Inside the GPT-3

  • Varun@mastodon.social
    link
    fedilink
    arrow-up
    1
    ·
    1 year ago

    @behohippy @saint Instead of timestep by timestep sequence modeling the attention allows us to pass sequential model in a parallel NN just like fully connected one, where the positional encoding helps us to know the sequence of each and we can remove the keys having less attention value…