Transformers!! What’s That?

This document talks about how transformers works and why this architecture is better than other architectures like TF-IDF and Word2Vec.

You can find the reference code for custom implementation using PyTorch with model training notebook and small project of natural text generation here: https://github.com/agasheaditya/handson-transformers

Untitled

This model is made of 2 parts:
- Encoders
- Decoders

Usage

Machine translation
Text summarization
Question answering
Text classification
Natural language generation

High level design:

Machine cannot understand the text directly, so it uses some encoding technique to represent words/sentences into a embedding space. In transformers positional encoders will be used to generate vectors which gives context based on position of word in sentence. After passing a sentence to a encoder we will get word embeddings and passing through position encoding we will have positional encoding for that word i.e. context.

Untitled

Encoding component is nothing but a stack of encoders where it takes input sequence in parallel. N identical encoder blocks are stacked on each other and those does not share weights in between. All encoder are identical in structure and each one is divided in 2 sub-parts.

The paper stacks 6 of them on top of each other, this value can be changed and experiment can be performed.

Self-attention :- It is also referred as Scaled Dot-Product Attention mechanism which allows the model to compute the significance of different words in a sequence against each other while considering their relationships. The encoders input will be passed to this layer where this layer helps encoder look at other words in the input sentence as it encodes the specific word.