natural language processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

BERT (Bidirectional Encoder Representations from Transformers)

BERT is a quite new technique, as it was introduced in 2018. BERT, opposed to word2vec (context-free model) is a contextual model and uses masked language.

Transformers have a neural network architecture and use the attention mechanism (self-attention) to create vector encodings of the words in a text, based on their surrounding text (context). Attention assigns weights for each token in the input in relation to the other tokens in the context of the input (producing a weighted hidden state at each step), thus obtaining a contextual model. Basically it creates

We will use Hugging Face Transformers.

Bert tokenizer

Bert uses a Word Piece tokenizer that divides the text into fragments (pieces of words) instead of just words (as other tokenizers do).

It starts with the letters of a certain word and merges them into subwords based on the probability of having those letters together in a word corresponding to the processing language.

Encoder/decoder

The encoder receives the input and turns into an embedding to be used by the neural network. The decoder does the reverse, generating the output in words based on the output of the network.

BERT only uses an encoder.. Bert is bidirectional as it considers both the left and right context of each word in the input sentence during training. For this, it uses the Masked Language Modeling (MLM) technique. this techniques means that part of the words in the input are masked (considered unknown) and BERT is trained to predict them based on their context in the input text.

It also uses Next Sentence Prediction (NSP). It is trained to predict what sentence comes before a given sentence.

Attention

The attention mechanism helps decide how important a token is in an input, and how it depends on the other values in the input, by using weights.

Bert uses a multihead attention mechanism, which means it looks at multiple words at the same time.

Training and fine tuning

Bert is usually pretrained, and if we want to specialize it on a task we fine tune it by training it on our data for a small amount of epochs (usually 2-4), with a learning rate from the following: 3e-4, 1e-4, 5e-5, 3e-5.

Important terms

Logits - the predicted values (the value of the last layer)

Loss - the difference between predicted and true values

batch - input data is divided into batches that are delivered to the networkrecommended batch sizes: 8, 16, 32, 64, 128

optimizer - adjusts the parameters of the model in order to minimize the loss function

gradients - represent the partial derivatives of the loss function with respect to the model's parameters (weights and biases). Gradients are used to update the parameters in a way that minimizes the loss

References:

Exercises and homework

All exercises can be done by all students, unregarding attendance.