Natural Language Processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Wordnet

Wordnet is a semantic dictionary organized into synsets (synonim rings). A synset is a group of synonym words and has an associated id and sense (a word can appear in multiple synsets if that word (or collocation) is polysemantic) illustrated by a brief gloss.

To easily visualize the links between words, you can use this web based GUI

>>> from nltk.corpus import wordnet
>>> wordnet.synsets('school')
[Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('school.v.01'), Synset('educate.v.03'), Synset('school.v.03')]



>>> wordnet.synsets('schooled')
[Synset('school.v.01'), Synset('educate.v.03'), Synset('school.v.03')]
>>> wordnet.synsets('cats')
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]

Types of relations (for nouns):

Types of relations (for verbs):

Types of relations (for adjectives):

Types of relations (for adverbs):

The similarity of two synsets: >>> tree=wordnet.synset('tree.n.01')
                                            
>>> tree.path_similarity(wordnet.synset('plant.n.01'))

0.09090909090909091
>>> tree.path_similarity(wordnet.synset('car.n.01'))

0.07142857142857142
>>> tree.path_similarity(wordnet.synset('public_school.n.01'))

0.05555555555555555

Word embeddings

Word2vec is a technique to associate vectors to words.

There are two techniques (both using neural networks) to obtain such vectors:

For this lesson we will use the gensim module: pip install gensim

We will use the gensim.models.word2vec.Word2Vec class. Its constructor creates the trained model.

According to the documentation, the constructor has the following syntax (you can find here the default values too): gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None)

Most important parameters are:

We can load a pretrained model (using the Google News dataset) with the following command: import gensim from gensim.models import Word2Vec model = gensim.models.KeyedVectors.load_word2vec_format(modelPath, binary=True) You need to set the modelPath variable to the Google news file. You can download the model from kaggle.

We can easily find the similarity betwween two words. model.similarity("cat","dog") would result in : 0.76094574

In order to actually obtain the vector, you can use the get_vector(word) method: model.get_vector("house")

The vocabulary is stored in the vocab model's property. Therefore you can check if a certain word appears in the vocabulary: "word" in model.vocab Notice taht words appear both in uppercase and lowercase.

Exercises and homework

All exercises can be done by all students, unregarding attendance.