natural language processing

Laboratory

(Deadline: 30.05.2025 23:59:59)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Wordnet

Wordnet is a semantic dictionary organized into synsets (synonim rings). A synset is a group of synonym words and has an associated id and sense (a word can appear in multiple synsets if that word (or collocation) is polysemantic) illustrated by a brief gloss.

To easily visualize the links between words, you can use this web based GUI

>>> from nltk.corpus import wordnet >>> wordnet.synsets('school') [Synset('school.n.01'), Synset('school.n.02'), Synset('school.n.03'), Synset('school.n.04'), Synset('school.n.05'), Synset('school.n.06'), Synset('school.n.07'), Synset('school.v.01'), Synset('educate.v.03'), Synset('school.v.03')] >>> wordnet.synsets('schooled') [Synset('school.v.01'), Synset('educate.v.03'), Synset('school.v.03')] >>> wordnet.synsets('cats') [Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), Synset('vomit.v.01')]

Types of relations (for nouns):

hypernyms/hyponyms. A hypernym is a more general term that describes a concept. A synset s1 is a hypernym of s2 if s2 is a type of s1. If s1 is a hypernym of s2, s2 is a hyponym of s1. >>> sch=wordnet.synset('school.n.01') >>> sch.hyponyms() [Synset('academy.n.03'), Synset('alma_mater.n.01'), Synset('conservatory.n.01'), Synset('correspondence_school.n.01'), Synset('crammer.n.03'), Synset('dance_school.n.01'), Synset('dancing_school.n.01'), Synset('day_school.n.02'), Synset('direct-grant_school.n.01'), Synset('driving_school.n.01'), Synset('finishing_school.n.01'), Synset('flying_school.n.01'), Synset('grade_school.n.01'), Synset('graduate_school.n.01'), Synset('language_school.n.01'), Synset('night_school.n.01'), Synset('nursing_school.n.01'), Synset('private_school.n.01'), Synset('public_school.n.01'), Synset('religious_school.n.01'), Synset('riding_school.n.01'), Synset('secondary_school.n.01'), Synset('secretarial_school.n.01'), Synset('sunday_school.n.01'), Synset('technical_school.n.01'), Synset('training_school.n.01'), Synset('veterinary_school.n.01')] >>> sch.hypernyms() [Synset('educational_institution.n.01')] >>>
meronyms/holonyms. A synset s1 is a holonym of synset s2 (and , simultaneously, s2 is a meronym of s1), if s2 is contained in s1. The meronyms and holonyms are of three types (with the prefixes in parentheses): "part of" ("part_"), "substance" ("substance_"), "member of" ("member_"). >>> aer=wordnet.synset('air.n.01') >>> aer.substance_holonyms() [Synset('wind.n.01')] >>> aer.substance_meronyms() [Synset('argon.n.01'), Synset('krypton.n.01'), Synset('neon.n.01'), Synset('nitrogen.n.01'), Synset('oxygen.n.01'), Synset('xenon.n.01')] >>> >>> house=wordnet.synset('house.n.01') >>> house.part_holonyms() [] >>> house.part_meronyms() [Synset('library.n.01'), Synset('loft.n.02'), Synset('porch.n.01'), Synset('study.n.05')] >>> >>> tree=wordnet.synset('tree.n.01') >>> tree.member_holonyms() [Synset('forest.n.01')] >>> tree.member_meronyms() [] >>> tree.part_meronyms() [Synset('burl.n.02'), Synset('crown.n.07'), Synset('limb.n.02'), Synset('stump.n.01'), Synset('trunk.n.01')] >>>
attributes (relation to adjective synsets). An adjective synset s1 is an attribute of noun synset s2 if s1 can be a value of s2) >>> wordnet.synset('strength.n.01').attributes() [Synset('delicate.a.01'), Synset('rugged.a.01'), Synset('strong.a.01'), Synset('weak.a.01')] >>>

Types of relations (for verbs):

hypernyms/troponyms. Troponyms give a specification for the verb (a verb that specifies the same action but in a certain context. For example, diving is a type of swimming, therefore the verb dive is a troponym for the verb swim). Troponyms are actually verb hyponyms. >>> vb=wordnet.synset('run.v.01') >>> vb.hypernyms() [Synset('travel_rapidly.v.01')] >>> vb.hyponyms() [Synset('hare.v.01'), Synset('jog.v.03'), Synset('lope.v.01'), Synset('outrun.v.01'), Synset('romp.v.02'), Synset('run.v.33'), Synset('run_bases.v.01'), Synset('rush.v.05'), Synset('scurry.v.01'), Synset('sprint.v.01'), Synset('streak.v.02'), Synset('trot.v.01')] >>>
entailments (an action(verb) is dependent on another action (verb) - the first action needs the other action to take place) >>> wordnet.synset('look.v.01').entailments() [Synset('see.v.01')] >>>
verb_groups >>> wordnet.synset('quiz.v.01').verb_groups() [Synset('test.v.07')] >>>

Types of relations (for adjectives):

antonyms >>> lem=wordnet.synset('good.a.01').lemmas()[0] >>> lem.antonyms() [Lemma('bad.a.01.bad')] >>>
similar to (also for adjective satellites) >>> wordnet.synset('strong.a.01').similar_tos() [Synset('beardown.s.01'), Synset('beefed-up.s.01'), Synset('brawny.s.01'), Synset('bullnecked.s.01'), Synset('bullocky.s.01'), Synset('fortified.s.02'), Synset('hard.s.04'), Synset('industrial-strength.s.01'), Synset('ironlike.s.01'), Synset('knock-down.s.01'), Synset('noticeable.s.04'), Synset('reinforced.s.01'), Synset('robust.s.03'), Synset('stiff.s.02'), Synset('vehement.s.02'), Synset('virile.s.01'), Synset('well-knit.s.01')] >>>
pertainyms (can be nouns or other adjectives). Concepts(synsets) that pertain to the given synset >>> lem=wordnet.synset('technical.a.01').lemmas()[0] >>> lem.pertainyms() [Lemma('technique.n.01.technique')] >>>
attributes (relation to noun synsets) >>> wordnet.synset('strong.a.01').attributes() [Synset('strength.n.01')] >>>

Types of relations (for adverbs):

antonyms >>> wordnet.synset('quickly.r.01').lemmas()[0].antonyms() [Lemma('slowly.r.01.slowly')] >>>
pertainyms (can be nouns or other adjectives). Concepts(synsets) that pertain to the given synset >>> wordnet.synset('quickly.r.01').lemmas()[0].pertainyms() [Lemma('quick.s.01.quick')] >>>

The similarity of two synsets: >>> tree=wordnet.synset('tree.n.01') >>> tree.path_similarity(wordnet.synset('plant.n.01')) 0.09090909090909091 >>> tree.path_similarity(wordnet.synset('car.n.01')) 0.07142857142857142 >>> tree.path_similarity(wordnet.synset('public_school.n.01')) 0.05555555555555555

Word embeddings

Word2vec is a technique to associate vectors to words.

There are two techniques (both using neural networks) to obtain such vectors:

CBOW (Continuous Bag of Words)
skip-gram

For this lesson we will use the gensim module: pip install gensim

We will use the gensim.models.word2vec.Word2Vec class. Its constructor creates the trained model.

According to the documentation, the constructor has the following syntax (you can find here the default values too): gensim.models.word2vec.Word2Vec(sentences=None, corpus_file=None, vector_size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, ns_exponent=0.75, cbow_mean=1, hashfxn=, epochs=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000, compute_loss=False, callbacks=(), comment=None, max_final_vocab=None)

Most important parameters are:

the corpus (and its dimension and vocabulary size)
training epochs
window size
training algorithms (hierarchical softmax and negative sampling)

We can load a pretrained model (using the Google News dataset) with the following command: import gensim from gensim.models import Word2Vec model = gensim.models.KeyedVectors.load_word2vec_format(modelPath, binary=True) You need to set the modelPath variable to the Google news file. You can download the model from kaggle.

We can easily find the similarity betwween two words. model.similarity("cat","dog") would result in : 0.76094574

In order to actually obtain the vector, you can use the get_vector(word) method: model.get_vector("house")

The vocabulary is stored in the vocab model's property. Therefore you can check if a certain word appears in the vocabulary: "word" in model.vocab Notice taht words appear both in uppercase and lowercase.

Exercises and homework

All exercises can be done by all students, unregarding attendance.