Natural Language Processing

Laboratory

(Deadline: 07.06.2020 23:59:59)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Lesk

Lesk measure is used to measure the relatedness of two words(senses) by counting the number of words they have in common (overlaps), in their definitions (glosses). The Lesk measure is the number of such common words.

Lesk algorithm is used in word disambiguation; it associates a sense to a given word based on how related it is to the context (the rest of the words in the text).

Lesk measure is already implemented in nltk: >>> from nltk.wsd import lesk
>>> lesk(nltk.word_tokenize('Students enjoy going to school, studying and reading books'),'school','n')
Synset('school.n.06')

Extended Gloss Overlaps (Extended Lesk)

This technique was presented by Satanjeev Banerjee and Ted Pedersen in 2003 in an article.

The algorithm measures the relatedness of two words. Just like Lesk, it counts the overlaps of glosses, however it takes into account the related glosses of the two word as well.

Suppose that we have two synsets s1 and s2. For both of them we obtain the glosses of the synsets for all:

In computing the score, for each single word that appears in both glosses we add 1. However if it appears in a common phrase, supposing the length of common phrase is L, we add L2(for example, if "white bread" appears in both glosses, we add 4). We obviusly don't add the score for the separate words in the phrase. We try to find the longest common phrase (tht doesn't start or end with a pronoun, preposition, article or conjunction in both glosses.

If we have multiple synsets in the same relation to one of the given synsets we create a string with all the glosses concatenated.

Exercises and homework

All exercises can be done by all students, unregarding attendance.