natural language processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Lesk

Lesk measure is used to measure the relatedness of two words(senses) by counting the number of words they have in common (overlaps), in their definitions (glosses). The Lesk measure is the number of such common words.

Lesk algorithm is used in word disambiguation; it associates a sense to a given word based on how related it is to the context (the rest of the words in the text).

Lesk measure is already implemented in nltk: >>> from nltk.wsd import lesk >>> lesk(nltk.word_tokenize('Students enjoy going to school, studying and reading books'),'school','n') Synset('school.n.06') >>> syns=wordnet.synset('school.n.06') >>> syns.definition() "an educational institution's faculty and students" >>>

Extended Gloss Overlaps (Extended Lesk)

This technique was presented by Satanjeev Banerjee and Ted Pedersen in 2003 in an article.

The algorithm measures the relatedness of two words. Just like Lesk, it counts the overlaps of glosses, however it takes into account the related glosses of the two word as well.

Suppose we want to obtain the sense for a word in a certain context (for example a sentence or just a window of text). The steps of the algorithm are:

We first tag the words in the sentence with their part of speech
For each word we obtain the list of synsets corresponding to that part of speech.
For each synset s we obtain the glosses of the synsets for all:
- hypernyms
- hyponyms
- meronyms
- holonyms
- troponyms
- attributes
- similar–to
- also–see
it is good to use a structure that shows for each gloss where it comes from (in order to do the tests in the exercise). We add them all in a list with all the glosses (for each target word). We call these lists "extended glosses".
For each synnset of the target word (for which we want to obtain the sense) we compute a score by counting the overlaps in the synset with all the other synsets corresponding to the words in the context.In computing the score, for each single word that appears in both extended glosses we add 1. However if it appears in a common phrase, supposing the length of common phrase is L, we add L²(for example, if "white bread" appears in both glosses, we add 4). We obviusly don't add the score for the separate words in the phrase. We try to find the longest common sequences of consecutive words (it shouldn't start or end with a pronoun, preposition, article or conjunction in both glosses). In order to avoid counting the same overlap multiple times for the same two glosses, after counting the overlap you should replace the sequence of words with a special string that doesn't appear in the text (don't remove it completely as you may obtain false overlaps). For example, you can use as special string "###".
After computing the score for each synset of the target word, choose as result the synset with the highest score

Exercises and homework

All exercises can be done by all students, unregarding attendance.