Natural Language Processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Lesk

Original Lesk Algorithm

The Lesk algorithm was introduced by Michael Lesk in 1986. The core idea of the algorithm is that words used together in a sentence are more likely to share a common context.

If we have the word "mouse" next to "cat", in a sentence, it's more probable to refer to a type of animal; if it is found in a sentence next to "computer," it is more likely to refer to the device.

The algorithm determines the meaning of a word by comparing the dictionary definitions (glosses) of the ambiguous word with the definitions of all the content words (nouns, verbs, adjectives and adverbs) surrounding it (in a window of text of given length).

Algorithm steps:

  1. For the chosen word to disambiguate we obtain the list of definitions.
  2. We obtain the words surrounding the target word in the sentence (the words from the context window)
  3. For each content word extracted from the contecxt window we create a list with all the possible definitions.
  4. We compare the definition of each sense of the target word against the definitions of the context words and we count how many words they have in common
  5. The sense with the highest number of overlapping words (the highest "score") is chosen as the correct meaning.

Original Lesk algorithm is already implemented in nltk: >>> from nltk.wsd import lesk
>>> lesk(nltk.word_tokenize('Students enjoy going to school, studying and reading books'),'school','n')
Synset('school.n.06')
>>> syns=wordnet.synset('school.n.06')
>>> syns.definition()
"an educational institution's faculty and students"
>>>

Disadvantages

Simplified Lesk Algorithm

Simplified Lesk compares the definitions (glosses) of the target word only against the words in the surrounding context (rather than the context words' definitions).

Lesk measure

Lesk measure is used to measure the relatedness of two words(senses) by counting the number of words they have in common (overlaps), in their definitions (glosses). The Lesk measure is the number of such common words.

Extended Gloss Overlaps (Extended Lesk)

This technique was presented by Satanjeev Banerjee and Ted Pedersen in 2003 in an article.

The algorithm measures the relatedness of two words. Just like Lesk, it counts the overlaps of glosses, however it takes into account the related glosses of the two word as well.

Suppose we want to obtain the sense for a word in a certain context (for example a sentence or just a window of text). The steps of the algorithm are:

  1. We first tag the words in the sentence with their part of speech
  2. For each word we obtain the list of synsets corresponding to that part of speech.
  3. For each synset s we obtain the glosses of the synsets for all:
    • hypernyms
    • hyponyms (or, for verbs, troponyms)
    • meronyms
    • holonyms
    It can also use:
    • attributes
    • similar–to
    • also–see
    it is good to use a structure that shows for each gloss where it comes from (in order to do the tests in the exercise). We add them all in a list with all the glosses (for each target word). We call these lists "extended glosses".
  4. For each synnset of the target word (for which we want to obtain the sense) we compute a score by counting the overlaps in the synset with all the other synsets corresponding to the words in the context.In computing the score, for each single word that appears in both extended glosses we add 1. However if it appears in a common phrase, supposing the length of common phrase is L, we add L2(for example, if "white bread" appears in both glosses, we add 4). We obviusly don't add the score for the separate words in the phrase. We try to find the longest common sequences of consecutive words in both glosses (the common sequence shouldn't start or end with a pronoun, preposition, article or conjunction). In order to avoid counting the same overlap multiple times for the same two glosses, after counting the overlap you should replace the sequence of words with a special string that doesn't appear in the text (don't remove it completely as you may obtain false overlaps). For example, you can use as special string "###".
  5. After computing the score for each synset of the target word, choose as result the synset with the highest score

Observation. In order to obtain the longest common substring (in our case, sublist, as we use a list of words), you can use SequenceMatcher from difflib:

from difflib import SequenceMatcher

l1=["She", "ate", "an", "apple", "pie", "and", "another", "apple", "pie"]
l2=["The", "apple", "pie", "is", "on", "the", "table"]

sm = SequenceMatcher(None, l1, l2)
potrivire=sm.find_longest_match(0, len(l1), 0, len(l2))
print(f"Sublista in prima lista incepe de la {potrivire.a}")
print(f"Sublista in a doua lista incepe de la {potrivire.b}")
print(f"Lungimea sublistei este", potrivire.size)
Sublista in prima lista incepe de la 3
Sublista in a doua lista incepe de la 1
Lungimea sublistei este 2

Exercises and homework

All exercises can be done by all students, unregarding attendance.