Laboratory
(Deadline: -)
Daca nu sunteti logati exercitiile nu se mai afiseaza.
Lesk
Original Lesk Algorithm
The Lesk algorithm was introduced by Michael Lesk in 1986. The core idea of the algorithm is that words used together in a sentence are more likely to share a common context.
If we have the word "mouse" next to "cat", in a sentence, it's more probable to refer to a type of animal; if it is found in a sentence next to "computer," it is more likely to refer to the device.
The algorithm determines the meaning of a word by comparing the dictionary definitions (glosses) of the ambiguous word with the definitions of all the content words (nouns, verbs, adjectives and adverbs) surrounding it (in a window of text of given length).
Algorithm steps:
- For the chosen word to disambiguate we obtain the list of definitions.
- We obtain the words surrounding the target word in the sentence (the words from the context window)
- For each content word extracted from the contecxt window we create a list with all the possible definitions.
- We compare the definition of each sense of the target word against the definitions of the context words and we count how many words they have in common
- The sense with the highest number of overlapping words (the highest "score") is chosen as the correct meaning.
Original Lesk algorithm is already implemented in nltk:
>>> from nltk.wsd import lesk
>>> lesk(nltk.word_tokenize('Students enjoy going to school, studying and reading books'),'school','n')
Synset('school.n.06')
>>> syns=wordnet.synset('school.n.06')
>>> syns.definition()
"an educational institution's faculty and students"
>>>
Disadvantages
- The algorithm is dependent on how long and descriptive the definitions are.
- It is very inneficient to use for a whole text disambiguation as the number of combinations to check is very high.
Simplified Lesk Algorithm
Simplified Lesk compares the definitions (glosses) of the target word only against the words in the surrounding context (rather than the context words' definitions).
Lesk measure
Lesk measure is used to measure the relatedness of two words(senses) by counting the number of words they have in common (overlaps), in their definitions (glosses). The Lesk measure is the number of such common words.
Extended Gloss Overlaps (Extended Lesk)
This technique was presented by Satanjeev Banerjee and Ted Pedersen in 2003 in an article.
The algorithm measures the relatedness of two words. Just like Lesk, it counts the overlaps of glosses, however it takes into account the related glosses of the two word as well.
Suppose we want to obtain the sense for a word in a certain context (for example a sentence or just a window of text). The steps of the algorithm are:
- We first tag the words in the sentence with their part of speech
- For each word we obtain the list of synsets corresponding to that part of speech.
- For each synset s we obtain the glosses of the synsets for all:
- hypernyms
- hyponyms (or, for verbs, troponyms)
- meronyms
- holonyms
- attributes
- similar–to
- also–see
- For each synnset of the target word (for which we want to obtain the sense) we compute a score by counting the overlaps in the synset with all the other synsets corresponding to the words in the context.In computing the score, for each single word that appears in both extended glosses we add 1. However if it appears in a common phrase, supposing the length of common phrase is L, we add L2(for example, if "white bread" appears in both glosses, we add 4). We obviusly don't add the score for the separate words in the phrase. We try to find the longest common sequences of consecutive words in both glosses (the common sequence shouldn't start or end with a pronoun, preposition, article or conjunction). In order to avoid counting the same overlap multiple times for the same two glosses, after counting the overlap you should replace the sequence of words with a special string that doesn't appear in the text (don't remove it completely as you may obtain false overlaps). For example, you can use as special string "###".
- After computing the score for each synset of the target word, choose as result the synset with the highest score
Observation. In order to obtain the longest common substring (in our case, sublist, as we use a list of words), you can use SequenceMatcher from difflib:
from difflib import SequenceMatcher
l1=["She", "ate", "an", "apple", "pie", "and", "another", "apple", "pie"]
l2=["The", "apple", "pie", "is", "on", "the", "table"]
sm = SequenceMatcher(None, l1, l2)
potrivire=sm.find_longest_match(0, len(l1), 0, len(l2))
print(f"Sublista in prima lista incepe de la {potrivire.a}")
print(f"Sublista in a doua lista incepe de la {potrivire.b}")
print(f"Lungimea sublistei este", potrivire.size)
Sublista in prima lista incepe de la 3Sublista in a doua lista incepe de la 1
Lungimea sublistei este 2
Exercises and homework
All exercises can be done by all students, unregarding attendance.