Natural Language Processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Supervised Word Sense Disambiguation (Bayes)

We use Bayes's classifier in order to label (classify) the words with e certain WordNet sense. For this we need a context window surrounding the target word (the word for which we search the sense). The context window should contain only "content words" (words with important meaning, that bring information, like nouns, verbs etc)

We note P(s|c) the probability for sense s in the context c. For each such sense of the target word the probability is computed and we take the sense with the highest probability compared to the others.

In order to compute the probability P(s|c), we use the formula: P(s|c)=P(c|s)*P(s)/P(c). P(s) is the probability of a sense without any context. However, for P(c|s) we would need a training set (with texts that contain the target word, already labeled with its correct sense).

However, NLTK already has the classifier implemented. We can use the NLTK NaiveBayesClassifier:https://www.nltk.org/_modules/nltk/classify/naivebayes.html

The Naive Bayes classifier will first compute the prior probability for the senses (or, generally speaking, for the class labels) - this is determined by the label's frequncy in the training set. the features are used to see the likelyhood of having that label in a given context.

nltk.NaiveBayesClassifier.train(train_set)

where train_set must contain a list with the classes and features for each class. The train_set list will contain tuples of two elements. First element is a dictionary with the features (name and value of each feature). The second element is the class label.

You can also use Naive Bayes classifier from sklearn (https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

Useful link: https://www.nltk.org/book/ch06.html

For today's task, yo need to train the NLTK Bayes classifier on senseval, on a word of your choice.

>>> from nltk.corpus import senseval
>>> inst=senseval.instances('interest.pos')
>>> inst[0]
SensevalInstance(word='interest-n', position=18, context=[('yields', 'NNS'), ('on', 'IN'), ('money-market', 'JJ'), ('mutual', 'JJ'), ('funds', 'NNS'), ('continued', 'VBD'), ('to', 'TO'), ('slide', 'VB'), (',', ','), ('amid', 'IN'), ('signs', 'VBZ'), ('that', 'IN'), ('portfolio', 'NN'), ('managers', 'NNS'), ('expect', 'VBP'), ('further', 'JJ'), ('declines', 'NNS'), ('in', 'IN'), ('interest', 'NN'), ('rates', 'NNS'), ('.', '.')], senses=('interest_6',))
>>> len(inst)
2368
>>>

For the training set, use 90% of instances to train the classifier and try to find the sense of the word on the rest of 10% of instances and compare it to the result. Print your findings. The classes used for training are the senses, and the features are the surrounding words.

Exercises and homework

All exercises can be done by all students, unregarding attendance.