natural language processing

Laboratory

(Deadline: -)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Knowledge-rich WSD based on WordNet++

In this laboratory we present the technique developped by Simone Paolo Ponzetto and Roberto Navigli in their article "Knowledge-rich Word Sense Disambiguation Rivaling Supervised Systems". This approach uses supplimentary relations between words, in order to compute relatedness between concepts. All the new relations are based on wikipedia, this is the reason why in this laboratory we need to use the wikipedia module (documentation: https://wikipedia.readthedocs.io/en/latest/code.html). However, some of the neede relations are not implemented in the wikipedia module, therefore, we will also need to use the requests module in order to use MediaWiki action API to wich we'll need to transmit requests.

Types of relations

"Redirect to" relations (https://www.mediawiki.org/wiki/API:Redirects)
disambiguation pages
internal links

In order to use these relations we need a mapping between WordNet word senses and wikipedia articles. In the article, they give as an example the word "soda" (https://en.wikipedia.org/wiki/Soda). Notice that the disambiguation page redirects to this same page: https://en.wikipedia.org/wiki/Soda_(disambiguation). You can see that it has multiple senses illustrated in a list of pages. you can obtain the ids of those pages with a code similar to : import requests #create a connection(session) r_session = requests.Session() #url for the MediaWiki action API URL = "https://en.wikipedia.org/w/api.php" PARAMS = { "action": "query", #we are creating a query "titles": "car", #for the title car "prop": "redirects", #asking for all the redirects (to the title car) "format": "json" #and we want the output in a json format } #we obtain the response to the get request with the given parmeters query_response = r_session.get(url=URL, params=PARAMS) json_data = query_response.json() wikipedia_pages = json_data["query"]["pages"] #we iterate through items and print all the redirects (their title and id) try: for k, v in wikipedia_pages.items(): for redir in v["redirects"]: print("{} redirect to {}({})".format(redir["title"], v["title"], redir["pageid"])) except KeyError as err: if err.args[0]=='redirects': print("It has no redirects") else: print(repr(err)) will have the output: Cars redirect to Car(73688) Motor car redirect to Car(458458) Motorcar redirect to Car(458459) Automobiles redirect to Car(513608) Motor Car redirect to Car(840650) Ottomobile redirect to Car(1836567) Automobles redirect to Car(1842410) Motorization redirect to Car(3223435) Motorisation redirect to Car(3223436) Passenger Vehicle redirect to Car(6260924)

The json data of the response looks like this: { "continue":{ "rdcontinue":"6492781", "continue":"||" }, "query":{ "normalized":[ { "from":"car", "to":"Car" } ], "pages":{ "13673345":{ "pageid":13673345, "ns":0, "title":"Car", "redirects":[ { "pageid":73688, "ns":0, "title":"Cars" }, { "pageid":458458, "ns":0, "title":"Motor car" }, { "pageid":458459, "ns":0, "title":"Motorcar" }, { "pageid":513608, "ns":0, "title":"Automobiles" }, { "pageid":840650, "ns":0, "title":"Motor Car" }, { "pageid":1836567, "ns":0, "title":"Ottomobile" }, { "pageid":1842410, "ns":0, "title":"Automobles" }, { "pageid":3223435, "ns":0, "title":"Motorization" }, { "pageid":3223436, "ns":0, "title":"Motorisation" }, { "pageid":6260924, "ns":0, "title":"Passenger Vehicle" } ] } } } }

Notice the normalization field, it is not what you might expect; it doesn't obtain the lemma, or apply any transformation on the letter case, it is about Unicode normalization.

for disambiaguations, notice the following two links:

In order to create the mapping we shall use for a given wikipedia page:

sense labels (actually they are the titles of the pages. At the time when the article was written, the titles had this syntax "word(sense label)" like "soda(soft drink)", however, notice that now you only find the sense label as a title.
links (outgoing links from the current page)
categories

The article uses the notation Ctx(w) for the set of words obtained from the text of some or all these pages.

Next, we need the WordNet context for a sense s, Ctx(s), for each sense of the word. For this we take the following relations:

synonymy
hypernymy/hyponymy
sisterhood (senses that have the same direct hypernym)
gloss

The next step is the mapping

For each word that we want to disambiguate, if we have only one sense, and only one wikipedia page, we map taht wikipedia page to the word.
In the case of multiple senses, for each remaining wikipedia word w(after the mapping from the former step) that still has no associate Wordnet word, we take all the redirects to the word w. For each such redirect we look if we already have a mapping associated to it (a relation between its sense and the wikipedia page). If we have such a mapping and the mapped word is in w's sysnset, we map w to the sense associated to the redirect page
For all wikipedia pages that aren't mapped yet, we try to assign the most probable sense. The most probable sense has the highest value p computed as score(w,sense)/sum, where sum is the sum between all the combinations of scores between each sense of the word from wordnet and each senese of the word from wikipedia. the score is the number of common words between the context of the sense in wikipedia and the context of the sense in WordNet to which we add 1 score(s,w)=|Ctx(s) ∩ Ctx(w)|+1

In the end we have created new relations (WordNet++) that we can use in a simplified Lesk manner to disambiguate a text. We will compute the overlaps on all the glosses given by the mentioned relations.

Exercises and homework

All exercises can be done by all students, unregarding attendance.