natural language processing

Laboratory

(Deadline: 30.05.2025 00:00:00)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Morphological analysis

StanfordPOSTagger

A POS-tagger is a program that tags words in raw text, idicating their part of speech. A POS-tagger is a program that tags words in raw text, indicating their part of speech. StanfordPOSTagger is widely used in NLP tasks. Please download the full version. Unzip the archive and search for stanford-postagger.jar.

Import the tagger in python console: from nltk.tag.stanford import StanfordPOSTagger

We want to create a tagger instance: tagger=StanfordPOSTagger(model_path, jar_tagger_path)

We will use english-bidirectional-distsim.tagger.

To compute POS tags we use the tag() method


			tagger.tag(["I","saw","a","cat"])

In case you receive an error while trying to use the tagger, read here: https://stackoverflow.com/questions/34692987/cant-make-stanford-pos-tagger-working-in-nltk

Try to parse a sentence and learn what each tag means.

Before we tag a raw text, we must tokenize it first

You can also use Stanford POStagger online version.

NLTK tagger

You can also use NLTK's own POS tagger: >>> nltk.pos_tag(["I", "saw", "a", "cat"]) [('I', 'PRP'), ('saw', 'VBD'), ('a', 'DT'), ('cat', 'NN')] You can find the tags meaning with: nltk.help.upenn_tagset(tag) >>> nltk.help.upenn_tagset('DT') DT: determiner all an another any both del each either every half la many much nary neither no some such that the them these this those >>>

Syntax analysis

It is used to obtain the structure of the sentence and the connections between towens in a sentence.

Parsing based on Context-Free Grammar

Please download the Stanford Parser.

Use the graphical user interface: lexparser-gui.bat

The parsing tree for this sentence: I like to go to school. is:
arbore sintactic

Recursive Descent Parsing

This is a top-down parser, that backtracks through the rules, expanding the tree nodes in a depth-first manner.


gram=nltk.CFG.fromstring("""  S -> NP VP | TO VB

VP -> V NP | V NP PP | V S | V PP

PP -> P NP  

V -> "caught" | "ate" | "likes" | "like" | "chase" | "go"

NP -> Det N | Det N PP | PRP

Det -> "the" | "a" | "an" | "my" | "some"

N -> "mice" | "cat" | "dog" |  "school" 

P -> "in" | "to" | "on" 

TO -> "to"

PRP -> "I"  """) 

>>> sent=["I", "like", "my", "dog"] 

>>> rdp = nltk.RecursiveDescentParser(gram) 

>>> for tree in rdp.parse(sent):

    print(tree)



(S (NP (PRP I)) (VP (V like) (NP (Det my) (N dog))))

You can observe the way it works by using the app provided by nltk: nltk.app.rdparser()

Shift Reduce Parsing


>>> srp = nltk.ShiftReduceParser(gram) 

>>> sent=["I", "like", "my", "dog"] 

>>> for tree in srp.parse(sent):

    print(tree)



(S (NP (PRP I)) (VP (V like) (NP (Det my) (N dog))))

You can observe the way it works by using the app provided by nltk: nltk.app.srparser()

Left corner parser

The Left-Corner Parser uses both a top-down and a bottom-up strategy. In the top-down steps it tries to predict the possible structure of the phrase, however continuosly checking the phrase structure through a bottom-up process (checking if the input phrase matches the prediction). A left-corner parser does some preprocessing before the parsing itself. It creates an association between each non-terminal label and a list of all possible left corners (start of the expression). Before applying a production from the context free grammar, it searches for the next word that there is one starting label (in the left corners list) that applies to it.

Algorithm steps:

Create the left corner table. For each production, A -> sequence_of_symbols we take the nonterminal A and check if it already has a corresponding line in the table. If it doesn't, we create a new line for it, and add the first (leftmost) symbol from the sequence_of_symbols. If we already have a row associated to A, we just add the symbol, to that row.
After we've created the table, we start to compute the parsing. During the processing we'll put all the non-expanded terminals in a list from left to right as they appear in the parsing tree. We consider the starting variable to be the symbol S (sentence). In order to find a correct production for the symbol S, we do a bottom-up search, using the left corner table. We start from the first input word and move it from the input string into a list of already processed words. We find for it, in the left corner table, the first line that contains it. We take the non-terminal (let's call it N1) associated with that line and search it in the left corner table. We find it in the row pertaining to the non-terminal N2. We repeat the same process for N2 and all the resulting non-terminals obtained this way, until we reach non-terminal S (supposing that was found through the left corner table value Nk). We take the leftmost Ni (i ≤ k) nonexpanded non-terminal and make a top-down step adding the resulted children in the parse tree. We take the leftmost nonexpanded non-terminal from the remaining set of nonexpanded non-terminals, let's call it A and we repeat the following steps:
1. For the current non-terminal A taken from the top-down prediction, we want to find a match with the input string
2. we take the first word from the remaining input string. We search in the left corner table the non-terminal that contains it in it's table row. We repeat the same process for that non-terminal, by also finding it in a row in the table and noting the non-terminal associated with that row.
3. We stop when we find throgh this process the non-terminal A. We then expand A (as we did with Ni, making a top-down prediction), or, if A expands in only one symbol (the one that is already associated with it through the bottom-up steps), we choose the next non-terminal from the top-down predictions (again the leftmost non-expanded) as the current non-terminal for which we search a match, and repeat the process
4. In case we don't find a match with the non-terminal A (the bottom-up steps resulted from the search in left corner tables end with another non-terminal, we backtrack (first the bottom-up steps, and if we finish them all we backtrack the top-down prediction as well)

We consider the following grammar (we also consider that the production rules are processed in this exact order):

S -> NP VP S -> VP NP -> DT NN NP -> DT JJ NN NP -> PRP VP -> VBP NP VP -> VBP VP VP -> VBG NP VP -> TO VP VP -> VB VP -> VB NP NN -> "show" | "book" PRP -> "|" VBP -> "am" VBG -> "watching" VB -> "show" DT -> "a" | "the" MD -> "will" The left corner table would be:

Symbol	Left corner
S	NP,VP
NP	DT,PRP
VP	VBP, VBG, TO, VB
NN	"show", "book"
PRP	"I"
VBP	"am"
VBG	"watching"
VB	"show"
DT	"a", "the"
MD	"will"

Well-Formed Substring Tables

It is a type of chart parsing (in Romanian: parser cu agenda). It is used in order to avoi rebuilding subtrees that are already correct. All the above parsers used backtracking in their algorithm, therefore, we could have subtrees that are processed again because there is a wrong assumption in the subtrees created before them. It's in backtracking's nature to delete(forget) all the computing it made for a wrong prefix in the sollution. Each time we build a subtree we save it in a "table" and it will be reused when it's needed. An idea of implementation is given in the NLTK book (Natural Language Processing with Python, by Steven Bird, Ewan Klein, and Edward Loper, 2009).

Supposing we have a sentence of n words, we create a matrix of (n+1)²elements. Let's call this matrix T. The meaning of this matrix is that T[i][j] wiil contan the root of the subtree containing all the words from i to j-1. At first, the matrix will be empty (initialisez with a null value), except elements T[i][i+1] that will contain the i-th word.
Next we continuosly apply the productions completing the table, until no more changes in the table are made. In order to complete T[i][j] with a label, we must have a number k in [0,n] such that T[i][k] and T[k][j] are both completed (let's say T[i][k] is B and T[k][j] is C) and a production A -< B C. In this case, we'll assign T[i][j]=A. We may have multiple cases for the same line i and column j (from a different reduction of the trees). In this case, we save all the values, so it is better to consider T[i][j] being a list of symbols. an even better representantion, in order to easily obtain the responsible productions for the end tree, would be to have the whole production (A -< B C) saved in T[i][j], for example by saving it's id (assuming that all the grammar's productions have an id).
How do we treat productions with a number of terminals not equal to 2. We may process the grammar and reform the productions, by adding auxiliary ones in order to obtain only two node in each production; for example "A-> B C D" can be changed into "A-> B NewT1" and "NewT1-> C D", where NewT1 is a new terminal used only for this production.
The algorithm finishes when no more reductions can be made. If we've obtained S (the sentence node) in T[0][n] we have succesfully parsed the sentence.

It is not mandatory to use a matrix, you can save the agenda with a list of subtrees, however you still need to save (in that structure) the indexes for the leafs contained by each subtree.

Dependency parsing

Dependency Grammars (DG) are based on dependency rules between the words of a phrase. A dependency rule is between two words: the head word, and the dependent word. We will visually represent the relation between the two words in the form of an arrow, begening from the head word and ending at the dependend word. On each such arrow, the name of the dependency will also be written. The Sanford Parser prints the dependencies in a text format as you can see bellow for the phrase "I like to go to school." nsubj(like-2, I-1) nsubj:xsubj(go-4, I-1) root(ROOT-0, like-2) mark(go-4, to-3) xcomp(like-2, go-4) case(school-6, to-5) nmod:to(go-4, school-6)

The syntax for each dependence rule is dependency_type(head_word - position_in_sentence, dependent_word - position_in_sentence)

You can find information about the meaning dependencies at https://nlp.stanford.edu/software/stanford-dependencies.html. You can also find a list with each dependency type and its meaning at http://universaldependencies.org/u/dep/

In order to apply dependency parsing on a phrase, we'll use the Stanford parser. We'll use cmd and change the current directory to the parser directory. We also search for the models jar (the name of the file should have this syntax: stanford-parser-version-models.jar ( for example, stanford-parser-3.9.2-models.jar ) and we unzip from it the file from the relative path edu\stanford\nlp\models\lexparser\englishPCFG.ser.gz and we copy this file in the parser's folder.

In cmd we write a command with this syntax: java -mx200m -cp "stanford-parser.jar;stanford-parser-3.9.2-models.jar" edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz [path_to_input_file] > [path_to_output_file]

For the outputFormat parameter, we've specified:

wordsAndTags - show thee words with their POS tags
penn - show the syntactic analysis (in a tree format) based on CFG
typedDependencies - show the dependecy parsing

In the input file you'll write the text to be parsed. It can contain multiple phrases.

By default the result is printed in the console, but we can redirect the output to an output file, using ">" and specifying the output file path, for example java -mx200m -cp "stanford-parser.jar;stanford-parser-3.9.2-models.jar" edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "wordsAndTags,penn,typedDependencies" englishPCFG.ser.gz ../text.txt > ../out.txt where ../text.txt is a relative path for a file containing the text to be parsed and "../out.txt" will be the relative path to the output file for the parsing.

To easily visualise the dependencies, we download DependenSee from the extensions section, at https://nlp.stanford.edu/software/lex-parser.html. We add the jar file in the parser's folder. Use a command similar to the one bellow: java -cp dependensee-3.7.0.jar;stanford-parser.jar;stanford-parser-3.9.2-models.jar;slf4j-api.jar com.chaoticity.dependensee.Main [fraza de parsat] [cale fisier png de output] for example (it is the exact example from DependenSee official web page): java -cp "dependensee-3.7.0.jar;stanford-parser.jar;stanford-parser-3.9.2-models.jar" com.chaoticity.dependensee.Main "Example isn't another way to teach, it is the only way to teach." out.png

Atention, this command is for Windows. If you are using Unix, you need the corresponding separator: use : instead of ; in classpath..

Projective Dependency Parsing

Non-Projective Dependency Parsing

Using Stanford Parser in NLTK

For CFG parsing:

import os from nltk.parse.stanford import StanfordParser os.environ['JAVAHOME'] = "cale folder bin java" os.environ['STANFORD_PARSER'] = 'cale relativa/absoluta jar parser' os.environ['STANFORD_MODELS'] = 'cale relativa/absoluta jar modele' parser = StanfordParser(model_path="cale relativa/absoluta englishPCFG.ser.gz") propozitii = parser.raw_parse_sents(("I like to go to school.", "The cat is running through the room.","Where are you?")) for prop in propozitii: print(list(prop))

For dependency parsing:

import os from nltk.parse.stanford import StanfordDependencyParser # you can also use nltk.parse.corenlp.CoreNLPDependencyParser os.environ['JAVAHOME'] = "cale folder bin java" cale_parser = 'cale relativa/absoluta jar parser' cale_modele = 'cale relativa/absoluta jar modele' dependency_parser = StanfordDependencyParser(path_to_jar=cale_parser, path_to_models_jar=cale_modele) dependente = dependency_parser.raw_parse('I like to go to school.') for dep in dependente: print(list(dep.triples()))

Exercises and homework

All exercises can be done by all students, unregarding attendance.