Natural Language Processing

Laboratory

(Deadline: 31.03.2024 23:59:59)

Daca nu sunteti logati exercitiile nu se mai afiseaza.

Text preprocessing

Tokenization

We use the process of tokenization on raw texts, when we have to break the text into sentences or words.

>>> sent_tokenize("I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat.")
['I have two dogs and a cat.', 'Do you have pets too?', 'My cat likes to chase mice.', 'My dogs like to chase my cat.']
>>>
>>> from nltk.tokenize import word_tokenize
>>> word_tokenize("I have two dogs and a cat. Do you have pets too? My cat likes to chase mice. My dogs like to chase my cat.")
['I', 'have', 'two', 'dogs', 'and', 'a', 'cat', '.', 'Do', 'you', 'have', 'pets', 'too', '?', 'My', 'cat', 'likes', 'to', 'chase', 'mice', '.', 'My', 'dogs', 'like', 'to', 'chase', 'my', 'cat', '.']
>>>
We also might not want to have distinctions between words like "This" (that appears at the beginning of the phrase, therefore, it is capitalized) and "this". We can apply lower() on the whole text, but we might lose proper nouns this way.

Removing stopwords

Stopwords are very common words that don't bring any information about the theme and meaning of the text (like pronouns, prepositions etc.)

>>> from nltk.corpus import stopwords
>>> stopwords.words('english')
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
>>>

Stemming

We might want to find how many times we can find in a text the action of running. We might find the verb run in different forms, like: run, ran, running, runs etc. For this we can use the stemming process that results in the root of a word. There are multiple stemming algorithms. We will talk about three of them:

  • Porter stemmer >>> import nltk
    >>> ps=nltk.PorterStemmer()
    >>> ps.stem("running")
    'run'
    >>> ps.stem("run")
    'run'
    >>> ps.stem("runs")
    'run'
    >>> ps.stem("ran")
    'ran'
    >>> ps.stem("am")
    'am'
    >>> ps.stem("being")
    'be'
    >>> ps.stem("darling")
    'darl'
    >>> ps.stem("ding")
    'ding'
    >>> ps.stem("bring")
    'bring'
    >>> ps.stem("cats")
    'cat'
    >>> ps.stem("Charles")
    'charl'
  • Lancaster stemmer (not recommended as it often results in overstemming) >>> ls=nltk.LancasterStemmer()
    >>> ls.stem("running")
    'run'
    >>> ls.stem("runs")
    'run'
    >>> ls.stem("ran")
    'ran'
    >>> ls.stem("darling")
    'darl'
    >>> ls.stem("are")
    'ar'
    >>> ls.stem("bring")
    'bring'
    >>> ls.stem("being")
    'being'
    >>> ls.stem("Charles")
    'charl'
    >>>
  • Snowball stemmer (also known as Porter2) >>> snb=nltk.SnowballStemmer("english")
    >>> snb.stem("running")
    'run'
    >>> snb.stem("runs")
    'run'
    >>> snb.stem("ran")
    'ran'
    >>> snb.stem("darling")
    'darl'
    >>> snb.stem("are")
    'are'
    >>> snb.stem("being")
    'be'
    >>> snb.stem("Charles")
    'charl'
    >>>

Lemmatization

The process of lematization returns the dictionary form of a word (canonical form). We will use WordNetLemmatizer

>>> from nltk.stem import WordNetLemmatizer
>>> lem=WordNetLemmatizer()
>>> lem.lemmatize("runs")
'run'
>>> lem.lemmatize("running")
'running'
>>> lem.lemmatize("ran")
'ran'
>>> lem.lemmatize("are")
'are'
>>> lem.lemmatize("is")
'is'
>>> lem.lemmatize("being")
'being'
>>> lem.lemmatize("darling")
'darling'
>>> lem.lemmatize("Charles")
'Charles'

Notice that you can specify the part of speech for the given word and the results greatly improve.

>>> from nltk.stem import WordNetLemmatizer
>>> lem=WordNetLemmatizer()
>>> lem.lemmatize("running", pos="v")
'run'
>>> lem.lemmatize("are",pos="v")
'be'
>>> lem.lemmatize("is",pos="v")
'be'
>>>

Numeral conversion or removal

Sometimes we want to remove all numerals as they give no information about the category of the text, for example). Sometimes we need the numerals in order to programatically understand and save the information from the text in some form of knowledge representation >>> from word2number import w2n
>>> w2n.word_to_num("eleven")
11
>>> w2n.word_to_num("twenty three")
23

>>> from num2words import num2words
>>> num2words(12)
'twelve'
>>> num2words(101)
'one hundred and one'
>>> num2words(2020)
'two thousand and twenty'
>>>