





















































In this article by Deepti Chopra, Nisheeth Joshi, and Iti Mathur authors of the book Mastering Natural Language Processing with Python, morphology may be defined as the study of the composition of words using morphemes. A morpheme is the smallest unit of the language that has a meaning. In this article, we will discuss stemming and lemmatizing, creating a stemmer and lemmatizer for non-English languages, developing a morphological analyzer and morphological generator using machine learning tools, creating a search engine, and many other concepts.
In brief, this article will include the following topics:
(For more resources related to this topic, see here.)
Morphology may be defined as the study of the production of tokens with the help of morphemes. A morpheme is the basic unit of language, which carries a meaning. There are two types of morphemes: stems and affixes (suffixes, prefixes, infixes, and circumfixes).
Stems are also referred to as free morphemes since they can even exist without adding affixes. Affixes are referred to as bound morphemes since they cannot exist in a free form, and they always exist along with free morphemes. Consider the word "unbelievable". Here, "believe" is a stem or free morpheme. It can even exist on its own. The morphemes "un" and "able" are affixes or bound morphemes. They cannot exist in s free form but exist together with a stem. There are three kinds of languages, namely isolating languages, agglutinative languages, and inflecting languages. Morphology has different meanings in all these languages. Isolating languages are those languages in which words are merely free morphemes, and they do not carry any tense (past, present, and future) or number (singular or plural) information. Mandarin Chinese is an example of an isolating language. Agglutinative languages are those languages in which small words combine together to convey compound information. Turkish is an example of an agglutinative language. Inflecting languages are languages in which words are broken down into simpler units, but all these simpler units exhibit different meanings. Latin is an example of an inflecting language. There are morphological processes such as inflections, derivations, semi-affixes, combining forms, and cliticization. An inflection refers to transforming a word into a form so that it represents a person, number, tense, gender, case, aspect, and mood. Here, the syntactic category of the token remains the same. In derivation, the syntactic category of word is also changed. Semi-affixes are bound morphemes that exhibit a word-like quality, for example, noteworthy, antisocial, anticlockwise, and so on.
Stemming may be defined as the process of obtaining a stem from a word by eliminating the affixes from it. For example, in the word "raining", a stemmer would return the root word or the stem word "rain" by removing the affix "ing" from "raining". In order to increase the accuracy of information retrieval, search engines mostly use stemming to get a stem and store it as an index word. Search engines call words with the same meaning synonyms, which may be a kind of query expansion known as conflation. Martin Porter has designed a well-known stemming algorithm known as the Porter Stemming Algorithm. This algorithm is basically designed to replace and eliminate some well-known suffices present in English words. To perform stemming in NLTK, we can simply perform the instantiation of the PorterStemmer class, and then perform stemming by calling the stem method.
Let's take a look at the code for stemming using the PorterStemmer class in NLTK:
>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemmerporter = PorterStemmer()
>>> stemmerporter.stem('working')
'work'
>>> stemmerporter.stem('happiness')
'happi'
The PorterStemmer class is trained and has the knowledge of many stems and word forms in the English language. The process of stemming takes place in a series of steps and transforms a word into a shorter word or this word may similar meaning to the root word. The stemmer I interface defines the stem() method, and all stemmers are inherited from this interface. The inheritance diagram is depicted here:
Another Stemming algorithm, known as the Lancaster Stemming algorithm, was introduced in Lancaster University. Similar to the PorterStemmer class, the LancasterStemmer class is used in NLTK to implement Lancaster Stemming.
Let's consider the following code, which depicts Lancaster stemming in NLTK:
>>> import nltk
>>> from nltk.stem import LancasterStemmer
>>> stemmerlan=LancasterStemmer()
>>> stemmerlan.stem('working')
'work'
>>> stemmerlan.stem('happiness')
'happy'
We can also build our own stemmer in NLTK using RegexpStemmer. This works by accepting a string and eliminates it from the prefix or suffix of a word when a match is found.
Let's consider an example of stemming using RegexpStemmer in NLTK:
>>> import nltk
>>> from nltk.stem import RegexpStemmer
>>> stemmerregexp=RegexpStemmer('ing')
>>> stemmerregexp.stem('working')
'work'
>>> stemmerregexp.stem('happiness')
'happiness'
>>> stemmerregexp.stem('pairing')
'pair'
We can use RegexpStemmer in cases where stemming cannot be performed using PorterStemmer and LancasterStemmer.
The SnowballStemmer class is used to perform stemming in 13 languages other than English. In order to perform stemming using SnowballStemmer, firstly, an instance is created in the language where stemming needs to be performed, and then using the stem() method, stepping is performed.
Consider the following example to perform stemming in Spanish and French in NLTK using SnowballStemmer:
>>> import nltk
>>> from nltk.stem import SnowballStemmer
>>> SnowballStemmer.languages
('danish', 'dutch', 'english', 'finnish', 'french', 'german', 'hungarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish')
>>> spanishstemmer=SnowballStemmer('spanish')
>>> spanishstemmer.stem('comiendo')
'com'
>>> frenchstemmer=SnowballStemmer('french')
>>> frenchstemmer.stem('manger')
'mang'
Nltk.stem.api consists of the stemmer I class in which the stem function is performed.
Consider the following code present in NLTK, which enables stemming to be performed:
Class StemmerI(object):
"""
It is an interface that helps to eliminate morphological affixes from the tokens and the process is known as stemming.
"""
def stem(self, token):
"""
Eliminate affixes from token and stem is returned.
"""
raise NotImplementedError()
Here's the code used to perform stemming using multiple stemmers:
>>> import nltk
>>> from nltk.stem.porter import PorterStemmer
>>> from nltk.stem.lancaster import LancasterStemmer
>>> from nltk.stem import SnowballStemmer
>>> def obtain_tokens():
With open('/home/p/NLTK/sample1.txt') as stem: tok = nltk.word_tokenize(stem.read())
return tokens
>>> def stemming(filtered):
stem=[]
for x in filtered:
stem.append(PorterStemmer().stem(x))
return stem
>>> if_name_=="_main_":
tok= obtain_tokens()
>>> print("tokens is %s")%(tok)
>>> stem_tokens= stemming(tok)
>>> print("After stemming is %s")%stem_tokens
>>> res=dict(zip(tok,stem_tokens))
>>> print("{tok:stemmed}=%s")%(result)
Lemmatization is the process in which we transform a word into a form that has a different word category. The word formed after lemmatization is entirely different from what it was initially.
Consider an example of lemmatization in NLTK:
>>> import nltk
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer_output=WordNetLemmatizer()
>>> lemmatizer_output.lemmatize('working')
'working'
>>> lemmatizer_output.lemmatize('working',pos='v')
'work'
>>> lemmatizer_output.lemmatize('works')
'work'
WordNetLemmatizer may be defined as a wrapper around the so-called WordNet corpus, and it makes use of the morphy() function present in WordNetCorpusReader to extract a lemma. If no lemma is extracted, then the word is only returned in its original form. For example, for 'works', the lemma that is returned is in the singular form 'work'.
This code snippet illustrates the difference between stemming and lemmatization:
>>> import nltk
>>> from nltk.stem import PorterStemmer
>>> stemmer_output=PorterStemmer()
>>> stemmer_output.stem('happiness')
'happi'
>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer_output.lemmatize('happiness')
'happiness'
In the preceding code, 'happiness' is converted to 'happi' by stemming it. Lemmatization can't find the root word for 'happiness', so it returns the word "happiness".
Polyglot is a software that is used to provide models called morfessor models, which are used to obtain morphemes from tokens. The Morpho project's goal is to create unsupervised data-driven processes. Its focuses on the creation of morphemes, which are the smallest units of syntax. Morphemes play an important role in natural language processing. They are useful in automatic recognition and the creation of language. With the help of the vocabulary dictionaries of polyglot, morfessor models on 50,000 tokens of different languages was used.
Here's the code to obtain a language table using a polyglot:
from polyglot.downloader import downloader
print(downloader.supported_languages_table("morph2"))
The output obtained from the preceding code is in the form of these languages listed as follows:
1. Piedmontese language 2. Lombard language 3. Gan Chinese
4. Sicilian 5. Scots 6. Kirghiz, Kyrgyz
7. Pashto, Pushto 8. Kurdish 9. Portuguese
10. Kannada 11. Korean 12. Khmer
13. Kazakh 14. Ilokano 15. Polish
16. Panjabi, Punjabi 17. Georgian 18. Chuvash
19. Alemannic 20. Czech 21. Welsh
22. Chechen 23. Catalan; Valencian 24. Northern Sami
25. Sanskrit (Saṁskṛta) 26. Slovene 27. Javanese
28. Slovak 29. Bosnian-Croatian-Serbian 30. Bavarian
31. Swedish 32. Swahili 33. Sundanese
34. Serbian 35. Albanian 36. Japanese
37. Western Frisian 38. French 39. Finnish
40. Upper Sorbian 41. Faroese 42. Persian
43. Sinhala, Sinhalese 44. Italian 45. Amharic
46. Aragonese 47. Volapük 48. Icelandic
49. Sakha 50. Afrikaans 51. Indonesian
52. Interlingua 53. Azerbaijani 54. Ido
55. Arabic 56. Assamese 57. Yoruba
58. Yiddish 59. Waray-Waray 60. Croatian
61. Hungarian 62. Haitian; Haitian Creole 63. Quechua
64. Armenian 65. Hebrew (modern) 66. Silesian
67. Hindi 68. Divehi; Dhivehi; Mald... 69. German
70. Danish 71. Occitan 72. Tagalog
73. Turkmen 74. Thai 75. Tajik
76. Greek, Modern 77. Telugu 78. Tamil
79. Oriya 80. Ossetian, Ossetic 81. Tatar
82. Turkish 83. Kapampangan 84. Venetian
85. Manx 86. Gujarati 87. Galician
88. Irish 89. Scottish Gaelic; Gaelic 90. Nepali
91. Cebuano 92. Zazaki 93. Walloon
94. Dutch 95. Norwegian 96. Norwegian Nynorsk
97. West Flemish 98. Chinese 99. Bosnian
100. Breton 101. Belarusian 102. Bulgarian
103. Bashkir 104. Egyptian Arabic 105. Tibetan Standard, Tib...
106. Bengali 107. Burmese 108. Romansh
109. Marathi (Marāthī) 110. Malay 111. Maltese
112. Russian 113. Macedonian 114. Malayalam
115. Mongolian 116. Malagasy 117. Vietnamese
118. Spanish; Castilian 119. Estonian 120. Basque
121. Bishnupriya Manipuri 122. Asturian 123. English
124. Esperanto 125. Luxembourgish, Letzeb... 126. Latin
127. Uighur, Uyghur 128. Ukrainian 129. Limburgish, Limburgan...
130. Latvian 131. Urdu 132. Lithuanian
133. Fiji Hindi 134. Uzbek 135. Romanian, Moldavian, ...
The necessary models can be downloaded using the following code:
%%bash
polyglot download morph2.en morph2.ar
[polyglot_data] Downloading package morph2.en to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data] /home/rmyeid/polyglot_data...
[polyglot_data] Package morph2.ar is already up-to-date!
Consider this example that obtains output from a polyglot:
from polyglot.text import Text, Word
tokens =["unconditional" ,"precooked", "impossible", "painful", "entered"]
for s in tokens:
s=Word(s, language="en")
print("{:<20}{}".format(s,s.morphemes))
unconditional ['un','conditional']
precooked ['pre','cook','ed']
impossible ['im','possible']
painful ['pain','ful']
entered ['enter','ed']
If tokenization is not performed properly, then we can perform morphological analysis for the process of splitting text into its original constituents:
sent="Ihopeyoufindthebookinteresting"
para=Text(sent)
para.language="en"
para.morphemes
WordList(['I','hope','you','find','the','book','interesting'])
Morphological analysis may be defined as the process of obtaining grammatical information about a token given its suffix information. Morphological analysis can be performed in three ways: Morpheme-based morphology (or the item and arrangement approach), Lexeme-based morphology (or the item and process approach), and Word-based morphology (or the word and paradigm approach). A morphological analyzer may be defined as a program that is responsible for the analysis of the morphology of a given input token. It analyzes a given token and generates morphological information, such as gender, number, class, and so on, as an output.
In order to perform morphological analysis on a given non-whitespace token, pyEnchant dictionary is used.
Consider the following code that performs morphological analysis:
>>> import enchant
>>> s = enchant.Dict("en_US")
>>> tok=[]
>>> def tokenize(st1):
if not st1:return
for j in xrange(len(st1),-1,-1):
if s.check(st1[0:j]):
tok.append(st1[0:i])
st1=st[j:]
tokenize(st1)
break
>>> tokenize("itismyfavouritebook")
>>> tok
['it', 'is', 'my','favourite','book']
>>> tok=[ ]
>>> tokenize("ihopeyoufindthebookinteresting")
>>> tok
['i','hope','you','find','the','book','interesting']
We can determine the category of a word as follows:
Omorfi (the open morphology of Finnish) is a package that has been licensed by version 3 of GNU GPL. It is used for the purpose of performing numerous tasks such as language modeling, morphological analysis, rule-based machine translations, information retrieval, statistical machine translations, morphological segmentation, ontologies, and spell checking and correction.
A morphological generator is a program that performs the task of morphological generations. Morphological generation may be considered the opposite of morphological analysis. Here, given the description of a word in terms of its number, category, stem, and so on, the original word is retrieved. For example, if root = go, Part of Speech = verb, tense= present, and if it occurs along with a third person and singular subject, then the morphological generator would generate its surface form, that is, goes.
There are many Python-based software that perform morphological analysis and generation. Some of them are as follows:
Other examples of software that is used to perform morphological analysis and generation are as follows:
PyStemmer 1.0.1 consists of Snowball stemming algorithms that are conducive for performing information retrieval tasks and the construction of a search engine. It consists of the Porter stemming algorithm and many other stemming algorithms that are useful for the purpose of performing stemming and information retrieval tasks in many languages, including many European languages.
We can construct a vector space search engine by converting the texts into vectors.
Here are the steps needed to construct a vector space search engine:
A stemmer is a program that accepts words and converts them into stems. Tokens that have same stem almost have the same meanings. Stop words are also eliminated from text.
Consider the following code for the removal of stop words and tokenization:
def eliminatestopwords(self,list):
" " "
Eliminate words which occur often and have not much significance from context point of view.
" " "
return[ word for word in list if word not in self.stopwords ]
def tokenize(self,string):
" " "
Perform the task of splitting text into stop words and tokens
" " "
Str=self.clean(str)
Words=str.split(" ")
return [self.stemmer.stem(word,0,len(word)-1) for word in words]
def obtainvectorkeywordindex(self, documentList):
" " "
In the document vectors, generate the keyword for the given position of element
" " "
#Perform mapping of text into strings
vocabstring = " ".join(documentList)
vocablist = self.parser.tokenise(vocabstring)
#Eliminate common words that have no search significance
vocablist = self.parser.eliminatestopwords(vocablist)
uniqueVocablist = util.removeDuplicates(vocablist)
vectorIndex={}
offset=0
#Attach a position to keywords that performs mapping with dimension that is used to depict this token
for word in uniqueVocablist:
vectorIndex[word]=offset
offset+=1
return vectorIndex #(keyword:position)
def constructVector(self, wordString):
# Initialise the vector with 0's
Vector_val = [0] * len(self.vectorKeywordIndex)
tokList = self.parser.tokenize(tokString)
tokList = self.parser.eliminatestopwords(tokList)
for word in toklist:
vector[self.vectorKeywordIndex[word]] += 1;
# simple Term Count Model is used
return vector
By finding the cosine of an angle between the vectors of a document, we can prove whether two given documents are similar or not. If the cosine value is 1, then the angle value is 0 degrees and vectors are said to be parallel (this means that documents are related). If the cosine value is 0 and the value of the angle is 90 degrees, then vectors are said to be perpendicular (this means that documents are not related).
This is the code to compute the cosine between the text vector using scipy:
def cosine(vec1, vec2):
"""
cosine = ( X * Y ) / ||X|| x ||Y||
"""
return float(dot(vec1,vec2) / (norm(vec1) * norm(vec2)))
We perform the mapping of keywords to a vector space. We construct a temporary text that represents items to be searched and then compare it with document vectors with the help of a cosine measurement.
Here is the following code needed to search for the vector space:
def searching(self,searchinglist):
""" search for text that are matched on the basis of list of items """
askVector = self.buildQueryVector(searchinglist)
ratings = [util.cosine(askVector, textVector) for textVector in self.documentVectors]
ratings.sort(reverse=True)
return ratings
The following code can be used to detect languages from a source text:
>>> import nltk
>>> import sys
>>> try:
from nltk import wordpunct_tokenize
from nltk.corpus import stopwords
except ImportError:
print( 'Error has occured')
#----------------------------------------------------------------------
>>> def _calculate_languages_ratios(text):
"""
Compute probability of given document that can be written in different languages and give a dictionary that appears like {'german': 2, 'french': 4, 'english': 1}
"""
languages_ratios = {}
'''
nltk.wordpunct_tokenize() splits all punctuations into separate tokens
wordpunct_tokenize("I hope you like the book interesting .")
[' I',' hope ','you ','like ','the ','book' ,'interesting ','.']
'''
tok = wordpunct_tokenize(text)
wor = [word.lower() for word in tok]
# Compute occurence of unique stopwords in a text
for language in stopwords.fileids():
stopwords_set = set(stopwords.words(language))
words_set = set(words)
common_elements = words_set.intersection(stopwords_set)
languages_ratios[language] = len(common_elements)
# language "score"
return languages_ratios
#----------------------------------------------------------------
>>> def detect_language(text):
"""
Compute the probability of given text that is written in different languages and obtain the one that is highest scored. It makes use of stopwords calculation approach, finds out unique stopwords present in a analyzed text.
"""
ratios = _calculate_languages_ratios(text)
most_rated_language = max(ratios, key=ratios.get)
return most_rated_language
if __name__=='__main__':
text = '''
All over this cosmos, most of the people believe that there is an invisible supreme power that is the creator and the runner of this world. Human being is supposed to be the most intelligent and loved creation by that power and that is being searched by human beings in different ways into different things. As a result people reveal His assumed form as per their own perceptions and beliefs. It has given birth to different religions and people are divided on the name of religion viz. Hindu, Muslim, Sikhs, Christian etc. People do not stop at this. They debate the superiority of one over the other and fight to establish their views. Shrewd people like politicians oppose and support them at their own convenience to divide them and control them. It has intensified to the extent that even parents of a
new born baby teach it about religious differences and recommend their own religion superior to that of others and let the child learn to hate other people just because of religion. Jonathan Swift, an eighteenth century novelist, observes that we have just enough religion to make us hate, but not enough to make us love one another.
The word 'religion' does not have a derogatory meaning - A literal meaning of religion is 'A
personal or institutionalized system grounded in belief in a God or Gods and the activities connected
with this'. At its basic level, 'religion is just a set of teachings that tells people how to lead a good
life'. It has never been the purpose of religion to divide people into groups of isolated followers that
cannot live in harmony together. No religion claims to teach intolerance or even instructs its believers to segregate a certain religious group or even take the fundamental rights of an individual solely based on their religious choices. It is also said that 'Majhab nhi sikhata aaps mai bair krna'. But this very majhab or religion takes a very heinous form when it is misused by the shrewd politicians and the fanatics e.g. in Ayodhya on 6th December, 1992 some right wing political parties
and communal organizations incited the Hindus to demolish the 16th century Babri Masjid in the
name of religion to polarize Hindus votes. Muslim fanatics in Bangladesh retaliated and destroyed a
number of temples, assassinated innocent Hindus and raped Hindu girls who had nothing to do with
the demolition of Babri Masjid. This very inhuman act has been presented by Taslima Nasrin, a Banglsdeshi Doctor-cum-Writer in her controversial novel 'Lajja' (1993) in which, she seems to utilizes fiction's mass emotional appeal, rather than its potential for nuance and universality.
'''
>>> language = detect_language(text)
>>> print(language)
The preceding code will search for stop words and detect the language of the text, which is English.
In this article, we discussed stemming, lemmatization, and morphological analysis and generation.
Further resources on this subject: