6 min readNov 8, 2020

Natural Language Processing and Text-Preprocessing

1.What is Text Preprocessing?

Machine learning task, cleaning or preprocessing the data is as important as model building if not more. And when it comes to unstructured data like text, this process is even more important. Machine learning not take input as unstructured data.so Text Preprocessing ‘’To preprocess your text simply means to bring your text into a form that is predictable and analyzable for your task’’ .its is mainly used in NLP.NLP is a subfield of computer science and artificial intelligence concerned with interactions between computers and human (natural) languages.

Steps of Test-Preprocessing:

Tokenization
Noise Removal
Removal of Stopwords
Removal of Frequent words
Removal of Rare words
Stemming
Lemmatization
Removal of HTML tags
Spelling correction
Lower casing

What is nltk toolkit for Python?

NLTK is a free, open source, community-driven project. it is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to many corpus and lexical resources. Also, it contains a text processing libraries for lemmatization ,bag of words, classification, tokenization, stemming, pos of tagging, parsing.

for import nltk toolkit, first download nltk toolkit using pip then open jupyter-notebook and write it “import nltk”

# download all nltk resources
nltk.download(‘all’)

Accessing Text Corpora and Lexical Resources

Practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora. nltk provide corpus for it .Corpus is nothing but collection of text or document. This have many group of corpus as Brown, gutenberg etc.

# load the brown corpus: American English texts of varying genres
from nltk.corpus import gutenberg

We can access the corpus as a list of words or a list of sentences (where each sentence is itself just a list of words). We can optionally specify particular categories or files to read:

gutenberg.fileids()

# using raw method
# return a single string for all file
emma=gutenberg.raw(fileids=[‘austen-emma.txt’])

TOKENIZATION:

Tokenization generally meaning that when we have collection of sentences or sentences ,then break down in its component sentences or words.
it is of two type.
Sentences Tokenization: document to sentences
Words Tokenization: sentence to words
1.Sentences Tokenization:-Many approach to tokenization, we follow default and recommended one. Here we using sent_tokenize function to perform tokenization
# this is default and recommended sentence tokenizer
# pre-trained in serval language models and work really well on other language
# besides English
emma_sents=nltk.sent_tokenize(emma)

2.Words Tokenization: Many approach to tokenization, we follow default and recommended one. Here we using word_tokenize function to perform tokenization.

emma_words=nltk.word_tokenize(emma)

Stemming:

For grammatical reasons, documents can contain different forms of a word such as read, reading, reads. Also, sometimes we have related words with a similar meaning, such as nation, national, nationality. Stemming is the process where finding the root word of any word. with root words used to create new word by attaching affixes(this process is called inflection). stemming is the reverse process of inflection. useful to classifying and clustering text ,and even information retrieval.
nltk have many interface to stemming but we use recommended one. that is Porter stemmer.it is most popular.
# phase wise rule for reduction of inflections to their stem
from nltk.stem import PorterStemmer
# create a instance of class
pstem=PorterStemmer()

it also have another type stemmer lancaster stemmer.

Lemmatization:

Its is similar to stemming. only difference is in stemming the stem word or base word may not always be a lexicographically correct word.it may not present in dictionary. however in lemmatization ,root and base word are always present in dictionary. root form or lemma is formed by removing affixes only if root word are present in dictionary.
from nltk.stem import WordNetLemmatizer
wn1=WordNetLemmatizer()

Stop words are words which are filtered out before or after processing of text. When applying machine learning to text, these words can add a lot of noise. That’s why we want to remove these irrelevant words.

Stop words usually refer to the most common words such as “and”, “the”, “a” in a language, but there is no single universal list of stopwords. The list of the stop words can change depending on your application. The nltk tool has a predefined list of stopwords that refers to the most common words.

from nltk.corpus import stopwords
print(stopwords.words(“english”))

remove stopwords from our text:

def remove_stop_words(tokens):
# using English language stopsword
seq_w=nltk.corpus.stopwords.words(‘english’)
token_f=[tn for tn in tokens if tn not in seq_w ]
return token_f

before stopwords remove, length of word was 191785,after apply remove_stopword length is 112632.see the difference .

Correcting spelling:

some times when someone is writing ,they used wrong word. And do spelling mistake and repetitive word repetitive character. so we need to check and remove char from token.

POS tagging:

For tagging main POS(part of speech): nouns , adjective, adverb, verb with token.
It is used with labeling word with their tag. We use the penn treebank notation for POS tagging. contain most widely used POS tagging set. Here we using nltk recommended pos_tag() function

# define function to tokenize text into tokens
def tokenize_text(text):
tokens=nltk.word_tokenize(text)
tokens=[tn.strip() for tn in tokens]
return tokens
# tokenize text to tokens using tokenize_text( )function
emma_tokens=tokenize_text(emma_sents[0])
# using pos_tag() function annoatate pos tags
emma_tokens_t=nltk.pos_tag(emma_tokens,tagset=’universal’)

feature extraction:

Machine learning algorithms cannot work with raw text directly, we need to convert the text into vectors of numbers. This is called feature extraction .for features extraction we need vector space or term vector model. transforming and representative text doc as numeric vector of specific term that a vector dimension.

bag of word model:

after done cleaning step ,this representation of text is called Bag of words model. it is simplest format that ignores order. the reason for store data in matrix is because we need to store this for multiple document. The bag of words model is a popular and simple feature extraction technique used when we work with text. It describes the occurrence of each word within a document.

Document-term-matrix:

we prepare Document-term-matrix -
~each row is different document
~each columns is different term
~each values are word count

after converting to vector model then you can apply any machine learning algorithm and RNN for text-classifying ,text-generation, topic-modeling sentiment-analysis.

Conclusion:

Machine learning algorithms not take input as unstructured data. So first change unstructured data to structured data by the above text-preprocessing step. You learn the basics of the NLP for text. NLP is used to apply machine learning algorithms and Deep learning algorithms to text and speech.

notebook link github:https://github.com/mahima-c/Text-Mining/blob/master/text%20mining%20modeling.ipynb