NLP - Basic Text Processing

NLP - Basic Text Processing

Machine learning algorithms cannot work with raw text directly and therefore the text must be converted into numbers. (specifically, vectors of numbers.)

Pre-Processing in NLP: Pre-processing is one of the most important steps to prepare text documents before any modeling.

following are the most widely used method:

Text Normalization Refers to a series of task for uniforming the test for processing.

  • Case Conversion Convert all text to lower or upper case.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    
     paragraph="A paragraph is a brief piece of writing that's around seven to ten sentences long. It has a topic sentence and supporting sentences that all relate closely to the topic sentence. The paragraph form refers to its overall structure, which is a group of sentences focusing on a single topic."
    
     def paragraph_to_lower_or_upper(paragraph,case):
         if (isinstance(paragraph, str) and case in ["upper","lower"]):
             if case=="lower":
                 paragraph=paragraph.lower()
                 return paragraph
             elif case=="upper":
                 paragraph=paragraph.upper()
                 return paragraph
         else:
             print("Wrong case or Data format")
    	        
     paragraph_to_lower_or_upper(paragraph,"lower")
     paragraph_to_lower_or_upper(paragraph,"upper")
    	
    
     Output:
     #Lower Case:
     a paragraph is a brief piece of writing that's around seven to ten sentences long. it has a topic sentence and supporting sentences that all relate closely to the topic sentence. the paragraph form refers to its overall structure, which is a group of sentences focusing on a single topic.
    
    
     #Upper Case:
     A PARAGRAPH IS A BRIEF PIECE OF WRITING THAT'S AROUND SEVEN TO TEN SENTENCES LONG. IT HAS A TOPIC SENTENCE AND SUPPORTING SENTENCES THAT ALL RELATE CLOSELY TO THE TOPIC SENTENCE. THE PARAGRAPH FORM REFERS TO ITS OVERALL STRUCTURE, WHICH IS A GROUP OF SENTENCES FOCUSING ON A SINGLE TOPIC.
    
  • Punctuation removal

    Removing punctuations such as : ? , ! ,’ ; - from sentences.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
     sentence=". Period ? Question Mark ! Exclamation Mark , Comma ' Apostrophe  : Colon ; Semicolon - Dash - Hyphen"
    
     def remove_punctuation(sentence):
         import re
         sentence_clean = re.sub(r'[^\w\s]', '', sentence)
         return sentence_clean
    
     remove_punctuation(sentence)
    	
    
    
     Output:
     ' Period  Question Mark  Exclamation Mark  Comma  Apostrophe   Colon  Semicolon  Dash  Hyphen'
    
  • Stopwords removal Stop words are some common words which don’t contain much information such as the, a, this Better to remove these words from the text.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    
     words_list=['It', 'has', 'a', 'topic', 'sentence', 'and', 'supporting', 'sentences', 'that', 'all', 'relate', 'closely', 'to', 'the', 'topic', 'sentence']
    	
     def remove_stopwords(words_list):
         from nltk.corpus import stopwords
         words_clean = []
         for word in words_list:
             if word not in stopwords.words('english'):
                 words_clean.append(word)
         return words_clean
    
     remove_stopwords(words_list)
    	
    
    
     Output:
     ['It','topic','sentence','supporting','sentences','relate','closely','topic','sentence']
    

    Tokenization Spliting of text or sequence of character into smaller chunks called as tokens. Types of tokenizer :

  • Sentence tokenizer : splitting a paragraph into individual sentences. Split at ‘. {Capital letter}’

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    
     paragraph="A paragraph is a brief piece of writing that's around seven to ten sentences long. It has a topic sentence and supporting sentences that all relate closely to the topic sentence. The paragraph form refers to its overall structure, which is a group of sentences focusing on a single topic."
    
     def sentence_tokens(paragraph):
         from nltk.tokenize import sent_tokenize
         sentence_tokens_list=sent_tokenize(paragraph)
         return sentence_tokens_list
    	    
     sentence_tokens(paragraph)
    	
    
    
     Output:
     ["A paragraph is a brief piece of writing that's around seven to ten sentences long.",
      'It has a topic sentence and supporting sentences that all relate closely to the topic sentence.',
      'The paragraph form refers to its overall structure, which is a group of sentences focusing on a single topic.']
    
  • Word tokenizer : splitting a sentence into individual words. Split at ‘Space’
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    
    sentence="A paragraph is a brief piece of writing that's around seven to ten sentences long."
    
    def word_tokens(sentence):
        from nltk.tokenize import word_tokenize
        word_tokens_list=word_tokenize(sentence)
        return word_tokens_list
    
    word_tokenize(sentence)
    	
    	
    Output:
    ['A','paragraph', 'is','a','brief','piece','of','writing','that',"'s",'around', 'seven', 'to','ten','sentences','long','.']
    

Stemming Refers to a crude heuristic process that chops off the ends of words to get base or root word. Output stem can often be non-existent word.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
words_list=['It', 'has', 'a', 'topic', 'sentence', 'and', 'supporting', 'sentences', 'that', 'all', 'relate', 'closely', 'to', 'the', 'topic', 'sentence']

def stem_words(words_list):
    from nltk.stem import LancasterStemmer
    stemmer = LancasterStemmer()
    words_stem = []
    for word in words_list:
        stem = stemmer.stem(word)
        words_stem.append(stem)
    return words_stem
    
stem_words(words_list)


Output:
['it', 'has', 'a', 'top', 'sent', 'and', 'support', 'sent', 'that', 'al', 'rel', 'clos', 'to', 'the', 'top', 'sent']

Lemmatization Remove inflectional endings and return the base, which is known as the lemma. Output lemma can be looked up in a dictionary.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
words_list=['It', 'has', 'a', 'topic', 'sentence', 'and', 'supporting', 'sentences', 'that', 'all', 'relate', 'closely', 'to', 'the', 'topic', 'sentence']

def lemmatize_words(words_list):
    from nltk.stem import WordNetLemmatizer
    lemmatizer = WordNetLemmatizer()
    words_lemma = []
    for word in words_list:
        lemma = lemmatizer.lemmatize(word, pos='v')
        words_lemma.append(lemma)
    return words_lemma

lemmatize_words(words_list)


Output:
['It', 'have', 'a', 'topic', 'sentence', 'and', 'support', 'sentence', 'that', 'all', 'relate', 'closely', 'to', 'the', 'topic', 'sentence']