In the 21st century, according to industry estimates only 21% of the available data is displayed in a structured format. Most text data generated from social media, books, newspapers, emails, and conversations is highly unstructured in nature. In order to obtain high-quality information and actionable insights from text data, text analysis techniques are essential.
The following is a concise definition of Text Mining (from Linguamatics.com): Text mining (also referred to as text analysis) is an artificial intelligence technology that uses Natural Language Processing to transform the unstructured text in documents and databases into normalized, structured data suitable for analysis or to drive machine learning algorithms. So there is a question comes up: what is the relationship between natural language processing and text mining? Let’s deep dive into it in the following paragraphs.
Text Mining and Natural Language Processing (NLP)
I’ve introduced Natural language Processing (NLP) in one of my previous journals, “Sentiment Analysis for text with Google Natural Language processing,” which is a combination of machine learning and linguistics. Users can extract information about people, places, and events, and better understand social media sentiment and customer conversation. NLP can be used to solve a wide range of tasks, which include automatic summarization, machine translation, and topic segmentation.
Text Mining/ Text Analysis is the process of deriving meaningful information from natural language text. Text mining involves the process of structuring input text while deriving patterns within the structured data and eventually analyzing the interpreted output. It also refers to the process of deriving high-quality information from text for analysis via the application of NLP.
Main Functions for Text Mining
There are five main functions for text mining that will be introduced in this journal: Tokenization, Stemming, Lemmatization, Stop Words, and Syntax. All of these functions are can be implemented using NLTK (Natural Language Toolkit), which is a leading platform for building Python programs to work with human language data in Python.
Tokenization is the first step in NLP that breaks strings into tokens. Tokens can be words, numbers or punctuation, but in text mining/text analytics, tokens are most commonly just words. The three steps for tokenization are: breaking a complex sentence into words; understanding the importance of each of the words with respect to the sentence, and producing a structural description on an input sentence.
Let’s use the following sentence as an example, ‘Tokenization is the first step in NLP.’ Using tokenization, this sentence can be divided into seven tokens: ‘Tokenization’, ‘is’, ‘the’, ‘first’, ‘step’, ‘in’, and ‘NLP’.
We are able to implement tokenization using NLTK, which also enables the tokenization of phrases containing more than one word. Below is a demo of tokenization implementation using NLTK on a sentence labeled “s1.” The Python code is written in a Jupyter Notebook (Jupyter Notebook is an open-source web application that users can use to create and share documents that contain live code, equations, visualizations, and text.).
After the text in s1 is split into tokens, we can import the FreqDist library from NLTK to find the frequency distributions of all tokens in the text. Below is an example of this in action in Jupyter Notebook.
From the output, we can see ‘a’, ‘great’, and ‘it’ are found two times, and the rest of the words from s1 are found one time. The output from the second cell is the frequency of the top 10 words.
Most of the time, when we are searching for information, we want to find relevant results not only for the exact expression we typed in the research bar but also for the other possible forms of the words. For example, we will want to see results containing the form ‘shopping’ if we typed ‘shop’ in a search bar. This can be achieved through stemming and lemmatizing. Stemming enables users to make tokens modifications after breaking input text into tokens. It is the process of normalizing the words into its base form of root form. For example, for ‘give, gave, given, giving’, the root word here is ‘give’. A note for stemming is the result is not always the root word. Below is a demo of how to perform stemming using NLTK in a given dataset.
Lemmatization is similar to Stemming which maps several words into one common root. The difference between Lemmatization differs from Stemming in that the output is a proper word (which is called ‘lemma’) and the morphological analysis of the words is taken into consideration. The journal, Text Minning in Python from Medium summarizes the difference in the following: “lemmatization considers the context and converts the word to its meaningful base form, whereas stemming just removes the last few characters, often leading to incorrect meanings and spelling errors.”
For example, lemmatization would correctly identify the base form of ‘caring’ as ‘care’, whereas, stemming would cut the ‘ing’ part and convert it to ‘car’. The implementation of lemmatization using NLTK is shown below.
In the English language, words such as “of,” “the,” “not,” “who,” and “is” are invaluable in the composition of the sentences, but are not helpful in terms of processing the language. These words are called stop words. The NLTK Python package has a list of stop words for 16 different languages, including English, Spanish, German, and French. At the time of this writing, there are 179 English words that NLTK considers to be stop words. Below is a demo where we check a list of stop words in English and remove them from a piece of text.
Chunking refers to a range of sentence-breaking systems that split a sentence into its component phrases (noun phrases, verb phrases, and so on). The main goal of chunking is to group nouns with the words that are in relation to them. Said another way, we’re trying to group words into meaningful chunks.
The code above demonstrates how to implement POS and NER to classify text into a pre-defined set of categories. Below is an example that shows how chunking groups words or tokens (‘old’ and ‘grandma’) into a meaningful chunk (‘old grandma’).
In this journal, we covered the introduction of text mining, NLP and demonstration of the key functions that text mining uses NLP to transform the unstructured texts into structured data. However, the examples in the demos are simple sentences instead of a paragraph or more complicated text. I’m going to discuss how to perform text mining with a dataset in my next journal. Stay tuned!