Elevate Digital Analytics

From Data To Insights

Mining Online Reviews Using Text Mining

Text Mining

With the development of the Internet and technology, an increasing number of people regard online customer reviews as the most important criteria to make purchasing decisions. People view the rating and reviews before they buy makeup products, reserving a hotel, going shopping, etc, and sometimes we even hesitate to purchase products or services without this information. Customer reviews, as the “Voice of customer” and the bridge between customers and businesses, are critical for businesses to understand the overall customer satisfaction and strengthen the company’s credibility. Every day millions of customers share their experience and reflections online, a simple and effective methodology to do an in-depth cleaning of customer reviews data and gain insights is vital for business success.

In my last journal, A Powerful Technology – Text Mining, I introduced a powerful technology – text mining to convert unstructured text data into structured data for further analysis to derive meaningful information. Additionally, it demonstrated how to implement NLTK functions to perform text mining in simple sentences. The scope of this article is to illustrate the steps of preprocessing and analyzing a dataset with a large number of online reviews using Text Mining. Let’s dive into the fun part together!

Data Set and Problem Statement

  • Dataset: The dataset used for this article is from Kaggle (https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products). This is a list of 34,660 consumer reviews for Amazon products like the Kindle, Fire TV Stick, and more provided by Datafiniti’s Product Database. The dataset is in CSV format which includes 21 feature variables: ID of the reviewer (id), ID of the product (asin), product name, categories, review text, and more info for each review. In order to concisely demonstrate the process of implementing text mining, we are going to only focus on variable review text, i.e., review.text – customer reviews of the product (String variable for the review content).
  • Problem Statement: It is essential to conceive a solid, clear, and well-defined problem statement before starting to analyze the dataset. The objective of this article is illustrated in the previous paragraph which is data preprocessing and cleaning by implementing text mining functions, and visualizing the keywords that appear in customer reviews.

Analyzing Dataset

In this article, we are also going to use Jupyter Notebook to demonstrate the process (Python code) of cleanning prelimary data and implementing Text Mining.

1 . Install necessary packages and libraries

# Step 1.1 Install necessary packages 
!pip install PyDrive
!pip install gensim
!pip install pyldavis
!python -m spacy download en
# Step 1.2 Load necessary libraries 
import pandas as pd
pd.set_option("display.max_colwidth", 200)
import numpy as np
import json
import re
import gzip
import spacy
import nltk
import os
import nltk.corpus
import matplotlib.pyplot as plt
from nltk import FreqDist
import gensim  # a topic-modelling and vector space modelling toolkit
from gensim import corpora
# Step 1.3 Import libraries for visualization 
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

2 . load the raw data

# Step 2.1 Import data in csv format by using pandas.read_csv() function 
df = pd.read_csv("C:/Users/mcyang/Desktop/data.csv")
# Step 2.2 Return the top 5 rows by using df.head() function 
df.head()
# Step 2.3 Select only relevant column 'review.text'and return a table contains the column contents
review_df = df[["reviews.text"]]
review_df.head()

3 . Clean Data

After loading the raw data into Jupyter Notebook, the next step is to remove the meaningless characters, numbers, and symbols; and create a plot to visualize the most common words.

# Step 3.1 Remove unwanted characters, numbers and symbols 
df['reviews.text_1'] = df['reviews.text'].str.replace("[^a-zA-Z#]", " ")

Creating a freq_words()‘ function to plot 25 most frequent terms.

# Step 3.2 Create freq-words() function to plot 25 most frequent terms 
def freq_words(x, terms = 25): 
  # Format sentences in string and split them into words  
  all_words = ' '.join([str(text) for text in x]) 
  all_words = all_words.split() 
  # Check frequency of all words, and transfter them to a dataframe format  
  fdist = FreqDist(all_words) 
  words_df = pd.DataFrame({'word':list(fdist.keys()),   
             'count':list(fdist.values())}) 
  # Sort words by frequency and select top 25 most frequent words to plot them in a  bar chart
  d = words_df.nlargest(columns="count", n = terms)      
  plt.figure(figsize=(20,5)) 
  ax = sns.barplot(data=d, x= "word", y = "count") 
  ax.set(ylabel = 'Count') 
  plt.show()
# Step 3.3 Apply freq_words() function to data 
freq_words(df["reviews.text_1"])

From the result above, most common words are ‘the’, ‘to’, ‘and’, so on and so forth. These words are not meaningful and we are unable to retrieve further information from them. Thus, these kinds of words need to be removed. The function ‘clean_text()’ which consists in cleaning the text data with the four text mining functions: tokenization, stop words, POS tag, and lemmatization can be used to achieve this.

# Step 3.4 Implement Text Mining functions 
# a. Import necessary libraries 
import string
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import WhitespaceTokenizer
from nltk.stem import WordNetLemmatizer
# b. Return the wordnet object value corresponding to the POS tag
def get_wordnet_pos(pos_tag):
    # Return the corresponidng wordnet tag for the given POS tag. If it can't determine a wordnet tag, it defaults to wordnet.NOUN
    if pos_tag.startswith('J'):
        return wordnet.ADJ
    elif pos_tag.startswith('V'):
        return wordnet.VERB
    elif pos_tag.startswith('N'):
        return wordnet.NOUN
    elif pos_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN   # default to NOUN if nothing else matches 
# c. Create clean_text() function 
def clean_text(text):
    # lower text: make entire text lowercase 
    text = str(text).lower()
    # tokenize text (split text into words) and remove puncutation
    text = [word.strip(string.punctuation) for word in text.split(" ")]
    # remove words that contain numbers
    text = [word for word in text if not any(c.isdigit() for c in word)]
    # remove stop words
    stop = stopwords.words('english')
    text = [x for x in text if x not in stop]
    # remove empty tokens
    text = [t for t in text if len(t) > 0]
    # pos tag text: assign a tag to every word to define if it corresponds to a   noun, a verb, etc. (using the WordNet lexical database)
    pos_tags = pos_tag(text)
    # lemmatize text: transform every word into their root form 
    text = [WordNetLemmatizer().lemmatize(t[0], get_wordnet_pos(t[1])) for t in pos_tags]
    # remove words with only two letter
    text = [t for t in text if len(t) > 2]
    # join all
    text = " ".join(text)
    return(text)
# d. Apply clean_text() function to our dataset column 'reviews.text_1', and store it in a new dataframe 'review.text_2'
df['reviews.text_2'] = df['reviews.text_1'].apply(lambda x: clean_text(x))

Now, let’s again plot the most frequent words and see if the more significant words are appearing or not.

# e. Plot the top 25 most frequent words again 
freq_words(df["reviews.text_2"])

According to the result, it seems that now the most frequent terms (‘ Great’, ‘Use’, ‘tablet’, ‘love’ etc.) in our data are relevant. We can start creating a wordcloud to have an overall view of what kind of words appear in our reviews.

4. Create a wordcloud

# Step 4.1 Install wordcloud package and necessary libraries 
!pip install wordcloud
import os
import matplotlib.pyplot as plt
from wordcloud import WordCloud
# Step 4.2 Define a wordcloud () function 
def show_wordcloud(data, title = None):
    wordcloud = WordCloud(
        background_color = 'white',
        max_words = 200,
        max_font_size = 40, 
        scale = 3,
        random_state = 42
    ).generate(str(data))
    fig = plt.figure(1, figsize = (20, 20))
    plt.axis('off')
    if title: 
        fig.suptitle(title, fontsize = 20)
        fig.subplots_adjust(top = 2.3)
    plt.imshow(wordcloud)
    plt.show()
# Step 4.3 Print a wordcloud 
show_wordcloud(df["reviews.text_2"])

Most of the words are indeed related to the Amazon products: Kindle, charger, tablet, screen etc. Some words are more related to the customer experience with using the products: great, well, good, need, love etc.

Next Step

Now, we have a cleaned dataset with all informative words and text so that we can start building topic models to visualize topics or performing sentiment analysis to extract emotions related to the customer reviews. Hope you enjoyed reading this article, see you in the next one!