A beginner's guide to text mining and analytics
Why do we need text mining and analytics?
From centuries humans are generating data which is either in the form of
numbers or texts. As we advanced in solving business problems, we started
using the numbers to solve critical business challenges. However, there was still
untamed, un-utilized texts holding significant information which was difficult
and time consuming to comprehend and was hardly used. It is estimated that
80% of the world’s data is unstructured in forms of some or the other text
documents. Thus text mining and analytics enabled us to extract the valuable
information stored in the texts and leverage it to solve business problems in
timely, unbiased, and efficient manner.
This graphical representation explains the evolution of business
reporting from traditional Descriptive analytics to Prescriptive Analytics.
What is text mining and Analytics?
Text mining is a practice for analyzing and processing large corpus of semi structured and unstructured text data (like articles, website text, blog posts, journals, surveys, reviews, emails etc.) using a software or a computer program. The prime purpose is the need to “turn text into numbers or meaningful indices”.
Text indices generated can then be used to derive relation between documents/texts, patterns, trends, sentiments and summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them.
Text mining sphere:
Text Mining is a subset of Applied Artificial Intelligence. To leverage the best
results from Text mining it is used in conjunction with different advanced
analytics techniques to discover and measure patterns, trends and sentiments
in text data.
Fig 1.2 depicts the landscape of Artificial intelligence and the key advanced analytics methods which are used along with text mining to solve problems
Fig 1.2 depicts the landscape of Artificial intelligence and the key advanced analytics methods which are used along with text mining to solve problems
Text mining process flow:
-
1 Data Collection: Collect the data to be analyzed
-
2 Text Parsing & Transformation: Extract words, Parts of speech, Stemming, Filtering,Synonyms, Spell checking, different methods of text transformation:
-
Term Based Method
-
Phrase Based Method
-
Concept base method
-
Pattern Taxonomy Method
-
Term Based Method
-
3 Readiness: Clean up the data by eliminating punctuation, numbers and irrelevant terms
-
4 Text Mining: Using data mining methods to find solutions from mined data.
Text mining and analytics applications can be broadly classified into two categories:
-
Search function
- Analytics function
Fig 1.3 shows some practical usage of these applications
These are the few text mining applications across industries and business verticals.
Risk
management
|
|
Research
|
|
Manufacturing
|
|
Marketing
|
|
Security
|
|
Performance
Management
|
|
Advantages of text mining:
- Developing new and unlocking hidden knowledge.
- Highly Efficient: The ability to extract and process information automatically from thousands of documents without human assistance
- Text mining provides Un-biased opinion
- Improved research process, quality, robustness and evidence based conclusions.
- Text mining can be utilized across different industry verticles for predictive analytics.
Challenges in text mining:
- All format of text files might not be supported by text analysis software’s.
- Copyright information, Cost of accessing the journals, periodicals.
- Non- availability of digital copies of publications / books / articles.
- Noise in the data, ambiguous/ unwanted information not related to text, lack of critical mass.
- Texts created by humans in sarcasm tone.
- Text mining doesn’t generate new facts. Needs to be further analyzed
- by a domain expert to bring complete picture.
- Models can’t be leveraged cross industries for different problems.
Use Case
Text Mining India’s Prime Minister Mr. Narendra Singh Modi’s Radio
talk show “Mann ki Baat”.
Indian PM Mr. Narendra Modi addresses India every month via radio to share his thoughts and ideas. The objective of this exercise is to analyze his speeches using text mining and analytics. We used Rstudio for this excercise.
Step 0: Installing key packages: tm , syuzhet, colorspace, Rcpp,Rsentiment
Step 1: Preprocessing- It includes data cleansing, integration and transformation.
Step 3: Analysis and Results : Frequency of most used words:
These are the top six words which were used in the text corpus. Since, the address was to nation the below results compliments the raw data.
Indian PM Mr. Narendra Modi addresses India every month via radio to share his thoughts and ideas. The objective of this exercise is to analyze his speeches using text mining and analytics. We used Rstudio for this excercise.
Step 0: Installing key packages: tm , syuzhet, colorspace, Rcpp,Rsentiment
Step 1: Preprocessing- It includes data cleansing, integration and transformation.
-
Converting raw text files (Sep’16 to Dec’17) to a data corpus
-
Transformation of data lower case.
-
Removing numbers
-
Removing punctuations
-
Converting data corpus as plain text.
-
Removing redundent words like will, can, year, day, one, two etc.
-
Tokenization – Breaking the stream of text into single words.
-
Stemming – Using snowball language, we converted words to their
roots
-
Filtering Stopwords- Every token created compared to list of stopwords
list is removed along with stop words
Step 3: Analysis and Results : Frequency of most used words:
These are the top six words which were used in the text corpus. Since, the address was to nation the below results compliments the raw data.
Below table shows which top three words were used along with most
frequented words his address:
Word Cloud depicting the most frequent words used
Sentiment Analysis: Analyzing the thoughts of his address.
We used the Syuzhet Package to extract the sentiments of our text. This package comes with four sentiment dictionaries and provides insights on wider range of emotions then positive and negative. We analyzed 10 different sentiments in the text (Positive, Trust, Negative, Anticipation, Joy, fear, Sadness, Anger, Surprise and Disgust)
Using Syuzhet package we determined the sum of each sentiments used in the text and plotted it on a scale.
As established from the previous analysis the below chart confirms the positive tone of the passage dominating the negative words.
Sentiment Analysis: Analyzing the thoughts of his address.
We used the Syuzhet Package to extract the sentiments of our text. This package comes with four sentiment dictionaries and provides insights on wider range of emotions then positive and negative. We analyzed 10 different sentiments in the text (Positive, Trust, Negative, Anticipation, Joy, fear, Sadness, Anger, Surprise and Disgust)
Using Syuzhet package we determined the sum of each sentiments used in the text and plotted it on a scale.
As established from the previous analysis the below chart confirms the positive tone of the passage dominating the negative words.
Conclusion:
This analysis can further be expanded to study and infer more insights in India's PM address to nation.
- We can conclude, his choice of positive words are three times to negative words.
- Based on the words frequency, the address is towards young, youth of county.
- It can be inferred from the words association frequented negative word “Poor” is used along worker, benefit, money, accomplish.
This analysis can further be expanded to study and infer more insights in India's PM address to nation.
Wonderful Article!!
ReplyDelete