A beginner's guide to text mining and analytics



Why do we need text mining and analytics?

From centuries humans are generating data which is either in the form of numbers or texts. As we advanced in solving business problems, we started using the numbers to solve critical business challenges. However, there was still untamed, un-utilized texts holding significant information which was difficult and time consuming to comprehend and was hardly used. It is estimated that 80% of the world’s data is unstructured in forms of some or the other text documents. Thus text mining and analytics enabled us to extract the valuable information stored in the texts and leverage it to solve business problems in timely, unbiased, and efficient manner.


What is text mining and Analytics?

Text mining is a practice for analyzing and processing large corpus of semi structured and unstructured text data (like articles, website text, blog posts, journals, surveys, reviews, emails etc.) using a software or a computer program. The prime purpose is the need to “turn text into numbers or meaningful indices”. 
Text indices generated can then be used to derive relation between documents/texts, patterns, trends, sentiments and summaries for the words contained in the documents or to compute summaries for the documents based on the words contained in them.

This graphical representation explains the evolution of business reporting from traditional Descriptive analytics to Prescriptive Analytics.




Text mining sphere:
Text Mining is a subset of Applied Artificial Intelligence. To leverage the best results from Text mining it is used in conjunction with different advanced analytics techniques to discover and measure patterns, trends and sentiments in text data.

Fig 1.2 depicts the landscape of Artificial intelligence and the key advanced analytics methods which are used along with text mining to solve problems





Text mining process flow:
  1. 1  Data Collection: Collect the data to be analyzed
  2. 2  Text Parsing & Transformation: Extract words, Parts of speech, Stemming, Filtering,Synonyms, Spell checking, different methods of text transformation:
    1. Term Based Method
    2. Phrase Based Method
    3. Concept base method
    4. Pattern Taxonomy Method
  3. 3  Readiness: Clean up the data by eliminating punctuation, numbers and irrelevant terms
  4. 4  Text Mining: Using data mining methods to find solutions from mined data.
Text mining and analytics applications:
Text mining and analytics applications can be broadly classified into two categories:
  1. Search function
  2. Analytics function
 Fig 1.3 shows some practical usage of these applications 





These are the few text mining applications across industries and business verticals.

Risk management
  1. Credit risk mitigation, line of credit
  2. Predicting bankruptcy/fraud detection
  3. High risk individual / corporations.
  4. Services contract review, potential risks , warranty claims
Research
  1. New area of research opportunities, research trends 
  2. Patents filling and likelihood of infringements. 
  3. Legal case researches judgments, notifications, treaties 
Manufacturing
  1. Reducing product design phase, retrieving generic BoMs
    & routings from large amounts of production information and

    process data available
  2. Processing new product feedback ,addressing customer quality
    issues from customer services
Marketing
  1. Contextual Advertising, product promotion.
  2. New Campaign development based on customer feedback.Gauging customer sentiments for new or existing products from
    surveys, social media posts

Security


  1. Cyber-crime prevention, analyzing posts, digital footprints.
  2. Analyzing social media chatter, posts, emails to identify potential threat to national security.
Performance Management


  1. Product Performance Measurement from electronic data
  2. Employee performance review from surveys, service requests, and
  3. customer call notes.


Advantages of text mining:

  1. Developing new and unlocking hidden knowledge.
  2. Highly Efficient: The ability to extract and process information automatically from thousands of documents without human assistance
  3. Text mining provides Un-biased opinion
  4. Improved research process, quality, robustness and evidence based conclusions.
  5. Text mining can be utilized across different industry verticles for predictive analytics.

Challenges in text mining:
    1. All format of text files might not be supported by text analysis software’s.
    2. Copyright information, Cost of accessing the journals, periodicals. 
    3. Non- availability of digital copies of publications / books / articles. 
    4. Noise in the data, ambiguous/ unwanted information not related to text, lack of critical mass. 
    5. Texts created by humans in sarcasm tone.
    6. Text mining doesn’t generate new facts. Needs to be further analyzed
    7. by a domain expert to bring complete picture.
    8. Models can’t be leveraged cross industries for different problems
Use Case

Text Mining India’s Prime Minister Mr. Narendra Singh Modi’s Radio talk show “Mann ki Baat”.
Indian PM Mr. Narendra Modi addresses India every month via radio to share his thoughts and ideas. The objective of this exercise is to analyze his speeches using text mining and analytics. We used Rstudio for this excercise.
Step 0: Installing key packages: tm , syuzhet, colorspace, Rcpp,Rsentiment
Step 1: Preprocessing- It includes data cleansing, integration and transformation.
  1. Converting raw text files (Sep’16 to Dec’17) to a data corpus
  2. Transformation of data lower case.
  3. Removing numbers
  4. Removing punctuations
  5. Converting data corpus as plain text.
  6. Removing redundent words like will, can, year, day, one, two etc.
  7. Tokenization – Breaking the stream of text into single words.
  8. Stemming – Using snowball language, we converted words to their
    roots
  9. Filtering Stopwords- Every token created compared to list of stopwords
    list is removed along with stop words
Step 2: Create a Document Term Matrix - It is a mathematical matrix that describes the frequency of terms that occur in a collection of documents
Step 3: Analysis and Results : Frequency of most used words:
These are the top six words which were used in the text corpus. Since, the address was to nation the below results compliments the raw data. 





Below table shows which top three words were used along with most frequented words his address: 




Ratio of positive to negative words frequency: (p)3254/(n)901 = 3.611

Word Cloud – Word clouds are used to visualize language and word frequency. These are the more frequent words used in his address.





                           Word Cloud depicting the most frequent words used

Sentiment Analysis: Analyzing the thoughts of his address.

We used the Syuzhet Package to extract the sentiments of our text. This package comes with four sentiment dictionaries and provides insights on wider range of emotions then positive and negative. We analyzed 10 different sentiments in the text (Positive, Trust, Negative, Anticipation, Joy, fear, Sadness, Anger, Surprise and Disgust)
Using Syuzhet package we determined the sum of each sentiments used in the text and plotted it on a scale.
As established from the previous analysis the below chart confirms the positive tone of the passage dominating the negative words.







Conclusion:
  1. We can conclude, his choice of positive words are three times to negative words.
  2. Based on the words frequency, the address is towards young, youth of county.
  3. It can be inferred from the words association frequented negative word “Poor” is used along worker, benefit, money, accomplish.

This analysis can further be expanded to study and infer more insights in India's PM address to nation.


Comments

Post a Comment