Machine Learning & Text Analysis

Yulia Gavrilova
Article by Yulia Gavrilova
Thursday, December 17th, 2020

Language is a logical structure that, in theory, should be easy for a machine to work with. How difficult is it, really, to train an ML text analysis system? Let’s find out today.

What is text analysis in machine learning?

Text analysis is the process of obtaining valuable insights from texts.

ML can work with different types of textual information such as social media posts, messages, and emails. Special software helps to preprocess and analyze this data.

Text analysis vs. text mining vs. text analytics

Text analysis and text mining are synonyms. They describe the same process of extracting meaning from data by observing patterns.

However, text analysis and text analytics are a bit different things:

  • Text analysis works with the concepts, the meaning of the text. Text analysis can be used to answer these questions: is a review positive or negative? What is the main topic of the text?
  • Text analytics studies patterns. The results can be shown on graphs, schemes, and spreadsheets. If you want to estimate the percentage of positive customer feedback, you will need text analytics.

In this post, we will talk about ML text analysis techniques and use cases.

Why is text mining important?

Every piece of content can be analyzed on a deeper level in order to understand more about the author or the topic of the text. By introducing ML text analysis, we can provide users with better services:

  • provide answers to FAQs;
  • translate into different languages;
  • monitor public sentiment towards products and services;
  • facilitate paperwork through clustering and classification of documents.

Companies become much more efficient at communicating with their customers: by studying customer feedback, a company can discover public opinion about their products. ML algorithms can automatically classify customer support tickets or reviews by topic or language they are written in.

ML makes textual analysis much faster and more efficient than manual processing of texts. It allows to reduce labor costs and speed up the processing of texts without compromising on quality.

How does machine learning text analysis work?

data mining, data processing, and machine learning

What do you need to build a text analysis tool? Let’s look at it step-by-step.

  1. Gather the data. Decide what information you will study and how you will collect it. These samples will be used to train and test your model. There are two major types of information sources. If you go to resources such as forums or newspapers, then you are collecting external data. Internal data is what every person or company generates every day: emails, reports, chats, etc. Both internal and external resources can be valuable for text mining.

  2. Prepare the data. Unstructured data needs to be prepared, or preprocessed. Otherwise, the program won’t understand it. In our blog, we have already talked about different strategies for data preprocessing.

  3. Apply a machine learning algorithm for text analysis. You can write your algorithm from scratch or use a library. Pay attention to NLTK, TextBlob, and Stanford’s CoreNLP if you are looking for something easily accessible for your study and research.

These are the techniques used for ML text analysis:

Tokenization

Every token is a meaningful unit. Words and punctuation are tokens, while whitespaces are not. Example: This post is about text analysis. = [“This”, “post”, “is”, “about”, “text”, “analysis”, “.”]

Tokenization

Part-of-speech tagging

When you assign a grammatical category to each token, it’s part-of-speech tagging.

Example: This post is about text analysis. = [“This”: ADJ, “post”: NOUN, “is”: VERB, “about”: PREP, “text”: NOUN, “analysis”: NOUN, “.”: PUNCT]

Part of speech analysis

Lemmatization

Putting the word back to its dictionary form (lemma) is done for natural language processing. You map all the possible forms of the word to one ‘root’ verb, and the machine will still understand it. ‘Being’, ‘was’, ‘were’ have the lemma ‘be’.

Stemming

By removing affixes from the word, you get the stem, ‘clean’ form of the word. Google uses stemming for indexing the requests. Instead of storing all the forms of the word, the lexicon is reduced to stems. The process becomes much faster but also less accurate than lemmatization. For instance, the stem or ‘Buying’ is simply ‘buy’.

Stemming

Parsing

There are two kinds of parsing: dependency and constituency. You conduct parsing when you want to understand the grammatical structure of a sentence.

During constituency parsing, you break the text into sub-phrases, also called constituents. This helps to represent the structure of the sentence. Disadvantage: it is context-free grammar. In a sentence like ‘Visiting relatives can be boring’, the algorithm would fail to understand the ambiguous meaning. However, it’s good for grammar checking. For instance, it’s hard for Grammarly to parse a grammatically incorrect sentence, but, thanks to constituency parsing, it uses models of how the sentence should look like to find the right solution.

consitutency parsing

Dependency parsing identifies the main words in the sentence and finds related words that modify their meaning. Syntactic relationships help to understand what the sentence means, especially in synthetic languages such as Slavic languages. Dependency parsing is also applied to grammar checking and word processing because it can parse free word order and fragmented sentences.

dependency parsing

For the demonstration, we have used the Allen NLP system that determines the relationship between the words automatically with a neural network trained on a large dataset of texts.

Text mining techniques

Now let us discover some of the approaches that allow you to work with textual data.

Word frequency analysis

This technique allows you to measure how frequently words appear in the text.

This is exactly how humans are able to identify the topic of the text and conduct sentiment analysis. We know that the word “interesting” usually refers to positive impressions. So if you see this word in a review, that means that the client is satisfied. However, this method is not sensitive to sarcasm, which might affect the general results of your analysis.

Collocation analysis

Two, three, or more words that are often used together in speech are called collocations. The same word in different collocations can have different meanings. The word “free” means “liberated” as in “free spirit”. “Free” can also mean “free of charge”. “Free’ is much more likely to appear on an online store’s website together with “shipping”, rather than with “spirit” or even separately. Taking collocations into account makes semantic analysis more accurate.

Concordance analysis

A concordance is a table that displays different meanings of the same word in different contexts. Here is an example from a contextual dictionary showing how different people use the word “concordance”:

Contextual dictionary

Contextual dictionaries are good for language learners because they contain real-life examples showing different ways of using the same word. They are just as good for machine translation and speech generation systems.

Concordance and collocation analysis are useful for keyword meaning disambiguation.

Using these basic techniques, you can proceed to more advanced types of ML text analysis.

Text classification

ML algorithms detect different patterns in data and break the text into clusters. Let us talk a bit more about typical text classification tasks.

Sentiment analysis

Sentiment analysis, or opinion mining, identifies and studies emotions in the text.

The emotions of the author are important for understanding texts. SA allows to classify opinion polarity about a new product or assess a brand’s reputation. It can also be applied to reviews, surveys, social media posts. The pro of SA is that it can effectively analyze even sarcastic comments.

Topic analysis

Topic modeling classifies texts by subject and can make humans’ lives easier in many domains. Finding books in a library, goods in the store, customer support tickets in the CRM would be impossible without it. Text classifiers can be tailored to your needs.

Content tagging

Students and professors, lawyers, scientists and laboratory assistants can all benefit from the use of text classification technology. Since they are dealing with massive amounts of unstructured data on a daily basis, tagging and classifying texts into categories would make their lives much easier.

Meaning extraction

With the help of text analysis, it is possible to extract keywords, prices, features, and other important information. A marketer can conduct competitor analysis and find out all about their prices and special offers in just a few clicks.

Keyword Extraction

Techniques that help to identify keywords and measure their frequency are useful to summarize the contents of texts, find an answer to a question, index data, and generate word clouds.

Entity Recognition

Entities are people, companies, or locations mentioned in the text. It can be useful in machine translation so that the program wouldn’t translate last names or brand names. Moreover, entity recognition is indispensable for market analysis and competitor analysis in business.

Practical applications of ML text analysis

What are the practical applications of ML text analysis techniques? We’ve tried to mention the most common ones.

Natural language processing

NLP is what helps the machines to comprehend human language and act according to the requests. NLP systems are used for chatbots, smart assistants, and voice recognition security systems.

Social media monitoring

How much do people love your brand? Twitter, Facebook, and Instagram are the places where users share their impressions, leave good and bad reviews about the places they have visited and the products they have tried. You can see how your company is perceived in general or focus on the concrete product.

Customer service

Trusting routine work to ML means that employees can focus on tasks that demand human attention. ML text analysis helps with ticket tagging, identifying the problem, and assigning it to the right person. Based on the keywords, ML systems can prioritize requests.

Business intelligence

In BI, preference is given to numbers. They are great for understanding trends and statistics. However, numbers can’t provide you with the reasons why things are happening. ML algorithms that analyze textual data can provide valuable insights by analyzing both internal and external data.

Sales and marketing

Analyze client and competitor profiles by parsing through their data and get a more detailed understanding of the situation on the market. Based on this data, you can provide more personalized sales offers. ML text analysis is used to analyze and write emails to help the sales team effectively communicate with customers.

SEO

SEO tools rely on machine learning when analyzing the content on web pages. If you want your website to be shown high in the search results, you should optimize it for the search engine. You can identify the topics other people in your fields write about using keyword parsers and make your content more useful to the target audience.

Software for disabled

ML text analysis helps to give voice to people with speech disabilities. By using text-to-speech technology, machine learning systems vocalize input text. It is possible to generate an original and unique voice for each user based on their own voice (if applicable). This software enables people with disabilities to communicate with other people and use voice-activated interfaces.

Robotics

Robots need to understand human speech and communicate with them, which would be impossible without ML text analysis. Moreover, sentiment analysis techniques allow them to get a bit better at understanding human emotions and acting accordingly. Robots that have been trained using ML text analysis models can read and understand texts, the same thing with data.

Challenges of ML text analysis

According to a recent study, about 80% of all data generated in enterprises is in the form of texts. A lot of insights can be drawn from it.

But ML textual analysis also presents some challenges:

  • Complexity. Transforming text into a format that can be processed by the computer requires several steps. For example, if we are solving a text classification problem, we need to collect the data, detect the keywords in it, define a number of classes, group the data according to these classes, and describe these processes in mathematical terms. It’s challenging both intellectually and in terms of human/money/time resources.
  • Conceptual struggles. Computers don’t understand concepts that are behind words, so working with homographs is difficult for them. Programmers have to come up with some effective tools for word meaning disambiguation in order to work with sentences such as ‘Will, will Will will Will Will’s will?’. Google Translate, for example, cannot cope with this sentence right now.
  • Understanding culture. Understanding human speech means understanding their emotions. One of the hardest emotions for a computer to grasp is sarcasm. Continuing the topic of disambiguation, the same meaning in different cultures can be expressed by different words such as slang or local variants. What is a “jumper” to a Brit is a “sweater’ to an American. A computer program must have experience and cultural background to effectively communicate with speakers who use less conventional forms of language.

Conclusion

ML text analysis is a technology that is used in various industries from marketing and sales to robotics. Special models help to teach the machine to work with such data and draw valuable conclusions from it. All in all, it can be a valuable technique for generating insights for your product or for your business.

Machine Learning & Text Analysis
More from Serokell
27 best sources to study machine learning27 best sources to study machine learning
What is pattern recognition in machine learningWhat is pattern recognition in machine learning
machine learning testing thumbnailmachine learning testing thumbnail