The best apps are intuitive. A user shouldn’t be required to spend time figuring them out or consulting a help doc. So it’s no surprise that engineers and developers are looking for ways to make applications that use the best tool we have for interaction: language.
Machines are fast and accurate when it comes to processing structured data, but that’s not how humans communicate. Our language is unstructured data. So in order for machines to communicate in our language, they must first turn our unstructured language into their structured language using natural language processing (NLP).
In this article, we’re going to help you understand natural language processing. Then we’ll lay out the most common NLP techniques that you (and your development team) can fold into your applications.
What is Natural Language Processing?
Natural language processing is a subfield of artificial intelligence that helps us make systems that understand the meaning behind human language. It’s a marriage of computer science and linguistics so that intelligent systems algorithms can understand, analyze, categorize, and extract meaning from written and spoken words. It powers language translators, chat bots, recommendation systems (like how Netflix knows what you like), and everyday gadgets like Alexa and Siri.
NLP uses text vectorization, a process that works to understand the structure and meaning of language by analyzing its components, such as semantics, syntax, morphology, and pragmatics. Then it sends what it learns to rule-based, machine learning algorithms that solve problems and perform predefined actions.
How do the algorithms know how to understand the text? By feeding them training data (text with expected outputs) beforehand to help them learn. Over time, they build a knowledge bank they use when you start feeding them unseen data (new text). They continue to learn and refine themselves as they see more data.
Gmail is a popular tool that uses natural language processing well. Emails are categorized as primary, promotions, social, and spam based on the words in their subject lines and body. If you use Gmail, you probably think the sorting is pretty accurate. That’s because its machine learning system has consumed a lot of data to learn.
Common Natural Language Processing Techniques
Now that you understand how NLP works, let’s go over the common NLP techniques that might support your applications. This isn’t an exhaustive list. There are plenty of sophisticated, but less common NLP techniques, but this will give you an idea of the technology’s potential.
1. Named Entity Recognition (NER)
This is one of the most useful natural language processing techniques. It’s the process of identifying and extracting the entities in the text. The algorithm finds people, locations, organizations, dates, etc. This helps the algorithm pick out the fundamental concepts and references of a text.
Image: Uma GunturiThis technique is often used for categorization. For example, an algorithm could scan and extract entities from articles to sort the content into categories. But you wouldn’t have to create the category beforehand. The algorithm would generate new categories whenever it discovered a new name.
NER generally uses grammar rules and supervised models to find entities. For instance, proper nouns are typically capitalized or used at the beginning of a sentence.
2. Sentiment Analysis
Sentiment analysis is the most widely used NLP technique. It’s also known as Emotion AI or Opinion Mining.
You’ll see sentiment analysis used in reviews, customer surveys, social media comments, and other places where people express their opinions and give feedback. This technique helps the algorithm understand how people feel about a particular issue.
Sentiment analysis evaluates a string of text using a point system. Positive comments award points. Negative comments deduct points. Neutral comments are worth 0.
Let’s say a user leaves a comment like this: “The support team was rude and abusive.”
Sentiment analysis would recognize two negative words, “rude” and “abusive,” and deduct a point for each. The final score would be -2.
In more complex systems, you could categorize words as more or less influential than others. For instance, “nice,” a positive word, might award one point, but “awesome,” a more positive word, might award two points.
Sentiment analysis becomes less useful in longer strings of text that express multiple sentiments. A blog post, for example, would include lots of positive and negative words that relate to different subjects, so your results wouldn’t be reliable. But this technique is great for single sentences.
3. Tokenization
Tokenization is a critical component of retrieving information. It’s a way of “preprocessing” text so algorithms can reliably understand the meaning.
Tokenization is the process of breaking up strings of words into semantically useful units. Each unit is called a token, which could be a word, phrase, number, punctuation, etc. You can use sentence tokenization to split sentences within a text and word tokenization to split words within the sentence.
Image: spaCyThe purpose of tokenization is to make searching faster and to save on storage space. Here’s an example of how it simplifies text:
User input: The flavor couldn’t be better!
Tokens: “flavor” “could” “not” “be” “better”
Tokens don’t need to be single words, however. You can tokenize phrases to help algorithms understand meaning. For instance, “New Mexico” has a different meaning than “new” and “mexico.”
4. Stemming and Lemmatization
Whenever we write or speak, we often use words in different grammatical forms. This creates a lot of complexity for algorithms to understand, so NLP uses stemming and lemmatization to revert words back to their root forms.
Stemming is the process of reducing words to their word stem. Basically, this means removing suffixes and extra pieces that appear due to conjugation. For example, the word “asking” would be reduced to “ask.”
Lemmatization is the process of reducing a word to its root form, called a lemma. For instance, “be” is the root form of “is,” “are,” “am,” “been,” and “were.”
In many cases, stemming reduces words to forms that might not seem semantically correct. For instance, “having” gets reduced to “hav.” This isn’t accurate, obviously, which is why stemming is used in conjunction with lemmatization. Stemming is fast, but lemmatization is more accurate.
5. Stopword Removal
Stopword removal is a natural processing technique that filters out high-frequency words that don’t add semantic value to a sentence, such as “which,” “the,” “a,” “to,” “for,” etc. You can customize your own list of stopwords. This lets your app focus on important words without being led astray by irrelevant words.
For example, let’s say you have a customer service support app. You want the app to filter tickets to the right support representative based on the customer’s issue. A customer sends this message: “Hello, I’m unable to reset my password.”
In this case, it’s smart to add “hello,” “I’m,” “to,” and “my” to your stopword list so you’re left with just “unable reset password.” That phrase is clean, simple, and hard to misunderstand.
6. Word Sense Disambiguation (WSD)
Word sense disambiguation is the process of using context to understand the meaning of words. After all, the same words can have different meanings.
Take the word “Java,” for instance. By itself, we can’t be sure if it refers to coffee, the programming language, or the island in Indonesia. But if we look at a longer sentence (“I want to learn to write Java”) we get more clarity.
Some word sense disambiguation techniques use a knowledge-based approach that infers meaning from the dictionary definition of a word. Others used a supervised approach, which is based on algorithms that learn from training data.
7. Text Summarization
Text summarization is a technique used to understand large chunks of text by identifying its most important components. It’s often used for news articles and research articles.
Text summarization uses two methods: Extraction, which creates a summary by extracting important parts from the broader text, and abstraction, which creates fresh content that explains the theme of the broader text.
As you can imagine, this is a complex process of breaking down the content into sentences, comparing sentences to one another, and then ranking their relevance.
Image: Analytics VidhyaThis technique is often used before other techniques are applied. Text summarization cuts out the fat, then some of the other techniques on this list can be used to turn what’s left into structured data for algorithms.
Use Your Tools Wisely
Natural language processing is not an easy technology to implement because human language is naturally ambiguous. Consider sarcasm, exaggeration, or humor, for instance. The meanings locked behind those concepts require deep knowledge of culture, history, and empathy. Machine learning is getting better, but there’s still a ways to go. You need a skilled development team to make it work properly.
This means that while NLP is a tool to help users engage with your app, it isn’t right for every application. Don’t force it into your app unless it makes sense.