The 7 Basic Functions of Text Analytics
The 7 Basic Functions of Text Analytics
The 7 Basic Functions of Text Analytics
Text analytics and natural language processing are often portrayed as ultra-complex
computer science functions that can only be understood by trained data scientists.
But the core concepts are pretty easy to understand even if the actual technology is
quite complicated. In this article I’ll review the basic functions of text analytics and
explore how each contributes to deeper natural language processing features.
Quick background: text analytics (also known as text mining) refers to a discipline of
computer science that combines machine learning and natural language processing
(NLP) to draw meaning from unstructured text documents. Text mining is how a
business analyst turns 50,000 hotel guest reviews into specific recommendations;
how a workforce analyst improves productivity and reduces employee turnover; how
healthcare providers and biopharma researchers understand patient experiences;
and much, much more.
Okay, now let’s get down and dirty with how text analytics really works.
There are 7 basic steps involved in preparing an unstructured text document for
deeper analysis:
1. Language Identification
2. Tokenization
3. Sentence Breaking
4. Part of Speech Tagging
5. Chunking
6. Syntax Parsing
7. Sentence Chaining
Each step is achieved on a spectrum between pure machine learning and pure
software rules. Let’s review each step in order, and discuss the contributions of
machine learning and rules-based NLP.
1. Language Identification
Fig. 1 – Lexalytics’ text analytics technology and NLP feature stack, showing the
layers of processing each text document goes through to be transformed into
structured data.
The first step in text analytics is identifying what language the text is written in.
Spanish? Singlish? Arabic? Each language has its own idiosyncrasies, so it’s
important to know what we’re dealing with.
2. Tokenization
Now that we know what language the text is in, we can break it up into
pieces. Tokens are the individual units of meaning you’re operating on. This can can
be words, phonemes, or even full sentences. Tokenization is the process of
breaking down text document apart into those pieces.
In text analytics, tokens are most frequently just words. A sentence of 10 words,
then, would contain 10 tokens. For deeper analytics, however, it’s often useful to
expand your definition of a token. For Lexalytics, tokens can be:
Words
Punctuation (exclamation points intensify sentiment)
Hyperlinks (https://…)
Possessive markers (apostrophes)
3. Sentence Breaking
Once you’ve identified the tokens, you can tell where the sentences end. (See, look
at that period right there, you knew exactly where the sentence ended, didn’t you Dr.
Smart?)
But look again at the second sentence above. Did it end with the period at the end of
“Dr.?”
Now check out the punctuation in that last sentence. There’s a period and a question
mark right at the end of it!
Point is, before you can run deeper text analytics functions (such as syntax parsing),
you must be able to tell where the boundaries are on a sentence. Sometimes it’s a
simple process, and other times… not so much.
Part of Speech tagging (or PoS tagging) is the process of determining the part of
speech of every token in a document, and then tagging it as such.
For example, we use PoS tagging to figure out whether a given token represents a
proper noun or a common noun, or if it’s a verb, an adjective, or something else
entirely.
Part of Speech tagging may sound simple, but much like an onion, you’d be
surprised at the layers involved – and they just might make you cry. At Lexalytics,
due to our breadth of language coverage, we’ve had to train our systems to
understand 93 unique Part of Speech tags.
5. Chunking
Let’s move on to the text analytics function known as Chunking (a few people call
it light parsing, but we don’t). Chunking refers to a range of sentence-breaking
systems that splinter a sentence into its component phrases (noun phrases, verb
phrases, and so on).
Before we move forward, I want to draw a quick distinction between Chunking and
Part of Speech tagging in text analytics.
Chunking will return: [the tall man]_np [is going to quickly walk]_vp [under the
ladder]_pp
6. Syntax Parsing
The syntax parsing sub-function is a way to determine the structure of a sentence.
In truth, syntax parsing is really just fancy talk for sentence diagraming. But it’s a
critical preparatory step in sentiment analysis and other natural language processing
features.
7. Sentence Chaining
The final step in preparing unstructured text for deeper analysis is sentence
chaining, sometimes known as sentence relation.
Lexalytics utilizes a technique called “lexical chaining” to connect related sentences.
Lexical chaining links individual sentences by each sentence’s strength of
association to an overall topic.
Even if sentences appear many paragraphs apart in a document, the lexical chain
will flow through the document and help a machine detect over-arching topics and
quantify the overall “feel”.
In fact, once you’ve drawn associations between sentences, you can run complex
analyses, such as comparing and contrasting sentiment scores and quickly
generating accurate summaries of long documents.
I’m a big fan of the Wikipedia article on this subject (don’t tell my high school English
teacher). Note that Wikipedia considers Text Analytics and Text Mining to be one
and the same thing. I don’t necessarily agree with that position, but we’ll discuss that
another time.