Natural Language Processing Succinctly PDF
Natural Language Processing Succinctly PDF
Natural Language Processing Succinctly PDF
org
Natural Language
Processing Succinctly
By
Joseph D. Booth
Foreword by Daniel Jebaraj
2
Copyright © 2018 by Syncfusion, Inc.
2501 Aerial Center Parkway
Suite 200
Morrisville, NC 27560
USA
All rights reserved.
If you obtained this book from any other source, please register and download a free copy from
www.syncfusion.com.
The authors and copyright holders provide absolutely no warranty for any information provided.
The authors and copyright holders shall not be liable for any claim, damages, or any other
liability arising from, out of, or in connection with the information in this book.
Please do not use this book if the listed terms are unacceptable.
www.dbooks.org
Table of Contents
Eliza .....................................................................................................................................12
SHRDLU ..............................................................................................................................14
Sarcasm ..........................................................................................................................17
Exceptions .......................................................................................................................17
Summary ..............................................................................................................................17
Tagger .............................................................................................................................18
Playground ...........................................................................................................................19
Installing ..........................................................................................................................20
4
Getting started ......................................................................................................................21
Summary ..............................................................................................................................23
IsQuestion ............................................................................................................................30
Summary ..............................................................................................................................30
Contractions .........................................................................................................................32
Summary ..............................................................................................................................33
Chapter 5 Tagging..................................................................................................................34
www.dbooks.org
Brill steps .........................................................................................................................40
Summary ..............................................................................................................................43
Entity types...........................................................................................................................44
Patterns ...........................................................................................................................46
Remembering ..................................................................................................................49
Prompting ........................................................................................................................50
Summary ..............................................................................................................................50
Example domain...................................................................................................................51
Summary .........................................................................................................................58
Second question...................................................................................................................66
6
Third question ......................................................................................................................66
Summary ..............................................................................................................................67
Summary ..............................................................................................................................71
Basic class.......................................................................................................................74
Summary ..............................................................................................................................79
www.dbooks.org
Summary ..............................................................................................................................84
Categorization ......................................................................................................................87
Summary ..............................................................................................................................88
Playground ...........................................................................................................................92
8
The Story Behind the Succinctly Series
of Books
Daniel Jebaraj, Vice President
Syncfusion, Inc.
Whenever platforms or tools are shipping out of Microsoft, which seems to be about every other
week these days, we have to educate ourselves, quickly.
While more information is becoming available on the Internet and more and more books are
being published, even on topics that are relatively new, one aspect that continues to inhibit us is
the inability to find concise technology overview books.
We are usually faced with two options: read several 500+ page books or scour the web for
relevant blog posts and other articles. Just as everyone else who has a job to do and customers
to serve, we find this quite frustrating.
We firmly believe, given the background knowledge such developers have, that most topics can
be translated into books that are between 50 and 100 pages.
This is exactly what we resolved to accomplish with the Succinctly series. Isn’t everything
wonderful born out of a deep desire to change things for the better?
Free forever
Syncfusion will be working to produce books on several topics. The books will always be free.
Any updates we publish will also be free.
www.dbooks.org
Free? What is the catch?
There is no catch here. Syncfusion has a vested interest in this effort.
As a component vendor, our unique claim has always been that we offer deeper and broader
frameworks than anyone else on the market. Developer education greatly helps us market and
sell against competing vendors who promise to “enable AJAX support with one click,” or “turn
the moon to cheese!”
We sincerely hope you enjoy reading this book and that it helps you better understand the topic
of study. Thank you for reading.
10
About the Author
Joseph D. Booth has been programming since 1981 in a variety of languages, including BASIC,
Clipper, FoxPro, Delphi, Classic ASP, Visual Basic, Visual C#, and the .NET Framework. He
has also worked in various database platforms, including DBASE, Paradox, Oracle, and SQL
Server.
Joe has worked for a number of companies including Sperry Univac, MCI-WorldCom, Ronin,
Harris Interactive, Thomas Jefferson University, People Metrics, and Investor Force. He is one
of the primary authors of Results for Research (market research software), PEPSys (industrial
distribution software), and a key contributor to AccuBuild (accounting software for the
construction industry).
He has a background in accounting, having worked as a controller for several years in the
industrial distribution field, but his real passion is computer programming.
In his spare time, Joe is an avid tennis player, practices yoga and martial arts, and plays with
his first granddaughter, Blaire.
11
www.dbooks.org
Chapter 1 Natural Language Processing
In Star Trek, the actors frequently talk to the computer. The computer understands their
requests and immediately delivers the expected results. While this level of understanding is still
quite a way off, it is one of the goals of Artificial Intelligence—for the computer to accurately
understand English (and other languages) and be able to extract meaning from the words.
return RainCheck(zipCode,date()+1)
The assistant software will determine the probability of rain, and based on that probability, return
an answer, such as “There is a good chance you will need your raincoat.” I asked Cortana that
same question and was shown the weather chart in Figure 1.
Eliza
Back in 1966, Joseph Weizenbaum wrote a program called ELIZA, which was meant to simulate
a Rogerian psychotherapist. A sample "session" is shown in Figure 2.
12
Figure 2 – Eliza session
Many early computer games were influenced by ELIZA, allowing the user to enter short
sentences instructing the program what to do for the next step. “Colossal Cave” was a very
early computer game that allowed you to explore a “nearby cave” for treasures. It was entirely
text based; you would type in a one or two-word commands to “explore” the world. Figure 3
shows the start of the program.
Eliza, Colossal Cave, and similar games had no understanding, but were simply clever rule-
based programming to “simulate” understanding. However, they did show the potential of using
Natural Language as an interaction method.
13
www.dbooks.org
SHRDLU
SHRDLU (named after the seventh through twelfth most common letters in English) was written
in 1970 by Terry Winograd, and showed some very impressive Natural Language Processing
capabilities, within a controlled environment of a block word. Figure 4 shows a sample dialog
with the program.
The level of understanding was quite impressive, but limited to the blocks in the virtual world.
While the program caused a lot of optimism, researchers began to realize just how complex
modeling the real world could be. While parsing text and understanding it is a big part of NLP,
building a model of all known facts is an incredibly complex task. By limiting the domain to a
reasonable size, Natural Language could be used to help systems, but a computer program that
understands all the nuances and complexities of the real world, and can answer any questions,
is still quite a way off.
Search engines
Internet search engines, such as Google, Microsoft Bing, Yahoo, and Duck Duck Go all attempt
to interpret the meaning behind your search text. However, it is curious to see how well the
“questions” get answered. For example, for soccer fans, you might ask the question, "Who won
last night's match?" The result could be a list of lottery game winners or multiple sporting event
outcomes.
Part of the reason that companies are collecting information whenever they can about you, is to
to give better answers when you search. I play a lot of tennis, so when I ask the question, "Who
won last night's match?" and it is during the US Open dates, I'd really love to see the result of
the tennis matches. You can see this level of personalization with advertisements, particularly
on social media. If you purchase a product from Amazon, your next visit to Facebook will very
likely show you ads related to your purchase. The more a computer system knows about you,
the better your search results can be. Ask Google to "show me restaurants near me," and it will
understand (based on the IP address of your computer) what "near me" means.
14
Chat bots
A chat bot is a computer program that attempts to respond to questions asked by humans. In
small domains, chat bots can be very helpful, solving common problems. Many businesses rely
on chat bot technology to handle simple requests, like fixing a router, or finding out what is on
TV for a subscriber. Chat bots generally rely on patterns, and they attempt to provide a
response if they detect that a question meets the pattern.
<category>
<template>
Hello There!
</template>
</category>
</aiml>
This file simply says: if the user enters "HELLO" followed by any text, respond with the template
response of "Hello There!" There are many additional features, such allowing the response to
be one of several random templates, imbedding prior responses, etc.
By keeping the conversation space small, it is possible to create enough patterns and template
responses. There is a chat bot called Mitsuku that uses AIML files to converse as an 18-year-
old woman from Leeds. You can visit the Mitsuku website to see a sample chat bot in action.
Figure 5 shows a sample chat with the bot.
15
www.dbooks.org
Figure 5 – Sample chat bot
Turing test
Allan Turing, the brilliant cryptologist from World War II, suggested a “test” that could be
performed to indicate if a computer has reached a certain level of intelligence. The test, named
after him, simply allows people to interact via a computer keyboard with an unknown person or
computer on the other side. If a computer manages to convince 30 percent of the people
interacting with it that it is a human being, the test is considered passed. Note that he was not
trying to determine whether the computer could "think" or not, but rather, if the computer could
respond with human-like conversation.
Loebner Prize
The Loebner Prize is a competition to see if any computer software can pass the Turing test.
The gold medal (and $100,000) will be awarded for a program that can pass the test using
visual and audio components, while the silver medal ($25,000) is for a program that passes the
test using text-only messages. The bronze medal is awarded to the most human-like program.
As of the writing of this book, the gold and silver medals have never been awarded.
16
Context changes meaning
Consider the sentence "He shot an eagle." To a game warden, this is bad news, perhaps a
poacher to deal with. However, to a golfer, this is good news, since an eagle is two strokes
under par.
Sarcasm
Another issue is that of sarcasm. If a reviewer writes "I really enjoyed the loud noise in the
theater," a computer program might interpret that as a positive comment. A person's
background knowledge that theaters are generally quiet would make it clear that the reviewer is
making a negative remark.
Exceptions
In English, there are three general rules for making a word plural case.
However, there are many exceptions to the rules. Some words are the same whether singular or
plural (sheep), other words change some letters (man/men), and others have specific plural
forms (child/children)
Summary
The goal of this book is to describe the various components needed to create a system that
appears to understand natural language and will provide reasonable responses to English
questions. We will design a simple system to take question text and provide an answer from a
specific set of data.
We will also cover some APIs from Microsoft, Cloudmersive, and Google that provide the
various methods for an NLP application. With an understanding of the steps involved, you
should be able to use one of the APIs to add natural language support to your application.
If your goal is to create Siri’s older, wiser sister, or pass the next Turing test, hopefully the
descriptions and code presented in this book, and the available APIs, will give you a reasonable
starting point.
17
www.dbooks.org
Chapter 2 What we’re building
The goal of the book is to build a simple NLP library, and use that library code as part of a
question-answering NLP application. The complete source code to the library and the database
we will be querying are available on the Syncfusion author's website at http://www.joebooth-
consulting.com.
The first few chapters will discuss some of the key functions that are used to parse sentences
into actionable structures to query a dataset. By the end of Chapter 6, you should be able to get
a list of words and tags from a sentence. In Chapters 9–11, we will explore some API calls
(Cloudmersive, Google, and Microsoft) that you can use to get a tagged list of words. Some of
those web services offer additional NLP tasks beyond our goal of question answering, so it is
worth reading them to see what else NLP can do.
In Chapters 7 and 8, we show how to take a tagged list of words and use it to ask questions and
get answers back, first by building the knowledge and code to access it, and then by showing
how to match the questions to the appropriate function to provide an answer. If you are simply
interested in the question-answering side, feel free to skip Chapters 3–6 and use one of the web
services to build your tagged word list.
In the next chapters, we will also make use of another NLP web service product from
Cloudmersive. I suggest registering for an API key with Cloudmersive; it allows you up to 50,000
requests per day on the free account. In this chapter, we will show how to get set up to make
the API calls and use these calls in early chapters to supplement our code.
Tagger
The tagger static class within the NLP project contains the list of words and parts of speech we
want to understand. We have a small list: approximately 500 top verbs, adverbs, and adjectives,
as well as pronouns. You can create your own word list or rely on various web services and
freely available dictionaries to help look up and interpret words you find in a sentence or
question.
18
DataSet (sample dataset)
Our sample dataset is simply a collection of facts (in our example, tennis major tournaments)
stored as a collection of objects. If your dataset is small and static, this approach could work.
However, for a large or frequently changing dataset, you would most likely pull the data from a
database backend using SQL queries.
Playground
Playground is a simple project that allows you to test out the API calls. It is a Windows
application that allows you to enter some text, and then either parse it, or ask questions of the
dataset. Figure 6 shows the Playground window.
Web service
Cloudmersive is a web services company that offers several different web services to solve
problems that often plague developers. These include OCR, data validation, and document
conversion, and my favorite, the Natural Language API calls. In their own words:
The Cloudmersive Natural Language Processing APIs let you perform part of speech tagging,
entity identification, sentence parsing, language detection, text analytics, and much more to help
you understand the meaning of unstructured text across a range of programming languages -
Node.JS, Python, C#, Java, PHP, Objective-C, and Ruby.
19
www.dbooks.org
Getting registered
To register with Cloudmersive (and get 50,000 free web service calls per day), go to this
webpage and create a login account.
Once you've done that, next time you log in, you will be directed to the Management Center,
where you can manage your API keys. You will need an API key to call the web services. Figure
7 shows the API Keys management page.
Installing
Once you have an API key, you can use NuGet to install the package into your application.
using Cloudmersive.APIClient.NET.NLP.Api;
using Cloudmersive.APIClient.NET.NLP.Client;
using Cloudmersive.APIClient.NET.NLP.Model;
20
Listing 2 – Language detection API
This is the general structure of the API calls. First, set your API key, and then create an instance
of the API you want to call. Finally, make the call within a try...catch block (since you are
POSTing to a web service) and return the result of NULL.
You can visit the website to explore the various API calls provided. For convenience, Chapter 9
has sample C# code to call the APIs from Cloudmersive for the internal NLP functions we
discuss in the next few chapters.
Getting started
To have the computer appear to process and understand text, there are a few concepts we
should be comfortable with before we begin.
Expected usage
Knowing what kind of data your application will work with and what kinds of inputs and
questions the user is likely to use can go a long way towards making your application better
able to figure out the question and answer. Siri and other "intelligent assistants" already know a
lot about the user, simply by having access to the information in the phone or device. If you
want to know tomorrow's weather, Siri can access the location information in your phone to
determine the location you are asking about.
21
www.dbooks.org
If you are adding Natural Language support to a personnel system, you will expect questions
about employees, applicants, roles, etc. The question, "Who is the president?" would be
answered with the company's president name, not the president of the United States. A
scheduling system would answer questions about appointments, free time, etc. If I tell a
scheduler system I want to play tennis with Roger tomorrow, I would hope it would find the
Roger in my contacts, and not put me on the court with Roger Federer.
Knowing your expected usage and data allows you to make assumptions to help return the most
likely response. English is highly ambiguous, so any context we can provide to help resolve the
ambiguities will improve the appearance of understanding.
Domain size
It is overly ambitious to assume that we could create a massive knowledge base that would
understand every conceivable fact in the world. But if we keep the knowledge domain small,
such as vendors and products a business uses, or football teams and stats in the NFL, we could
create a reasonable system to handle basic English-language queries.
Regular expressions
Many of the parsing techniques in the subsequent chapters will use regular expressions (or
regex), which are patterns used to perform string searching. A regex is a compact string that
indicates how data should be searched. It is very powerful, very cryptic, and very easy to get
wrong. To get a sense of regular expressions, let's explore a simple example.
Imagine you wanted to find the word "CAT" in a text string. Pretty easy, right? The regular
expression is simply CAT. However, your needs are a bit more complex: you need any three-
letter word beginning with C and ending with T. You can use C.T (the period represents any
character). But wait, that finds unexpected things, like C#T. No problem—change the expression
to C[a-z]T. Only vowels allowed? Then let’s go with C[aeiou]T.
Regular expression can handle a lot, such as the following expression: ^[0-9]{5}([- /]?[0-
9]{4})?$. This expression asks for five numbers and an optional space or dash, optionally
followed by four more digits (such as a United States Postal Service zip code).
Such a regular expression could read through a corpus of text and try to identify possible zip
codes, phone numbers, email addresses, or URLs. A regular expression is looking for text
patterns that match. The expressions can get very involved. The cryptic expression in Listing 3
shows one way to validate that text looks like a valid file name.
^(([a-zA-Z]:|\\)\\)?(((\.)|(\.\.)|([^\\/:\*\?"\|<>\. ](([^\\/:\*\?"\|<>\.
])|([^\\/:\*\?"\|<>]*[^\\/:\*\?"\|<>\. ]))?))\\)*[^\\/:\*\?"\|<>\.
](([^\\/:\*\?"\|<>\. ])|([^\\/:\*\?"\|<>]*[^\\/:\*\?"\|<>\. ]))?$
22
It is very powerful, but can be very cryptic and hard to read. You do not need to necessarily
understand how regexes work, but you will see references to two key regex methods throughout
the book.
Regex.Split()
This method works just like the regular string Split function, but uses a regular expression to
split the string. For example, using StringSplit, we could use the following code to split by
punctuation characters.
StringSplitOptions.RemoveEmptyEntries);
Using the Regex.Split function, we can perform the same functionality with the following code:
You will see examples using Regex.Split in the code when the splitting criteria is a bit more
complex than simple string splitting.
Regex.IsMatch()
This method compares the text with a regular expression pattern and returns a Boolean value
whether the text matches the search rules in the regex expression. For example, a simple
Regex to test a time string (such as 10:00) is:
@"^([0-1][0-9]|[2][0-3]):([0-5][0-9])$"
The following code snippet will set the IsTime flag to true if the value in the word string looks
like a time value.
Summary
Know your application and keep your domain size small—this will help reduce the ambiguities
and give you a better chance to get meaningful results from the inputted text.
If you want to roll your own parsing routines, you should spend a bit of time exploring the regex
syntax, which is very useful for parsing text. You can download my book on regular expressions
here.
23
www.dbooks.org
Chapter 3 Extracting Sentences
At first glance, it would seem that breaking a text into individual sentences is a trivial task.
Simply use the .NET Split() method, as shown in Listing 4.
The code says to add a new element to the string array as soon as a period, exclamation point,
question mark, or semi-colon is found. If an element is empty, don't include it in the result.
However, English has its own set of punctuation rules, and the period character is used for a lot
more than just an end-of-sentence indicator. Some uses of the period character include:
In addition, the English rules of grammar state that if an acronym or abbreviation ends a
sentence, you should not add a second period.
In NLP parlance, the problem of extracting sentences from text is called sentence boundary
disambiguation. There are multiple approaches to the problem, and we are going to explore two
of them in this chapter.
Mr. Federer won the tennis match against Mr. Nadal on Jan. 28, 2018. Check
the scores at www.tennis.com.
Our expected outcome would be two sentences, telling who won and how to check out the
scores. Let's create a static class, SimpleSentenceSplit, to handle our parsing requirements.
24
The first step would be to take our test sentence, and change it to the following text:
Mr~ Federer won the tennis match against Mr~ Nadal on Jan~ 28, 2018. Check
the scores at www~tennis~com.
Once the text is converted, the Split function produces the following result.
[0] Mr~ Federer won the tennis match against Mr~ Nadal on Jan~ 28, 2018.
[1] Check the scores at www~tennis~com.
We now loop through the "sentences" and replace the ~ character with the period.
The sentence parser would declare a few settings as constant variables when the class is first
called. This includes your end-of-sentence characters and any anticipated abbreviation.
You can add your own abbreviations; the list shown in Listing 5 is just a sample of the possible
words you might encounter in the text. The code to do the parsing is shown in Listing 6.
25
www.dbooks.org
}
Note: C# doesn't have a case-insensitive Replace method, so the code uses the
Regex.Replace method to provide case insensitive replacement.
This approach will work reasonably well, but does require some understanding of the likely
abbreviations your text can expect. You can adjust the delimiters and abbreviations to fine-tune
this parsing strategy. If your goal is to parse a reasonably consistent set of text, this simple
approach could give you a usable sentence splitter.
[0] Mr.
[1] Federer won the tennis match against Mr.
[2] Nadal on Jan.
[3] 28, 2018.
[4]
[5] Check the scores at www.
[6] tennis.
[7] com.
The first key to this approach is adapting the split function to be sure to include the delimiter
character, since we will need that to assemble the final resulting sentences. Listing 7 shows the
regular expression to split the string, but uses backtracking to keep the delimiter character.
Tip: If you are targeting languages other than English, you can add additional
delimiters between the [ ] characters, such as the \U2047 for the double question
mark.
We can now write our code to extract the sections from the input text string. The first snippet of
the code is shown in Listing 8.
26
{
// Split by new line character
List<string> FirstPass =
Regex.Split(Paragraph, @"((?:\r ?\n |\r)+)",
RegexOptions.IgnorePatternWhitespace).
Where(s => s != Environment.NewLine &&
!string.IsNullOrEmpty(s)).ToList<string>();
foreach (string curSentence in FirstPass)
{
string[] chunks = Regex.Split(curSentence, punctuation);
Sections_.AddRange(chunks.ToList<string>());
}
}
After this code runs, the Sections_ string list contains the split elements, including the
delimiter character. Figure 8 shows the content of Sections_ list.
We now make a second pass, checking to see if the current list element is a sentence or a
different usage of the delimiter. Our first step is to build a regex pattern to look for potential uses
of the delimiter, other than end of sentence.
You can add your own expected abbreviations to this list. Knowing the type of questions and
input text will be very helpful.
We will also use a bit of LINQ to remove the empty strings from our collected list of sections.
27
www.dbooks.org
Sections_ = Sections_.Where(s => s.Length > 0).ToList<string>();
[1] Mr.
28
Mr. is found by the regular expression, and the second element gets added to it,
The process continues until all sections are processed. You can add your own rules, such as: if
the two sections are numeric (such as currency or IP address), combine them. This allows you
to fine-tune the sentence extraction based on your application.
Note: Be sure to review the string that comes back from the API to determine
which characters you might need. In the example shown in Listing 10, rather than
return a string, we are returning a list of strings, after splitting on new line characters
and removing the extra quotes placed on both ends of the paragraph.
Sample paragraph
Try the following sample paragraph to confirm how well the program performs at splitting
sentences.
29
www.dbooks.org
Kerri won her match 6-2,6-2. Rachel/Dori also won 6-4,6-3. Dr. Schmidt of Frog Hollow Assn.
was on hand to watch. I.B.M. provided the scoring software. The players paid $18.00 to play.
This should test to see how well the parser handles various uses of the period character.
IsQuestion
Once you have the sentences, you might want to determine if the sentence is asking a question
(particularly in the case of a system to provide answers). Listing 11 below shows a simple
function to determine if the sentence is asking a question.
Listing 11 – IsQuestion
public static bool IsQuestion(string text)
{
bool isQuestion = text.Trim().EndsWith("?"); // Assumes English only
if (!isQuestion)
{
isQuestion = Regex.IsMatch(text, @"(Who|What|Where|When|How)\s.*",
RegexOptions.IgnoreCase);
}
return isQuestion;
This function would allow you to read each retrieved sentence from the input text and determine
which ones need an answer.
Summary
We touched upon the basics of extracting sentences, but did not touch upon all the nuances
that can occur. For example, we ignored emoticons and only provided code to handle English.
Sentence boundary disambiguation is a complex problem to solve completely, but the code
should give you a basic idea of how it works.
30
Chapter 4 Extracting Words
Fortunately, splitting a sentence into words (called tokenization in NLP parlance) is a bit easier
of a task than sentence splitting. Delimiters, such as spaces and commas, generally don't have
other purposes within a sentence. A simple solution to extracting words is shown in the following
example, using the Char class from .NET and some LINQ code.
var ListOfSeparators =
sentence.Where(Char.IsPunctuation).Distinct().ToList();
If you want to add additional "separators,” such as symbols, you can simply append to the
ListOfSeparators:
ListOfSeparators.AddRange(sentence.Where(Char.IsSymbol).Distinct().ToList());
You can also add you own word delimiters to the list, in case of any unusual separators that
might be common in your application.
Once you've determined your separators, simply perform a Split() using the character list
you've just built.
Mr. Federer won the tennis match against Mr. Nadal on Jan. 28, 2018.
31
www.dbooks.org
Regular expressions
You can also use regular expressions for a compact way to get the list of words. The regular
expression engine will generally perform slower than other .NET approaches, but unless you
are dealing with a massive amount of text, an end user would not notice the performance
difference within an application. Using regular expressions allows the regex engine to decide
how to split words, rather than custom splitting. Listing 12 shows how you can split the words
using regular expressions.
Contractions
Contractions, which are a short way to represent two words, are an interesting construct. For
example:
Depending on your preferences, you might want to expand contractions to have a better chance
of interpreting the text. Listing 13 is a function that will take a list of words and return a new list
with the contractions (if any found) expanded.
32
if (cont == "t") { ans_.Add("not"); }
if (cont == "d") { ans_.Add("could"); }
if (cont == "re") { ans_.Add("are"); }
}
else
{
ans_.Add(words_[x]);
}
}
return ans_;
}
• I
• could
• have
• been
• a
• contender
By breaking the contraction into two words, it helps reduce the work that application needs to do
to parse the sentence and attempt to determine its meaning.
Summary
Splitting the words is a simple task, but now you are left with a list of words. In the next chapter,
we will start using the word list to help the system determine what the user is asking or telling
us.
33
www.dbooks.org
Chapter 5 Tagging
Now that we have a list of words from the sentence, our next step is to tag the words,
essentially providing a part of speech code indicating how the word is being used. Children are
taught this in school, words are verbs, nouns, adjectives, etc. However, many words cannot be
classified, since how the word is used in context, determines its likely part of speech. One well-
known example is the following expression:
This example shows that in the first segment, flies is a verb, while in the second, flies is a noun
and fruit is an adjective. Tagging is the processing of using any many clues as possible to come
up with the most likely set of tag values for each word in the sentence.
To keep our code simple, we are only going to use a few common tags.
DT Determiner the, an
34
Tag Meaning Example
Different APIs may use different tag sets; Google's API calls use the Universal tag set, and
Microsoft Cognitive services use the Penn Treebank tags. In the chapter where we cover the
Google API calls, I will provide code to map the Google tags to the matching Penn Treebank
tag.
Everything's a noun
When we start the process of tagging, we begin by assuming everything is a noun. Since nouns
make up most English words, it seems a reasonable starting point. This means that if we cannot
make a more accurate tag, we will treat it as a noun.
Regular expressions
There are some common patterns in words that we can use to change the tag from the default
noun, to a more likely tag. For example, if a word has five or more letters, and ends with a
consonant, followed by "ed" (such as saved, accumulated, or assumed), we could make a
guess that the word is more likely to be a past-tense verb, rather than a noun. By using regular
expressions to look for these patterns, we can possibly determine a better tag.
Table 2 shows some sample regular expressions to try to assign a better tag, based on these
word patterns.
CD ^(((\d{1,3})(,\d{3})*)|(\d+))(.\d+)?$ Numbers
35
www.dbooks.org
Tag Expression Meaning
DT (?i)^[the|an|a]$ Determiners
CC (?i)^[for|and|nor|but|or|yet|so]$ Conjunctions
RB @"(?i)^[a-z]{2,}ly$ Adverbs
Listing 14 is the start of our Tagger class, which make every tag a noun, and uses some regular
expressions to identify words that we should consider other tags.
36
We assume the word is a noun, but check the regular expression to see if there is a better tag.
The regular expression check improves our tagging performance, but we need to add more
coding to get a likely set of tags we can interpret.
Tip: If a word is caught in the regular expression code, there is no need to repeat
the word in your dictionary. Be sure to consider the regex "catches" before building
your dictionary.
Dictionary lookup
Although regular expressions will help, there are many cases where the regular expression is
going to get it wrong. To improve our tagging work, we are going to create a dictionary of
common words (top 500 verbs, top 500 adjectives, etc.). Because we know our intended use
(question-answering), we can get by with a smaller dictionary of words. For a large, more free-
form question answering, you would most likely need a much larger word list (or rely on one of
the NLP APIs available).
Listing 15 – Dictionary
static private Dictionary<string, string>
MyDictionary = new Dictionary<string, string>
{
{"aboard","RB"},
{"almost","RB"},
{"always","RB"},
{"and/or","CC"},
{"bad","JJ"},
{"became","VBD"},
{"began","VBD"},
{"best","JJS"},
. . .
Note: The source code, with a more complete word list, is available at
https://github.com/SyncfusionSuccinctlyE-Books/Natural-Language-Processing-
Succinctly.
To use this dictionary, we are going to add additional code after our regular expression lookup,
but before we return the final tag. Listing 16 shows the code to look the word up in the
dictionary.
37
www.dbooks.org
if (MyDictionary.ContainsKey(WordLower))
{
tag= MyDictionary[WordLower];
}
using System.Data.Entity.Design.PluralizationServices;
using System.Globalization;
We can now use this feature, and a simple regular expression, to fine-tune our noun tag. Listing
17 shows some code that attempts to distinguish between singular and plural nouns, and to
take a guess as to whether the noun is a proper noun.
We have assumed a noun (NN) and marked it as plural (NNS) based on the pluralization service.
Finally, we use a regular expression to assume that a word beginning with a capital letter,
following by two or more lowercase letters (and an optional hyphen), is probably a proper name
(NNP). While this simple regular expression won't handle all names, it should help us make a
reasonable guess as to the type of noun.
38
Adding words
Your application may have its own unique vocabulary, so the class library has two additional
methods to allow you to add words to the dictionary, or regular expressions to the regex list.
Listing 18 shows two methods that allow you to customize your dictionaries.
39
www.dbooks.org
return ans;
}
The routine allows you to access the tagger's internal structures and returns some error codes if
something occurs (such as a bad regex expression or duplicate entries).
Not all sentences will play that nicely with the tags we've assigned. For example, a question like
“Who spoke at the play?” would produce the following set of tags:
Word Tag
Who WP
spoke NN
at IN
the DT
play VB
Since spoke is not in our dictionary, it is assumed to be a noun. Trying to determine how to
answer a question based on that parsing would be difficult to do. This is where our next step
comes into play: tweaking. 😊
To improve our word tags, we need to rely on more than simply looking up words and assigning
the appropriate parts of speech. We are going to implement a very simple algorithm, loosely
based on the Brill tagger. The Brill tagger algorithm was invented by Eric Brill in 1993 as part of
his Ph.D. thesis. Essentially, it says to make your best guess, then correct any thing you might
have gotten wrong.
Brill steps
Once we've assigned the tags, we need to process "rules" to see if any tags should be
switched. Our rules structure is shown in the following table: It compares two sequential words
and tags. The first tag gets switched to the replacement tag if the condition specified is met.
40
Table 4 – Sample tag swap rules
The rule syntax is based on regular expressions. For example, the third rule indicates that a
verb should be changed to a noun if it occurs after a determiner, and one or more adjectives.
For example, the word play is generally considered a verb. However, if the sentence started
with “An exciting play,” the rule parser would determine that instead of DT, JJS, VB (determiner,
superlative adjective, verb), it should be DT, JJS, NN.
Using the rules shown in Table 4, our new set of tags for our problematic sentence, “Who spoke
at the play,” become:
Word Tag
Who WP
spoke VB
at IN
the DT
play NN
This represents an interpretation of the input sentence that is much more likely to be accurate.
Listing 19 shows the code to adjust the collection of words and tags based on the rules table.
41
www.dbooks.org
Listing 20 shows the code to apply these rules prior to returning the tag list.
Listing 21 shows the pattern-matching code that deals with patterns using regular expression
syntax.
42
Listing 21 – Pattern matching
static private bool MatchMiddlePattern(string Pattern, string tagsBetween)
{
bool ans = false;
string lastchar = Pattern.Substring(Pattern.Length - 1);
// Regex search patterns
if ("+*?".IndexOf(lastchar)>=0)
{
string regex = @"^(" + Pattern.Replace(lastchar, "," +
lastchar) + ")$";
ans = Regex.IsMatch(tagsBetween, regex);
}
else
{
ans = tagsBetween.StartsWith(Pattern);
}
return ans;
}
The code reads each tag from your list. If the current tag is found in the transform rules, it then
looks to see if the ending tag is found within your tag list as well. If both tags are found, we then
compare the tags in between the two tags. The tag in-between borrows syntax from regular
expressions, such as the options shown in Table 6.
This is a very simple example of a Brill-based tagging routine, combining regular expressions
and word lookups to provide a reasonable interpretation of the text the user entered.
Summary
This chapter gave a simple tagger example, with a very small set of words and rules.
Implementing a tagger capable of handling more complex sentences would require a larger
word set, more tags, and several additional rules. For example, you could adapt the dictionary
to return all possible tags a word could have, or you could consider the tense of word or whether
it’s singular or plural. English is an ambiguous and complex language, and a tagger to
understand all those nuances would be a very ambitious undertaking (leave it to the big guys).
43
www.dbooks.org
Chapter 6 Entity Recognition
Once you have a tagged collection of words, you should scan the list for named entities. A
named entity could be something like a city, a person, or a corporation. Essentially, it is likely a
noun or adjective and noun that has meaning for your application. If you were processing a
travel application, and wanted the user to be able to say, "I want to fly from Philadelphia to
Orlando," your system should consider airport cities as important entities to extract from the
sentence.
In addition to named entities, there are certain “nouns” that have special meaning, such as
email addresses, phone numbers, or credit card numbers. Recognizing these entities can
improve your application’s ability to determine what the text contains.
The more information we can gather about the word, the better we will be able to extract
meaning and find answers to the user's English questions.
Note: Your application might need to expand upon the entity types, depending on
what type of questions you need answered. If you are developing a human resources
application, for example, you might want to include ID number (Social Security
number), phone numbers and email addresses.
Entity types
Table 7 contains a list of entity type tags we might consider within our application.
44
Named entities
The simplest way to find named entities is by looking them up in a dictionary object. If you
create a simple list of strings, comparing them against the words in the tag list is a simple
matter. We might want to add some new tags, such as Airport, Person, City, Organization, or
Event for our travel application. Listing 22 shows the code to traverse the tag and word list, and
if possible, update the tag to identify it as an entity.
45
www.dbooks.org
Words_.RemoveRange(nStart + 1, nSize - 1);
}
}
}
}
}
}
Patterns
In addition to a word list, we are also going to create a dictionary of patterns (via regular
expressions) to help us identity "entities." In general, if a word matches one of the defined
patterns, we are going to assign it as an entity.
{@"^([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|
(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)$","EMAIL" },
{@"^((\(\d{3}\) ?)|(\d{3}-))?\d{3}-\d{4}$","PHONE" },
{@"^((\d{2})|(\d))\/((\d{2})|(\d))\/((\d{4})|(\d{2}))$","DATE" },
{@"^(\d{4})$","YEAR" },
{@"(?i)^(Mr|Ms|Miss|Ms)$","TITLE"},
{@"^(\d+)$","CD"},
{@" (?i)^(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|
Jul(y)?|Aug(ust)?|Sep(tember)?|Sept|Oct(ober)?|Nov(ember)?|
Dec(ember)?)$","MONTH" }
};
Again, knowing your type of application will help indicate which regular expressions you should
look for. If your system is a Human Resources application, identifying Social Security numbers
is a good idea. For a payment system, you might want to identify the credit cards. If you are
processing emails, emoticons would be useful.
Listing 24 shows the code that loops through your words and updates any tags back on the
regular expression patterns you've defined.
46
Listing 24 – Regex pattern matching
for (int x = 0; x < Words_.Count; x++)
{
string curWord = Words_[x];
foreach (KeyValuePair<string, string> pair in Patterns)
{
if (Regex.IsMatch(curWord, pair.Key))
{
Tags_[x]= pair.Value;
break;
}
}
There are a couple things to keep in mind. First, be sure to put your regular expressions in
order, so that the YEAR tag (four digits) is checked before the cardinal regex (any number of
digits). Second, you might need additional checks in your code. For example, before we
consider a four-digit number to be a year, we want to confirm it falls between 1960 and 2029.
Listing 25 shows just such a check.
In a Human Resources application, you might want to consider that a four-digit number should
only be considered a year if it’s lower than the current year (asking someone for their birth date).
A good source for regular expressions is the Regex Library website. They have a large number
of user contributed regular expressions.
Rule-based lookup
Another approach for identifying named entities, is to look for tag patterns that suggest an entity,
rather than individual words. For example, the tag TITLE followed by one or more tags of NNP
(proper name) suggests a person. Anytime we see that pattern, we should create a single-
person entity. Our pattern dictionary would be very similar to the rules we used when we
tweaked the tag list while tagging words.
47
www.dbooks.org
{"TITLE|NNP|NNP","PERSON" },
{"TITLE|NNP","PERSON" }
};
This rule says: if we find the pattern of title followed by one or two proper names, we combine it
into a single tagged word, and tag it as a PERSON. So, Mr. Joseph Booth as three tags, becomes
a single PERSON tag, with the value of Mr. Joseph Booth.
48
Example question
Let’s say the user asks the following question:
After we apply the named entity recognition, the result is shown in Figure 10.
Based on that sentence structure, we can determine a course of action: look up the winner of
the French Open from 2001.
Remembering
Once you get information from a question, your application should remember key components.
A user querying the tennis results might see dialog like this:
Gustavo Kuerten won the men's side and Jennifer Capriati won on the women's side.
Who lost?
Àlex Corretja lost to Gustavo Kuerten and Kim Clijsters lost to Jennifer Capriati.
Since your code needs to know the year and name of the tournament, it should remember those
key variables from the previous questions. The better you know your data and types of
questions, the better you'll be able to determine which pieces of information to retain and plug in
for later questions.
49
www.dbooks.org
Prompting
You might also want to consider the system asking for additional information, in case something
in missing in the sentence. For example:
You could assume the question refers to the most recent French Open, or you could prompt
back with a question to obtain the missing information: In what year?
A lot of these decisions about how to plan the interaction will help determine how well received
your application is. For example, in our tennis application, many people refer to the player by
their last names. We could consider adding the last name to our named entity stack, so
someone asking "Did Nadal win the French Open in 2014?" would come back with an answer
without knowing Nadal's first name (Raphael).
Summary
After tagging and applying named entities and some chunking logic, we should be able to come
up with enough information to determine what the user wants to know. In the next few chapters,
we will see how to build our knowledge base, and after that, how to provide an answer to the
user.
50
Chapter 7 Knowledge Base
As we mentioned earlier, building a knowledge base of every known fact would be an incredibly
complex and massive undertaking. However, if we can restrict our knowledge base to a small
set of data, we can create a system that allows end-users to query that knowledge with English
questions.
Many applications have small data sets that can be queried. For example, you might want to
ask an employee system questions like: "Who is up for a review this month?" or "Show me
employees working for Miss Blaire." An ordering system might answer questions such as
"When did order #1234 ship?" or "Which unshipped orders are older than one week?"
Example domain
The first step in building the knowledge base, is to determine the data subset you want to use.
For our example code, we are going to use the list of tennis champions at the grand slam
events since 1968. This gives use a small base (50 years x 4 events x 2 genders) of 400 rows
of data. Such a small base allows us to store the data in memory and use LINQ to process it.
Larger systems might rely on SQL server or possibly third-party API calls.
• Event
• Gender
• Year
• Champion Seed
• Champion Country
• Champion
• Runner-up Seed
• Runner-up Country
• Runner-up
• Score in the Final
The data was obtained from data world and released to the public domain. The link to this
dataset is here.
Data World is a website of collaborative data sets contributed by users. Be sure to review the
licensing, as each contributor may have their own licensing and attribution rules.
Base classes
To use the data, we are going to create a couple of base classes and then load the data into an
enumerated list, so we can rely on LINQ queries to search. If you have a much larger data set,
or data stored in a SQL database, you will probably use SQL queries or Entity Framework to
retrieve the data.
51
www.dbooks.org
Player class
This class represents the tennis players from the dataset. Listing 28 shows the base Player
class. We rely on a very simple Split function to take player's name and return the first and last
name.
If you are building a data set that refers to people's names, the names should probably be
treated as named entities. By exposing our NamedEntities collection, we can add code to
identify the players when their names appear in a question text.
Tournament class
We also have a class that represents the tournament. Listing 29 shows the Tournament base
class.
52
public Player RunnerUp { get; set; }
public int SetsPlayed
{
get
{
string[] ans_ = FinalScore.Split(',');
return ans_.Length;
}
}
}
Now we’ll create a static class called TennisMajors, and declare a collection to hold the
tournament results.
static TennisMajors() {
LoadDataSet();
The class will load the data set once at startup when the class is first referenced. Listing 30
shows the class to load the data file into the collection.
53
www.dbooks.org
{
Name = CurrentData[0].ToUpper().Trim(),
Gender = CurrentData[1].ToUpper().Trim(),
Year = Convert.ToInt16(CurrentData[2]),
FinalScore = CurrentData[9],
Winner = new Player(),
RunnerUp = new Player()
};
CurTournament.Winner.FullName = CurrentData[3].Trim();
CurTournament.Winner.Seed = Convert.ToInt16(CurrentData[4]);
CurTournament.Winner.Country = CurrentData[5].Trim();
CurTournament.RunnerUp.FullName = CurrentData[6].Trim();
CurTournament.RunnerUp.Seed =
Convert.ToInt16(CurrentData[7]);
CurTournament.RunnerUp.Country = CurrentData[8].Trim();
TennisResults.Add(CurTournament);
Players.Add(CurTournament.Winner.FullName);
Players.Add(CurTournament.RunnerUp.FullName);
}
The code loads the dataset from the text file and grabs the player's names as it loads the data.
The player's names are then loaded to the Entities collection in the base NLP object to give
us a collection of named entities from the tennis players. So now, a question such as:
Will tag Serena Williams as PERSON, 2014 as a YEAR, and US Open as an EVENT.
Single answers
Based on our dataset and expectations, we should provide the following functions at a
minimum. If the year is 0, it will assume the first tournament. Most of these functions are simple
LINQ queries against the collection.
GetResults(TournamentName, Year)
Listing 31 shows a method to look up the tournament, year, and optionally the gender, and
return the winner, loser, and number of sets. We are returning the data as a delimited string,
and allowing the program using the data the format the response.
54
static Tournament GetResults(string TournamentName, int Year, string
Gender)
{
Tournament Fnd = TennisResults.FirstOrDefault(x => x.Year == Year
&& x.Name == TournamentName && x.Gender==Gender);
return Fnd;
}
If no results are found, the method returns NULL; otherwise, it will return the tournament object
requested. Using this method, we can easily create some simple wrappers (or allow the query
program to make direct use of the object).
55
www.dbooks.org
}
The wrapper functions are not necessary, but simply hide the details of the Tournament object
behind the scenes.
Aggregate answers
You might also want to answer some questions about the history of the tournament, again by
using LINQ queries against the collection.
MostWins(TournamentName, Gender)
MostWins is a function using a LINQ query that determines for a particular tournament, who has
won it the most times. Listing 34 shows the method.
Tip: If you are not familiar with LINQ, I recommend reading Jason Robert's LINQ
Succinctly title.
MostLosses(TouramentName, Gender)
MostLosses is the same code, except looking at the runner up, rather than the winner name
during the GroupBy.
PlayerWins(PlayerName,TournamentName)
Listing 35 shows the code to determine how many times a player has won the tournament.
56
Listing 35 – Player wins
static public string PlayerWins(string PlayerName,string TournamentName)
{
string ans = PlayerName + " has never won.";
int NumTimes_ = TennisResults.Where(x => x.Name.ToLower() ==
TournamentName.ToLower()
&& x.Winner.FullName.ToUpper() ==
PlayerName.ToUpper()).Count();
if (NumTimes_>0)
{
if (NumTimes_ == 1) { ans = PlayerName + " has won once"; }
if (NumTimes_ == 2) { ans = PlayerName + " has won twice"; }
if (NumTimes_>2) {
ans = PlayerName + " has won " + NumTimes_.ToString() + " times";
}
}
return ans;
}
PlayerLosses(PlayerName,TournamentName)
In Listing 36, determining the number of losses is slightly different, because we need to
distinguish between a player who never reached the finals versus a player who reached the
finals, but lost.
57
www.dbooks.org
if (NumLost_ > 0 && NumWins_ > 0) { ans = PlayerName + " lost " +
(NumLost_) + " times in " + (NumLost_ + NumWins_).ToString()+
" trips to the finals"; }
}
return ans;
}
Note: The code to the Tennis Data and all the lookup functions can be
downloaded from https://github.com/SyncfusionSuccinctlyE-Books/Natural-
Language-Processing-Succinctly.
Summary
The question-answering component of the application needs to understand how to get the
answers, be it a LINQ query, SQL call, or web service. The list of tagged words is parsed, and
hopefully, tied to one of the functions you wrote. The goal is interpreting a tagged word list and
extracting enough information to know which function to call to determine the answer.
Now that we know what data is available, we need to take the user’s question and determine
which function to call to give the best answers. That is what we will discuss in Chapter 8.
58
Chapter 8 Answering Questions
We've reached this point with the ability to generate a list of tagged words from a text. Whether
you relied on the Cloudmersive (or other API) or used the code presented in the book, that
tagged list of word objects give us a good chance to have the computer respond, in a somewhat
friendly manner, to the user's text. Our basic method will be one that takes that tagged sentence
and tries to provide a response.
The word list and the tag list are passed as parameters. By searching through the lists, we
should be able to determine what the user is asking, and how to answer.
We now have the phrases and a set of functions about our data. In this chapter, we will
integrate the pieces and allow you to ask questions and get answers. Figure 12 shows a
conversation with the tennis major application.
Gustavo Kuerten won the men's side and Jennifer Capriati won on the women's side.
Who lost?
Àlex Corretja lost to Gustavo Kuerten and Kim Clijsters lost to Jennifer Capriati.
Rod Laver won in a Good match over Tony Roche and Billie Jean King won in a Good match
over Judy Tegart.
Note: Our dataset only goes back 50 years, so "first" refers to the first Wimbledon in
our dataset, when the first actual Wimbledon tournament was played in 1877 and won by
Spencer Gore.
Getting started
Our goal sounds simple—we need to determine three things to answer the question. First, what
is being asked. Second, which function call has the answer. Third, if we can get the information
the function call needs from the list of word info objects.
59
www.dbooks.org
What we can answer
If we review the functions from the previous chapter, we know how to answer the following
questions. Table 8 lists the functions and the information we need to determine to call the
functions.
Function Parameters
At minimum, we need the tournament name (all functions expect the tournament as the first
parameter). The other parameters will vary, depending on the type of question being asked.
First question
Our first question, is “Who won the 2001 French Open?” If we look at our tagged phrases, we
get the list of tagged words shown in Figure 13.
The YEAR and EVENT tags let us know which tournament the question is referring to. We have
two parameters that we can pass to any of our first three function calls. Since we don't know
gender, we will report results from the tournament for both men and women.
The verb is likely to indicate which function we want, the winner or losers of the tournament.
Since our verb is the word won, we should report the winner. Since the question is who, we know
the user is expecting a name.
60
With this information, we can call the function and generate an answer.
Saving information
One thing we want to do is to create variables to hold the information that the user gives us, that
might be reused as parameters. Listing 37 shows the static class variables we declare to
remember the parameters.
Whenever we determine the parameters from a question text, we update the class variables, so
the user doesn't have to repeat themselves.
61
www.dbooks.org
for (int x = 0; x < Tags_.Count; x++)
{
if (Tags_[x] == "YEAR")
{
TournamentYear = Convert.ToInt16(Words_[x]);
}
if (Tags_[x] == "EVENT")
{
Tournament = Words_[x];
if (Tournament.Contains("FRENCH")) { Tournament = "FRENCH"; }
if (Tournament.Contains("US ")) { Tournament = "USOPEN"; }
if (Tournament.Contains("AUS ")) { Tournament = "AUS"; }
}
if (Tags_[x].StartsWith("VB")) { LastVerb = Words_[x].ToUpper();
}
if (Tags_[x] == "PERSON") { PlayerName = Words_[x].ToUpper();
}
}
if (LastVerb == "WON") {
if (Gender == "B")
{
ans_ = WhoWon(Tournament, TournamentYear, "M");
string ansW = WhoWon(Tournament, TournamentYear, "F");
if (ansW.Length>0) { ans_ += " and " + ansW; }
}
else
{
ans_ = WhoWon(Tournament, TournamentYear, Gender);
}
}
if (LastVerb == "LOST") {
if (Gender == "B")
{
ans_ = WhoLost(Tournament, TournamentYear, "M");
string ansW = WhoLost(Tournament, TournamentYear, "F");
if (ansW.Length > 0) { ans_ += " and " + ansW; }
}
else
{
ans_ = WhoLost(Tournament, TournamentYear, Gender);
}
}
if (ans_.Length < 1) { ans_ = "I don't know..."; }
return ans_;
}
62
This basic code processes a couple key verbs (WON and LOST) and calls the appropriate function
to return an answer. It makes a loop through the sentence tags, seeing if it could find
parameters to pass along to the calls to narrow down which tournament and year the user is
asking about.
Don't be boring
When people answer a question, they might phrase it differently each time. We want our “Who
Won” routine to be a bit creative. Listing 40 shows the routine determining the answer, but
formatting it differently based on a random selection.
ans_ = string.Format(PossibleReplies[reply],
Results_.Winner.FullName,
GenderText,
Results_.RunnerUp.FullName,
SetText);
}
return ans_;
63
www.dbooks.org
While it is possible to simply extract the answer and return the person's name, the application
will appear more friendly and easier to use if it seems more human (in this case, by giving
different ways of providing the answer).
Depending on your application, you can really enhance the application by understanding your
data. In our case, the score of the match is stored as a string. With a bit of string manipulation,
we can make a guess as to how close (or one-sided) the match was.
Games won
The score string from the data, looks like the following.
Listing 41 shows how to determine the games won and the games lost based on the score
string.
64
int x = WinLoss[1].IndexOf("(");
if (x>0) { WinLoss[1] = WinLoss[1].Substring(0, x - 1); }
TotalLost += Convert.ToInt16(WinLoss[1]);
}
}
return TotalLost;
}
Games lost is very similar, but deals with the tiebreaker string if it appears. By looking at a
match and applying some tennis logic, we can determine if the match was close or not. We
might want to adjust our replies even further. Let's determine if the match was one-sided, close,
or a good match.
If not straight sets (three for men, or two for women), it means that the winner lost at least one
set. So, we can assume that was a close match. If the match was decided in straight sets, and
the winner won more than 2-3 times as many games as the loser, we will call that a one-sided
match. Listing 42 shows the function to make a rating guess for the match.
With this code added to our functions, the application gets a bit opinionated (Nadal won 6-2,6-
3,6-1), as shown in Figure 14.
65
www.dbooks.org
Rafael Nadal won in a one-sided match over Stan Wawrinka, and Jeļena Ostapenko won
in a very close match over Simona Halep.
Of course, the computer knows nothing about the actual matches, just what the data it sees tells
it. Be sure to consider your audience as your application gets more creative in its responses.
Second question
The second question is very simple: “Who lost?” Since the user has not given us a year or the
tournament name, we will rely on the previous answers. From the previous question, we know it
was the 2001 French Open, so the system finds the verb LOST, and has enough information to
determine which function to call.
Remembering previous replies will make the system much more friendly to the user. If I enter
“Who is the Human Resource manager?” and the system replies “Julie,” you can assume my
follow-up question of, “What is her email?” refers to Julie and can be answered. If your system
identifies a person, the pronouns he and she should be replaced with the person’s name in the
system's memory.
Third question
The third question is: “Who won the first Wimbledon?” The WHO question and EVENT tell us we
are looking for a person, but we don't know the year. However, the keyword FIRST tells us to
get the very first year we have data for. So, we scan our word list, looking for first, earliest, etc.,
keyword-searching the word array.
One of the drawbacks is having to possibly know all the synonyms a person might use. There is
a web service available called WordsAPI that allows you to find synonyms for a given word. An
example of the JSON response is shown in Listing 43.
{ "word": "first",
"synonyms":
[ "1st",
"inaugural",
"maiden",
"kickoff",
"start",
"foremost",
]
}
66
By using the API, you can anticipate expected words, such as first or latest, and have the API
prebuild possible synonyms. You can also store the list locally in a dictionary. If you are using
your own code, you will probably want to keep a synonym dictionary of words likely to be used
by your audience.
Final question
The last question is: “Who has won the most Wimbledons?” In this example, we are counting on
the tags to identify the event (Wimbledon) and verb (WON). We are also expecting the keyword
most (as an adverb or adjective). By detecting the verb (WON or LOST) and the modifier most, we
can determine which method in the dataset to call. We can change the question a bit, and still
get a reasonable reply, as shown in Figure 15.
Summary
By parsing the tagged sentence to extract missing data and relying on the verb to guess which
function to call, we can generally do a pretty good job of matching input sentences to functions
that provide the answer. Again, the more you know about your application, the better you will be
able to anticipate the types of questions you might find.
I would suggest, at least initially, keeping a log of the questions asked and answers provided by
your application. You will likely keep tweaking your code, based on what the users are asking.
As you get a collection of common questions, and tweak the code to answer them, your system
will appear smarter every time.
It is possible to simply return the exact answer a person wants, but the system will appear more
useful if you use random responses, or even humorous answer to provide the information. We
are designing a system to interact with people, so we don't need to be quite as rigid as the
protocols needed when we talk between computer systems. People will like the variety and light
nature of the responses generated.
Have fun generating responses, but know your audience. If you are designing a system for the
military, they might not appreciate a lighter, varying response. (And they carry guns.)
67
www.dbooks.org
Chapter 9 Cloudmersive
As we mentioned earlier, Cloudmersive offers several web services to save developers from
tasks that could be tedious or difficult to do. For example, you can validate that an input from the
user looks like an email address. This is a simple Regular Expression. However, there isn’t an
easy way to confirm that [email protected] is not an actual address.
The Cloudmersive API provides additional web services, to see if that is an actual email
address. When I ran the /validate/email/address/full web service, I received this reply.
"ValidAddress": false,
"MailServerUsedForValidation": "psg.com"
Be sure to explore the Cloudmersive web service library; the email example is just the tip of the
iceberg.
68
Sentence parsing
Listing 45 shows the code to ask the API to break a paragraph into sentences. The API returns
a delimited string; this code breaks that string apart and returns a list of strings (sentences).
Language detection
Listing 46 shows the code to ask the API to determine the language of a text string.
69
www.dbooks.org
Extracting entities
This API attempts to extract entities (people, locations, etc.) from the input sentence. It is shown
in Listing 47.
Extracting Words
Listing 48 extracts a list of words from a sentence.
70
Tagging words
Cloudmersive uses the Penn Treebank tags, and Listing 49 shows the code that returns a string
with the words tagged with the appropriate codes. This web service provides the same type of
information as the code we developed in Chapter 5.
Summary
While the web services don't offer Sentiment Analysis or Text Summarization, they do provide a
solid set of services you can use to create a tagged list of words for your application to use.
Since they are calling web services via a POST, you need to pass your credentials and check for
an exception (typically if the server is unavailable, network is down, etc.).
The source code for the Cloudmersive API calls is included on the book’s source code site.
71
www.dbooks.org
Chapter 10 Google Cloud NLP API
Google provides many API web services, including a set of API calls for processing text. In this
chapter, we will discuss how to get set up to use the service, and the service calls available.
Developers dashboard
The developer's dashboard (Figure 16) provides access to all of Google APIs, which you can
explore by clicking on the library icon.
API Library
You will need to create credentials by first finding the Cloud Natural Language API and clicking
on it. Figure 17 shows the API library that Google offers.
72
Figure 17 – Google API library
Select the API by clicking on the box, shown in Figure 18. Click Enable to enable the API.
73
www.dbooks.org
Creating credentials
Once you've picked the API, you'll need to create credentials for using the API. Once you do,
you'll be given an API, which you'll need in your code. (I've hidden mine here.)
That's it. Save that API key, and you are ready to start interfacing with Natural Language API.
Basic class
Listing 50 is the basic class to call the Google APIs. Be sure to set the GOOGLE_KEY variable to
your API obtained previously.
74
NLPrequest.ContentType = "application/json";
using (var streamWriter = new
StreamWriter(NLPrequest.GetRequestStream()))
{
string json = "{\"document\": {\"type\":\"PLAIN_TEXT\"," +
"\"content\":\"" + msg + "\"} }";
streamWriter.Write(json);
streamWriter.Flush();
streamWriter.Close();
}
var NLPresponse = (HttpWebResponse)NLPrequest.GetResponse();
if (NLPresponse.StatusCode == HttpStatusCode.OK)
{
using (var streamReader = new
StreamReader(NLPresponse.GetResponseStream()))
{
result = streamReader.ReadToEnd();
}
results = JsonConvert.DeserializeObject<dynamic>(result);
}
return results;
}
75
www.dbooks.org
}
These calls return dynamic objects, which you can further parse to pull out information. The
Google documentation provides details on the response data returned via the API.
You can access any of the properties of the object by name. For example, using the syntax
resp.language.Value will return the string en. The syntax resp.sentences.Count will return
the number of sentences in the response, and resp.sentences[x] will let you access the
individual items in the sentences collection.
76
Listing 52 – Extract sentences
static public List<string> GoogleExtractSentences(string Paragraph)
{
List<string> Sentences_ = new List<string>();
var ans_ = analyzeSyntax(Paragraph);
if (ans_ !=null)
{
for(int x=0;x<ans_.sentences.Count;x++)
{
Sentences_.Add(ans_.sentences[x].text.content.Value);
}
}
return Sentences_;
In Chapter 5, we introduced the concept of tagging words, using the Penn Treebank tag set.
(See Appendix A.) Google uses a different set of tags, so we need to do some conversion work
to map the Google tags to the Penn Treebank set.
Google tags
Google's API returns a token for each word in the parsed word list. These tags are based on the
Universal tag list (see Appendix B). Each token has a part of speech component, which has 12
properties. We need to map these properties to the appropriate Penn Treebank tag. Listing 53
shows the code to get a list of tagged words from the Google API (by converting the Google
partOfSpeech information to a Penn Tree Bank tag).
77
www.dbooks.org
We've included an option to return the lemma. A lemma is the base form of a word, so that wins,
won, winning, etc. would all have a lemma of win. Adding a lemma value simplifies the task of
determine the concept the verb or noun represents. Google's API includes a lemma, so the
method can append the lemma to the returned tag. Figure 21 shows a sample syntax analyze
call from the API, including lemmas.
Listing 54 shows the code to convert the Universal Part of Speech tag to a Treebank code,
relying on the Universal tag and some of the other fields available with the 12 properties
returned for each word.
78
}
// Tweak certain tags
if (ans=="NN")
{
string number = Token.number.Value.ToUpper();
string proper = Token.proper.Value.ToUpper();
The code takes the Token object and maps it to the appropriate Treebank tag. However, there
is additional information in the Google token, to let us map further, determining the type of noun
and the case of verb.
Summary
The Google Cloud API provides a good set of NLP API calls, and can be helpful to build your
tagged list of words. The API also offers categorization, named entity recognition, and sentiment
analysis, all handy features when processing natural language text.
79
www.dbooks.org
Chapter 11 Microsoft Cognitive Services
Getting started
To get started with Microsoft Cognitive services, you will first need to visit this site and sign up
for an API key.
80
Figure 23 – Cognitive Services sign-up
Figure 24 – Sign in
Once you've signed in, you will be provided an endpoint and an API key. Save these two,
because you'll need them to retrieve data via the API.
81
www.dbooks.org
Figure 25 – NuGet package
82
{
result = streamReader.ReadToEnd();
}
results = JsonConvert.DeserializeObject<dynamic>(result);
}
return results;
This is the basic code to reach the endpoint and pass the data along to Cognitive Services. Be
sure to set the ENDPOINT and API_KEY to the ones you obtained from the registration. This
method returns the result object as a JSON object; you will need to have your method calls
extract whatever in information your application needs.
Listing 56 is a number of wrapper calls to extract the JSON information for use in your
application.
83
www.dbooks.org
static public List<string> classifyText(string text)
{
List<string> ans = new List<string>();
var results = BuildAPICall("keyPhrases", text);
try
{
foreach (var curPhrase in results.documents[0].keyPhrases)
{
ans.Add(curPhrase.Value);
}
}
catch
{
}
return ans;
}
static public List<string> analyzeEntities(string text)
{
List<string> ans = new List<string>();
var results = BuildAPICall("entities", text);
try
{
foreach (var curEntity in results.documents[0].entities)
{
ans.Add(curEntity.name.Value);
}
}
catch
{
}
return ans;
These calls make the API call and extract the returned data, either as a string, a list, or other
object, depending on your application needs.
Summary
Cognitive Services is a helpful API from Microsoft that handles some of the trickier issues in
Natural Language processing. You can explore the text analytics offered via these APIs to
enhance your application.
84
Chapter 12 Other NLP uses
In the previous chapters, we explored some of the basic functions and code to create a simple
question-answering NLP application. The goal of the book was to get the reader comfortable
with understanding how to parse sentences to create a tagged word list, and then use that
tagged word list to answer English-language questions.
Answering questions is just the tip of the iceberg. Knowing the words and a reasonable set of
tags helps the programs attempt to discern meaning. Here are some other applications of
Natural Language Processing, which can be accessed through the various APIs we talked
about in earlier chapters.
Language recognition
As the world continues to get smaller, most application cannot simply assume the user is
entering English text. For example, if any application allows user to send tech support requests
via email or text message, you might get some text as follows:
Since you would like to send the support to a person who speaks the language, you would first
need to discern the language. Most of the APIs handle this task for you, by returning the
language it thinks the text is in. Many European languages were derived from Latin, so
discerning the actual language is a tricky task. Leave this one to the big guys. 😊
The Microsoft Cognitive services API can detect over 100 different languages. For example, say
we give the API the request shown in Listing 57.
"documents": [
85
www.dbooks.org
Listing 58 – Results of language detection
"documents": [
The API returns the language, the language code, and a score indicating the confidence in its
language determination.
Sentiment analysis
Sentiment analysis attempts to discover the attitude a piece of text is conveying. It is a positive
text or negative? Such knowledge would be very helpful for a business trying to keep in touch
with their customers. Social media and online reviews provide a great source of customer
feedback, and companies would do well to listen to it.
Unfortunately, it is not a simple task. While some sentences are clear, there are many times
where the meaning is not. For example, is the following sentence positive or negative?
Even particular words can be better interpreted with knowledge of the domain. For example, if I
see the word “thin”, and I am talking about cell phones, this is probably a positive comment.
However, if I am referring to my hotel room, thin sheets or thin walls would generally be
negative comments.
Both Microsoft Cognitive Services and Google Cloud NLP offer sentiment analysis API calls.
86
Figure 26 – Google sentiment analysis
The score ranges from -1 to 1; the lower the number, in general, the more negative the text is
perceived. In this case, a score of 0.6 indicates a pretty positive press release. The magnitude,
which ranges from 0 to infinity, indicates how strong the emotion is. Based on Google's
analysis, I can't wait for Windows 22.
Categorization
Imagine a corporation that sells many different products and services. They have an NLP
system to process received emails. Knowing the general category that the email is discussing
could help them direct the message to the appropriate department for processing. The larger
the domain space (that is, number products and services), the more difficult it becomes to
extract categories from text messages.
Microsoft Cognitive Services provide a means to read through a text and extract the various
objects the document is talking about. Figure 27 shows an example of a customer complaint
email and the information that the Cognitive Services API gleaned from it.
You can use to the code from Chapter 11 to gather this information into JSON objects so your
code can decide what actions to take.
87
www.dbooks.org
Summary
Using the APIs discussed in the next few chapters will assist you in adding NLP to your
application. The problems mentioned in this chapter are handled via the APIs, so while it can be
intellectually fun and challenging to think about, I would suggest letting the big guys deal with
the nuances if your goal is simply to enhance your application by allowing English questions.
88
Chapter 13 Summary
Natural Language Processing is an ambitious field. The idea of being able to converse with a
computer program—once the realm of science fiction—is getting closer all the time. In this book,
we touched upon how to use Natural Language Processing in your code. The code in the early
part of the book illustrates the principal components, but the API services from the “big guys”
offer a quick way to implement those concepts.
Once you have parsed and tagged list of words, your application still needs to decide what to do
with them, but ideally, allowing your users to communicate with your systems in English would
be a nice accomplishment.
The list of words and tags (and any other information) is the starting point to most text analytics
usages. The code in this book and the various API calls all provide methods to build that list. If
your domain is limited and you know the data and response very well, you can use the limited
code here to process and interpret the text sentences. If you are trying to create a more general
application, I suggest using the APIs and leaving the hard work to someone else.
Keep in the mind that the more you know about your domain and usage, the less ambiguous the
user questions will be, so spend a lot of time getting to know your data and users. Now, I must
go to play tennis with Roger. My scheduling app set it up for me—I just hope the application
knows I meant my friend Roger.
89
www.dbooks.org
Appendix A Penn Treebank tags
The Penn Treebank project provides a list of 36 tags used to classify words. Many NLP web
services use these tags. The complete list is shown in Table 10. The official site can be found
here.
90
Appendix B Universal POS tags
The Universal Part of Speech tag set (used by Google), is smaller than the Penn Treebank tags.
Table 11 shows the list of Universal POS tags (see http://universaldependencies.org/u/pos/ for
more information.
91
www.dbooks.org
Appendix C About the code
The code for the book consists of a single solution with three projects. The primary project is the
Natural Language Processing class library, which contains the code samples discussed in this
book. You can also find them at the Syncfusion GitHub repository.
You will likely need to review the dictionary of words, regular expressions, etc. in tagger and
Entities to customize them for your application needs.
There are also three wrapper classes that interact with the various APIs we've discussed
throughout the book. You will need to obtain keys and endpoints, and then update these API
classes with your credentials. These classes are:
• CloudmersiveNLP
• GoogleNLP
• MicrsoftNLP
Playground
This project is simply a Windows Forms application that allows you to enter a sentence and see
the results of the various NLP operations. It simply looks for an input text, and then performs the
requested API call. The API dropdown list allows you to specify which API (or the book's internal
code) to perform the request action (see Figure 28).
92
Figure 28 - Playground
Tennis Data
The Tennis Data project is the code to load the data, and using LINQ queries, to create
functions that answer the questions the user might ask. You will likely write your own version,
using SQL, Entity Framework, etc.
93
www.dbooks.org