Natural Language Toolkit NLTK PDF

The Natural Language Toolkit
(NLTK)
Natural Language Processing (NLP)
• How can we make a computer understand language?
– Can a human write/talk to the computer?
• Or can the computer guess/predict the input?
– Can the computer talk back?
– Based on language rules, patterns, or statistics
• For now, statistics are more accurate and popular
2
Some areas of NLP
• shallow processing – the surface level
– tokenization
– part-of-speech tagging
– forms of words
• deep processing – the underlying structures of language
– word order (syntax)
– meaning
– translation
• natural language generation
3
The NLTK
• A collection of:
– Python functions and objects for accomplishing NLP tasks
– sample texts (corpora)
• Available at: http://nltk.sourceforge.net
– Requires Python 2.4 or higher
– Click 'Download' and follow instructions for your OS
4
Tokenization
• Say we want to know the words in Marty's vocabulary
– "You know what I hate? Anybody who drives an S.U.V. I'd really
like to find Mr. It-Costs-Me-100-Dollars-To-Gas-Up and kick him
square in the teeth. Booyah. Be like, I'm Marty Stepp, the best
ever. Booyah!"
• How do we split his speech into tokens?
5
Tokenization (cont.)
• How do we split his speech into tokens?
>>> martysSpeech.split()
['You', 'know', 'what', 'I', 'hate?', 'Anybody',
'who', 'drives', 'an', 'S.U.V.', "I'd", 'really',
'like', 'to', 'find', 'Mr.', 'It-Costs-Me-100-
Dollars-To-Gas-Up', 'and', 'kick', 'him',
'square', 'in', 'the', 'teeth.', 'Booyah.', 'Be',
'like,', "I'm", 'Marty', 'Stepp,', 'the', 'best',
'ever.', 'Booyah!']
• Now, how often does he use the word "booyah"?
>>> martysSpeech.split().count("booyah")
0
>>> # What the!
6
Tokenization (cont.)
• We could lowercase the speech
• We could write our own method to split on "." split on ",",
split on "-", etc.
• The NLTK already has several tokenizer options
• Try:
• nltk.tokenize.WordPunctTokenizer
– tokenizes on all punctuation
• nltk.tokenize.PunktWordTokenizer
– trained algorithm to statistically split on words
7
Part-of-speech (POS) tagging
• If you know a token's POS you know:
– is it the subject?
– is it the verb?
– is it introducing a grammatical structure?
– is it a proper name?
8
Part-of-speech (POS) tagging
• Exercise: most frequent proper noun in the Penn Treebank?
– Try:
• nltk.corpus.treebank
• Python's dir() to list attributes of an object
– Example:
>>> dir("hello world!")

[..., 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs',
'find', 'index', 'isalnum', 'isalpha',
'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', ...]
9
Tuples
• tagged_words() gives us a list of tuples
– tuple: the same thing as a list, but you can't change it
– in this case, the tuples are a (word, tag) pairs
>>> # Get the (word, tag) pair at list index 0

...
>>> pair = nltk.corpus.treebank.tagged_words()[0]
>>> pair
('Pierre', 'NNP')
>>> word = pair[0]
>>> tag = pair[1]
>>> print word, tag
Pierre NNP
>>> word, tag = pair # or unpack in 1 line!
>>> print word, tag
Pierre NNP
10
POS tagging (cont.)
• How do we tag plain sentences?
– A NLTK tagger needs a list of tagged sentences to train on
• We'll use nltk.corpus.treebank.tagged_sents()
– Then it is ready to tag any input! (but how well?)
– Try these tagger objects:

• nltk.UnigramTagger(tagged_sentences)
• nltk.TrigramTagger(tagged_sentences)
– Call the tagger's tag(tokens) method
>>> tagger = nltk.UnigramTagger(tagged_sentences)

>>> result = tagger.tag(tokens)
>>> result
[('You', 'PRP'), ('know', 'VB'), ('what', 'WP'),
('I', 'PRP'), ('hate', None), ('?', '.'), ...]
11
Parsing
• Syntax is as important for a compiler as it is for natural
language
• Realizing the hidden structure of a sentence is useful for:
– translation
– meaning analysis
– relationship analysis
– a cool demo!
• Try:
– nltk.draw.rdparser.demo()
12
Conclusion
• NLTK: NLP made easy with Python
– Functions and objects for:
• tokenization, tagging, generation, parsing, ...
• and much more!
– Even armed with these tools, NLP has a lot of difficult problems!
• Also saw:
– List methods
– dir()
– Tuples
13
Python scikit-learn
• Popular machine learning toolkit in Python http://scikit-
learn.org/stable/
• Requirements
– Anaconda
– Available from https://www.continuum.io/downloads
– Includes numpy, scipy, and scikit-learn (former two are
necessary for scikit-learn)
14
SciKit
Many popular Python toolboxes/libraries:
– NumPy
– SciPy
– Pandas
– SciKit-Learn All these
libraries are
installed on
Visualization libraries the SCC
– matplotlib
– Seaborn
and many15more …
15
Python Libraries for Data Science
SciPy:
▪ collection of algorithms for linear algebra, differential
equations, numerical integration, optimization, statistics and
more
▪ part of SciPy Stack
▪ built on NumPy
Link: https://www.scipy.org/scipylib/
16
16
SciKit-Learn:
▪ provides machine learning algorithms: classification,
regression, clustering, model validation etc.
▪ built on NumPy, SciPy and matplotlib
Link: http://scikit-learn.org/
17
17
matplotlib:
▪ python 2D plotting library which produces publication quality
figures in a variety of hardcopy formats
▪ a set of functionalities similar to those of MATLAB
▪ line plots, scatter plots, barcharts, histograms, pie charts etc.
▪ relatively low-level; some effort needed to create advanced

visualization
18
18
Seaborn:
▪ based on matplotlib
▪ provides high level interface for drawing attractive statistical

graphics
▪ Similar (in style) to the popular ggplot2 library in R
Link: https://seaborn.pydata.org/
19
19
Login to the Shared Computing
Cluster
• Use your SCC login information if you have SCC account
• If you are using tutorial accounts see info on the blackboard
Note: Your password will not be displayed while you enter it.
20
20
Selecting Python Version on the
SCC
# view available python versions on the SCC
[scc1 ~] module avail python
# load python 3 version
[scc1 ~] module load python/3.6.2
21
21
Start Jupyter notebook
# On the Shared Computing Cluster
[scc1 ~] jupyter notebook
22
22
Loading Python Libraries
In [ #Import Python Libraries

]: import numpy as np
import scipy as sp
import pandas as pd
import matplotlib as mpl
import seaborn as sns
Press Shift+Enter to execute the jupyter cell
23
23

Natural Language Toolkit NLTK PDF

Uploaded by

Copyright:

Available Formats

Natural Language Toolkit NLTK PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Natural Language Toolkit NLTK PDF

Uploaded by

Copyright:

Available Formats

The Natural Language Toolkit

• Now, how often does he use the word "booyah"?

>>> dir("hello world!")

>>> # Get the (word, tag) pair at list index 0

– Try these tagger objects:

>>> tagger = nltk.UnigramTagger(tagged_sentences)

▪ part of SciPy Stack

▪ built on NumPy, SciPy and matplotlib

▪ a set of functionalities similar to those of MATLAB

▪ line plots, scatter plots, barcharts, histograms, pie charts etc.

▪ relatively low-level; some effort needed to create advanced

▪ provides high level interface for drawing attractive statistical

▪ Similar (in style) to the popular ggplot2 library in R

• If you are using tutorial accounts see info on the blackboard

[scc1 ~] module avail python

# load python 3 version

[scc1 ~] module load python/3.6.2

In [ #Import Python Libraries

Press Shift+Enter to execute the jupyter cell

You might also like