Lesson 1 - NLP

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
Natural Language Processing - (NLP)
Lecturer: Dr. Bùi Thanh Hùng

Data Science Department
Faculty of Information Technology
Industrial University of Ho Chi Minh city
Email: [email protected]
Website: https://sites.google.com/site/hungthanhbui1980/
L Outline
N P
• Overview of the field

– What is Natural Language Processing?
– NLP applications
– Aspects of language processing
– Why NLP is difficult?
• The NLP Research Community
L
N P What is Natural Language Processing?
• Natural Language Processing

– Process information contained in natural language
text.
– Also known as Computational Linguistics (CL),
Human Language Technology (HLT), Natural
Language Engineering (NLE)
L
N P NLP applications
• Text Categorization
– Classify documents by topics, language, author, spam filtering, information
retrieval (relevant, not relevant), sentiment classification (positive, negative)
• Spelling & Grammar Corrections
• Information Extraction
• Speech Recognition
• Information Retrieval
– Synonym Generation
• Summarization
• Machine Translation
• Question Answering
• Dialog Systems
– Language generation
L Where does it fit in the CS taxonomy?
N P
Computers
Databases Artificial Intelligence Algorithms Networking
Robotics Natural Language Processing Search
Information Machine Language

Retrieval Translation Analysis
Semantics Parsing
L Aspects of language processing
N P
• Word, lexicon: lexical analysis

– Morphology, word segmentation
• Syntax
– Sentence structure, phrase, grammar, …
• Semantics
– Meaning
– Execute commands
• Discourse analysis
– Meaning of a text
– Relationship between sentences (e.g. anaphora)
L Dependency Parsing
N P
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
Word dependency parsing
Word dependency parsed sentence

He reckons the current account deficit will narrow to only 1.8 billion in September .
MOD MOD COMP
SUBJ MOD SUBJ
COMP
SPEC
MOD
S-COMP
ROOT
L Dependency Trees 1. Assign heads
N P
S
[head=thrill]
NP VP
[head=plan] [head=thrill]
Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V
[head=swallow] [head=thrill] [head=Otto]
NP
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
L Parsing (in Definite Clause Grammars)
N P
s --> np, vp det -->[a]. det --> [an].

np --> det, noun det --> [the].
np --> proper_noun noun --> [apple].
vp --> v, np noun --> [orange].
vp --> v. proper_noun --> [john].
proper_noun --> [mary].
v --> [eats].
v --> [loves].
Eg. john eats an apple.
proper_noun v det noun
np
np vp
s
L Semantic analysis
N P
john eats an apple. Sem. Cat (Ontology)

proper_noun v det noun object
[person: john] λYλX eat(X,Y) [apple]
np animated non-anim
[apple]
np vp person animal food …

[person: john] eat(X, [apple])
s vertebral … fruit …
eat([person: john], [apple])
apple …
L Parsing & semantic analysis
N P
• Rules: syntactic rules or semantic rules

– What component can be combined with what component?
– What is the result of the combination?
• Categories
– Syntactic categories: Verb, Noun, …
– Semantic categories: Person, Fruit, Apple, …
• Analyses
– Recognize the category of an element
– See how different elements can be combined into a
sentence
– Problem: The choice is often not unique
L Discourse analysis
N P
• Anaphora
He hits the car with a stone. It bounces back.
• Understanding a text
– Who/when/where/what … are involved in an event?
– How to connect the semantic representations of
different sentences?
– What is the cause of an event and what is the
consequence of an action?
–…
L NLP
N P
L Why NLP is difficult
N P
• A NLP system needs to answer the question “who did

what to whom”
• Language is ambiguous
– At all levels: lexical, phrase, semantic
– Iraqi Head Seeks Arms
• Word sense is ambiguous (head, arms)
– Stolen Painting Found by Tree
• Thematic role is ambiguous: tree is agent or location?
– Ban on Nude Dancing on Governor’s Desk
• Syntactic structure (attachment) is ambiguous: is the ban or the
dancing on the desk?
– Hospitals Are Sued by 7 Foot Doctors
• Semantics is ambiguous : what is 7 foot?
N P
• Language is flexible
– New words, new meanings
– Different meanings in different contexts
• Language is subtle
– He arrived at the lecture
– He chuckled at the lecture
– He chuckled his way through the lecture
– **He arrived his way through the lecture
• Language is complex!
N P
• MANY hidden variables

– Knowledge about the world
– Knowledge about the context
– Knowledge about human communication techniques
• Can you tell me the time?
• Problem of scale
– Many (infinite?) possible words, meanings, context
• Problem of sparsity
– Very difficult to do statistical analysis, most things (words,
concepts) are never seen before
• Long range correlations
N P
• Key problems:
– Representation of meaning
– Language presupposes knowledge about the world
– Language only reflects the surface of meaning
– Language presupposes communication between
people
L Meaning
N P
• What is meaning?
– Physical referent in the real world
– Semantic concepts, characterized also by relations.
• How do we represent and use meaning
– I am Italian
• From lexical database (WordNet)
• Italian =a native or inhabitant of Italy→ Italy = republic in southern
Europe [..]
– I am Italian
• Who is “I”?
– I know she is Italian/I think she is Italian
• How do we represent “I know” and “I think”
• Does this mean that I is Italian? What does it say about the “I” and about
the person speaking?
– I thought she was Italian
• How do we represent tenses?
L The NLP Research Community
N P
• Papers
– ACL Anthology has nearly everything, free!
• Over 20,000 papers!
• Free-text searchable
– Great way to learn about current research on a topic
– New search interfaces currently available in beta
» Find recent or highly cited work; follow citations
• Used as a dataset by various projects
– Analyzing the text of the papers (e.g., parsing it)
– Extracting a graph of papers, authors, and institutions
(Who wrote what? Who works where? What cites what?)
N P
• Conferences
– Most work in NLP is published as 8-page conference papers
with 3 double-blind reviewers.
– Main annual conferences: ACL, EMNLP, NAACL
• Also EACL, IJCNLP, COLING
• + various specialized conferences and workshops
– Big events, and growing fast! ACL 2014:
• About 1000 attendees
• 572 full-length papers submitted (146 accepted)
• 551 short papers submitted (139 accepted)
• 16 workshops on various topics
N P
NLP Journals
Computational Linguistics
Journal of Natural Language Engineering (JLNE)
Machine Translation
Natural Language and Linguistic Theory
Journal of Natural Language Processing
…
N P
• Institutions
– Universities: Many have NLP faculty
• Several “big players” with many faculty
• Some of them also have good linguistics,
cognitive science, machine learning, AI
– Companies:
• Old days: AT&T Bell Labs, IBM
• Now: Google, Microsoft, IBM, many startups …
– Speech: Nuance, …
– Machine translation: Language Weaver, Systran, …
– Many niche markets – online reviews, medical transcription, news
summarization, legal search and discovery …
N P
NLP Research Centers

AT&T Labs - Research
BBN Systems and Technologies Corporation
DFKI (German research center for AI)
General Electric R&D
IRST, Italy
IBM T.J. Watson Research, NY
Lucent Technologies Bell Labs, Murray Hill, NJ
Microsoft Research, Redmond, WA
MITRE
NEC Corporation
SRI International, Menlo Park, CA
SRI International, Cambridge, UK
Xerox, Palo Alto, CA
XRCE, Grenoble, France
Google, Microsoft, Facebook, Amazon, …
N P
• Standard tasks
– If you want people to work on your problem, make it
easy for them to get started and to measure their
progress. Provide:
• Test data, for evaluating the final systems
• Development data, for measuring whether a change to the
system helps, and for tuning parameters
• An evaluation metric (formula for measuring how well a
system does on the dev or test data)
• A program for computing the evaluation metric
• Labeled training data and other data resources
• A prize? – with clear rules on what data can be used
N P
• Software
– Lots of people distribute code for these tasks
• Or you can email a paper’s authors to ask for their code
– Some lists of software, but no central site 
– Some end-to-end pipelines for text analysis

• “One-stop shopping”
• Cleanup/tokenization + morphology + tagging + parsing + …
• NLTK is easy for beginners and has a free book (intersession?)
• GATE has been around for a long time and has a bunch of modules
N P
• Software
– To find good or popular tools:
• Search current papers, ask around, use the web
– Still, often hard to identify the best tool for your job:
• Produces appropriate, sufficiently detailed output?
• Accurate? (on the measure you care about)
• Robust? (accurate on your data, not just theirs)
• Fast?
• Easy and flexible to use? Nice file formats, command line options,
visualization?
• Trainable for new data and languages? How slow is training?
• Open-source and easy to extend?
N P
• Datasets
– Raw text or speech corpora
• Or just their n-gram counts, for super-big corpora
• Various languages and genres
• Usually there’s some metadata (each document’s date, author, etc.)
• Sometimes  licensing restrictions (proprietary or copyright data)
– Text or speech with manual or automatic annotations
• What kind of annotations? That’s the rest of this lecture …
• May include translations into other languages
– Words and their relationships
• Morphological, semantic, translational, evolutionary
– Grammars
– World Atlas of Linguistic Structures
– Parameters of statistical models (e.g., grammar weights)
N P
• Datasets
– Read papers to find out what datasets others are using
• Linguistic Data Consortium (searchable) hosts many large datasets
• Many projects and competitions post data on their websites
• But sometimes you have to email the author for a copy
– CORPORA mailing list is also good place to ask around
– LREC Conference publishes papers about new datasets & metrics
– Amazon Mechanical Turk – pay humans (very cheaply) to annotate your
data or to correct automatic annotations
• Old task, new domain: Annotate parses etc. on your kind of data
• New task: Annotate something new that you want your system to find
• Auxiliary task: Annotate something new that your system may benefit from
finding (e.g., annotate subjunctive mood to improve translation)
– Can you make annotation so much fun or so worthwhile
that they’ll do it for free?
N P
Datasets
1. Google Datasets:
Link : https://datasetsearch.research.google.com/
2. Papers with Code Datasets.
Link : https://paperswithcode.com/datasets
3. Kaggle Dataset
Link: https://www.kaggle.com/datasets
4. Big Bag NLP Datasets
Link: https://index.quantumstat.com/#/
5. Hugging Face Datasets
Link: https://huggingface.co/dataset
6. UCI Machine Learning
Link: https://archive.ics.uci.edu/ml/index.php
N P
Datasets
7. Amazin Datasets (Open Data on AWS)
Link: https:/aws.amazon.com/opendata/
8. Awesome Public Datasets
Link: https://github.com/awesomedata/awesome-public-datasets
9. Azure public datasets
Link: https://docs.microsoft.com/.../azure-sql/public-data-sets
10. Carnegie Mellon University
Link: https://guides.library.cmu.edu/az.php
11. .gov Datasets
Link: https://data.gov.au/
https://data.gov.in/
https://data.gov.sg/
https://data.europa.eu/data/datasets?locale=en&minScoring=0
N P
Datasets
Một số nguồn để tìm dataset về machine learning, data science, AI.
7. Amazin Datasets (Open Data on AWS)
Link: https:/aws.amazon.com/opendata/
8. Awesome Public Datasets
Link: https://github.com/awesomedata/awesome-public-datasets
9. Azure public datasets
Link: https://docs.microsoft.com/.../azure-sql/public-data-sets
10. Carnegie Mellon University
Link: https://guides.library.cmu.edu/az.php
11. .gov Datasets
Link: https://data.gov.au/
https://data.gov.in/
https://data.gov.sg/
https://data.europa.eu/data/datasets?locale=en&minScoring=0
N P
• Standard data formats

– Often just simple ad hoc text-file formats
• Documented in a README; easily read with scripts
– Some standards:
• Unicode – strings in any language (see ICU toolkit)
• PCM (.wav, .aiff) – uncompressed audio
– BWF and AUP extend w/metadata; also many compressed formats
• XML – documents with embedded annotations
• Text Encoding Initiative – faithful digital representations of printed text
• Protocol Buffers, JSON – structured data
• UIMA – “unstructured information management”; Watson uses it
– Standoff markup: raw text in one file, annotations in other
files (“ noun phrase from byte 378—392”)
• Annotations can be independently contributed & distributed
N P
• Survey articles
– May help you get oriented in a new area
– Synthesis Lectures on Human Language Technologies
– Handbook of Natural Language Processing
– Oxford Handbook of Computational Linguistics
– Foundations & Trends in Machine Learning
– ACM Computing Surveys?
– Online tutorial papers
– Slides from tutorials at conferences
– Textbooks
N P
• Vietnam
Jaist: GS Nguyễn Lê Minh
Trường Đại học Công nghệ, ĐHQGHN
Vin University
Đại học KHTN
Đại học Bách khoa
Đại học CNTT
Học viện Bưu chính viễn thông
Đại học Kyoto: TS Phạm Quang Nhật Minh
Đại học Tôn Đức Thắng, Đại học Kỹ thuật CN, Đại học Hà Nội
…
N P
• Toolkits
Tsujii Lab-Tokyo, Japan: http://www.nactem.ac.uk/tsujii/
Stanford Lab, America: http://nlp.stanford.edu/
Matsumoto Lab-NAIST, Japan: http://cl.naist.jp/en/
NLTK Toolkits: http://www.nltk.org/
Open NLP: http://opennlp.sourceforge.net/projects.html
NLP Toolkits: http://www.phontron.com/nlptools.php
Kyoto Lab: http://nlp.ist.i.kyoto-u.ac.jp/EN/
Google NLP research: http://research.google.com/pubs/NaturalLanguageProcessing.html
VLSP project: http://vlsp.vietlp.org:8080/demo/?&lang=en

Nguyễn Lê Minh: http://www.jaist.ac.jp/~nguyenml/
Lưu Văn Hải, Nguyễn Tuấn Hải, Japan: http://viet.jnlp.org/
N P
https://github.com/undertheseanlp/NLP-Vietnamese-progress
N P
https://github.com/undertheseanlp/NLP-Vietnamese-progress
Named Entity Recognition
L Summary
N P
• Overview of the field

– What is Natural Language Processing?
– NLP applications
– Aspects of language processing
– Why NLP is difficult?
• The NLP Research Community

Lesson 1 - NLP

Uploaded by

Copyright:

Available Formats

Lesson 1 - NLP

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 1 - NLP

Uploaded by

Copyright:

Available Formats

HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY

Natural Language Processing - (NLP)

Lecturer: Dr. Bùi Thanh Hùng

• Overview of the field

• Natural Language Processing

Databases Artificial Intelligence Algorithms Networking

Robotics Natural Language Processing Search

Information Machine Language

• Word, lexicon: lexical analysis

Word dependency parsing

Word dependency parsed sentence

s --> np, vp det -->[a]. det --> [an].

proper_noun v det noun

john eats an apple. Sem. Cat (Ontology)

np vp person animal food …

• Rules: syntactic rules or semantic rules

• A NLP system needs to answer the question “who did

• MANY hidden variables

NLP Research Centers

– Some end-to-end pipelines for text analysis

• Standard data formats

VLSP project: http://vlsp.vietlp.org:8080/demo/?&lang=en

• Overview of the field

You might also like