Lesson 1 - NLP
Lesson 1 - NLP
Lesson 1 - NLP
• Text Categorization
– Classify documents by topics, language, author, spam filtering, information
retrieval (relevant, not relevant), sentiment classification (positive, negative)
• Spelling & Grammar Corrections
• Information Extraction
• Speech Recognition
• Information Retrieval
– Synonym Generation
• Summarization
• Machine Translation
• Question Answering
• Dialog Systems
– Language generation
L Where does it fit in the CS taxonomy?
N P
Computers
Semantics Parsing
L Aspects of language processing
N P
Raw sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
Part-of-speech tagging
POS-tagged sentence
He reckons the current account deficit will narrow to only 1.8 billion in September.
PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .
NP VP
[head=plan] [head=thrill]
Det N V VP
[head=plan] [head=thrill]
The has
N VP
[head=plan] [head=swallow]
V VP
[head=thrill]
plan been
to VP V
[head=swallow] [head=thrill] [head=Otto]
NP
thrilling Otto
V
[head=swallow] NP
[head=Wanda]
swallow Wanda
L Parsing (in Definite Clause Grammars)
N P
np
np vp
s
L Semantic analysis
N P
np animated non-anim
[apple]
s vertebral … fruit …
eat([person: john], [apple])
apple …
L Parsing & semantic analysis
N P
• Anaphora
He hits the car with a stone. It bounces back.
• Understanding a text
– Who/when/where/what … are involved in an event?
– How to connect the semantic representations of
different sentences?
– What is the cause of an event and what is the
consequence of an action?
–…
L NLP
N P
L Why NLP is difficult
N P
• Language is flexible
– New words, new meanings
– Different meanings in different contexts
• Language is subtle
– He arrived at the lecture
– He chuckled at the lecture
– He chuckled his way through the lecture
– **He arrived his way through the lecture
• Language is complex!
L Why NLP is difficult
N P
• Key problems:
– Representation of meaning
– Language presupposes knowledge about the world
– Language only reflects the surface of meaning
– Language presupposes communication between
people
L Meaning
N P
• What is meaning?
– Physical referent in the real world
– Semantic concepts, characterized also by relations.
• How do we represent and use meaning
– I am Italian
• From lexical database (WordNet)
• Italian =a native or inhabitant of Italy→ Italy = republic in southern
Europe [..]
– I am Italian
• Who is “I”?
– I know she is Italian/I think she is Italian
• How do we represent “I know” and “I think”
• Does this mean that I is Italian? What does it say about the “I” and about
the person speaking?
– I thought she was Italian
• How do we represent tenses?
L The NLP Research Community
N P
• Papers
– ACL Anthology has nearly everything, free!
• Over 20,000 papers!
• Free-text searchable
– Great way to learn about current research on a topic
– New search interfaces currently available in beta
» Find recent or highly cited work; follow citations
• Used as a dataset by various projects
– Analyzing the text of the papers (e.g., parsing it)
– Extracting a graph of papers, authors, and institutions
(Who wrote what? Who works where? What cites what?)
L The NLP Research Community
N P
• Conferences
– Most work in NLP is published as 8-page conference papers
with 3 double-blind reviewers.
– Main annual conferences: ACL, EMNLP, NAACL
• Also EACL, IJCNLP, COLING
• + various specialized conferences and workshops
– Big events, and growing fast! ACL 2014:
• About 1000 attendees
• 572 full-length papers submitted (146 accepted)
• 551 short papers submitted (139 accepted)
• 16 workshops on various topics
L The NLP Research Community
N P
NLP Journals
Computational Linguistics
Journal of Natural Language Engineering (JLNE)
Machine Translation
Natural Language and Linguistic Theory
Journal of Natural Language Processing
…
L The NLP Research Community
N P
• Institutions
– Universities: Many have NLP faculty
• Several “big players” with many faculty
• Some of them also have good linguistics,
cognitive science, machine learning, AI
– Companies:
• Old days: AT&T Bell Labs, IBM
• Now: Google, Microsoft, IBM, many startups …
– Speech: Nuance, …
– Machine translation: Language Weaver, Systran, …
– Many niche markets – online reviews, medical transcription, news
summarization, legal search and discovery …
L The NLP Research Community
N P
• Standard tasks
– If you want people to work on your problem, make it
easy for them to get started and to measure their
progress. Provide:
• Test data, for evaluating the final systems
• Development data, for measuring whether a change to the
system helps, and for tuning parameters
• An evaluation metric (formula for measuring how well a
system does on the dev or test data)
• A program for computing the evaluation metric
• Labeled training data and other data resources
• A prize? – with clear rules on what data can be used
L The NLP Research Community
N P
• Software
– Lots of people distribute code for these tasks
• Or you can email a paper’s authors to ask for their code
– Some lists of software, but no central site
• Software
– To find good or popular tools:
• Search current papers, ask around, use the web
– Still, often hard to identify the best tool for your job:
• Produces appropriate, sufficiently detailed output?
• Accurate? (on the measure you care about)
• Robust? (accurate on your data, not just theirs)
• Fast?
• Easy and flexible to use? Nice file formats, command line options,
visualization?
• Trainable for new data and languages? How slow is training?
• Open-source and easy to extend?
L The NLP Research Community
N P
• Datasets
– Raw text or speech corpora
• Or just their n-gram counts, for super-big corpora
• Various languages and genres
• Usually there’s some metadata (each document’s date, author, etc.)
• Sometimes licensing restrictions (proprietary or copyright data)
– Text or speech with manual or automatic annotations
• What kind of annotations? That’s the rest of this lecture …
• May include translations into other languages
– Words and their relationships
• Morphological, semantic, translational, evolutionary
– Grammars
– World Atlas of Linguistic Structures
– Parameters of statistical models (e.g., grammar weights)
L The NLP Research Community
N P
• Datasets
– Read papers to find out what datasets others are using
• Linguistic Data Consortium (searchable) hosts many large datasets
• Many projects and competitions post data on their websites
• But sometimes you have to email the author for a copy
– CORPORA mailing list is also good place to ask around
– LREC Conference publishes papers about new datasets & metrics
– Amazon Mechanical Turk – pay humans (very cheaply) to annotate your
data or to correct automatic annotations
• Old task, new domain: Annotate parses etc. on your kind of data
• New task: Annotate something new that you want your system to find
• Auxiliary task: Annotate something new that your system may benefit from
finding (e.g., annotate subjunctive mood to improve translation)
– Can you make annotation so much fun or so worthwhile
that they’ll do it for free?
L The NLP Research Community
N P
Datasets
1. Google Datasets:
Link : https://datasetsearch.research.google.com/
2. Papers with Code Datasets.
Link : https://paperswithcode.com/datasets
3. Kaggle Dataset
Link: https://www.kaggle.com/datasets
4. Big Bag NLP Datasets
Link: https://index.quantumstat.com/#/
5. Hugging Face Datasets
Link: https://huggingface.co/dataset
6. UCI Machine Learning
Link: https://archive.ics.uci.edu/ml/index.php
L The NLP Research Community
N P
Datasets
7. Amazin Datasets (Open Data on AWS)
Link: https:/aws.amazon.com/opendata/
8. Awesome Public Datasets
Link: https://github.com/awesomedata/awesome-public-datasets
9. Azure public datasets
Link: https://docs.microsoft.com/.../azure-sql/public-data-sets
10. Carnegie Mellon University
Link: https://guides.library.cmu.edu/az.php
11. .gov Datasets
Link: https://data.gov.au/
https://data.gov.in/
https://data.gov.sg/
https://data.europa.eu/data/datasets?locale=en&minScoring=0
L The NLP Research Community
N P
Datasets
Một số nguồn để tìm dataset về machine learning, data science, AI.
7. Amazin Datasets (Open Data on AWS)
Link: https:/aws.amazon.com/opendata/
8. Awesome Public Datasets
Link: https://github.com/awesomedata/awesome-public-datasets
9. Azure public datasets
Link: https://docs.microsoft.com/.../azure-sql/public-data-sets
10. Carnegie Mellon University
Link: https://guides.library.cmu.edu/az.php
11. .gov Datasets
Link: https://data.gov.au/
https://data.gov.in/
https://data.gov.sg/
https://data.europa.eu/data/datasets?locale=en&minScoring=0
L The NLP Research Community
N P
• Survey articles
– May help you get oriented in a new area
– Synthesis Lectures on Human Language Technologies
– Handbook of Natural Language Processing
– Oxford Handbook of Computational Linguistics
– Foundations & Trends in Machine Learning
– ACM Computing Surveys?
– Online tutorial papers
– Slides from tutorials at conferences
– Textbooks
L The NLP Research Community
N P
• Vietnam
Jaist: GS Nguyễn Lê Minh
Trường Đại học Công nghệ, ĐHQGHN
Vin University
Đại học KHTN
Đại học Bách khoa
Đại học CNTT
Học viện Bưu chính viễn thông
Đại học Kyoto: TS Phạm Quang Nhật Minh
Đại học Tôn Đức Thắng, Đại học Kỹ thuật CN, Đại học Hà Nội
…
L The NLP Research Community
N P
• Toolkits
Tsujii Lab-Tokyo, Japan: http://www.nactem.ac.uk/tsujii/
Stanford Lab, America: http://nlp.stanford.edu/
Matsumoto Lab-NAIST, Japan: http://cl.naist.jp/en/
NLTK Toolkits: http://www.nltk.org/
Open NLP: http://opennlp.sourceforge.net/projects.html
NLP Toolkits: http://www.phontron.com/nlptools.php
Kyoto Lab: http://nlp.ist.i.kyoto-u.ac.jp/EN/
Google NLP research: http://research.google.com/pubs/NaturalLanguageProcessing.html
https://github.com/undertheseanlp/NLP-Vietnamese-progress
L The NLP Research Community
N P
https://github.com/undertheseanlp/NLP-Vietnamese-progress
Named Entity Recognition
L Summary
N P