Ilak Pos Tagging

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 48

Part-of-speech

tagging
Parts of Speech
Perhaps starting with Aristotle in the West (384–322
BC), there was the idea of having parts of speech
a.k.a lexical categories, word classes, “tags”, POS
It comes from Dionysius Thrax of Alexandria (c. 100 BC)
the idea that is still with us that there are 8 parts of
speech
But actually his 8 aren’t exactly the ones we are taught today
Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun
School grammar: noun, verb, adjective, adverb, preposition, conjunction, pronoun,
interjection {^[A-Za-z0-9._-]+@[[A-Za-z0-9.-]+$}
POS Tagging

Words often have more than one POS: back


The back door = JJ
On my back = NN
Win the voters back = RB
Promised to back the bill = VB
The POS tagging problem is to determine the POS tag for
a particular instance of a word.
Open class (lexical) words
Nouns Verbs Adjectives old older oldest

Proper Common Main Adverbs slowly


IBM cat / cats see
Italy snow registered Numbers … more
122,312
one
Closed class (functional)
Modals
Determiners the some can Prepositions to with
had
Conjunctions and or Particles off up … more

Pronouns he its Interjections Ow Eh


Definition
 Annotate each word in a sentence with a part-of-speech marker.
 Lowest level of syntactic analysis.
Useful for subsequent syntactic parsing and word sense
disambiguation.
 Example
John saw the saw and decided to take it table.
to the NNP VBD DT NN CC VBD NN
TO VB PRP IN DT

POS TAGGING 5
English POS Tagsets
 Original Brown corpus used a large set of 87 POS tags.
 Most common in NLP today is the Penn Treebank set of 45 tags.
 Reduced from the Brown set for use in the context of a parsed corpus (i.e.
treebank).
The C5 tagset used for the British National Corpus (BNC) has
61 tags.

POS TAGGING 6
Why POS
 POS tell us a lot about a word (and the words near it).
 E.g, adjectives often followed by nouns
 personal pronouns often followed by verbs
 possessive pronouns by nouns
 Pronunciations depends on POS, e.g.
 object (first syllable NN, second syllable VM), content, discount
 First step in many NLP applications

POS TAGGING 7
Word Classes
Basic word classes: Noun, Verb, Adjective, Adverb, Preposition, …
Open vs. Closed classes
◦ Open:
◦Nouns, Verbs, Adjectives, Adverbs.
◦Why “open”?
◦ Closed:
◦determiners: a, an, the
◦pronouns: she, he, I
◦prepositions: on, under, over, near, by, …

POS TAGGING 8
Closed vs. Open Class
Closed class categories are composed of a small, fixed set of
grammatical function words for a given language.
 prepositions: on, under, over, …
 particles: up, down, on, off, …
 determiners: a, an, the, …
 pronouns: she, who, I, ..
 conjunctions: and, but, or, …
 auxiliary verbs: can, may should, …

POS TAGGING 9
Closed vs. Open Class
Open class categories have large number of words and new ones are
easily invented.
 Nouns new nouns: Internet, website, URL, CD-ROM, email, newsgroup,
bitmap, modem, multimedia
 New verbs have also : download, upload, reboot, right-click, double-
click,
 Verbs (Google),
 Adjectives (geeky)
 Abverb (chompingly)

POS TAGGING 1
0
English Parts of Speech (Nouns)
Noun (person, place or thing)
 Singular (NN): dog, fork

 Plural (NNS): dogs, forks

 Proper (NNP, NNPS): John, Springfields

 Personal pronoun (PRP): I, you, he, she, it

 Wh-pronoun (WP): who, what

POS TAGGING 1
1
English Parts of Speech (Nouns)
Proper nouns (Penn, Philadelphia, Davidson)

 English capitalizes these.

Common nouns (the rest).

Count nouns and mass nouns

 Count: have plurals, get


counted: goat/goats,

 Mass: don’t get counted


(snow, salt, water,) POS TAGGING 12
English Parts of Speech (Verbs)
Verb (actions and processes)
 Base, infinitive (VB): eat
 Past tense (VBD): ate
 Gerund (VBG): eating
 Past participle (VBN): eaten
 Non 3rd person singular present tense (VBP): eat
 3rd person singular present tense: (VBZ): eats
 Modal (MD): should, can
 To (TO): to (to eat)

POS TAGGING 13
English Parts of Speech (Adjectives)
Adjective (modify nouns, identify properties or qualities of nouns)
 Basic (JJ): red, tall
 Comparative (JJR): redder, taller
 Superlative (JJS): reddest, tallest
Adjective ordering restrictions in English:
 Old blue book, not Blue old book
 the 44th president
 a green product
 a responsible investment
 the dumbest, worst leader

POS TAGGING 14
English Parts of Speech (Adverbs)
Adverb (modify verbs)
 Basic (RB): quickly
 Comparative (RBR): quicker
 Superlative (RBS): quickest
Unfortunately, John walked home extremely slowly yesterday
 Directional/locative adverbs (here, downhill)
 Degree adverbs (extremely, very, somewhat)
 Manner adverbs (slowly, slinkily, delicately)
 Temporal adverbs (yesterday, tomorrow)

POS TAGGING 15
English Parts of Speech (Determiner)

Is a word that occurs together with a noun or noun phrase and serves to
express the reference of that noun or noun phrase in the context.
That is, a determiner may indicate whether the noun is referring to a
definite or indefinite element of a class, to a closer or more distant
element, to an element belonging to a specified person or thing, to a
particular number or quantity, etc.

POS TAGGING 16
English Parts of Speech(Determiner)

Common kinds of determiners include


 definite and indefinite articles (the, a, an)

 demonstratives (this, that, these)

 possessive determiners (my, their)

 quantifiers (many, few , several).

POS TAGGING 17
English Parts of Speech
( preposition)
Preposition (IN): a word governing, and usually preceding, a noun or
pronoun and expressing a relation to another word or element in the
clause, as in ‘the man on the platform’, ‘she arrived after dinner’.
Ex: on, in, by, to, with

POS TAGGING 18
English Parts of Speech
Coordinating Conjunction (CC): that connects words, sentences, phrases or
clauses.
the truth of nature, and the power of giving interest
Ex: and, but, or.
Particle (RP): a particle is a function word that must be associated with
another word or phrase to impart meaning, i.e., does not have its own
lexical definition.
Ex: off (took off), up (put up)

POS TAGGING 19
POS tagging
POS Tagging is a process that attaches each word in a
sentence with a suitable tag from a given set of tags.
Tagging is the assignment of a single part-of-speech tag to
each word (and punctuation marker) in a corpus.

 The set of tags is called the Tag-set.

 Standard Tag-set : Penn Treebank (for English).

POS TAGGING 20
POS tagging
There are so many parts of speech, potential distinctions we can
draw.
To do POS tagging, we need to choose a standard set of tags to
work with.
 Could pick very coarse tag sets.
 N , V, Adj, Adv.

 More commonly used set is finer grained (Penn TreeBank, 45


tags)
 PRP$, WRB, WP$, VBG
POS TAGGING 21
POS Tag Ambiguity
Deciding on the correct part of speech can be difficult even for
people.
In English : I bank1 on the bank2 on the river bank3 for my
transactions.
 Bank1 is verb, the other two banks are nouns
 In Hindi :
 “Khaanaa” : can be noun (food) or verb (to eat)

POS TAGGING 22
Measuring Ambiguity

POS TAGGING 23
How Hard is POS Tagging?
About 11% of the word types in the Brown corpus are ambiguous
with regard to part of speech

 But they tend to be very common words

 40% of the word tokens are ambiguous

POS TAGGING 24
Penn TreeBank POS Tagset

POS TAGGING 25
Using the Penn Tagset
The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”)
Except the preposition/complementizer “to” is just marked
“TO”.

POS TAGGING 26
Process
 List all possible tag for each word in sentence.
 Choose best suitable tag sequence.
 Example
 ”People jump high”.
 People : Noun/Verb
 jump : Noun/Verb
 high : Noun/Verb/Adjective
 We can start with probabilities.

POS TAGGING 27
How difficult is POS tagging?

About 11% of the word types in the Brown corpus are


ambiguous with regard to part of speech
But they tend to be very common words. E.g., that
I know that he is honest = IN
Yes, that play was nice = DT
You can’t go that far = RB
40% of the word tokens are ambiguous
Sources of information

What are the main sources of information for POS


tagging?
Knowledge of neighboring words
Bill saw that man yesterday
NNP NN DT NN NN
VB VB(D) IN VB NN
Knowledge of word probabilities
man is rarely used as a verb….
The latter proves the most useful, but the former also
helps
More and Better Features  Feature-
based tagger
Can do surprisingly well just looking at a word by itself:
Word the: the  DT
Lowercased word Importantly: importantly  RB
Prefixes unfathomable: un-  JJ
Suffixes Importantly: -ly  RB
Capitalization Meridian: CAP  NNP
Word shapes 35-year: d-x  JJ
Then build a maxent (or whatever) model to predict tag
Maxent P(t|w): 93.7% overall / 82.6% unknown
How to improve supervised results?
Build better features!
RB
PRP VBD IN RB IN PRP VBD .
They left as soon as he arrived .
We could fix this with a feature that looked at the next word

JJ
NNP NNS VBD VBN .
Intrinsic flaws remained undetected .

We could fix this by linking capitalized words to their lowercase versions


Rule-Based Tagging
 Start with a dictionary.

 Assign all possible tags to words from the dictionary.

 Write rules by hand to selectively remove tags.

 Leaving the correct tag for each word.

POS TAGGING 32
Step1: Start with a Dictionary
she: PRP
promised: VBN,VB
to: TO D
back: VB, JJ, RB, NN
the: DT
bill: NN, VB
Etc… for the ~100,000 words of English with more than 1 tag

POS TAGGING 33
Step2: Assign Every Possible Tag

NN

VBN RB
PRP VBD J
She promisedJ to back the bill

V
B
POS TAGGING 34
Step3: Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows


“<start> PRP”
NN

VBN RB JJ VB
PRP VBD TOVB DT NN
She promised to back the bill

POS TAGGING 35
POS TAGGING 36
END

POS TAGGING 48

You might also like