Natural Language
Processing
Lecture 4: Sequence Labeling with Hidden Markov
Models. Part-of-Speech Tagging.
11/6/2020
COMS W4705
Yassine Benajiba
Garden-Path Sentences
• The horse raced past the barn.
• The horse raced past the barn fell.
• The old dog the footsteps of the young.
Garden-Path Sentences
• Why does this happen?
past tense verb
VBD ???
The horse raced past the barn fell
• raced can be a past tense verb or a a past participle
(indicating passive voice).
• The verb interpretation is more likely before fell is read.
Garden-Path Sentences
• Why does this happen?
past participle
VBN VBD
[The horse raced past the barn] fell
NP
• raced can be a past tense verb or a a past participle
(indicating passive voice).
• Once fell is read, the verb interpretation is impossible.
Garden-Path Sentences
• Why does this happen?
adjective
JJ NN
[The old dog] [the footsteps of the young]
NP NP
• dog can be a noun or a verb (plural, present tense)
Garden-Path Sentences
• Why does this happen?
NN VB
[The old] dog [the footsteps of the young]
NP NP
• dog can be a noun or a verb (plural, present tense)
Parts-of-Speech
• Classes of words that behave alike:
• Appear in similar contexts.
• Perform a similar grammatical function in the sentence.
• Undergo similar morphological transformations.
• ~9 traditional parts-of-speech:
• noun, pronoun, determiner, adjective, verb, adverb,
preposition, conjunction, interjection
Syntactic Ambiguities and
Parts-of-Speech
N / V? N / V? V / Preposition
• Time flies like an arrow.
Syntactic Ambiguities and
Parts-of-Speech
N N V
• [Time flies] like an arrow.
NP
Why do we need P.O.S.?
• Interacts with most levels of linguistic representation.
• Speech processing:
• object, object
• content, content
• Syntactic parsing
• …
• P.O.S. tag-set should contain morphological and maybe syntactic
information.
Penn Treebank Tagset
CC Coordinating conjunction PRP$ Possessive pronoun
CD Cardinal number RB Adverb
DT Determiner RBR Adverb, comparative
EX Existential there RBS Adverb, superlative
FW Foreign word RP Particle
IN Preposition or subordinating conjunction SYM Symbol
JJ Adjective TO to
JJR Adjective, comparative UH Interjection
JJS Adjective, superlative VB Verb, base form
LS List item marker VBD Verb, past tense
MD Modal VBG Verb, gerund or present participle
NN Noun, singular or mass VBN Verb, past participle
NNS Noun, plural VBP Verb, non-3rd person singular present
NNP Proper noun, singular VBZ Verb, 3rd person singular present
NNPS Proper noun, plural WP Wh-pronoun
PDT Predeterminer WP$ Possessive wh-pronoun
POS Possessive ending WRB Wh-adverb
PRP Personal pronoun plus punctuation symbols
P.O.S. Tagsets
• Tagset is language specific.
• Some language capture more morphological information
which should be reflected in the tag set.
• “Universal Part Of Speech Tags?”
• Petrov et al. 2011: Mapping of 25 language specific tag-
sets to a common set of 12 universal tags
Part-of-Speech Tagging
• Goal: Assign a part-of-speech label to each word in a
sentence.
DT NN VBD DT NNS IN DT NN .
the koala put the keys on the table .
• This is an example of a sequence labeling task.
• Think of this as a translation task from a sequence of
words (w1, w2, …, wn) ∈ V*, to a sequence of tags
(t1, t2, …, tn) ∈T*.
Determining Part-of-Speech
• A blue seat / A child seat: noun or adj?
• Syntactic tests:
• A very blue seat • *A very child seat
• This seat is blue • *This seat is child
• Morphological Tests
• bluer • *childer
Determining Part-of-Speech
• Preposition or Particle?
• He threw out the garbage.
out is a particle
He threw the garbage out.
• He threw the garbage out the door.
out is a preposition
*He threw the garbage the door out
Part-of-Speech Tagging
• Goal: Translate from a sequence of words
(w1, w2, …, wn) ∈ V*, to a sequence of tags
( t1, t2, …, tn ) ∈ T*.
• NLP is full of translation problems from one structure to
another. Basic solution:
• For each translation step:
1. Construct search space of possible translations.
2. Find best paths through this space (decoding) according
to some performance measure.
Bayesian Inference for
Sequence Labeling
• Recall Bayesian Inference (Generative Models): Given
some observation, infer the value of some hidden variable.
(see Naive Bayes’)
• We can apply this approach to sequence labeling:
• Assume each word wi in the observed sequence
(w1, w2, …, wn) ∈ V* was generated by some hidden
variable ti.
• Infer the most likely sequence of hidden variables given
the sequence of observed words.
Noisy Channel Model
“NN VBZ IN DT NN”
P(tags)
P(words | tags)
“time flies like an arrow”
• Goal: figure out what the original input to the the channel
was. Use Bayes’ rule:
• This model is used widely (speech recognition, MT)
Hidden Markov Models (HMMs)
• Generative (Bayesian) probability model.
Observations: sequences of words.
Hidden states: sequence of part-of-speech labels.
START NN VBZ IN DT NN
time flies like an arrow
• Hidden sequence is generated by an n-gram language
model (typically a bi-gram model)
t0 = START
Markov Chains
start
0.1
0.7 0.2
0.4
0.6
1.0 0.2
DT NNZ IN VBZ
0.2
0.3
0.2
0.7 0.4
• A Markov chain is a sequence of random variables X1, X2, …
• The domain of these variables is a set of states.
• Markov assumption: Next state depends only on current state.
• This is a special case of a weighted finite state automaton (WFSA).
Hidden Markov Models (HMMs)
• There are two types of probabilities:
Transition probabilities and Emission Probabilities.
transition probabilities
start t1 t2 t3
emission probabilities
w1 w2 w3
Important Tasks on HMMs
• Decoding: Given a sequence of words, find the most likely
probability sequence.
(Bayesian inference using Viterbi algorithm).
• Evaluation: Given a sequence of words, find the
total probability for this word sequence given an HMM.
Note that we can view the HMM as another type of language
model. (Forward algorithm)
• Training: Estimate emission and transition probabilities from
training data. (MLE, Forward-Backward a.k.a Baum-Welch
algorithm)
Decoding HMMs
VBZ VBZ VBZ VBZ VBZ
IN IN IN IN P
NN NN NN NN NN
DT DT DT DT DT
time flies like an arrow
Goal: Find the path with the highest total probability (given the words)
There are dn paths for n words and d tags.
Viterbi Algorithm
• Input: Sequence of observed words w1, …, wn
• Create a table π, such that each entry π[k,t] contains the score of the
highest-probability sequence ending in tag t at time k.
• initialize π[0,start]=1.0 and π[0,t]=0.0 for all tags t∈T.
• for k=1 to n:
• for t ∈ T:
emission probability
•
• return transition probability
Emission Probabilities
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
• P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.1 x .2 = .02VBZ VBZ VBZ VBZ VBZ
0 IN IN IN IN IN
.2 x .3 = .06 NN NN NN NN NN
0 DT DT DT DT DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
IN
.06 NN NN NN
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
0.0
.02 VBZ VBZ VBZ
.06 x .6 x .3 = .0108
0.6 IN
.06 NN NN NN
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
.0108
0.4
IN
.02 x .4 x .2 = .0016
.06 NN NN NN
0.2
.06 x .2 x .2 = .0024
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
0.0
.02 VBZ VBZ VBZ
.0108 .0024 x .6 x .5 = .00072
0.7 IN
.06 NN NN NN
.0024
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
0.2
.0108 .00072
IN .0108 x .2 x 1 = .00216
.0024 x .2 x 1 = .00048
.06 NN NN 0.2 NN
.0024
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
.0108 .00072
0.4
IN
.00072 x .4 x 1 = .000288
.00216
.06 NN NN NN
.0024 0.7
.00216 x .7 x 1 = .001512
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216
.06 NN NN NN
.0024 .0001512
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216 .0001512 x 1.0 x .5 = .0000756
.06 NN NN NN
.0024 .0001512
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
.02 VBZ VBZ VBZ
.0108 .00072
IN
.00216 0.0000756
.06 NN NN NN
.0024 .0001512
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
• P(time | VB) = 0.2
P(flies | VB) = 0.3
P(like | VB) = 0.5
Viterbi Algorithm
P(time | NN) = 0.3
P(flies | NN) = 0.2
P(arrow | NN) = 0.5
• P(like | P) = 1.0
• P(an | DT) = 1.0
VBZ VBZ VBZ
.0108
IN
.00216 0.0000756
.06 NN NN NN
.0001512
DT
time flies like an arrow
• Idea: Because of the Markov assumption, we only need
the probabilities for Xn to compute the probabilities for Xn+1.
This suggests a dynamic programming algorithm.
Viterbi Algorithm
• Input: Sequence of observed words w1, …, wn
• Create a table π, such that each entry π[k,t] contains the score of the
highest-probability sequence ending in tag t at time k.
• initialize π[0,start]=1.0 and π[0,t]=0.0 for all tags t∈T.
• for k=1 to n:
• for t ∈ T:
emission probability
•
• return transition probability
Trigram Language Model
• Instead of using a unigram context , use a bigram
context .
• Think of this as having states that represent pairs of tags.
• So the HMM probability for a given tag and word sequence
is:
• Need to handle data sparseness when estimating transition
probabilities (for example using backoff or linear
interpolation)
More POS tagging tricks
• It is also often useful in practice to add an end-of-sentence marker (just like we
did for n-gram language models).
where t-1 = t0 = START and tn+1 = STOP.
• Another useful trick is to replace words with “pseudo words” representing an
entire class.
• For example: replace {“01”,”85”,”90”,…} with twoDigitNumber
replace {“1985”,”2018”,…} with fourDigitNumber
replace {“1”,”1.0”,”234.3” …} with otherNum
replace {“IBM”, “DNC”, …} with allCaps etc.
Using a smoothed trigram HMM model with these tricks, we can build a tagger that is
close to the state-of-the art (~97% accuracy on the Penn Treebank).
HMMs as Language Models
• We can also use an HMM as language models (language
generation, MT, …), i.e. evaluate for a
given sentence.
What is the advantage over a plain word n-gram model?
• Problem: There are many tag-sequences that could have
generated w1, … wn.
• This is an example of spurious ambiguity.
• Need to compute:
Forward Algorithm
• Input: Sequence of observed words w1, …, wn
• Create a table π, such that each entry π[k,t] contains the score of the
highest-probability sequence ending in tag t at time k.
• initialize π[0,start]=1.0 and π[0,t]=0.0 for all tags t∈T.
• for k=1 to n:
• for t ∈ T:
•
• return
Named Entity Recognition as
Sequence Labeling
• Use 3 tags:
• O - outside of named entity
• I - inside named entity
• B - first word (beginning) of named entity
O O B I O
… identification of tetronic acid in …
• Other encodings are possible (for example, NE-type
specific)
• This can also be used for phrase chunking.