Lect#2 RegularExprAndAutomata
Lect#2 RegularExprAndAutomata
Lect#2 RegularExprAndAutomata
REGULAR EXPRESSIONS
AND AUTOMATA
NLP Representations
• State Machines
– FSAs: Finite State Automata
– FSTs: Finite State Transducers
– HMMs: Hidden Markov Model
– ATNs: Augmented Transition Network
– RTNs: Recursive Transition Network
2
NLP Representations
• Rule Systems:
– CFGs: Context Free Grammar
– Unification Grammars
– Probabilistic CFGs
• Logic-based Formalisms
– 1st Order Predicate Calculus
– Temporal and other Higher Order Logics
• Models of Uncertainty
– Bayesian Probability Theory
3
NLP Algorithms
• Most are transducers: accept or reject
input, and construct new structure from
input
– State space search
• To manage the problem of making choices
during processing when we lack the information
needed to make the right choice
– Dynamic programming
• To avoid having to redo work during the course
of a state-space search
4
State Space Search
• States represent pairings of partially
processed inputs with partially
constructed answers
• Goals are exhausted inputs paired with
complete answers that satisfy some
criteria
• The spaces are normally too large to
exhaustively explore
5
Dynamic Programming
• Don’t do the same work over and over
• Avoid this by building and making use of
solutions to sub-problems that must be
invariant across all parts of the space
6
Regular Expressions and Text
Searching
• Regular expression (RE): A formula (in a
special language) for specifying a set of
strings
• String: A sequence of alphanumeric
characters (letters, numbers, spaces, tabs,
and punctuation)
7
Regular Expression Patterns
• Regular Expression can be considered as
a pattern to specify text search strings to
search a corpus of texts
• What is Corpus?
• For text search purpose: use Perl syntax
• Show the exact part of the string in a line
that first matches a Regular Expression
pattern
8
Regular Expression Patterns
RE String matched
10
RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s
12
Sidebar: Errors
• The process we just went through was
based on two fixing kinds of errors
– Matching strings that we should not have
matched (there, then, other)
• False positives
– Not matching things that we should have
matched (The)
• False negatives
13
Sidebar: Errors
14
Regular expressions
• Basic regular expression patterns
• Perl-based syntax (slightly different from
other notations for regular expressions)
• Disjunctions [abc]
• Ranges [A-Z]
• Negations [^Ss]
• Optional characters ? and *
• Wild cards .
• Anchors ^ and $, also \b and \B
• Disjunction, grouping, and precedence |
15
16
Writing correct expressions
• Exercise: write a Perl regular expression to
match the English article “the”:
/the/ missed ‘The’
/[^a-zA-Z][tT]he[^a-zA-Z]/
Missed ‘The’ at the beginning of a line
/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
17
A more complex example
• Exercise: Write a regular expression that will
match “any PC with more than 500MHz and 32
Gb of disk space for less than $1000”:
18
Example
• Price
– /$[0-9]+/ # whole dollars
– /$[0-9]+\.[0-9][0-9]/ # dollars and cents
– /$[0-9]+(\.[0-9][0-9])?/ #cents optional
– /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries
• Specifications for processor speed
– /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
• Memory size
– /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/
– /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
• Vendors
– /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
– /\b(Mac|Macintosh|Apple)\b/
19
Advanced Operators
Underscore:
Correct figure 2.6
20
21
Substitutions in RE
• Substitutions
s/colour/color/
23
baa!
baaa!
Finite State Automata baaaa!
baaaaa!
...
• FSAs recognize the regular languages
represented by regular expressions
– SheepTalk: /baa+!/ a
b a a !
q0 q1 q2 q3 q4
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5
transitions
24
Languages
• A language is a set of strings
25
Finite State Automata
• Regular expressions can be viewed as a
textual way of specifying the structure of
finite-state automata.
26
Three Views: REs, FSA, RL
• Three equivalent formal ways to look at
what we’re up to
Regular Expressions
27
Recognition
• The process of determining if a string
should be accepted by a machine
• The process of determining if a string is in
the language we’re defining with the
machine
• The process of determining if a regular
expression matches a string
28
Recognition
• in the start state
• Examine the current input (tape)
• Consult the transition table
• Go to the next state and update the tape
pointer
• Repeat until you run out of tape
29
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject
index Beginning of tape
current-state Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
else if transition-table [current-state, tape[index]] is empty then
return reject
else
current-state transition-table [current-state, tape[index]]
index index + 1
end
30
D-Recognize
• Deterministic means that at each point in
processing there is always one unique
thing to do (no choices)
• D-recognize algorithm is a simple table-
driven interpreter
• The algorithm is universal for all
unambiguous languages
– To change the machine, you change the table
31
Recognition as Search
• You can view this algorithm as a kind of
state-space search
• States are pairings of tape positions and
state numbers
• Operators are compiled into the table
• Goal state is a pairing with the end of tape
position and a final accept state
32
Generative Formalisms
• Formal Languages are sets of strings
composed of symbols from a finite set of
symbols
• Finite-state automate define formal
languages (without having to enumerate
all the strings in the language)
• The term Generative is based on the view
that you can run the machine as a
generator to get strings from the language
33
Generative Formalisms
• FSAs can be viewed from two
perspectives:
– Acceptors that can tell you if a string is in the
language
– Generators to produce all and only the strings
in the language
34
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject
index Beginning of tape
current-state Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
else if transition-table [current-state, tape[index]] is empty then
return reject
else
current-state transition-table [current-state, tape[index]]
index index + 1
end
35
Deterministic Algorithm
37
Assignment 2- Part 1
• Practice search in Ms Word using regular
expressions (Wildcards) for both Arabic
and English. Submit at least 5 nontrivial
examples.
38
Assignment 2 - Part 2
• You have been asked to participate in
writing an exam about chapter 2 of the
textbook. Write one question to check
student understanding of chapter two
material. Include the answer in your
submission.
39