Lect#2 RegularExprAndAutomata

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Lecture 2:

REGULAR EXPRESSIONS
AND AUTOMATA
NLP Representations
• State Machines
– FSAs: Finite State Automata
– FSTs: Finite State Transducers
– HMMs: Hidden Markov Model
– ATNs: Augmented Transition Network
– RTNs: Recursive Transition Network

2
NLP Representations
• Rule Systems:
– CFGs: Context Free Grammar
– Unification Grammars
– Probabilistic CFGs
• Logic-based Formalisms
– 1st Order Predicate Calculus
– Temporal and other Higher Order Logics
• Models of Uncertainty
– Bayesian Probability Theory
3
NLP Algorithms
• Most are transducers: accept or reject
input, and construct new structure from
input
– State space search
• To manage the problem of making choices
during processing when we lack the information
needed to make the right choice
– Dynamic programming
• To avoid having to redo work during the course
of a state-space search

4
State Space Search
• States represent pairings of partially
processed inputs with partially
constructed answers
• Goals are exhausted inputs paired with
complete answers that satisfy some
criteria
• The spaces are normally too large to
exhaustively explore

5
Dynamic Programming
• Don’t do the same work over and over
• Avoid this by building and making use of
solutions to sub-problems that must be
invariant across all parts of the space

6
Regular Expressions and Text
Searching
• Regular expression (RE): A formula (in a
special language) for specifying a set of
strings
• String: A sequence of alphanumeric
characters (letters, numbers, spaces, tabs,
and punctuation)

7
Regular Expression Patterns
• Regular Expression can be considered as
a pattern to specify text search strings to
search a corpus of texts
• What is Corpus?
• For text search purpose: use Perl syntax
• Show the exact part of the string in a line
that first matches a Regular Expression
pattern
8
Regular Expression Patterns

RE String matched

/woodchucks/ “interesting links to woodchucks


and lemurs”
/a/ “Sarah Ali stopped by Mona’s”
/Ali says,/ “My gift please,” Ali says,”
/book/ “all our pretty books”

/!/ “Leave him behind!” said Sami


9
RE Match

/[wW]oodchuck/ Woodchuck or woodchuck


/[abc]/ “a”, “b”, or “c”
/[0123456789]/ Any digit

10
RE Description
/a*/ Zero or more a’s
/a+/ One or more a’s

/a?/ Zero or one a’s


/cat|dog/ ‘cat’ or ‘dog’
/^cat$/ A line containing only ‘cat’
/\bun\B/ Beginnings of longer strings starts
by ‘un’ 11
Example
• Find all instances of the word “the” in a
text.
– /the/
• What About ‘The’
– /[tT]he/
• What about ‘Theater”, ‘Another’
– /\b[tT]he\b/

12
Sidebar: Errors
• The process we just went through was
based on two fixing kinds of errors
– Matching strings that we should not have
matched (there, then, other)
• False positives
– Not matching things that we should have
matched (The)
• False negatives

13
Sidebar: Errors

• Reducing the error rate for an application


often involves two efforts
– Increasing accuracy (minimizing false
positives)
– Increasing coverage (minimizing false
negatives)

14
Regular expressions
• Basic regular expression patterns
• Perl-based syntax (slightly different from
other notations for regular expressions)
• Disjunctions [abc]
• Ranges [A-Z]
• Negations [^Ss]
• Optional characters ? and *
• Wild cards .
• Anchors ^ and $, also \b and \B
• Disjunction, grouping, and precedence |
15
16
Writing correct expressions
• Exercise: write a Perl regular expression to
match the English article “the”:
/the/ missed ‘The’

/[tT]he/ included ‘the’ in ‘others’

/\b[tT]he\b/ Missed ‘the25’ ‘the_’

/[^a-zA-Z][tT]he[^a-zA-Z]/
Missed ‘The’ at the beginning of a line

/(^|[^a-zA-Z])[tT]he[^a-zA-Z]/
17
A more complex example
• Exercise: Write a regular expression that will
match “any PC with more than 500MHz and 32
Gb of disk space for less than $1000”:

18
Example
• Price
– /$[0-9]+/ # whole dollars
– /$[0-9]+\.[0-9][0-9]/ # dollars and cents
– /$[0-9]+(\.[0-9][0-9])?/ #cents optional
– /\b$[0-9]+(\.[0-9][0-9])?\b/ #word boundaries
• Specifications for processor speed
– /\b[0-9]+ *(MHz|[Mm]egahertz|Ghz|[Gg]igahertz)\b/
• Memory size
– /\b[0-9]+ *(Mb|[Mm]egabytes?)\b/
– /\b[0-9](\.[0-9]+) *(Gb|[Gg]igabytes?)\b/
• Vendors
– /\b(Win95|WIN98|WINNT|WINXP *(NT|95|98|2000|XP)?)\b/
– /\b(Mac|Macintosh|Apple)\b/

19
Advanced Operators

Underscore:
Correct figure 2.6
20
21
Substitutions in RE
• Substitutions

s/colour/color/

• Memory (\1, \2, etc. refer back to matches)


s/([0-9]+)/<\1>/
• Put angle brackets around all integers
• Practice with Microsoft Word
22
Assignment: Try regular
expressions in MS WORD in both
Arabic & English

23
baa!
baaa!
Finite State Automata baaaa!
baaaaa!
...
• FSAs recognize the regular languages
represented by regular expressions
– SheepTalk: /baa+!/ a
b a a !

q0 q1 q2 q3 q4
• Directed graph with labeled nodes and arc transitions
•Five states: q0 the start state, q4 the final state, 5
transitions

24
Languages
• A language is a set of strings

• String: A sequence of letters


– Examples: “cat”, “dog”, “house”,

– Defined over an alphabet:

25
Finite State Automata
• Regular expressions can be viewed as a
textual way of specifying the structure of
finite-state automata.

26
Three Views: REs, FSA, RL
• Three equivalent formal ways to look at
what we’re up to

Regular Expressions

Finite State Automata Regular Languages

27
Recognition
• The process of determining if a string
should be accepted by a machine
• The process of determining if a string is in
the language we’re defining with the
machine
• The process of determining if a regular
expression matches a string

28
Recognition
• in the start state
• Examine the current input (tape)
• Consult the transition table
• Go to the next state and update the tape
pointer
• Repeat until you run out of tape

29
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject
index  Beginning of tape
current-state  Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
else if transition-table [current-state, tape[index]] is empty then
return reject
else
current-state  transition-table [current-state, tape[index]]
index  index + 1
end
30
D-Recognize
• Deterministic means that at each point in
processing there is always one unique
thing to do (no choices)
• D-recognize algorithm is a simple table-
driven interpreter
• The algorithm is universal for all
unambiguous languages
– To change the machine, you change the table

31
Recognition as Search
• You can view this algorithm as a kind of
state-space search
• States are pairings of tape positions and
state numbers
• Operators are compiled into the table
• Goal state is a pairing with the end of tape
position and a final accept state

32
Generative Formalisms
• Formal Languages are sets of strings
composed of symbols from a finite set of
symbols
• Finite-state automate define formal
languages (without having to enumerate
all the strings in the language)
• The term Generative is based on the view
that you can run the machine as a
generator to get strings from the language
33
Generative Formalisms
• FSAs can be viewed from two
perspectives:
– Acceptors that can tell you if a string is in the
language
– Generators to produce all and only the strings
in the language

34
D-RECOGNIZE
function D-RECOGNIZE (tape, machine) returns accept or reject
index  Beginning of tape
current-state  Initial state of machine
loop
if End of input has been reached then
if current-state is an accept state then
return accept
else
return reject
else if transition-table [current-state, tape[index]] is empty then
return reject
else
current-state  transition-table [current-state, tape[index]]
index  index + 1
end
35
Deterministic Algorithm

1. Index the tape to the beginning and the machine to the


initial state.
2. First check to see if you have any more input
•If no and you’re in a final state, ACCEPT
•If no and you’re in a non-final state reject
3. If you have more input check what state you’re in by
consulting the transition table. The index of the Current
State tells you what row in the table to consult. The index
on the tape symbol tells you what column to consult in the
table. Loop through until no more input then go back to 2.
36
Languages and automata
• Formal languages: regular languages, non-regular
languages
• deterministic vs. non-deterministic FSAs
• Epsilon () transitions
–  is the empty string & Ø is the empty set (empty regular
language)

37
Assignment 2- Part 1
• Practice search in Ms Word using regular
expressions (Wildcards) for both Arabic
and English. Submit at least 5 nontrivial
examples.

38
Assignment 2 - Part 2
• You have been asked to participate in
writing an exam about chapter 2 of the
textbook. Write one question to check
student understanding of chapter two
material. Include the answer in your
submission.

39

You might also like