Regular Expression & Autometa

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 62

Natural Language

Processing
Definitions

• A language is a set of strings


• String: A sequence of letters
 Examples: “cat”, “dog”, “house”, …
 Defined over an alphabet:

  a, b, c,, z

2
Alphabets and Strings

• We will use small alphabets:   a, b


• Strings
a
ab u  ab
abba v  bbbaaa
baba w  abba
aaabbbaabab
3
Regular Expressions

• In computer science, RE is a language used for


specifying text search string.
• A regular expression is a formula in a special language
that is used for specifying a simple class of string.
• Formally, a regular expression is an algebraic notation
for characterizing a set of strings.
• RE search requires
 a pattern that we want to search for, and
 a corpus of texts to search through.
Basic Regular Expression Patterns

• The use of the brackets [] to specify a disjunction of characters.

• The use of the brackets [] plus the dash - to specify a range.

5
Basic Regular Expression Patterns

• Uses of the caret ^ for negation or just to mean ^

• The question-mark ? marks optionality of the previous expression.

• The use of period . to specify any character

6
Disjunction, Grouping, and Precedence

• Disjunction
• /cat|dog
• Precedence
• /gupp(y|ies)
• To find the English article the
• /the/
• /[tT]he/
• /[^a-zA-Z][tT]he[^a-zA-Z]/

7
Aliases for common sets of characters

8
Regular expression operators for counting

9
Some characters that need to be backslashed

10
Finite State Automata

• FSAs recognize the regular languages represented by


regular expressions
 SheepTalk: /baa+!/
a
b a a !

q0 q1 q2 q3 q4

• Directed graph with labeled nodes and arc transitions


•Five states: q0 the start state, q4 the final state, 5
transitions

11
Formally

• FSA is a 5-tuple consisting of


 Q: set of states {q0,q1,q2,q3,q4}
 : an alphabet of symbols {a,b,!}
 q0: A start state
 F: a set of final states in Q {q4}
 (q,i): a transition function mapping Q x 
to Q
a
b a a !

q0 q1 q2 q3 q4
12
• FSA recognizes (accepts) strings of a
regular language
 baa! a
 baaa! b a a !
q0 q1 q2 q3 q4
 baaaa!
…
• Tape Input: a rejected input

a b a ! b

13
State Transition Table for SheepTalk

Input
State
baa! b a !
baaa!
baaaa!
0 1 Ø Ø
baaaaa
! 1 Ø 2 Ø
...
2 Ø 3 Ø
a 3 Ø 3 4
b a a !
4 Ø Ø Ø
q0 q1 q2 q3 q4

14
Non-Deterministic FSAs for SheepTalk

b a a a !

q0 q1 q2 q3 q4

b a a !

q0 q1 q2 q3 q4

15
Finite Accepter

• Input
String
Output
“Accept”
Finite
or
Automata
“Reject”

16
Transition Graph

abba -Finite Accepter a, b


q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

initial final
state state
transition
state “accept”

17
Initial Configuration

• Input String
a b b a
a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

18
Reading the Input

• a b b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 19
• a b b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 20
• a b b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 21
• a b b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 22
a b b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

Output: “accept”
2/5/2020 23
Rejection

• a b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 24
• a b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 25
• a b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 26
• a b a

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 27
a b a

a, b
Output:
q5 “reject”
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 28
Another Example

a a b

a a, b

q0 b q1 a, b q2

2/5/2020 29
a a b

a a, b

q0 b q1 a, b q2

2/5/2020 30
a a b

a a, b

q0 b q1 a, b q2

2/5/2020 31
a a b

a a, b

q0 b q1 a, b q2

2/5/2020 32
a a b

a a, b
Output: “accept”

q0 b q1 a, b q2

2/5/2020 33
Rejection

b a b

a a, b

q0 b q1 a, b q2

2/5/2020 34
b a b

a a, b

q0 b q1 a, b q2

2/5/2020 35
b a b

a a, b

q0 b q1 a, b q2

2/5/2020 36
b a b

a a, b

q0 b q1 a, b q2

2/5/2020 37
b a b

a a, b

q0 b q1 a, b q2

Output: “reject”

2/5/2020 38
Formalities

• Deterministic Finite Accepter (DFA)

M  Q, ,  , q0 , F 
Q : set of states
 : input alphabet

 : transition function
q0 : initial state

F : set of final states


2/5/2020 39
About Alphabets

• Alphabets means we need a finite set of symbols in the


input.
• These symbols can and will stand for bigger objects that
can have internal structure.

2/5/2020 40
Input Aplhabet

•   a, b

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 41
Set of States
Q
Q  q0 , q1, q2 , q3 , q4 , q5 

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 42
Initial State
q0

a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 43
Set of Final States
F
F  q4 
a, b

q5
a a, b
b a b
q0 a q1 b q2 b q3 a q4

2/5/2020 44
Transition Function

 :Q  Q

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 45
 q0 , a   q1

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 46
 q0 , b   q5

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 47
 q2 , b   q3

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 48
Transition Function

 a b
q0 q1 q5
q1 q5 q2
q2 q2 q3 a, b
q3 q4 q5
q4 q5 q5 q5
q5 q5 q5 a, b
b a a b
q0 a q1 b q2 b q3 a q4

2/5/2020 49
Extended Transition Function
(Reads the entire string) *

 * : Q  *  Q

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 50
 * q0 , ab   q2

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 51
 * q0 , abba   q4

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 52
 * q0 , abbbaa   q5

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 53
Observation: There is a walk from q0 to q 5
with label abbbaa

 * q0 , abbbaa   q5
a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

2/5/2020 54
Example

LM   abba M

a, b

q5
b a a, b
a b
q0 a q1 b q2 b q3 a q4

accept

2/5/2020 55
Another Example

LM   ab, abba


M

a, b

q5
b a a a, b
b
q0 a q1 b q2 b q3 a q4

accept accept accept

2/5/2020 56
More Examples

LM   {a b : n  0}
n

a a, b

q0 b q1 a, b q2

accept trap state

2/5/2020 57
LM = { all substrings with prefix ab }
a, b

q0 a q1 b q2

b a accept

q3 a, b

2/5/2020 58
LM  = { all strings without
substring 001 }

1 0 0,1
1

 0 1
0 00 001

0
2/5/2020 59
Regular Languages

• A language L is regular if there is


a DFA M such that L  LM 

• All regular languages form a language


family

2/5/2020 60
Example

• The language L  awa : w  a, b*


• is regular:
a
b
b
q0 a q2 q3

b a
q4

a, b

2/5/2020 61
Dollars and Cents

2/5/2020 62

You might also like