PYTHON REGULAR EXPRESSIONS
http://www.tuto rialspo int.co m/pytho n/pytho n_re g _e xpre ssio ns.htm Co pyrig ht © tuto rials po int.co m
A regular expression is a special sequence of characters that helps you match or find other string s or sets of
string s, using a specialized syntax held in a pattern. Reg ular expressions are widely used in UNIX world.
T he module re provides full support for Perl-like reg ular expressions in Python. T he re module raises the
exception re.error if an error occurs while compiling or using a reg ular expression.
We would cover two important functions, which would be used to handle reg ular expressions. But a small thing
first: T here are various characters, which would have special meaning when they are used in reg ular expression.
T o avoid any confusion while dealing with reg ular expressions, we would use Raw String s as r'expression'.
The match Function
T his function attempts to match RE pattern to string with optional flags.
Here is the syntax for this function:
re.match(pattern, string, flags=0)
Here is the description of the parameters:
Parameter Desc ription
pattern T his is the reg ular expression to be matched.
string T his is the string , which would be searched to match the pattern at the
beg inning of string .
flag s You can specify different flag s using bitwise OR (|). T hese are modifiers,
which are listed in the table below.
T he re.match function returns a matc h object on success, None on failure. We would use group(num) or
groups() function of matc h object to g et matched expression.
Matc h O bjec t Methods Desc ription
g roup(num=0) T his method returns entire match (or specific subg roup num)
g roups() T his method returns all matching subg roups in a tuple (empty if there
weren't any)
Example:
#!/usr/bin/python
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"
When the above code is executed, it produces following result:
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
The search Function
T his function searches for first occurrence of RE pattern within string with optional flags.
Here is the syntax for this function:
re.search(pattern, string, flags=0)
Here is the description of the parameters:
Parameter Desc ription
pattern T his is the reg ular expression to be matched.
string T his is the string , which would be searched to match the pattern anywhere in
the string .
flag s You can specify different flag s using bitwise OR (|). T hese are modifiers,
which are listed in the table below.
T he re.search function returns a matc h object on success, None on failure. We would use group(num) or
groups() function of matc h object to g et matched expression.
Matc h O bjec t Methods Desc ription
g roup(num=0) T his method returns entire match (or specific subg roup num)
g roups() T his method returns all matching subg roups in a tuple (empty if there
weren't any)
Example:
#!/usr/bin/python
import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print "matchObj.group() : ", matchObj.group()
print "matchObj.group(1) : ", matchObj.group(1)
print "matchObj.group(2) : ", matchObj.group(2)
else:
print "No match!!"
When the above code is executed, it produces following result:
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
Matching vs Searching :
Python offers two different primitive operations based on reg ular expressions: matc h checks for a match only at
the beg inning of the string , while searc h checks for a match anywhere in the string (this is what Perl does by
default).
Example:
#!/usr/bin/python
import re
line = "Cats are smarter than dogs";
matchObj = re.match( r'dogs', line, re.M|re.I)
if matchObj:
print "match --> matchObj.group() : ", matchObj.group()
else:
print "No match!!"
matchObj = re.search( r'dogs', line, re.M|re.I)
if matchObj:
print "search --> matchObj.group() : ", matchObj.group()
else:
print "No match!!"
When the above code is executed, it produces the following result:
No match!!
search --> matchObj.group() : dogs
Search and Replace:
Some of the most important re methods that use reg ular expressions is sub.
Syntax:
re.sub(pattern, repl, string, max=0)
T his method replaces all occurrences of the RE pattern in string with repl, substituting all occurrences unless
max provided. T his method would return modified string .
Example:
Following is the example:
#!/usr/bin/python
import re
phone = "2004-959-559 # This is Phone Number"
# Delete Python-style comments
num = re.sub(r'#.*$', "", phone)
print "Phone Num : ", num
# Remove anything other than digits
num = re.sub(r'\D', "", phone)
print "Phone Num : ", num
When the above code is executed, it produces the following result:
Phone Num : 2004-959-559
Phone Num : 2004959559
Reg ular-expression Modifiers - Option Flag s
Reg ular expression literals may include an optional modifier to control various aspects of matching . T he
modifiers are specified as an optional flag . You can provide multiple modifiers using exclusive OR (|), as shown
previously and may be represented by one of these:
Modifier Desc ription
re.I Performs case-insensitive matching .
re.L Interprets words according to the current locale. T his interpretation affects the
alphabetic g roup (\w and \W), as well as word boundary behavior (\b and \B).
re.M Makes $ match the end of a line (not just the end of the string ) and makes ^ match
the start of any line (not just the start of the string ).
re.S Makes a period (dot) match any character, including a newline.
re.U Interprets letters according to the Unicode character set. T his flag affects the
behavior of \w, \W, \b, \B.
re.X Permits "cuter" reg ular expression syntax. It ig nores whitespace (except inside
a set [] or when escaped by a backslash) and treats unescaped # as a comment
marker.
Reg ular-expression patterns:
Except for control characters, (+ ? . * ^ $ ( ) [ ] { } | \), all characters match themselves. You can escape a
control character by preceding it with a backslash.
Following table lists the reg ular expression syntax that is available in Python:
Pattern Desc ription
^ Matches beg inning of line.
$ Matches end of line.
. Matches any sing le character except newline. Using m option allows it to match
newline as well.
[...] Matches any sing le character in brackets.
[^...] Matches any sing le character not in brackets
re* Matches 0 or more occurrences of preceding expression.
re+ Matches 1 or more occurrence of preceding expression.
re? Matches 0 or 1 occurrence of preceding expression.
re{ n} Matches exactly n number of occurrences of preceding expression.
re{ n,} Matches n or more occurrences of preceding expression.
re{ n, m} Matches at least n and at most m occurrences of preceding expression.
a| b Matches either a or b.
(re) Groups reg ular expressions and remembers matched text.
(?imx) T emporarily tog g les on i, m, or x options within a reg ular expression. If in
parentheses, only that area is affected.
(?-imx) T emporarily tog g les off i, m, or x options within a reg ular expression. If in
parentheses, only that area is affected.
(?: re) Groups reg ular expressions without remembering matched text.
(?imx: re) T emporarily tog g les on i, m, or x options within parentheses.
(?-imx: re) T emporarily tog g les off i, m, or x options within parentheses.
(?#...) Comment.
(?= re) Specifies position using a pattern. Doesn't have a rang e.
(?! re) Specifies position using pattern neg ation. Doesn't have a rang e.
(?> re) Matches independent pattern without backtracking .
\w Matches word characters.
\W Matches nonword characters.
\s Matches whitespace. Equivalent to [\t\n\r\f].
\S Matches nonwhitespace.
\d Matches dig its. Equivalent to [0-9].
\D Matches nondig its.
\A Matches beg inning of string .
\Z Matches end of string . If a newline exists, it matches just before newline.
\z Matches end of string .
\G Matches point where last match finished.
\b Matches word boundaries when outside brackets. Matches backspace (0x08)
when inside brackets.
\B Matches nonword boundaries.
\n, \t, etc. Matches newlines, carriag e returns, tabs, etc.
\1...\9 Matches nth g rouped subexpression.
\10 Matches nth g rouped subexpression if it matched already. Otherwise refers to
the octal representation of a character code.
REGULAR-EXPRESSION EXAMPLES
Literal characters:
Example Desc ription
python Match "python".
Character classes:
Example Desc ription
[Pp]ython Match "Python" or "python"
rub[ye] Match "ruby" or "rube"
[aeiou] Match any one lowercase vowel
[0-9] Match any dig it; same as [0123456789]
[a-z] Match any lowercase ASCII letter
[A-Z ] Match any uppercase ASCII letter
[a-zA-Z 0-9] Match any of the above
[^aeiou] Match anything other than a lowercase vowel
[^0-9] Match anything other than a dig it
Special Character Classes:
Example Desc ription
. Match any character except newline
\d Match a dig it: [0-9]
\D Match a nondig it: [^0-9]
\s Match a whitespace character: [ \t\r\n\f]
\S Match nonwhitespace: [^ \t\r\n\f]
\w Match a sing le word character: [A-Z a-z0-9_]
\W Match a nonword character: [^A-Z a-z0-9_]
Repetition Cases:
Example Desc ription
ruby? Match "rub" or "ruby": the y is optional
ruby* Match "rub" plus 0 or more ys
ruby+ Match "rub" plus 1 or more ys
\d{3} Match exactly 3 dig its
\d{3,} Match 3 or more dig its
\d{3,5} Match 3, 4, or 5 dig its
Nong reedy repetition:
T his matches the smallest number of repetitions:
Example Desc ription
<.*> Greedy repetition: matches "<python>perl>"
<.*?> Nong reedy: matches "<python>" in "<python>perl>"
Grouping with parentheses:
Example Desc ription
\D\d+ No g roup: + repeats \d
(\D\d)+ Grouped: + repeats \D\d pair
([Pp]ython(, )?)+ Match "Python", "Python, python, python", etc.
Backreferences:
T his matches a previously matched g roup ag ain:
Example Desc ription
([Pp])ython&\1ails Match python&pails or Python&Pails
(['"])[^\1]*\1 Sing le or double-quoted string . \1 matches whatever the 1st g roup matched . \2
matches whatever the 2nd g roup matched, etc.
Alternatives:
Example Desc ription
python|perl Match "python" or "perl"
rub(y|le)) Match "ruby" or "ruble"
Python(!+|\?) "Python" followed by one or more ! or one ?
Anchors:
T his needs to specify match position.
Example Desc ription
^Python Match "Python" at the start of a string or internal line
Python$ Match "Python" at the end of a string or line
\APython Match "Python" at the start of a string
Python\Z Match "Python" at the end of a string
\bPython\b Match "Python" at a word boundary
\brub\B \B is nonword boundary: match "rub" in "rube" and "ruby" but not alone
Python(?=!) Match "Python", if followed by an exclamation point
Python(?!!) Match "Python", if not followed by an exclamation point
Special syntax with parentheses:
Example Desc ription
R(?#comment) Matches "R". All the rest is a comment
R(?i)uby Case-insensitive while matching "uby"
R(?i:uby) Same as above
rub(?:y|le)) Group only without creating \1 backreference