13B RegExp
13B RegExp
13B RegExp
Modules:
A file containing a set of related functions
Easy to create and use your own modules
First import it: import utils …
Then use dot notation: utils.makeDict()
A quick review – cont’
Recursion:
A function that calls itself
Divide and conquer algorithms
Every recursion must have two key features:
1. There are one or more base cases for which no recursion is applied.
2. All recursion chains eventually end up at one of the base cases.
Examples:
Factorial, string reversal
Binary search
Traversing trees
Merge sort
A B C
“abc”
‘’’ abc’’’
r’abc’
Newlines are a bit more complicated
‘abc\n’
A B C
“abc\n”
‘’’abc
’’’
A B C \ n
r’abc\n’
Why so many?
‘ vs “ lets you put the other kind inside a string. Very
Useful.
‘’’ lets you run across multiple lines.
All 3 let you include and show invisible characters
(using \n, \t, etc.)
r’...’ (raw strings) do not support invisible character,
but avoid problems with backslash. Will become
useful very soon.
open(’C:\new\text.dat’) vs.
open(’C:\\new\\text.dat’) vs.
open(r’C:\new\text.dat’)
String operations
As you recall, the string data type supports a verity of
operations:
>>> my_str = 'tea for too‘
>>> print my_str.replace('too','two')
'tea for two'
>>> my_str.split(‘ ‘)
[‘tea’, ‘for’, ‘too’]
6
Regular expressions
Regular expressions (a.k.a. RE, regexp, regexes, regex)
are a highly specialized text-matching tool.
http://docs.python.org/library/re.html
Not only in Python
REs are very widespread:
Unix utility “grep”
Perl
TextWrangler
TextPad
Python
WARNING:
backslash is special in Python strings
It’s special again in RE
This means you need too many backslashes
Use ”raw strings” to make things simpler
re.findall(r’[AG]{3,3}CATG[TC]{4,4}[AG]{2,2}C[AT]TG[CT][CG][TC]’, myDNA)
More examples
>>> re.sub('\d', 'x', 'a_b - 12')
'a_b - xx'
>>> re.sub('\D', 'x', 'a_b - 12')
'xxxxxx12'
>>> re.sub('\s', 'x', 'a_b - 12')
'a_bx-x12'
>>> re.sub('\S', 'x', 'a_b - 12')
'xxx x xx'
>>> re.sub('\w', 'x', 'a_b - 12')
'xxx - xx'
>>> re.sub('\W', 'x', 'a_b - 12')
'a_bxxx12‘
>>> re.sub('^', 'x', 'a_b - 12')
'xa_b - 12'
>>> re.sub('$', 'x', 'a_b - 12')
'a_b - 12x'
>>> re.sub('\b', 'x', 'a_b - 12')
'a_b - 12'
>>> re.sub('\\b', 'x', 'a_b - 12')
'xa_bx - x12x'
>>> re.sub(r'\b', 'x', 'a_b - 12')
'xa_bx - x12x'
>>> re.sub('\B', 'x', 'a_b - 12')
'ax_xb x-x 1x2'
RE Semantics
If R, S are regexes:
RS matches the concatenation of strings matched by R, S
individually
R|S matches the union (either R or S)
What happened?
Matching is greedy
>>> import re
>>> mystring = "This contains 2 files, hw3.py and uppercase.py."
>>> all_matches = re.findall(r’.+\.py’, mystring)
>>> print all_matches
[’ This contains 2 files, hw3.py and uppercase.py’]
[’hw3.py’,’uppercase.py’]
Code like a pro … TIP
OF THE
DAY
[‘jht@uw.edu’, ‘elbo@uw.edu’]
Sample problem #2
1. Download and save warandpeace.txt. Write a program
to read it line-by-line. Use re.findall to check whether
the current line contains one or more “proper” names
ending in “...ski”. If so, print these names: ['Bolkonski']
['Bolkonski']
['Bolkonski']
['Bolkonski']
['Volkonski']
['Volkonski']
['Volkonski']
file_name = sys.argv[1]
file = open(file_name,"r")
file.close()
Solution #2.2
import sys
import re
file_name = sys.argv[1]
file = open(file_name,"r")
file.close()
name_list = names_dict.keys()
name_list.sort()