Skip to content

Commit beb93f6

Browse files
updated python regex cheatsheet
1 parent 3508896 commit beb93f6

File tree

4 files changed

+40
-27
lines changed

4 files changed

+40
-27
lines changed

_config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ title: "learnbyexample"
2121
description: "Doing is often better than thinking of doing"
2222
#baseurl: # the subpath of your site, e.g. "/blog"
2323
url: "https://learnbyexample.github.io"
24-
logo: "/images/lbe_logo.png"
24+
logo: "/images/lbe.jpg"
2525
date_format: "%B %-d, %Y"
2626
read_time: true
2727
words_per_minute: 100

_posts/2019-09-23-python-regex-cheatsheet.md

+39-26
Original file line numberDiff line numberDiff line change
@@ -21,18 +21,19 @@ From [docs.python: re](https://docs.python.org/3/library/re.html):
2121

2222
>A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression
2323
24-
This blog post gives an overview and examples of regular expression syntax as implemented by the `re` built-in module (Python 3.7+). Assume ASCII character set unless otherwise specified. This post is an excerpt from my [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) book.
24+
This blog post gives an overview and examples of regular expression syntax as implemented by the `re` built-in module (Python 3.8+). Assume ASCII character set unless otherwise specified. This post is an excerpt from my [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) book.
2525

2626
## Elements that define a regular expression
2727

2828
| Anchors | Description |
2929
| ------------- | ----------- |
30-
| `\A` | restricts the match to start of string |
31-
| `\Z` | restricts the match to end of string |
32-
| `^` | restricts the match to start of line |
33-
| `$` | restricts the match to end of line |
30+
| `\A` | restricts the match to the start of string |
31+
| `\Z` | restricts the match to the end of string |
32+
| `^` | restricts the match to the start of line |
33+
| `$` | restricts the match to the end of line |
3434
| `\n` | newline character is used as line separator |
35-
| `\b` | restricts the match to start/end of words |
35+
| `re.MULTILINE` or `re.M` | flag to treat input as multiline string |
36+
| `\b` | restricts the match to the start/end of words |
3637
| | word characters: alphabets, digits, underscore |
3738
| `\B` | matches wherever `\b` doesn't match |
3839

@@ -61,7 +62,7 @@ This blog post gives an overview and examples of regular expression syntax as im
6162
| `pat1.*pat2` | any number of characters between `pat1` and `pat2` |
6263
| `pat1.*pat2|pat2.*pat1` | match both `pat1` and `pat2` in any order |
6364

64-
Greedy here means that the above quantifiers will match as much as possible that'll also honor the overall RE. Appending a `?` to greedy quantifiers makes them non-greedy, i.e. match as minimally as possible. Quantifiers can be applied to literal characters, groups, backreferences and character classes.
65+
Greedy here means that the above quantifiers will match as much as possible that'll also honor the overall RE. Appending a `?` to greedy quantifiers makes them **non-greedy**, i.e. match as *minimally* as possible. Quantifiers can be applied to literal characters, groups, backreferences and character classes.
6566

6667
| Character class | Description |
6768
| ------------- | ----------- |
@@ -104,7 +105,7 @@ Greedy here means that the above quantifiers will match as much as possible that
104105
| `(?-flags:pat)` | negate flags only for this `pat` |
105106
| `(?flags-flags:pat)` | apply and negate particular flags only for this `pat` |
106107
| `(?flags)` | apply flags for whole RE, can be used only at start of RE |
107-
| | anchors if any, should be specified after these flags |
108+
| | anchors if any, should be specified after `(?flags)` |
108109

109110
| Matched portion | Description |
110111
| ------------- | ----------- |
@@ -114,6 +115,7 @@ Greedy here means that the above quantifiers will match as much as possible that
114115
| `m.groups()` | tuple of all the capture groups' matched portions |
115116
| `m.span()` | start and end+1 index of entire matched portion |
116117
| | pass a number to get span of that particular capture group |
118+
| | can also use `m.start()` and `m.end()` |
117119
| `\N` | backreference, gives matched portion of *N*th capture group |
118120
| | applies to both search and replacement sections |
119121
| | possible values: `\1`, `\2` up to `\99` provided no more digits |
@@ -124,6 +126,8 @@ Greedy here means that the above quantifiers will match as much as possible that
124126
| | refer as `'name'` in `re.Match` object |
125127
| | refer as `(?P=name)` in search section |
126128
| | refer as `\g<name>` in replacement section |
129+
| `groupdict` | method applied on a `re.Match` object |
130+
| | gives named capture group portions as a `dict` |
127131

128132
`\0` and `\100` onwards are considered as octal values, hence cannot be used as backreferences.
129133

@@ -136,21 +140,26 @@ Greedy here means that the above quantifiers will match as much as possible that
136140
| | r-strings preferred to define RE |
137141
| | Use byte pattern for byte input |
138142
| | Python also maintains a small cache of recent RE |
143+
| `re.fullmatch` | ensures pattern matches the entire input string |
139144
| `re.compile` | Compile a pattern for reuse, outputs `re.Pattern` object |
140145
| `re.sub` | search and replace |
141146
| `re.sub(r'pat', f, s)` | function `f` with `re.Match` object as argument |
142147
| `re.escape` | automatically escape all metacharacters |
143148
| `re.split` | split a string based on RE |
149+
| | text matched by the groups will be part of the output |
150+
| | portion matched by pattern outside group won't be in output |
144151
| `re.findall` | returns all the matches as a list |
145152
| | if 1 capture group is used, only its matches are returned |
146153
| | 1+, each element will be tuple of capture groups |
154+
| | portion matched by pattern outside group won't be in output |
147155
| `re.finditer` | iterator with `re.Match` object for each match |
148156
| `re.subn` | gives tuple of modified string and number of substitutions |
149157

150158
The function definitions are given below:
151159

152160
```python
153161
re.search(pattern, string, flags=0)
162+
re.fullmatch(pattern, string, flags=0)
154163
re.compile(pattern, flags=0)
155164
re.sub(pattern, repl, string, count=0, flags=0)
156165
re.escape(pattern)
@@ -160,6 +169,8 @@ re.finditer(pattern, string, flags=0)
160169
re.subn(pattern, repl, string, count=0, flags=0)
161170
```
162171

172+
<br>
173+
163174
## Regular expression examples
164175

165176
As a good practice, always use **raw strings** to construct RE, unless other formats are required. This will avoid clash of special meaning of backslash character between RE and normal quoted strings.
@@ -197,7 +208,7 @@ True
197208
# string anchors
198209
>>> bool(re.search(r'\Ahi', 'hi hello\ntop spot'))
199210
True
200-
>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
211+
>>> words = ['surrender', 'up', 'newer', 'do', 'ear', 'eel', 'pest']
201212
>>> [w for w in words if re.search(r'er\Z', w)]
202213
['surrender', 'newer']
203214

@@ -209,7 +220,7 @@ True
209220
* examples for `re.findall`
210221

211222
```python
212-
# match whole word par with optional s at start and optional e at end
223+
# whole word par with optional s at start and optional e at end
213224
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
214225
['par', 'spar', 'spare', 'pare']
215226

@@ -219,15 +230,15 @@ True
219230

220231
# if multiple capturing groups are used, each element of output
221232
# will be a tuple of strings of all the capture groups
222-
>>> re.findall(r'(x*):(y*)', 'xx:yyy x: x:yy :y')
223-
[('xx', 'yyy'), ('x', ''), ('x', 'yy'), ('', 'y')]
233+
>>> re.findall(r'([^/]+)/([^/,]+),?', '2020/04,1986/Mar')
234+
[('2020', '04'), ('1986', 'Mar')]
224235

225236
# normal capture group will hinder ability to get whole match
226237
# non-capturing group to the rescue
227-
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run against')
228-
['cost', 'akin', 'east', 'against']
238+
>>> re.findall(r'\b\w*(?:st|in)\b', 'cost akin more east run')
239+
['cost', 'akin', 'east']
229240

230-
# useful for debugging purposes as well before applying substitution
241+
# useful for debugging purposes as well
231242
>>> re.findall(r't.*?a', 'that is quite a fabricated tale')
232243
['tha', 't is quite a', 'ted ta']
233244
```
@@ -323,7 +334,7 @@ True
323334
* backreferencing in replacement section
324335

325336
```python
326-
# remove any number of consecutive duplicate words separated by space
337+
# remove consecutive duplicate words separated by space
327338
>>> re.sub(r'\b(\w+)( \1)+\b', r'\1', 'aa a a a 42 f_1 f_1 f_13.14')
328339
'aa a 42 f_1 f_13.14'
329340

@@ -357,22 +368,22 @@ True
357368
```python
358369
# change 'foo' only if it is not followed by a digit character
359370
# note that end of string satisfies the given assertion
360-
# 'foofoo' has two matches as the assertion doesn't consume characters
371+
# foofoo has 2 matches as the assertion doesn't consume characters
361372
>>> re.sub(r'foo(?!\d)', r'baz', 'hey food! foo42 foot5 foofoo')
362373
'hey bazd! foo42 bazt5 bazbaz'
363374

364375
# change whole word only if it is not preceded by : or -
365376
>>> re.sub(r'(?<![:-])\b\w+\b', r'X', ':cart <apple -rest ;tea')
366377
':cart <X -rest ;X'
367378

368-
# extract digits only if it is preceded by - and followed by ; or :
369-
>>> re.findall(r'(?<=-)\d+(?=[:;])', '42 foo-5, baz3; x-83, y-20: f12')
379+
# match digits only if it is preceded by - and followed by ; or :
380+
>>> re.findall(r'(?<=-)\d+(?=[:;])', 'fo-5, ba3; x-83, y-20: f12')
370381
['20']
371382

372-
# words containing all lowercase vowels in any order
373-
>>> words = ['sequoia', 'subtle', 'questionable', 'exhibit', 'equation']
374-
>>> [w for w in words if re.search(r'(?=.*a)(?=.*e)(?=.*i)(?=.*o).*u', w)]
375-
['sequoia', 'questionable', 'equation']
383+
# words containing 'b' and 'e' and 't' in any order
384+
>>> words = ['sequoia', 'questionable', 'exhibit', 'equation']
385+
>>> [w for w in words if re.search(r'(?=.*b)(?=.*e).*t', w)]
386+
['questionable', 'exhibit']
376387

377388
# match if 'do' is not there between 'at' and 'par'
378389
>>> bool(re.search(r'at((?!do).)*par', 'fox,cat,dog,parrot'))
@@ -395,13 +406,15 @@ True
395406
>>> bool(pet.search('A cat crossed their path'))
396407
False
397408

398-
>>> remove_parentheses = re.compile(r'\([^)]*\)')
399-
>>> remove_parentheses.sub('', 'a+b(addition) - foo() + c%d(#modulo)')
409+
>>> pat = re.compile(r'\([^)]*\)')
410+
>>> pat.sub('', 'a+b(addition) - foo() + c%d(#modulo)')
400411
'a+b - foo + c%d'
401-
>>> remove_parentheses.sub('', 'Hi there(greeting). Nice day(a(b)')
412+
>>> pat.sub('', 'Hi there(greeting). Nice day(a(b)')
402413
'Hi there. Nice day'
403414
```
404415

416+
<br>
417+
405418
## Python re(gex)? book
406419

407420
Visit my repo [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) for details about the book I wrote on Python regular expressions. The ebook uses plenty of examples to explain the concepts from the very beginning and step by step introduces more advanced concepts. The book also covers the [third party module regex](https://pypi.org/project/regex/). The cheatsheet and examples presented in this post are based on contents of this book.

images/lbe.jpg

48.7 KB
Loading

images/lbe_logo.png

-132 KB
Binary file not shown.

0 commit comments

Comments
 (0)