You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2019-09-23-python-regex-cheatsheet.md
+39-26
Original file line number
Diff line number
Diff line change
@@ -21,18 +21,19 @@ From [docs.python: re](https://docs.python.org/3/library/re.html):
21
21
22
22
>A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression
23
23
24
-
This blog post gives an overview and examples of regular expression syntax as implemented by the `re` built-in module (Python 3.7+). Assume ASCII character set unless otherwise specified. This post is an excerpt from my [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) book.
24
+
This blog post gives an overview and examples of regular expression syntax as implemented by the `re` built-in module (Python 3.8+). Assume ASCII character set unless otherwise specified. This post is an excerpt from my [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) book.
25
25
26
26
## Elements that define a regular expression
27
27
28
28
| Anchors | Description |
29
29
| ------------- | ----------- |
30
-
|`\A`| restricts the match to start of string |
31
-
|`\Z`| restricts the match to end of string |
32
-
|`^`| restricts the match to start of line |
33
-
|`$`| restricts the match to end of line |
30
+
|`\A`| restricts the match to the start of string |
31
+
|`\Z`| restricts the match to the end of string |
32
+
|`^`| restricts the match to the start of line |
33
+
|`$`| restricts the match to the end of line |
34
34
|`\n`| newline character is used as line separator |
35
-
|`\b`| restricts the match to start/end of words |
35
+
|`re.MULTILINE` or `re.M`| flag to treat input as multiline string |
36
+
|`\b`| restricts the match to the start/end of words |
36
37
|| word characters: alphabets, digits, underscore |
37
38
|`\B`| matches wherever `\b` doesn't match |
38
39
@@ -61,7 +62,7 @@ This blog post gives an overview and examples of regular expression syntax as im
61
62
|`pat1.*pat2`| any number of characters between `pat1` and `pat2`|
62
63
| `pat1.*pat2|pat2.*pat1` | match both `pat1` and `pat2` in any order |
63
64
64
-
Greedy here means that the above quantifiers will match as much as possible that'll also honor the overall RE. Appending a `?` to greedy quantifiers makes them non-greedy, i.e. match as minimally as possible. Quantifiers can be applied to literal characters, groups, backreferences and character classes.
65
+
Greedy here means that the above quantifiers will match as much as possible that'll also honor the overall RE. Appending a `?` to greedy quantifiers makes them **non-greedy**, i.e. match as *minimally* as possible. Quantifiers can be applied to literal characters, groups, backreferences and character classes.
65
66
66
67
| Character class | Description |
67
68
| ------------- | ----------- |
@@ -104,7 +105,7 @@ Greedy here means that the above quantifiers will match as much as possible that
104
105
|`(?-flags:pat)`| negate flags only for this `pat`|
105
106
|`(?flags-flags:pat)`| apply and negate particular flags only for this `pat`|
106
107
|`(?flags)`| apply flags for whole RE, can be used only at start of RE |
107
-
|| anchors if any, should be specified after these flags |
108
+
|| anchors if any, should be specified after `(?flags)`|
108
109
109
110
| Matched portion | Description |
110
111
| ------------- | ----------- |
@@ -114,6 +115,7 @@ Greedy here means that the above quantifiers will match as much as possible that
114
115
|`m.groups()`| tuple of all the capture groups' matched portions |
115
116
|`m.span()`| start and end+1 index of entire matched portion |
116
117
|| pass a number to get span of that particular capture group |
118
+
|| can also use `m.start()` and `m.end()`|
117
119
|`\N`| backreference, gives matched portion of *N*th capture group |
118
120
|| applies to both search and replacement sections |
119
121
|| possible values: `\1`, `\2` up to `\99` provided no more digits |
@@ -124,6 +126,8 @@ Greedy here means that the above quantifiers will match as much as possible that
124
126
|| refer as `'name'` in `re.Match` object |
125
127
|| refer as `(?P=name)` in search section |
126
128
|| refer as `\g<name>` in replacement section |
129
+
|`groupdict`| method applied on a `re.Match` object |
130
+
|| gives named capture group portions as a `dict`|
127
131
128
132
`\0` and `\100` onwards are considered as octal values, hence cannot be used as backreferences.
129
133
@@ -136,21 +140,26 @@ Greedy here means that the above quantifiers will match as much as possible that
136
140
|| r-strings preferred to define RE |
137
141
|| Use byte pattern for byte input |
138
142
|| Python also maintains a small cache of recent RE |
143
+
|`re.fullmatch`| ensures pattern matches the entire input string |
139
144
|`re.compile`| Compile a pattern for reuse, outputs `re.Pattern` object |
140
145
|`re.sub`| search and replace |
141
146
|`re.sub(r'pat', f, s)`| function `f` with `re.Match` object as argument |
142
147
|`re.escape`| automatically escape all metacharacters |
143
148
|`re.split`| split a string based on RE |
149
+
|| text matched by the groups will be part of the output |
150
+
|| portion matched by pattern outside group won't be in output |
144
151
|`re.findall`| returns all the matches as a list |
145
152
|| if 1 capture group is used, only its matches are returned |
146
153
|| 1+, each element will be tuple of capture groups |
154
+
|| portion matched by pattern outside group won't be in output |
147
155
|`re.finditer`| iterator with `re.Match` object for each match |
148
156
|`re.subn`| gives tuple of modified string and number of substitutions |
As a good practice, always use **raw strings** to construct RE, unless other formats are required. This will avoid clash of special meaning of backslash character between RE and normal quoted strings.
@@ -197,7 +208,7 @@ True
197
208
# string anchors
198
209
>>>bool(re.search(r'\Ahi', 'hi hello\ntop spot'))
199
210
True
200
-
>>> words = ['surrender', 'unicorn', 'newer', 'door', 'empty', 'eel', 'pest']
211
+
>>> words = ['surrender', 'up', 'newer', 'do', 'ear', 'eel', 'pest']
201
212
>>> [w for w in words if re.search(r'er\Z', w)]
202
213
['surrender', 'newer']
203
214
@@ -209,7 +220,7 @@ True
209
220
* examples for `re.findall`
210
221
211
222
```python
212
-
#match whole word par with optional s at start and optional e at end
223
+
# whole word par with optional s at start and optional e at end
213
224
>>> re.findall(r'\bs?pare?\b', 'par spar apparent spare part pare')
214
225
['par', 'spar', 'spare', 'pare']
215
226
@@ -219,15 +230,15 @@ True
219
230
220
231
# if multiple capturing groups are used, each element of output
221
232
# will be a tuple of strings of all the capture groups
Visit my repo [Python re(gex)?](https://github.com/learnbyexample/py_regular_expressions) for details about the book I wrote on Python regular expressions. The ebook uses plenty of examples to explain the concepts from the very beginning and step by step introduces more advanced concepts. The book also covers the [third party module regex](https://pypi.org/project/regex/). The cheatsheet and examples presented in this post are based on contents of this book.
0 commit comments