Regular Expressions in Python
Introduction
Regular Expressions (Regex) are patterns used to match character combinations in strings.
They are an essential tool for processing text and data across various domains. In web
development, regex is used to validate user inputs like email addresses or passwords. Data
scientists and analysts use regex to clean, transform, and extract meaningful patterns from
raw datasets. Similarly, regex plays a critical role in parsing log files, identifying errors in
large datasets, and extracting specific information from documents or web pages. Its
versatility makes it a foundational skill for software engineers, data professionals, and system
administrators alike. They are widely used for:
• Text validation (e.g., email validation)
• Searching within text
• Text manipulation (e.g., replacing patterns)
• Parsing complex datasets (e.g., logs, HTML, or CSV files)
Python provides the re module to work with regular expressions. This document explains
regex concepts with examples and outputs to help beginners understand and apply regex
effectively.
Basics of Regular Expressions
Raw String Literals in Python ( r prefix)
• Raw strings in Python (e.g., r"\d") treat backslashes literally, simplifying regex
patterns. Without the r prefix, double backslashes are required.
• Example:
import re
pattern = r"\d"
string = "abc123"
result = re.search(pattern, string)
print(result.group()) # Output: 1
Key Functions in the re Module
1. re.match()
Matches a pattern at the start of the string.
import re
result = re.match(r'\d+', '123abc')
if result:
print(result.group()) # Output: 123
2. re.search()
Searches for the first occurrence of a pattern anywhere in the string.
result = re.search(r'\d+', 'abc123def')
if result:
print(result.group()) # Output: 123
3. re.findall()
Returns all occurrences of a pattern as a list.
result = re.findall(r'\d+', 'abc123def456')
print(result) # Output: ['123', '456']
4. re.split()
Splits a string by occurrences of a pattern.
result = re.split(r'\d+', 'abc123def456')
print(result) # Output: ['abc', 'def', '']
5. re.sub()
Replaces occurrences of a pattern with a replacement string.
result = re.sub(r'\s+', '-', 'This is a test')
print(result) # Output: 'This-is-a-test'
6. re.compile()
Creates a reusable regex pattern object for efficiency.
pattern = re.compile(r'\d+')
result = pattern.findall('123abc456')
print(result) # Output: ['123', '456']
What does re.compile do?
When you use re.compile, it "prepares" (or compiles) your regular expression into a reusable
object. This object can then be used multiple times for different operations like finding
matches, replacing text, etc., without needing to re-interpret the pattern every time.
Without re.compile, Python has to process the pattern every time you call a function like
re.findall or re.search. Using re.compile is more efficient if you're working with the same
pattern multiple times in your code.
Key Benefits of re.compile:
1. Improved performance: The pattern is compiled once and reused, saving time if you
use it repeatedly.
2. Better readability: The pattern is defined and reused in a clear, structured way.
Example Without re.compile:
Imagine you need to find and replace all numbers in multiple strings. Without re.compile,
you'll repeatedly pass the pattern to the re functions:
import re
strings = ["abc123", "456def", "ghi789"]
for s in strings:
# Find all numbers in each string
matches = re.findall(r'\d+', s)
print(f"Numbers in '{s}': {matches}")
Output:
Numbers in 'abc123': ['123']
Numbers in '456def': ['456']
Numbers in 'ghi789': ['789']
Here, Python processes the pattern r'\d+' every time you call re.findall.
Example With re.compile:
If you use re.compile, the pattern is prepared once and reused for each string:
import re
# Compile the pattern once
pattern = re.compile(r'\d+')
strings = ["abc123", "456def", "ghi789"]
for s in strings:
# Use the compiled pattern to find numbers
matches = pattern.findall(s)
print(f"Numbers in '{s}': {matches}")
Output:
Numbers in 'abc123': ['123']
Numbers in '456def': ['456']
Numbers in 'ghi789': ['789']
What’s the Difference?
• Without re.compile: The pattern is interpreted every time re.findall is called.
• With re.compile: The pattern is interpreted once and reused for all operations.
If you're working with the pattern only once, re.compile doesn't make a noticeable difference.
However, if the pattern is reused multiple times (e.g., in a loop or across different parts of
your code), using re.compile improves performance and makes your code more readable.
Summary:
• re.compile is useful when you use the same pattern repeatedly.
• It saves time by compiling the pattern once and allows you to use the compiled object
for all regex operations.
Common Regex Patterns and Characters
Table 1: Basic Characters and Character Classes
Character/Pattern Description Example Matches / Fails to Match
Matches any character except a Matches: "acb", "a1b"; Fails:
. "a.b"
newline. "ab", "a\nb"
Matches any alphanumeric Matches: "hello",
\w "\w+" "Python_123"; Fails: "hello!",
character ([a-zA-Z0-9_]).
"123$"
Matches any non-alphanumeric Matches: "!!!", "#@$%";
\w "\w+"
character. Fails: "abc123", "hello"
Matches: "123", "456"; Fails:
\d Matches any digit ([0-9]). "\d{3}"
"12", "abc"
Matches any non-digit Matches: "hello", "abc!";
\d "\d+"
character. Fails: "1234", "567"
Matches any whitespace Matches: " ", "\\t"; Fails:
\s "\s+"
character (space, tab, newline). "abc", "123"
Matches any non-whitespace Matches: "hello123", "abc!";
\S "\S+"
character. Fails: " ", "\n"
Anchors and Special Characters
Character/Pattern Description Example Matches / Fails to Match
Anchors the pattern to the Matches: "hello world"; Fails:
^ "^hello"
start of the string. "world hello", "abc hello"
Anchors the pattern to the Matches: "hello world"; Fails:
$ "world$"
end of the string. "world hello", "hello"
Matches: "cat", "a cat"; Fails:
\b Matches a word boundary. "\bcat\b"
"catalog", "scattered"
Quantifiers
Matches / Fails to
Character/Pattern Description Example
Match
Matches zero or more repetitions Matches: "b", "ab",
* "a*b"
of the preceding element. "aaab"; Fails: "cab", "c"
Matches / Fails to
Character/Pattern Description Example
Match
Matches one or more repetitions of Matches: "ab", "aaab";
+ "a+b"
the preceding element. Fails: "b", "c"
Matches: "aaa"; Fails:
{n} Matches exactly n repetitions. "a{3}"
"aa", "aaaa"
Matches: "aa", "aaa";
{n,} Matches at least n repetitions. "a{2,}"
Fails: "a"
Matches between n and m Matches: "a", "aa", "aaa";
{n,m} "a{1,3}"
repetitions. Fails: "aaaa"
Examples of Practical Applications
Email Validation
Extracting Data
Example 1: Extract Numbers
Example 2: Extract Email Addresses
Example 3: Validate Mobile Numbers
Example 4: Extract Hours from Timestamps
Example 5: Extract Specific Data from Text
Using Regex Flags
Multi-line Matching
Case-Insensitive Matching
Advanced Regex Techniques
Groups and Alternation
Here are explanations and examples for each of the regex components mentioned in the
image:
1. | (Either or):
The pipe | is used to match either of two or more options.
Example:
• Explanation: The pattern matches either "falls" or "stays" in the input text.
2. () (Capture and group):
Parentheses are used to group parts of a pattern and capture them as separate groups.
Example:
• Explanation:
o (rain|sun) captures "rain" or "sun".
o (falls|stays) captures "falls" or "stays".
o The result is a list of tuples containing the captured groups.
3. [] (Set of characters):
Square brackets [] define a set of characters to match.
Example:
• Explanation: The pattern [a-z] matches any lowercase letter from 'a' to 'z'. Each match
is returned as a separate element in the list.
4. \ (Special sequence):
The backslash \ is used to escape special characters or represent special sequences.
Example:
• Explanation: The pattern \d matches any digit (0–9). Here, it finds all the digits in the
input text.
Another Example (Escaping Special Characters):
• Explanation: The backslash \ escapes the special meaning of |, treating it as a literal
character to match.
The .group() method in Python regular expressions is used to extract the part of the string
that matches the pattern or a specific group within the match.
Syntax of .group()
match.group([group_number])
• group_number (optional):
o If not specified (i.e., .group()), it returns the entire match.
o group(0): Returns the entire match (same as .group()).
o group(n): Returns the text matched by the n-th capturing group (inside
parentheses).
Example 1: Using .group() to Return the Entire Match
• Explanation:
o The pattern \d+ matches one or more digits.
o .search() finds the first match ("123").
o .group() returns the entire match.
Example 2: Using .group(n) to Access Capturing Groups
• Explanation:
o The parentheses () create a capturing group for the digits (\d+).
o .group(1) returns the content of the first capturing group (the digits "123").
Example 3: Multiple Capturing Groups
• Explanation:
o (123) is captured in group 1.
o (apples) is captured in group 2.
o .group(0) always returns the full match.
Example 4: Named Groups
You can assign names to groups and access them with .group('name').
Explanation:
o (?P<number>\d+) names the first group "number".
o (?P<item>\w+) names the second group "item".
o .group('name') retrieves the content of the named group.
What Happens If There’s No Match?
If there’s no match, .group() raises an AttributeError. To avoid this, always check if match is
not None before using .group().
Summary:
• .group() returns the entire match.
• .group(n) returns the n-th capturing group.
• Named groups allow you to retrieve specific parts of the match using names.