REGULAR EXPRESSIONS (REGEX) IN PYTHON:
Regular Expressions (RegEx) are a powerful tool for pattern matching and text manipulation. In Python, regex
functionality is implemented through the re module.
APPLICATIONS OF REGEX
● Data validation
● Data extraction
● Input sanitization (data cleaning)
This document explains regex basics, syntax, functions, and practical examples with improved clarity and structure.
What is a Regular Expression?
A Regular Expression is a sequence of characters that defines a search pattern. It can be used to match strings,
validate formats, or extract information.
COMMON USE CASES OF REGEX THAT ARE ALSO COVERED IN THIS ARTICLE WITH DETAILED EXPLANATION:
● Extracting email addresses
● Extracting timestamps from logs
● Extracting URLs
● Validating phone numbers or dates
● Searching for words or patterns in text
● Validating passwords
Regex Syntax in Python
To use regex, you define a pattern or a regex expression that consists of special characters and sequences, which
defines what to look for in a text.
Here are some of the most common components of regex syntax:
1. SPECIAL CHARACTERS
Character Description
. Matches any single character.
^ Matches the start of the string.
$ Matches the end of the string.
* Matches 0 or more repetitions.
+ Matches 1 or more repetitions.
? Matches 0 or 1 occurrence.
{n} Matches exactly n occurrences.
{n,} Matches n or more occurrences.
{n,m} Matches between n and m occurrences.
\ Escapes special characters.
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/
2. CHARACTER CLASSES
Syntax Description
[arn] where one of the a, r or n is present
[a-n] returns a match for any lowercase character between a and n
[^arn] returns a match where character is not a, r or n
[0123] return a match where 0,1,2 or 3 is present
[0-9] returns a match where a number between 0 to 9
[0-5][0-9] returns a match for any number between 00-59
[a-zA-Z] returns a match for any alphabetical character
[+] in sets, special characters have no meaning, so it will return a match if a '+' character is found.
3. PREDEFINED SEQUENCES
Sequence Description
\A returns a match if the specified characters are at the start of the string
\b Returns a match where the specified characters are at the beginning or at the end of a word
\B A match where the specified characters are present, but NOT at the beginning or at the end of a word
\d returns a match where the string contains digits 0-9
\D returns a match where the string does not contains digits 0-9
\s returns a match where the string contains a white space character
\S returns a match where the string DOES NOT contains a white space character
\w returns a match where the string contains word character i.e., a-zA-Z0-9 and underscore
\W returns a match where the string DOES NOT contain a word character
\Z returns a match if the specified characters are at the end of the string.
4. GROUPING AND CAPTURING
Parentheses () are used to group parts of a regex pattern and capture matches. Capturing groups save the matched
content for later use, while non-capturing groups allow grouping without saving the matched content.
CAPTURING GROUP
A capturing group matches the specified pattern and saves the matched content for reference. For example:
pattern = r"(\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('123', '45', '6789')
NON-CAPTURING GROUP
A non-capturing group groups the pattern without saving the matched content. Use (?:...) to create a non-
capturing group. For example:
pattern = r"(?:\d{3})-(\d{2})-(\d{4})"
text = "123-45-6789"
match = re.match(pattern, text)
print(match.groups()) # Output: ('45', '6789')
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/
PRACTICAL EXAMPLES
1. MATCHING EMAIL ADDRESSES
Example: john.doe123@abc-school.ac.in
● The username part i.e., before @ part:
Can contain alphabets a-z, A-Z, numbers 0-9, dot ., space, hyphen -, and some emails unlike gmail allow
underscore _ and other special characters like + as well.
○ john.doe123@abc-school.ac.in : “[a-zA-Z0-9 .-_+]+” : one or more than one occurrence of these
characters
● The domain part i.e., after @ part:
Can contain sub domains, domains, domain extensions and one necessary ending extension that must
contain at least 2 alphabets.
○ john.doe123@abc-school.ac.in : “[a-zA-Z0-9-.]+”
○ john.doe123@abc-school.ac.in : “\.[a-zA-Z]{2,}”
# Complete regex:
r"[a-zA-Z0-9 ._-+]+@[a-zA-Z-.]+\.[a-zA-Z]{2,}"
# Equivalent regex:
r"[\w .-+]+@[\w-.]+\.[a-zA-Z]{2,}"
# (\w: any alphabet, number, underscore, {2,} means occurrence greater than 2
times)
2. MATCHING QUESTIONS
Examples:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- Why is the sky blue during the day?
● Starting of question: can be alphanumeric, can contain quotation marks: r”[a-zA-Z0-9\”’]+”
● Middle part of a question: r”[a-zA-Z0-9\”’ ,-_–+]*”
(you can include more special characters if they’re allowed in the questions, or you can use [^?\n] to match
every character except a question mark and a new line)
● Ending of a question: r”\?”
# Complete regex:
r"[\w\"']+[\w\"',-_+ ]*\?"
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/
3. MATCHING URLS
Examples:
- https://www.example.com?query_param1=value1&query_param2=value2
- Components of a URL:
Since, there are a lot of special characters allowed in the URL, some are not allowed, for example white space is
encoded using %20, and non ascii characters are also encoded using word characters and some special characters.
● Scheme (http/https) of url followed by :// - r”https?:\/\/”
● Subdomain, domain, top level domain: r”(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}”
● Port number’s non capturing group: r”(?::[0-9]{1,5})?”
● Path’s non capturing group: r”(?:\/[^\s?#]*)?”
● Query Separator and Parameters’ non capturing group: r”(?:\?[a-zA-Z0-9%._\-~+=&]*)?”
● Fragment’s non capturing group: r”(?:#[^\s]+)?”
# Complete regex:
r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?"
4. MATCHING IPV4 ADDRESSES
An IPv4 address consists of four octets, separated by dots (.), where each octet is a number between 0 and 255.
Logic behind regex to match a number between 0-255:
● Number between 0-9: [0-9]
● Number between 10-99: [1-9][0-9]
● Number between 0-99: [0-9][0-9]?
● Number between 0-199: [0-1]?[0-9][0-9]?
● Number between 200-255: 2[0-5][0-5]
Regex for number to be in between 0-255: r”(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])”
# Complete regex:
r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])"
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/
Python’s re Module
The re module provides built-in functions for regex operations.
COMMON FUNCTIONS
Function Description Syntax Return Value (x)
Returns a list containing all matches in x=
List of all matched
re.findall the order they are found. If no match, re.findall("regex_expression",
strings
empty list. text)
Returns a match object for the first x=
Match object (if
re.search match found. Returns None if no match is re.search("regex_expression",
found) or None
found. text)
Splits a string into a list at each match. x = re.split("regex_expression", List of separated
re.split
Optionally, limit the splits with maxsplit. text, [maxsplit]) strings
Replaces one or more matches with a x = re.sub("regex_expression", A new string with
re.sub given string. Optionally limit "replacement_string", text, substitutions
replacements with count. count) applied
CODE:
import re
# Sample text with correct and incorrect examples
sample_text = """
Correct Examples:
john.doe123@abc-school.ac.in
why_not.valid+email@gmail.com
Is this your final answer?
"Python is a snake" - is this statement correct?
https://www.example.com?query_param1=value1&query_param2=value2
http://example.org/resource
192.168.1.1
127.0.0.1
Incorrect Examples:
john.doe@com
noatsymbol.com
Is this even correct..
ftp://wrong.protocol.com
256.256.256.256
999.999.999.999
"""
# Regex patterns
patterns = {
"Email Address": r"[a-zA-Z0-9._+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}",
"Question": r"[a-zA-Z0-9\"'][a-zA-Z0-9\"',-_-+ ]*\?",
"URL": r"https?:\/\/(?:[a-zA-Z0-9-]+\.)+[a-zA-Z]{2,}(?::[0-
9]{1,5})?(?:\/[^\s?#]*)?(?:\?[a-zA-Z0-9%._\-~+=&]*)?(?:#[^\s]*)?",
"IPv4 Address": r"(?:(?:[0-1]?[0-9][0-9]?|2[0-5][0-5])\.){3}(?:[0-1]?[0-9][0-
9]?|2[0-5][0-5])"
}
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/
def test_regex(pattern_name, pattern, text):
print(f"\nTesting: {pattern_name}")
matches = re.findall(pattern, text)
print("Matches:")
for match in matches:
print(f" - {match}")
# Testing all patterns
for name, regex in patterns.items():
test_regex(name, regex, sample_text)
OUTPUT:
Testing: Email Address
Matches:
- john.doe123@abc-school.ac.in
- why_not.valid+email@gmail.com
Testing: Question
Matches:
- Is this your final answer?
- "Python is a snake" - is this statement correct?
- https://www.example.com?
Testing: URL
Matches:
- https://www.example.com?query_param1=value1&query_param2=value2
- http://example.org/resource
Testing: IPv4 Address
Matches:
- 192.168.1.1
- 127.0.0.1
Theory References:
https://www.w3schools.com/python/python_regex.asp
https://www.geeksforgeeks.org/components-of-a-url/
Created by: Anjali Garg | Data Scientist | Aspiring ML Engineer | https://www.linkedin.com/in/anjali-garg-2a7747222/