Python 3 – Regular Expressions

Python 3 – Regular Expressions

A regular expression is a special sequence of characters used in a pattern to match or find other strings or sets of strings using a specialized syntax. Regular expressions are widely used in the UNIX world.

The re module provides comprehensive support for Perl-like regular expressions in Python. If an error occurs while compiling or using a regular expression, the re module raises the exception re.error .

We will introduce two important functions for working with regular expressions. However, a small matter first: various characters have special meanings when used in regular expressions. To avoid any confusion when working with regular expressions, we will use raw strings, r’expression’ .

Basic Pattern Matching Single Characters

Order Numbers Expressions and Matches
1 a, X, 9, < Common characters match only themselves completely.
2 . (period) Matches any single character except the newline character ‘n’
3 w Matches a “word” character: a letter, digit, or underscore [a-zA-Z0-9_].
4 W Matches any non-word character.
5 b The boundary between a word and a non-word.
6 s Matches a single whitespace character – space, newline, carriage return, tab.
7 S Matches any non-whitespace character.
8 t, n, Tab, Line Feed, Carriage Return
9 d Decimal digit [0-9]
10 ^ Matches the beginning of the string
11 $ Matches the end of the string
12 Suppresses the “specialness” of a character.

Compilation Flags

Compilation flags allow you to modify certain aspects of how regular expressions work. Flags are referred to in the re module by two names: a long name, such as IGNORECASE , and a short name, such as I.

Order Number Flags and Meanings
1 ASCII, A Causes escape sequences such as w, b, s, and d to match only ASCII characters with their respective properties.
2 DOTALL, S Causes . to match any character, including newline.
3 IGNORECASE, I Matches case-insensitively.
4 LOCALE, L Performs a match that conforms to the current locale.
5 MULTILINE, M Matches across multiple lines, affecting ^ and $.
6 VERBOSE, X (for “extended”) Enable verbose regular expressions for clearer organization and understanding.

Match Function

This function attempts to match the RE pattern against the string string, optionally using flags.

The syntax of this function is as follows: –

re.match(pattern, string, flags = 0)

This is a description of the parameters −

Number Parameters and Descriptions
1 pattern This is the regular expression to match.
2 string This is the string to search for, matching the pattern at the beginning of the string.
3 flags You can specify different flags using bitwise OR (|). These are the modifiers listed in the table below.

The re.match function returns a match object on success and None on failure. We use the match object’s group(num) or groups() functions to retrieve the matched expression.

Order Number Match Object Methods and Descriptions
1 group(num = 0) This method returns the entire match (or a specific subgroup num).
2 groups() This method returns all matching subgroups in a tuple (or empty if there are none).

Example

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs"

matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)

if matchObj:
print ("matchObj.group() : ", matchObj.group())
print ("matchObj.group(1) : ", matchObj.group(1))
print ("matchObj.group(2) : ", matchObj.group(2))
else:
print ("No match!!")

When executing the above code, it will produce the following result −

matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter

Search Function

This function searches for the first occurrence of RE pattern in a string and returns optional flags.

Here is the syntax of the function –

re.search(pattern, string, flags = 0)

Below is a description of the parameters –

Number Parameters and Description
1 pattern This is the regular expression to match.
2 string This is the string to search for, matching the pattern anywhere within the string.
3 flags You can specify different flags using bitwise OR (|). These are the modifiers listed in the table below.

The re.search function returns a match object on success and None on failure. We use the match object’s group(num) or groups() function to retrieve the matched expression.

Order Number Match Object Methods and Descriptions
1 group(num = 0) This method returns the entire match (or a specific subgroup num).
2 groups() This method returns all matching subgroups in a tuple (or empty if there are none).

Example

#!/usr/bin/python3
import re

line = "Cats are smarter than dogs";

# Search the entire string for a matching pattern
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)

if searchObj:
print ("searchObj.group() : ", searchObj.group()) # Returns all substrings that match the entire string
print ("searchObj.group(1) : ", searchObj.group(1)) # Returns all substrings in the first matching string
print ("searchObj.group(2) : ", searchObj.group(2)) # Returns all substrings in the second matching string
else:
print ("Nothing found!!")

# match only matches at the beginning of the string, while the search function matches throughout the entire string
matchObj = re.match(r'dogs', line, re.M|re.I)
if matchObj:
print ("match --> matchObj.group() : ", matchObj.group())
else:
print ("No match!!")

searchObj = re.search(r'dogs', line, re.M|re.I)
if searchObj:
print ("search --> searchObj.group() : ", searchObj.group())
else:
print ("Nothing found!!")

phone = "2004-959-559 # This is Phone Number"

# Remove Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Number : ", num)

# Remove non-numeric characters
num = re.sub(r'D', "", phone) 
print ("Phone Num : ", num)

When the above code is executed, it produces the following output:

searchObj.group() : Cats are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter
No match!!
search --> searchObj.group() : dogs
Phone Num : 2004-959-559 
Phone Num : 2004959559

Regular Expression Modifiers: Optional Flags

Regular expression literals can contain optional modifiers to control various aspects of the match. Modifiers are specified as optional flags. Multiple modifiers can be provided using exclusive OR (|), as described above, and can be expressed as one of the following:

Order Number Modifier and Description
1 re.I Performs case-insensitive matching.
2 re.L Interprets words according to the current locale. This interpretation affects the behavior of letter groups (w and W) and word boundaries (b and B).
3 re.M Causes $ to match the end of a line (not just the end of a string) and causes ^ to match the beginning of any line (not just the beginning of a string).
4 re.S causes a period (dot) to match any character, including newline.
5 re.U Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, b, and B.
6 re.X allows for “cute” regular expression syntax. It ignores whitespace (except within the set [] or when escaped with a backslash) and treats unescaped # as a comment marker.

Regular Expression Patterns

Except for the control characters (+?.*^$()[]{}|), all characters match themselves. You can escape a control character by preceding it with a backslash.

The following table lists the regular expression syntax available in Python –

Here is a list of commonly used regular expression syntax in Python.

Sequence Number Parameter & Description
1 ^ Matches the beginning of a line.
2 $ Matches the end of a line.
3 . Matches any single character except newline. Use the m option to match newline.
4 [...] Matches any single character enclosed in square brackets.
5 [^...] Matches any single character not enclosed in square brackets.
6 re* Matches zero or more occurrences of the preceding expression.
7 re+ Matches one or more occurrences of the preceding expression.
8 re? Matches 0 or 1 occurrences of the preceding expression.
9 re{n} Matches the preceding expression exactly n times.
10 re{n,} Matches the preceding expression n or more times.
11 re{n,m} Matches the preceding expression at least n times and at most m times.
12 a|b Matches either a or b.
13 (re) Captures a regular expression and remembers the matched text.
14 (?imx) Temporarily turns on the i, m, or x option in a regular expression. If enclosed in parentheses, only that region is affected.
15 (?-imx) Temporarily turns off the i, m, or x option in a regular expression. If enclosed in parentheses, only that region is affected.
16 (?: re) Captures an expression but does not remember the matched text.
17 (?imx: re) Temporarily turns on the i, m, or x option within the parentheses.
18 (?-imx: re) Temporarily turns off the i, m, or x option within the parentheses.
19 (?#...) Comments.
20 (?= re) Matches a position using a pattern. No range.
21 (?! re) Negates a specific position using a pattern. No range.
22 (? > re) Matches a single pattern, without backtracking.
23 w Matches a word character.
24 W Matches a non-word character.
25 s Matches a space character, equivalent to [tnrf].
26 S Matches a non-space character.
27 d | 27 | d Matches a digit. Equivalent to [0-9].
28 D Matches a non-digit.
29 A Matches the beginning of the string.
30 Z Matches the end of the string. If a newline character is present, it matches the character immediately before it.
31 z Matches the end of the string.
32 G Matches the position where the previous match ended.
33 b Matches a word boundary when outside square brackets. Inside square brackets, it matches a backspace (0x08).
34 B Matches a non-word boundary.
35 n, t, etc. Matches newline, carriage return, tab, etc.
36 1...9 Matches the nth grouped subexpression.
37 10 Matches the nth grouped subexpression, if it matches. Otherwise, refer to the octal representation of the character code.

Regular Expression Example

Literal Characters

Sequence Number Example and Explanation
1 python matches “python”.

Character Classes

Order Numbers Examples and Explanations
1 [Pp]ython matches “Python” or “python”.
2 rub[ye] matches “ruby” or “rube”.
3 [aeiou] matches any lowercase vowel.
4 [0-9] matches any digit; equivalent to [0123456789].
5 [a-z] matches any lowercase ASCII letter.
6 [A-Z] matches any uppercase ASCII letter.
7 [a-zA-Z0-9] matches any of the above.
8 [^aeiou] matches all characters except lowercase vowels.
9 [^0-9] matches all characters except digits.

Special Character Classes

Sequence Numbers Examples and Explanations
1 . Matches any character except newline.
2 d Matches a digit: [0-9].
3 D Matches a non-digit: [^0-9].
4 s matches a whitespace character: [ trnf].
5 S matches a non-whitespace character: [^ trnf].
6 w matches a word character: [A-Za-z0-9_].
7 W matches a non-word character: [^A-Za-z0-9_].

Repeat

Sequence Number Example and Explanation
1 ruby? Matches “rub” or “ruby”: the y is optional.
2 ruby* Matches “rub” followed by zero or more y’s.
3 ruby+ Matches “rub” followed by one or more y’s.
4 d{3} matches three digits.
5 d{3,} matches three or more digits.
6 d{3,5} matches three, four, or five digits.

Non-greedy repetition

Unlike greedy repetition, this matches the minimum number of repetitions −

Sequence number Example and explanation
1 <.*> Greedy repetition: matches <python>perl>.
2 <.*?> Non-greedy repetition: matches <python>, not <python>perl>.

Grouping with brackets

Sequence number Example and description
1 Dd+ No grouping: +Repeated d
2 (Dd)+ Grouping: +Repeated Dd combination
3 ([Pp]ython(,)?)+ Matches “Python”, “Python, python, python” etc.

Backreference

This will match the previously matched group again.

Sequence Number Example and Description
1 ([Pp])python &1ails matches python&pails or Python&Pails
2 (['"])[^1]*1 A string enclosed in single or double quotes. 1 matches the first match, 2 matches the second match, and so on.

Alternatives

Number Example and Description
1 python|perl Matches “python” or “perl”
2 rub(y|le) Matches “ruby” or “ruble”
3 Python(!+|?) “Python” followed by one or more ! or one ?

Anchor

This requires specifying the match position.

Sequence Number Example and Description
1 ^Python Matches “Python” at the beginning of a string or line.
2 Python$ Matches “Python” at the end of a string or line.
3 APython Matches “Python” at the beginning of a string.
4 PythonZ Matches “Python” at the end of a string.
5 bPythonb Matches “Python” at a word boundary.
6 brubB B is a non-word boundary: it matches “rub” in “rube” and “ruby,” but not “rub.”
7 Python(?=!) Matches “Python” if followed by an exclamation point.
8 Python(?!!) Matches “Python” if not followed by an exclamation point.

Special Syntax with Parentheses

Sequence Number Example and Description
1 R(?#comment) Matches “R”. The rest of the text is a comment.
2 R(?i)uby Case-insensitively matches “uby”
3 R(?i:uby) Same as above
4 rub(?:y|le)) Only groups, no backreference is created

Leave a Reply

Your email address will not be published. Required fields are marked *