Python 3 – Regular Expressions
Python 3 – Regular Expressions
A regular expression is a special sequence of characters used in a pattern to match or find other strings or sets of strings using a specialized syntax. Regular expressions are widely used in the UNIX world.
The re module provides comprehensive support for Perl-like regular expressions in Python. If an error occurs while compiling or using a regular expression, the re module raises the exception re.error .
We will introduce two important functions for working with regular expressions. However, a small matter first: various characters have special meanings when used in regular expressions. To avoid any confusion when working with regular expressions, we will use raw strings, r’expression’ .
Basic Pattern Matching Single Characters
Order Numbers | Expressions and Matches |
---|---|
1 | a, X, 9, < Common characters match only themselves completely. |
2 | . (period) Matches any single character except the newline character ‘n’ |
3 | w Matches a “word” character: a letter, digit, or underscore [a-zA-Z0-9_] . |
4 | W Matches any non-word character. |
5 | b The boundary between a word and a non-word. |
6 | s Matches a single whitespace character – space, newline, carriage return, tab. |
7 | S Matches any non-whitespace character. |
8 | t, n, Tab, Line Feed, Carriage Return |
9 | d Decimal digit [0-9] |
10 | ^ Matches the beginning of the string |
11 | $ Matches the end of the string |
12 | Suppresses the “specialness” of a character. |
Compilation Flags
Compilation flags allow you to modify certain aspects of how regular expressions work. Flags are referred to in the re module by two names: a long name, such as IGNORECASE , and a short name, such as I.
Order Number | Flags and Meanings |
---|---|
1 | ASCII, A Causes escape sequences such as w, b, s, and d to match only ASCII characters with their respective properties. |
2 | DOTALL, S Causes . to match any character, including newline. |
3 | IGNORECASE, I Matches case-insensitively. |
4 | LOCALE, L Performs a match that conforms to the current locale. |
5 | MULTILINE, M Matches across multiple lines, affecting ^ and $. |
6 | VERBOSE, X (for “extended”) Enable verbose regular expressions for clearer organization and understanding. |
Match Function
This function attempts to match the RE pattern against the string string, optionally using flags.
The syntax of this function is as follows: –
re.match(pattern, string, flags = 0)
This is a description of the parameters −
Number | Parameters and Descriptions |
---|---|
1 | pattern This is the regular expression to match. |
2 | string This is the string to search for, matching the pattern at the beginning of the string. |
3 | flags You can specify different flags using bitwise OR (|). These are the modifiers listed in the table below. |
The re.match function returns a match object on success and None on failure. We use the match object’s group(num) or groups() functions to retrieve the matched expression.
Order Number | Match Object Methods and Descriptions |
---|---|
1 | group(num = 0) This method returns the entire match (or a specific subgroup num). |
2 | groups() This method returns all matching subgroups in a tuple (or empty if there are none). |
Example
#!/usr/bin/python3
import re
line = "Cats are smarter than dogs"
matchObj = re.match( r'(.*) are (.*?) .*', line, re.M|re.I)
if matchObj:
print ("matchObj.group() : ", matchObj.group())
print ("matchObj.group(1) : ", matchObj.group(1))
print ("matchObj.group(2) : ", matchObj.group(2))
else:
print ("No match!!")
When executing the above code, it will produce the following result −
matchObj.group() : Cats are smarter than dogs
matchObj.group(1) : Cats
matchObj.group(2) : smarter
Search Function
This function searches for the first occurrence of RE pattern in a string and returns optional flags.
Here is the syntax of the function –
re.search(pattern, string, flags = 0)
Below is a description of the parameters –
Number | Parameters and Description |
---|---|
1 | pattern This is the regular expression to match. |
2 | string This is the string to search for, matching the pattern anywhere within the string. |
3 | flags You can specify different flags using bitwise OR (|). These are the modifiers listed in the table below. |
The re.search function returns a match object on success and None on failure. We use the match object’s group(num) or groups() function to retrieve the matched expression.
Order Number | Match Object Methods and Descriptions |
---|---|
1 | group(num = 0) This method returns the entire match (or a specific subgroup num). |
2 | groups() This method returns all matching subgroups in a tuple (or empty if there are none). |
Example
#!/usr/bin/python3
import re
line = "Cats are smarter than dogs";
# Search the entire string for a matching pattern
searchObj = re.search( r'(.*) are (.*?) .*', line, re.M|re.I)
if searchObj:
print ("searchObj.group() : ", searchObj.group()) # Returns all substrings that match the entire string
print ("searchObj.group(1) : ", searchObj.group(1)) # Returns all substrings in the first matching string
print ("searchObj.group(2) : ", searchObj.group(2)) # Returns all substrings in the second matching string
else:
print ("Nothing found!!")
# match only matches at the beginning of the string, while the search function matches throughout the entire string
matchObj = re.match(r'dogs', line, re.M|re.I)
if matchObj:
print ("match --> matchObj.group() : ", matchObj.group())
else:
print ("No match!!")
searchObj = re.search(r'dogs', line, re.M|re.I)
if searchObj:
print ("search --> searchObj.group() : ", searchObj.group())
else:
print ("Nothing found!!")
phone = "2004-959-559 # This is Phone Number"
# Remove Python-style comments
num = re.sub(r'#.*$', "", phone)
print ("Phone Number : ", num)
# Remove non-numeric characters
num = re.sub(r'D', "", phone)
print ("Phone Num : ", num)
When the above code is executed, it produces the following output:
searchObj.group() : Cats are smarter than dogs
searchObj.group(1) : Cats
searchObj.group(2) : smarter
No match!!
search --> searchObj.group() : dogs
Phone Num : 2004-959-559
Phone Num : 2004959559
Regular Expression Modifiers: Optional Flags
Regular expression literals can contain optional modifiers to control various aspects of the match. Modifiers are specified as optional flags. Multiple modifiers can be provided using exclusive OR (|), as described above, and can be expressed as one of the following:
Order Number | Modifier and Description |
---|---|
1 | re.I Performs case-insensitive matching. |
2 | re.L Interprets words according to the current locale. This interpretation affects the behavior of letter groups (w and W) and word boundaries (b and B). |
3 | re.M Causes $ to match the end of a line (not just the end of a string) and causes ^ to match the beginning of any line (not just the beginning of a string). |
4 | re.S causes a period (dot) to match any character, including newline. |
5 | re.U Interprets letters according to the Unicode character set. This flag affects the behavior of w, W, b, and B. |
6 | re.X allows for “cute” regular expression syntax. It ignores whitespace (except within the set [] or when escaped with a backslash) and treats unescaped # as a comment marker. |
Regular Expression Patterns
Except for the control characters (+?.*^$()[]{}|)
, all characters match themselves. You can escape a control character by preceding it with a backslash.
The following table lists the regular expression syntax available in Python –
Here is a list of commonly used regular expression syntax in Python.
Sequence Number | Parameter & Description |
---|---|
1 | ^ Matches the beginning of a line. |
2 | $ Matches the end of a line. |
3 | . Matches any single character except newline. Use the m option to match newline. |
4 | [...] Matches any single character enclosed in square brackets. |
5 | [^...] Matches any single character not enclosed in square brackets. |
6 | re* Matches zero or more occurrences of the preceding expression. |
7 | re+ Matches one or more occurrences of the preceding expression. |
8 | re? Matches 0 or 1 occurrences of the preceding expression. |
9 | re{n} Matches the preceding expression exactly n times. |
10 | re{n,} Matches the preceding expression n or more times. |
11 | re{n,m} Matches the preceding expression at least n times and at most m times. |
12 | a|b Matches either a or b. |
13 | (re) Captures a regular expression and remembers the matched text. |
14 | (?imx) Temporarily turns on the i, m, or x option in a regular expression. If enclosed in parentheses, only that region is affected. |
15 | (?-imx) Temporarily turns off the i, m, or x option in a regular expression. If enclosed in parentheses, only that region is affected. |
16 | (?: re) Captures an expression but does not remember the matched text. |
17 | (?imx: re) Temporarily turns on the i, m, or x option within the parentheses. |
18 | (?-imx: re) Temporarily turns off the i, m, or x option within the parentheses. |
19 | (?#...) Comments. |
20 | (?= re) Matches a position using a pattern. No range. |
21 | (?! re) Negates a specific position using a pattern. No range. |
22 | (? > re) Matches a single pattern, without backtracking. |
23 | w Matches a word character. |
24 | W Matches a non-word character. |
25 | s Matches a space character, equivalent to [tnrf]. |
26 | S Matches a non-space character. |
27 | d | 27 | d Matches a digit. Equivalent to [0-9]. |
28 | D Matches a non-digit. |
29 | A Matches the beginning of the string. |
30 | Z Matches the end of the string. If a newline character is present, it matches the character immediately before it. |
31 | z Matches the end of the string. |
32 | G Matches the position where the previous match ended. |
33 | b Matches a word boundary when outside square brackets. Inside square brackets, it matches a backspace (0x08). |
34 | B Matches a non-word boundary. |
35 | n, t , etc. Matches newline, carriage return, tab, etc. |
36 | 1...9 Matches the nth grouped subexpression. |
37 | 10 Matches the nth grouped subexpression, if it matches. Otherwise, refer to the octal representation of the character code. |
Regular Expression Example
Literal Characters
Sequence Number | Example and Explanation |
---|---|
1 | python matches “python”. |
Character Classes
Order Numbers | Examples and Explanations |
---|---|
1 | [Pp]ython matches “Python” or “python”. |
2 | rub[ye] matches “ruby” or “rube”. |
3 | [aeiou] matches any lowercase vowel. |
4 | [0-9] matches any digit; equivalent to [0123456789] . |
5 | [a-z] matches any lowercase ASCII letter. |
6 | [A-Z] matches any uppercase ASCII letter. |
7 | [a-zA-Z0-9] matches any of the above. |
8 | [^aeiou] matches all characters except lowercase vowels. |
9 | [^0-9] matches all characters except digits. |
Special Character Classes
Sequence Numbers | Examples and Explanations |
---|---|
1 | . Matches any character except newline. |
2 | d Matches a digit: [0-9] . |
3 | D Matches a non-digit: [^0-9] . |
4 | s matches a whitespace character: [ trnf] . |
5 | S matches a non-whitespace character: [^ trnf] . |
6 | w matches a word character: [A-Za-z0-9_] . |
7 | W matches a non-word character: [^A-Za-z0-9_] . |
Repeat
Sequence Number | Example and Explanation |
---|---|
1 | ruby? Matches “rub” or “ruby”: the y is optional. |
2 | ruby* Matches “rub” followed by zero or more y’s. |
3 | ruby+ Matches “rub” followed by one or more y’s. |
4 | d{3} matches three digits. |
5 | d{3,} matches three or more digits. |
6 | d{3,5} matches three, four, or five digits. |
Non-greedy repetition
Unlike greedy repetition, this matches the minimum number of repetitions −
Sequence number | Example and explanation |
---|---|
1 | <.*> Greedy repetition: matches <python>perl> . |
2 | <.*?> Non-greedy repetition: matches <python> , not <python>perl> . |
Grouping with brackets
Sequence number | Example and description |
---|---|
1 | Dd+ No grouping: +Repeated d |
2 | (Dd)+ Grouping: +Repeated Dd combination |
3 | ([Pp]ython(,)?)+ Matches “Python”, “Python, python, python” etc. |
Backreference
This will match the previously matched group again.
Sequence Number | Example and Description |
---|---|
1 | ([Pp])python &1ails matches python&pails or Python&Pails |
2 | (['"])[^1]*1 A string enclosed in single or double quotes. 1 matches the first match, 2 matches the second match, and so on. |
Alternatives
Number | Example and Description |
---|---|
1 | python|perl Matches “python” or “perl” |
2 | rub(y|le) Matches “ruby” or “ruble” |
3 | Python(!+|?) “Python” followed by one or more ! or one ? |
Anchor
This requires specifying the match position.
Sequence Number | Example and Description |
---|---|
1 | ^Python Matches “Python” at the beginning of a string or line. |
2 | Python$ Matches “Python” at the end of a string or line. |
3 | APython Matches “Python” at the beginning of a string. |
4 | PythonZ Matches “Python” at the end of a string. |
5 | bPythonb Matches “Python” at a word boundary. |
6 | brubB B is a non-word boundary: it matches “rub” in “rube” and “ruby,” but not “rub.” |
7 | Python(?=!) Matches “Python” if followed by an exclamation point. |
8 | Python(?!!) Matches “Python” if not followed by an exclamation point. |
Special Syntax with Parentheses
Sequence Number | Example and Description |
---|---|
1 | R(?#comment) Matches “R”. The rest of the text is a comment. |
2 | R(?i)uby Case-insensitively matches “uby” |
3 | R(?i:uby) Same as above |
4 | rub(?:y|le)) Only groups, no backreference is created |