Skip to content

Python Regex

Posted on:September 23, 2022 at 03:22 PM

Python Regular expressions

Note: Alphanumeric characters=[a-zA-Z0-9_]

Note: Python automatically concatenates string literals delimited by whitespace.

Matching Characters

Complete list of the metacharacters

Specifying a character class

Use of the metacharacter backslash, \

RegexMatch
\dMatches any decimal digit; this is equivalent to the class [0-9]
\DMatches any non-digit character; this is equivalent to the class [^0-9]
\sMatches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]
\SMatches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v]
\wMatches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_]
\WMatches any decimal digit; this is equivalent to the class [^a-za-z0-9_]

These sequences can be included inside a character class. For example, [\s,.] is a character class that will match any whitespace character, or , or ..

Matching any character except newline

The final metacharacter in this section is .. It matches anything except a newline character, and there’s an alternate mode(re.DOTALL) where it will match even a newline. . is often used where you want to match “any character”.

Repeating Things

Repetitions are greedy

Another repeating metacharacter is +, which matches one or more times.

Another repeating qualifier, the question mark character, ?, matches either once or zero times; you can think of it as marking something as being optional.

The most complicated repeated qualifier is {m,n}, where m and n are decimal integers. This qualifier means there must be at least m repetitions, and at most n. You can omit either m or n. Omitting m is interpreted as a lower limit of 0, while omitting n results in an upper bound of infinity.

Equivalence of Metachar sequences

Note:  better to use *, +, or ? when you can, simply because they’re shorter and easier to read.

Compiling Regular Expressions

import re
a=re.compile(r"ABC", re.IGNORECASE).
b=a.match("ABCxfhh")
print(b.group()) # prints ABC

The Backslash Plague

Solution: Python raw strings

For eg in re.compile(r'\d+') r is needed as the search string contains backslash.

Performing Matches

Following methods can be invoked on compiled pattern object.

Pattern object methodDescription
match()Determine if the RE matches at the beginning of the string. Returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.
search()Scan through a string, looking for any location where this RE matches. Returns None if no match can be found. If they’re successful, a match object instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more.
findall()Find all substrings where the RE matches, and returns them as a list.
finditer()Find all substrings where the RE matches, and returns them as an iterator.

Methods on match Object

Match object methodDescription
group()Return the string matched by the RE.
start()Return the starting position of the match.
end()Return the ending position of the match.
span()Return a tuple containing the (start, end) positions of the match.

Module level methods

Compilation Flags

Note: Multiple flags can be specified by bitwise OR-ing them.

FlagDescription
ASCII, AMakes several escapes like \w, \b, \s and \d match only on ASCII characters with the respective property.
DOTALL, SMake . match any character, including newlines.
IGNORECASE, IDo case-insensitive matches.
LOCALE, LDo a locale-aware match. The \w+ will match letters of the language defined by locale, instead of just english letters[a-zA-Z0-9_]
MULTILINE, MMulti-line matching, affecting ^ and $.
VERBOSE, X (for ‘extended’)Enable verbose REs, which can be organized more cleanly and understandably.

When MULTILINE flag is NOT specified

When MULTILINE flag is specified

When UNICODE flag is specified

When VERBOSE flag is specified

charref = re.compile(r"""
 &[#]                # Start of a numeric entity reference
 (
     0[0-7]+         # Octal form
   | [0-9]+          # Decimal form
   | x[0-9a-fA-F]+   # Hexadecimal form
 )
 ;                   # Trailing semicolon
""", re.VERBOSE)

More Metacharacters

zero-width assertions

They don’t cause the engine to advance through the string, instead, they consume no characters at all, and simply succeed or fail. For example, \b is an assertion that the current position is located at a word boundary; the position isn’t changed by the \b at all. This means that zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times.

|: Alternation, or the “or” operator. If A and B are regular expressions, A|B will match any string that matches either A or B. | has very low precedence in order to make it work reasonably when you’re alternating multi-character strings. Crow|Servo will match either Crow or Servo, not Cro, a ‘w’ or an ‘S’, and ervo.

^: Matches at the beginning of lines. Unless the MULTILINE flag has been set, this will only match at the beginning of the string. In MULTILINE mode, this also matches immediately after each newline within the string.

$: Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character.

\A: Matches only at the start of the string. When not in MULTILINE mode, \A and ^ are effectively the same. In MULTILINE mode, they’re different: \A still matches only at the beginning of the string, but ^ may match at any location inside the string that follows a newline character.

\b: Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word. A word is defined as a sequence of alphanumeric characters, so the end of a word is indicated by whitespace or a non-alphanumeric character. Inside a character class, where there’s no use for this assertion, \b represents the backspace character, for compatibility with Python’s string literals.

\B: Another zero-width assertion, this is the opposite of \b, only matching when the current position is not at a word boundary.

Grouping

>>> p = re.compile('(a(b)c)d')
>>> m = p.match('abcd')
>>> m.group(0)
'abcd'
>>> m.group(1)
'abc'
>>> m.group(2)
'b'
>>> m.group(2,1,2)
('b', 'abc', 'b')
>>> m.groups()
('abc', 'b')
>>> m = re.match("([abc])+", "abc")
>>> m.groups()
('c',)

Backreferences

For example, the following RE detects doubled words in a string.

>>> p = re.compile(r'\b(\w+)\s+\1\b')
>>> p.search('Paris in the the spring').group()
'the the'

Non-capturing and Named Groups

Note: Except for the fact that you can’t retrieve the contents of what the group matched, a non-capturing group behaves exactly the same as a capturing group(named group).

Sometimes you’ll want to use a group to collect a part of a regular expression, but aren’t interested in retrieving the group’s contents. You can make this fact explicit by using a non-capturing group,(?:…).

Named groups: Instead of referring to them by numbers, groups can be referenced by a name.

The regex to find duplicate word can now be written as.

Lookahead Assertions

The regex to parse a filename with an extension after ’.’, also th extension should not be “bat”. .*[.](?!bat$)[^.]*$

Method/AttributePurpose
split(string[, maxsplit=0])Split the string into a list, splitting it wherever the RE matches. returning a list of the pieces. If capturing parentheses are used in the RE, then their values are also returned as part of the list.
sub(replacement, string[, count=0])Find all substrings where the RE matches, and replace them with a different string. Returns the string obtained by replacing the leftmost non-overlapping occurrences of the RE in string by the replacement replacement.
subn()Does the same thing as sub(), but returns a 2-tuple containing the new string value and the number of replacements that were performed

Split

>>> p = re.compile(r'\W+')
>>> p2 = re.compile(r'(\W+)')
>>> p.split('This... is a test.')
['This', 'is', 'a', 'test', '']
>>> p2.split('This... is a test.')
['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']

Search and Replace

The example below replaces colour names with the word colour:

>>> p = re.compile('(blue|white|red)')
>>> p.sub('colour', 'blue socks and red shoes')
'colour socks and colour shoes'
>>> p.sub('colour', 'blue socks and red shoes', count=1)
'colour socks and red shoes'

Empty matches are replaced only when they’re not adjacent to a previous match.

>>> p = re.compile('x*')
>>> p.sub('-', 'abxd')
'-a-b-d-'

If replacement is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by the corresponding group in the RE. This lets you incorporate portions of the original text in the resulting replacement string.

The following substitutions are all equivalent, but use all three variations of the replacement string.

>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
>>> p.sub(r'subsection{\1}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<1>}','section{First}')
'subsection{First}'
>>> p.sub(r'subsection{\g<name>}','section{First}')
'subsection{First}'

If replacement is a function, the function is called for every non-overlapping occurrence of pattern. On each call, the function is passed a match object argument for the match and can use this information to compute the desired replacement string and return it. In the following example, the replacement function translates decimals into hexadecimal.

>>> def hexrepl(match):
...     "Return the hex string for a decimal number"
...     value = int(match.group())
...     return hex(value)
...
>>> p = re.compile(r'\d+')
>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
'Call 0xffd2 for printing, 0xc000 for user code.'