Reg
ular Ex
pressionsRegular expresseions are useful tools in a programmers arsenal for text and string matching. In the following sections, token will refer to a character or a group of characters inside ()
that need to be matched
\d
will match a single digit\D
will match a non-digit (this is similar to [^0-9]
explained below)\w
matches a word character (a-z
, A-Z
, 0-9
and _
)\W
matches any non word character (equivalent to [^a-zA-Z0-9_]
)\s
is for any space character (tab \t
, space
, newline \n
, carriage return \r
, form-feed \f
or vertical tab \v
)\S
matches any character that is not in the \s
group of characters (can also be said as non space characters)[]
allow matching of any character that is placed in between the square brackets
[a-z]
matches a single character, that can be any of the lower case english alphabets[a-z0-9]
matches a single character, that can either be a lowe case english alphabet, or a digit[a-zA-Z0-9_
is same as \w
described earlier^
character at the start, we can create a negative character class; [^a-z]
will match anything except the lowercase english alphabets+
, *
, ?
+
allow matching of at least 1
instance of the previous token, upto as many as it can match greedily*
on the other hand matchs 0
or more instances?
matches 0
or 1
instance of the previous token{}
can be used
{num}
will match the previous token exactly num
number of times{num1,num2}
matches the token between num1
and num2
times, as many times as needed greedily{num,}
matches the token between num
and unlimited
giving back greedily{,num}
will match between 0
and num
times.
is a special wildcard character that matches anything except a line terminator (characters like \n
, \r
and others)^
and $
can be used to match the start and end of line respectively()
are used to enclose the desired set of characters
\
as \(
or \)
\1
, \2
and so on|
can be used to separate such groups
batman
or superman
, we use batman|superman
.+?*|[](){}^$\
, we need to escape them/add a \
before the characterimport re
r
prefix is used before the expression string in most cases to save time writing \d
instead of \\d
regex_obj = re.compile(r'\d{3}-\d{4}-\d{3}')
matches phone numbers of the form 123-4567-890
re.compile
for specifying search settings
re.IGNORECASE
for case insensitive searchingre.DOTALL
to enable searching line terminators when .
is usedre.VERBOSE
ignores whitespaces in the regex|
, re.compile(r'expression', re.IGNORECASE | re.DOTALL)
re.compile
using the following functions
match()
matches the expression at only the beginning of the string (returns None
if no matches, otherwise a matched object is returned containing information about the match)search()
is similar to match()
but searches the entire stringfindall()
tries to find all the matches through the entire string, returing them as a listfinditer()
is similar to findall()
, but returns an iterator insteadmatch object
, returned by the above functions in case of a match has the following available functions
group()
returns the string matched by the expression
group
0
, and 0
denotes the entire expression that was matched1
, 2
, and so ongroup(i)
does not exist, an IndexError
is raisedstart()
is the starting position of the matched stringend()
is the ending position of the matched stringspan()
is the tuple containing the start and end positions of the matchre.compile
), following functions allow executing specific tasks on the matched string
split(string)
splits the string at the matched expression and returns the listsub(replacement, string)
substitutes all the matched expression with the argument of the sub
subn(replacement, string)
does the same job as sub
, but returns a tuple containing the new string and the count of replacementsre
module directly, but will take an additional first argument pattern
which is the pattern to searchMatching a phone number
>>> import re
>>> p = re.compile(r'\d{3}-\d{3}-\d{4}')
>>> p
re.compile('\\d{3}-\\d{3}-\\d{4}')
>>> p.search('My number is 415-555-4242.')
<re.Match object; span=(13, 25), match='415-555-4242'>
>>> p.search('My number is 415-555-4242.').group()
'415-555-4242'
>>> p.search('My number is 415-555-4242.').end()
25
Matching using groups
>>> p = re.compile(r'\d+')
>>> p.search('My number is 415-555-4242.')
<re.Match object; span=(13, 16), match='415'>
>>>
>>> p = re.compile(r'(\d+)')
>>> p.findall('My number is 415-555-4242.')
['415', '555', '4242']
>>> type(p.findall('My number is 415-555-4242.')[0])
<class 'str'>
Matching multiple groups
>>> p = re.compile(r'(\d+)-(\d+)-(\d+)')
>>> p.search('My number is 415-555-4242.')
<re.Match object; span=(13, 25), match='415-555-4242'>
>>> p.search('My number is 415-555-4242.').group(0)
'415-555-4242'
>>> p.search('My number is 415-555-4242.').group(2)
'555'
>>> p.search('My number is 415-555-4242.').group(4)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: no such group
>>>
Matching all mentions of batman and vehicles
>>> p = re.compile(r'Bat(man|mobile|copter)')
>>> p.search('Batmobile, the vehicle of choice for Batman, has lost an engine.')
print('batRegex : ' + mo.group())
<re.Match object; span=(0, 9), match='Batmobile'>
>>> p.findall('Batmobile, the vehicle of choice for Batman, has lost an engine.')
['mobile', 'man']
>>> # to get the entire matched groups in findall
>>> p = re.compile(r'(Bat(man|mobile|copter))')
>>> p.findall('Batmobile, the vehicle of choice for Batman, has lost an engine.')
[('Batmobile', 'mobile'), ('Batman', 'man')]