Regex

Updated on 28 Dec 2022

regex

As I used to explain to my students, 'regular expressions' are like a black art. They are used for pattern matching, and in this chapter we’ll introduce you to the basics. You’ll often refer back to the cheat sheet, however the postcode and subject codes are a great place to start with learning about regular expressions.

regex - cheat sheet

Characters

character	legend	example	sample_match
\d	One digit from 0 to 9	blah_\d	blah_3
\w	Ascii letter, digit or underscore	blah-\w	blah-b
\s	whitespace character: space, tab, newline, carriage return, vertical tab	blah\s\d	blah 3
\D	one character that is not a digit	\D	d
\W	one character that is not a word	\W\W\W	+=*
\S	one character that is not a whitespace	\S	b
.	any character except line break	a.c	azc
\.	an actual period	a\.c	a.c

Quantifiers

quantifier	legend	example	sample_match
+	One or more	blah_\d+	blah_321
*	Zero or more times	ABC	ACC
?	Once or none	blahs?	blah
{3}	Exactly 3 times	\w{3}	3b1
{2,4}	Two to Four times	\d{2,4}	156
{3,}	Three or more times	\w{3,}	regex

Character Classes

character	legend	example	sample_match
[…]	One of the characters in the brackets	[AEIOU]	E
[…]	One of the characters in the brackets	T[ao]p	Tap or Top
[a-z]	One of the characters in range from a to z	[a-k]	b

Example

pattern	string	match
[a-z]{4}\d{3}	comp306	yes
[a-z]{4}\d{3}	COMP30	no. looking for lowercase letters and must have 3 digits.

Anchors

anchor	legend	example	sample_match
^	Start of string	^abc	abcd
$	End of string	end$	the end

Logic and Grouping

logic	legend	example	sample_match
\|	OR operand	1\|2	2
(…)	capturing group	A(nt\|pple)	Apple

Example

pattern	string	match
([a-z]{4}\d{3})\|(\w{,4})	comp306	yes. matches first group
([a-z]{4}\d{3})\|(\w{,4})	Soot	yes. matches second group
([a-z]{4}\d{3})\|(\w{,4})	sooty	no. does not match first group because missing digits and string too long. does not match second group because string too long

example - postcodes

Imagine that we want to see if a value matches a pattern for a postcode for Australia.

postcode = 'b052'

How could I write something like if postcode valid: do something. The best way is with regular expressions as shown in the example below.

import re

postcode = 'b052'

if re.match('^\d{4}$', postcode):
    print(postcode + " matches the regular expression")
else:
    print(postcode + " DOES NOT match the regular expression")

debugging

^ start of string
$ end of string

The fact that we are using both ^ and $ means that the entire string must match the pattern. If we only use one or the other it means that we are looking for a match at the start or end of the string.

\d any digit, 0-9
{4} exactly 4 characters.

So this particular regular expression is looking for digit characters, exactly 4 of them, and it must match the entire string. The \d is for a special sequence, and there are many of them that you can choose from.

{} is used for number of sequences.

{4} exactly 4
{1,4} from 1 to 4 occurrences
{,4} from zero to 4 occurrences
{4,} 4 or more occurrences

example - subject codes

At University our subject codes were 4 alpha characters followed by 3 digits with the first digit being 1, 2, or 3 to denote 1st, 2nd or 3rd year. Can we write a regex to validate this pattern?

import re

code = 'COMP306'

if re.match('^[A-Z]{4}(1|2|3)\d{2}$', code):
    print(code + " matches the regular expression")
else:
    print(code + " DOES NOT match the regular expression")

debugging

We’ve already covered some of the symbols used in this regex, so I’ll just give a brief explanation to the new ones we’re seeing in this example.

[] are used to denote a range. I.e. between A and Z.
| or symbol. I.e. 1 or 2 or 3.
() grouping. Some parts of our pattern need to be grouped. I.e. we want 1|2|3 but we don’t want the last option blended with \d{2}, therefore we use the brackets to signal where our groups are.

It is also worth noting that the range [] can include multiple ranges. I.e.

[A-Za-z]

means any letter A - Z, upper or lower case. We can also add individual characters to the range as well. In this example we are also including the underscore, period and comma to the list of acceptable characters.

[A-Za-z_.,]

regex - guided exercise

In an earlier exercise with dates, we used strings (that contained a date) and converted them to a date object and did stuff to it. We also know that things don’t work out too well for us if what we thought was a date couldn’t actually be converted to a date object. Let’s combine our knowledge of dates with regex to ensure that only a string that can be converted will be converted.

As a first pass, lets give this regex a go.

import re

dateStr = '08/03/2019'

if re.match('^\d{2}/\d{2}/\d{4}$', dateStr):
    print(dateStr + " matches the regular expression")
else:
    print(dateStr + " DOES NOT match the regular expression")

Our dateStr passes the pattern, but unfortunately so does 88/14/2019 which definitely isn’t a valid date. Let’s see if we can do something about the months first (because it is easier), and then we’ll do the day.

months

Probably the easiest way for us to deal with the months is to use the | (or) with the 12 possible combinations for each month.

^\d{2}/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$

days

I could follow the same pattern for the days as the month, but I don’t want to do all that typing and I’d rather see if I can put my regex skills to the test. For this one I am thinking of 2 distinct groupings.

First digit 0, 1, 2 can follow with a second digit 0-9.
First digit 3 can follow with a second digit 0 or 1.

^((0|1|2)[0-9]{1})|(3(0|1))/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$

So our final solution becomes

import re

dateStr = '88/03/2019'  # hmmm, a bad date!

if re.match('^((0|1|2)[0-9]{1})|(3(0|1))/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$', dateStr):
    print(dateStr + " matches the regular expression")
else:
    print(dateStr + " DOES NOT match the regular expression")

This new regular expression does an excellent job of weeding out a string that is a bad date, and for now we should be happy with the improvement we’ve made from the original ^\d{2}/\d{2}/\d{4}$. It certainly does catch out 88/03/2019. But beaware that it wont catch 30/02/2019 or the 31st where there is none for that month.