Regex
Updated on 28 Dec 2022
As I used to explain to my students, 'regular expressions' are like a black art. They are used for pattern matching, and in this chapter we’ll introduce you to the basics. You’ll often refer back to the cheat sheet, however the postcode and subject codes are a great place to start with learning about regular expressions.
regex - cheat sheet
Characters
character | legend | example | sample_match |
---|---|---|---|
\d | One digit from 0 to 9 | blah_\d | blah_3 |
\w | Ascii letter, digit or underscore | blah-\w | blah-b |
\s | whitespace character: space, tab, newline, carriage return, vertical tab | blah\s\d | blah 3 |
\D | one character that is not a digit | \D | d |
\W | one character that is not a word | \W\W\W | +=* |
\S | one character that is not a whitespace | \S | b |
. | any character except line break | a.c | azc |
\. | an actual period | a\.c | a.c |
Quantifiers
quantifier | legend | example | sample_match |
---|---|---|---|
+ | One or more | blah_\d+ | blah_321 |
* | Zero or more times | ABC | ACC |
? | Once or none | blahs? | blah |
{3} | Exactly 3 times | \w{3} | 3b1 |
{2,4} | Two to Four times | \d{2,4} | 156 |
{3,} | Three or more times | \w{3,} | regex |
Character Classes
character | legend | example | sample_match |
---|---|---|---|
[…] | One of the characters in the brackets | [AEIOU] | E |
[…] | One of the characters in the brackets | T[ao]p | Tap or Top |
[a-z] | One of the characters in range from a to z | [a-k] | b |
Example
pattern | string | match |
---|---|---|
[a-z]{4}\d{3} | comp306 | yes |
[a-z]{4}\d{3} | COMP30 | no. looking for lowercase letters and must have 3 digits. |
Anchors
anchor | legend | example | sample_match |
---|---|---|---|
^ | Start of string | ^abc | abcd |
$ | End of string | end$ | the end |
Logic and Grouping
logic | legend | example | sample_match |
---|---|---|---|
| | OR operand | 1|2 | 2 |
(…) | capturing group | A(nt|pple) | Apple |
Example
pattern | string | match |
---|---|---|
([a-z]{4}\d{3})|(\w{,4}) | comp306 | yes. matches first group |
([a-z]{4}\d{3})|(\w{,4}) | Soot | yes. matches second group |
([a-z]{4}\d{3})|(\w{,4}) | sooty | no. does not match first group because missing digits and string too long. does not match second group because string too long |
example - postcodes
Imagine that we want to see if a value matches a pattern for a postcode for Australia.
postcode = 'b052'
How could I write something like if postcode valid: do something
. The best way is with regular expressions as shown in the example below.
import re
postcode = 'b052'
if re.match('^\d{4}$', postcode):
print(postcode + " matches the regular expression")
else:
print(postcode + " DOES NOT match the regular expression")
debugging
- ^ start of string
- $ end of string
The fact that we are using both ^ and $ means that the entire string must match the pattern. If we only use one or the other it means that we are looking for a match at the start or end of the string.
- \d any digit, 0-9
- {4} exactly 4 characters.
So this particular regular expression is looking for digit characters, exactly 4 of them, and it must match the entire string. The \d is for a special sequence, and there are many of them that you can choose from.
{} is used for number of sequences.
- {4} exactly 4
- {1,4} from 1 to 4 occurrences
- {,4} from zero to 4 occurrences
- {4,} 4 or more occurrences
example - subject codes
At University our subject codes were 4 alpha characters followed by 3 digits with the first digit being 1, 2, or 3 to denote 1st, 2nd or 3rd year. Can we write a regex to validate this pattern?
import re
code = 'COMP306'
if re.match('^[A-Z]{4}(1|2|3)\d{2}$', code):
print(code + " matches the regular expression")
else:
print(code + " DOES NOT match the regular expression")
debugging
We’ve already covered some of the symbols used in this regex, so I’ll just give a brief explanation to the new ones we’re seeing in this example.
- [] are used to denote a range. I.e. between A and Z.
- | or symbol. I.e. 1 or 2 or 3.
- () grouping. Some parts of our pattern need to be grouped. I.e. we want 1|2|3 but we don’t want the last option blended with
\d{2}
, therefore we use the brackets to signal where our groups are.
It is also worth noting that the range [] can include multiple ranges. I.e.
[A-Za-z]
means any letter A - Z, upper or lower case. We can also add individual characters to the range as well. In this example we are also including the underscore, period and comma to the list of acceptable characters.
[A-Za-z_.,]
regex - guided exercise
In an earlier exercise with dates, we used strings (that contained a date) and converted them to a date object and did stuff to it. We also know that things don’t work out too well for us if what we thought was a date couldn’t actually be converted to a date object. Let’s combine our knowledge of dates with regex to ensure that only a string that can be converted will be converted.
As a first pass, lets give this regex a go.
import re
dateStr = '08/03/2019'
if re.match('^\d{2}/\d{2}/\d{4}$', dateStr):
print(dateStr + " matches the regular expression")
else:
print(dateStr + " DOES NOT match the regular expression")
Our dateStr
passes the pattern, but unfortunately so does 88/14/2019
which definitely isn’t a valid date. Let’s see if we can do something about the months first (because it is easier), and then we’ll do the day.
months
Probably the easiest way for us to deal with the months is to use the | (or) with the 12 possible combinations for each month.
^\d{2}/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$
days
I could follow the same pattern for the days as the month, but I don’t want to do all that typing and I’d rather see if I can put my regex skills to the test. For this one I am thinking of 2 distinct groupings.
- First digit 0, 1, 2 can follow with a second digit 0-9.
- First digit 3 can follow with a second digit 0 or 1.
^((0|1|2)[0-9]{1})|(3(0|1))/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$
So our final solution becomes
import re
dateStr = '88/03/2019' # hmmm, a bad date!
if re.match('^((0|1|2)[0-9]{1})|(3(0|1))/(01|02|03|04|05|06|07|08|09|10|11|12)/\d{4}$', dateStr):
print(dateStr + " matches the regular expression")
else:
print(dateStr + " DOES NOT match the regular expression")
This new regular expression does an excellent job of weeding out a string that is a bad date, and for now we should be happy with the improvement we’ve made from the original ^\d{2}/\d{2}/\d{4}$
. It certainly does catch out 88/03/2019. But beaware that it wont catch 30/02/2019 or the 31st where there is none for that month.