Alex Hyett
Alex Hyett

Follow

Alex Hyett

Follow
Finally Understand Regular Expressions: Regex isn't as hard as it looks

Finally Understand Regular Expressions: Regex isn't as hard as it looks

Alex Hyett's photo
Alex Hyett
·Jan 10, 2023·

6 min read

There's nothing like a regular expression to strike fear in the heart of a developer.

Regular expressions (regex) are used for a lot of things, such as validating that a string is in the right format, as well as grabbing certain parts of a string as well.

You can do simple string searches with regex, but obviously that's not what makes it powerful.

If you want to follow along with these examples, there is a great website called Regexr that I always use when testing regular expressions.

Special characters

There are a few different special characters you can use to help you with your searches.

  • \w - will match every alphanumeric character as well as underscores.
  • \d - will match all number characters.
  • \s - matches spaces, tabs and new lines.

We can turn these into the negative versions by using a capital letter:

  • \W - will match everything that is not alphanumeric or an underscore.
  • \D - will match everything that is not a number.
  • \S - will match everything that is not a space, tab or new line.

There is also another special character projects that is used to match any character in your string, and that is a ..

If you were to search for .at in the following sentence:

The cat sat on the mat at home.

It will match on cat, sat, mat, and at.

Quantifiers

There are a few quantifiers you can use in regex to match for multiple occurrences of a letter.

  • * - match 0 or more of the preceding pattern.
  • + - match at least 1 of the preceding pattern.
  • ? - match 0 or 1 of the preceding pattern (it is basically optional).
  • {3} - matches exactly 3 occurrences of the preceding pattern.
  • {3,5} - match between 3 and 5 occurrences (3,4,5) of the preceding pattern.

Let's say we have the following text:

a aa aaa aaaa aaaaa aaaaaa

This is what we get with the following patterns:

  • a* matches a, aa, aaa, aaaa, aaaaa, aaaaaa
  • a+ matches all of them as well as we have at least 1.
  • a? matches a 21 times for each individual a in the text.
  • a{3} matches just aaa 5 times. Once in aaa, aaaa and aaaaa and twice in aaaaaa.
  • a{4,5} matches aaaa and aaaaa 3 times.

Character Sets

In some cases, we want to match a range of different characters. For this, we have character sets. To use a character set, we can put a range of characters in square brackets [].

Let's take our simple sentence again and look at an example:

The cat sat on the mat at home.

If we search for the pattern [cs]at we are going to match on cat and sat but not mat.

You can also do ranges of characters too. If we search for the pattern [a-p]at then we are going to match on cat and mat but not sat.

As with the special characters, it is also possible to look at the negative version of this by putting a ^ symbol at the start inside the brackets.

So doing [^a-p]at will match on sat but also at as spaces are included as characters as well.

Capture Groups

One of the main reasons for using regular expressions is because you want to extract a string from a bit of text.

For example, if you wanted to extract the domain from the following email address:

cat@alexhyett.com

We can use the following regular expression to match on this email address:

[\w-\.]+@([\w-]+\.+[\w]{2,63})

Let's break this down, so we can see what it is doing:

  • [\w-\.]+ - The first part is matching on any alphanumeric character and underscore (as denoted by the \w ) as well as a hyphen - and a dot .. The dot here has been escaped with a backslash so that it doesn’t get confused with the . special character. These characters are matched one or more times, denoted by the +.
  • @ - is just matching the @ character.
  • [\w-]+ - is matching any alphanumeric character and underscore (as denoted by the \w ) as well a hyphen -. These characters are matched one or more times, denoted by the +.
  • \. - is just matching the . character.
  • [\w]{2,4} matches any alphanumeric character and underscore. I don’t think you can have underscores in top level domains so this should probably just be [a-zA-Z] but \w is simpler. This is then matched 2 to 63 times to allow for extensions such as uk, com, technology.

We have then added brackets after the @ until the end of the string to create a capture group.

When you use this regex in code you will be able to look at the groups and extract the domain e.g. alexhyett.com.

Lookahead and Lookbehind

This is where people to start to switch off when it comes to regular expressions.

Positive and Negative Lookahead and Lookbehinds sound complicated, but they are not actually that hard.

A lookahead, or lookbehind, just looks for a particular pattern ahead or behind what you are looking for, without including it as part of the match.

Positive Lookahead

Let’s go back to our simple string and see how we can use a positive lookahead.

The cat sat on the mat at home.

Say we want to match on the letter o but only if it has the letter m after it.

To do this, we use the pattern o(?=m). Which will match the o in home but not the o in on.

Negative Lookahead

We can also do the negative of this. If we wanted to find all occurrences of the letter o that does not have the letter m after it we would use o(?!m) basically replacing the = with an !.

This would then match the o in on but not the o in home.

Positive Lookbehind

You can probably see where this is going now. A lookbehind, as the name suggests, looks backwards instead of forwards.

If we want to find all occurrences of the word at that are preceded by the letter c we can use the following pattern (?<=c)at. This will only match the at in cat but not any of the other occurrences.

Negative Lookbehind

Similarly, we can find the negative version of the positive lookbehind by changing the = to a !.

If we now search for (?<!c)at it will match on the at in sat, mat and at.

Extra Tip

It is also possible to combine multiple patterns in one regular expression.

Let's say we want to find all the a characters that aren't preceded by an s as well as all the t characters.

We can do an OR symbol | and have a pattern that looks like this:

(?<!s)a|t

You can do this, but it can get quite complicated if you're going to be chaining on lots of different expressions.

If you're doing these regular expressions in code, then I recommend that you split these out, just to make the regular expressions that much clearer.

I hope that demystifies regular expressions for you. If you like this post, you can also follow me on Twitter and Medium.

 
Share this