Regular Expressions: Introduction
posted in productivity on • by Wouter Van SchandevijlYou already know regex is a (very) useful skill? Skip to the Tutorial. If not, allow me to convince you…
Some people, when confronted with a problem, think ‘I know, I’ll use regular expressions.’ Now they have two problems. - Jeffrey Friedl’s Blog
Or the XKCD version:
Okay maybe that was not a good start trying to convince you to learn regex.
The Why
You don’t know you need to know regex until you know regex.
Applicability
Regular Expressions are typically used for user input validation.
Another good fit is extracting, deleting and/or reorganising text.
Replace a dozen lines of Substring
and IndexOf
with one ‘pretty’ regex!
Other usage in code might be suspect: does the readability really benefit from the regex or does it just make things more complex.
Be sure to have Unit Tests for all input you’d like to support.
You’ll need them whenever requirements change or bugs need to be fixed.
Even with this seemingly limited applicability - if you don’t want to shoot yourself in the foot that is -
regexes are still a very useful tool as they can be used in many one time actions.
- Find in (all project) files in your IDE or in a SQL DB
- Quickly find interesting stuff in big files (think logs, data dumps, csvs, etc)
- Convert data that came from A to what B likes
- Many a CLI tool or its parameters understands regex
- You can even use ‘regex’ in Excel and Word…
These are the reason this series is posted under the productivity category.
Example
Whenever I fix a bug that was the result of an error that might have happened in other places aswell, I might use a regex to find those, even if the other code contains different variable names, code style etc it would still find them - at least if the regex was written in a way that takes the context into account.
Tooling
Don’t forget about tooling.
Having direct feedback on what your regex is matching on your dataset as you type it.. It’s the best 😃
Any IDE worth its salt has this capability.
Learn to work with multiple cursors in your IDE. It is a strong and capable companion to regex:
Why write the perfect find & replace regex when you can write a simplistic regex on the top of your
head and handle the replacement details with multiple cursor trickery?
You want to avoid falling into the trap of spending more time on your regex than you would spend doing it manually.
Learning
RegExOne: An interactive tutorial!
Not necessarily learning resources, but still pretty handy: interactive RegEx testers. Paste a RegEx you found and match it against your input. Plus they explain the regex visually.
Mastering Regular Expressions
Even though the name of the book is Mastering Regular Expressions it still starts really slow and is perfectly readable by someone with no (or very little) regex experience.
The Why Not
Always be wary before resorting to regex 😉
Implementations
As if RegExes are not confusing enough in and by themselves, there are different implementations of regex. While all implementations are largely the same, you may still find yourself in a situation where the regex works perfectly in your IDE but not from your application.
Some examples (or a full list):
- Features not implemented. ex: named capture groups in JS (since added in ECMAScript 2018)
\b
vs<|>
for example in Visual Studio- Different behavior of ‘small’ syntax errors. ex: throwing vs silently continueing
- .NET has
(?#comment)
, Java has not. Java has\Q...\E
, .NET has not
Inapplicability
Html
There are the obvious ones: throwing regex against HTML or XML.
Use a parser instead. And then you can throw some jquery against it or whatever API your language exposes. It’s going to be a lot more readable, maintainable and more robust to changes in the HTML.
I might get hate mail for posting this on the interwebs (or worse) but really: if you have
some HTML or XML which you know always has the exact same structure and you need
to do something simple, well… Why not? What’s the worst that can happen by using a regex to extract
<SomeUniqueNode>SOME ID WITH KNOWN STRUCTURE</SomeUniqueNode>
and save you importing a third
party library which is probably bloated and inefficiently written?
Everywhere
I once worked with a senior developer who, I think, had just discovered regex.
And, he had become so fond of them (or something?) that all code he touched was littered with little regexes after.
I mean, why keep str.split('|')[0]
around when you can replace it with a regex, right.
To top it off, all his regexes were needlessly long, most of them could be written in 1/3 or even in 1/2 of his “style”.
Which is how we fluently arrive at…
Writing vs Reading
It’s funny because it’s true. Requirements change and well, good luck deciphering your own regex.
Or heaven forbids, one someone else wrote! 😱
The REAL Why
Let’s be honest here…
Moving on
Did I already mention that you’ll want to have tests of the code using regexes, especially if they tend to change often? Don’t be shy to write them, even if they are the only ones in your repository.