How to craft a regular expression

Wed 13 Sep 2023

Disclaimer

This is an In Progress post. It is incomplete and of poorer quality than my other posts. It’s an experiment in encouraging me to publish more often.

The problem

Picture this. You’re searching a large amount of code for some kind of statement/expression that has a pretty non-specific form. Obviously, the best option would be to use some kind of matching tool that knows about the grammar/syntax of the programming language that the code is written in.

But we aren’t always lucky enough to have such a tool at hand, or maybe we’re searching through many code bases each with their own idiosyncratic build systems.

So we fall back on regular expressions and our favourite unix tools e.g. grep or rg/ripgrep.

Now it is well known that, in general, that programming languages can’t be parsed using regular expressions (and if you do you’re liable to summon unholy Eldrich beings). So you can’t hope to match every case of some general syntactic forms. But what if you don’t care about 100% accuracy and you just want to do a “good enough” job?

Then by all means use a regular expression!

Making sure it matches

The problem is that, if the code base is big enough, you’re just going to miss a lot. But you want to get the percentage of what you miss down to less than, say, 1%.

Create a special text file that you’ll place all the weird examples in.
Learn about all the weird examples by doing more general regular expression searches and carefully observing whether you get more hits that are valid.
Add the weird examples to the text file.
Change them up a bit. If the programming language allows it do things like
- adding white space
- putting brackets around sub-expressions
- use weird identifiers containing special characters
Then make sure your regex matches these weird examples.

If you follow this process you can craft a convoluted (but pretty good!) regex in short order.

Or you could use `regexr`

Recently ƎDOↃ Security shared with me an online tool that allows you to the same as what I suggested above with a text file: regexr. You can collect examples of what you want to match in the textbox up the top and then iteratively craft your regex until you have what you want. Be aware there are subtle differences between some of the regex languages, so something you craft in regexr might not have the same result in grep or rg. Then again, in most cases, it will.

The problem

Making sure it matches

Or you could use regexr

Or you could use `regexr`