How to craft a regular expression

Wed 13 Sep 2023


This is an In Progress post. It is incomplete and of poorer quality than my other posts. It’s an experiment in encouraging me to publish more often.

The problem

Picture this. You’re searching a large amount of code for some kind of statement/expression that has a pretty non-specific form. Obviously, the best option would be to use some kind of matching tool that knows about the grammar/syntax of the programming language that the code is written in.

But we aren’t always lucky enough to have such a tool at hand, or maybe we’re searching through many code bases each with their own idiosyncratic build systems.

So we fall back on regular expressions and our favourite unix tools e.g. grep or rg/ripgrep.

Now it is well known that, in general, that programming languages can’t be parsed using regular expressions (and if you do you’re liable to summon unholy Eldrich beings). So you can’t hope to match every case of some general syntactic forms. But what if you don’t care about 100% accuracy and you just want to do a “good enough” job?

Then by all means use a regular expression!

Making sure it matches

The problem is that, if the code base is big enough, you’re just going to miss a lot. But you want to get the percentage of what you miss down to less than, say, 1%.

If you follow this process you can craft a convoluted (but pretty good!) regex in short order.

Or you could use regexr

Recently ƎDOↃ Security shared with me an online tool that allows you to the same as what I suggested above with a text file: regexr. You can collect examples of what you want to match in the textbox up the top and then iteratively craft your regex until you have what you want. Be aware there are subtle differences between some of the regex languages, so something you craft in regexr might not have the same result in grep or rg. Then again, in most cases, it will.