Writing an 7000 character regex

← Back to Kevin's homepagePublished: 2018 June 8

Note: This was originally a response on the sketch.systems forum to someone who wanted to name states with characters outside of ASCII. The details turned out to be sufficiently entertaining to publish here for posterity.

Sketch.systems specs are parsed by a GLL-parser (Clojure’s Instaparse), defined by an EBNF grammar. The relevant grammar rule for this story is:

word = !indent (#"\w+" | '-' | '*' | '?')+

where the \w is the familiar “word” regular expression character class.

As someone who only speaks English, I never thought much about this character class since it always matched what I wanted (that is to say, English words). Somewhere in the back of my mind, I assumed that since JavaScript has UTF-16 strings, it’s regex implementation would probably just “do the right thing” and match word characters outside of ASCII.

Nope — turns out the \w character class is equivalent to just [A-Za-z0-9_] (MDN docs).

If you’re thinking “surely there’s modern Unicode equivalent!”, you’re right! Unicode characters are assigned to general categories like “Letter, uppercase” or “Number, decimal digit”.

These categories can be referred to within Java regular expressions since JDK7, which I verified by using the regex [\p{L}\p{M}\p{N}]+ (L, M, and N for Unicode Letters, Marks, and Numbers, categories, respectively).

Chrome also supports these classes within regexes that have the “unicode” flag. For example,/[\p{L}\p{M}\p{N}]/u.test("матрёшка") evaluates to true in Chrome.

I’m calling out “Chrome” specifically here, because it does not yet work in other browsers — regex unicode property escapes are scheduled in ES2018 (see proposal).

So, how can we support the many attractive and intelligent people who use Sketch.systems with other browsers?

The best solution I could find was to create a regex with a character class that explicitly matches the underlying codepoints. E.g., [A-Za-z\xAA\xB5...] where A-Z and a-z match ASCII, then \xAA (the feminine ordinal indicator), \xB5 (the micro sign), and so forth.

Thankfully, most letter codepoints tend to be consecutive and can be matched with ranges, just like the ASCII A-Z and a-z. For example, the Greek letter μ (not the same thing as the micro sign!) is at \u03BC, which is matched by the range \u03A3-\u03F5: Σ-ϵ, capital Sigma to lunate epsilon. (This isn’t the entire Greek alphabet, just a portion; e.g., following the lunate epsilon (a letter) is the reversed lunate epsilon, which Unicoders have deemed part of the “math symbol” category rather than “letter” category.)

All-in-all, the letter-matching regex comes in at a hair over 7000 characters, which is quite a deal, since it matches 125,419 characters!

So, please use as many letters as you can in your sketches =)

Huge thanks to Mathias Bynens for writing a bunch of great articles about the JavaScript-meets-Unicode situation, in particular:

and for his tool that expands regexes using unicode property escapes into regexes that match codepoints directly.