Parsing in anger

Hi friends,

The past few newsletters have been about architecture, so I thought this month we should revisit the wonderful world of computering. In particular, parsing.

Backstory

I use Quicken to manage my finances. And by “manage”, I mean import credit card transactions twice a month, add tags and notes to individual transactions, and aggregate the data to answer questions like: “How much did I spend on coffee last year?” ($78/month, on average).

Quicken for Mac used to do alright with this stuff, but over the past few years it has gotten cloud-fever.

Symptoms:

Even though it’s desktop software, you have to “sign in” with an account. Now when I refresh bank statements, it spends two minutes (literally) “uploading transactions”. Where? Who knows.
Now that it has my email, every time I open Quicken I get 5+ “Unusual Spending!” emails sent to my inbox.
Every few weeks, all of my accounts are automatically renamed “CREDIT CARD”.
My file has a “password” set, but when I open up Kevin Lynagh.quicken2015/data in Emacs I can see transaction details and my email address in cleartext. (Technically this isn’t a symptom of cloud-fever, but it doesn’t exactly inspire me with confidence about the security of their servers.)

I’m a patient man, and have put up with all of this nonsense. However, I recently reached my breaking point when some transactions started showing up in duplicate.

This threw off all of my coffee-tracking calculations, defeating the entire purpose of the software.

Okay, about that parsing

So, after losing all faith in Quicken, I decided to write my own, 100% cloud free, financial management software: Moneytron.

The highly-advanced Moneytron user experience will work as follows:

Open a beer
Manually sign into bank websites and download transaction activity
Drag ‘n drop transaction activity files into Moneytron
Use Moneytron interface to add tags, descriptions, and PDF reciepts
Use Moneytron query engine to estimate coffee-runway length

The first two steps are up to the human operator, but for the third step Moneytron needs to parse the transaction activity file.

For the banks that my girlfriend and I use (Bank of America, Chase, and Umpqua Bank), the “QBO” file extension contained the most raw data, so that’s what I decided to focus on.

Unfortunately, QBO is based on the OFX standard which is kinda like XML except:

There’s a weird block at the top of stuff that isn’t tags
Some end tags are optional (Umpqua has 'em, Chase doesn’t)

Here’s a snippet:

OFXHEADER:100
DATA:OFXSGML
VERSION:102
SECURITY:NONE
ENCODING:USASCII
CHARSET:1252
COMPRESSION:NONE
OLDFILEUID:NONE
NEWFILEUID:NONE
<OFX>
<SIGNONMSGSRSV1>
<SONRS>
<STATUS>
<CODE>0
<SEVERITY>INFO
</STATUS>
<DTSERVER>20160912120000[0:GMT]
<LANGUAGE>ENG
<FI>
<ORG>ISC
<FID>10898
</FI>
<INTU.BID>2430
</SONRS>
</SIGNONMSGSRSV1>
<CREDITCARDMSGSRSV1>
<CCSTMTTRNRS>
<TRNUID>1
<STATUS>
<CODE>0
<SEVERITY>INFO
<MESSAGE>Success
</STATUS>
<CCSTMTRS>
<CURDEF>USD
<CCACCTFROM>
<ACCTID>123456789
</CCACCTFROM>
<BANKTRANLIST>
<DTSTART>20160311120000[0:GMT]
<DTEND>20160911120000[0:GMT]
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20160710120000[0:GMT]
<TRNAMT>-4.00
<FITID>123456789
<NAME>CITY OF PORTLAND DEPT
</STMTTRN>
<STMTTRN>
<TRNTYPE>DEBIT
<DTPOSTED>20160527120000[0:GMT]
<TRNAMT>-8.25
<FITID>987654321
<NAME>SQ *TAILORED COFFEE
</STMTTRN>
...

Instead of going at it with a regular-expression sawzall, I thought I’d try making a real grown up parser.

I’d seen Mark Engelberg’s Instaparse, but never had the chance to use it in anger.

I’ll walk through the grammar I came up with:

<ROOT> = <CRAP>* ACCTID? <CRAP>* (STMTTRN <SPACE>*)+ <CRAP>*

This is my description of the file as I see it: there’s some crap I don’t care about, maybe followed by an account ID, then maybe some more crap, then a bunch of statement transactions, then finally some more crap.

Of course, crap is defined as anything that’s not a statement transaction or account id:

SPACE = #"\s+"
CRAP = !(STMTTRN | ACCTID) (#"." | SPACE)

and the things I care about are defined with tag literals and regular expressions:

ACCTID = <'<ACCTID>'> #"[0-9]*" <'</ACCTID>'>?

<STMTTRN> = <'<STMTTRN>'> <SPACE>* (statement_info <SPACE>*)+ <'</STMTTRN>'>
<statement_info> = TRNTYPE | DTPOSTED | TRNAMT | FITID | MEMO

TRNTYPE = <'<TRNTYPE>'> #"[0-9A-z]+" <'</TRNTYPE>'>?
DTPOSTED = <'<DTPOSTED>'> #"[0-9A-z.\[\]:]+" <'</DTPOSTED>'>?
TRNAMT = <'<TRNAMT>'> #"-?[0-9]+(\.[0-9]+)?" <'</TRNAMT>'>?
FITID = <'<FITID>'> #"[A-z0-9.\-]+" <'</FITID>'>?

MEMO = <'<MEMO>'> memo_middle+ <memo_end>
<memo_middle> = !memo_end #"."
memo_end = (<'</MEMO>'> | <'\n'>)

One thing to note is that, I can’t use the regex wildcard (AKA the dot) to match anything for the values:

TRNTYPE = <'<TRNTYPE>'> #".*" <'</TRNTYPE>'>?

because if I did this, the regular expression would consume the rest of the input string and the parse would fail.

However, I can’t describe the <MEMO> tag with a simple regex, because I don’t know what’ll be in there. It needs to be able to match spaces and stuff. So for that one I break the pattern up into pieces (memo_middle, memo_end) and let Instaparse match character-by-character and figure things out.

From this grammar and an input file, Instaparse returns a data structure that I can then process further (converting string amounts into integer cents, parsing dates, etc.).

Need for speed

Unfortunately, I couldn’t figure out how to make Instaparse go fast: a file with 600 transactions took more than 3 seconds to parse.

Given the extensive documentation, code quality, and general well-put-togetherness of Instaparse, this poor performance is almost certainly my fault. (Like driving around a Porche with the handbrake engaged.)

I suspect this is because the <CRAP>* chews through the file one character at a time until it hits an account ID (or maybe a statement transaction?), then it backs off and lets everything proceed. It may be backtracking a lot, so perhaps there’s some quadratic nonsense happening.

This is just a hunch — to really know, I’d need to understand how Instaparse works internally and or have it spit out a some kind of “step-by-step” log of what it’s doing.

I reached out to some thought-leaders for assistance (thx @brandonbloom and @zachtellman) and both of them had the same advice: “Ehh, just write your own parser using regular expressions” (Paraphrased).

The dark side

So I decided I’d try that out. After all, since I only care about extracting the account ID and statement transactions, I don’t need to parse the full file.

Here’s the function that scans a source string for an account ID:

(defn account-id
  [source]
  (when-let [[_ id _] (->> source
                           (re-seq #"<ACCTID>(.*?)(\s|</ACCTID>)")
                           first)]
    id))

The regular expression first matches <ACCTID>, then it grabs characters, then matches either whitespace or the closing tag. (I’m relying on the fact that the bank’s software will crash before mine if there’s an account ID that contains whitespace.)

Note that .* is a greedy match, whereas .*? is non-greedy — the question mark means “if the next part of the pattern can match, do that rather than continuing the repetition”.

This function runs in 0.5ms.

I handle the statement transactions in two phases.

First I extract each individual transaction using:

(re-seq #"<STMTTRN>([\s\S]*?)</STMTTRN>" source)

Same regex story as before, though I’m using [\s\S]*? instead of .*? since I need to match newlines within the <STMTTRN> tag (Umpqua prints everything on one line, while Chase puts each tag/value on its own line).

Then for each matching substring (AKA the guts of the <STMTTRN> tag), I use a single regex to match tag/value pairs:

(def re-tag
  (let [opening-tag "<(?!/)([^>]+)>"
        body "([\\s\\S]*?)"
        tag-or-end-lookahead "(?=\\s*<|$)"]
    (re-pattern (str opening-tag body tag-or-end-lookahead))))

Two things to note:

I’m building up the regex from named pieces so that when I look at it two weeks from now I’ll be able to understand what’s what.
The last bit is a “lookahead” that matches without consuming the token. If it were a regular match instead of a lookahead, then the input string <foo>1<bar>2 would get matched as <foo>1< and then the remaining part of the string, bar>2, would fail.

I then put the matches into a Clojure map, processing the values as necessary:

(defn stmtrn->map
  [stmtrn-source]
  (->> stmtrn-source
       (re-seq re-tag)
       (reduce (fn [m [_ tag value]]
                 (case tag

                   "TRNTYPE"
                   (assoc m :transaction/type value)

                   "TRNAMT"
                   (assoc m :transaction/amount (str->cents value))

                   "and so on"))
               {})))

All-in-all, this takes about 15 ms — or 200x faster than my Instaparse attempt.

Takeaways

I’m not sure if the speed difference between the two approaches is something fundamental / algorithmic, or if it’s just the difference between performance-tuned, language built-ins (regular expressions) vs. a general purpose library (Instaparse).

One thing I appreciated from both tools is that they support incremental progress in the Clojure REPL:

I used three different regular expressions to extract what I needed, and composed them together with Clojure functions.
Instaparse’s :start optional argument allowed me to test parts of my grammar against input substrings (like the <STMTTRN> tags)

However, in this particular problem I only wanted to extract pieces of the input file — not make sense of the whole thing — so regular expressions were a good fit for that situation.

I don’t know enough about Instaparse to say whether my grammar could be tweaked to ignore stuff as efficiently as regular expressions.

If ya’ll have any ideas, send 'em in.

Best,

Kevin