On datalog and application databases

In the spirit of working in public, I figured I’d share some application design problems I’ve been pondering, as well as half-baked potential solutions.

Background: Application state as a value

I came up as a programmer in the Clojure community, where I internalized the value of values. In particular, the usefulness of modeling an application’s entire state as a single value, which allows one to pass it around an argument, print it out for examination, serialize it to disk, etc.

Furthermore, if this value is immutable — if change is modeled via functions that return new states rather than procedures that modify a single state in-place — it’s straightforward to implement various user interface patterns with minimal additional code.

For example, rather than implement “undo” by writing reverse-direction code that “un-does” for every possible operation, one can instead maintain references to previous states and just switch back. Similarly in the forward direction, rather than implement “previews” for, e.g., drag operations via special “add this transient offset to the last real position” logic, one can just create an entirely new application state with the held items “actually moved” to the cursor location and render that state instead. (I did this all the time in Subform and it meant I never had to worry about, e.g., inconsistencies between previews and what was actually allowed.)

The “just"s are doing some work in those examples (e.g., undo is usually a bit more involved since people don’t expect their viewport/scroll position to be undone), but I’ve yet to run into serious problems due to conceptual limitations of this logical model.

Rather, the challenges I’ve faced relate to performance and, to a lesser extent, code-organization considerations, which are the subject of the rest of this newsletter.

Application "databases”

Here’s how I structure apps in ClojureScript:

There’s a single !app atom (Clojure mutable reference) which references an immutable map
That map contains a Datascript database (the bulk of the application state) and a few things with special undo/redo semantics (e.g., viewport position, notification messages, references to transient platform stuff like pop-up window handles, in-flight requests, etc.)
The UI is rendered by React, with components taking as arguments what they’re rendering (usually either the full app map or a single Datascript entity) and a trigger! function which can be invoked from callbacks in response to user events.
That trigger! function essentially does undo/redo if requested, otherwise (swap! !app update-app event) and schedules a React re-render for the next animation frame.
All of the application logic is contained within the update-app function, which is free of side-effects and can be called by render components (for forward-preview) and run on the JVM (for REPL-driven development and testing, as I’ve never had luck scripting browsers).

The word “database” can mean many things — a relational model, SQL, remote/distributed servers, durable persistence, query optimizers, etc. — but for my purposes making standalone (non-networked) apps, I just mean something that’s:

easy to put data into
easy to get data out of

Datascript works well on both fronts because it’s “embedded” in Clojure: You put data in using regular vectors and maps, and data comes out the same way; as vectors from a query or as map-like “entities” which can be manipulated by regular Clojure functions. There are no type coercions, no network round-trip latencies (since the entire database exists in the application’s memory space), and just like Clojure’s maps, vectors, and sets, the entire database is an immutable value.

Here’s an example, just to make things concrete; this database contains three “point” entities and two “line” entities that reference them:

(def db
  (-> (d/empty-db {:line/a {:db/valueType :db.type/ref}
                   :line/b {:db/valueType :db.type/ref}})
      (d/db-with [#:point {:db/id -1 :x 0 :y 0}
                  #:point {:db/id -2 :x 1 :y 1}
                  #:point {:db/id -3 :x 2 :y 2}
                  #:line {:a -1 :b -2}
                  #:line {:a -2 :b -3}])))

Note that the schema is on attribute-level (“columns” rather than “tables”) and is only required when an attribute requires interpretation (here, entity references); there is no need to specify the physical type of attributes, since in-memory every value is just “pointer”.

We can use the query API to look up the IDs of the lines (i.e., any entity with a :line/a attribute):

(d/q '{:find [?line-id]
       :where [[?line-id :line/a]]}
     db)
;; => #{[4] [5]}

and use the entity API to, say, pattern match and calculate the horizontal length of the line:

(clojure.core.match/match (d/entity db 4)
  {:line/a {:point/x x1} :line/b {:point/x x2}}
  (- x2 x1))
;; => 1

Again, because Datascript’s entities implement Clojure’s map lookup protocols, they “just work” with read operations like those used by core.match.

Making illegal states unrepresentable

How can we enforce an invariant like “every line must have two endpoints, a and b”?

Folks with a type checker like to “make illegal states unrepresentable” by modeling their data with algebraic data types. In databases with table-oriented schemas, a line table with “not null” constraints on the a and b columns could also be used.

What are my options for doing this in Datascript?

One option is runtime checking: At the end of the update-app function, examine the database for any entities with only a single :line/a or :line/b attribute and retract them before returning.

This could be done with a query, something like:

(d/q '{:find [?id]
       :where [(or (and [?id :line/a]
                        (not [?id :line/b]))
                   (and [?id :line/b]
                        (not [?id :line/a])))]}
     db)

but running such invariant-checking queries after every state update will cause performance issues for after not-too-many entities and/or invariants.

One could “hand-optimize” the invariant queries using the :tx-data (datoms added/removed) provided after database updates. For the example invariant, we’d be able to see directly from :tx-data the entities that had :line/a or :line/b attributes removed, and retract them without having to examine anything else in the database.

However, for more complex invariants like those between multiple entities or recursive relationships (e.g., a parent container cannot be moved into one its descendants), it’s not clear that such hand-optimizations of the datom stream will even be possible, much less net out overall in terms of performance or logical clarity.

Luckily, I’m (mostly) immune to the pathology that afflicts some programmers into attempting to capture their entire domain within a type system or other “elegant” abstraction at the expense of shipping a working application.

So typically I lean on “less safe” approaches like

just remembering that lines always need two endpoints, and when deleting points being sure to check for that (ya dummy!)
writing slow (but obviously correct) invariant checks and running them during development / testing

Because Datascript’s transaction API accepts regular Clojure data, the usual language data abstraction and reuse mechanisms are available. E.g., I can write a delete-points-tx function that accepts points to delete and returns the transaction data to actually delete them, including any lines that would’ve otherwise been left dangling.

However, such abstractions won’t necessarily compose — there will be trouble when create-line-tx and delete-points-tx reference the same point entity in the same transaction.

So if you know of other approaches to maintaining invariants at different points in the expressiveness / correctness / performance tradeoff space, please let me know!

Compile-time specialization

During prototyping and development, I love the flexibility of the RDF / Datomic / Datascript graph triple-store information model, which makes it easy to change up storage and retrieval patterns.

However, I need none of that flexibility at runtime — my application is a closed-world, with no plugin API, no need for arbitrary reflection, and no query prompt available to the end-user. So there must be substantial performance gains to be had from specializing code to the known-at-compile-time data shapes and access patterns.

A data-oriented design play would be to introduce value types (structs) to store data compactly; in contiguous memory with known offsets for each attribute, rather than RDF triples. This would improve the performance of “fetch all entities of this type and their attributes” operations (common in rendering), at the expense of:

queries that only need to examine a subset of entity attributes (which would spend time loading and skipping unwanted attributes)
memory usage (if a packed representation is stored in addition to triples)
implementation complexity (separate query and storage paths for packed vs. triple storages)

Mechanical sympathy

Datascript is written in ClojureScript, which is compiled to JavaScript. I’m curious how much performance would improve if the storage and query engine were ported to WASM. One might get an order of magnitude estimate by comparing various textbook datalog queries of primitive values (e.g., transitive closure, connected graph components, group by aggregations, etc.) in Datascript versus Rust engines like Datafrog or Crepe compiled to WASM.

For interactive applications, there may also be gains to be had from an allocation-free, mutable-in-place API when lower-latency is preferred over the ability to create derived immutable values. This might be a specialized API for just value updates (e.g., a dragging operation that updates item positions without creating/deleting any attributes) or a more general, transients-style API.

Differential datalog

Datalog makes it very easy to write recursive queries.

Following our geometry theme, if we had reference attributes :coincident/a and :coincident/b which “connect” two points (in addition to the :line attributes), recursive query rules could be used to find all groups of connected points (polylines) and which of those form cycles (i.e., closed shapes).

These are topological properties derived from the :line and :coincident attributes; they won’t change if points are repositioned (e.g., :point/x value changes) — so re-running the full query on every database update / render is unnecessary work. (It’s the same situation as the “two line endpoints” invariant check discussed earlier.)

The general problem of incremental query updates (AKA “materialized views”) is an active research problem, but open source implementations like differential datalog have appeared in the past few years. (It compiles to WASM too!) Perhaps some of these techniques could be applied to Datascript.

See Jamie Brandon’s opinionated map for an overview of this space.

Developer experience

For me, the biggest, most interesting unknown of any of these approaches is how to retain the ergonomics and well-integrated feel of Rich Hickey’s Datomic API in a system that relies on a pre-compilation step. The vibe of a Clojure macroexpansion is quite different than “wait 2 minutes to rebuild WASM blob, then reload the page”.

I suspect Datascript’s success is in large part due to the philosophy behind its tagline: “What if creating a database would be as cheap as creating a Hashmap?”. After all, I’m not using it because it’s the fastest or has the most advanced query logic, but because it strikes an excellent balance and is a joy to use. It reminds me of the photography quip: “The best camera is the one you’re carrying.”

Since the entity API makes it easy to fall back to Clojure for expressing complex logic/algorithms, the main limitation I’ve run into building applications with Datascript is performance. However, I’m optimistic this could be improved using both general mechanically sympathetic techniques (via WASM) and specialized incremental view maintenance algorithms, while retaining the excellent developer experience.

TODO

Here’s a rough sketch of how I’d explore the above ideas:

Benchmark suite: extract a few representative queries from my current projects; since I do interactive applications, I’d structure the problems as “how much data can we process in under 10ms?”
Other people’s datalogs: Benchmark those queries in Datafrog, Crepe, and differential datalog
WASM / ClojureScript bridge: Build a joins-only (no rules) query engine and entity API backed by a WASM triple store of primitive values only (i.e., numbers).
Extend the above with a rust persistent data structures version.
If performance is reasonable (which I expect it would be, barring perhaps WASM overhead for entity lookups), merge the engine into a real application for improved performance. It’s worth noting that since Datascript can query any collection, we can fall back to it for more advanced queries that require rules, negation, or anything else not implemented in the fast/minimal version.

That’s it for now — definitely let me know if you are interested in collaborating!

Misc. stuff

If you made it this far, you’ll want to check out the Apr 29 “Have you tried rubbing a database on it?” online conference.
Why are 2D vector graphics so much harder than 3D?
The Uselessness of Phenylephrine
“This is the hallmark of problems solved by ontological remodeling. You don’t want to say they’re tricky, exactly, because the new framework makes them feel pretty approachable. But without the new framework, they’re basically impossible.”
“Remember, Marcus Aurelius has already absolved you of the duty of having a take”
“He’s an apple that’s been inhabited by the spirit of a dead cat, and he absolutely shreds on the drums.”
“The coalescing of the West does not imply that a China that continues to tie itself to Russia can be isolated or ‘contained’ as the Soviet Union was contained. It has become common to describe US-China competition as ‘a new Cold War’. This is an intellectually lazy trope that fundamentally misrepresents the nature of the competition.”
“That fear and difficulty in describing someone’s beliefs is the hallmark of epistemic illegibility. The wider the confidence interval on what someone is claiming, the more work I have to do to question it.”
Daniel Jackson uses formal modeling of software concepts to understand why Facebook’s features lead to unexpected behavior.
“[D]espite a buzzy, crowded field of startups and established companies working on biomanufacturing technology, improving the efficiency of mammalian cell culture itself is still a wide-open opportunity.”
“A entire book of integrals can be reduced down to a simple set of multiplication and transform rules of G-functions in a few lines of term rewrite logic.”
Sketchpad by Ivan Sutherland at Bay Area Computer History Perspectives (March 1994)
An Indie Approach to Procedural Animation
“Students invaded and took over the Parliament. But rather than chant slogans, instead they livestreamed their own parliamentary debate over the trade deal, allowing volunteers to speak both in favour and against. Instead of polarising the country more, this so-called Sunflower Student Movement ultimately led to a bipartisan consensus that Taiwan should open up its government. That process has gradually made it one of the most communicative and interactive administrations anywhere in the world.”
“The Web, in its earliest conception, was nothing more than a series of pointers. It grew not out of a desire to be an electronic encyclopedia so much as an electronic Post-it note.”
“Funmath is designed to make the idea of letting the symbols do the work feasible and convenient in everyday mathematical practice.”
“Some amount of optimization is worth it, but in my experience, most people are way over-indexed on optimization and under-indexed on drawing more samples.”
How Does Kodak Make Film?
FTX Future Fund’s project ideas
“Why does the protocol work? It works because mentors want two things: they want to know that they’re being taken seriously, and they want to know that they’re making a difference in your life.”
What Size is Pixel Art?
Scan of the Month scans food packaging
Bubble Butt Corgi Mouse
“For us, writing our names overseas is a big headache.”
How our free plan stays free
“This report looks at events that have transpired in December 1990, where a family living in Chicago and on their way to France for Christmas ended up leaving their eight years old son home alone.”
Bitluni’s DIY sonar scanner
Why Dark and Light is Complicated in Photographs; see also How Does Perspective Work in Pictures?
South Korean election graphics are next level.
“The moat of academic prestige is unassailable and has little to do with the quality of the research.”
“Dex is a functional, statically typed language for array processing.”
“World Taekwondo strongly condemns the brutal attacks on innocent lives in Ukraine, which go against the World Taekwondo vision of “Peace is More Precious than Triumph” and the World Taekwondo values of respect and tolerance. In this regard, World Taekwondo has decided to withdraw the honorary 9th dan black belt conferred to Mr. Vladimir Putin in November 2013.”