← Back to Kevin's newslettersPublished: 2022 April 8

In the spirit of working in public, I figured I’d share some application design problems I’ve been pondering, as well as half-baked potential solutions.

Background: Application state as a value

I came up as a programmer in the Clojure community, where I internalized the value of values. In particular, the usefulness of modeling an application’s entire state as a single value, which allows one to pass it around an argument, print it out for examination, serialize it to disk, etc.

Furthermore, if this value is immutable — if change is modeled via functions that return new states rather than procedures that modify a single state in-place — it’s straightforward to implement various user interface patterns with minimal additional code.

For example, rather than implement “undo” by writing reverse-direction code that “un-does” for every possible operation, one can instead maintain references to previous states and just switch back. Similarly in the forward direction, rather than implement “previews” for, e.g., drag operations via special “add this transient offset to the last real position” logic, one can just create an entirely new application state with the held items “actually moved” to the cursor location and render that state instead. (I did this all the time in Subform and it meant I never had to worry about, e.g., inconsistencies between previews and what was actually allowed.)

The “just"s are doing some work in those examples (e.g., undo is usually a bit more involved since people don’t expect their viewport/scroll position to be undone), but I’ve yet to run into serious problems due to conceptual limitations of this logical model.

Rather, the challenges I’ve faced relate to performance and, to a lesser extent, code-organization considerations, which are the subject of the rest of this newsletter.

Application "databases”

Here’s how I structure apps in ClojureScript:

The word “database” can mean many things — a relational model, SQL, remote/distributed servers, durable persistence, query optimizers, etc. — but for my purposes making standalone (non-networked) apps, I just mean something that’s:

  1. easy to put data into
  2. easy to get data out of

Datascript works well on both fronts because it’s “embedded” in Clojure: You put data in using regular vectors and maps, and data comes out the same way; as vectors from a query or as map-like “entities” which can be manipulated by regular Clojure functions. There are no type coercions, no network round-trip latencies (since the entire database exists in the application’s memory space), and just like Clojure’s maps, vectors, and sets, the entire database is an immutable value.

Here’s an example, just to make things concrete; this database contains three “point” entities and two “line” entities that reference them:

(def db
  (-> (d/empty-db {:line/a {:db/valueType :db.type/ref}
                   :line/b {:db/valueType :db.type/ref}})
      (d/db-with [#:point {:db/id -1 :x 0 :y 0}
                  #:point {:db/id -2 :x 1 :y 1}
                  #:point {:db/id -3 :x 2 :y 2}
                  #:line {:a -1 :b -2}
                  #:line {:a -2 :b -3}])))

Note that the schema is on attribute-level (“columns” rather than “tables”) and is only required when an attribute requires interpretation (here, entity references); there is no need to specify the physical type of attributes, since in-memory every value is just “pointer”.

We can use the query API to look up the IDs of the lines (i.e., any entity with a :line/a attribute):

(d/q '{:find [?line-id]
       :where [[?line-id :line/a]]}
     db)
;; => #{[4] [5]}

and use the entity API to, say, pattern match and calculate the horizontal length of the line:

(clojure.core.match/match (d/entity db 4)
  {:line/a {:point/x x1} :line/b {:point/x x2}}
  (- x2 x1))
;; => 1

Again, because Datascript’s entities implement Clojure’s map lookup protocols, they “just work” with read operations like those used by core.match.

Making illegal states unrepresentable

How can we enforce an invariant like “every line must have two endpoints, a and b”?

Folks with a type checker like to “make illegal states unrepresentable” by modeling their data with algebraic data types. In databases with table-oriented schemas, a line table with “not null” constraints on the a and b columns could also be used.

What are my options for doing this in Datascript?

One option is runtime checking: At the end of the update-app function, examine the database for any entities with only a single :line/a or :line/b attribute and retract them before returning.

This could be done with a query, something like:

(d/q '{:find [?id]
       :where [(or (and [?id :line/a]
                        (not [?id :line/b]))
                   (and [?id :line/b]
                        (not [?id :line/a])))]}
     db)

but running such invariant-checking queries after every state update will cause performance issues for after not-too-many entities and/or invariants.

One could “hand-optimize” the invariant queries using the :tx-data (datoms added/removed) provided after database updates. For the example invariant, we’d be able to see directly from :tx-data the entities that had :line/a or :line/b attributes removed, and retract them without having to examine anything else in the database.

However, for more complex invariants like those between multiple entities or recursive relationships (e.g., a parent container cannot be moved into one its descendants), it’s not clear that such hand-optimizations of the datom stream will even be possible, much less net out overall in terms of performance or logical clarity.

Luckily, I’m (mostly) immune to the pathology that afflicts some programmers into attempting to capture their entire domain within a type system or other “elegant” abstraction at the expense of shipping a working application.

So typically I lean on “less safe” approaches like

  1. just remembering that lines always need two endpoints, and when deleting points being sure to check for that (ya dummy!)
  2. writing slow (but obviously correct) invariant checks and running them during development / testing

Because Datascript’s transaction API accepts regular Clojure data, the usual language data abstraction and reuse mechanisms are available. E.g., I can write a delete-points-tx function that accepts points to delete and returns the transaction data to actually delete them, including any lines that would’ve otherwise been left dangling.

However, such abstractions won’t necessarily compose — there will be trouble when create-line-tx and delete-points-tx reference the same point entity in the same transaction.

So if you know of other approaches to maintaining invariants at different points in the expressiveness / correctness / performance tradeoff space, please let me know!

Compile-time specialization

During prototyping and development, I love the flexibility of the RDF / Datomic / Datascript graph triple-store information model, which makes it easy to change up storage and retrieval patterns.

However, I need none of that flexibility at runtime — my application is a closed-world, with no plugin API, no need for arbitrary reflection, and no query prompt available to the end-user. So there must be substantial performance gains to be had from specializing code to the known-at-compile-time data shapes and access patterns.

A data-oriented design play would be to introduce value types (structs) to store data compactly; in contiguous memory with known offsets for each attribute, rather than RDF triples. This would improve the performance of “fetch all entities of this type and their attributes” operations (common in rendering), at the expense of:

Mechanical sympathy

Datascript is written in ClojureScript, which is compiled to JavaScript. I’m curious how much performance would improve if the storage and query engine were ported to WASM. One might get an order of magnitude estimate by comparing various textbook datalog queries of primitive values (e.g., transitive closure, connected graph components, group by aggregations, etc.) in Datascript versus Rust engines like Datafrog or Crepe compiled to WASM.

For interactive applications, there may also be gains to be had from an allocation-free, mutable-in-place API when lower-latency is preferred over the ability to create derived immutable values. This might be a specialized API for just value updates (e.g., a dragging operation that updates item positions without creating/deleting any attributes) or a more general, transients-style API.

Differential datalog

Datalog makes it very easy to write recursive queries.

Following our geometry theme, if we had reference attributes :coincident/a and :coincident/b which “connect” two points (in addition to the :line attributes), recursive query rules could be used to find all groups of connected points (polylines) and which of those form cycles (i.e., closed shapes).

These are topological properties derived from the :line and :coincident attributes; they won’t change if points are repositioned (e.g., :point/x value changes) — so re-running the full query on every database update / render is unnecessary work. (It’s the same situation as the “two line endpoints” invariant check discussed earlier.)

The general problem of incremental query updates (AKA “materialized views”) is an active research problem, but open source implementations like differential datalog have appeared in the past few years. (It compiles to WASM too!) Perhaps some of these techniques could be applied to Datascript.

See Jamie Brandon’s opinionated map for an overview of this space.

Developer experience

For me, the biggest, most interesting unknown of any of these approaches is how to retain the ergonomics and well-integrated feel of Rich Hickey’s Datomic API in a system that relies on a pre-compilation step. The vibe of a Clojure macroexpansion is quite different than “wait 2 minutes to rebuild WASM blob, then reload the page”.

I suspect Datascript’s success is in large part due to the philosophy behind its tagline: “What if creating a database would be as cheap as creating a Hashmap?”. After all, I’m not using it because it’s the fastest or has the most advanced query logic, but because it strikes an excellent balance and is a joy to use. It reminds me of the photography quip: “The best camera is the one you’re carrying.”

Since the entity API makes it easy to fall back to Clojure for expressing complex logic/algorithms, the main limitation I’ve run into building applications with Datascript is performance. However, I’m optimistic this could be improved using both general mechanically sympathetic techniques (via WASM) and specialized incremental view maintenance algorithms, while retaining the excellent developer experience.

TODO

Here’s a rough sketch of how I’d explore the above ideas:

  1. Benchmark suite: extract a few representative queries from my current projects; since I do interactive applications, I’d structure the problems as “how much data can we process in under 10ms?”

  2. Other people’s datalogs: Benchmark those queries in Datafrog, Crepe, and differential datalog

  3. WASM / ClojureScript bridge: Build a joins-only (no rules) query engine and entity API backed by a WASM triple store of primitive values only (i.e., numbers).

  4. Extend the above with a rust persistent data structures version.

  5. If performance is reasonable (which I expect it would be, barring perhaps WASM overhead for entity lookups), merge the engine into a real application for improved performance. It’s worth noting that since Datascript can query any collection, we can fall back to it for more advanced queries that require rules, negation, or anything else not implemented in the fast/minimal version.

That’s it for now — definitely let me know if you are interested in collaborating!

Misc. stuff