A Relational Spreadsheet

← Back to Kevin's homepagePublished: 2023 February 26

I tried doing an INNER JOIN in Google Sheets and it was so awkward I was compelled to build my own lightweight relational spreadsheet / language / system.

Timestamps:
 0:00 intro and motivation
 4:00 language semantics
 5:50 toy financial problem
12:00 symbolic variables

Thanks

I was inspired by:

The motivating example

In trying to tidy up my finances I had to rebalance my portfolio, but even though I had all of my trades loaded into a spreadsheet, I found it difficult to aggregate them by ticker / account / category and calculate the actionable bit — the stocks and quantities I needed to buy/sell to meet a desired allocation.

This example shows the basic pieces I wanted to play with.

def TradeWide (
  [ 'trade-0 2021-12-01 'appl    2    ]
  [ 'trade-1 2022-02-01 'appl    3    ]
  [ 'trade-2 2022-03-01 'appl    5    ]
  [ 'trade-3 2022-06-01 'msft    8    ]
  [ 'trade-4 2022-12-01 ?ticker  ?qty ]
)

def Trade      TradeWide[0]
def   date     TradeWide[0 1]
def   ticker   TradeWide[0 2]
def   qty      TradeWide[0 3]

def position sum [
  t: Trade
  t.ticker
  t.qty
]

def latest-px (
  ['appl 2.00]
  ['msft 3.00]
)

def position-value
  position * latest-px

def total-value
  sum position-value[-1]

def proportion
  position-value / total-value

proportion . 'appl = 0.5

The program is basically asking, “What trade should I make so that the mark-to-market value of my portfolio is 50% Apple?” (Answer: buy 2 appl.)

See the video demo above for more explanation.

Language

Everything acts on relations (sets of same-arity tuples).

  1
= (1)
= ([1])

Round parentheses union the contained relations and everything is first-order:

  (1 2 3)
= (1 (2 3))
= ([1] [2] [3])

Square brackets cross (cartesian product) the contained relations:

  [1 (2 3)]
= ([1 2] [1 3])

These can also include variable bindings and expressions:

  [x: (1 2)
   x
   x * 2]

= ([1 2]
   [2 4])

The empty set is “false”, so predicates can be used within square brackets to filter:

  [x: (1 2)
   y: (1 2)
   x + y = 3
   x
   y]

= ([1 2]
   [2 1])

The dot joins on the first column, consuming matches:

  [1 2] . ([1 3] 
           [5 6])
= ([2 3])

Aggregation and binary operators match tuples based on their “key” (all columns but their last) and operate on the tuple “value” (the last column).

sum ([1 2]
     [1 3]
     [2 0])

= ([1 5] 
   [2 0])

With binary operators, this feels a bit like array programming but with symbolic indexes rather than integer ones:

(['foo 2] 
 ['bar 4]) * (['bar 4]
              ['foo 1])

= (['bar 16] 
   ['foo 2])

The “empty key” matches everything, which allows binary operators and 1-arity tuples to act sensibly:

2 * (['foo 1]
     ['bar 2])
= (['foo 2]
   ['bar 4])

Symbolic varibles are prefixed with ? and can be used anywhere. They are first-order (scalar values like numbers, not relations) and can take on only a single value (even within square brackets). Here the constraint on ?x forces the value of ?s:

?x = ?s . sum (['foo 1]
               ['bar 2]
               [?s   3])
?x = 5
// => {s "bar", x 5}

Other examples:

[?y ?x] in min ([1 2]
                [1 3] 
                [2 0])
?y != 1
// => {y 2, x 0}

 

[?x ?y ?output] in [x: (1 2 3)
                    y: (3 4 5)
                    x 
                    y 
                    x * y]
maximize ?output
// => {x 3, y 5, output 15}

Implementation

This prototype is implemented in about 1000 lines of Clojure, leaning heavily on Instaparse and backtick to transform the formula syntax into forms suitable for Clojure’s eval.

I use Clojure’s binary operators, “lifting” them twice. First to handle symbolic variables:

(def s+ (lift-symbolic + '+ 0))

(assert (= 0          (s+)))
(assert (= 1          (s+ 1)))
(assert (= 2          (s+ 1 1)))
(assert (= '(+ 3 x)   (s+ 1 2 'x)))
(assert (= 'y         (s+ 'y)))
(assert (= '(+ x y)   (s+ 'x 'y)))

Then again to handle relations:

(def r+    (lift-relational s+))
(assert (= (->rel [4])         (r+ (->rel [1])         (->rel [3]))))
(assert (= (->rel ['(+ 1 x)])  (r+ (->rel [1])         (->rel ['x]))))
(assert (= (->rel [1 4])       (r+ (->rel [1 1])       (->rel [3]))))
(assert (= (->rel [1 2] [2 4]) (r+ (->rel [1 1] [2 2]) (->rel [1 1] [2 2]))))

The relational aggregation and join operations track logical conditions on tuple-by-tuple basis. For example, if [1 2] . [?x 3] join to create the tuple [2 3], it must be the case that ?x = 1. So the [2 3] tuple better remember that for future operations.

Conceptually, aggregations are reductions yielding a set of “aggregate tuples” whose keys are distinct. The reduction starts with an empty set and when adding each tuple there are two cases to consider:

  1. The tuple’s key doesn’t match anything, so the tuple is added to the aggregation set
  2. The tuple’s key does match, so the tuple’s value is combined with the matching aggregate tuple’s value

This is straightforward if everything is known. However, things get complicated with symbolic variables since whether or not a tuple matches may depend on:

  1. Any symbolic variables in the tuple keys
  2. Whether or not the tuple itself exists. I.e., even if all of a tuple’s values are ground (non-symbolic), the tuple itself may be the result of an operation that depends on symbolic variables

This leads to a lot of logical conditions, since to prevent double-counting each aggregate tuple’s conditions must assert both that the matching tuples matched and that the non-matching tuples didn’t.

Future work

While logical conditions are ad-hoc simplified during some execution steps (e.g., any conjunction containing false is simplified to false), they tend to become enormous after only a few joins/aggregations containing symbolic variables. I suspect performance could be greatly improved with an approach like equality saturation.

Set semantics have footguns around aggregations as intermediate sets will discard duplicates. For example, if position-value is (['msft $100] ['appl $100] ['tsmc $50]) then the total value calculation sum position-value[-1] will yield $150 rather than the correct $250. Some systems have dedicated syntax to solve this (e.g., Datomic’s with), but why not just avoid the problem entirely by using bag (multiset) semantics? (Of course, such a system would require dedicated syntax for deduplication.)

Rewrite in Rust, compile to WASM, and actually publish online so everyone can use it. =D