← Back to Kevin's newslettersPublished: 2026 May 28

Difftron

Last newsletter I wished for structural diffs in a Magit-like UI and, a few weekends later, it turned out as awesome as I’d imagined.

Behold, Difftron:

The Difftron UI in Emacs, showing changes in a file > entity type hierarchy with one function change expanded to show a side-by-side diff with highlighting"/>
</a></p>

<p>See the <a href=15 minute demo video for more details.

Conceptually, the tool works as follows:

This barely scratches the surface of the complex computer sciencey questions about tree diffing and heuristics for matching entities, but so far I don’t mind at all — as it turns out, just this basic scheme combined with a rich user interface already feels miles better than the standard file-and-line-based diffs I’ve been using. All of the UI credit goes to Magit and Emacs, for a few reasons:

First, Magit espouses the idea that everything can be interacted with:

Commands are invoked, not by typing them out, but by pressing short mnemonic key sequences. Many commands act on the thing the cursor is on, either by showing more detailed or alternative information about it in a separate buffer or by performing an action with side-effects on it.

In Difftron:

Second, Emacs is pretty much just text with different colors, so the default data-density is quite high. I’m sure Emacs is capable of supporting excessive amounts of whitespace to match even the most “minimal, clean, modern” (i.e., useless) web UI, but you’d have to work against the grain to mess things up.

Third, the entire system is “live”, meaning that I can imagine some new feature, prompt an LLM to implement it, and then test it without restarting Emacs.

I just defined:

(defun difftron-reload ()
  (interactive)
  (when (featurep 'difftron)
    (unload-feature 'difftron t))
  (load "/Users/dev/work/difftron/emacs/difftron.el"))

and ran it whenever I wanted to test some changes.

This made the iteration loop quite fast, which made it low-friction and fun to polish away rough edges.

All in all, it took about 8 hours over two afternoons to knock out the initial implementation in Magit and record the first demo video, then another 16 hours polishing it as I used it myself and ran it by a few friends for critique.

The entire implementation was done by Codex with GPT-5.5 High (via my $20/month ChatGPT Plus subscription), and I’ve barely touched the code myself. I mainly provided guidance in terms of:

My agents.md instructions were minimal:

Same advice I give myself, honestly: Read the source code of what you’re using, put frequent/complex tasks into scripts, and specify the success condition before trying to implement it.

On open-sourcing a vibe-coded project

Last newsletter I said:

Mayyybe if I’m happy with it I’ll end up releasing something. But I’m not trying to collect Github stars or HN karma, so I might just happily use it in the privacy of my own home without trying to “commercialize it”.

While I am indeed happy with Difftron so far, I’ve hesitated about sharing it because of a few lingering questions in the back of my mind:

However, all the same questions apply to the Whispertron dictation app I vibe-coded last fall, and I’m glad I released that because I’ve heard from several folks who have been enjoying it daily, which makes me happy.

Furthermore, all these concerns revolve around setting expectations, which I can simply do:

So, between that and the general principle of increasing my luck surface area, I figure I should release Difftron and any other generally useful tool I might cook up.

LLM determinism

When delegating to humans, there’s a delicate balance between:

LLMs, though, can’t learn from mistakes, nor do they have creative flames that can be snuffed out by overbearing procedures, so I’m all about using them as cogs in a deterministic machine.

Contrary to the majority of GitHub repos I’ve seen lately, this cannot be done via the context window.

LLMs on their own cannot follow, for example, Graydon’s Not Rocket Science Rule of Software Engineering: “automatically maintain a repository of code that always passes all the tests”.

No matter how much you plead in markdown:

You MUST run test.sh before committing

there’s a chance they’ll just go ahead and commit anyway (or “fix” the failing test by deleting it, etc.).

If you want LLMs to follow a deterministic process, you must use them via a deterministic harness.

Understanding this idea is easy; the tricky part is actually building that harness.

The Not Rocket Science Rule, for example, is phrased in terms of git commits and usually implemented via a continuous integration server: Once a commit is pushed, a script attempts the merge, runs the tests, and updates the branch only if the merged code passed the tests.

This does maintain the desired invariant, but the feedback loop is long: The continuous integration failure occurs many minutes after the code change.

This wastes time, tokens, and (much worse!) potentially a full turn of human feedback (if the agent stopped before getting the CI feedback).

This article on Stripe’s coding agents puts it nicely:

We seek to “shift feedback left” when thinking about developer productivity. That means that it’s best for humans and agents if any lint step that would fail in CI is enforced in the IDE or on a git push, and presented to the engineer immediately.

So how can we change the harness to shorten the feedback loop and give agents the chance to correct their mistake?

I have similar implementation questions about other useful invariants beyond the Not Rocket Science Rule. E.g., how do we ensure an agent doesn’t modify the tests or write slop to the README?

I’m reminded of the dichotomy between static and dynamic language strategies:

For example, if we want an LLM to rename a bunch of methods/types/variables under the invariant “don’t change the program behavior”, we could:

When I’m programming, I’ll switch between these strategies depending on the specific problem and my mood. For agents, it’s not obvious to me whether one of these strategies in general dominates the other.

In terms of my work as an implementer/specifier, it’s probably easier for me to define invariants via runtime tests than by designing a set of “safe” tools:

Ultimately which strategy to pursue is an empirical question about balancing the setup hassle (for me) with the combined performance of the model and harness.

My interest in invariants is, I hope, not some midwit folly (dimwit and genius: “Just ask the model for what you want”).

Nor does it stem from the religious purity / “safety” mindset that afflicts certain programmers. (You know, the ones whose aesthetic appreciation of category theory-inspired, borrow-checked phantom types has them constructing complex defenses against technically-possible-but-rare-in-practice “problems”.)

Rather, I’m stubbornly attached to the idea of understanding what I’m building. Having more constraints — immutable data, pure functions, limited scope, etc. — makes it easier to hold a system in my head. Trying to understand code written by other people (or one’s past self) is hard enough, so if I’m going to have any chance of understanding, shaping, and directing code written by a machine, I’ll take all of the invariants I can get =D

If you have any favorite approaches for building this kind of stuff — Linux sandbox APIs, version control hooks, build-your-own-harness libraries, etc. — or if you’re interested in exploring/collaborating in this space, please let me know!

Misc. stuff