Performance of Rust's match vs. lookup tables
← Back to Kevin's homepagePublished: 2019 Jan 22I’ve been getting into bioinformatics algorithms lately and ran across an interesting pull request that improved performance by changing a Rust match expression to a lookup. Basically, changing:
fn match4(x: u8) -> u64 {
match x {
b'A' => 1,
b'B' => 2,
b'C' => 3,
b'D' => 4,
_ => 0,
}
}
to
use lazy_static::lazy_static;
lazy_static! {
static ref LOOKUP4: [u64; 256] = {
let mut l = [0; 256];
l[b'A' as usize] = 1;
l[b'B' as usize] = 2;
l[b'C' as usize] = 3;
l[b'D' as usize] = 4;
l
};
}
fn lookup4(x: u8) -> u64 {
LOOKUP4[x as usize]
}
This felt quite surprising to me since, well, the match is so simple — why isn’t the compiler already generating optimal code for it?
Most of my career has been on the web, which is a gazillion layers of abstraction away from what the hardware is doing. Since I’m trying to develop mechanical sympathy, I thought this would be a fun problem to dive into.
Benchmarking
The first thing I did was try to find a minimal test case that reproduced the results from the pull request. This turned out to be more difficult than expected, and I ended up looking at 36 variations:
(See the source and analysis code.)
What sticks out here is that the match is always faster than the lookup, except when inlining comes into play, and even then inlined lookup is only faster in the “mix with rotate” task for alphabet sizes of 4 and 8.
Based on this benchmark, the question we’ll be diving into is: What’s different about 4-letter inlined match/lookup between “mix” and “mix with rotate”?
A few notes on benchmarking first: I used the awesome Criterion crate.
However, it took me a while before I figured out how to run all of the variations I wanted (the product of lookup/match, 2/4/8-sized alphabets, and 3 different tasks).
I ended up with nested for
loops across function pointers and explicit labels:
for (size, s, f_match, f_lookup) in vec![
(2, &s2, match2 as fn(u8) -> u64, lookup2 as fn(u8) -> u64),
(4, &s4, match4 as fn(u8) -> u64, lookup4 as fn(u8) -> u64),
(8, &s8, match8 as fn(u8) -> u64, lookup8 as fn(u8) -> u64),
] {
for (label, f) in vec![("match", f_match), ("lookup", f_lookup)] {
{
let s = s.clone();
c.bench_function(
&format!("sum,{},not inline,{}", label, size),
move |b: &mut Bencher| {
b.iter(|| {
let mut sum = 0;
for x in s.as_bytes() {
sum += f(*x);
}
sum
})
},
);
}
//... the other two tasks
Note:
the input string is 10,000 uniformly selected characters from the appropriate alphabet (i.e., the letters always matched, and the “not found” case was never run)
the input string
s
was cloned so it could be moved into each benchmark closure; an explicit scope was created so thats
would be dropped between each of the three tasksmatch/lookup is done via a function pointer (
f: fn(u8) -> u64
), so the compiler can’t inline itthe simplest task is
sum
, since when I tried just the match/lookup, the compiler optimized out the match (because the result wasn’t being used, I guess?)
To get the inlined results, I copy/pasted all variations (see the benchmark source).
I’m sure this could be tidied up using Rust macros, but I find them syntactically terrifying and decided to go with the bounded yak-shaving of copy/paste. =)
I couldn’t figure out how to run the benchmarks using my own Rust program (instead of cargo bench
), so I manually cleaned up the results printed to the terminal1 and generated graphs in R.
Looking at the generated assembly
I posted this question to the Rust user’s forum, where someone suggested that I look at the generated assembly via the Godbolt Compiler Explorer.
I don’t know much about assembly; here’s my mental model:
- registers hold values and are fast to read/write
- memory holds values and is slow to read/write
- CPU instructions operate on the values in the registers and/or set hidden flags
- these flags determine if the CPU jumps to other sets of instructions
I decided to start by looking at the simplest functions: match and lookup with a 2-letter alphabet.
Here’s the match:
example::match2:
xor ecx, ecx #clear register ecx (see https://stackoverflow.com/questions/33666617/what-is-the-best-way-to-set-a-register-to-zero-in-x86-assembly-xor-mov-or-and/33668295 for why it's xor)
cmp dil, 66 #compare dil with 66, which is ASCII 'B'.
sete cl #if comparison was equal, set lowest byte of the rcx register to 1.
add rcx, rcx #add rcx to itself, which makes it 2 (which is the mapping we wanted from the match for 'B')
cmp dil, 65 #compare dil with 65, which is ASCII 'A'.
mov eax, 1 #move 1 into eax, which is lower 32 bits of rax.
cmovne rax, rcx #if comparison was not equal (i.e., dil wasn't 'A'), move rcx into rax.
ret
The lookup is a bit more complicated:
example::lookup2:
movzx eax, dil # moves 1-byte dil into lowest byte of 4-byte eax.
lea rcx, # load effective address into rcx
[rip + .L__unnamed_1] # the address is the instruction pointer plus the address of the lookup table
mov rax, # move into rax
qword ptr [rcx + 8*rax] # a pointer's worth of bytes (8) from the address given by `rcx + 8*rax`
ret
.L__unnamed_1:
.asciz "\000\000..." #a string with 8188 characters; each \000 is a byte, so there are 2047 total bytes here.
Grouping the .L__unnamed_1
lookup table data into chunks of 8 bytes, the 65th and 66th rows (counting from 0) are:
\001\000\000\000\000\000\000\000
\002\000\000\000\000\000\000\000
which are the values 1 and 2 in 64-bit little-endian. Since ‘A’ is 65 and should map to 1, this must be how Rust has compiled the lookup table into assembly.
For this analysis, I used these references on x86 architecture and instruction set. See the original post for more discussion on assembly.
Spending a few hours googling this assembly stuff was cool, but I didn’t feel much closer to understanding what was going on. Mostly because I only looked at the simplest match/lookup code, whereas the real performance difference came up in much more complicated “mix with rotate” task:
let mut out: u64 = 0;
for idx in 0..1000 {
for x in s.as_bytes().iter().skip(idx).take(32) {
out ^= lookup4(*x).rotate_left(idx as u32)
}
}
out
which compiled to a lot more assembly — and it wasn’t clear that wading through it all would lead to a satisfying answer.
Kevin’s baby understanding of compilers and a hypothesis
Looking at generated code is neat, but perhaps not that helpful without knowing what to look for — it’d help to have a hypothesis.
In the benchmarks, inlined match and lookup have basically the same performance, except for “mix with rotate” when the alphabet size is greater than two.
My hypothesis is that there is some optimization that applies to the lookup but not the match code in this specific case.
Before we talk about how to test this hypothesis, let me share my baby understanding of how a Rust program runs:
- Kevin writes code in Rust
- the Rust compiler translates this into LLVM IR (other code)
- LLVM optimizes this IR and translates it into hardware-specific machine code (the assembly we just looked at)
- my CPU translates this machine code into, uh, microcode?
- the microcode controls, uh, transistor switches and other electrical stuff?
- the universe run my computations
My friend Jamie Brandon suggested that instead looking at the assembly (third step), I look at the optimized LLVM IR (second step), since that would probably show the “big picture” of memory access and branching, without the details of specific hardware instructions, registers, calling conventions, etc.
This reasoning made sense, both in terms of understanding this specific case and in terms of personal skill investment. I.e., knowing LLVM IR will be more useful (for me) than knowing the quirks and features of the x86 instruction set.
So, to test my hypothesis, lets going to look at the generated LLVM IR and see how it differs between the two cases we’re interested in (4-letter “mix with rotate”).
Looking at LLVM IR
Before we get to the complex “mix with rotate” case, lets start again with the simplest match2 and lookup2.
If you want to follow along:
Helpful resources I found for learning about LLVM IR:
- The LLVM language reference, for looking up API docs
- Mapping high-level constructs to LLVM, for more walkthrough-style discussion
I generated the following code on OS X 10.9.5 with:
rustc 1.31.1 (b6c32da9b 2018-12-18) binary: rustc commit-hash: b6c32da9b0481e3e9d737153286b3ff8aa39a22c commit-date: 2018-12-18 host: x86_64-apple-darwin release: 1.31.1 LLVM version: 8.0
I ran
rustc --crate-type lib --emit=llvm-ir -C opt-level=3 src/match2.rs
to generate LLVM IR in a file
match2.ll
. You can also generate IR from Rust via an online playground.
Anyway, here’s the Rust source:
const LOOKUP: [u64; 256] = [
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 2, 3, 4, 5, 6, 7, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
];
pub fn lookup2(x: u8) -> u64 {
LOOKUP[x as usize]
}
pub fn match2(x: u8) -> u64 {
match x {
b'A' => 1,
b'B' => 2,
_ => 0,
}
}
and here’s the generated LLVM IR (with the @0
lookup table constant trimmed):
; ModuleID = '2.3a1fbbbh-cgu.0'
source_filename = "2.3a1fbbbh-cgu.0"
target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-apple-darwin"
@0 = private unnamed_addr constant <{ [2048 x i8] }> <{ [2048 x i8] c"\00\00\00..." }>, align 8
; _2::lookup2
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN2_27lookup217he284a079f87517e3E(i8 %x) unnamed_addr #0 {
start:
%0 = zext i8 %x to i64
%1 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %0
%2 = load i64, i64* %1, align 8
ret i64 %2
}
; _2::match2
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN2_26match217h7e07b9733c4bda11E(i8 %x) unnamed_addr #0 {
start:
%switch.selectcmp = icmp eq i8 %x, 66
%switch.select = select i1 %switch.selectcmp, i64 2, i64 0
%switch.selectcmp1 = icmp eq i8 %x, 65
%switch.select2 = select i1 %switch.selectcmp1, i64 1, i64 %switch.select
ret i64 %switch.select2
}
attributes #0 = { norecurse nounwind readnone uwtable "no-frame-pointer-elim"="true" "probe-stack"="__rust_probestack" "target-cpu"="core2" }
Hopefully this should look similar to the assembly (which was generated from this IR, after all). A few things to point out:
LLVM has unlimited registers, which can have nice names (prefixed with
%
)there’s metadata like
norecurse
(the function doesn’t call itself), which presumably LLVM will use when generating machine codeinstead of hidden flags, comparisons are returned as values and passed around explicitly; for example:
%switch.selectcmp1 = icmp eq i8 %x, 65
means that the register
%switch.selectcmp1
has a 1-bit value of 1 if%x
is equal to 65 and 0 otherwise.types are written out everywhere
In the 4-letter case, the lookup is the same, but the match contains a branch instruction:
; _4::match4
; Function Attrs: norecurse nounwind readnone uwtable
define i64 @_ZN2_46match417h91abd8c68f0f7a54E(i8 %x) unnamed_addr #0 {
start:
%switch.tableidx = add i8 %x, -65
%0 = icmp ult i8 %switch.tableidx, 4
br i1 %0, label %switch.lookup, label %bb6
switch.lookup: ; preds = %start
%switch.idx.cast = zext i8 %switch.tableidx to i64
%switch.offset = add nuw nsw i64 %switch.idx.cast, 1
ret i64 %switch.offset
bb6: ; preds = %start
ret i64 0
}
If the input %x
is less than 69 (AKA, before ASCII ‘E’), then we branch to switch.lookup
; otherwise we know it’s out of range and we can branch to %bb6
and just return 0.
In the switch.lookup
, the byte with value %x - 65
is “zero extended” (add a bunch of zeros to the front until it’s as big as an i64), then the value is incremented2 and returned.
Task 1: Sum
Now that we’ve seen the plain lookup and match LLVM IR, lets look at the slightly more complicated task of mapping and summing over all the characters in a string (in the 4-letter alphabet case):
pub fn sum_lookup_4(s: &str) -> u64 {
let mut sum = 0;
for x in s.as_bytes() {
sum += lookup4(*x);
}
sum
}
This Rust function generates about 75 lines of LLVM IR, which I’ll go through in blocks (for the sake of storytelling, I’ll present in a different order than the source order):
; lookup4::sum_lookup_4
; Function Attrs: nounwind readonly uwtable
define i64 @sum_lookup([0 x i8]* noalias nonnull readonly align 1 %s.0, i64 %s.1) unnamed_addr #1 {
This function returns an i64 and takes two arguments:
%s.0
, has type[0 x i8]*
, a pointer to an array of bytes. (LLVM doesn’t actually enforce bounds checking, and since the Rust compiler doesn’t know how long the array is going to be it just uses 0.)%s.1
is ani64
, that’s the actual length of the array — after all, our Rust function takes one input of type&str
— “string slice” — and a slice is “a view into a block of memory represented as a pointer and a length”.
Then in the starting block:
start:
%0 = icmp eq i64 %s.1, 0
br i1 %0, label %bb5, label %bb7.preheader
We look at the length of the array and branch based on whether or not it’s zero (i.e., whether or not our string slice is empty). Lets look at the empty string case first:
bb5:
%sum.0.lcssa = phi i64 [ 0, %start ], [ %.lcssa.ph, %bb5.loopexit.unr-lcssa ], [ %9, %bb7.epil ]
ret i64 %sum.0.lcssa
The phi
instruction returns a value based on where we just came from.
If we ended up here from the %start
block, then phi
returns 0, which we then return.
This is what we want — for a blank string, we don’t need to loop over anything, just return the initial value of the sum
variable: 0.
For non-empty strings, %start
instead branches here:
bb7.preheader: ; preds = %start
%1 = getelementptr inbounds [0 x i8], [0 x i8]* %s.0, i64 0, i64 0
%2 = add i64 %s.1, -1
%xtraiter = and i64 %s.1, 3
%3 = icmp ult i64 %2, 3
br i1 %3, label %bb5.loopexit.unr-lcssa, label %bb7.preheader.new
First, %1
is set as the address of the first byte of the slice3.
(See the getelementptr FAQ if you’re wondering why it takes so many arguments to calculate this address.)
The %1
register is not used in this block, but we’ll see it later.
Likewise, the %xtraiter
register is set (copying 1s and 2s bits of the array length %s.1
) but not used until later.
Finally, the block branches on whether the array length4 is less than 3.
If it’s greater than 3, we jump to
bb7.preheader.new: ; preds = %bb7.preheader
%unroll_iter = sub i64 %s.1, %xtraiter
br label %bb7
where we see %xtraiter
subtracted out of the array length.
The resulting %unroll_iter
is then the length of the input array minus any 1s or 2s.
This will make sense in the block we branch to next, the body of the loop:
bb7: ; preds = %bb7, %bb7.preheader.new
%sum.09 = phi i64 [ 0, %bb7.preheader.new ], [ %33, %bb7 ]
%iter.sroa.0.08 = phi i8* [ %1, %bb7.preheader.new ], [ %28, %bb7 ]
%niter = phi i64 [ %unroll_iter, %bb7.preheader.new ], [ %niter.nsub.3, %bb7 ]
%10 = getelementptr inbounds i8, i8* %iter.sroa.0.08, i64 1
%11 = load i8, i8* %iter.sroa.0.08, align 1
%12 = zext i8 %11 to i64
%13 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %12
%14 = load i64, i64* %13, align 8
%15 = add i64 %14, %sum.09
%16 = getelementptr inbounds i8, i8* %iter.sroa.0.08, i64 2
%17 = load i8, i8* %10, align 1
%18 = zext i8 %17 to i64
%19 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %18
%20 = load i64, i64* %19, align 8
%21 = add i64 %20, %15
%22 = getelementptr inbounds i8, i8* %iter.sroa.0.08, i64 3
%23 = load i8, i8* %16, align 1
%24 = zext i8 %23 to i64
%25 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %24
%26 = load i64, i64* %25, align 8
%27 = add i64 %26, %21
%28 = getelementptr inbounds i8, i8* %iter.sroa.0.08, i64 4
%29 = load i8, i8* %22, align 1
%30 = zext i8 %29 to i64
%31 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %30
%32 = load i64, i64* %31, align 8
%33 = add i64 %32, %27
%niter.nsub.3 = add i64 %niter, -4
%niter.ncmp.3 = icmp eq i64 %niter.nsub.3, 0
br i1 %niter.ncmp.3, label %bb5.loopexit.unr-lcssa, label %bb7
The first thing that stands out is the repetition of familiar code: it’s the lookup code we saw earlier, repeated 4 times.
This is an optimization known as loop unrolling. The idea is that a loop body tends to execute many times and overhead related to iteration counting and branching (the last three instructions) can be reduced by running only every, e.g., 4 loop bodies rather than on each body.
Interestingly, if we look at this repeated group of 6 instructions, the first instruction looks up the address of a byte from the input array, but this address isn’t used until the next group.
E.g., the %10
register is set in the first group, but used in the second group.
I have no idea why this is.
That the loop has been unrolled into sets of 4 explains why the 1s and 2s were subtracted out earlier — so that %unroll_iter
is a multiple of 4.
Once we’ve run as many iterations as we can in steps of 4, we branch to:
bb5.loopexit.unr-lcssa: ; preds = %bb7, %bb7.preheader
%.lcssa.ph = phi i64 [ undef, %bb7.preheader ], [ %33, %bb7 ]
%sum.09.unr = phi i64 [ 0, %bb7.preheader ], [ %33, %bb7 ]
%iter.sroa.0.08.unr = phi i8* [ %1, %bb7.preheader ], [ %28, %bb7 ]
%lcmp.mod = icmp eq i64 %xtraiter, 0
br i1 %lcmp.mod, label %bb5, label %bb7.epil
This block tests whether we have any iterations left to do, using the %xtraiter
register that was set earlier with the 1s and 2s bits from the original input length.
If we have no iterations left, branch to %bb5
(which we saw earlier) and return the value.
If there are a few more iterations left, branch to %bb7.epil
(“epilogue”, presumably):
bb7.epil: ; preds = %bb5.loopexit.unr-lcssa, %bb7.epil
%sum.09.epil = phi i64 [ %9, %bb7.epil ], [ %sum.09.unr, %bb5.loopexit.unr-lcssa ]
%iter.sroa.0.08.epil = phi i8* [ %4, %bb7.epil ], [ %iter.sroa.0.08.unr, %bb5.loopexit.unr-lcssa ]
%epil.iter = phi i64 [ %epil.iter.sub, %bb7.epil ], [ %xtraiter, %bb5.loopexit.unr-lcssa ]
%4 = getelementptr inbounds i8, i8* %iter.sroa.0.08.epil, i64 1
%5 = load i8, i8* %iter.sroa.0.08.epil, align 1
%6 = zext i8 %5 to i64
%7 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %6
%8 = load i64, i64* %7, align 8
%9 = add i64 %8, %sum.09.epil
%epil.iter.sub = add i64 %epil.iter, -1
%epil.iter.cmp = icmp eq i64 %epil.iter.sub, 0
br i1 %epil.iter.cmp, label %bb5, label %bb7.epil, !llvm.loop !0
which completes the remaining iterations by branching back to itself and then finally to %bb5
to return the value.
Phew! That was a lot of code, but hopefully you caught the gist.
How does match version compare?
From what I can tell, it’s quite similar. The only thing that stands out to me is that the loop unrolls into 2 bodies per iteration rather than 4 bodies5.
Diffing LLVM IR
Now that we’ve read some LLVM IR and gotten a rough sense of how things work, lets go directly to the problem at hand: What’s different about the generated LLVM IR between the match and lookup for the “mix with rotate” task:
pub fn match_mix_4(s: &str) -> u64 {
let mut out: u64 = 0;
for idx in 0..1000 {
for x in s.as_bytes().iter().skip(idx).take(32) {
out ^= match4(*x).rotate_left(idx as u32)
}
}
out
}
Rather than try to read through the generated code (about 90 lines), lets instead use the llvm-diff command to do a semantic diff for us! I manually renamed the function names to be identical (so they’d be diffed against each other) and llvm-diff reported the following difference: The lookup version had
%20 = zext i8 %19 to i64
%21 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %20
%22 = load i64, i64* %21, align 8
%23 = shl i64 %22, %4
%24 = lshr i64 %22, %6
%25 = or i64 %23, %24
%26 = xor i64 %25, %out.158
where the match version had
%switch.tableidx = add i8 %19, -65
%20 = icmp ult i8 %switch.tableidx, 4
%switch.idx.cast = zext i8 %switch.tableidx to i64
%switch.offset = add nuw nsw i64 %switch.idx.cast, 1
%_0.0.i = select i1 %20, i64 %switch.offset, i64 0
%21 = shl i64 %_0.0.i, %4
%22 = lshr i64 %_0.0.i, %6
%23 = or i64 %21, %22
%24 = xor i64 %23, %out.160
which…isn’t that helpful? I mean, the diff looks like exactly what I’d expect: the lookup version has the lookup code inlined, and the match version has the match code inlined.
What about the simple mix case? Lookup had:
%17 = zext i8 %16 to i64
%18 = getelementptr inbounds [256 x i64], [256 x i64]* bitcast (<{ [2048 x i8] }>* @0 to [256 x i64]*), i64 0, i64 %17
%19 = load i64, i64* %18, align 8
%20 = xor i64 %19, %out.158
where match had:
%switch.tableidx = add i8 %16, -65
%17 = icmp ult i8 %switch.tableidx, 4
%switch.idx.cast = zext i8 %switch.tableidx to i64
%switch.offset = add nuw nsw i64 %switch.idx.cast, 1
%_0.0.i = select i1 %17, i64 %switch.offset, i64 0
%18 = xor i64 %_0.0.i, %out.160
which shows the same situation (the only difference is that the rotation code is missing, as expected).
So perhaps the difference really isn’t detectable at the level of LLVM IR and we need to look at the emitted assembly?
Revisiting assembly
I compiled the LLVM IR into assembly using the llc
program, then aligned a “diff” manually.
Here’s “mix with rotate”, with lookup on the left and match on the right:
LBB0_10: ## %bb15
## in Loop: Header=BB0_2 Depth=2
movzbl (%rbx), %ecx movzbl (%rbx), %ecx
movq (%r9,%rcx,8), %rbx addb $-65, %cl
movzbl %cl, %ecx
leaq 1(%rcx), %rbx
cmpb $4, %cl
cmovaeq %r9, %rbx
movq %rbx, %rdx movq %rbx, %rdx
movl %r8d, %ecx movl %r8d, %ecx
shlq %cl, %rdx shlq %cl, %rdx
movl %r11d, %ecx movl %r11d, %ecx
shrq %cl, %rbx shrq %cl, %rbx
orq %rdx, %rbx orq %rdx, %rbx
xorq %rbx, %rax xorq %rbx, %rax
incq %r15 incq %r15
movl $0, %ecx movl $0, %ecx
jne LBB0_2 jne LBB0_2
Ditto, but just “mix”:
LBB0_10: ## %bb15
## in Loop: Header=BB0_2 Depth=2
movzbl (%r11), %ecx movzbl (%rdx), %ecx
addb $-65, %cl
movzbl %cl, %ecx
leaq 1(%rcx), %rdx
cmpb $4, %cl
cmovaeq %r9, %rdx
xorq (%r9,%rcx,8), %rax xorq %rdx, %rax
incq %rbx incq %r11
movl $0, %ecx movl $0, %ecx
jne LBB0_2 jne LBB0_2
Again, nothing really jumps out here. The match code is 5 instructions longer than the lookup code — that’d explain things if match were consistently slower than lookup, but that’s not the case — match is only slower in the “mix with rotate” task for 4 and 8 letter alphabets.
It’s also worth noting that this disproves my original hypothesis that some optimization applied to the lookup but not the match code. As far as I can tell, basically the same code is getting generated “around” the match and lookup in all cases.
Cachegrind
Maybe the issue is that these extra few instructions in the loop body end up keeping the match “mix with rotate” instructions from fitting in a cache line?
cargo build --release && valgrind --tool=cachegrind target/release/match_mix_cache
==31976==
==31976== I refs: 599,041,912
==31976== I1 misses: 6,066
==31976== LLi misses: 3,642
==31976== I1 miss rate: 0.00%
==31976== LLi miss rate: 0.00%
==31976==
==31976== D refs: 36,578,581 (35,418,178 rd + 1,160,403 wr)
==31976== D1 misses: 45,598 ( 40,688 rd + 4,910 wr)
==31976== LLd misses: 20,374 ( 16,861 rd + 3,513 wr)
==31976== D1 miss rate: 0.1% ( 0.1% + 0.4% )
==31976== LLd miss rate: 0.1% ( 0.0% + 0.3% )
==31976==
==31976== LL refs: 51,664 ( 46,754 rd + 4,910 wr)
==31976== LL misses: 24,016 ( 20,503 rd + 3,513 wr)
==31976== LL miss rate: 0.0% ( 0.0% + 0.3% )
cargo build --release && valgrind --tool=cachegrind target/release/match_mix_rotate_cache
==31979==
==31979== I refs: 798,023,810
==31979== I1 misses: 6,177
==31979== LLi misses: 3,640
==31979== I1 miss rate: 0.00%
==31979== LLi miss rate: 0.00%
==31979==
==31979== D refs: 36,574,131 (35,414,303 rd + 1,159,828 wr)
==31979== D1 misses: 45,740 ( 40,748 rd + 4,992 wr)
==31979== LLd misses: 20,488 ( 16,887 rd + 3,601 wr)
==31979== D1 miss rate: 0.1% ( 0.1% + 0.4% )
==31979== LLd miss rate: 0.1% ( 0.0% + 0.3% )
==31979==
==31979== LL refs: 51,917 ( 46,925 rd + 4,992 wr)
==31979== LL misses: 24,128 ( 20,527 rd + 3,601 wr)
==31979== LL miss rate: 0.0% ( 0.0% + 0.3% )
Nope, doesn’t seem like there’s any difference in miss rates between the 4-letter “mix” and “mix with rotate” tasks.
Revisiting assumptions + a zeroth-order explanation
At this point, I was starting to doubt whether it’s possible to reason about performance at all.
I decided to revisit my assumptions and re-run benchmarks.
After much whirring of fans and angry reformatting of Criterion output into R for analysis, I made a startling discovery: The 2-to-4-letter inline match “mix with rotate” slowdown disappears when lazy_static
is not used.
Compare the results by hovering over this graphic with your mouse:
You’ll see that without lazy_static
(i.e., with const LOOKUP: [u64; 256]
written out with 256 values), all the lookups are faster (“mix” and “mix with rotate” tasks drop by about 50 microseconds).
That makes sense, since there’s a bit of overhead to access the lazily initialized value.
After all, it was my desire to avoid wading through this extra lazy_static
generated code that prompted me to write out the lookup table with all 256 values explicitly.
To my great shame and embarrassment, it did not occur to me to benchmark everything again after making that change.
But now that we have a benchmark that more accurately reflects the code we’ve been examining, we can find an explanation.
In particular, it looks like the lazy_static
overhead was “hiding” the the cost match’s extra instructions (and/or reduced, uh, loop unrollability).
With this overhead removed and all inlined lookups faster than matches, I’m satisfied with the explanation that: Inlined lookups are faster because they require fewer instructions than matches
Open questions
We could go deeper on this explanation — after all, we still don’t know why Rust/LLVM doesn’t automatically convert the match into a lookup table.
Another question (one which I find more troubling) is why lazy_static
makes the 2-alphabet match “mix with rotate” faster.
After all, the match code has nothing to do with the lookup table, so it shouldn’t be touching any lazy_static
code whatsoever.
As we can see from the benchmarks, though, there is some kind of relationship.
Perhaps Criterion doesn’t sufficiently isolate the functions under test?
Or lazy_static
enables some kind of optimization that only speeds up the 2-letter “mix with rotate” match task?
I’ll leave these questions as an exercise to the reader. =)
-
I know Criterion prints results out to CSV and JSON files in the
target/criterion
directory, but I couldn’t bring myself to walk the directory tree, etc. to collect those results automatically using either R or Rust. Ideally it’d be possible to access benchmark results as structured data directly from my own program (which is easy to do in Clojure’s Criterium benchmarking library). ↩ -
Question: Why does the second add instruction in this block have
nuw
andnsw
annotations? They’re related to wrapping/overflow, but that doesn’t seem necessary since the value was just zero extended. Maybe it doesn’t hurt to have these here — but if it’s a default why here and not on the first add instruction? ↩ -
Question: Isn’t the function input
[0 x i8]* %s.0
already the address of the first byte of the string slice? I would expect ↩getelementptr inbounds [0 x i8], [0 x i8]* %foo, i64 0, i64 0
to be the same as just
%foo
. If that’s true, why does Rust bother to emit thisgetelementptr
instruction at all? If not, what am I missing? -
Question: Why is the condition implemented as ↩
%2 = add i64 %s.1, -1 %3 = icmp ult i64 %2, 3
i.e.,
length-1 < 3
? Why not justlength < 4
? The%2
register doesn’t seem to be used anywhere else -
Maybe this is because the match takes 8 instructions: ↩
%6 = getelementptr inbounds i8, i8* %iter.sroa.0.09, i64 1 %7 = load i8, i8* %iter.sroa.0.09, align 1 %switch.tableidx = add i8 %7, -65 %8 = icmp ult i8 %switch.tableidx, 4 %switch.idx.cast = zext i8 %switch.tableidx to i64 %switch.offset = add nuw nsw i64 %switch.idx.cast, 1 %_0.0.i = select i1 %8, i64 %switch.offset, i64 0 %9 = add i64 %_0.0.i, %sum.010
compared to the lookup’s 6, and so hits some “loop unrolling instruction count” ceiling?