On "Real-time search of all bacterial and viral genomic data"
← Back to Kevin's homepagePublished: 2019 Jan 5Josh Batson recommended “Real-time search of all bacterial and viral genomic data” to me because of my interest in search as part of Finda.
These are my notes from a read-through of the paper then some exploration / rough analysis of alternative implementation ideas.
Update: Phelim Bradley (the paper’s first author) kindly answered many of my questions on Twitter.
Paper notes
traditional methods (e.g., BLAST) rely on alignments against reference genomes, so this paper’s formulation of genome search as exact match problem is atypical
not many high-quality reference genomes for bacteria and viruses, since they mutate too much; most data is based on short reads
the diversity of these genomes (number of unique k-mers / n-grams) means k-mer indexing schemes that work well for eukaryotes don’t work well for bacteria
The main contribution of the paper seems to be a “bitsliced signature”-based approach, which represents each of the $N$ sequences as a bloom filter of length $m$ and bounds the diversity of the k-mers using $\eta$ hash functions (which map k-mers to $1..m$).
To index a new sequence: construct all unique k-mers in that sequence, map all of them under the hash functions, and set the resulting bits to make a bloom filter.
To retrieve sequences matching some query sequence: construct all unique k-mers in the query, map them under the hash functions, and find all bloom filters that have the same bits set (or within some difference, if inexact matches are okay).
This retrieval can be speed up by thinking of the bloom filters as “columns” in a matrix of $m$ rows and scanning over the relevant rows (i.e., the ones corresponding to the hashed query k-mers) rather than scanning column-by-column and taking the intersection of each column with the query bloom filter.
(This “bitsliced signature” technique was known before this paper, but fell out of favor in human language search applications because inverted indexes worked better. See also this paper about signatures-search at Bing, which provides a nice overview with more math and pictures.)
The authors implemented their system with two backends: BerkeleyDB (on disk key/value store) and Redis (in-memory key/value store, running on 3TB of RAM). Source code is on Github and is about 5,000 lines of Python.
Questions
“Indexing at a smaller k-mer (31) than our minimum query length (61) enables both compression and a low error rate.” Is there a technical reason for these values to be different? Forcing the minimum query length to be longer lowers the false positive rate, but at the cost of true positives — why not let users make this tradeoff for themselves?
The method doesn’t seem to do any normalization around sequence length (e.g., tf-idf). So a giant sequence containing every possible k-mer would always be an optimal match for everything. Is this not a problem in practice because the corpus of sequences are roughly the same size? (i.e., the bloom filters all have roughly the same density)
From the docs:
You can just any tool you want to extract unique k-mers from your raw data. We recommend mccortex as you can use it’s error cleaning methods to extract error cleaned k-mers.
What are errors in k-mers? Like, isn’t the input data just a string? The mccortex docs say
‘Clean’ graphs to remove sequencing error (per sample, for high coverage samples)
which suggests to me that the tool does some kind of assembly of multiple reads to generate a consensus sequence, then spits out k-mers of that sequence.
The authors indexed about 500,000 sequences with $k = 31$, leading to about $10^{10}$ (~ $2^{37}$) unique k-mers. For a 4-letter alphabet, the space could contain as many as $4^{31} = 2^{62}$ unique k-mers — why don’t we see more? Is there just not that much variation in biology, or have we not enough sequences to see it all?
Why have multiple hash functions ($\eta > 1$)? Wikipedia gives an equation relating this to the number of elements in the bloom filter and its size, but I don’t have an intuitive sense for this. Maybe if you only have a few items and a huge bloom filter, you want to “smear out” the item of each signature to reduce the chance of collisions. E.g., if you have 2 items and 10 bits, with one hash you are only setting 1 bit, and the chance of a collision is 1 in 10; whereas if you follow the formula and you hash each item 3 times then the chance of collision is closer to 1 in 1000?
Potential improvement via compressed bitmaps?
The “signature” method compresses the huge space of $4^k$ possible k-mers down to $2^m$ possible bit vectors. The space required for the index is $mN$ bits ($m$ bits for each of the $N$ sequences).
The authors choose $m = 25,000,000$, so each sample needs about ~3MB of storage (thus, 1.5 TB of RAM to support real time search of 500,000 samples).
If this vector isn’t that dense, then it could be losslessly compressed with something like Roaring Bitmaps, which might greatly decrease the storage requirements of this method.
What’s a ballpark estimate for the bit vector density?
For a given sequence of length $n$, the number of set bits is proportional to:
- The number of k-mers in the sequence: $n-k+1$
- The number of times each k-mer is hashed: $\eta$
Assuming the hash functions map k-mers uniformly to $1..m$, the problem is similar to drawing balls with replacement: After $r$ draws (hashed k-mers), how many unique balls $y$ do we expect to have seen (how many bits were set)?
$$y = m - \frac{(m-1)^r}{m^{r-1}}$$
Dividing by the length of the vector $m$ to get a proportion, $\alpha$:
$$\alpha = 1 - (1 - \frac{1}{m})^r$$
According to this paper, bacterial genomes look to be somewhere around 2–6 million bases long. So, between friends, calling that $10^7$ bases, throwing out the factor of $\eta$, and taking a power series expansion of the second term, $(1 + x)^a \approx 1 + ax$, we have:
$$ \begin{align} \alpha &\approx 1 - (1 - \frac{10^7}{28,755,176}) \newline &\approx 1/3 \end{align} $$
So it looks like these bitvectors may not actually be that sparse.
In that case, whether there’s a real benefit to using Roaring Bitmaps may come down to:
- the overhead of compression on non-sparse sequences (i.e., is RoaringBitmap that much slower or bigger naive bitvectors?)
- the specific distribution of lengths within the indexed sequences
This first question can be answered empirically:
use croaring::Bitmap;
use rand::{thread_rng, Rng};
fn main() {
let mut rng = thread_rng();
let m = 25_000_000;
println!("Naive bytes: {}", m / 8);
for density_exponent in 1..16 {
let b: Bitmap = (0..m)
.filter(|_| rng.gen_ratio(1, 2_u32.pow(density_exponent)))
.collect();
println!(
"density: 1/2^{}, bytes used: {}",
density_exponent,
b.get_serialized_size_in_bytes()
);
}
}
prints the following:
Naive bytes: 3125000
density: 1/2^1, bytes used: 3132408
density: 1/2^2, bytes used: 3132408
density: 1/2^3, bytes used: 3132038
density: 1/2^4, bytes used: 3108258
density: 1/2^5, bytes used: 1563798
density: 1/2^6, bytes used: 783102
density: 1/2^7, bytes used: 394296
density: 1/2^8, bytes used: 198302
density: 1/2^9, bytes used: 100524
density: 1/2^10, bytes used: 52136
density: 1/2^11, bytes used: 27940
density: 1/2^12, bytes used: 15116
density: 1/2^13, bytes used: 9120
density: 1/2^14, bytes used: 6024
density: 1/2^15, bytes used: 4154
This shows that Roaring Bitmaps have a ~10kB space overhead compared to a naive bitvector until the density gets down to about 1/16. (And the real space savings only begin around a density of 1/32.)
So if the bitvectors all have density greater than 1/16, Roaring Bitmaps will impose an extra ~5GB of storage cost for little benefit. But if there are many vectors less dense than 1/16 (i.e., sequences less than $10^6$ bases long), the savings could be substantial.
What about direct k-mer index?
The authors found about $10^{10}$ unique k-mers; how much space would it take to encode these directly, rather than lossily compress them in Bloom filters?
Unfortunately, $10^{10} \approx 2^{37}$, which means we’re outside of Roaring bitmap’s 32-bit integer domain. (See issue, where Roaring’s author suggests options for pre-partitioning data.)
What if we worked around this by reducing the k-mer size: $2^{32} = 4^{16}$, which means if $k=16$ then the k-mers could fit directly into bitmaps without any need to mess with hash functions, etc.
So with bitmaps of length $2^{32}$ (to account for every possible 16-mer) and typical sequences having $10^7 \approx 2^{24}$ unique k-mers (a density about $2^{-8}$), we can run the numbers:
let m = std::u32::MAX;
let b: Bitmap = (0..m).filter(|_| rng.gen_ratio(1, 2_u32.pow(8))).collect();
println!("{}", b.get_serialized_size_in_bytes());
which gives us ~34MB per sample — about 10x the space requirement (not to mention potential loss of specificity due to using half-sized k-mers). So, probably best to stick with bloom filters =)