On "Real-time search of all bacterial and viral genomic data"

← Back to Kevin's homepagePublished: 2018 Jan 5

Josh Batson recommended “Real-time search of all bacterial and viral genomic data” to me because of my interest in search as part of Finda.

These are my notes from a read-through of the paper then some exploration / rough analysis of alternative implementation ideas.

Update: Phelim Bradley (the paper’s first author) kindly answered many of my questions on Twitter.

Paper notes

The main contribution of the paper seems to be a “bitsliced signature”-based approach, which represents each of the $N$ sequences as a bloom filter of length $m$ and bounds the diversity of the k-mers using $\eta$ hash functions (which map k-mers to $1..m$).

To index a new sequence: construct all unique k-mers in that sequence, map all of them under the hash functions, and set the resulting bits to make a bloom filter.

To retrieve sequences matching some query sequence: construct all unique k-mers in the query, map them under the hash functions, and find all bloom filters that have the same bits set (or within some difference, if inexact matches are okay).

This retrieval can be speed up by thinking of the bloom filters as “columns” in a matrix of $m$ rows and scanning over the relevant rows (i.e., the ones corresponding to the hashed query k-mers) rather than scanning column-by-column and taking the intersection of each column with the query bloom filter.

(This “bitsliced signature” technique was known before this paper, but fell out of favor in human language search applications because inverted indexes worked better. See also this paper about signatures-search at Bing, which provides a nice overview with more math and pictures.)

The authors implemented their system with two backends: BerkeleyDB (on disk key/value store) and Redis (in-memory key/value store, running on 3TB of RAM). Source code is on Github and is about 5,000 lines of Python.

Questions

Potential improvement via compressed bitmaps?

The “signature” method compresses the huge space of $4^k$ possible k-mers down to $2^m$ possible bit vectors. The space required for the index is $mN$ bits ($m$ bits for each of the $N$ sequences).

The authors choose $m = 25,000,000$, so each sample needs about ~3MB of storage (thus, 1.5 TB of RAM to support real time search of 500,000 samples).

If this vector isn’t that dense, then it could be losslessly compressed with something like Roaring Bitmaps, which might greatly decrease the storage requirements of this method.

What’s a ballpark estimate for the bit vector density?

For a given sequence of length $n$, the number of set bits is proportional to:

Assuming the hash functions map k-mers uniformly to $1..m$, the problem is similar to drawing balls with replacement: After $r$ draws (hashed k-mers), how many unique balls $y$ do we expect to have seen (how many bits were set)?

$$y = m - \frac{(m-1)^r}{m^{r-1}}$$

(see explanation)

Dividing by the length of the vector $m$ to get a proportion, $\alpha$:

$$\alpha = 1 - (1 - \frac{1}{m})^r$$

According to this paper, bacterial genomes look to be somewhere around 2–6 million bases long. So, between friends, calling that $10^7$ bases, throwing out the factor of $\eta$, and taking a power series expansion of the second term, $(1 + x)^a \approx 1 + ax$, we have:

$$ \begin{align} \alpha &\approx 1 - (1 - \frac{10^7}{28,755,176}) \newline &\approx 1/3 \end{align} $$

So it looks like these bitvectors may not actually be that sparse.

In that case, whether there’s a real benefit to using Roaring Bitmaps may come down to:

This first question can be answered empirically:

use croaring::Bitmap;
use rand::{thread_rng, Rng};

fn main() {
    let mut rng = thread_rng();
    let m = 25_000_000;
    println!("Naive bytes: {}", m / 8);

    for density_exponent in 1..16 {
        let b: Bitmap = (0..m)
            .filter(|_| rng.gen_ratio(1, 2_u32.pow(density_exponent)))
            .collect();

        println!(
            "density: 1/2^{}, bytes used: {}",
            density_exponent,
            b.get_serialized_size_in_bytes()
        );
    }
}

prints the following:

Naive bytes:                3125000
density: 1/2^1, bytes used: 3132408
density: 1/2^2, bytes used: 3132408
density: 1/2^3, bytes used: 3132038
density: 1/2^4, bytes used: 3108258
density: 1/2^5, bytes used: 1563798
density: 1/2^6, bytes used:  783102
density: 1/2^7, bytes used:  394296
density: 1/2^8, bytes used:  198302
density: 1/2^9, bytes used:  100524
density: 1/2^10, bytes used:  52136
density: 1/2^11, bytes used:  27940
density: 1/2^12, bytes used:  15116
density: 1/2^13, bytes used:   9120
density: 1/2^14, bytes used:   6024
density: 1/2^15, bytes used:   4154

This shows that Roaring Bitmaps have a ~10kB space overhead compared to a naive bitvector until the density gets down to about 1/16. (And the real space savings only begin around a density of 1/32.)

So if the bitvectors all have density greater than 1/16, Roaring Bitmaps will impose an extra ~5GB of storage cost for little benefit. But if there are many vectors less dense than 1/16 (i.e., sequences less than $10^6$ bases long), the savings could be substantial.

What about direct k-mer index?

The authors found about $10^{10}$ unique k-mers; how much space would it take to encode these directly, rather than lossily compress them in Bloom filters?

Unfortunately, $10^{10} \approx 2^{37}$, which means we’re outside of Roaring bitmap’s 32-bit integer domain. (See issue, where Roaring’s author suggests options for pre-partitioning data.)

What if we worked around this by reducing the k-mer size: $2^{32} = 4^{16}$, which means if $k=16$ then the k-mers could fit directly into bitmaps without any need to mess with hash functions, etc.

So with bitmaps of length $2^{32}$ (to account for every possible 16-mer) and typical sequences having $10^7 \approx 2^{24}$ unique k-mers (a density about $2^{-8}$), we can run the numbers:

let m = std::u32::MAX;
let b: Bitmap = (0..m).filter(|_| rng.gen_ratio(1, 2_u32.pow(8))).collect();
println!("{}", b.get_serialized_size_in_bytes());

which gives us ~34MB per sample — about 10x the space requirement (not to mention potential loss of specificity due to using half-sized k-mers). So, probably best to stick with bloom filters =)