Thursday, September 15, 2016

Blending gene sequence variation data into PostgreSQL

VCF, the Variant Call Format, is a clever idea. Instead of storing all genetic information of a sampled genome, it only stores the delta against some reference genome. This squeezes a lot of redundancy out of the data and thus occupies a lot less storage space.

Unfortunately, VCF is also a unwieldy format. Only a part is fixed, with metadata in the header describing fields in the body which in turn describe the actual data format of the samples.

This makes VCF especially hard to grasp for systems like Hadoop that work on chunked files spread over many compute nodes. With VCF, every chunk has to carry a copy of the header to make sense of the data it carries.

Formats like ADAM are under development that tackle this (any many more) problems, but when I was given the task to make VCF files accessible to a database last year, I took the direct route:

Since there have been already quite powerful tools and libraries to work with VCF files around for some time, I used Multicorn and PyVCF and wrote a foreign data wrapper for PostgreSQL that understands the VCF format.

After it was finished, I realized that it had been done before...

However, my implementation is different in some parts. Most notably it abstracts more from the actual storage of the VCF files and it works with vanilla PyVCF and does not need any special modifications.

A few days ago, I was granted permission to release the code into the wild by the people who paid for it in the first place, so now you have the freedom of choice. ;-)

Ladies and Gentlemen, please take a look at just another multicorn based foreign data wrapper for VCF files for PostgreSQL, the one and only truly integrarelational DBMS.

No comments:

Post a Comment