Genetic data is widely used in scientific research and rapidly becoming a standard part of healthcare. Genome sequencing generates vast amounts of data, which presents many challenges. One major problem is that the standard format for representing the genomes of many individuals for statistical analysis, Variant Call Format (VCF), which was designed as part of the 1000 Genomes Project in 2010, is not suitable for today’s population-scale datasets. VCF defines the genetic and quality control data for all individuals in a dataset at one position on the genome as a single record, usually encoded as a line of text. With datasets now approaching millions of individuals and billions of records this representation is increasingly unwieldy.
In a paper published in the journal GigaScience the team describe how today’s petabyte-scale genetic datasets can be stored using the popular Zarr data format. The paper shows that translating VCF data to Zarr speeds up statistical analysis and opens up many exciting new possibilities. Because Zarr is an open standard that is widely used to store huge scientific datasets, biologists can now take full advantage of modern infrastrucure like cloud computing and AI frameworks such as PyTorch and TensorFlow to analyse genetic data.