Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

A team of scientists and engineers who analyse and curate the world’s largest genetic datasets have announced a new data format designed to unlock the potential of the millions of genomes now sequenced in healthcare systems around the world.

Genetic data is widely used in scientific research and rapidly becoming a standard part of healthcare. Genome sequencing generates vast amounts of data, which presents many challenges. One major problem is that the standard format for representing the genomes of many individuals for statistical analysis, Variant Call Format (VCF), which was designed as part of the 1000 Genomes Project in 2010, is not suitable for today’s population-scale datasets. VCF defines the genetic and quality control data for all individuals in a dataset at one position on the genome as a single record, usually encoded as a line of text. With datasets now approaching millions of individuals and billions of records this representation is increasingly unwieldy.

In a paper published in the journal GigaScience the team describe how today’s petabyte-scale genetic datasets can be stored using the popular Zarr data format. The paper shows that translating VCF data to Zarr speeds up statistical analysis and opens up many exciting new possibilities. Because Zarr is an open standard that is widely used to store huge scientific datasets, biologists can now take full advantage of modern infrastrucure like cloud computing and AI frameworks such as PyTorch and TensorFlow to analyse genetic data.

 

Read the full story on the Big Data Institute website.