Unlocking the potential of genetic data for research and healthcare

Genetic data is widely used in scientific research and rapidly becoming a standard part of healthcare. Genome sequencing generates vast amounts of data, which presents many challenges. One major problem is that the standard format for representing the genomes of many individuals for statistical analysis, Variant Call Format (VCF), which was designed as part of the 1000 Genomes Project in 2010, is not suitable for today’s population-scale datasets. VCF defines the genetic and quality control data for all individuals in a dataset at one position on the genome as a single record, usually encoded as a line of text. With datasets now approaching millions of individuals and billions of records this representation is increasingly unwieldy.

In a paper published in the journal GigaScience the team describe how today’s petabyte-scale genetic datasets can be stored using the popular Zarr data format. The paper shows that translating VCF data to Zarr speeds up statistical analysis and opens up many exciting new possibilities. Because Zarr is an open standard that is widely used to store huge scientific datasets, biologists can now take full advantage of modern infrastrucure like cloud computing and AI frameworks such as PyTorch and TensorFlow to analyse genetic data.

Read the full story on the Big Data Institute website.

Unlocking the potential of genetic data for research and healthcare

Similar stories

OpenSAFELY team awarded Queen Elizabeth Prize for Higher and Further Education

Bridging the gap between epilepsy and mental health

Unequal access to early pregnancy scans delays detection of serious conditions

New analysis highlights urgent need to close global gaps in genomic surveillance of antimicrobial resistance

Innovative brain health clinic has now assessed 500 patients and improved accuracy of dementia diagnosis

Professor Sarah Blagden shares cutting-edge precision prevention research at Oxford in new Channel 4 Documentary

Cookies on this website

Unlocking the potential of genetic data for research and healthcare

Similar stories

OpenSAFELY team awarded Queen Elizabeth Prize for Higher and Further Education

Bridging the gap between epilepsy and mental health

Unequal access to early pregnancy scans delays detection of serious conditions

New analysis highlights urgent need to close global gaps in genomic surveillance of antimicrobial resistance

Innovative brain health clinic has now assessed 500 patients and improved accuracy of dementia diagnosis

Professor Sarah Blagden shares cutting-edge precision prevention research at Oxford in new Channel 4 Documentary