The entire human genome consists of six billion letters; if it had to be stored in books it would fill a library with thousands of volumes. This poses a challenge for scientists and physicians who are looking for single-letter typos — corresponding to genetic defects or cancer — in this multi-volume encyclopedia. Especially since normal genomes, coding for characteristics such as eye or hair color, can vary between individuals.
Now a computational biologist at The University of Texas Southwestern Medical Center proposes to find a new way to quickly compare genetic sequences with the entire reference volume — a task that currently requires either using incomplete sequences or enormous computing power.
Bioinformatics expert Daehwan Kim was recruited in 2017 from Johns Hopkins University, where he was a postdoctoral researcher, with the help of a First-Time Tenure-Track Award from CPRIT.
One of the crucial steps in sequencing a genome is cutting long DNA strands from millions of cells into tiny pieces of about 100-200 letters each. These tiny pieces are then sequenced. The final step is putting the sequences back together to create a record of the original DNA strand.
Traditionally, the small sequences are aligned using a reference genome whose entire sequence is known. Unfortunately, the reference genome consists of a single or small number of known human genomes, failing to reflect the normal genetic diversity of the human population. This could mean finding typos that aren’t significant genetic mutations, or missing mutations that are really a sign of a genetic defect.
Kim is developing a computer algorithm that reduces the computing power necessary, while also increasing the number of reference genomes a sequence can be compared with. He’s also making it possible for the computations to be performed with an ordinary desktop computer.
The first human genome ever sequenced, completed in 2000, cost $3 billion and took a decade. Today, current sequencing technology enables an entire genome to be sequenced for about $10,000. Kim predicts that soon the cost will drop below $1,000, allowing sequencing of a patient’s entire genome to look for significant genetic variance. (Mail-in genetic sequencing companies typically only sequence portions of each chromosome, looking for variations at specific locations associated with ancestry or traits known to vary considerably between individuals.)
Cancer typically arises from accumulations of genetic mutations of the genome in certain cell populations. These mutations can be inherited or caused by, among other things, damage from ultraviolet light, inhaled smoke or pollution, or diet. Kim’s method promises to be able to look for novel mutations anywhere in the genome that might be linked to cancer.
As a graduate student at the University of Maryland and later at Johns Hopkins, Kim already developed programs that increased the speed with which genomes can be analyzed by hundreds of times.
Although Kim could have parlayed his skills into a lucrative job in the tech industry or taken a position at an engineering school, he enjoys solving biological problems. “I really like being surrounded by biologists and medical doctors,” he says, “and I want to tackle practical problems.” CPRIT support has afforded him the ability to pay software engineers enough to entice them into an academic job.
Kim received his undergraduate degree in computer science & engineering from Chung-Ang University in Seoul, South Korea, and spent five years as software developer and chief technology officer for a computer-gaming company there before coming to the University of Maryland, College Park, to study for a graduate degree in computer science with a focus on computational biology. After receiving his Ph.D. in 2013, he joined the Center for Computational Biology at Johns Hopkins School of Medicine.
Read Less