Robel Hailu

science

The genome you trust isn't quite finished

Why GRCh38 still has gaps, what T2T-CHM13 fixed, and the practical caveat behind every coordinate you load into a notebook.

By Robel Wolde 3 min read


For nearly two decades, “the human genome” has meant GRCh38 — the assembly bundled with Ensembl, GENCODE, and most of the alignment tools you have ever used. Bioinformaticians cite it constantly. Patients are sequenced against it. Pharma companies model their pipelines around it.

It has gaps.

Until 2022, the reference was missing about eight percent of the genome — repetitive regions in centromeres, the short arms of acrocentric chromosomes, ribosomal DNA arrays, and stretches of segmental duplication. These were not minor: they harbor genes implicated in disease, structural variants that shape ancestry differences, and sequence that is functionally important even if it does not code for protein. Most of the field treated those eight percent as “we will get to it later” and kept publishing.

The T2T-CHM13 assembly closed those gaps. Released by the Telomere-to-Telomere Consortium, it added 200 million bases, fixed thousands of structural errors in GRCh38, and resolved every chromosome from end to end. It is the first genuinely complete assembly of any human chromosome. You should be using it.

Most people are not. The reasons are mundane and not.

The mundane reasons

Tooling lags. Variant callers are validated against GRCh38; their performance on T2T is sometimes worse, often the same, occasionally better. Annotation databases like ClinVar still report coordinates against GRCh38. Liftover from GRCh38 to T2T-CHM13 is not lossless — about three percent of variants do not transfer cleanly, and the failure modes are quietly biased. Most labs are not going to redo six years of analyses for an eight percent gain.

So: people use both. Pipelines emit calls against the new reference and the old. Curators maintain coordinate maps. The literature splits.

The less mundane reason

A reference is not a description of “the human genome.” It is a description of one human genome — historically, mostly one Ohio man known as RP-11 — that has accreted patches over time. T2T-CHM13 is similarly one source: a hydatidiform mole, conveniently haploid, from a single donor. It is not us, plural.

The Pangenome Reference Consortium is trying to fix this. Instead of one reference, they release a graph: 47 individuals from diverse ancestries, encoded as a pangenome. You align reads to the graph and discover variants that the linear reference would have erased. Early analyses on the pangenome show that GRCh38 systematically misrepresents about 1.6% of the genome for African ancestry samples — which is to say, the field has been doing inferential work on a known-bad foundation for a long time.

Graph-based aligners are slower. They cost more. The output schemas are different. Most clinical pipelines cannot ingest them. A version of the field is moving anyway, because the alternative is wrong.

What I tell early-career people

If you are coming into computational biology now, do not skip these layers. Understand that the reference is a choice and not a given. Understand that “well-annotated” means “well-curated by people you do not know, against assumptions you should examine.” Look at how the gene model for your favorite locus has changed across GENCODE versions. Look at how the coordinates differ between Ensembl and NCBI for the same gene. Notice that they sometimes do not agree even on whether a transcript exists.

This is not a complaint about the field. The field has done extraordinary work assembling, annotating, and re-annotating one of the most complex molecular objects we know about. It is a reminder that the abstractions you load into a Jupyter notebook — gene_id, chrom, start, end — are summaries of an unfinished, contested process.

The practical advice is short. When the analysis matters, run it twice — once against GRCh38 for compatibility, once against T2T-CHM13 for completeness — and compare. The places where they disagree are usually the most interesting parts of the genome. The genome you trust is mostly correct. It is also still being written. Plan accordingly.


More essays

  • tech

    Notes on a boring data pipeline

    In favor of the cron job. A defense of small infrastructure for small problems, and the failure mode of choosing the interesting tool.

  • brewing

    Why I rest my beans

    The cheapest intervention in home espresso is a date written on the bag with a Sharpie. A field guide to coffee's most ignored variable.