Primate Centromeres

Article
The variation and evolution of complete

human centromeres
https://doi.org/10.1038/s41586-024-07278-3 Glennis A. Logsdon1,14, Allison N. Rozanski1, Fedor Ryabov2, Tamara Potapova3,

Valery A. Shepelev4, Claudia R. Catacchio5, David Porubsky1, Yafei Mao6, DongAhn Yoo1,
Received: 29 May 2023
Mikko Rautiainen7,15, Sergey Koren7, Sergey Nurk7,16, Julian K. Lucas8,9, Kendra Hoekzema1,
Accepted: 7 March 2024 Katherine M. Munson1, Jennifer L. Gerton3, Adam M. Phillippy7, Mario Ventura5,
Ivan A. Alexandrov10,11,12 & Evan E. Eichler1,13 ✉
Published online: xx xx xxxx
Open access
Human centromeres have been traditionally very difficult to sequence and assemble
Check for updates
owing to their repetitive nature and large size1. As a result, patterns of human
centromeric variation and models for their evolution and function remain incomplete,
despite centromeres being among the most rapidly mutating regions2,3. Here, using
long-read sequencing, we completely sequenced and assembled all centromeres from
a second human genome and compared it to the finished reference genome4,5. We find
that the two sets of centromeres show at least a 4.1-fold increase in single-nucleotide
variation when compared with their unique flanks and vary up to 3-fold in size.
Moreover, we find that 45.8% of centromeric sequence cannot be reliably aligned
using standard methods owing to the emergence of new α-satellite higher-order
repeats (HORs). DNA methylation and CENP-A chromatin immunoprecipitation
experiments show that 26% of the centromeres differ in their kinetochore position
by >500 kb. To understand evolutionary change, we selected six chromosomes and
sequenced and assembled 31 orthologous centromeres from the common chimpanzee,
orangutan and macaque genomes. Comparative analyses reveal a nearly complete
turnover of α-satellite HORs, with characteristic idiosyncratic changes in α-satellite
HORs for each species. Phylogenetic reconstruction of human haplotypes supports
limited to no recombination between the short (p) and long (q) arms across
centromeres and reveals that novel α-satellite HORs share a monophyletic origin,
providing a strategy to estimate the rate of saltatory amplification and mutation of
human centromeric DNA.
Advances in long-read sequencing technologies and assembly algo- advances, human centromeres still pose a challenge to sequencing and
rithms have now enabled the complete assembly of complex repetitive assembly. In a recent analysis of human genomes sequenced as part
regions in the human genome, including centromeres4–8. In addition to of the Human Pangenome Reference Consortium (HPRC), no other
these technological advances, completion of the first human genome human genome was completely sequenced across its centromeres10.
was aided by the use of a complete hydatidiform mole (CHM)4—an The centromeres, in particular, were among the most gap-ridden
abnormality of development in which only the paternal chromosomal regions11 and were excluded from the construction of a pangenome10.
complement is retained. The particular cell line, CHM13, simplified the Additional methods and approaches are still required to fully sequence
assembly process because the presence of a single human haplotype and assemble these regions12.
eliminated allelic variation that can otherwise complicate the assembly Human centromeres are among the most diverse and rapidly evolv-
of structurally complex regions6,9. This combination of technologies ing regions of the genome13,14. The bulk of human centromeric DNA is
and resources therefore provided the first complete sequence of each composed of tandemly repeating, approximately 171 bp α-satellite
centromere from a single human genome4,5. Notwithstanding these DNA, which is organized into HOR units that can extend for megabase
1
Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA. 2Masters Program in National Research University Higher School of Economics, Moscow,
Russia. 3Stowers Institute for Medical Research, Kansas City, MO, USA. 4Institute of Molecular Genetics, Moscow, Russia. 5Department of Biosciences, Biotechnology and Environment,
University of Bari Aldo Moro, Bari, Italy. 6Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University,
Shanghai, China. 7Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA.
8
Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA, USA. 9UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA, USA.
10
Department of Human Molecular Genetics and Biochemistry, Tel Aviv University, Tel Aviv, Israel. 11Department of Anatomy and Anthropology, Sackler Faculty of Medicine, Tel Aviv University,
Tel Aviv, Israel. 12Dan David Center for Human Evolution and Biohistory Research, Tel Aviv University, Tel Aviv, Israel. 13Howard Hughes Medical Institute, University of Washington, Seattle, WA,
USA. 14Present address: Department of Genetics, Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA. 15Present address: Institute for Molecular
Medicine Finland (FIMM), Helsinki Institute of Life Science (HiLIFE), University of Helsinki, Helsinki, Finland. 16Present address: Oxford Nanopore Technologies, Oxford, United Kingdom.
✉e-mail: [email protected]
Nature | www.nature.com | 1
Article
pairs (Mb) of sequence. Centromeres are particularly variable among biological relevance of CHM1 centromeres (similar to the T2T-CHM13
humans owing to the action of unequal crossing over, concerted evo- centromeres), both genomes are aberrations of normal development
lution and saltatory amplification12,15. Thus, a single human genome, followed by cell culture propagation. Thus, caution should be taken
such as CHM13, cannot adequately represent human genetic diversity. until all structures and configurations have been confirmed in addi-
Although most of the human genome has been examined for allelic tional human samples.
variation at the base-pair level, studies of centromeric DNA are far
more limited, based on early pulsed-field gels and Southern blots1,16,17,
monomer α-satellite analyses with short reads18,19, or analyses restricted Genetic variation among human centromeres
to select regions or chromosomes5,20,21. Here we present a complete set The complete assembly of each CHM1 centromere enables, in principle,
of centromeres from another human genome using a second hyda- a comprehensive comparison of centromeric allelic sequence and struc-
tidiform mole cell line (CHM1)6,9,22. We compare two complete sets of ture between two human genomes (Fig. 1). In light of the considerable
human centromeres to establish a baseline for single-nucleotide and variation between centromeres and the challenge in creating optimal
structural variation, and we relate these differences to shifts in the site alignments (especially among α-satellite HORs), we analysed the blocks
of kinetochore attachment. We also compare the rate of mutational of monomeric α-satellite DNA in the pericentromere separately from
change of centromeric DNA by sequencing select chromosomes from the α-satellite HOR arrays, and we considered three different alignment
other non-human primate (NHP) species and comparing our findings to strategies, including one designed to specifically handle variation in
finished centromeres from the HPRC10 and Human Genome Structural tandem repeats30 (Methods). We initially compared the centromeres
Variation Consortium (HGSVC)23. from the CHM1 and CHM13 genomes and then extended our analysis
to both complete and incompletely sequenced centromeres from 56
human genomes (Supplementary Tables 3–6). Comparison of the CHM1
The complete sequence of CHM1 centromeres and CHM13 centromeres revealed that 63.0–71.5% of α-satellite HORs
To assemble each CHM1 centromere, we used an approach similar to (depending on the chromosome) could be reliably aligned (that is,
that used for the assembly of the CHM13 centromeres (Supplementary greater than 90% identity; Supplementary Table 3). Extending this
Fig. 1). First, we generated approximately 66-fold sequence coverage analysis to those from 56 diverse human genomes from the HPRC
of Pacific Biosciences (PacBio) high-fidelity (HiFi) sequencing data and and HGSVC, we found that this drops to 53.2–55.3% (Supplementary
about 98-fold coverage of Oxford Nanopore Technologies (ONT) data Table 6), underscoring the considerable variation in these genomes
from the complete hydatidiform mole cell line CHM1 (Supplementary and the emergence of new α-satellite HOR structures in some human
Table 1). We initially used the whole-genome assembler hifiasm24 to haplotypes but not others. For the portions that could be aligned, the
generate a highly accurate backbone genome assembly. Only four results were comparable among the three methods (Supplementary
centromeres were contiguously assembled (from chromosomes 2, 7, Table 3), and we report the full contig alignment statistics with respect
19 and 20), with the remaining 19 fragmented into multiple contigs. We to single-nucleotide variation below (Methods).
resolved the remaining centromeres using singly unique nucleotide In comparing the CHM1 and CHM13 centromeres to each other, we
k-mers (SUNKs) to barcode the PacBio HiFi contigs, bridging them found that sequence identity increases as we transition from hetero-
with ultra-long (>100 kb) ONT reads that share a similar barcode, as chromatin to euchromatin. For example, the mean sequence identity
described previously21. Finally, we improved the base accuracy of the for the alignable portions of the CHM1 and CHM13 α-satellite HOR
assemblies by replacing the ONT sequences with locally assembled arrays is 98.6 ± 1.6%, in contrast to monomeric/diverged α-satellites
PacBio HiFi contigs, generating complete CHM1 centromere assem- at 99.8 ± 0.4% and other pericentromeric satellite DNA (β-satellite,
blies with an estimated base accuracy >99.9999% (QV > 60; Methods). γ-satellite and human satellites) at 99.1 ± 1.5% (Extended Data Fig. 2
Owing to the potential for somatic rearrangement arising during cell and Supplementary Table 4). Extending further into the non-satellite
culture, especially for centromeric regions25,26, we carefully assessed pericentromeric DNA, the sequence identity begins to approximate
the CHM1 cell line for chromosomal rearrangements (Supplemen- rates of allelic variation corresponding to the euchromatic portions
tary Figs. 2 and 3 and Supplementary Notes 1 and 2) and validated the of the genome (99.9 ± 0.3%; Extended Data Fig. 2 and Supplementary
integrity and biological relevance of each CHM1 centromere. First, we Table 4). However, we note that this varies considerably depending
mapped native long-read sequencing data generated from the CHM1 on the chromosome (Fig. 2a, Extended Data Fig. 3 and Supplementary
genome to each centromere assembly and confirmed the integrity of Figs. 16 and 17), and the presence of imperfectly aligned α-satellite
all chromosomes, with two exceptions (Supplementary Figs. 4–8 and repeats further complicates such calculations. The centromeres of
Supplementary Note 2). We next applied an algorithm, VerityMap27, some chromosomes, such as 19 and X, show the highest degree of con-
that identifies discordant k-mers between the centromere assemblies cordance between their α-satellite HOR arrays, whereas all others show
and PacBio HiFi reads and found no evidence of discordance (Meth- greater divergence in both sequence identity and structure (Fig. 2a,
ods). Third, we used a method, GAVISUNK28, that compares SUNKs in Extended Data Fig. 3 and Supplementary Figs. 16 and 17). A comparison
the centromere assemblies to those in the ONT reads generated from of the chromosome 5 D5Z2 α-satellite HOR array, for example, reveals
the same sample and observed support for each SUNK with orthogo- tracts that have as much as 4% sequence divergence, with clear expan-
nal ONT data (Supplementary Figs. 9–11). Fourth, we compared the sions of α-satellite HORs in the CHM1 α-satellite HOR array (Fig. 2a).
sequence of each CHM1 centromere assembly to those generated by Comparison with 56 incompletely assembled HPRC/HGSVC reference
an independent assembler, Verkko29, and found that they were highly genomes10,23 generally confirms that this wide variance in sequence
concordant, with greater than 99.99% sequence identity between each identity is a chromosome-specific property (Extended Data Fig. 4 and
pair (Supplementary Figs. 12 and 13). Finally, we compared both the Supplementary Figs. 18 and 19). Whereas most α-satellite HOR arrays
CHM1 and CHM13 genomes directly to 56 genomes (112 haplotypes) share at least 97% sequence identity, chromosomes 1, 5, 10, 12, 13 and
sequenced as part of the HPRC10 and HGSVC23. While many of these 19 represent clear outliers, with 16.6% of α-satellite HOR arrays aligning
additional human genomes are not yet completely assembled across very divergently (<97% sequence identity; Extended Data Fig. 4 and Sup-
the centromeres, 20.9% of human haplotypes match ≥99% to the newly plementary Figs. 18 and 19). Importantly, neither set of fully resolved
assembled centromeric regions (Supplementary Table 2, Extended human centromeres is a better match for the majority of HPRC/HGSVC
Data Fig. 1 and Supplementary Figs. 14 and 15). In fact, we found that genomes, nor does either adequately capture the full extent of human
46.9% of these haplotypes are a better match to CHM1 than to CHM13 genetic diversity (Supplementary Figs. 20–42). For example, the mean
(Methods and Supplementary Table 2). Although the data support the sequence identity among the 56 HPRC/HGSVC genomes to either CHM1
2 | Nature | www.nature.com
Chromosome 1 Chromosome 2 Chromosome 3 Chromosome 4 Chromosome 5 Chromosome 6
1.42 0.73
2.48 0.86 1.86 2.75
0.31 0.55 0.19 0.22 3.32 2.56
4.32 4.50 2.81 2.77
Mb Mb
1.53 2.34

4.04 3.30 2.58 2.08
2.53 2.63 2.21 2.03 5.36 3.47 3.07 2.58

2.52 1.95 1.76 2.62 1.87 1.01 1.87 1.98 4.29 3.68 5.36 4.97
Chromosome 19 Chromosome 20 Chromosome 21 Chromosome 22 Chromosome X

3.69 3.97 2.31 2.17 1.23 0.34 2.00 2.92
3.32 3.11
0
Scale (Mb)
1
2
3
4
Genome Centromeric regions Chromatin α-Satellite HORs

CHM1 Mon./div. α-satellite CENP-A 2-mer 5-mer 8-mer 11-mer 14-mer 17-mer 20-mer 24-mer
CHM13 Other satellite 3-mer 6-mer 9-mer 12-mer 15-mer 18-mer 21-mer 26-mer
4-mer 7-mer 10-mer 13-mer 16-mer 19-mer 22-mer
Fig. 1 | Overview of the centromeric genetic and epigenetic variation between each pair of chromosomes. The length (in Mb) of the α-satellite
between two human genomes. Complete assembly of centromeres from two higher-order repeat (HOR) array(s) is indicated, and the location of centromeric
hydatidiform moles, CHM1 and CHM13, reveals both small- and large-scale chromatin, marked by the presence of the histone H3 variant CENP-A, is indicated
variation in centromere sequence, structure and epigenetic landscape. The by a dark red circle. Transposable elements that are polymorphic in these regions
CHM1 and CHM13 centromeres are shown on the left and right, respectively, are shown in Supplementary Fig. 73. Mon./div., monomeric/diverged.
or CHM13 is 98.0 ± 2.3% (Supplementary Table 6). Similarly, we find range of variation (1.7-fold to 79.7-fold; median, 2.3-fold), based on
that 11 centromeres are a better match to CHM1, while 12 are a better released haplotype-phased genome assemblies from the HPRC10
match to CHM13 (Supplementary Table 2). However, if we require that and HGSVC23 (Fig. 2c). Our analysis shows, for example, that human
more than 75% of all HPRC haplotypes match better to either CHM1 or α-satellite HOR arrays range in size from 0.03 Mb on chromosome 4 to
CHM13, only five centromeres meet this requirement for CHM1 (chro- 6.5 Mb on chromosome 11. Chromosomes 3, 4 and 21 represent some
mosomes 2, 12, 13, 19 and 22), while seven do for CHM13 (chromosomes of the smallest α-satellite HOR arrays and show the greatest variation
3, 4, 7, 10, 11, 14 and 15; Supplementary Table 2). These analyses reflect in length among human haplotypes (Fig. 2b; 13.3-fold, 19.0-fold and
an extraordinary degree of single-nucleotide and structural diversity 9.0-fold difference, respectively). Almost all of the large-scale structural
of human centromeres. variation is due to variation in α-satellite HOR array organization and
Comparison of the length of the α-satellite HOR arrays reveals that size, although the patterns are considerably more complex than simple
CHM1 arrays are around 1.3-fold larger, on average, than their CHM13 insertion, deletion or inversion processes.
counterparts, with 16 out of 23 chromosomes containing a larger array Comparison of the CHM1 and CHM13 centromeres identifies eight
in CHM1 than in CHM13 (Figs. 2b and 3a and Supplementary Table 7). Of with distinctly different α-satellite HOR array structures (chromo-
these, five arrays are more than 1.5-fold larger in CHM1 than in CHM13 somes 5, 7, 8 and 10–14; Fig. 3b,c and Supplementary Fig. 43). This
(chromosomes 3, 4, 11, 15 and 21), with the greatest variation in length includes four arrays with a high abundance of previously uncharacter-
occurring on chromosome 21 (3.6-fold; Fig. 3a). This variation in length ized α-satellite HORs (chromosomes 5, 7, 10 and 14; Supplementary
between CHM1 and CHM13 α-satellite HOR arrays falls within the normal Fig. 44 and Supplementary Table 8). The centromeric D5Z2 α-satellite
Article
a Chromosome 19 Chromosome 18 Chromosome 5
α-satellite HOR array

CHM13 D19Z3

100.0
CHM13 D18Z1
Sequence identity (%)

CHM13 D5Z2
97.5
95.0
92.5
90.0
CHM1 D5Z2
CHM1 D19Z3 CHM1 D18Z1 α-satellite HOR array
α-satellite HOR array α-satellite HOR array
Centromeric regions α-Satellite HORs
Mon./div. α-satellite 2-mer 6-mer 10-mer 14-mer 20-mer 26-mer Scale (Mb)
Other satellite 3-mer 7-mer 12-mer 16-mer 22-mer
Transition region 4-mer 8-mer 13-mer 18-mer 24-mer 0 1 2 3
b
8 Genome
Length of α-satellite HOR arrays
in diverse human genomes (Mb)
7 CHM1
CHM13
Other
6
1
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Chromosome
Fig. 2 | The variation in sequence and structure between two sets of human b, The length of the active α-satellite HOR arrays among the CHM1 (red), CHM13
centromeres. a, The allelic variation between CHM1 and CHM13 centromeric/ (black) and complete HPRC/HGSVC (various colours) centromeres. n = 626.
pericentromeric haplotypes. Diagonal lines are coloured according to per cent The α-satellite HOR arrays range in size from 0.03 Mb on chromosome 4 to
sequence identity. The α-satellite HOR array structure is shown on the axes, 6.5 Mb on chromosome 11. Data are mean (solid black bar) and 25% and 75%
along with the organization of each centromeric/pericentromeric region. quartiles (dotted black bars).
array from CHM1 chromosome 5, for example, is significantly more to six, with the majority having four (Extended Data Fig. 5 and Sup-
diverse, containing two novel α-satellite HOR variants that are four plementary Figs. 45–67).
and six α-satellite monomers in length (Fig. 3b and Supplementary
Fig. 44a). Phylogenetic and comparative analysis of these HOR variants
reveals that they are both derivatives of an ancestral ten-monomer Epigenetic differences among centromeres
α-satellite HOR, which resides at the edge of the D5Z2 α-satellite The kinetochore is a proteinaceous complex marked by the presence
HOR array. These novel HORs, confirmed by analysis of the HPRC of nucleosomes containing the histone H3 variant CENP-A, which is
genomes10, probably arose from repeated deletions of α-satellite mon- critical for both meiotic and mitotic segregation of chromosomes. Pre-
omers in the ancestral HOR, giving rise to novel four- and six-monomer vious studies have shown that the kinetochore typically resides within
HOR variants (Supplementary Fig. 44a). Moreover, specific α-satellite a region of hypomethylated DNA, named the centromere dip region
HORs appear to be more consolidated, forming distinct evolutionary (CDR)5,31, that colocalizes with CENP-A immunostaining21. We assessed
layers that are not as apparent or are completely absent in the other the DNA methylation pattern and CENP-A chromatin organization of
haplotype. A clear 870 kb evolutionary layer, for example, is apparent each CHM1 centromere and compared it with its CHM13 counterpart.
in the CHM1 chromosome 5 centromeric D5Z2 α-satellite HOR array, Although CHM1 centromeric α-satellite HOR arrays are typically larger,
and it corresponds to a cluster of highly identical eight-monomer the majority of CHM1 kinetochore sites (18 out of 23) are smaller than
α-satellite HORs (Fig. 3b). This evolutionary layer is absent from the their CHM13 counterparts, with an average size of 178 kb versus 214 kb,
CHM13 centromere, of which the eight-monomer α-satellite HORs respectively (Fig. 4a and Supplementary Table 7). Moreover, 16 out of
are more dispersed along with four-monomer HORs. Similarly, the 23 CHM1 kinetochore sites are located further than 100 kb away from
CHM1 chromosome 11 D11Z1 centromere evolved a 1.2 Mb layer in their corresponding location in the CHM13 centromere, with six located
the core of its α-satellite HOR array that is missing from the CHM13 further than 500 kb away (chromosomes 4, 6, 11, 12, 18 and 20), when
centromere (Fig. 3c). This novel layer is composed of six-monomer measuring the distance from the α-satellite HOR-to-monomeric transi-
α-satellite HORs that are found only rarely in the CHM13 centromere. tion region (Fig. 4b and Supplementary Table 7). Consistent with earlier
We observed new evolutionary layers in the CHM1 chromosome 10, observations5, we identified five chromosomes with evidence of two
12 and 13 α-satellite HOR arrays, all of which have divergent array kinetochore sites, separated by >150 kb (chromosomes 1, 2, 13, 17 and 19).
structures. The remaining centromeres have a similar number of In the case of chromosomes 13 and 19, the two distinct kinetochore sites
evolutionary layers between the two genomes, ranging from two are located more than 1 Mb apart from each other (Extended Data Fig. 6).
a b CHM1 CHM13
centromere centromere
1 Chromosome 5
2
2.9×
3 Structure
1.9×
4 Mon./div. α-satellite
10-mer
5 8-mer
6 6-mer
α-Satellite
7 HORs 4-mer
2-mer
8 6-mer*
Novel α-satellite HOR variants
9 4-mer*
2 HSAT
10 Evolutionary layers 1 3 4 3 HSAT 1 1 2 3 1
Chromosome
1.5×
11 4
3
12
3 Kinetochore sites Kinetochore site
13 2
14 2
1.9×
15 1
16 1
17
18 CHM1 CHM13
19 c centromere centromere
20 Chromosome 11
3.6×
21
22
Structure
X Mon./div. α-satellite
α-Satellite 5-mer
0 1 2 3 4 HORs 6-mer*
Ratio of CHM1:CHM13 active 2 2 2
α-satellite HOR array length (Mb) Evolutionary layers 1 3 4 3 1 1 2 3 1
4
3
Centromeric regions Sequence identity (%)
Mon./div. α-satellite 3 Kinetochore sites Kinetochore sites
Other satellite 2
70 80 90 95 100
Transition region
Sequence identity 1
Scale (Mb) (inset; %) 2
0 1 2
90.0 93.5 97.0 100.0
1
Fig. 3 | Variation in the length and sequence composition of human and comprises a new evolutionary layer, or a stretch of sequence that has evolved
centromeric α-satellite HOR arrays. a, Ratio of the length of the active separately from neighbouring sequences (layer 4; indicated with an arrow),
α-satellite HOR arrays in the CHM1 genome compared with those in the CHM13 although this 1.21 Mb segment is more highly identical to the flanking sequence.
genome. b,c, Comparison of the CHM1 and CHM13 chromosome 5 D5Z2 The inset shows each of the new evolutionary layers with a higher stringency of
α-satellite HOR arrays (b) and CHM1 and CHM13 chromosome 11 D11Z1 α-satellite sequence identity, as well as the relative position of the kinetochore. Notably,
HOR arrays (c). The CHM1 chromosome 5 D5Z2 array contains two novel the α-satellite HOR variants comprising the new evolutionary layers in both
α-satellite HOR variants (Supplementary Fig. 44a) as well as a new evolutionary CHM1 chromosomes 5 and 11 have divergent CpG methylation patterns despite
layer (layer 4; indicated by an arrow), which is absent from the CHM13 array. their identical structure (Supplementary Fig. 74). Asterisk, α-satellite HORs
Similarly, the CHM1 chromosome 11 D11Z1 α-satellite HOR array contains a variants that are either novel or present in higher abundance in the CHM1
six-monomer HOR variant that is much more abundant than in the CHM13 array centromere relative to the CHM13 centromere.
To test whether these two kinetochore sites represent two distinct cell haplotypes. This change spans 87–88% of the length of the α-satellite
populations or, alternatively, an early-stage somatic mutational event HOR array itself and coincides with an alteration in the underlying
resulting in two kinetochores on the same chromosome, we performed α-satellite HOR sequence and structure, switching from a mixture of 16-
immunostaining combined with fluorescence in situ hybridization and 18-monomer α-satellite HORs to a mixture of 15- and 18-monomer
(immuno-FISH) analysis of stretched CHM1 metaphase chromosome HORs (Fig. 4c). Given the complete sequence of CHM1 and CHM13 cen-
spreads. We found that the chromosome 13 centromere has a single tromeres and the availability of incomplete assemblies from 56 diverse
kinetochore, marked by the inner-kinetochore protein CENP-C, within human genomes, we assessed whether the sequences underlying the
the D13Z2 α-satellite HOR array, while the chromosome 19 centromere kinetochore were more likely to be conserved compared with α-satellite
has two kinetochores within the D19Z3 α-satellite HOR array (Extended HORs that were not associated with the kinetochore. While we observed
Data Fig. 7). Assessment of the underlying sequence and structure of the clear examples of sequence conservation underlying the kinetochore
chromosome 13 D13Z2 α-satellite HOR array reveals a 631 kb deletion for specific chromosomes involving both CHM1 and CHM13 (for exam-
in approximately half of CHM1 cells (Supplementary Fig. 5c and Sup- ple, chromosomes 4, 5, 7, 12, 13, 16 and 18), other kinetochore regions
plementary Note 2), which may have contributed to the repositioning appeared to be more similar (chromosomes 1–3, 6, 8, 9, 17, 20, 21 and
of the kinetochore in a subpopulation of cells, whereas the chromo- X) or more divergent (chromosomes 10, 11, 14, 15, 19 and 22) than other
some 19 centromere has no such deletion and may have had two kine- portions of the α-satellite HOR array (Supplementary Figs. 20–42).
tochores present from the first few cell divisions. Centromeres with two
kinetochores (known as dicentrics) have been previously observed in
humans and other species and have been shown to be viable, even with Diverse evolutionary trajectories
interkinetochore distances of up to 12 Mb (refs. 32,33). Our analyses (Figs. 1–4) revealed that human centromeres vary non-
The chromosome 6 centromere shows the greatest variation in uniformly depending on the chromosome. In particular, specific
kinetochore position, with a difference of 2.4 Mb between the two human chromosomes show either highly variable α-satellite HOR array
Article
a b
0.5 NS 3
of the kinetochore site (Mb)

Difference in the position
2.4
kinetochore site (Mb)

0.4 2.2
Length of the
2
0.3
0.2 1.0
1 0.9
0.7
0.5
0.1
0 0
CHM1 CHM13 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X
Chromosome
c CHM1 CHM13
centromere centromere
Chromosome 6
Kinetochore site Kinetochore site
Structure
18-mer
α-Satellite 16-mer
HORs
15-mer
1 1
Frequency of
CpG methylation
0 0
Ratio of 600 3,000
CENP-A ChIP:bulk
nucleosome reads
(replicate 1) 0 0
0 1 2 3 0 1 2 3
Position (Mb) Position (Mb)
Fig. 4 | Variation at the site of the kinetochore among two sets of human was performed using a two-sided Kolmogorov–Smirnov test; NS, not significant.
centromeres. a, Comparison of the length of the kinetochore site, marked by b, The difference in the position of the kinetochore among the CHM1 and
hypomethylated DNA and CENP-A-containing chromatin, between the CHM1 CHM13 centromeres. c, Comparison of the CHM1 and CHM13 chromosome 6
and CHM13 centromeres. n = 28 and 25 kinetochore sites for the CHM1 and centromeres, which differ in kinetochore position by 2.4 Mb.
CHM13 centromeres, respectively. Data are mean ± s.e.m. Statistical analysis
lengths (such as chromosome 21), diverse α-satellite HOR organi- became apparent during this analysis (Fig. 5). For example, we estimate
zations (such as chromosomes 5, 10 and 12) or divergent epigenetic that common chimpanzee α-satellite HOR arrays are, on average, 67.8%
landscapes (such as chromosome 20). By contrast, the X chromosome the size of their human counterparts—a reduction observed in both
is among the most conserved, with nearly identical sequences and chimpanzee haplotypes. Like humans, chimpanzee α-satellite HOR
structures among diverse human genomes (Supplementary Fig. 42). arrays show evidence of clear evolutionary layers, with the pairwise
These findings imply that centromeres may have different mutation sequence identity of these layers dropping as they move toward peri-
rates and diverse evolutionary trajectories that shape their variation. centromeric DNA. This layered α-satellite HOR organization consists
To test this hypothesis, we sequenced and assembled orthologous mainly of a single, continuous block of higher-order α-satellite repeats
centromeres from four primate species, focusing on the completion that are >95% identical to each other, except for on chromosomes 12 and
of these six centromeres, in an effort to reconstruct their evolution- 20, which have two or three discrete blocks of α-satellite HORs that are
ary history over a 25-million-year window of primate evolution. Each only 90–95% identical to each other. By contrast, orangutan centromere
was specifically selected because it represents different forms of organization differs radically from either human or chimpanzee. We
centromeric diversity, but additional analyses, such as sampling all found that orangutan α-satellite HOR arrays are composed of three to
centromeres across multiple individuals, will need to be performed four distinct blocks of α-satellite HORs that are only 80–90% identical
to fully assess the complete diversity. To assemble these centromeres, to each other, creating a mosaic of independent HOR expansions with
we first generated PacBio HiFi data (38- to 100-fold coverage) from a patchwork-quilt pattern based on sequence identity (Fig. 5). Finally,
diploid human, chimpanzee, orangutan and macaque genomes (Meth- macaque centromeric α-satellite arrays are substantially larger in size,
ods), producing whole-genome assemblies ranging from 6.1 to 6.3 Gb with an average length of 12.2 ± 1.6 Mb. In contrast to apes, which pos-
in size (Supplementary Table 1). Using ultra-long ONT data (14- to sess complex HOR structures, macaque centromeric arrays are com-
20-fold coverage), we then ordered, oriented and joined the PacBio posed of dimeric α-satellite units18,35 that are 93–97% identical across
HiFi contigs together from each centromere, creating 31 contigu- all centromeres.
ous assemblies of primate centromeres for these six chromosomes Assessment of the α-satellite suprachromosomal families (SFs),
(Fig. 5). Mapping of long-read sequencing data to each centromere which are classes of α-satellite HORs defined by their monomer type
showed uniform coverage, indicating a lack of large structural errors and homology36, revealed three unexpected findings. First, we iden-
and validating the overall organization (Supplementary Figs. 68–71). tified an African ape centromere that is primarily composed of SF5
With the exception of the X chromosome from a male chimpanzee, α-satellite repeats: the chimpanzee chromosome 5 centromere. While
both haplotypes were completely sequenced for each diploid female all human and chimpanzee α-satellite arrays are mainly composed of
sample, providing additional insights into their overall organization α-satellites from SF1–3 with the exception of the Y chromosome (SF4)37,
and variation (Fig. 5). we found that both chimpanzee chromosome 5 centromeres harbour
Comparative analysis of these six sets of NHP centromeres revealed, SF5 α-satellite. Second, we found that all four chimpanzee chromo-
as expected2,18,19,34, diverse α-satellite HOR array organization and struc- some 20 and 21 α-satellite HOR arrays are composed of SF1 α-satellite,
tures, with arrays varying in size by more than 18.6-fold (the small- as opposed to SF2 as in human. Third, we found multiple examples of
est residing on human chromosome 21, and the largest residing on large-scale inversion polymorphisms across α-satellite arrays, such as
macaque chromosome 20). Distinct species-specific differences also one of two orangutan chromosome 20 centromeres, which contains a
CHM1/CHM13 Human Chimpanzee Orangutan Macaque
3.31 2.56
Mb Mb 2.47 5.05 2.18 1.85 10.58 10.32 11.77 11.65
Chr. 5
2.19 2.03 2.48 2.56 1.72 1.58 5.59 7.34 11.32 9.39
Chr. 10
3.07 2.58 2.31 4.08 1.46 2.04 13.20 13.02

0.13 0.14
Vestigial Vestigial
Chr. 12 Neocen Neocen
2.31 2.18 3.23 3.68 3.14 3.18 4.47 7.26 15.07 13.85
Chr. 20
1.23 0.34 0.82 0.81 2.17 1.44 4.52 3.89

0.002 0.002
Chr. 21
3.32 3.11 3.49 3.84 1.27 6.82 8.94 11.42 10.99
Chr. X
Human Orangutan
Centromeric regions α-Satellite SFs Sequence identity (%) Chimpanzee Macaque
Mon./div. α-satellite SF1 SF4
Other satellite SF2 SF5 0 1 2 0 2 4
Transition region SF3 SF7 70 80 90 100 Scale (Mb) Scale (Mb)
Fig. 5 | Sequence and structure of six sets of centromeres from diverse indicated (vertical bar colour), with arrows illustrating the orientation of the
primate species. Complete assembly of centromeres from chromosomes 5, repeats within the array. Chromosome 12 in orangutan has a neocentromere,
10, 12, 20, 21 and X in human, chimpanzee, orangutan and macaque reveals while the chromosome 21 centromere in macaque is no longer active due to a
diverse α-satellite SF organization and evolutionary landscapes. Sequence chromosomal fusion event in that lineage 45. All chromosomes are labelled
identity maps generated using StainedGlass44 are shown for each centromere according to the human phylogenetic group nomenclature46. The human diploid
(Methods and Supplementary Figs. 75–80), with the size of the α-satellite genome used as a control (second column) is HG00733—a 1000 Genomes sample
higher-order (human, chimpanzee and orangutan) or dimeric (macaque) of Puerto Rican origin. Note that the orangutan and macaque centromeres are
repeat array indicated in Mb. The α-satellite SF for each centromeric array is drawn at half the scale with respect to the other apes.
large 3.2 Mb inversion (Supplementary Fig. 72), and all four macaque array, while more ancient α-satellite sequences are located within inac-
chromosome 5 and 10 centromeres. tive D5Z1 α-satellite HOR arrays (Fig. 6a). This is in contrast to the chro-
Despite these species-specific patterns, a common feature of all mosome 12 centromere, which contains α-satellite HORs that are shared
primate centromeres is the presence of two to five distinct evolution- among orangutan and chimpanzee (Extended Data Fig. 8b). Finally,
ary layers, marked by the most highly identical α-satellite sequences the chromosome X centromere is composed of α-satellite HORs and
at the centre of the α-satellite array that become increasingly diver- monomers that are evolutionarily similar to each other, and in con-
gent towards the periphery. These more divergent higher-order and trast to the other centromeres, are also similar to macaque’s α-satellite
dimeric repeats are flanked by blocks of monomeric α-satellite DNA. monomers (Fig. 6b).
We performed phylogenetic and comparative analyses of all six com-
plete orthologous centromere sets and observed that monomeric
α-satellite is generally more closely related to the Old World monkey Mutation rate estimation
dimeric satellites of macaques. Notwithstanding this general topology, As our analyses showed that monomeric α-satellite sequences mutate
distinct chromosome-specific patterns emerge (Fig. 6 and Extended less quickly and can be readily aligned among human and non-human
Data Fig. 8). The chromosome 5 centromere, for example, has evolved apes, we focused first on the pericentromeric DNA flanking the
human-specific α-satellite that defines the active D5Z2 α-satellite HOR α-satellite HOR array. On the basis of the complete sequence from
Article
Inactive Inactive
a Great ape c α-satellite Active α-satellite
Human monomeric HORs α-satellite HORs
higher-order α-satellite Mon. HORs HSat3
α-satellite (p- and q-arm) Chr. 5 centromere
(active array)
10–4
Great ape
10–5
(mutations per bp
monomeric
per generation)
Mutation rate
α-satellite 10–6
(p-arm)
10–7
Great ape
higher-order 10–8
Macaque
α-satellite dimeric and 10–9
(inactive arrays) Human monomeric
higher-order α-satellite 10–10
α-satellite 45 46 47 48 49 50 51 52
(active array) CHM13 chromosome 5 position (Mb)
Active
b d α-satellite
Mon. HORs Mon.
Great ape Great ape
monomeric and macaque Chr. X centromere
α-satellite monomeric
α-satellite 10–4
(mutations per bp
10–5
per generation)
Mutation rate
10–6
Great ape
higher-order 10–7
and monomeric Macaque 10–8
α-satellite dimeric and
monomeric 10–9
α-satellite 10–10
Tree scale: 0.1 56 57 58 59 60 61 62
CHM13 chromosome X position (Mb)
Human (CHM13) Human (HG00733; H1) Chimpanzee (H1) Orangutan (H1) Macaque (H1)
Human (CHM1) Human (HG00733; H2) Chimpanzee (H2) Orangutan (H2) Macaque (H2)
Fig. 6 | Centromeres evolve with different evolutionary trajectories and are shown. Note that the regions corresponding to the active α-satellite HORs
mutation rates. a,b, Phylogenetic trees of human, chimpanzee, orangutan have only approximate mutation rates based on human–human comparisons.
and macaque α-satellites from the higher-order and monomeric (mon.) Owing to unequal rates of mutation and the emergence of new α-satellite HORs,
α-satellite regions of the chromosome 5 (a) and X (b) centromeres, respectively. interspecies comparisons are not possible in these regions. HSat3, human
c,d, The mutation rate of the chromosome 5 (c) and X (d) centromeric regions, satellite 3.
respectively. Individual data points from 10 kb pairwise sequence alignments
human and NHP centromeric transition regions, we estimated the muta- of suppressed or limited recombination across the region. Importantly,
tion rate of the approximately 1–2 Mb region flanking the α-satellite haplotypes containing new α-satellite HORs most often share a mono-
HOR arrays using established evolutionary models (Methods) and phyletic origin (Fig. 7b and Extended Data Figs. 9–11). For example, in
found that the mutation rate increases from 1.1- to 4.1-fold compared the case of chromosome 12, we estimate the new HORs emerged approx-
with the unique portions in each of the six centromeres (Fig. 6c,d and imately 13–23 thousand years ago; Fig. 7b), while for chromosome 11,
Extended Data Fig. 8e–h). The greatest increase in the mutation rate they emerged approximately 80–153 thousand years ago (Extended
was observed for the chromosome 5 centromere (4.1-fold), while the Data Fig. 11a). This suggests a single origin for the new α-satellite HORs,
smallest increase was observed for the chromosome X centromere followed by the saltatory spread of >1 Mb of new HORs to this subset
(1.1-fold), consistent with the observed rapid and slower structural of human haplotypes. As we are specifically selecting haplotypes that
diversity for this chromosome. Owing to nearly complete evolutionary show a saltatory amplification of α-satellite HORs, these rate estimates
turnover of the α-satellite HORs, biologically meaningful alignment should not be considered genome- or even centromere-wide rates of
comparisons among humans and NHPs could not be made. However, change.
analyses of the sequence alignments among the four human haplo- By directly comparing the structure of the α-satellite HOR arrays
types suggest a potential mutation rate increase of at least an order of with the nearest human haplotype lacking the newly derived HORs, we
magnitude, given the caveat that substantial portions of the α-satellite computed the difference in the number of base pairs, α-satellite mono-
repeats do not align. mers, α-satellite HORs and distinct structural changes (Supplementary
To understand the nature of evolutionary change within the Table 9). Using these α-satellite HORs as a benchmark, our results sug-
α-satellite HOR arrays and especially the emergence of new HORs, we gest 392–2,490 nucleotide differences (or up to two α-satellite HORs)
used a population genetics approach leveraging the genetic diversity per generation, on average, to create the new HORs on chromosomes 11
present in the HPRC10 and HGSVC23 genomes. We reasoned that less and 12 (Fig. 7b and Supplementary Fig. 11a). Given the average length of
divergent sequence comparisons within the human species would each α-satellite HOR array and the estimated coalescent time, this trans-
enable more accurate alignments and, therefore, better reconstruc- lates to considerably different rates for the emergence of these new
tion of the series of mutational events occurring within the α-satellite α-satellite HORs on chromosomes 11 (~30–60 nucleotide differences
HOR arrays. Given the relative stability of the flanking monomeric per Mb per generation) and 12 (~500–1,000 nucleotide differences per
satellite DNA, we constructed phylogenetic trees using the chimpanzee Mb per generation; Supplementary Table 9). While caution should be
sequence as an outgroup and estimated separation times for different exercised given the focus on new α-satellite HOR structures and the
human haplotypes, assuming a chimpanzee and human divergence limited number of human haplotypes compared, a notable finding is
time of 6 million years (Fig. 7 and Extended Data Figs. 9–11). Under both the speed at which these new HORs emerged and the interdigitated
the assumption that there is limited to no recombination across the nature of new α-satellite HORs intermixed with relic ancestral HORs.
α-satellite HOR array, we then compared the topologies of both the Our results suggest approximately 100 distinct structural changes
p- and q-arms, focusing specifically on haplotypes in which we had (insertions and deletions) as this new HOR variant evolved. This pat-
documented the emergence of novel α-satellite HOR arrays. Despite tern implicates mechanisms other than simple unequal crossover for
being anchored in sequence separated 2–3 Mb apart, the p- and q-arm the spread of novel α-satellite HORs within centromeres. The change
topologies of the resulting trees were similar, consistent with the notion in array structure is probably due to saltatory amplification of newly
a b Chromosome 12
Mon./div. Mon./div.
HOR HG02723 (H2) HG03486 (H2)
p q * HG03486 (H2) HG02723 (H2) *
26 ka
30 ka HG02630 (H2) HG02630 (H2)
HG02559 (H2) HG01106 (H1)
13 ka
* HG00735 (H2) HG02559 (H2) *23
HG01123 (H2) HG01109 (H1)
p-arm tree q-arm tree HG01109 (H1) HG01358 (H1) ka
HG01258 (H2) HG02486 (H2) *q-arm
97 CHM13 HG01258 (H2)
p-arm
* 556
HG02486 (H2) HG00735 (H2)
HG01358 (H1) CHM13
ka HG01106 (H1) HG01123 (H2)
Diverged HG02630 (H1) HG02630 (H1)
1.4 Ma HG03540 (H2) HG03540 (H2) 90 485 ka
*
67 ka 99 HG00438 (H2) CHM1
Diverged
HG00438 (H1)
90 94 HG01978 (H2)
HG01361 (H2)
HG01358 (H2)
*
92 257 1.4 ma
HG01928 (H1) HG03540 (H1) 99 ka
667 ka CHM1 HG01071 (H2)
HG02055 (H2) HG02055 (H2)
<1 ka * HG03540 (H1) HG01928 (H1)
HG01361 (H2) HG00438 (H1)
92 HG01071 (H2) HG01978 (H2) 320 ka
HG01358 (H2) HG00438 (H2)
HG01109 (H2) HG03486 (H1)
HG03486 (H1) HG02145 (H2) *89
* HG02145 (H2) HG03579 (H2)
ka
72 HG03579 (H2) HG02622 (H1)
ka HG02622 (H1) HG03098 (H2)
* HG03098 (H2) HG00741 (H2) 91
306 ka
NA18906 (H2) NA18906 (H2) *<1
*
<1 ka HG00741 (H2) HG01109 (H2) ka
HG00741 (H1) HG00741 (H1)
Centromeric regions Superpopulation NA20129 (H2) HG02723 (H1)
Mon./div. α-satellite African 99 HG02723 (H1) NA20129 (H2)
Other satellite American HG02818 (H1) HG02818 (H1)
Transition region 99 HG005 (H1) HG005 (H1)
East Asian 99
South Asian HG03492 (H1) HG01928 (H2)
European HG01952 (H2) HG01952 (H2)
* HG01928 (H2)
<1 ka 96 HG01243 (H1)
HG01243 (H1)
α-Satellite HORs HG00735 (H1)
2-mer 6-mer 14-mer Tree scale HG00735 (H1) HG03492 (H1)
3-mer 8-mer 16-mer
4-mer 10-mer 0.001 0 1 2 3 0 1 2 3
5-mer 12-mer Position (Mb) Position (Mb)
Fig. 7 | Phylogenetic reconstruction of human centromeric haplotypes and pattern of new α-satellite HOR insertions and deletions over a short period of
the saltatory amplification of new α-satellite HORs. a, The strategy to evolutionary time. The asterisks indicate nodes with 100% bootstrap support,
determine the phylogeny and divergence times of completely sequenced and nodes with 90–99% bootstrap support are indicated numerically. Nodes
centromeres using monomeric α-satellite or unique sequence flanking the without an asterisk or number have bootstrap support <90%. The haplotypes
canonical α-satellite HOR array from both the p- and q-arms. Chimpanzee was from the p- and the q-arm trees are linked with a light teal bar, as shown in the
used as an outgroup with an estimated species divergence time of 6 million schematic in a. Note that most differences in the order of the haplotypes occur
years ago (Ma). b, Maximum-likelihood phylogenetic trees depicting the p- at the terminal branches, where the order of sequence taxa can be readily
and q-arm topologies along with the estimated divergence times reveal a reshuffled to establish near-complete concordance. Thus, there are no
monophyletic origin for the emergence of new α-satellite HORs within the significant changes in the overall topologies of the phylogenetic tree.
chromosome 12 (D12Z3) α-satellite HOR array. This array shows a complex ka, thousand years ago.
emerged α-satellite HOR variants at multiple sites in the original HOR underscores the centromere paradox3, an unresolved conundrum
array, leading to an overall increase in array size from 554 kb to 2 Mb, regarding the contradictory phenomenon of rapidly evolving centro-
on average (Fig. 7b and Extended Data Figs. 9–11). meric DNA and proteins despite their essential role in ensuring faithful
chromosome transmission.
Comparison of the sequence and structure across six sets of ortholo-
Discussion gous primate centromeres suggests near-complete lineage-specific
Here we present a detailed comparative analysis of two completely turnover of α-satellite HORs as well as unique features specific to each
assembled reference sets of human centromeres compared to a diver- lineage. We find that chimpanzee α-satellite HOR arrays are around
sity panel of human and NHP centromeres. We show a demonstrable 67% smaller than their human counterparts. Orangutan α-satellite
acceleration of single-nucleotide and structural variation transition- HOR arrays are organized as a mosaic patchwork of distinct α-satellite
ing from euchromatin to heterochromatin, with most of this excess HOR blocks with a high degree of divergence. Macaque centromeres
occurring within the core of the centromeric α-satellite HOR arrays. are consistently the largest, but are also more homogenous and com-
An important caveat is that a substantial fraction (45–47%) of the composed of dimeric α-satellites that are approximately 95% similar in
pletely sequenced centromeres cannot be readily aligned to either of sequence to each other, with blocks of polymorphic inversions present
the two references, owing in part to the emergence of new α-satellite on some centromeres. Using the emergence of these HOR structures
HOR structures (Supplementary Tables 6 and 8 and Supplementary within human as a marker of evolutionary mutability, our coalescent
Figs. 43 and 44). These initial mutation rate estimates therefore prob- approach suggests that centromeric α-satellite HOR arrays can mutate
ably represent an underestimate until a greater diversity of human and multiple orders of magnitude more quickly than unique DNA (estimated
NHP centromeres is sampled. Notably, we find that the predicted site of at 30–1,000 nucleotides per generation per Mb based on our analysis
the kinetochore attachment varies considerably in location, with eight of newly emerged HORs on chromosomes 11 and 12; Supplementary
differing by more than 500 kb in these two human genomes (Fig. 4b). Table 9). These changes in DNA occur most frequently in concert with
While some of this repositioning corresponds to the emergence of gains and losses of α-satellite HOR units and do not appear to do so in a
novel α-satellite HORs (Fig. 3c), overall we have not found a one-to-one contiguous manner but, instead, are intermixed with ancestral HORs.
correspondence between the sites of kinetochore attachment and These patterns are consistent with saltatory as opposed to a constant
areas of rapid evolutionary turnover and homogenization as predicted rate of mutation38, potentially as a result of meiotic drive for the newly
by the kinetochore-associated recombination machinery model15,34 minted HORs39. Mechanisms involving DNA double-stranded break
(Supplementary Figs. 20–42). This notable plasticity in kinetochore formation followed by homologous or unidirectional gene conver-
position despite the conserved, essential function of these regions sion between sister chromatids, as has been recently suggested for
Article
Arabidopsis thaliana40, may account for this pattern. The emergence 25. Aldrup-MacDonald, M. E., Kuo, M. E., Sullivan, L. L., Chew, K. & Sullivan, B. A. Genomic
variation within alpha satellite DNA influences centromere location on human
of new α-satellite HORs may also contribute to increased centromere chromosomes with metastable epialleles. Genome Res. 26, 1301–1311 (2016).
strength41, which can lead to non-Mendelian chromosome segregation 26. Mahtani, M. M. & Willard, H. F. A primary genetic map of the pericentromeric region of the
and biased chromosome retention in oocytes42,43. Now that centromeres human X chromosome. Genomics 2, 294–301 (1988).
27. Bzikadze, A. V., Mikheenko, A. & Pevzner, P. A. Fast and accurate mapping of long reads
can be fully sequenced, it will be critical to study the mutational pro- to complete genome assemblies with VerityMap. Genome Res. https://doi.org/10.1101/
cesses in multigenerational families to understand the mechanisms gr.276871.122 (2022).
shaping these rapidly evolving regions of our genome. 28. Dishuck, P. C., Rozanski, A. N., Logsdon, G. A., Porubsky, D. & Eichler, E. E. GAVISUNK:
genome assembly validation via inter-SUNK distances in Oxford Nanopore reads.
Bioinformatics https://doi.org/10.1093/bioinformatics/btac714 (2022).
29. Rautiainen, M. et al. Telomere-to-telomere assembly of diploid chromosomes with
Online content Verkko. Nat. Biotechnol. https://doi.org/10.1038/s41587-023-01662-6 (2023).
30. Bzikadze, A. V. & Pevzner, P. A. TandemAligner: a new parameter-free framework for
Any methods, additional references, Nature Portfolio reporting summa- fast sequence alignment. Preprint at bioRxiv https://doi.org/10.1101/2022.09.15.507041
ries, source data, extended data, supplementary information, acknowl- (2022).
edgements, peer review information; details of author contributions 31. Gershman, A. et al. Epigenetic patterns in a complete human genome. Science 376,
eabj5089 (2022).
and competing interests; and statements of data and code availability 32. Stimpson, K. M., Matheny, J. E. & Sullivan, B. A. Dicentric chromosomes: unique
are available at https://doi.org/10.1038/s41586-024-07278-3. models to study centromere function and inactivation. Chromosome Res. 20, 595–605
(2012).
33. Sullivan, B. A. & Willard, H. F. Stable dicentric X chromosomes with two functional
1. Willard, H. F. Chromosome-specific organization of human alpha satellite DNA. Am. J. Hum. centromeres. Nat. Genet. 20, 227–228 (1998).
Genet. 37, 524–532 (1985). 34. Shepelev, V. A., Alexandrov, A. A., Yurov, Y. B. & Alexandrov, I. A. The evolutionary origin of
2. Alexandrov, I., Kazakov, A., Tumeneva, I., Shepelev, V. & Yurov, Y. Alpha-satellite DNA of man can be traced in the layers of defunct ancestral alpha satellites flanking the active
primates: old and new families. Chromosoma 110, 253–266 (2001). centromeres of human chromosomes. PLoS Genet. 5, e1000641 (2009).
3. Henikoff, S., Ahmad, K. & Malik, H. S. The centromere paradox: stable inheritance with 35. Pike, L. M., Carlisle, A., Newell, C., Hong, S.-B. & Musich, P. R. Sequence and evolution of
rapidly evolving DNA. Science 293, 1098–1102 (2001). rhesus monkey alphoid DNA. J. Mol. Evol. 23, 127–137 (1986).
4. Nurk, S. et al. The complete sequence of a human genome. Science 376, 44–53 (2022). 36. Alexandrov, I. A., Mitkevich, S. P. & Yurov, Y. B. The phylogeny of human chromosome
5. Altemose, N. et al. Complete genomic and epigenetic maps of human centromeres. specific alpha satellites. Chromosoma 96, 443–453 (1988).
Science 376, eabl4178 (2022). 37. Hughes, J. F., Skaletsky, H. & Page, D. C. ALRY-MAJOR:PT: Major Repeat Unit of Chimpanzee
6. Chaisson, M. J. P. et al. Resolving the complexity of the human genome using single- Alpha Repetitive DNA from the Y Chromosome Centromere—A Consensus (Repbase,
molecule sequencing. Nature 517, 608–611 (2015). accessed 28 May 2023); http://www.girinst.org/.
7. Nurk, S. et al. HiCanu: accurate assembly of segmental duplications, satellites, and allelic 38. Plohl, M., Luchetti, A., Meštrović, N. & Mantovani, B. Satellite DNAs between selfishness
variants from high-fidelity long reads. Genome Res. https://doi.org/10.1101/gr.263566.120 and functionality: structure, genomics and evolution of tandem repeats in centromeric
(2020). (hetero)chromatin. Gene 409, 72–82 (2008).
8. Vollger, M. R. et al. Segmental duplications and their variation in a complete human 39. Amor, D. J. et al. Human centromere repositioning ‘in progress’. Proc. Natl Acad. Sci. USA
genome. Science 376, eabj6965 (2022). 101, 6542–6547 (2004).
9. Steinberg, K. M. et al. Single haplotype assembly of the human genome from a 40. Wlodzimierz, P. et al. Cycles of satellite and transposon evolution in Arabidopsis
hydatidiform mole. Genome Res .24, 2066–2076 (2014). centromeres. Nature 618, 557–565 (2023).
10. Liao, W.-W. et al. A draft human pangenome reference. Nature 617, 312–324 (2023). 41. Iwata-Otsubo, A. et al. Expanded satellite repeats amplify a discrete CENP-A nucleosome
11. Porubsky, D. et al. Inversion polymorphism in a complete human genome assembly. assembly site on chromosomes that drive in female meiosis. Curr. Biol. 27, 2365–2373
Genome Biol. 24, 100 (2023). (2017).
12. Logsdon, G. A. & Eichler, E. E. The dynamic structure and rapid evolution of human 42. Akera, T. et al. Spindle asymmetry drives non-Mendelian chromosome segregation.
centromeric satellite DNA. Genes 14, 92 (2023). Science 358, 668–672 (2017).
13. Archidiacono, N. et al. Comparative mapping of human alphoid sequences in great apes 43. Akera, T., Trimm, E. & Lampson, M. A. Molecular strategies of meiotic cheating by selfish
using fluorescence in situ hybridization. Genomics 25, 477–484 (1995). centromeres. Cell 178, 1132–1144 (2019).
14. Cechova, M. et al. High satellite repeat turnover in great apes studied with short- and 44. Vollger, M. R., Kerpedjiev, P., Phillippy, A. M. & Eichler, E. E. StainedGlass: interactive
long-read technologies. Mol. Biol. Evol. 36, 2415–2431 (2019). visualization of massive tandem repeat structures with identity heatmaps. Bioinformatics
15. Miga, K. H. & Alexandrov, I. A. Variation and evolution of human centromeres: a field 38, 2049–2051 (2022).
guide and perspective. Annu. Rev. Genet. 55, 583–602 (2021). 45. Richard, F. & Dutrillaux, B. Origin of human chromosome 21 and its consequences: a
16. Willard, H. F., Wevrick, R. & Warburton, P. E. Human centromere structure: organization 50-million-year-old story. Chromosome Res. 6, 263–268 (1998).
and potential role of alpha satellite DNA. Prog. Clin. Biol. Res. 318, 9–18 (1989). 46. McConkey, E. H. Orthologous numbering of great ape and human chromosomes is
17. Wu, J. C. & Manuelidis, L. Sequence definition and organization of a human repeated essential for comparative genomics. Cytogenet. Genome Res. 105, 157–158 (2004).
DNA. J. Mol. Biol. 142, 363–386 (1980).
18. Alkan, C. et al. Organization and evolution of primate centromeric DNA from whole- Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
genome shotgun sequence data. PLoS Comput. Biol. 3, 1807–1818 (2007). published maps and institutional affiliations.
19. Alkan, C. et al. Genome-wide characterization of centromeric satellites from multiple
mammalian genomes. Genome Res. 21, 137–145 (2011). Open Access This article is licensed under a Creative Commons Attribution
20. Miga, K. H. et al. Telomere-to-telomere assembly of a complete human X chromosome. 4.0 International License, which permits use, sharing, adaptation, distribution
Nature 585, 79–84 (2020). and reproduction in any medium or format, as long as you give appropriate
21. Logsdon, G. A. et al. The structure, function and evolution of a complete human credit to the original author(s) and the source, provide a link to the Creative Commons licence,
chromosome 8. Nature 593, 101–107 (2021). and indicate if changes were made. The images or other third party material in this article are
22. Vollger, M. R. et al. Long-read sequence and assembly of segmental duplications. included in the article’s Creative Commons licence, unless indicated otherwise in a credit line
Nat. Methods 16, 88–94 (2019). to the material. If material is not included in the article’s Creative Commons licence and your
23. Ebert, P. et al. Haplotype-resolved diverse human genomes and integrated analysis of intended use is not permitted by statutory regulation or exceeds the permitted use, you will
structural variation. Science 372, eabf7117 (2021). need to obtain permission directly from the copyright holder. To view a copy of this licence,
24. Cheng, H., Concepcion, G. T., Feng, X., Zhang, H. & Li, H. Haplotype-resolved de novo visit http://creativecommons.org/licenses/by/4.0/.
assembly using phased assembly graphs with hifiasm. Nat. Methods 18, 170–175
(2021). © The Author(s) 2024
Methods using DeepConsensus48 (v.0.2.0) with the default parameters. Raw
HG00733 data were processed using the CCS algorithm (v.3.4.1) with
Cell lines the following parameters: --minPasses 3 --minPredictedAccuracy 0.99
CHM1hTERT (CHM1) cells were originally isolated from a hydatidiform --maxLength 21000 or 50000.
mole at Magee-Womens Hospital. Cryogenically frozen cells from this Ultra-long ONT data were generated from the CHM1, HG00733,
culture were grown and transformed using human telomerase reverse chimpanzee, orangutan and macaque genomes according to a previ-
transcriptase (hTERT) to immortalize the cell line. This cell line has been ously published protocol49. In brief, 3–5 × 107 cells were lysed in a buffer
authenticated by short-tandem-repeat analysis by Cell Line Genet- containing 10 mM Tris-Cl (pH 8.0), 0.1 M EDTA (pH 8.0), 0.5% (w/v) SDS
ics and has tested negative for mycoplasma contamination. Human and 20 μg ml−1 RNase A (Qiagen, 19101) for 1 h at 37 °C. Then, 200 μg ml−1
HG00733 lymphoblastoid cells were originally obtained from a female proteinase K (Qiagen, 19131) was added, and the solution was incubated
Puerto Rican child, immortalized with the Epstein–Barr Virus (EBV) and at 50 °C for 2 h. DNA was purified through two rounds of 25:24:1 (v/v)
stored at the Coriell Institute for Medical Research. This cell line has phenol–chloroform–isoamyl alcohol extraction followed by ethanol
been authenticated using a multiplex PCR assay with six autosomal precipitation. Precipitated DNA was solubilized in 10 mM Tris (pH 8.0)
microsatellite markers and has tested negative for mycoplasma con- containing 0.02% Triton X-100 at 4 °C for 2 days. Libraries were con-
tamination. Chimpanzee (Pan troglodytes, Clint, S006007) fibroblast structed using the Ultra-Long DNA Sequencing Kit (ONT, SQK-ULK001)
cells were originally obtained from a male western chimpanzee named with modifications to the manufacturer’s protocol. Specifically, around
Clint (now deceased) at the Yerkes National Primate Research Center 40 μg of DNA was mixed with FRA enzyme and FDB buffer as described
and immortalized with EBV. Orangutan (Pongo abelii, Susie, PR01109) in the protocol and incubated for 5 min at room temperature, followed
fibroblast cells were originally obtained from a female Sumatran oran- by a 5 min heat-inactivation at 75 °C. RAP enzyme was mixed with the
gutan named Susie (now deceased) at the Gladys Porter Zoo, immortal- DNA solution and incubated at room temperature for 1 h before the
ized with EBV and stored at the Coriell Institute for Medical Research. clean-up step. Clean-up was performed using the Nanobind UL Library
Macaque (Macaca mulatta; AG07107) fibroblast cells were originally Prep Kit (Circulomics, NB-900-601-01) and eluted in 225 μl EB. Then,
obtained from a female rhesus macaque of Indian origin and stored at 75 μl of library was loaded onto a primed FLO-PRO002 R9.4.1 flow cell
the Coriell Institute for Medical Research. The chimpanzee, orangutan for sequencing on the PromethION, with two nuclease washes and
and macaque cell lines have not yet been authenticated or assessed for reloads after 24 and 48 h of sequencing.
mycoplasma contamination to our knowledge. Additional ONT data were generated from the CHM1, HG00733,
chimpanzee, orangutan and macaque genomes according to a previ-
Cell culture ously published protocol21. In brief, high-molecular-weight DNA was
CHM1 cells were cultured in complete AmnioMax C-100 Basal Medium extracted from cells using a modified Qiagen Gentra Puregene pro-
(Thermo Fisher Scientific, 17001082) supplemented with 15% Amni- tocol47. High-molecular-weight DNA was prepared into libraries with
oMax C-100 Supplement (Thermo Fisher Scientific, 12556015) and the Ligation Sequencing Kit (SQK-LSK110) from ONT and loaded onto
1% penicillin–streptomycin (Thermo Fisher Scientific, 15140122). primed FLO-PRO002 R9.4.1 flow cells for sequencing on the Prome-
HG00733 (Homo sapiens) cells were cultured in RPMI-1650 medium thION system, with two nuclease washes and reloads after 24 and 48 h
(Sigma-Aldrich, R8758) supplemented with 15% fetal bovine serum (FBS; of sequencing. All ONT data were base-called using Guppy (v.5.0.11)
Thermo Fisher Scientific, 16000-044) and 1% penicillin–streptomycin with the SUP model.
(Thermo Fisher Scientific, 15140122). Chimpanzee (P. troglodytes; Clint;
S006007) and macaque (Macaque mulatta; AG07107) cells were cul- Targeted sequence assembly and validation of centromeric
tured in MEM α containing ribonucleosides, deoxyribonucleosides regions
and l-glutamine (Thermo Fisher Scientific, 12571063) supplemented To generate complete assemblies of centromeric regions from the
with 12% FBS (Thermo Fisher Scientific, 16000-044) and 1% penicillin– CHM1, HG00733, chimpanzee, orangutan and macaque genomes, we
streptomycin (Thermo Fisher Scientific, 15140122). Orangutan (P. abelii; first assembled each genome from PacBio HiFi data (Supplementary
Susie; PR01109) cells were cultured in MEM α containing ribonucleo- Table 1) using hifiasm24 (v.0.16.1). The resulting PacBio HiFi contigs were
sides, deoxyribonucleosides and l-glutamine (Thermo Fisher Scien- aligned to the T2T-CHM13 reference genome4 (v.2.0) using minimap250
tific, 12571063) supplemented with 15% FBS (Thermo Fisher Scientific, (v.2.24) with the following parameters: -I 15G -a --eqx -x asm20 -s 5000.
16000-044) and 1% penicillin–streptomycin (Thermo Fisher Scientific, Fragmented centromeric contigs were subsequently scaffolded with
15140122). All cells were cultured in a humidity-controlled environment ultra-long (>100 kb) ONT data generated from the same source genome
at 37 °C under 95% O2. using a method that takes advantage of SUNKs (Supplementary Fig. 1;
https://github.com/arozanski97/SUNK-based-contig-scaffolding). In
DNA extraction, library preparation and sequencing brief, SUNKs (k = 20 bp) were identified from the CHM1 PacBio HiFi
PacBio HiFi data were generated from the CHM1 and HG00733 gen whole-genome assembly using Jellyfish (v.2.2.4) and barcoded on the
omes as previously described21 with some modifications. In brief, high- CHM1 PacBio HiFi centromeric contigs as well as all ultra-long ONT
molecular-weight DNA was extracted from cells using a modified reads. PacBio HiFi centromeric contigs sharing a SUNK barcode with
Qiagen Gentra Puregene Cell Kit protocol47. High-molecular-weight ultra-long ONT reads were subsequently joined together to generate
DNA was used to generate PacBio HiFi libraries using the Template Prep contiguous assemblies that traverse each centromeric region. The
Kit v1 (PacBio, 100-259-100) or SMRTbell Express Template Prep Kit v2 base accuracy of the assemblies was improved by replacing the ONT
(PacBio, 100-938-900) and SMRTbell Enzyme Clean Up kits (PacBio, 101- sequences with locally assembled PacBio HiFi contigs generated using
746-400 and 101-932-600). Size selection was performed with SageELF HiCanu7 (v.2.1.1).
(Sage Science, ELF001), and fractions sized 11 kb, 14 kb, 15 kb or 16 kb We validated the construction of each centromere assembly using
(as determined by FEMTO Pulse (Agilent, M5330AA)) were chosen for four different methods. First, we aligned native PacBio HiFi and ONT
sequencing. Libraries were sequenced on the Sequel II platform with data from the same source genome to each whole-genome assem-
seven or eight SMRT Cells 8M (PacBio, 101-389-001) per sample using bly using pbmm2 (v.1.1.0) (for PacBio HiFi data; https://github.com/
either Sequel II Sequencing Chemistry 1.0 (PacBio, 101-717-200) or 2.0 PacificBiosciences/pbmm2) or Winnowmap51 (v.1.0) (for ONT data) and
(PacBio, 101-820-200), both with 2 h pre-extension and 30 h videos, assessed the assemblies for uniform read depth across the centromeric
aiming for a minimum estimated coverage of 30× in PacBio HiFi reads regions using IGV52 and NucFreq22. We next assessed the concordance
(assuming a genome size of 3.1 Gb). Raw CHM1 data were processed between the assemblies and raw PacBio HiFi data using VerityMap27,
Article
which identifies discordant k-mers between the two and flags them for Illumina standard software. We aligned the reads in the FASTQ files to
correction. We then assessed the concordance between the assemblies the T2T-CHM13 reference genome4 (v.2.0) using BWA58 (v.0.7.17-r1188),
and ONT data using GAVISUNK28, which identifies concordant SUNKs sorted the alignments using SAMtools59 (v.1.9) and marked duplicate
between the two. Finally, we estimated the accuracy of the centromere reads using sambamba60 (v.1.0). We merged the BAM files for the mono-
assemblies from mapped k-mers (k = 21) using Merqury (v.1.1)53 and and dinucleosome fractions of each cell using SAMtools59 (v.1.9). We
publicly available Illumina data from each genome (Extended Data used breakpointR (v.1.18)61 to assess the quality of generated strand-seq
Table 1). We estimated the QV of the centromeric regions with the fol- libraries with the following parameters: windowsize = 2000000, bin-
lowing formula: Method = ‘size’, pairedEndReads = TRUE, min.mapq = 10, background =
0.1, minReads = 50. We filtered the libraries based on the read density,
−10 × log(1 − (1 − (number of erroneous k -mers/total number of k -mers)) (1/ k )) level of background reads and level of genome coverage variability62.
In total, 48 BAM files were selected for all subsequent analysis and are
publicly available. We detected changes in strand-state inheritance
FISH and spectral karyotyping across all strand-seq libraries using the R package AneuFinder63 with
To determine the karyotype of the CHM1 genome, we first prepared the following parameters: variable.width.reference = <merged BAM of
metaphase chromosome spreads by arresting CHM1 cells in mitosis all 48 strand-seq libraries>, binsizes = windowsize, use.bamsignals =
via the addition of KaryoMAX Colcemid Solution (0.1 µg ml−1, Thermo FALSE, pairedEndReads = TRUE, remove.duplicate.reads = TRUE, min.
Fisher Scientific, 15212012) to the growth medium for 6 h. Cells were mapq = 10, method = ‘edivisive’, strandseq = TRUE, cluster.plots =
collected by centrifugation at 200g for 5 min and incubated in 0.4% KCl TRUE, refine.breakpoints = TRUE. We extracted a list of recurrent
swelling solution for 10 min. Swollen cells were pre-fixed by the addition strand-state changes reported as sister chromatid exchange hotspots
of freshly prepared methanol:acetic acid (3:1) fixative solution (~100 μl by AneuFinder. With this analysis, we identified reciprocal transloca-
per 10 ml total volume). Pre-fixed cells were collected by centrifugation tions between chromosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 (see
at 200g for 5 min and fixed in methanol:acetic acid (3:1) fixative solu- below) and established the overall copy number for each chromosome
tion. Spreads were dropped on a glass slide and incubated on a heating and strand-seq library.
block at 65 °C overnight. Before hybridization, slides were treated with To identify the reciprocal translocation breakpoints between chro-
1 mg ml−1 RNase A (Qiagen, 19101) in 2× SSC for at least 45 min at 37 °C mosomes 4q35.1/11q24.3 and 16q23.3/17q25.3 in the CHM1 genome,
and then dehydrated in a 70%, 80% and 100% ethanol series for 2 min. we first aligned CHM1 PacBio HiFi reads to the T2T-CHM13 reference
Denaturation of spreads was performed in 70% formamide/2× SSC genome4 (v.2.0) using pbmm2 (v.1.1.0) and used BEDtools64 intersect
solution at 72 °C for 1.5 min and was immediately stopped by immers- (v.2.29.0) to define putative translocation regions based on Aneu-
ing the slides into an ethanol series pre-chilled to −20 °C. Finder analysis (described above). We extracted PacBio HiFi reads
Fluorescent probes for spectral karyotyping were generated with supplementary alignments using SAMtools59 (v.1.9) flag 2048.
in-house. Individual fluorescently labelled whole-chromosome paints Using this method, we were able to identify the precise breakpoint of
were obtained from Applied Spectral Imaging. Paints were provided in a each translocation. Note that, for the reciprocal translocation between
hybridization buffer and mixed 1:1 for indicated combinations. Labelled chromosomes 4q35.1/11q24.3, we report two breakpoints in each chro-
chromosome probes and paints were denatured by heating to 80 °C for mosome due to the presence of a ~97–98 kb deletion in the translocated
10 min before applying them to denatured slides. Spreads were hybrid- homologues (Supplementary Fig. 3). The breakpoints are located at
ized to probes under a HybriSlip hybridization cover (Grace Bio-Labs, chromosome 4: 187112496/chromosome 11: 130542388, chromosome 4:
716024) sealed with Cytobond (SciGene, 2020-00-1) in a humidified 187209555/chromosome 11: 130444240, and chromosome 16: 88757545/
chamber at 37 °C for 48 h. After hybridization, the slides were washed chromosome 17: 81572367 (in T2T-CHM13 v.2.0).
three times in 50% formamide/2× SSC for 5 min at 45 °C, 1× SSC solu-
tion at 45 °C for 5 min twice, and at room temperature once. The slides Sequence identity across centromeric regions
were then rinsed with double-deionized H2O, air-dried and mounted in To calculate the sequence identity across the centromeric regions
Vectashield-containing DAPI (Vector Laboratories, H-1200-10). from CHM1, CHM13 and 56 other diverse human genomes (generated
For spectral karyotyping, images were acquired using LSM710 con- by the HPRC10 and HGSVC23), we performed three analyses that take
focal microscope (Zeiss) with the 63×/1.40 NA oil-immersion objec- advantage of different alignment methods. In the first analysis, we
tive and ZEN (v.3.7) software. Segmentation, spectral unmixing and performed a pairwise sequence alignment between contigs from the
identification of chromosomes were performed using an open-source CHM1, CHM13 and diverse genomes using minimap250 (v.2.24) and
karyotype identification via spectral separation (KISS) analysis pack- the following command: minimap2 -I 15G -K 8G -t {threads} -ax asm20
age for Fiji54 (v.2.13.1), freely available online (http://research.stow- --secondary=no --eqx -s 2500 {ref.fasta} {query.fasta}. We chose these
ers.org/imagejplugins/KISS_analysis.html). A detailed description minimap2 parameters after testing several options and identifying
of chromosome paints, hybridization and analysis procedures was optimal ones for alignment between repetitive and/or structurally
reported previously55. divergent regions in diploid human genomes. Specifically, we chose
For individually painted chromosomes, z stack images were acquired -I 15G to provide additional memory for aligning between centromeric
on the Nikon Ti-E microscope equipped with a 100× objective NA 1.45, regions (the default is 4G and sometimes throws an error because of the
Yokogawa CSU-W1 spinning disk and Flash 4.0 sCMOS camera with large number of potential alignments). We also chose -K 8G because it
NIS-Elements AR (v.3.2) software. Image processing was performed allows for 8 Gb of sequence to be loaded into memory at a time. This
in Fiji54 (v.2.13.1). is enough for a typical human diploid genome (~6 Gb) to be loaded. If
we had left it at the default (500M), only a subset of contigs would be
Strand-seq analysis loaded at a time, and once the shortest contigs align, we would be left
To assess the karyotype of the CHM1 genome, we prepared strand-seq with only one thread aligning the longest contig. We therefore chose
libraries from CHM1 cells using a previously published protocol56,57. to increase this parameter so that the whole assembly is aligned at
We sequenced the mono- and dinucleosome fractions separately, with one time. We also chose to use -ax asm20 as it allows for sequences
the mononucleosomes sequenced with 75 bp, paired-end Illumina that are up to 20% divergent to be aligned. This is more permissive to
sequencing, and the dinucleosomes sequenced with 150 bp, paired-end alternative α-satellite HOR structures and sequence compositions than
Illumina sequencing. We demultiplexed the raw sequencing data based the other alignment options (for example, asm5 and asm10). We also
on library-specific barcodes and converted them to FASTQ files using opted to use --secondary=no to prevent secondary alignments from the
same contig, thereby preventing multi-mapping and ensuring that the of haplotypes, limiting our analysis to only the centromeric α-satellite
query would only align once to the reference. We added --eqx to allow HOR arrays as follows: (total number of aligned bases in the query)/
us to parse the CIGAR string and calculate the mean sequence identity (total number of bases in the reference) × (mean sequence identity
of the alignments. Finally, we selected -s 2500 as the minimal peak by event). The mean sequence identity by event is calculated as fol-
dynamic programming alignment score. The default setting for this lows: (number of matches)/(number of matches + number of mis-
parameter is 40, and we tested that one as well as 1000, 2500 and 5000. matches + number of insertion events + number of deletion events).
We found that with -s 40 and -s 1000, spurious alignments occurred The set of centromeres with a higher alignment score was determined to
from other centromeres, and with -s 5000, accurate alignments from be a better match to that haplotype than the other set of centromeres.
centromeres were filtered out. We therefore chose -s 2500 to allow
for diverse α-satellite HOR structures to align without some align- Pairwise sequence identity heat maps
ments being filtered out. After generating the alignments, we filtered To generate pairwise sequence identity heat maps of each centromeric
them using SAMtools59 (v.1.9) flag 4, which keeps primary and partial region, we ran StainedGlass44 (v.6.7.0) with the following parameters:
alignments. We subsequently partitioned the alignments into 10 kb window=5000 mm_f=30000 mm_s=1000. We normalized the col-
non-overlapping windows in the reference genome (either CHM1 or our scale across the StainedGlass plots by binning the percentage of
CHM13) and calculated the mean sequence identity between the pair- sequence identities equally and recolouring the data points according
wise alignments in each window with the following formula: (number to the binning. To generate heat maps that show only the variation
of matches)/(number of matches + number of mismatches + number between centromeric regions, we ran StainedGlass44 (v.6.7.0) with the
of insertion events + number of deletion events). We then averaged following parameters: window=5000 mm_f=60000 mm_s=30000. As
the sequence identity across the 10 kb windows within the α-satellite above, we normalized the colour scale across the StainedGlass plots by
HOR array(s), monomeric/diverged α-satellites, other satellites and binning the percentage of sequence identities equally and recolouring
non-satellites for each chromosome to determine the mean sequence the datapoints according to the binning.
identity in each region.
In the second analysis, we first fragmented the centromeric contigs Estimation of α-satellite HOR array length
from each genome assembly into 10 kb fragments with seqtk (v.1.3; To estimate the length of the α-satellite HOR arrays of each centromere
https://github.com/lh3/seqtk) and subsequently aligned them to the in the CHM1, CHM13 and 56 diverse genome assemblies10,23, we first
reference genome (either CHM1 or CHM13) using minimap250 (v.2.24) ran RepeatMasker65 (v.4.1.0) on the assemblies and identified contigs
and the following command: minimap2 -I 15G -K 8G -t {threads} -ax containing α-satellite repeats, marked by ‘ALR/Alpha’. We extracted
asm20 --secondary=no --eqx -s 40 {ref.fasta} {query.fasta}. We filtered these α-satellite-containing contigs and ran HumAS-HMMER (https://
the alignments using SAMtools59 (v.1.9) flag 4, which keeps primary github.com/fedorrik/HumAS-HMMER_for_AnVIL) on each of them.
and partial alignments. In this method, multiple 10 kb fragments are HumAS-HMMER is a tool that identifies the location of α-satellite
allowed to align to the same region in the reference genome, but each HORs in human centromeric sequences. It uses a hidden Markov model
10 kb fragment is only allowed to align once. We then partitioned the (HMM) profile for centromeric α-satellite HOR monomers and gener-
alignments into 10 kb non-overlapping windows in the reference ates a BED file with the coordinates of the α-satellite HORs and their
genome and calculated the mean sequence identity between all align- classification. Using this BED file, we extracted contigs containing
ments in each window as described above. We averaged the sequence α-satellite HORs that were designated as live or active (denoted with
identity across the 10 kb windows within the α-satellite HOR array(s), an ‘L’ in the HumAS-HMMER BED file), which are those that belong to
monomeric/diverged α-satellites, other satellites and non-satellites an array that consistently associates with the kinetochore in several
for each chromosome to determine the mean sequence identity in individuals5,67. By contrast, dead or inactive α-satellite HORs (denoted
each region. with a ‘d’ in the HumAS-HMMER BED file) are those that have not been
In the third analysis, we first identified the location of the α-satellite found to be associated with the kinetochore and are usually more diver-
HOR array(s) in each genome assembly using RepeatMasker65 gent in sequence than the live or active arrays. We filtered out contigs
(v.4.1.0) followed by HumAS-HMMER (https://github.com/fedorrik/ that had incomplete α-satellite HOR arrays (such as those that did not
HumAS-HMMER_for_AnVIL) and subsequently extracted regions traverse into unique sequence), thereby limiting our analysis to only
enriched with ‘live’ α-satellite HORs (denoted with an ‘L’ in the complete α-satellite HOR arrays. Moreover, we assessed the integrity of
HumAS-HMMER BED file). We then ran TandemAligner66 (v.0.1) on pairs each of the α-satellite HOR array-containing contigs using NucFreq22 to
of complete centromeric HOR arrays using the following command: ensure that they were completely and accurately assembled, filtering
tandem_aligner --first {ref.fasta} --second {query.fasta} -o {output_direc- out those with evidence of a deletion, duplication or misjoin. Finally,
tory}. We parsed the CIGAR string generated by TandemAligner by we calculated the length of the α-satellite HOR arrays in the remain-
first binning the alignments into 10 kb non-overlapping windows and ing contigs by taking the minimum and maximum coordinate of the
calculating the mean sequence identity in each window as described ‘live’ α-satellite HOR arrays and plotting their lengths with GraphPad
above. As TandemAligner is only optimized for tandem repeat arrays, Prism (v.9.5.1).
we assessed the sequence identity only in the α-satellite HOR array(s)
of each centromeric region and did not use it to assess the sequence Sequence composition and organization of α-satellite HOR
identity in any other region. arrays
To determine the sequence composition and organization of each
Better-match analysis α-satellite HOR array in the CHM1, CHM13 and 56 diverse genome
To determine whether the CHM1 or CHM13 centromeres are a better assemblies10,23, we ran HumAS-HMMER (https://github.com/fedorrik/
match to those from the 56 diverse human genomes assembled by the HumAS-HMMER_for_AnVIL) on centromeric contigs with the default
HPRC10 and HGSVC23, we performed a pairwise sequence alignment parameters and parsed the resulting BED file with StV (https://github.
between contigs from the HPRC and HGSVC assemblies to either the com/fedorrik/stv). This generated a BED file with each α-satellite
CHM1 or CHM1 assembly using minimap250 (v.2.24) and the following HOR sequence composition and its organization along the α-satellite
command: minimap2 -I 15G -K 8G -t {threads} -ax asm20 --secondary=no HOR arrays. We used the stv_row.bed file to visualize the organiza-
--eqx -s 2500 {ref.fasta} {query.fasta}. We filtered the alignments using tion of the α-satellite HOR arrays with R68 (v.1.1.383) and the ggplot2
SAMtools59 (v.1.9) flag 4, which keeps primary, secondary and partial package66. The α-satellite monomer and HOR classification gener-
alignments, and then calculated an alignment score between each pair ated with HumAS-HMMER is described in detail in the supplementary
Article
information of a previous study5, in which a more complete description a 1:80 dilution) was added and rotated overnight at 4 °C. Immuno-
of these annotations can be found. complexes were recovered by the addition of 200 ml 50% protein G
Sepharose bead slurry followed by rotation at 4 °C for 3 h. The beads
CpG methylation analysis were washed three times with buffer B and once with buffer B without
To determine the CpG methylation status of each CHM1 centromere, we Tween-20. For the input fraction, an equal volume of input recovery
aligned CHM1 ONT reads >30 kb in length to the CHM1 whole-genome buffer (0.6 M NaCl, 20 mM EDTA, 20 mM Tris, pH 7.5 and 1% SDS) and
assembly using Winnowmap51 (v.1.0) and then assessed the CpG meth- 1 ml of RNase A (10 mg ml−1) was added, followed by incubation for 1 h
ylation status of the centromeric regions with Nanopolish69 (v.0.13.3). at 37 °C. Proteinase K (100 mg ml−1, Roche) was then added, and the
Nanopolish distinguishes 5-methylcytosines from unmethylated samples were incubated for another 3 h at 37 °C. For the ChIP fraction,
cytosines via a HMM on the raw nanopore current signal. The methyla- 300 μl of ChIP recovery buffer (20 mM Tris, pH 7.5, 20 mM EDTA, 0.5%
tion caller generates a log-likelihood value for the ratio of probability SDS and 500 mg ml−1 proteinase K) was added directly to the beads
of methylated to unmethylated CpGs at a specific k-mer. We filtered and incubated for 3–4 h at 56 °C. The resulting proteinase-K-treated
methylation calls using the nanopore_methylation_utilities tool70 samples were subjected to a phenol–chloroform extraction followed
(https://github.com/isaclee/nanopore-methylation-utilities), which by purification using the Qiagen MinElute PCR purification column.
uses a log-likelihood ratio of 2.5 as a threshold for calling methylation. Unamplified bulk nucleosomal and ChIP DNA was analysed using an
CpG sites with log-likelihood ratios greater than 2.5 (methylated) or Agilent Bioanalyzer instrument and a 2100 High Sensitivity Kit.
less than −2.5 (unmethylated) are considered to be high quality and are Sequencing libraries were generated using the TruSeq ChIP Library
included in the analysis. Reads that do not have any high-quality CpG Preparation Kit, Set A (Illumina, IP-202-1012) according to the manu-
sites are filtered from the BAM for subsequent methylation analysis. facturer’s instructions, with some modifications. In brief, 5–10 ng bulk
Nanopore_methylation_utilities integrates methylation information nucleosomal or ChIP DNA was end-repaired and A-tailed. Illumina
into the BAM file for viewing in IGV’s52 bisulfite mode, which was used TruSeq adaptors were ligated, libraries were size-selected to exclude
to visualize CpG methylation. To determine the size of hypomethylated polynucleosomes using an E-Gel SizeSelect II agarose gel and the librar-
region (termed the CDR31) in each centromere, we developed a novel ies were PCR-amplified using the PCR polymerase and primer cocktail
tool, CDR-Finder (https://github.com/arozanski97/CDR-Finder). This provided in the kit. The resulting libraries were submitted for 150 bp,
tool first bins the assembly into 5 kb windows, computes the median paired-end Illumina sequencing using the NextSeq 500/550 High Out-
CpG methylation frequency within windows containing α-satellite (as put Kit v2.5 (300 cycles). The resulting reads were assessed for quality
determined by RepeatMasker65 (v.4.1.0), selects bins that have a lower using FastQC (https://github.com/s-andrews/FastQC), trimmed with
CpG methylation frequency than the median frequency in the region, Sickle (v.1.33; https://github.com/najoshi/sickle) to remove low-quality
merges consecutive bins into a larger bin, filters for merged bins that 5′- and 3′-end bases, and trimmed using Cutadapt71 (v.1.18) to remove
are >50 kb and reports the location of these bins. adapters.
Processed CENP-A ChIP and bulk nucleosomal reads were aligned to
Native CENP-A ChIP–seq and analysis the CHM1 whole-genome assembly using BWA-MEM72 (v.0.7.17) with
To determine the location of centromeric chromatin within the CHM1 the following parameters: bwa mem -k 50 -c 1000000 {index} {read1.
genome, we performed two independent replicates of native CENP-A fastq.gz} {read2.fastq.gz}. The resulting SAM files were filtered using
chromatin immunprecipitation–sequencing (ChIP–seq) analysis of SAMtools59 (v.1.9) with flag score 2308 to prevent multi-mapping of
CHM1 cells as described previously21, with some modifications. In reads. With this filter, reads mapping to more than one location are
brief, 3–4 × 107 cells were collected and resuspended in 2 ml of ice-cold randomly assigned a single mapping location, thereby preventing
buffer I (0.32 M sucrose, 15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl2, mapping biases in highly identical regions. Alignments were normalized
0.1 mM EGTA and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher and filtered with deepTools73 (v.3.4.3) bamCompare with the follow-
Scientific, 78429)). Then, 2 ml of ice-cold buffer II (0.32 M sucrose, ing parameters: bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.
15 mM Tris, pH 7.5, 15 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA, 0.1% bam} --operation ratio --binSize 1000 --minMappingQuality 1 -o {out.
IGEPAL and 2× Halt Protease Inhibitor Cocktail) was added, and the bw}. Alternatively, CENP-A ChIP–seq data alignments were filtered
samples were placed onto ice for 10 min. The resulting 4 ml of nuclei using a marker-assisted mapping strategy as described previously5.
was gently layered on top of 8 ml of ice-cold buffer III (1.2 M sucrose, In brief, unique 51-mers in the CHM1 whole-genome assembly were
60 mM KCl, 15 mM, Tris pH 7.5, 15 mM NaCl, 5 mM MgCl2, 0.1 mM EGTA counted and filtered with meryl53 (v.1.3). The locations of the unique
and 2× Halt Protease Inhibitor Cocktail (Thermo Fisher Scientific, 51-mers were identified with meryl53 (v.1.3) and then used to filter the
78429)) and centrifuged at 10,000g for 20 min at 4 °C. Pelleted nuclei CENP-A ChIP–seq and input alignments using BEDtools64 intersect
were resuspended in buffer A (0.34 M sucrose, 15 mM HEPES, pH 7.4, (v.2.29.0). Alignments were normalized and filtered with deepTools73
15 mM NaCl, 60 mM KCl, 4 mM MgCl2 and 2× Halt Protease Inhibitor (v.3.4.3) bamCompare with the following parameters: bamCompare
Cocktail) to 400 ng ml−1. Nuclei were frozen on dry ice and stored at -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} --operation ratio --binSize
80 °C. MNase digestion reactions were performed on 200–300 μg 1000 -o {out.bw}.
chromatin, using 0.2–0.3 U μg−1 MNase (Thermo Fisher Scientific,
88216) in buffer A supplemented with 3 mM CaCl2 for 10 min at 37 °C. Estimation of the length of the kinetochore sites
The reaction was quenched with 10 mM EGTA on ice and centrifuged To estimate the length of the CHM1 and CHM13 kinetochore sites,
at 500g for 7 min at 4 °C. The chromatin was resuspended in 10 mM we first determined the CpG methylation status of each CHM1 and
EDTA and rotated at 4 °C for 2 h. The mixture was adjusted to 500 mM CHM13 centromere using the approach described above (see the ‘CpG
NaCl, rotated for another 45 min at 4 °C and then centrifuged at maxi- methylation analysis’ section). We then mapped the CENP-A ChIP–seq
mum speed (21,100g) for 5 min at 4 °C, yielding digested chromatin data from each genome to the same source genome using the map-
in the supernatant. Chromatin was diluted to 100 ng ml−1 with buffer ping parameters described above (see the ‘Native CENP-A ChIP–seq
B (20 mM Tris, pH 8.0, 5 mM EDTA, 500 mM NaCl and 0.2% Tween-20) and analysis’ section). We next used CDR-Finder (https://github.com/
and precleared with 100 μl 50% protein G Sepharose bead (Abcam, arozanski97/CDR-Finder) to identify the location of hypomethylated
ab193259) slurry for 20 min at 4 °C with rotation. Precleared superna- regions within the centromeres, and we filtered the hypomethylated
tant (10–20 μg bulk nucleosomes) was saved for further processing. regions that had less than tenfold enrichment of CENP-A ChIP–seq reads
To the remaining supernatant, 20 μg mouse monoclonal anti-human relative to the bulk nucleosomal reads. We reported the lengths of the
CENP-A antibody (3-19; Enzo, ADI-KAM-CC006-E; approximately hypomethylated regions enriched with CENP-A as determined with
CDR-Finder, and we tested for statistical significance using a two-sided {path_to_directory_with_fasta} AS-SFs-hmmer3.0.290621.hmm {num-
Kolmogorov–Smirnov test with GraphPad Prism (v.9.5.1). ber_of_threads}. This generated a BED file with the SF classification
and strand orientation of each α-satellite monomer, which we visual-
Immuno-FISH on stretched metaphase chromosome spreads ized with R68 (v.1.1.383) using the ggplot2 package66. In cases in which
Mechanically stretched metaphase spreads were obtained from the an inversion was detected, we ran StringDecomposer77, a tool that
CHM1 cell line according to established procedures74. In brief, colcemid- detects and reports changes in orientation of tandem repeats, using
treated cells were washed in phosphate-buffered saline (1× PBS), the default parameters to confirm the presence of reoriented α-satellite
counted, and resuspended for 15 min in a hypotonic buffer HCM (10 mM monomers at the breakpoints. Finally, we validated the presence of
HEPES, pH 7.3, 1 mM glycerol, 1 mM CaCl2 and 0.8 mM MgCl2) to achieve the inversion by aligning native ultra-long ONT reads to the assem-
a final concentration of 10,000 cells per ml. Then, 0.5 ml of the cell blies as described above and confirming even coverage across the
suspension was cytocentrifuged onto glass slides at 2,000 rpm for breakpoints as well as the presence of inverted α-satellite monomers
8 min with a Shandon Cytospin 3 and fixed in methanol at −20 °C for in the aligned reads.
15 min and in methanol:acetic acid 3:1 at −20 °C for 30 min. The slides We uploaded the α-satellite SF and strand orientation tracks gener-
were aged overnight at room temperature. ated by HumAS-HMMER for each centromere assembly to the UCSC
Immunofluorescence was performed on the stretched metaphase Human Genome Browser. For the CHM1 centromeres, we uploaded
chromosome spreads using an in-house rabbit polyclonal CENP-C anti- two additional tracks: one showing each α-satellite monomer belong-
body as previously described with minor modifications75. In brief, each ing to known human HORs (ASat-HOR track) and another showing
slide was rehydrated by immersion in 1× PBS-azide (10 mM NaPO4, structural variation in human HORs (StV track). All tracks were built
pH 7.4, 0.15 M NaCl, 1 mM EGTA and 0.01% NaN3) for 15 min at room and colour-coded as described previously5 and are publicly available
temperature. Chromosomes were then swollen by washing the slides online (https://genome.ucsc.edu/s/fedorrik/chm1_cen (CHM1); https://
(three times, 2 min each) with 1× TEEN (1 mM triethanolamine-HCl, genome.ucsc.edu/s/fedorrik/T2T_dev (CHM13); https://genome.ucsc.
pH 8.5, 0.2 mM NaEDTA, and 25 mM NaCl), 0.5% Triton X-100 and 0.1% edu/s/fedorrik/cen_primates (chimpanzee, orangutan, and macaque)).
BSA. The primary polyclonal antibody against the centromeric protein Note that the SF annotation coverage in macaque is sometimes discon-
CENP-C was diluted 1:40 in the same solution and then added (100 μl) tinuous (some monomers are not annotated due to significant diver-
onto the slides. Each slide was incubated for 2 h at 37 °C. Excess of pri- gence of macaque dimers from their progenitor Ka class monomers).
mary antibody was removed by washing the slides at room temperature However, most monomers are identified as Ka, which indicates SF7.
(three times, 2, 5 and 3 min each) with 1× KB buffer (10 mM Tris-HCl, In orangutan centromeres, most monomers are identified as R1 and
pH 7.7, 0.15 M NaCl and 0.1% BSA). A goat anti-rabbit IgG secondary R2, which indicates SF5. In chimpanzee and human autosome and
antibody conjugated to FITC (Sigma-Aldrich, F0382) was diluted 1:40 X chromosome centromeres, active arrays are formed by J1 and J2 (SF1),
in the same solution, and 100 μl was then added to the slides that were D1, FD and D2 (SF2), and W1–W5 (SF3) monomers. The only exception
then incubated for 45 min at 37 °C in a dark chamber. After incubation uncovered in this paper is the centromere of chimpanzee chromosome
with the secondary antibody, the slides were washed once with 1× KB 5, which appears to be formed by R1 and R2 (SF5), with some monomers
for 2 min, prefixed with 4% paraformaldehyde in 1× KB for 45 min at identified as J4 and Ga. The former belongs to SF01, which represents
room temperature, washed with distilled H2O by immersion for 10 min the generation of α-satellite intermediate between the progenitor SF5
at room temperature, and fixed with methanol and acetic acid (3:1) and the more derived SF1, and J4 is particularly close to the R1 monomer.
for 15 min. FISH was then performed using two α-satellite-containing Moreover, the other SF01 monomers, such as J3, J5 and J6, are absent
plasmids (pZ21A and pGA16) directly labelled by nick-translation in the array, which indicates that it is not genuine SF01. Thus, the J4
with Cy3-dUTP (Enzo, 42501) according to a standard procedure with monomer in chimpanzee centromere 5 should be considered variant
minor modifications76. In brief, 300 ng of labelled probe was used for R1. Similarly, occasional Ga monomers belong to SF4, which is the direct
the FISH experiments; DNA denaturation was performed at 70 °C for progenitor of SF5, and Ga is very close to R2. Ga monomers dispersed
4 min and hybridization at 37 °C in 2× SSC, 50% (v/v) formamide, 10% in the SF5 array are therefore just misclassed R2 monomers. The whole
(w/v) dextran sulphate, 3 μg Cot-1 DNA and 3 mg sonicated salmon chimpanzee chromosome 5 α-satellite HOR array should therefore be
sperm DNA, in a volume of 10 μl. Post-hybridization washing was per- classified as SF5, despite the abovementioned contaminations.
formed under high stringency conditions: at 60 °C in 0.1× SSC (three
times, 5 min each). Nuclei and chromosome metaphases were simul- Human and NHP phylogenetic analysis
taneously DAPI-stained. Digital images were obtained using a Leica Humans, chimpanzees, orangutans and macaques diverged over a
DMRXA2 epifluorescence microscope equipped with a cooled CCD period of at least 25 million years, with chimpanzees diverging approxi-
camera (Princeton Instruments). DAPI, Cy3 and fluorescein fluores- mately 6 million years ago29, orangutans 12–16 million years ago29 and
cence signals, detected with specific filters, were recorded separately macaques ~25 million years ago78. Despite these divergence times, all
as grayscale images. Pseudocolouring and merging of images were primates retain α-satellite repeats, which permit the phylogenetic
performed using ImageJ (v.1.53k). analysis of these regions and an estimation of their evolutionary trajec-
tory. To assess the phylogenetic relationship between α-satellite repeats
Human and NHP α-satellite SF classification and strand in human and NHP genomes, we first masked every non-α-satellite
orientation analysis repeat in the CHM1, CHM13, HG00733, chimpanzee, orangutan and
Human and NHP α-satellite monomers are grouped into 20 distinct macaque centromere assemblies using RepeatMasker65 (v.4.1.0). We
SF classes based on shared sequence identity and structure, which is then subjected the masked assemblies to StringDecomposer77 using
described in detail previously5. The SF classes and their monomers are α-satellite monomers derived from the T2T-CHM13 reference genome4
as follows: SF1 ( J1 and J2), SF01 ( J3, J4, J5 and J6), SF2 (D2, D2, FD), SF02 (v.2.0). This tool identifies the location of α-satellite monomers in the
(D3, D4, D5, D6, D7, D8 and D9), SF3 (W1, W2, W3, W4 and W5), SF4 (Ga), assemblies, and we used this to extract the α-satellite monomers from
SF5 (R1 and R2), SF6 (Ha), SF7 (Ka), SF8 (Oa and Na), SF9 (Ca), SF10 (Ba), the HOR/dimeric array and monomeric regions into multi-FASTA files.
SF11 ( Ja), SF12 (Aa), SF13 (Ia), SF14 (La), SF15 (Fa), SF16 (Ea), SF17 (Qa), We randomly selected 100 and 50 α-satellite monomers from the HOR/
SF18 (Pa and Ta). To determine the α-satellite SF content and strand dimeric array and monomeric regions, respectively, and aligned them
orientation of human and NHP centromeres, we ran HumAS-HMMER with MAFFT79,80 (v.7.453). We used IQ-TREE81 (v.2.1.2) to reconstruct
(https://github.com/fedorrik/HumAS-HMMER_for_AnVIL) on cen- the maximum-likelihood phylogeny with model selection and 1,000
tromeric contigs with the following command: hmmer-run_SF.sh bootstraps. The resulting tree file was visualized in iTOL82.
Article
To estimate sequence divergence along the pericentromeric regions, these regions using BEDtools64 maskfasta (v.2.29.0). We aligned the
we first mapped each NHP centromere assembly to the CHM13 cen- masked CHM1 fasta to the masked CHM13 fasta using minimap250
tromere assembly using minimap250 (v.2.17-r941) with the following and the following command: minimap2 -t {threads} --eqx -c -x asm20
parameters: -ax asm20 --eqx -Y -t 8 -r 500000. We then generated a BED --secondary=no {ref.fasta} {query.fasta}. Using the resulting PAF, we
file of 10 kb windows located within the CHM13 centromere assembly. extracted the regions with structural variants that were >50 bp long. We
We used the BED file to subset the BAM file, which was subsequently next intersected these regions with the RepeatMasker annotation file to
converted into a set of FASTA files. FASTA files contained at least 5 kb identify those variants that overlapped SINE, LINE or LTR repeat classes
of sequence from one or more NHP centromere assemblies mapping by >75%. We considered the following LINE and SINE subgroups: LINE/
to orthologous chromosomes. Pairs of human and NHP sequences CR1, LINE/L1, LINE/L1-Tx1, LINE/L2, LINE/Penelope, LINE/RTE-BovB,
were realigned using MAFFT79,80 (v.7.453) with the following command: LINE/RTE-X, SINE/5S-Deu-L2, SINE/Alu, SINE/MIR, SINE/tRNA, SINE/
mafft --maxiterate 1000 --localpair. Next, we calculated the SNV density tRNA-Deu, SINE/tRNA-RTE. We then determined the variation in length
and Ti/Tv ratios from these alignments, limiting our analysis to only of these regions between the two centromeric regions, and we plotted
those regions with one-to-one unambiguous mapping and exclud- their position and length using R68 (v.1.1.383) and the ggplot2 package66.
ing segmental duplications and satellite repeats (Supplementary
Table 10). As a control, we also calculated the SNV density and Ti/Tv Reporting summary
ratios from 500 uniquely mapping regions across the genomes (Sup- Further information on research design is available in the Nature Port-
plementary Table 11). We estimated the sequence divergence using folio Reporting Summary linked to this article.
the Tamura-Nei substitution model83, which accounts for recurrent
mutations and differences between transversions and transitions as
well as within transitions. The mutation rate per segment was estimated Data availability
using Kimura’s model of neutral evolution84. In brief, we modelled the All sequencing data generated and/or used in this study are pub-
estimated divergence (D) as a result of between-species substitutions licly available and listed in Extended Data Table 1 with their BioPro-
and within-species polymorphisms, that is: ject ID, accession number (if available) and/or URL. The following
BioProject accessions were used: CHM1 whole-genome assembly
D = 2μt + 4Ne μ with complete centromeres (PRJNA975207); CHM1 PacBio HiFi data
(PRJNA726974); CHM1 ONT data (PRJNA869061); CHM1 Illumina data
where Ne is the ancestral human effective population size, t is the diver- (PRJNA246220); CHM1 strand-seq alignments (https://doi.org/10.5281/
gence time for a given human–NHP pair and μ is the mutation rate. zenodo.7959305)85; CHM1 CENP-A ChIP–seq data (PRJNA975217);
We assumed a generation time of [20, 29] years and the following T2T-CHM13 (v.2.0) whole-genome assembly (PRJNA559484); CHM13
divergence times: human–macaque = [23 × 106, 25 × 106] years, human– PacBio HiFi data (PRJNA530776); CHM13 ONT data (PRJNA559484);
orangutan = [12 × 106, 14 × 106] years, human–chimpanzee = [4 × 106, HG00733 PacBio HiFi data (PRJNA975575 and PRJEB36100); HG00733
6 × 106] years. To convert the genetic unit to a physical unit, our com- ONT data (PRJNA975575, PRJNA686388 and PRJEB37264); HPRC
putation also assumes Ne = 10,000 and uniformly drawn values for the whole-genome assemblies (https://projects.ensembl.org/hprc/);
generation and divergence times. HGSVC whole-genome assemblies (https://www.internationalgenome.
org/data-portal/data-collection/hgsvc2); and NHP (chimpanzee (Clint;
Human-specific phylogenetic analysis S006007), orangutan (Susie; PR01109), and macaque (AG07107))
To determine the phylogenetic relationship and divergence times PacBio HiFi and ONT data (PRJNA659034). The original karyotyping
between centromeric regions from chromosomes 5, 7 and 10–14 in the imaging data for the CHM1 cell line are available from the Stowers Origi-
CHM1, CHM13 and 56 other diverse human genomes (sequenced and nal Data Repository (http://www.stowers.org/research/publications/
assembled by the HPRC10 and HGSVC23), we first identified contigs with libpb-2457).
complete and accurately assembled centromeric α-satellite HOR arrays,
as determined by RepeatMasker65 (v.4.1.0) and NucFreq22 analysis. We
then aligned each of these contigs to the T2T-CHM13 reference genome4 Code availability
(v.2.0) using minimap250 (v.2.24). We also aligned the chimpanzee Custom code for the SUNK-based assembly of centromeric regions
whole-genome assembly to the T2T-CHM13 reference genome4 (v.2.0) is available at GitHub (https://github.com/arozanski97/SUNK-based-
to serve as an outgroup in our analysis. We identified 20 kb regions in contig-scaffolding)86. Custom code to detect hypomethylated regions
the flanking monomeric α-satellite or unique regions on the p- or q-arms within centromeric regions, termed CDRs31, is available at GitHub
and ensured that the region we had selected had only a single alignment (https://github.com/arozanski97/CDR-Finder)87. All other code is
from each haplotype to the reference genome. We next aligned these publicly available.
regions to each other using MAFFT79,80 (v.7.453) with the following
command: mafft --auto --thread {num_of_threads} {multi-fasta.fasta}. 47. Huddleston, J. et al. Reconstructing complex regions of genomes using long-read
sequencing technology. Genome Res. 24, 688–696 (2014).
We used IQ-TREE81 (v.2.1.2) to reconstruct the maximum-likelihood 48. Baid, G. et al. DeepConsensus improves the accuracy of sequences with a gap-aware
phylogeny with model selection and 1,000 bootstraps. The resulting sequence transformer. Nat. Biotechnol. 41, 232–238 (2023).
tree file was visualized in iTOL82. Timing estimates were calculated 49. Logsdon, G. A. HMW gDNA purification and ONT ultra-long-read data generation.
Protocols.io https://doi.org/10.17504/protocols.io.bchhit36 (2020).
by applying a molecular clock based on the branch-length distance 50. Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34,
to individual nodes and assuming a divergence time between human 3094–3100 (2018).
and chimpanzee of 6 million years ago. Clusters of α-satellite HOR 51. Jain, C. et al. Weighted minimizer sampling improves long read mapping. Bioinformatics
36, i111–i118 (2020).
arrays with a single monophyletic origin were assessed for gains and 52. Robinson, J. T. et al. Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011).
losses of α-satellite base pairs, monomers, HORs and distinct structural 53. Rhie, A., Walenz, B. P., Koren, S. & Phillippy, A. M. Merqury: reference-free quality,
changes manually. completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245
(2020).
54. Schindelin, J. et al. Fiji: an open-source platform for biological-image analysis. Nat. Methods
Polymorphic TE analysis 9, 676–682 (2012).
To detect polymorphic TEs between the CHM1 and CHM13 centro- 55. Potapova, T. A. et al. Karyotyping human and mouse cells using probes from single-sorted
chromosomes and open source software. BioTechniques 59, 335–346 (2015).
meric regions, we first ran RepeatMasker65 (v.4.1.0) on the CHM1 and 56. Falconer, E. et al. DNA template strand sequencing of single-cells maps genomic
CHM13 centromeric regions. We then masked all satellite repeats within rearrangements at high resolution. Nat. Methods 9, 1107–1112 (2012).
57. Sanders, A. D., Falconer, E., Hills, M., Spierings, D. C. J. & Lansdorp, P. M. Single-cell 83. Tamura, K. & Nei, M. Estimation of the number of nucleotide substitutions in the control
template strand sequencing by Strand-seq enables the characterization of individual region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10, 512–526 (1993).
homologs. Nat. Protoc. 12, 1151–1176 (2017). 84. Kimura, M. The Neutral Theory of Molecular Evolution (Cambridge Univ. Press, 1983).
58. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. 85. Porubsky, D. & Lansdorp, P. The variation and evolution of complete human centromeres.
Bioinformatics 26, 589–595 (2010). Zenodo https://doi.org/10.5281/zenodo.7959305 (2022).
59. Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 86. Logsdon, G. A., Rozandki, A. N., Harvey, W. H. & Eichler, E. E. SUNK-based contig scaffolding
2078–2079 (2009). pipeline. GitHub github.com/arozanski97/SUNK-based-contig-scaffolding (2023).
60. Tarasov, A., Vilella, A. J., Cuppen, E., Nijman, I. J. & Prins, P. Sambamba: fast processing of 87. Logsdon, G. A., Rozanski, A. N., Harvey, W. H., Mastrorosa, F. K. & Eichler, E. E. CDR-Finder.
NGS alignment formats. Bioinformatics 31, 2032–2034 (2015). GitHub github.com/arozanski97/CDR-Finder (2023).
61. Porubsky, D. et al. breakpointR: an R/Bioconductor package to localize strand state 88. Guarracino, A. et al. Recombination between heterologous human acrocentric
changes in strand-seq data. Bioinformatics 36, 1260–1261 (2020). chromosomes. Nature 617, 335–343 (2023).
62. Porubsky, D. et al. Direct chromosome-length haplotyping by single-cell sequencing.
Genome Res. 26, 1565–1574 (2016).
63. Bakker, B. et al. Single-cell sequencing reveals karyotype heterogeneity in murine and Acknowledgements We thank the staff at Google for basecalling CHM1 PacBio circular
human malignancies. Genome Biol. 17, 115 (2016). consensus sequencing data with DeepConsensus; P. M. Lansdorp and D. C. J. Spierings for
64. Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic generating and sharing the CHM1 strand-seq data; P. Hsieh for assistance with phylogenetic
features. Bioinformatics 26, 841–842 (2010). analyses; V. Slon for providing space and resources for the α-satellite SF and strand orientation
65. Smit, A. F. A., Hubley, R. & Green, P. RepeatMasker Open-4.0 (2013). analysis with funding from the Genetic Society of America and the Gruber foundation; Z. Zhao
66. Wickham, H. Ggplot2: Elegant Graphics for Data Analysis (Springer, 2009). for comments and suggestions on figures; and T. Brown for assistance in editing this manuscript.
67. McNulty, S. M. & Sullivan, B. A. Alpha satellite DNA biology: finding function in the This research was supported in part by funding from the National Institutes of Health (NIH)
recesses of the genome. Chromosome Res. 26, 115–138 (2018). National Human Genome Research Institute (NHGRI) R01 HG010169 (to E.E.E.); the National
68. R Core Team. R: a Language and Environment for Statistical Computing (R Foundation for Institute of General Medical Sciences (NIGMS) K99 GM147352 (to G.A.L.); the National Cancer
Statistical Computing, 2020). Institute (NCI) R01 CA266339 (to J.L.G.); the Intramural Research Program of the National
69. Loman, N. J., Quick, J. & Simpson, J. T. A complete bacterial genome assembled de novo Human Genome Research Institute (NHGRI) at NIH (to M.R., S.K., S.N. and A.M.P.); Shanghai
using only nanopore sequencing data. Nat. Methods 12, 733–735 (2015). Jiao Tong University 2030 Program WH510363001-7 (to Y.M.); and the Center for Integration in
70. Lee, I. et al. Simultaneous profiling of chromatin accessibility and methylation on human Science of the Ministry of Aliyah, Israel (to I.A.A.). This work used the computational resources
cell lines with nanopore sequencing. Nat. Methods 17, 1191–1199 (2020). of the NIH HPC Biowulf cluster (https://hpc.nih.gov). E.E.E. is an investigator of the Howard
71. Martin, M. Cutadapt removes adapter sequences from high-throughput sequencing Hughes Medical Institute.
reads. EMBnet.Journal 17, 10–12 (2011).
72. Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Author contributions G.A.L. and E.E.E. conceived the project. G.A.L., J.K.L., K.H. and K.M.M.
Preprint at arxiv.org/abs/1303.3997 (2013). generated sequencing data. G.A.L. and A.N.R. analysed sequencing data and performed
73. Ramírez, F., Dündar, F., Diehl, S., Grüning, B. A. & Manke, T. deepTools: a flexible platform quality-control analyses. G.A.L. and A.N.R. generated and validated the CHM1 centromere
for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014). assemblies. M.R., S.K., S.N. and A.M.P. generated the Verkko CHM1 centromere assemblies.
74. Ventura, M. et al. The evolution of African great ape subtelomeric heterochromatin and G.A.L., A.N.R., F.R., V.A.S. and I.A.A. analysed the CHM1 centromere assemblies. T.P. performed
the fusion of human chromosome 2. Genome Res. 22, 1036–1049 (2012). spectral karyotyping and FISH experiments. C.R.C. performed immuno-FISH experiments. D.P.
75. Earnshaw, W. C. & Tomkiel, J. E. Centromere and kinetochore structure. Curr. Opin. Cell performed strand-seq and polymorphic TE analyses. G.A.L., A.N.R., D.Y. and Y.M. performed
Biol. 4, 86–93 (1992). phylogenetic analyses. J.L.G., A.M.P., M.V., I.A.A. and E.E.E. supervised experiments and
76. Lichter, P. et al. High-resolution mapping of human chromosome 11 by in situ hybridization analyses. G.A.L., T.P., C.R.C., D.P., I.A.A. and E.E.E. developed figures. G.A.L., I.A.A. and E.E.E.
with cosmid clones. Science 247, 64–69 (1990). drafted the manuscript. All of the authors read and approved the manuscript.
77. Dvorkina, T., Bzikadze, A. V. & Pevzner, P. A. The string decomposition problem and its
applications to centromere analysis and assembly. Bioinformatics 36, i93–i101 (2020). Competing interests S.N. is an employee of Oxford Nanopore Technologies. S.K. has received
78. Glazko, G. V. & Nei, M. Estimation of divergence times for major lineages of primate travel funds to speak at events hosted by Oxford Nanopore Technologies. E.E.E. is a scientific
species. Mol. Biol. Evol. 20, 424–434 (2003). advisory board member of Variant Bio. The other authors declare no competing interests.
79. Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7:
Improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013). Additional information
80. Nakamura, T., Yamada, K. D., Tomii, K. & Katoh, K. Parallelization of MAFFT for large-scale Supplementary information The online version contains supplementary material available at
multiple sequence alignments. Bioinformatics 34, 2490–2492 (2018). https://doi.org/10.1038/s41586-024-07278-3.
81. Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective Correspondence and requests for materials should be addressed to Evan E. Eichler.
stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, Peer review information Nature thanks Ian Henderson, André Marques and the other,
268–274 (2015). anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer
82. Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree reports are available.
display and annotation. Bioinformatics 23, 127–128 (2007). Reprints and permissions information is available at http://www.nature.com/reprints.
Article
Extended Data Fig. 1 | Variation in the sequence and structure of centromeric to either the CHM1 (left) or CHM13 (right) α-satellite HOR array(s). The percent
α-satellite higher-order repeat (HOR) arrays among 56 diverse human of unaligned bases are shown in black. The size of each data point corresponds
genomes. Plots showing the percent sequence identity between centromeric to the total percent of aligned bases among the CHM1 and CHM13 centromeric
α-satellite HOR arrays from CHM1 (y-axis), CHM13 (x-axis), and 56 other diverse α-satellite HOR arrays. Precise quantification of the sequence identity and
human genomes [generated by the Human Pangenome Reference Consortium proportion of aligned versus unaligned sequences is provided in Supplementary
(HPRC)10 and Human Genome Structural Variation Consortium (HGSVC)23]. Table 6. Enlarged versions of these plots are shown in Supplementary Figs. 14, 15.
Each data point shows the percent of aligned bases from each human haplotype
Extended Data Fig. 2 | Sequence identities between the CHM1 and CHM13 α-satellite, other satellite, and non-satellite portions were assessed separately
centromeric regions. Histogram showing the distribution of sequence and reveal a much larger distribution in sequence identities for the α-satellite
identities from complete contig alignments between centromeric regions in HORs. The mean and standard deviation (s.d.) are indicated.
the CHM1 and CHM13 genomes. The α-satellite HOR, monomeric/divergent
Article
Extended Data Fig. 3 | Comparison of the CHM1 and CHM13 centromeric regions. Dot plots showing the percent sequence identity between the CHM1 and
CHM13 centromeric regions. Plots were generated with StainedGlass44. Enlarged versions of these plots are shown in Supplementary Figs. 16, 17.
Extended Data Fig. 4 | Comparison of CHM1 and CHM13 centromeric Note that each data point represents a haplotype with 1:1 best mapping,
α-satellite HOR arrays to those from 56 diverse human genomes. Plots although many of the centromeres are not yet complete in the HPRC and HGSVC
showing the percent sequence identity and number of megabase pairs (Mbp) assemblies. Enlarged versions of these plots are shown in Supplementary
aligned for 56 diverse human genomes (112 haplotypes), generated by the Figs. 18, 19.
HPRC10 and HGSVC23, mapped to the CHM1 and CHM13 centromeric regions.
Article
Extended Data Fig. 5 | Comparison of the genetic, epigenetic, and frequency (second track), CENP-A nucleosome enrichment (third track), and
evolutionary landscapes between the CHM1 and CHM13 centromeric evolutionary layers (bottom triangle) for each CHM1 and CHM13 centromeric
regions. Plots showing the sequence organization (top track), CpG methylation region. Enlarged versions of these plots are shown in Supplementary Figs. 45–67.
Extended Data Fig. 6 | CHM1 chromosome 13 and 19 centromeres have two reveal similar patterns of CENP-A chromatin enrichment, with two regions
regions enriched with CENP-A chromatin within hypomethylated α-satellite enriched with CENP-A that coincide with hypomethylated α-satellite DNA within
DNA. a,b) Two strategies for mapping CHM1 CENP-A ChIP-seq data (Methods) the CHM1 a) chromosome 13 and b) chromosome 19 α-satellite HOR arrays.
Article
Extended Data Fig. 7 | The CHM1 chromosome 13 centromere likely has probe for each chromosome 13 sister chromatid, indicating that this
one kinetochore site, while the CHM1 chromosome 19 centromere has chromosome likely has one kinetochore (a,b). Conversely, we find that
two kinetochore sites. a-d) Immuno-FISH staining of stretched metaphase there are two CENP-C signals that coincide with a single chromosome 5/19
chromosome spreads from CHM1 cells with a fluorescent antibody against α-satellite probe signal for each sister chromatid, indicating there are likely
CENP-C (an inner-kinetochore protein; green) as well as a fluorescent two kinetochores on this chromosome (c,d). Each experiment was performed
chromosome 13/21 α-satellite DNA probe (a,b; red) or a fluorescent three times with similar results. n = 32 and 34 metaphase chromosome spreads
chromosome 5/19 α-satellite DNA probe (c,d; red). We find that there is a were analysed for chromosomes 13 and 19, respectively. Insets are magnified
single CENP-C signal that coincides with the chromosome 13/21 α-satellite 1.7-fold (panels a and c) or 3.9-fold (panels b and d). Bar, 10 μm.
Extended Data Fig. 8 | Centromeres evolve with different evolutionary regions. Individual data points from 10-kbp pairwise sequence alignments are
trajectories and mutation rates. a-d) Phylogenetic trees of α-satellite shown. We note that the regions corresponding to the active α-satellite HORs
monomers derived from the human, chimpanzee, orangutan, and macaque have only approximate mutation rates based on human–human comparisons,
chromosome a) 10, b) 12, c) 20, and d) 21 centromeric regions. e-h) Plot showing Due to unequal rates of mutation and the emergence of new α-satellite HORs,
the mutation rate of the chromosome e) 10, f) 12, g) 20, and h) 21 centromeric interspecies comparisons are not possible in these regions.
Article
Extended Data Fig. 9 | Phylogenetic reconstruction of human chromosome 90–99% bootstrap support are indicated numerically. Nodes without an
5 and 7 centromeric haplotypes. a,b) Phylogenetic trees showing the asterisk or number have bootstrap support <90%. The haplotypes from the
evolutionary relationship and estimated divergence times of completely and p- and the q-arm trees are linked with a light teal bar, as schematized in panel a.
accurately assembled a) D5Z2 α-satellite HOR arrays and b) D7Z1 α-satellite We note that most differences in the order of the haplotypes occur at the
HOR arrays from CHM1, CHM13, and diverse human samples (generated by the terminal branches where the order of sequence taxa can be readily reshuffled
HPRC10 and HGSVC23). The trees were generated from 20-kbp segments in the to establish near-complete concordance. Thus, there are no significant changes
monomeric α-satellite or unique sequence regions on the p- (left) and q- (right) in the overall topologies of the phylogenetic trees.
arms. Asterisks indicate nodes with 100% bootstrap support, and nodes with
Extended Data Fig. 10 | Phylogenetic reconstruction of human chromosome 90–99% bootstrap support are indicated numerically. Nodes without an asterisk
8 and 10 centromeric haplotypes. a,b) Phylogenetic trees showing the or number have bootstrap support <90%. The haplotypes from the p- and the
evolutionary relationship and estimated divergence times of completely and q-arm trees are linked with a light teal bar, as schematized in panel a. We note
accurately assembled a) D8Z2 α-satellite HOR arrays and b) D10Z1 α-satellite that most differences in the order of the haplotypes occur at the terminal
HOR arrays from CHM1, CHM13, and diverse human samples (generated by the branches where the order of sequence taxa can be readily reshuffled to establish
HPRC10 and HGSVC23). The trees were generated from 20-kbp segments in the near-complete concordance. Thus, there are no significant changes in the overall
monomeric α-satellite or unique sequence regions on the p- (left) and q- (right) topologies of the phylogenetic trees.
arms. Asterisks indicate nodes with 100% bootstrap support, and nodes with
Article
Extended Data Fig. 11 | See next page for caption.

Extended Data Fig. 11 | Phylogenetic reconstruction of human chromosome support <90%. The haplotypes from the p- and the q-arm trees are linked with a
11, 13, and 14 centromeric haplotypes. a-c) Phylogenetic trees showing the light teal bar, as schematized in panel a. We note that most differences in the
evolutionary relationship and estimated divergence times of completely and order of the haplotypes occur at the terminal branches where the order of
accurately assembled a) D11Z1 α-satellite HOR arrays, b) D13Z2 α-satellite HOR sequence taxa can be readily reshuffled to establish near-complete concordance.
arrays, and b) D14Z9 α-satellite HOR arrays from CHM1, CHM13, and diverse Thus, there are no significant changes in the overall topologies of the
human samples (generated by the HPRC10 and HGSVC 23). The trees were phylogenetic trees. We note, however, in the case of the chromosome 13 p-arm
generated from 20-kbp segments in the monomeric α-satellite or unique (panel b), the CHM13 divergence time is exceptional (5.2 mya) compared to all
sequence regions on the p- (left) and q- (right) arms. Asterisks indicate nodes other regions of the genome. The basis for this is unknown, but it may reflect
with 100% bootstrap support, and nodes with 90–99% bootstrap support are ectopic exchange of the p-arm of human acrocentric chromosomes, leading to
indicated numerically. Nodes without an asterisk or number have bootstrap non-homologous exchange among five human chromosomes88.
Article
Extended Data Table 1 | Datasets generated and/or used in this study
See accompanying Excel file. ONT, Oxford Nanopore Technologies; PacBio, Pacific Biosciences; HiFi, high-fidelity; ChIP, chromatin immunoprecipitation.
nature portfolio | reporting summary
Corresponding author(s): Evan E. Eichler
Last updated by author(s): Dec 28, 2023
Reporting Summary
Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.
Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested

A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons
A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient)
AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals)
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
Our web collection on statistics for biologists contains articles on many of the points above.
Software and code

Policy information about availability of computer code
Data collection The software used to collect sequencing data are Pacific Bioscience Sequel II Instrument Control SW (v7.0, 7.1, and 8.0) and Oxford Nanopore
Technologies PromethION software (v21.02.17 - 23.04.5). The software used to collect image data are ZEN (v3.7) and NIS-Elements AR (v3.2).
Data analysis Custom code for the SUNK-based sequence assembly of centromeric regions is publicly available at https://github.com/arozanski97/SUNK-
based-contig-scaffolding. Custom code to detect hypomethylated regions within centromeric regions, termed "centromere dip
regions” (CDRs), is publicly available at https://github.com/arozanski97/CDR-Finder. Other publicly available software used in this study
include DeepConsensus (v0.2.0), PacBio circular consensus sequencing software (v3.4.1), hifiasm (v0.16.1), HiCanu (v2.1.1), minimap2 (v2.17-
r941 and v2.24), Jellyfish (v2.2.4), pbmm2 (v1.1.0), Winnowmap (v1.0), Merqury (v1.1), BWA-MEM (v0.7.17), sambamba (v1.0), SAMtools
(v1.9), breakpointR (v1.18), BEDtools (v2.29.0), deepTools (v3.4.3), seqtk (v1.3), TandemAligner (v0.1), meryl (v1.3), StringDecomposer (no
version specified), StainedGlass (v6.7.0), Nanopolish (v0.13.3), RepeatMasker (v4.1.0), Sickle (v1.33), Cutadapt (v1.18), MAFFT (v7.453), IQ-
TREE (v2.1.2), Fiji (v2.13.1), ImageJ (v1.53k), KISS ImageJ plug-in (v1), Prism (v9.5.1), and R (v1.1.383).
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and
reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
April 2023
1
Data
Policy information about availability of data
All manuscripts must include a data availability statement. This statement should provide the following information, where applicable:
- Accession codes, unique identifiers, or web links for publicly available datasets
- A description of any restrictions on data availability
- For clinical datasets or third party data, please ensure that the statement adheres to our policy
All sequencing data generated and/or used in this study are publicly available and listed in Extended Data Table 1 with their BioProject ID, accession # (if available),
and/or URL. For convenience, we also list the BioProject IDs and/or URLs here: CHM1 whole-genome assembly with complete centromeres (PRJNA975207); CHM1
PacBio HiFi data (PRJNA726974); CHM1 ONT data (PRJNA869061); CHM1 Illumina data (PRJNA246220); CHM1 Strand-Seq alignments (https://doi.org/10.5281/
zenodo.7959305); CHM1 CENP-A ChIP-seq data (PRJNA975217); T2T-CHM13 (v2.0) whole-genome assembly (PRJNA559484); CHM13 PacBio HiFi data
(PRJNA530776); CHM13 ONT data (PRJNA559484); HG00733 PacBio HiFi data (PRJNA975575 and PRJEB36100); HG00733 ONT data (PRJNA975575, PRJNA686388,
and PRJEB37264); HPRC whole-genome assemblies (https://projects.ensembl.org/hprc/); HGSVC whole-genome assemblies (https://www.internationalgenome.org/
data-portal/data-collection/hgsvc2); and NHP [chimpanzee (Clint; S006007), orangutan (Susie; PR01109), and macaque (AG07107)] PacBio HiFi and ONT data
(PRJNA659034). The original karyotyping imaging data for the CHM1 cell line is available from the Stowers Original Data Repository (http://www.stowers.org/
research/publications/libpb-2457).
Research involving human participants, their data, or biological material

Policy information about studies with human participants or human data. See also policy information about sex, gender (identity/presentation),
and sexual orientation and race, ethnicity and racism.
Reporting on sex and gender N/A
Reporting on race, ethnicity, or We report analyses of publicly available human genome sequencing data generated by the 1000 Genomes Project (https://
other socially relevant www.internationalgenome.org/home) and their associated genetic ancestry information, as established and described by the
groupings 1000 Genomes Project (https://www.internationalgenome.org/category/population/).
Population characteristics See above.
Recruitment See above.
Ethics oversight See above.
Note that full information on the approval of the study protocol must also be provided in the manuscript.
Field-specific reporting
Please select the one below that is the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/documents/nr-reporting-summary-flat.pdf
Life sciences study design

All studies must disclose on these points even when the disclosure is negative.
Sample size We generated complete sequence assemblies of each centromere in the human CHM1hTERT genome (n=23) as well as a subset of
orthologous centromeres in a second human (HG00733; n=12), chimpanzee (Pan troglodytes; Clint; S006007; n=11), orangutan (Pongo abelii;
Susie; PR01109; n=10), and macaque (Macaca mulatta; AG07107; n=10). We also analyzed whole-genome assemblies from diverse humans
generated by the Human Pangenome Reference Consortium (HPRC) and Human Genome Structural Variation Consortium (HGSVC; n=56
genomes; n=112 haplotypes; n=580 completely assembled centromeres; and n=2,049 incompletely assembled centromeres). For
phylogenetic tree construction of centromeric regions, we used 150 data points from each genome. For centromeric mutation rate
computation, we used hundreds to thousands of data points from each genome.
Data exclusions No data were excluded.
Replication Computational experiments are deterministic and are, therefore, reproducible. Each wet-lab experiment was performed at least two
independent times.
Randomization Randomization is not applicable to this study because we did not perform any experiments where there are treatment and control groups
April 2023
that would necessitate randomization between the subjects.
Blinding Blinding is not applicable to this study because we did not perform any experiments where there are treatment and control groups that would
necessitate blinding.
2
Reporting for specific materials, systems and methods
We require information from authors about some types of materials, experimental systems and methods used in many studies. Here, indicate whether each material,
system or method listed is relevant to your study. If you are not sure if a list item applies to your research, read the appropriate section before selecting a response.
Materials & experimental systems Methods

n/a Involved in the study n/a Involved in the study
Antibodies ChIP-seq
Eukaryotic cell lines Flow cytometry
Palaeontology and archaeology MRI-based neuroimaging
Animals and other organisms
Clinical data
Dual use research of concern
Plants
Antibodies
Antibodies used We used a mouse monoclonal anti-human CENP-A antibody (clone 3-19; Enzo, ADI-KAM-CC006-E) in the ChIP-seq experiments.
We used a rabbit polyclonal anti-human CENP-C antibody (made in house) and a goat anti-rabbit IgG antibody conjugated to FITC
(Sigma F0382) in the immuno-FISH experiments.
Validation The anti-human CENP-A antibody was generated against a synthetic peptide consisting of amino acids 3-19 of human CENP-A, and
mutation of this epitope in human cells prevents antibody binding (Logsdon et al., JCB, 2015).
Eukaryotic cell lines

Policy information about cell lines and Sex and Gender in Research
Cell line source(s) The human CHM1hTERT cell line was a gift from Urvashi Surti (Pittsburgh, PA). The human HG00733 lymphoblastoid cell line
was obtained from the Coriell Institute for Medical Research (Camden, NJ). Chimpanzee (Pan troglodytes; Clint; S006007)
fibroblast cells were obtained from a male western chimpanzee named Clint (now deceased) at the Yerkes National Primate
Research Center (Atlanta, GA) and immortalized with EBV. Orangutan (Pongo abelii; Susie; PR01109) fibroblast cells were
obtained from a female Sumatran orangutan named Susie (now deceased) at the Gladys Porter Zoo (Brownsville, TX),
immortalized with EBV, and stored at the Coriell Institute for Medical Research (Camden, NJ). Macaque (Macaca mulatta;
AG07107) fibroblast cells were originally obtained from a female rhesus macaque of Indian origin and stored at the Coriell
Institute for Medical Research (Camden, NJ).
Authentication The human CHM1hTERT cell line was authenticated via STR analysis by Cell Line Genetics (Madison, WI). The human
HG00733 cell line is part of the NHGRI Sample Repository for Human Genetic Research at the Coriell Institute for Medical
Research (Camden, NJ) and was authenticated using a multiplex PCR assay with six autosomal microsatellite markers. The
chimpanzee, orangutan, and macaque cell lines have not yet been authenticated to our knowledge.
Mycoplasma contamination The human CHM1hTERT and HG00733 cell lines are negative for mycoplasma contamination. The chimpanzee, orangutan,
and macaque cell lines have not yet been tested for mycoplasma contamination to our knowledge.
Commonly misidentified lines No commonly misidentified cell lines were used in this study.
(See ICLAC register)
Plants
Seed stocks Report on the source of all seed stocks or other plant material used. If applicable, state the seed stock centre and catalogue number. If
plant specimens were collected from the field, describe the collection location, date and sampling procedures.
Novel plant genotypes Describe the methods by which all novel plant genotypes were produced. This includes those generated by transgenic approaches,
gene editing, chemical/radiation-based mutagenesis and hybridization. For transgenic lines, describe the transformation method, the
number of independent lines analyzed and the generation upon which experiments were performed. For gene-edited lines, describe
the editor used, the endogenous sequence targeted for editing, the targeting guide RNA sequence (if applicable) and how the editor
was applied.
Authentication Describe any authentication procedures for each seed stock used or novel genotype generated. Describe any experiments used to
April 2023
assess the effect of a mutation and, where applicable, how potential secondary effects (e.g. second site T-DNA insertions, mosiacism,
off-target gene editing) were examined.
3
ChIP-seq

Data deposition
Confirm that both raw and final processed data have been deposited in a public database such as GEO.
Confirm that you have deposited or provided access to graph files (e.g. BED files) for the called peaks.
Data access links https://www.ncbi.nlm.nih.gov/sra/?term=SRR24675260

May remain private before publication. https://www.ncbi.nlm.nih.gov/sra/?term=SRR24675261
https://www.ncbi.nlm.nih.gov/sra/?term=SRR24675262
https://www.ncbi.nlm.nih.gov/sra/?term=SRR24675263
Files in database submission CHM1_CA_ChIP_1_S3_R1_001.fastq.gz

CHM1_CA_ChIP_1_S3_R2_001.fastq.gz
CHM1_Input_1_S1_R1_001.fastq.gz
Genome browser session No longer applicable.

(e.g. UCSC)
Methodology
Replicates Two independent replicates of CENP-A ChIP-seq (with chromatin input as a control) were performed on CHM1hTERT cells and were
in agreement with each other.
Sequencing depth The total number of reads generated from each CHM1hTERT CENP-A ChIP-seq experiment is as follows:
CHM1hTERT CENP-A ChIP (Replicate 1): 113,284,073 paired-end, 150x150-bp reads

CHM1hTERT CENP-A ChIP (Replicate 2): 82,612,743 paired-end, 150x150-bp reads
CHM1hTERT Input (Replicate 1): 81,452,960 paired-end, 150x150-bp reads
CHM1hTERT Input (Replicate 2): 90,430,891 paired-end, 150x150-bp reads
Antibodies We used a mouse monoclonal anti-human CENP-A antibody (clone 3-19; Enzo, ADI-KAM-CC006-E) to enrich for CENP-A-containing
chromatin in the CHM1hTERT cell line.
Peak calling parameters We aligned the CHM1hTERT CENP-A ChIP and input sequencing data to the CHM1hTERT whole-genome assembly generated in this
study using BWA-MEM (v0.7.17) with the following parameters: bwa mem -k 50 -c 1000000 {index} {read1.fastq.gz} {read2.fastq.gz}.
The resulting SAM files were filtered using SAMtools (v1.9) with flag score 2308 to prevent multi-mapping of reads. With this filter,
reads mapping to more than one location are randomly assigned a single mapping location, thereby preventing mapping biases in
highly identical regions. Alignments were normalized with deepTools (v3.4.3) bamCompare with the following parameters:
bamCompare -b1 {ChIP.bam} -b2 {bulk_nucleosomal.bam} --operation ratio --binSize 1000 minMappingQuality 1 -o {out.bw}.
Data quality The CHM1hTERT CENP-A ChIP and input sequencing data were assessed for quality using FastQC (https://github.com/s-andrews/
FastQC), trimmed with Sickle (v1.33; https://github.com/najoshi/sickle) to remove low-quality 5' and 3' end bases, and trimmed with
Cutadapt (v1.18) to remove adapters.
Software BWA-MEM (v0.7.17) was used to align the CHM1hTERT CENP-A ChIP and input sequencing data to the CHM1hTERT whole-genome
assembly. SAMtools (v1.9) was used to remove multi-mapped reads, and deepTools (v3.4.3) bamCompare was used to normalize and
filter CENP-A ChIP data relative to input data to calculate fold enrichment.
April 2023

Primate Centromeres

Uploaded by

Copyright:

Available Formats

Primate Centromeres

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Primate Centromeres

Uploaded by

Copyright:

Available Formats

Article

The variation and evolution of complete

https://doi.org/10.1038/s41586-024-07278-3 Glennis A. Logsdon1,14, Allison N. Rozanski1, Fedor Ryabov2, Tamara Potapova3,

Chromosome 7 Chromosome 8 Chromosome 9 Chromosome 10 Chromosome 11 Chromosome 12

Chromosome 13 Chromosome 14 Chromosome 15 Chromosome 16 Chromosome 17 Chromosome 18

Chromosome 19 Chromosome 20 Chromosome 21 Chromosome 22 Chromosome X

Genome Centromeric regions Chromatin α-Satellite HORs

α-satellite HOR array

α-satellite HOR array

Sequence identity (%)

of the kinetochore site (Mb)

kinetochore site (Mb)

3.07 2.58 2.31 4.08 1.46 2.04 13.20 13.02

1.23 0.34 0.82 0.81 2.17 1.44 4.52 3.89

3.32 3.11 3.49 3.84 1.27 6.82 8.94 11.42 10.99

Extended Data Fig. 11 | See next page for caption.

A description of all covariates tested

Software and code

Research involving human participants, their data, or biological material

Population characteristics See above.

Recruitment See above.

Ethics oversight See above.

Life sciences study design

Data exclusions No data were excluded.

that would necessitate randomization between the subjects.

Materials & experimental systems Methods

Eukaryotic cell lines

nature portfolio | reporting summary

Data access links https://www.ncbi.nlm.nih.gov/sra/?term=SRR24675260

Files in database submission CHM1_CA_ChIP_1_S3_R1_001.fastq.gz

Genome browser session No longer applicable.

CHM1hTERT CENP-A ChIP (Replicate 1): 113,284,073 paired-end, 150x150-bp reads

You might also like