Africa specific molecular taxonomy of prostate cancer

Patient cohorts and WGS

Our study included 183 treatment-naïve patients with prostate cancer who were recruited under informed consent and appropriate ethical approval (Supplementary Information 2) in Australia (not= 53), Brazil (not= 7) and South Africa (not= 123). Although matched for pathological grading, as noted earlier, prostate-specific antigen levels are particularly elevated in our African patients.16 and we cannot exclude on the basis of potential metastases (as data on metastases in this cohort are not available). DNA extracted from fresh tissue and matched blood underwent 2 × 150 bp sequencing on the Illumina NovaSeq instrument (Kinghorn Center for Clinical Genomics, Garvan Institute of Medical Research).

WGS processing and call for variants

Each lane of raw sequencing reads was aligned to human reference hg38 + alternate contigs using bwa (v.0.7.15)37. Lane-level BAM files from the same library were merged and duplicate reads were flagged. Genome Analysis Toolkit (GATK, v. was used for baseline quality recalibration38. Contaminated and duplicate samples (not= 8) have been deleted. We have established three main pipelines for germline and somatic variant discovery, the latter including small (SNV and indel) to large (CNA and SV) genomic variations. The complete pipelines and tools used are available from the Sydney Informatics Hub (SIH), Core Research Facilities, University of Sydney (see “Code Availability” section). Scalable bioinformatics workflows are described in Supplementary Information 4.

Genetic ancestry was estimated using fastSTRUCTURE (v.1.0)39, Bayesian inference for the best approximation of the marginal likelihood of a very large variant dataset. The reference panels for African and European ancestry compared in this study were extracted from previous whole genome databases.19.

Analysis of chromothripsia and chromoplexy

Clustered genomic rearrangements of prostate tumors were identified using ShatterSeek (v.0.4)40 and ChainFinder (v.1.0.1)41. Our somatic SV and somatic CNA call sets were prepared and co-analyzed using custom scripts (see “Code Availability” section; Supplementary Information 6).

Mutational recurrence analysis

We used three approaches to detect recurrently mutated genes or regions based on three types of mutations, including small mutations, SVs, and NACs (Supplementary Information 7). Briefly, small mutations were tested in a given genomic element to be significantly more mutated than adjacent background sequences. Genomic elements extracted from syn5259886, the PCAWG Consortium20, were one group of coding sequences and ten groups of non-coding regions. SV breakpoints were tested in a given gene for statistical enrichment using gamma-Poisson regression and corrected for genomic covariates12. Focal and arm-level recurrent ANCs were examined using GISTIC (v.2.0.23)42. Known motor mutations in coding and non-coding regions published in PCAWG20,43,44 were also recorded in our 183 tumors, and those specific to prostate cancer genes were also included7,8,12,17,18.

Integrative Analysis of Prostate Cancer Subtypes

Integrative clustering of three types of genomic data for 183 patients was performed using iClusterplus11.45 in R, with the following entries: (1) driver genes and elements; (2) somatic CN segments; and (3) significantly recurrent SV breakpoints. We ran iClusterPlus.tune with clusters ranging from 1 to 9. We also performed unsupervised consensus clustering on each of the three data types individually. Association analysis of genomic alteration with different iCluster subtypes was performed in detail (Supplementary Information 8). Differences in driver mutations, recurrent breakpoints and somatic NACs in different iCluster subtypes have been reported.

Comparison of iCluster with Asian and pancancer data

To compare molecular subtypes between existing human populations, the Chinese Prostate Cancer Genome and Epigenome Atlas (CPGEA, PRJCA001124)11 was merged and processed with our integrative clustering analysis on the three data types described above, with some modifications. In addition, we used data from the PCAWG consortium13 to define molecular subtypes across different ethnic groups in other cancer types using published data on somatic mutations, SV and GISTIC results by gene. Four types of cancer including breast, liver, ovarian and pancreatic cancers were considered due to existing primary ancestries of African, Asian and European with a contribution of at least 70%. Full details are provided in Supplementary Information 8.4.

PCAWG13 participants with prostate cancer were retrieved for comparison to Australian data with clinical follow-up. Only those with more than 90% European ancestry (not= 139) were analyzed for the three types of iCluster subtyping genomic data, as well as individual consensus clustering. Grouping results identical to the larger cohort size mentioned above were chosen for the association analyses. Differences in participants’ biochemical relapse and fatal prostate cancer across subtypes were assessed using the Kaplan-Meier diagram followed by a log-rank test for significance.

Analysis of mutational signatures

Mutational signatures (SBS, DBS, and indels), as defined by the PCAWG Mutational Signatures Working Group3were fitted to individual tumors with observed signature activities using SigProfiler46. Non-negative matrix factorization was implemented to detect de novo and global signature profiles in 183 patients and their contributions. Novel signatures of mutational genome rearrangement (CN and SV) were also performed using non-negative matrix factorization, with 45 CN and 44 SV features examined across 183 tumors. We followed the PCAWG working classification and annotation scheme for genomic rearrangement26. Two SV callers were used to get the exact coordinates of the breakpoint. Replication synchronization scores influencing SV detection were set at >75, 20–75, and 47. Full details of relevant analysis steps, parameters and statistical tests are provided in Supplementary Information 9.

Reconstruction of cancer timelines

Synchronization of CN gains and driver mutations (SNV and indels) in four epochs of cancer progression (early clonal, unspecified clonal, late clonal and subclonal) was performed using MutationTimeR24. CN gains including 2+0, 2+1, and 2+2 (1+1 for a diploid genome) were taken into account for a clearer boundary between epochs instead of only variant allele frequency information . Confidence intervals (youheyyouat the top) for time estimates were calculated with 200 bootstraps. Mutation rates for each subtype were calculated according to ref. 24 so CpG to TpG mutations were counted for analysis because they were attributed to spontaneous deamination of 5-methyl-cytosine to thymine at CpG dinucleotides, thereby acting as a molecular clock.

The relative ordering of the league model was performed to aggregate all study samples to calculate the overall ranking of driver mutations and recurrent NACs. Information for ranking was derived from the timing of each driver mutation and that of the clonal and subclonal CN segments, as described above. A full description is provided in Supplementary Information 10.

Summary of reports

Further information on the research design can be found in the summary of nature research reports linked to this article.

Comments are closed.