6533b7dcfe1ef96bd1273254
RESEARCH PRODUCT
Mycobacterium tuberculosiscomplex lineage 5 exhibits high levels of within-lineage genomic diversity and differing gene content compared to the type strain H37Rv
Leen RigoutsLeen RigoutsBouke C. De JongMireia CoscollaJulian ParkhillJulian ParkhillMartin AntonioBoatema Ofori-anyinamIsaac Darko OtchereSebastien GagneuxSebastien GagneuxDorothy Yeboah-manuC. N’dira SanoussiC. N’dira SanoussiDissou AffolabiConor J. MeehanConor J. MeehanStefan NiemannSimon R. Harrissubject
Genetics0303 health sciencesLineage (genetic)030306 microbiologySequence assemblySingle-nucleotide polymorphismBiologybiology.organism_classificationGenome3. Good health03 medical and health sciencesMycobacterium tuberculosis complexGeneMycobacterium africanum030304 developmental biologyReference genomedescription
AbstractPathogens of theMycobacterium tuberculosiscomplex (MTBC) are considered monomorphic, with little gene content variation between strains. Nevertheless, several genotypic and phenotypic factors separate the different MTBC lineages (L), especially L5 and L6 (traditionally termedMycobacterium africanum), from each other. However, genome variability and gene content especially of L5 and L6 strains have not been fully explored and may be potentially important for pathobiology and current approaches for genomic analysis of MTBC isolates, including transmission studies.We compared the genomes of 358 L5 clinical isolates (including 3 completed genomes and 355 Illumina WGS (whole genome sequenced) isolates) to the L5 complete genomes and H37Rv, and identified multiple genes differentially present or absent between H37Rv and L5 strains. Additionally, considerable gene content variability was found across L5 strains, including a split in the L5.3 sublineage into L5.3.1 and L5.3.2. These gene content differences had a small knock on effect on transmission cluster estimation, with clustering rates influenced by the selection of reference genome, and with potential over-estimation of recent transmission when using H37Rv as the reference genome.Our data show that the use of H37Rv as reference genome results in missing SNPs in genes unique for L5 strains. This potentially leads to an underestimation of the diversity present in the genome of L5 strains and in turn affects the transmission clustering rates. As such, a full capture of the gene diversity, especially for high resolution outbreak analysis, requires a variation of the single H37Rv-centric reference genome mapping approach currently used in most WGS data analysis pipelines. Moreover, the high within-lineage gene content variability suggests that the pan-genome ofM. tuberculosisis at least several kilobases larger than previously thought, implying a concatenated or reference-free genome assembly (de novo) approach may be needed for particular questions.Data summarySequence data for the Illumina dataset are available at European Genome-phenome Archive (EGA;https://www.ebi.ac.uk/ega/) under the study accession numbers PRJEB38317 and PRJEB38656. Individual runs accession numbers are indicated in Table S8.PacBio raw reads for the L5 Benin genome are available on the ENA accession SAME3170744. The assembled L5 Benin genome is available on NCBI with accession PRJNA641267. To ensure naming conventions of the genes in the three L5 genomes can be followed, we have uploaded these annotated GFF files to figshare athttps://doi.org/10.6084/m9.figshare.12911849.v1.Custom python scripts used in this analysis can be found athttps://github.com/conmeehan/pathophy.
year | journal | country | edition | language |
---|---|---|---|---|
2020-06-22 |