6533b872fe1ef96bd12d4326
RESEARCH PRODUCT
Mycobacterium tuberculosis complex lineage 5 exhibits high levels of within-lineage genomic diversity and differing gene content compared to the type strain H37Rv
Boatema Ofori-anyinamMartin AntonioDissou AffolabiC. N’dira SanoussiC. N’dira SanoussiMireia CoscollaSebastien GagneuxSebastien GagneuxConor J. MeehanConor J. MeehanSimon R. HarrisLeen RigoutsLeen RigoutsDorothy Yeboah-manuIsaac Darko OtchereBouke C. De JongJulian ParkhillJulian ParkhillStefan Niemannsubject
0301 basic medicineLineage (genetic)Genotype030106 microbiologySequence assemblyPathogens and Epidemiologylineage 5Genomegenomic diversity03 medical and health sciencesSpecies SpecificityDrug Resistance Multiple BacterialGenotypeHumansTuberculosisH37RvBiologyGeneResearch Articlesreference genomewithin-lineage variabilityGeneticsWhole Genome SequencingbiologyChromosome MappingGenetic VariationHigh-Throughput Nucleotide SequencingMycobacterium tuberculosisSequence Analysis DNAgene presence/absenceGeneral Medicinebiology.organism_classification030104 developmental biologyL5.3.2Mycobacterium tuberculosis complexM. africanumHuman medicineMycobacterium africanumGenome BacterialReference genomedescription
Pathogens of theMycobacterium tuberculosiscomplex (MTBC) are considered to be monomorphic, with little gene content variation between strains. Nevertheless, several genotypic and phenotypic factors separate strains of the different MTBC lineages (L), especially L5 and L6 (traditionally termedMycobacterium africanum) strains, from each other. However, this genome variability and gene content, especially of L5 strains, has not been fully explored and may be important for pathobiology and current approaches for genomic analysis of MTBC strains, including transmission studies. By comparing the genomes of 355 L5 clinical strains (including 3 complete genomes and 352 Illumina whole-genome sequenced isolates) to each other and to H37Rv, we identified multiple genes that were differentially present or absent between H37Rv and L5 strains. Additionally, considerable gene content variability was found across L5 strains, including a split in the L5.3 sub-lineage into L5.3.1 and L5.3.2. These gene content differences had a small knock-on effect on transmission cluster estimation, with clustering rates influenced by the selected reference genome, and with potential overestimation of recent transmission when using H37Rv as the reference genome. We conclude that full capture of the gene diversity, especially high-resolution outbreak analysis, requires a variation of the single H37Rv-centric reference genome mapping approach currently used in most whole-genome sequencing data analysis pipelines. Moreover, the high within-lineage gene content variability suggests that the pan-genome ofM. tuberculosisis at least several kilobases larger than previously thought, implying that a concatenated or reference-free genome assembly (de novo) approach may be needed for particular questions.
year | journal | country | edition | language |
---|---|---|---|---|
2021-07-01 | Microbial Genomics |