6533b857fe1ef96bd12b4679

RESEARCH PRODUCT

CROSSMAPPER: estimating cross-mapping rates and optimizing experimental design in multi-species sequencing studies

Ahmed Ibrahem HafezAhmed Ibrahem HafezAhmed Ibrahem HafezCarlos LlorensHrant HovhannisyanToni Gabaldón

subject

Statistics and Probability:Informàtica::Aplicacions de la informàtica::Bioinformàtica [Àrees temàtiques de la UPC]Computer sciencecomputer.software_genreBiochemistryGenomeTranscriptome03 medical and health sciencesResource (project management)GenomesTranscriptomicsMolecular BiologyOrganismGenòmica -- Informàtica030304 developmental biology0303 health sciences030306 microbiologyHigh-Throughput Nucleotide SequencingGenomicsSequence Analysis DNADNAGenome analysisGenome AnalysisAnàlisis de seqüènciesComputer Science ApplicationsApplications NoteComputational MathematicsComputational Theory and MathematicsCross-mappingResearch DesignMetagenomicsRNAData miningLine (text file)computerSoftwareGenèticaparametres

description

Motivation Numerous sequencing studies, including transcriptomics of host-pathogen systems, sequencing of hybrid genomes, xenografts, mixed species systems, metagenomics and meta-transcriptomics, involve samples containing genetic material from divergent organisms. A crucial step in these studies is identifying from which organism each sequencing read originated, and the experimental design should be directed to minimize biases caused by cross-mapping of reads to incorrect source genomes. Additionally, pooling of sufficiently different genetic material into a single sequencing library could significantly reduce experimental costs but requires careful planning and assessment of the impact of cross-mapping. Having these applications in mind we designed Crossmapper, the first to our knowledge tool able to assess cross-mapping prior to sequencing, therefore allowing optimization of experimental design. Results Using any combination of reference genomes, Crossmapper performs read simulation and back-mapping of those reads to the pool of references, quantifies and reports the cross-mapping rates for each organism. Crossmapper performs these analyses with numerous user-specified parameters, including, among others, read length, read layout, coverage, mapping parameters, genomic or transcriptomic data. Additionally, it outputs the results in highly interactive and publication-ready reports. This allows the user to perform multiple comparisons at once and choose the experimental setup minimizing cross-mapping rates. Moreover, Crossmapper can be used for resource optimization in sequencing facilities by pooling different samples into one sequencing library. Availability and implementation Crossmapper is a command line tool implemented in Python 3.6 and available as a conda package, allowing effortless installation. The source code, detailed information and a step-by-step tutorial is available at our GitHub page https://github.com/Gabaldonlab/crossmapper. Supplementary information Supplementary data are available at Bioinformatics online. This work was supported by the Spanish Ministry of Economy, Industry and Competitiveness (MEIC) for the EMBL partnership and the grant ‘Centro de Excelencia Severo Ochoa’ SEV-2012-0208 cofounded by European Regional Development Fund (ERDF); from the CERCA Programme/Generalitat de Catalunya; from the Catalan Research Agency (AGAUR) SGR857 and grants from the European Union’s Horizon 2020 research and innovation programme under the grant agreement ERC-2016-724173 and the Marie Sklodowska-Curie grant agreement No H2020-MSCA-ITN-2014-642095. The group also receives support from a INB Grant (PT17/0009/0023–ISCIII-SGEFI/ERDF). Peer Reviewed

10.1093/bioinformatics/btz626http://hdl.handle.net/2117/345929