Search results for "Data_FILES"

showing 10 items of 197 documents

A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform

2020

Indexing sequence data is important in the context of Precision Medicine, where large amounts of ``omics'' data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.

FOS: Computer and information sciencesArtificial Intelligence (cs.AI)Computer Science - Distributed Parallel and Cluster ComputingComputer Science - Artificial IntelligenceComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Distributed Parallel and Cluster Computing (cs.DC)
researchProduct

Sorting suffixes of a text via its Lyndon Factorization

2013

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can b…

FOS: Computer and information sciencesBWTLyndon FactorizationSettore INF/01 - InformaticaSorting Suffixes; Lyndon Factorization; Lyndon WordsSuffix arrayComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Lyndon wordSorting suffixeSorting SuffixesLyndon Words
researchProduct

Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark

2021

With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.

FOS: Computer and information sciencesComputer Science - Distributed Parallel and Cluster ComputingComputer Science - Data Structures and AlgorithmsData_FILESData Structures and Algorithms (cs.DS)Distributed Parallel and Cluster Computing (cs.DC)
researchProduct

Lightweight LCP construction for very large collections of strings

2016

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is reali…

FOS: Computer and information sciencesComputer scienceComputation0102 computer and information sciences02 engineering and technologyParallel computing01 natural sciencesGeneralized Suffix ArrayTheoretical Computer Sciencelaw.inventionlawComputational Theory and MathematicComputer Science - Data Structures and AlgorithmsExtended Burrows-Wheeler TransformData_FILES0202 electrical engineering electronic engineering information engineeringDiscrete Mathematics and CombinatoricsData Structures and Algorithms (cs.DS)Discrete Mathematics and CombinatoricAuxiliary memoryLongest Common Prefix Array; Extended Burrows-Wheeler Transform; Generalized Suffix Array;String (computer science)LCP arraySuffix arrayData structureComputational Theory and Mathematics010201 computation theory & mathematicsLongest Common Prefix Array020201 artificial intelligence & image processingJournal of Discrete Algorithms
researchProduct

Sorted deduplication: How to process thousands of backup streams

2016

The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and cre…

File system020203 distributed computingComputer scienceData domainFingerprint (computing)Search engine indexingSorting020206 networking & telecommunications02 engineering and technologyParallel computingcomputer.software_genreBackupServerData_FILES0202 electrical engineering electronic engineering information engineeringData deduplicationcomputer2016 32nd Symposium on Mass Storage Systems and Technologies (MSST)
researchProduct

Direct lookup and hash-based metadata placement for local file systems

2013

New challenges to file systems' metadata performance are imposed by the continuously growing number of files existing in file systems. The total amount of metadata can become too big to be cached, potentially leading to multiple storage device accesses for a single metadata lookup operation. This paper takes a look at the limitations of traditional file system designs and discusses an alternative metadata handling approach, using hash-based concepts already established for metadata and data placement in distributed storage systems. Furthermore, a POSIX compliant prototype implementation based on these concepts is introduced and benchmarked. A variety of file system metadata and data operati…

File systemData elementDatabaseComputer scienceFitxers informàtics -- OganitzacióComputer fileFile organization (Computer science)Meta Data Servicescomputer.file_formatMetadata placementRandomizationcomputer.software_genreMetadata repositoryTorrent fileMetadataFile system designDirect lookupHashingOperating systemData_FILESVersioning file systemMetadata performancecomputer:Informàtica::Sistemes operatius [Àrees temàtiques de la UPC]
researchProduct

ESB: Ext2 Split Block Device

2012

Solid State Disks (SSDs) start to replace rotating media (hard disks, HDD) in many areas, but are still not as cost efficient concerning capacity to completely replace them. One approach to use their superior performance properties is to use them as a cache for magnetic disks to speed up overall storage operations. In this paper, we present and evaluate a file system level optimization based on ext2. We split metadata and data and store the metadata on a SDD while the data remains on a common HDD. We evaluate our system with filebench under a file server, web server, and web proxy scenario and compare the results with flashcache. We find that many of the scenarios do not contain enough meta…

File systemWeb serverHardware_MEMORYSTRUCTURESComputer scienceComputer fileDevice filecomputer.software_genreMetadataFile serverData_FILESOperating systemFlashcacheCachecomputer2012 IEEE 18th International Conference on Parallel and Distributed Systems
researchProduct

Comparative Cytogenetics Allows the Reconstruction of Human Chromosome History: The Case of Human Chromosome 13

2019

Comparative cytogenetics permits the identification of human chromosomal homologies and rearrangements between species, allowing the reconstruction of the history of each human chromosome. The aim of this work is to review evolutionary aspects regarding human chromosome 13. Classic and molecular cytogenetics using comparative banding, chromosome painting, and bacterial artificial chromosome (BAC) mapping can help us formulate hypotheses about chromosome ancestral forms; more recently, sequence data have been integrated as well. Although it has been previously shown to be conserved when compared to the ancestral primate chromosome, it shows a degree of rearrangements in some primate taxa; fu…

Geneticsmedicine.medical_specialtyChromosome (genetic algorithm)InformationSystems_INFORMATIONSTORAGEANDRETRIEVALData_FILESCytogeneticsmedicineSettore BIO/08 - AntropologiaBiologyFish evolution mammals human syntenyGeneralLiterature_REFERENCE(e.g.dictionariesencyclopediasglossaries)Chromosome 13
researchProduct

Hypervisor memory acquisition for ARM

2021

Abstract Cyber forensics use memory acquisition in advanced forensics and malware analysis. We propose a hypervisor based memory acquisition tool. Our implementation extends the volatility memory forensics framework by reducing the processor's consumption, solves the in-coherency problem in the memory snapshots and mitigates the pressure of the acquisition on the network and the disk. We provide benchmarks and evaluation.

Hardware_MEMORYSTRUCTURESComputer scienceHypervisorcomputer.software_genreMemory forensicsComputer Science ApplicationsPathology and Forensic MedicineMedical Laboratory TechnologyData_FILESOperating systemMemory acquisitionVolatility (finance)Malware analysisLawcomputerInformation SystemsForensic Science International: Digital Investigation
researchProduct

Distributed Data Collection for the ATLAS EventIndex

2015

The ATLAS EventIndex contains records of all events processed by ATLAS, in all processing stages. These records include the references to the files containing each event (the GUID of the file) and the internal “pointer” to each event in the file. This information is collected by all jobs that run at Tier-0 or on the Grid and process ATLAS events. Each job produces a snippet of information for each permanent output file. This information is packed and transferred to a central broker at CERN using an ActiveMQ messaging system, and then is unpacked, sorted and reformatted in order to be stored and catalogued into a central Hadoop server. This contribution describes in detail the Producer/Consu…

HistoryData collectionDatabaseComputer scienceSnippetcomputer.software_genreGridComputer Science ApplicationsEducationMetadataPointer (computer programming)Data_FILEScomputerParticle Physics - Experiment
researchProduct