Search results for "Data_FILES"
showing 10 items of 197 documents
A Big Data Approach for Sequences Indexing on the Cloud via Burrows Wheeler Transform
2020
Indexing sequence data is important in the context of Precision Medicine, where large amounts of ``omics'' data have to be daily collected and analyzed in order to categorize patients and identify the most effective therapies. Here we propose an algorithm for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our approach is the first that distributes the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Sorting suffixes of a text via its Lyndon Factorization
2013
The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can b…
Burrows Wheeler Transform on a Large Scale: Algorithms Implemented in Apache Spark
2021
With the rapid growth of Next Generation Sequencing (NGS) technologies, large amounts of "omics" data are daily collected and need to be processed. Indexing and compressing large sequences datasets are some of the most important tasks in this context. Here we propose algorithms for the computation of Burrows Wheeler transform relying on Big Data technologies, i.e., Apache Spark and Hadoop. Our algorithms are the first ones that distribute the index computation and not only the input dataset, allowing to fully benefit of the available cloud resources.
Lightweight LCP construction for very large collections of strings
2016
The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is reali…
Sorted deduplication: How to process thousands of backup streams
2016
The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and cre…
Direct lookup and hash-based metadata placement for local file systems
2013
New challenges to file systems' metadata performance are imposed by the continuously growing number of files existing in file systems. The total amount of metadata can become too big to be cached, potentially leading to multiple storage device accesses for a single metadata lookup operation. This paper takes a look at the limitations of traditional file system designs and discusses an alternative metadata handling approach, using hash-based concepts already established for metadata and data placement in distributed storage systems. Furthermore, a POSIX compliant prototype implementation based on these concepts is introduced and benchmarked. A variety of file system metadata and data operati…
ESB: Ext2 Split Block Device
2012
Solid State Disks (SSDs) start to replace rotating media (hard disks, HDD) in many areas, but are still not as cost efficient concerning capacity to completely replace them. One approach to use their superior performance properties is to use them as a cache for magnetic disks to speed up overall storage operations. In this paper, we present and evaluate a file system level optimization based on ext2. We split metadata and data and store the metadata on a SDD while the data remains on a common HDD. We evaluate our system with filebench under a file server, web server, and web proxy scenario and compare the results with flashcache. We find that many of the scenarios do not contain enough meta…
Comparative Cytogenetics Allows the Reconstruction of Human Chromosome History: The Case of Human Chromosome 13
2019
Comparative cytogenetics permits the identification of human chromosomal homologies and rearrangements between species, allowing the reconstruction of the history of each human chromosome. The aim of this work is to review evolutionary aspects regarding human chromosome 13. Classic and molecular cytogenetics using comparative banding, chromosome painting, and bacterial artificial chromosome (BAC) mapping can help us formulate hypotheses about chromosome ancestral forms; more recently, sequence data have been integrated as well. Although it has been previously shown to be conserved when compared to the ancestral primate chromosome, it shows a degree of rearrangements in some primate taxa; fu…
Hypervisor memory acquisition for ARM
2021
Abstract Cyber forensics use memory acquisition in advanced forensics and malware analysis. We propose a hypervisor based memory acquisition tool. Our implementation extends the volatility memory forensics framework by reducing the processor's consumption, solves the in-coherency problem in the memory snapshots and mitigates the pressure of the acquisition on the network and the disk. We provide benchmarks and evaluation.
Distributed Data Collection for the ATLAS EventIndex
2015
The ATLAS EventIndex contains records of all events processed by ATLAS, in all processing stages. These records include the references to the files containing each event (the GUID of the file) and the internal “pointer” to each event in the file. This information is collected by all jobs that run at Tier-0 or on the Grid and process ATLAS events. Each job produces a snippet of information for each permanent output file. This information is packed and transferred to a central broker at CERN using an ActiveMQ messaging system, and then is unpacked, sorted and reformatted in order to be stored and catalogued into a central Hadoop server. This contribution describes in detail the Producer/Consu…