0000000000073984

AUTHOR

André Brinkmann

AIOC2: A deep Q-learning approach to autonomic I/O congestion control in Lustre

Abstract In high performance computing systems, I/O congestion is a common problem in large-scale distributed file systems. However, the current implementation mainly requires administrator to manually design low-level implementation and optimization, we proposes an adaptive I/O congestion control framework, named AIOC 2 , which can not only adaptively tune the I/O congestion control parameters, but also exploit the deep Q-learning method to start the training parameters and optimize the tuning for different types of workloads from the server and the client at the same time. AIOC 2 combines the feedback-based dynamic I/O congestion control and deep Q-learning parameter tuning technology to …

research product

Quantum chemical meta-workflows in MoSGrid

Quantum chemical workflows can be built up within the science gateway Molecular Simulation Grid. Complex workflows required by the end users are dissected into smaller workflows that can be combined freely to larger meta-workflows. General quantum chemical workflows are described here as well as the real use case of a spectroscopic analysis resulting in an end-user desired meta-workflow. All workflow features are implemented via Web Services Parallel Grid Runtime and Developer Environment and submitted to UNICORE. The workflows are stored in the Molecular Simulation Grid repository and ported to the SHIWA repository. © 2014 John Wiley & Sons, Ltd.

research product

Improving Collective I/O Performance Using Non-volatile Memory Devices

Collective I/O is a parallel I/O technique designed to deliver high performance data access to scientific applications running on high-end computing clusters. In collective I/O, write performance is highly dependent upon the storage system response time and limited by the slowest writer. The storage system response time in conjunction with the need for global synchronisation, required during every round of data exchange and write, severely impacts collective I/O performance. Future Exascale systems will have an increasing number of processor cores, while the number of storage servers will remain relatively small. Therefore, the storage system concurrency level will further increase, worseni…

research product

Extending SSD lifetime in database applications with page overwrites

Flash-based Solid State Disks (SSDs) have been a great success story over the last years and are widely used in embedded systems, servers, and laptops.One often overlooked ability of NAND flash is that flash pages can be overwritten in certain circumstances. This can be used to decrease wear out and increase performance.In this paper, we analyze the potential of overwrites for the most used data structure in database applications: the B-Tree. We show that with overwrites it is possible to significantly reduce flash wear out and increase overall performance.

research product

Improving LSM‐trie performance by parallel search

research product

Towards Dynamic Scripted pNFS Layouts

Today's network file systems consist of a variety of complex subprotocols and backend storage classes. The data is typically spread over multiple data servers to achieve higher levels of performance and reliability. A metadata server is responsible for creating the mapping of a file to these data servers. It is hard to map application specific access patterns to storage system specific features, which can result in a degraded IO performance. We present an NFSv4.1/pNFS protocol extension that integrates the client's ability to provide hints and I/O advices to metadata servers. We define multiple storage classes and allow the client to choose which type of storage fits best for its desired ac…

research product

Online Management of Hybrid DRAM-NVMM Memory for HPC

Non-volatile main memories (NVMMs) offer a comparable performance to DRAM, while requiring lower static power consumption and enabling higher densities. NVMM therefore can provide opportunities for improving both energy efficiency and costs of main memory. Previous hybrid main memory management approaches for HPC either do not consider the unique characteristics of NVMMs, depend on high profiling costs, or need source code modifications. In this paper, we investigate HPC applications' behaviors in the presence of NVMM as part of the main memory. By performing a comprehensive study of HPC applications and based on several key observations, we propose an online hybrid memory architecture for …

research product

GekkoFS - A Temporary Distributed File System for HPC Applications

We present GekkoFS, a temporary, highly-scalable burst buffer file system which has been specifically optimized for new access patterns of data-intensive High-Performance Computing (HPC) applications. The file system provides relaxed POSIX semantics, only offering features which are actually required by most (not all) applications. It is able to provide scalable I/O performance and reaches millions of metadata operations already for a small number of nodes, significantly outperforming the capabilities of general-purpose parallel file systems. The work has been funded by the German Research Foundation (DFG) through the ADA-FS project as part of the Priority Programme 1648. It is also support…

research product

File system scalability with highly decentralized metadata on independent storage devices

This paper discusses using hard drives that integrate a key-value interface and network access in the actual drive hardware (Kinetic storage platform) to supply file system functionality in a large scale environment. Taking advantage of higher-level functionality to handle metadata on the drives themselves, a serverless system architecture is proposed. Skipping path component traversal during the lookup operation is the key technique discussed in this paper to avoid performance degradation with highly decentralized metadata. Scalability implications are reviewed based on a fuse file system implementation. Peer Reviewed

research product

One Phase Commit: A Low Overhead Atomic Commitment Protocol for Scalable Metadata Services

As the number of client machines in high end computing clusters increases, the file system cannot keep up with the resulting volume of requests, using a centralized metadata server. This problem will be even more prominent with the advent of the exascale computing age. In this context, the centralized metadata server represents a bottleneck for the scaling of the file system performance as well as a single point of failure. To overcome this problem, file systems are evolving from centralized metadata services to distributed metadata services. The metadata distribution raises a number of additional problems that must be taken into account. In this paper we will focus on the problem of managi…

research product

Building a Medical Research Cloud in the EASI-CLOUDS Project

The demand for IT resources is constantly growing in the scientific area. The ability to store and process increasing amounts of data has transformed many research disciplines, like the life-sciences, which now rely on complex data processing and data analytics. Cloud environments are able to integrate and encapsulate possibly distributed resources and allow convenient and on-demand access to the corresponding services, tools, and complete work environments. The European research project EASI-CLOUDS (http://www. easi-clouds.eu) develops a platform for a convenient service delivery with special regard to service integration, monitoring, management, and Service Level Agreement (SLA) negotiati…

research product

Improving checkpointing intervals by considering individual job failure probabilities

Checkpointing is a popular resilience method in HPC and its efficiency highly depends on the choice of the checkpoint interval. Standard analytical approaches optimize intervals for big, long-running jobs that fail with high probability, while they are unable to minimize checkpointing overheads for jobs with a low or medium probability of failing. Nevertheless, our analysis of batch traces of four HPC systems shows that these jobs are extremely common.We therefore propose an iterative checkpointing algorithm to compute efficient intervals for jobs with a medium risk of failure. The method also supports big and long-running jobs by converging to the results of various traditional methods for…

research product

Constant Time Garbage Collection in SSDs

research product

Building a medical research cloud in the EASI-CLOUDS project

Summary The demand for Information Technology (IT) resources is constantly growing in the scientific area. The ability to store and process increasing amounts of data has transformed many research disciplines like the life sciences, which now rely on complex data processing and data analytics. Cloud computing can provide researchers with scalable and easy-to-use hardware and software resources and allows on-demand access to services, tools, or even complete work environments. The European research project EASI-CLOUDS has developed a service delivery platform with special regard to service integration, monitoring and management, and the negotiation of service level agreements. In order to de…

research product

Fusing storage and computing for the domain of business intelligence and analytics: research opportunities

With the growing importance of external and shared data, the set of requirements for Business Intelligence and Analytics (BIA) is shifting. Current solutions still come with shortcomings, esp. In multi-stakeholder environments where sensitive content is exchanged. We argue that a new level in the evolution of BIA can be unlocked by tearing down the barriers between storage and computing based on upcoming storage technologies. In particular, we propose a revitalization of ideas from object-oriented databases. We present results from a joint project that aimed at delineating design options for BIA solutions built upon this idea. The paper outlines the interplay of various architectural layers…

research product

MCD: Overcoming the Data Download Bottleneck in Data Centers

The data download problem in data centers describes the increasingly common task of coordinated loading of identical data to a large number of nodes. Data download is seen as a significant problem in exascale HPC applications. Uncoor-dinated reading from a central file server creates contention at the file server and its network interconnect. We propose and evaluation a reliable multicast based approach to solve the data download problem. The MCD system builds a logical multi-rooted tree based on the physical network topology and uses the logical view for a two-phase approach. In the first phase, the data is multicasted to all nodes. In the second phase, the logical tree is used for an effi…

research product

On the Influence of PRNGs on Data Distribution

The amount of digital information produced grows rapidly and constantly. Storage systems use clustered architectures designed to store and process this information efficiently. Their use introduces new challenges in storage systems development, like load-balancing and data distribution. A variety of randomized solutions handling data placement issues have been proposed and utilized. However, to the best of our knowledge, there has not yet been a structured analysis of the influence of pseudo random number generators (PRNGs) on the data distribution. In the first part of this paper we consider Consistent Hashing [1] as a combination of two consecutive phases: distribution of bins and distrib…

research product

And Now for Something Completely Different: Running Lisp on GPUs

The internal parallelism of compute resources increases permanently, and graphics processing units (GPUs) and other accelerators have been gaining importance in many domains. Researchers from life science, bioinformatics or artificial intelligence, for example, use GPUs to accelerate their computations. However, languages typically used in some of these disciplines often do not benefit from the technical developments because they cannot be executed natively on GPUs. Instead existing programs must be rewritten in other, less dynamic programming languages. On the other hand, the gap in programming features between accelerators and common CPUs shrinks permanently. Since accelerators are becomi…

research product

Streamlining distributed Deep Learning I/O with ad hoc file systems

With evolving techniques to parallelize Deep Learning (DL) and the growing amount of training data and model complexity, High-Performance Computing (HPC) has become increasingly important for machine learning engineers. Although many compute clusters already use learning accelerators or GPUs, HPC storage systems are not suitable for the I/O requirements of DL workflows. Therefore, users typically copy the whole training data to the worker nodes or distribute partitions. Because DL depends on randomized input data, prior work stated that partitioning impacts DL accuracy. Their solutions focused mainly on training I/O performance on a high-speed network but did not cover the data stage-in pro…

research product

Deduplication Potential of HPC Applications’ Checkpoints

HPC systems contain an increasing number of components, decreasing the mean time between failures. Checkpoint mechanisms help to overcome such failures for long-running applications. A viable solution to remove the resulting pressure from the I/O backends is to deduplicate the checkpoints. However, there is little knowledge about the potential to save I/Os for HPC applications by using deduplication within the checkpointing process. In this paper, we perform a broad study about the deduplication behavior of HPC application checkpointing and its impact on system design.

research product

Accelerating Application Migration in HPC

It is predicted that the number of cores per node will rapidly increase with the upcoming era of exascale supercomputers. As a result, multiple applications will have to share one node and compete for the (often scarce) resources available on this node. Furthermore, the growing number of hardware components causes a decrease in the mean time between failures. Application migration between nodes has been proposed as a tool to mitigate these two problems: Bottlenecks due to resource sharing can be addressed by load balancing schemes which migrate applications; and hardware errors can often be tolerated by the system if faulty nodes are detected and processes are migrated ahead of time.

research product

Distributing Storage in Cloud Environments

Cloud computing has a major impact on today's IT strategies. Outsourcing applications from IT departments to the cloud relieves users from building big infrastructures as well as from building the corresponding expertise, and allows them to focus on their main competences and businesses. One of the main hurdles of cloud computing is that not only the application, but also the data has to be moved to the cloud. Networking speed severely limits the amount of data that can travel between the cloud and the user, between different sites of the same cloud provider, or indeed between different cloud providers. It is therefore important to keep applications near the data itself. This paper investig…

research product

FADaC

Solid state drives (SSDs) implement a log-structured write pattern, where obsolete data remains stored on flash pages until the flash translation layer (FTL) erases them. erase() operations, however, cannot erase a single page, but target entire flash blocks. Since these victim blocks typically store a mix of valid and obsolete pages, FTLs have to copy the valid data to a new block before issuing an erase() operation. This process therefore increases the latencies of concurrent I/Os and reduces the lifetime of flash memory. Data classification schemes identify data pages with similar update frequencies and group them together. FTLs can use this grouping to design garbage collection strategi…

research product

Pure Functions in C: A Small Keyword for Automatic Parallelization

AbstractThe need for parallel task execution has been steadily growing in recent years since manufacturers mainly improve processor performance by increasing the number of installed cores instead of scaling the processor’s frequency. To make use of this potential, an essential technique to increase the parallelism of a program is to parallelize loops. Several automatic loop nest parallelizers have been developed in the past such as PluTo. The main restriction of these tools is that the loops must be statically analyzable which, among other things, disallows function calls within the loops. In this article, we present a seemingly simple extension to the C programming language which marks fun…

research product

Using On-Demand File Systems in HPC Environments

In modern HPC systems, parallel (distributed) file systems are used to allow fast access from and to the storage infrastructure. However, I/O performance in large-scale HPC systems has failed to keep up with the increase in computational power. As a result, the I/O subsystem which also has to cope with a large number of demanding metadata operations is often the bottleneck of the entire HPC system. In some cases, even a single bad behaving application can be held responsible for slowing down the entire HPC system, disrupting other applications that use the same I/O subsystem. These kinds of situations are likely to become more frequent in the future with larger and more powerful HPC systems…

research product

Scalable Monitoring System for Clouds

Although cloud computing has become an important topic over the last couple of years, the development of cloud-specific monitoring systems has been neglected. This is surprising considering their importance for metering services and, thus, being able to charge customers. In this paper we introduce a monitoring architecture that was developed and is currently implemented in the EASI-CLOUDS project. The demands on cloud monitoring systems are manifold. Regular checks of the SLAs and the precise billing of the resource usage, for instance, require the collection and converting of infrastructure readings in short intervals. To ensure the scalability of the whole cloud, the monitoring system mus…

research product

A configurable rule based classful token bucket filter network request scheduler for the lustre file system

HPC file systems today work in a best-effort manner where individual applications can flood the file system with requests, effectively leading to a denial of service for all other tasks. This paper presents a classful Token Bucket Filter (TBF) policy for the Lustre file system. The TBF enforces Remote Procedure Call (RPC) rate limitations based on (potentially complex) Quality of Service (QoS) rules. The QoS rules are enforced in Lustre's Object Storage Servers, where each request is assigned to an automatically created QoS class.The proposed QoS implementation for Lustre enables various features for each class including the support for high-priority and real-time requests even under heavy …

research product

Reducing False Node Failure Predictions in HPC

Future HPC applications must be able to scale to thousands of compute nodes, while running for several days. The increased runtime and node count inconveniently raises the probability of hardware failures that may interrupt computations. Scientists must therefore protect their simulations against hardware failures. This is typically done using frequent checkpoint& restart, which may have significant overheads. Consequently, the frequency in which checkpoints are taken should be minimized. Predicting hardware failures ahead of time is a promising approach to address this problem, but has remaining issues like false alarms at large scales. In this paper, we introduce the probability of unnece…

research product

Challenges and Solutions for Tracing Storage Systems

IBM Spectrum Scale’s parallel file system General Parallel File System (GPFS) has a 20-year development history with over 100 contributing developers. Its ability to support strict POSIX semantics across more than 10K clients leads to a complex design with intricate interactions between the cluster nodes. Tracing has proven to be a vital tool to understand the behavior and the anomalies of such a complex software product. However, the necessary trace information is often buried in hundreds of gigabytes of by-product trace records. Further, the overhead of tracing can significantly impact running applications and file system performance, limiting the use of tracing in a production system. In…

research product

Simurgh

The availability of non-volatile main memory (NVMM) has started a new era for storage systems and NVMM specific file systems can support extremely high data and metadata rates, which are required by many HPC and data-intensive applications. Scaling metadata performance within NVMM file systems is nevertheless often restricted by the Linux kernel storage stack, while simply moving metadata management to the user space can compromise security or flexibility. This paper introduces Simurgh, a hardware-assisted user space file system with decentralized metadata management that allows secure metadata updates from within user space. Simurgh guarantees consistency, durability, and ordering of updat…

research product

MERCURY: A Transparent Guided I/O Framework for High Performance I/O Stacks

The performance gap between processors and I/O represents a serious scalability limitation for applications running on computing clusters. Parallel file systems often provide mechanisms that allow programmers to disclose their I/O pattern knowledge to the lower layers of the I/O stack through a hints API. This information can be used by the file system to boost the application performance. Unfortunately, programmers rarely make use of these features, missing the opportunity to exploit the full potential of the storage system. In this paper we propose MERCURY, a transparent guided I/O framework able to optimize file I/O patterns in scientific applications, allowing users to control the I/O b…

research product

VarySched: A Framework for Variable Scheduling in Heterogeneous Environments

Despite many efforts to better utilize the potential of GPUs and CPUs, it is far from being fully exploited. Although many tasks can be easily sped up by using accelerators, most of the existing schedulers are not flexible enough to really optimize the resource usage of the complete system. The main reasons are (i) that each processing unit requires a specific program code and that this code is often not provided for every task, and (ii) that schedulers may follow the run-until-completion model and, hence, disallow resource changes during runtime. In this paper, we present VarySched, a configurable task scheduler framework tailored to efficiently utilize all available computing resources in…

research product

Deriving and comparing deduplication techniques using a model-based classification

Data deduplication has been a hot research topic and a large number of systems have been developed. These systems are usually seen as an inherently linked set of characteristics. However, a detailed analysis shows independent concepts that can be used in other systems. In this work, we perform this analysis on the main representatives of deduplication systems. We embed the results in a model, which shows two yet unexplored combinations of characteristics. In addition, the model enables a comprehensive evaluation of the representatives and the two new systems. We perform this evaluation based on real world data sets.

research product

Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems

The ever-growing amount of data requires highly scalable storage solutions. The most flexible approach is to use storage pools that can be expanded and scaled down by adding or removing storage devices. To make this approach usable, it is necessary to provide a solution to locate data items in such a dynamic environment. This article presents and evaluates the Random Slicing strategy, which incorporates lessons learned from table-based, rule-based, and pseudo-randomized hashing strategies and is able to provide a simple and efficient strategy that scales up to handle exascale data. Random Slicing keeps a small table with information about previous storage system insert and remove operations…

research product

DelveFS - An Event-Driven Semantic File System for Object Stores

Data-driven applications are becoming increasingly important in numerous industrial and scientific fields, growing the need for scalable data storage, such as object storage. Yet, many data-driven applications cannot use object interfaces directly and often have to rely on third-party file system connectors that support only a basic representation of objects as files in a flat namespace. With sometimes millions of objects per bucket, this simple organization is insufficient for users and applications who are usually only interested in a small subset of objects. These huge buckets are not only lacking basic semantic properties and structure, but they are also challenging to manage from a tec…

research product

Randomized renaming in shared memory systems.

Abstract Renaming is a task in distributed computing where n processes are assigned new names from a name space of size m . The problem is called tight if m = n , and loose if m > n . In recent years renaming came to the fore again and new algorithms were developed. For tight renaming in asynchronous shared memory systems, Alistarh et al. describe a construction based on the AKS network that assigns all names within O ( log n ) steps per process. They also show that, depending on the size of the name space, loose renaming can be done considerably faster. For m = ( 1 + ϵ ) ⋅ n and constant ϵ , they achieve a step complexity of O ( log log n ) . In this paper we consider tight as well as loos…

research product

Balls into non-uniform bins

Balls-into-bins games for uniform bins are widely used to model randomized load balancing strategies. Recently, balls-into-bins games have been analysed under the assumption that the selection probabilities for bins are not uniformly distributed. These new models are motivated by properties of many peer-to-peer (P2P) networks, which are not able to perfectly balance the load over the bins. While previous evaluations try to find strategies for uniform bins under non-uniform bin selection probabilities, this paper investigates heterogeneous bins, where the "capacities" of the bins might differ significantly. We show that heterogeneous environments can even help to distribute the load more eve…

research product

Improving MLC flash performance and endurance with extended P/E cycles

The traditional usage pattern for NAND flash memory is the program/erase (P/E) cycle: the flash pages that make a flash block are all programmed in order and then the whole flash block needs to be erased before the pages can be programmed again. The erase operations are slow, wear out the medium, and require costly garbage collection procedures. Reducing their number is therefore beneficial both in terms of performance and endurance. The physical structure of flash cells limits the number of opportunities to overcome the 1 to 1 ratio between programming and erasing pages: a bit storing a logical 0 cannot be reprogrammed to a logical 1 before the end of the P/E cycle. This paper presents a t…

research product

Effects and Benefits of Node Sharing Strategies in HPC Batch Systems

Processor manufacturers today scale performance by increasing the number of cores on each CPU. Unfortunately, not all HPC applications can efficiently saturate all cores of a single node, even if they successfully scale to thousands of nodes. For these applications, sharing nodes with other applications can help to stress different resources on the nodes to more efficiently use them. Previous work has shown that the performance impact of node sharing is very application dependent but very little work has studied its effects within batch systems and for complex parallel application mixes. Administrators therefore typically fear the complexity of running a batch system supporting node sharing…

research product

Hyperion

Indexes are essential in data management systems to increase the speed of data retrievals. Widespread data structures to provide fast and memory-efficient indexes are prefix tries. Implementations like Judy, ART, or HOT optimize their internal alignments for cache and vector unit efficiency. While these measures usually improve the performance substantially, they can have a negative impact on memory efficiency. In this paper we present Hyperion, a trie-based main-memory key-value store achieving extreme space efficiency. In contrast to other data structures, Hyperion does not depend on CPU vector units, but scans the data structure linearly. Combined with a custom memory allocator, Hyperion…

research product

POSTER: Optimizing scientific file I/O patterns using advice based knowledge

Before us, other works have used data prefetching to boost applications performance [1]–[8]. Our approach differs from these works since we do not rely on precise I/O pattern information to predict and prefetch every chunck of data in advance. Instead we use data prefetching to group many small requests in a few big ones, improving applications performance and utilization of the whole storage system. Moreover, we provide the infrastructure that enables users to access file system specific interfaces for guided I/O without modifying applications and hiding the intrinsic complexity that such interfaces introduce.

research product

Persistent software transactional memory in Haskell

Emerging persistent memory in commodity hardware allows byte-granular accesses to persistent state at memory speeds. However, to prevent inconsistent state in persistent memory due to unexpected system failures, different write-semantics are required compared to volatile memory. Transaction-based library solutions for persistent memory facilitate the atomic modification of persistent data in languages where memory is explicitly managed by the programmer, such as C/C++. For languages that provide extended capabilities like automatic memory management, a more native integration into the language is needed to maintain the high level of memory abstraction. It is shown in this paper how persiste…

research product

Evaluation of a hash-compress-encrypt pipeline for storage system applications

Great efforts are made to store data in a secure, reliable, and authentic way in large storage systems. Specialized, system specific clients help to achieve these goals. Nevertheless, often standard tools for hashing, compressing, and encrypting data are arranged in transparent pipelines. We analyze the potential of Unix shell pipelines with several high-speed and high-compression algorithms that can be used to achieve data security, reduction, and authenticity. Furthermore, we compare the pipelines of standard tools against a house made pipeline implemented in C++ and show that there is great potential for performance improvement.

research product

A gearbox model for processing large volumes of data by using pipeline systems encapsulated into virtual containers

Software pipelines enable organizations to chain applications for adding value to contents (e.g., confidentially, reliability, and integrity) before either sharing them with partners or sending them to the cloud. However, the pipeline components add overhead when processing large volumes of data, which can become critical in real-world scenarios. This paper presents a gearbox model for processing large volumes of data by using pipeline systems encapsulated into virtual containers. In this model, the gears represent applications, whereas gearboxes represent software pipelines. This model was implemented as a collaborative system that automatically performs Gear up (by using parallel patterns…

research product

Direct lookup and hash-based metadata placement for local file systems

New challenges to file systems' metadata performance are imposed by the continuously growing number of files existing in file systems. The total amount of metadata can become too big to be cached, potentially leading to multiple storage device accesses for a single metadata lookup operation. This paper takes a look at the limitations of traditional file system designs and discusses an alternative metadata handling approach, using hash-based concepts already established for metadata and data placement in distributed storage systems. Furthermore, a POSIX compliant prototype implementation based on these concepts is introduced and benchmarked. A variety of file system metadata and data operati…

research product

Sorted deduplication: How to process thousands of backup streams

The requirements of deduplication systems have changed in the last years. Early deduplication systems had to process dozens to hundreds of backup streams at the same time while today they are able to process hundreds to thousands of them. Traditional approaches rely on stream-locality, which supports parallelism, but which easily leads to many non-contiguous disk accesses, as each stream competes with all other streams for the available resources. This paper presents a new exact deduplication approach designed for processing thousands of backup streams at the same time on the same fingerprint index. The underlying approach destroys the traditionally exploited temporal chunk locality and cre…

research product

Advanced Stochastic Petri Net Modeling with the Mercury Scripting Language

Formal models are widely used in performance and dependability studies of computational systems. Graphical modeling tools allow users to compose such models with ease, but they complicate the creation of models with a dynamic/complex structure, the hierarchical arrangement of different models, and the automatic execution of models with different parameter configurations. To overcome this problem, we created a scripting language for the Mercury tool that supports the combination of different modeling approaches (e.g., Stochastic Petri Nets and Reliability Block Diagrams) in a single project. In this paper, we focus on the extensions developed to improve the capabilities of Generalized Stocha…

research product

Lone Star Stack: Architecture of a Disk-Based Archival System

The need for huge storage systems rises with the ever growing creation of data. With growing capacities and shrinking prices, "write once read sometimes" workloads become more common. New data is constantly added, rarely updated or deleted, and every stored byte might be read at any time - a common pattern for digital archives or big data scenarios. We present the Lone Star Stack, a disk based archival storage system building block that is optimized for high reliability and energy efficiency. It provides a POSIX file system interface that uses flash based storage for write-offloading and metadata and the disk-based Lone Star RAID for user data storage. The RAID attempts to spin down disks a…

research product

NVMM-Oriented Hierarchical Persistent Client Caching for Lustre

In high-performance computing (HPC), data and metadata are stored on special server nodes and client applications access the servers’ data and metadata through a network, which induces network latencies and resource contention. These server nodes are typically equipped with (slow) magnetic disks, while the client nodes store temporary data on fast SSDs or even on non-volatile main memory (NVMM). Therefore, the full potential of parallel file systems can only be reached if fast client side storage devices are included into the overall storage architecture. In this article, we propose an NVMM-based hierarchical persistent client cache for the Lustre file system (NVMM-LPCC for short). NVMM-LPC…

research product

ESB: Ext2 Split Block Device

Solid State Disks (SSDs) start to replace rotating media (hard disks, HDD) in many areas, but are still not as cost efficient concerning capacity to completely replace them. One approach to use their superior performance properties is to use them as a cache for magnetic disks to speed up overall storage operations. In this paper, we present and evaluate a file system level optimization based on ext2. We split metadata and data and store the metadata on a SDD while the data remains on a common HDD. We evaluate our system with filebench under a file server, web server, and web proxy scenario and compare the results with flashcache. We find that many of the scenarios do not contain enough meta…

research product

Algorithmic differentiation for cloud schemes (IFS Cy43r3) using CoDiPack (v1.8.1)

Abstract. Numerical models in atmospheric sciences not only need to approximate the flow equations on a suitable computational grid, they also need to include subgrid effects of many non-resolved physical processes. Among others, the formation and evolution of cloud particles is an example of such subgrid processes. Moreover, to date there is no universal mathematical description of a cloud, hence many cloud schemes have been proposed and these schemes typically contain several uncertain parameters. In this study, we propose the use of algorithmic differentiation (AD) as a method to identify parameters within the cloud scheme, to which the output of the cloud scheme is most sensitive. We il…

research product

LPCC

Most high-performance computing (HPC) clusters use a global parallel file system to enable high data throughput. The parallel file system is typically centralized and its storage media are physically separated from the compute cluster. Compute nodes as clients of the parallel file system are often additionally equipped with SSDs. The node internal storage media are rarely well-integrated into the I/O and compute workflows. How to make full and flexible use of these storage media is therefore a valuable research question. In this paper, we propose a hierarchical Persistent Client Caching (LPCC) mechanism for the Lustre file system. LPCC provides two modes: RW-PCC builds a read-write cache on…

research product

Challenges and Opportunities of User-Level File Systemsfor HPC

research product

GekkoFS — A Temporary Burst Buffer File System for HPC Applications

Many scientific fields increasingly use high-performance computing (HPC) to process and analyze massive amounts of experimental data while storage systems in today’s HPC environments have to cope with new access patterns. These patterns include many metadata operations, small I/O requests, or randomized file I/O, while general-purpose parallel file systems have been optimized for sequential shared access to large files. Burst buffer file systems create a separate file system that applications can use to store temporary data. They aggregate node-local storage available within the compute nodes or use dedicated SSD clusters and offer a peak bandwidth higher than that of the backend parallel f…

research product

Scheduling shared continuous resources on many-cores

We consider the problem of scheduling a number of jobs on m identical processors sharing a continuously divisible resource. Each job j comes with a resource requirement rj∈[0,1]. The job can be processed at full speed if granted its full resource requirement. If receiving only an x-portion of r_j, it is processed at an x-fraction of the full speed. Our goal is to find a resource assignment that minimizes the makespan (i.e., the latest completion time). Variants of such problems, relating the resource assignment of jobs to their processing speeds, have been studied under the term discrete-continuous scheduling. Known results are either very pessimistic or heuristic in nature. In this paper, …

research product

Design of an exact data deduplication cluster

Data deduplication is an important component of enterprise storage environments. The throughput and capacity limitations of single node solutions have led to the development of clustered deduplication systems. Most implemented clustered inline solutions are trading deduplication ratio versus performance and are willing to miss opportunities to detect redundant data, which a single node system would detect. We present an inline deduplication cluster with a joint distributed chunk index, which is able to detect as much redundancy as a single node solution. The use of locality and load balancing paradigms enables the nodes to minimize information exchange. Therefore, we are able to show that, …

research product

LoneStar RAID

The need for huge storage archives rises with the ever growing creation of data. With today’s big data and data analytics applications, some of these huge archives become active in the sense that all stored data can be accessed at any time. Running and evolving these archives is a constant tradeoff between performance, capacity, and price. We present the LoneStar RAID, a disk-based storage architecture, which focuses on high reliability, low energy consumption, and cheap reads. It is designed for MAID systems with up to hundreds of disk drives per server and is optimized for “write once, read sometimes” workloads. We use dedicated data and parity disks, and export the data disks as individu…

research product

Topic 5: Parallel and Distributed Data Management

Nowadays we are facing an exponential growth of new data that is overwhelming the capabilities of companies, institutions and the society in general to manage and use it in a proper way. Ever-increasing investments in Big Data, cutting edge technologies and the latest advances in both application development and underlying storage systems can help dealing with data of such magnitude. Especially parallel and distributed approaches will enable new data management solutions that operate effectively at large scale.

research product

The MoSGrid Science Gateway – A Complete Solution for Molecular Simulations

The MoSGrid portal offers an approach to carry out high-quality molecular simulations on distributed compute infrastructures to scientists with all kinds of background and experience levels. A user-friendly Web interface guarantees the ease-of-use of modern chemical simulation applications well established in the field. The usage of well-defined workflows annotated with metadata largely improves the reproducibility of simulations in the sense of good lab practice. The MoSGrid science gateway supports applications in the domains quantum chemistry (QC), molecular dynamics (MD), and docking. This paper presents the open-source MoSGrid architecture as well as lessons learned from its design.

research product

Smart grid-aware scheduling in data centres

In several countries the expansion and establishment of renewable energies result in widely scattered and often weather-dependent energy production, decoupled from energy demand. Large, fossil-fuelled power plants are gradually replaced by many small power stations that transform wind, solar and water power into electrical power. This leads to changes in the historically evolved power grid that favours top-down energy distribution from a backbone of large power plants to widespread consumers. Now, with the increase of energy production in lower layers of the grid, there is also a bottom-up flow of the grid infrastructure compromising its stability. In order to locally adapt the energy deman…

research product

Migration Techniques in HPC Environments

Process migration is an important feature in modern computing centers as it allows for a more efficient use and maintenance of hardware. Especially in virtualized infrastructures it is successfully exploited by schemes for load balancing and energy efficiency. One can divide the tools and techniques into three groups: Process-level migration, virtual machine migration, and container-based migration.

research product

ADA-FS—Advanced Data Placement via Ad hoc File Systems at Extreme Scales

Today’s High-Performance Computing (HPC) environments increasingly have to manage relatively new access patterns (e.g., large numbers of metadata operations) which general-purpose parallel file systems (PFS) were not optimized for. Burst-buffer file systems aim to solve that challenge by spanning an ad hoc file system across node-local flash storage at compute nodes to relief the PFS from such access patterns. However, existing burst-buffer file systems still support many of the traditional file system features, which are often not required in HPC applications, at the cost of file system performance.

research product

An Analysis of Flash Page Reuse With WOM Codes

Flash memory is prevalent in modern servers and devices. Coupled with the scaling down of flash technology, the popularity of flash memory motivates the search for methods to increase flash reliability and lifetime. Erasures are the dominant cause of flash cell wear, but reducing them is challenging because flash is a write-once medium— memory cells must be erased prior to writing. An approach that has recently received considerable attention relies on write-once memory (WOM) codes, designed to accommodate additional writes on write-once media. However, the techniques proposed for reusing flash pages with WOM codes are limited in their scope. Many focus on the coding theory alone, whereas o…

research product

Compiler Driven Automatic Kernel Context Migration for Heterogeneous Computing

Computer systems provide different heterogeneous resources (e.g., GPUs, DSPs and FPGAs) that accelerate applications and that can reduce the energy consumption by using them. Usually, these resources have an isolated memory and a require target specific code to be written. There exist tools that can automatically generate target specific codes for program parts, so-called kernels. The data objects required for a target kernel execution need to be moved to the target resource memory. It is the programmers' responsibility to serialize these data objects used in the kernel and to copy them to or from the resource's memory. Typically, the programmer writes his own serializing function or uses e…

research product

Algorithmic Differentiation for Cloud Schemes

<p>Numerical models in atmospheric sciences do not only need to approximate the flow equations on a suitable computational grid, they also need to include subgrid effects of many non-resolved physical processes. Among others, the formation and evolution of cloud particles is an example of such subgrid processes. Moreover, to date there is no universal mathematical description of a cloud, hence many cloud schemes were proposed and these schemes typically contain several uncertain parameters. In this study, we propose the use of algorithmic differentiation (AD) as a method to identify parameters within the cloud scheme, to which the output of the cloud scheme is most sensitive.…

research product