6533b85efe1ef96bd12bff79
RESEARCH PRODUCT
Accelerating data queries on Hadoop framework by using compact data formats
Romans TaranovsDaiga PlaseLaila Niedritesubject
Distributed databaseDatabasePlain textComputer sciencebusiness.industryBig datacomputer.file_formatcomputer.software_genreFile formatColumn (database)Schema evolutionData accessBinary databusinesscomputerdescription
There are massive amounts of data generated from IoT, online transactions, click streams, emails, logs, posts, social networking interactions, sensors, mobile phones and their applications etc. The question is where and how to store these data in order to provide faster data access. Understanding and handling Big Data is a big challenge. The research direction in Big Data projects using Hadoop Technology, MapReduce kind of framework and compact data formats such as RCFile, SequenceFile, ORC, Avro, Parquet shows that only two data formats (Avro and Parquet) support schema evolution and compression in order to utilize less storage space. In this paper, file formats like Avro and Parquet are compared with text formats to evaluate the performance of the data queries. Different data query patterns have been evaluated. Cloudera's open-source Apache Hadoop distribution CDH 5.4 has been chosen for the experiments presented in this article. The results show that compact data formats (Avro and Parquet) take up less storage space when compared with plain text data formats because of binary data format and compression advantage. Furthermore data queries from the column based data format Parquet are faster when compared with text data formats and Avro.
year | journal | country | edition | language |
---|---|---|---|---|
2016-11-01 | 2016 IEEE 4th Workshop on Advances in Information, Electronic and Electrical Engineering (AIEEE) |