Publication:
Development of an efficient and merging of numerous small files algorithm for the hadoop distributed file system

Loading...
Thumbnail Image
Date
2023-10-01
Authors
Adnan Ali
Journal Title
Journal ISSN
Volume Title
Publisher
Research Projects
Organizational Units
Journal Issue
Abstract
In the era of Big Data, many sources and environments generate large amounts of data. This large amount of data requires sophisticated tools and specialized procedures that can evaluate the information and predict the results for future changes. Hadoop is used to process this type of data. It is known to handle large amounts of data more efficiently than small amounts, which results in inefficiencies in the framework. This study proposes a new solution to this problem by applying the Enhanced Best Fit Merging algorithm (EBFM) that merges files based on predefined parameters (type and size). The implementation of this algorithm will ensure that the maximum number of blocks and the size of the generated file are in the same range. Its main goal is to dynamically combine files with criteria that have been specified based on the type of file to guarantee the effectiveness and efficiency of the established system. This procedure occurs before the file is processed by the Hadoop framework. In addition, the files produced by the system are named with certain keywords to ensure that there is no data loss (file overwriting). The proposed EBFM guarantees the generation of the least number of files possible, reduces the input/output memory load and is in line with the effectiveness of the Hadoop framework. The results of the study show that the proposed EBFM improves the performance of the framework by about 64% compared to all other potential performance variables. The proposed EBFM can be implemented in any environment that uses the Hadoop framework, including smart cities and real-time data analysis.
Description
Keywords
Citation