Ahmed Elgalb
Independent Researchers, Iowa, United States.
George Samaan
Independent Researchers, Iowa, United States.
Download PDF http://doi.org/10.37648/ijrst.v12i04.008
Apache Spark and Hadoop MapReduce are two of the most popular data processing paradigms for large-scale computing and each has its own model and philosophy of execution. Spark’s in-memory model promises better execution for iterative, interactive, and streaming workloads, and Hadoop MapReduce’s disk-based solution remains a staple of massive one-pass jobs. This paper presents an in-depth discussion of both frameworks based on studies and benchmarks published prior to 2022. By exploring their architectures, performance, fault tolerance, and compatibility with larger analytics stacks, it illustrates where each one is superior and where they are able to work together. It explains the cost of large-scale in-memory caching, why iterative machine learning algorithms get the most out of Spark’s DAG architecture, and why some tasks still get better off using Hadoop’s stable batch structure. Across many tables and examples, the paper illustrates a subtle point: Spark and Hadoop are not competing monoliths but complementary tools for distinct workload profiles that each are relevant to an array of real-world data engineering and analytics situations.
Keywords: Apache Spark; Hadoop MapReduce; data processing paradigm
Disclaimer: Indexing of published papers is subject to the evaluation and acceptance criteria of the respective indexing agencies. While we strive to maintain high academic and editorial standards, International Journal of Research in Science and Technology does not guarantee the indexing of any published paper. Acceptance and inclusion in indexing databases are determined by the quality, originality, and relevance of the paper, and are at the sole discretion of the indexing bodies.