0% found this document useful (0 votes)
3 views

A20528094_Assign2

This document summarizes two research papers on big data technologies: one surveys the landscape, technologies, applications, and challenges of big data, while the other analyzes the performance of Apache Hadoop and Apache Spark. The survey highlights key characteristics of big data and its applications across various sectors, while the performance analysis reveals that Spark outperforms Hadoop in specific workloads. Together, these papers provide valuable insights into the current state and future directions of big data research.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

A20528094_Assign2

This document summarizes two research papers on big data technologies: one surveys the landscape, technologies, applications, and challenges of big data, while the other analyzes the performance of Apache Hadoop and Apache Spark. The survey highlights key characteristics of big data and its applications across various sectors, while the performance analysis reveals that Spark outperforms Hadoop in specific workloads. Together, these papers provide valuable insights into the current state and future directions of big data research.

Uploaded by

gaurav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1.

Summary
This document summarizes two research papers focused on different facets of big data technologies: a
survey of big data technologies, terminologies, and applications, and a performance analysis of Apache
Hadoop and Apache Spark.

Big Data Survey


The first paper provides a broad overview of the rapidly evolving landscape of big data. It addresses the
increasing importance of managing and processing vast amounts of data for decision-making, scientific
research, and business intelligence. The survey defines big data by its key characteristics: volume,
velocity, variety, veracity, and value (the 5 Vs). It examines various big data technologies related to
storage, processing, and security. These include tools like NoSQL databases (e.g., Cassandra), Hadoop,
Storm, Spark, Hive, and OpenRefine, and frameworks like Apache Flume, Sqoop, Pig, Hive, ZooKeeper,
Cassandra, and Hadoop, and also, components of Hadoop's ecosystem.
The survey highlights the importance of visual analytics. Visual analytics (VA) depends on three main
layers: Visualization, analytics, and data management. Besides the technologies, the survey also
explores the application of big data across various sectors like smart cities, network communication,
business management, IoT, cloud computing, fog computing, edge computing, health care, and
agriculture, and the challenges associated with them. Most of the papers reviewed utilized databases,
such as machine learning, Deep-learning, cloud computing, and big data analytics methods. Big data
analytics is the most commonly utilized method among the different applications
Finally, the survey identifies issues related to big data management like the separation of the data storage
layer and the management layer, and also, discusses challenges, and proposes future research directions
such as dynamic edge computing, and ensemble algorithms for classifying data and management layers.

Performance Analysis of Hadoop and Spark


The second paper presents a comprehensive performance comparison between Apache Hadoop and
Apache Spark. These distributed computing frameworks are essential for analyzing large-scale datasets,
the study aims to identify the most impacting parameters (resource utilization, input splits, and shuffle
behavior) that influence the performance of these frameworks. A real cluster, with a large-scale dataset is
implemented to do the performance analysis based on the workloads WordCount and TeraSort.
Performance metrics for this implementation included execution time, throughput, and speedup.
The experiment results show that the performance is highly dependent on data input size and correct
parameter selection. Spark has better performance as compared to Hadoop, and using a factory set is not
always the optimal approach for better performance, with the right reconfiguration, the performance can
be improved further. Also, spark excels in workloads with small data sets achieving two times the
speedup in WordCount and up to 14 times speedup in TeraSort workloads.

Combined Insights & Conclusion


Together, these papers provide a holistic view of big data. The survey gives insight into terminologies,
technologies, applications, and challenges while the analysis shows the practical considerations of big
data implementation. Both papers show the impact of big data, and both show challenges and constraints
that should be considered for future research.
2. hadoop fs –ls /
3. /user
4. /user/csp554

5. /user/csp554-5
6. Copy

7. Copy

8. GCS to hadoop master node

You might also like