BigData-Assignment1-CSP 554

Uploaded by

emile.mondon.r

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

BigData-Assignment1-CSP 554

Uploaded by

emile.mondon.r

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

CSP 554 : Big Data Technologies – Assignment 1

In February 2014, Google Flu Trends (GFT) was highlighted to overestimate more than double the proportion
of doctor visits for influenza-like illness compared to the Centers for Disease Control and Prevention (CDC).
This difference highlights the limits of GFT, often compared to an example of big data. The article examines
3 points of view : Big Data Hubris, Algorithm Dynamics and Transparency, Granularity, and All-Data.

Big Data Hubris refers to the mistaken belief that big data can replace traditional methods of data collection
and analysis, rather than complementing them. GFT exemplified this problem by combining massive search
data with a small number of data points from the CDC, leading to overfitting of the models. This overfitting
resulted in errors in GFT’s predictions, which notably failed to predict the H1N1 flu pandemic in 2009 for
example. Even with an update of the algorithm in 2009, GTF continued to overestimate the prevalence of
the flu, questioning its usefulness as a stand-alone tool. We understand that even if big data offers
significant scientific possibilities, it should not overlook the fundamental principles of measurement
validity and reliability.

Algorithm Dynamics highlights the inherent instability in GFT’s flu prediction models, due to changes in the
Google search engine and the behavior of users. These dynamics result in unpredictable results from the
model because it is linked to non-static factors such as media panics or the constant evolution of the
Google search engine. The changes in search patterns are driven by Google’s adjustments to its algorithms,
which are meant to improve the user experience but affect the reliability of the GFT’s predictions. GFT has
systematically overvalued the flu’s prevalence during different seasons, particularly since 2011. These
errors were not random and showed patterns of temporal autocorrelation and seasonality. Even after
updates in 2009, GFT could not follow flu activity accurately, predicting often higher prevalence rates than
the ones observed by the CDC. This systematic overestimation undermines the utility of GFT as a stand-
alone tool to follow the flu’s trends.

The last part of the article is about Transparency, Granularity and All-Data. The article highlights its
importance in big data analysis, criticizing the lack of the GFT’s documentation, making it harder to replicate
the results. We can also see that the granularity, which means the ability to provide precise measurements
at a local level, could highly improve flu prediction models. Furthermore, an “All-Data” approach,
combining traditional and new data sources, could be a very good way to better understand the world, while
stressing the need for researchers to monitor the evolution of socio-technical systems like those of Google,
Twitter or Facebook.

To conclude, the article underlines the fact that even if big data offers a huge potential, it should not be used
in an isolated way. An approach combining new and traditional data is essential to make analysis more
reliable.
4.

Big Data has become a central research topic thanks to its various utilizations. According to Gartner, Big
Data is characterized by massive volumes of high-speed and diverse data, requiring new processing
methods to optimize decision-making and knowledge discovery. In 2017, Big Data generated $32.4 billion,
thanks to technological progress in AI and open-source tools. Big Data permits predictions and information-
based analysis in several fields, such as organizational management, medical services, environmental
conditions,… However, there is a difficult challenge in Big Data : ensuring the accuracy of predictive
systems. Research continues to explore solutions to improve data management and protection while
maximizing its value in everyday applications.

Big Data differs from the usual data with its huge size set, including organized, unorganized, and
unstructured information. This particularity requires technological advancement: The BI technology is
useless for it. Big Data can be characterized by 5 V: Volume (a large amount of data), Variety (diversity of
formats), Velocity (the speed at which data is generated and processed), Veracity (data reliability), and
Value (the added value obtained by analyzing these data). The VA (Visual Analytics) has also an important
place in Big Data: they permit a better exploitation of complex data (by taking into account their volume,
their variety, their velocity, and their veracity). The VA is defined by 3 layers: the visualization layer (visual
representation of data), the analytics layer (to get conclusions from these data), and the data management
layer (ensuring the quality, retrieval, and long-term preservation of data).

This article presents the tools and technologies used for processing Big Data, focusing on Hadoop and
Apache tools. The objective is to compare these products to help researchers choose the best solutions for
managing large datasets.

This article analyzes the current solutions, such as Hadoop, Spark, Cassandra, and other Apache tools
(Flume, Sqoop, Strom, etc.), to see which tool will be better for processing large datasets, while keeping
velocity, data integrity, and availability.

Each tool is described with its pros and cons, comparing their performance in terms of data management,
scalability, and reliability. The tools are tested based on their ability to handle massive volumes,
heterogeneous data, and real-time data streams.

This is a resume of the tools :

• NoSQL : Used to handle unstructured data with flexibility but limited by interface challenges.
• Cassandra : Used for its high availability and data replication capabilities.
• Hadoop : A major framework for batch data processing with HDFS for storage and MapReduce for
computation. (It can process data by spreading the work across different machines)
• Spark : Used for streaming and batch data processing. (Faster than Hadoop because it keeps data
in memory)
• Flume and Sqoop : Tools that help collect and transfer data from different sources into Hadoop.
• Hive and Pig : Languages that make it easier to query and manipulate data in Hadoop. They simplify
the use of MapReduce.
• ZooKeeper : A tool for coordinating distributed services.
• Storm and Splunk: Systems for real-time stream processing and analyzing large datasets.

These tools use some components :

• YARN (Yet Another Resource Negotiator): This is the part of Hadoop that manages resources. It
helps divide tasks and run several applications at the same time. YARN is more advanced than
MapReduce because it separates resource management from application management, making
everything run more efficiently.
• MapReduce: A core component of Hadoop used to break down large tasks into smaller ones. It
works by first dividing the data into pieces with the Map function, and then combining the results
with the Reduce function. While MapReduce is very powerful for processing large amounts of data,
it’s not as fast for real-time tasks, which is why tools like Spark are often preferred for that purpose.
• HDFS: The storage system in Hadoop. It allows for structured and unstructured data to be stored
across many computers.
• HBase: A database that works with Hadoop, designed for fast data access.

At the end, we can categorize the technologies :

• Technologies effective for batch processing and real-time data stream : Hadoop and Spark,
• Technologies effective for high availability : Cassandra,
• Technologies effective for making easy the data ingestion : Flume and Spook (they require
improvements to ensure proper sequencing of data events)
• Technologies effective for making easy the querying interface : Hive and Pig (there is latency in
complex tasks)

Big Data is applied in many areas, such as smart cities, healthcare, agriculture, business management,
transport, and more. It also uses various techniques, like ML (Machine Learning), Deep Learning, cloud
computing, and IoT (Internet of Things).

Some studies show us where these techniques are used : ML for analyzing disaster data and predicting
people’s needs during emergencies, Deep Learning for analyzing data about businesses to improve
sustainability, Cloud Computing to use a smart home, or for helping students access more powerful
computing systems,...

However, this is the result of the study :

BDA (Big Data Analytics) is more used than ML and Deep Learning in big data applications.
The accuracy, the standard deviation (SD), and the FPR (False Positive Rate) are widely used, but the
processing time is the most cited metric (especially in recent papers), even if the accuracy seems to be the
widely-accepted parameter utilized for evaluating the big data application.

Most of the data for this study are from Elsevier, Springer, and International Journal of Physics.

o The problem with the Google flu detection algorithm is the fact that it overestimates the prevalence
of flu.
o Big data hubris refers to the mistaken belief that large datasets can replace traditional methods of
data collection and analysis, rather than complementing them.
o To improve the Google flu detection algorithm, they could employ a more integrated approach, for
example, combining GFT with CDC’s data and recalibrating the algorithm depending on changes in
search behavior and seasonal patterns.
o Algorithm Dynamics refers to the changes and instability in the performances of a predictive model
caused by users behavior and underlying algorithms.
o The aspect of algorithm dynamics that impacted the Google flu detection algorithm was the
continuous changes in the Google search engine (the underlying algorithm) and the evolving
behavior of users.

Process and Mechanical Sizing Calculation For Filter Separator
No ratings yet
Process and Mechanical Sizing Calculation For Filter Separator
8 pages
36 Planters Development Bank v. Lopez
0% (1)
36 Planters Development Bank v. Lopez
1 page
How To Scan Comics Like A Manatee 1.0
100% (1)
How To Scan Comics Like A Manatee 1.0
86 pages
hadoop research paper
No ratings yet
hadoop research paper
7 pages
Big Data and Genomics
No ratings yet
Big Data and Genomics
17 pages
4 A Review Paper On Big Data and Hadoop
No ratings yet
4 A Review Paper On Big Data and Hadoop
3 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
No ratings yet
Big Data: Spot Business Trends, Prevent Diseases, C Ombat Crime and So On"
8 pages
Big Data: New Tricks For Econometrics: Hal R. Varian
No ratings yet
Big Data: New Tricks For Econometrics: Hal R. Varian
55 pages
Big Data New Tricks For Econometrics
No ratings yet
Big Data New Tricks For Econometrics
27 pages
Big Data New Tricks For Econometrics
No ratings yet
Big Data New Tricks For Econometrics
51 pages
BidData New Tricks For Econometrics (Varian H)
No ratings yet
BidData New Tricks For Econometrics (Varian H)
52 pages
Big Data New Tricks for Econometrics
No ratings yet
Big Data New Tricks for Econometrics
26 pages
Big Data: New Tricks For Econometrics: Hal R. Varian
No ratings yet
Big Data: New Tricks For Econometrics: Hal R. Varian
27 pages
IEEE Conf Paper Formatvv
No ratings yet
IEEE Conf Paper Formatvv
5 pages
C - B D A - A S C R F D: Loud Based IG ATA Nalytics Urvey of Urrent Esearch and Uture Irections
No ratings yet
C - B D A - A S C R F D: Loud Based IG ATA Nalytics Urvey of Urrent Esearch and Uture Irections
12 pages
Integrating R and Hadoop For Big Data Analysis
No ratings yet
Integrating R and Hadoop For Big Data Analysis
12 pages
Case Study DSBDA Report Final
No ratings yet
Case Study DSBDA Report Final
24 pages
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Big Data Manual - Edited
No ratings yet
Big Data Manual - Edited
69 pages
Big Data Platforms
No ratings yet
Big Data Platforms
8 pages
Big Data
No ratings yet
Big Data
5 pages
What Is Big Data ?
No ratings yet
What Is Big Data ?
6 pages
Tommy Iverson Johnson CSENG 506 Seminar Research Project
No ratings yet
Tommy Iverson Johnson CSENG 506 Seminar Research Project
7 pages
MODULE 1 - ST
No ratings yet
MODULE 1 - ST
13 pages
Cloud & Big Data
No ratings yet
Cloud & Big Data
5 pages
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
No ratings yet
A Study On Big Data Processing Mechanism & Applicability: Byung-Tae Chun and Seong-Hoon Lee
10 pages
An Insight On Big Data Analytics Using Pig Script
No ratings yet
An Insight On Big Data Analytics Using Pig Script
7 pages
A Comparison of Big Data Analytics Approaches Based On Hadoop Mapreduce
No ratings yet
A Comparison of Big Data Analytics Approaches Based On Hadoop Mapreduce
9 pages
Big Data Research Paper
No ratings yet
Big Data Research Paper
10 pages
Big Data Analytics Litrature Review
No ratings yet
Big Data Analytics Litrature Review
7 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Case Study On Processing Data Driven For Health
No ratings yet
Case Study On Processing Data Driven For Health
9 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
A_COMPARATIVE_ANALYSIS_OF_CONV (Tripathi al., 2018)
No ratings yet
A_COMPARATIVE_ANALYSIS_OF_CONV (Tripathi al., 2018)
7 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
41 pages
Big Data Analytics: A Literature Review Paper: Abstract. in The Information Era, Enormous Amounts of Data Have Become
No ratings yet
Big Data Analytics: A Literature Review Paper: Abstract. in The Information Era, Enormous Amounts of Data Have Become
14 pages
Big Data Testing
No ratings yet
Big Data Testing
9 pages
IV Unit Big Data Analysis
No ratings yet
IV Unit Big Data Analysis
17 pages
Escritura 1
No ratings yet
Escritura 1
7 pages
Big Datapptfina1
No ratings yet
Big Datapptfina1
25 pages
Bda QB
No ratings yet
Bda QB
18 pages
Overview of Security Issues
No ratings yet
Overview of Security Issues
5 pages
Document Clustering With Map Reduce Using Hadoop Framework
No ratings yet
Document Clustering With Map Reduce Using Hadoop Framework
5 pages
Big Data
No ratings yet
Big Data
4 pages
Unit 1
No ratings yet
Unit 1
20 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
BDA Answers-1
No ratings yet
BDA Answers-1
15 pages
Book Big Data Technology
No ratings yet
Book Big Data Technology
87 pages
Data Science Vs Big Data
No ratings yet
Data Science Vs Big Data
34 pages
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
No ratings yet
Scalable Machine-Learning Algorithms For Big Data Analytics: A Comprehensive Review
21 pages
Big Data Analysis
No ratings yet
Big Data Analysis
8 pages
Research Paper - Reading Materials
No ratings yet
Research Paper - Reading Materials
15 pages
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
No ratings yet
Analysis of Dynamic Data Placement Strategy For Heterogeneous Hadoop Cluster
8 pages
Seminar Topic
No ratings yet
Seminar Topic
13 pages
Full Doc Janani
No ratings yet
Full Doc Janani
121 pages
CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
No ratings yet
CSD 1043: Big Data Fundamentals Week1: Big Data Landscape: Definitions
13 pages
Big Data Analytics
No ratings yet
Big Data Analytics
8 pages
Unit-1
No ratings yet
Unit-1
11 pages
Hal Varian
No ratings yet
Hal Varian
36 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
j.ijdsa.20241005.11
No ratings yet
j.ijdsa.20241005.11
14 pages
Sulphur Msds
No ratings yet
Sulphur Msds
28 pages
Contract II 2 Project
No ratings yet
Contract II 2 Project
18 pages
Get (Ebook) SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights by Cathy Tanimura ISBN 9781492088783, 1492088781 free all chapters
100% (8)
Get (Ebook) SQL for Data Analysis: Advanced Techniques for Transforming Data into Insights by Cathy Tanimura ISBN 9781492088783, 1492088781 free all chapters
81 pages
Patching End Dates For Roadmap
No ratings yet
Patching End Dates For Roadmap
2 pages
Directory of Accredited Testing Laboratories
No ratings yet
Directory of Accredited Testing Laboratories
645 pages
Module 3 - Planning
No ratings yet
Module 3 - Planning
10 pages
Rule Against Accumulation Section 17
100% (3)
Rule Against Accumulation Section 17
2 pages
CSC Ssce Practical Examination
No ratings yet
CSC Ssce Practical Examination
7 pages
Updated RPMS - Theme 1 (Pink Watercolor and Flowers)
No ratings yet
Updated RPMS - Theme 1 (Pink Watercolor and Flowers)
63 pages
Trade and Economy of Vijayanagara Empire
No ratings yet
Trade and Economy of Vijayanagara Empire
2 pages
Final Annexure Internship Report Word
No ratings yet
Final Annexure Internship Report Word
31 pages
Napocor Vs Heirs of Saturnino Borbon - Merencillo
No ratings yet
Napocor Vs Heirs of Saturnino Borbon - Merencillo
2 pages
Exercises E2 42b On Cogm With Solution
No ratings yet
Exercises E2 42b On Cogm With Solution
3 pages
Tax Deduction/Collection Account Number: Meaning of TAN
No ratings yet
Tax Deduction/Collection Account Number: Meaning of TAN
8 pages
datasheet-37-e-glass-strength-central-loose-tube-inout-fibre-optic-cable
No ratings yet
datasheet-37-e-glass-strength-central-loose-tube-inout-fibre-optic-cable
2 pages
1583334928-Front Office Associate (Britti)
No ratings yet
1583334928-Front Office Associate (Britti)
16 pages
Rr311403 Finite Element Method
100% (2)
Rr311403 Finite Element Method
8 pages
Human Services Homeless PPT Final
No ratings yet
Human Services Homeless PPT Final
26 pages
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
No ratings yet
MBV - Alexander v. US - 113 S. Ct. 2766, 125 L. Ed. 2d. 441
27 pages
Quantum Dot Solar Cells: High Efficiency Through Multiple Exciton Generation
No ratings yet
Quantum Dot Solar Cells: High Efficiency Through Multiple Exciton Generation
5 pages
MED - SCH 12-3,5 R
No ratings yet
MED - SCH 12-3,5 R
3 pages
Deleuze's Foucault
No ratings yet
Deleuze's Foucault
7 pages
Jackson Intercom Article
No ratings yet
Jackson Intercom Article
2 pages
Shadow Compact Fume Extractor Brochure
No ratings yet
Shadow Compact Fume Extractor Brochure
12 pages
Tea in Vietnam - Analysis: Country Report - Mar 2019
No ratings yet
Tea in Vietnam - Analysis: Country Report - Mar 2019
2 pages
CL 0272
No ratings yet
CL 0272
39 pages
PIP Flow Chart
No ratings yet
PIP Flow Chart
1 page

BigData-Assignment1-CSP 554

Uploaded by

BigData-Assignment1-CSP 554

Uploaded by

CSP 554 : Big Data Technologies – Assignment 1

This is a resume of the tools :

These tools use some components :

At the end, we can categorize the technologies :

However, this is the result of the study :

You might also like