BigData-Assignment1-CSP 554
BigData-Assignment1-CSP 554
1.
2.
3.
In February 2014, Google Flu Trends (GFT) was highlighted to overestimate more than double the proportion
of doctor visits for influenza-like illness compared to the Centers for Disease Control and Prevention (CDC).
This difference highlights the limits of GFT, often compared to an example of big data. The article examines
3 points of view : Big Data Hubris, Algorithm Dynamics and Transparency, Granularity, and All-Data.
Big Data Hubris refers to the mistaken belief that big data can replace traditional methods of data collection
and analysis, rather than complementing them. GFT exemplified this problem by combining massive search
data with a small number of data points from the CDC, leading to overfitting of the models. This overfitting
resulted in errors in GFT’s predictions, which notably failed to predict the H1N1 flu pandemic in 2009 for
example. Even with an update of the algorithm in 2009, GTF continued to overestimate the prevalence of
the flu, questioning its usefulness as a stand-alone tool. We understand that even if big data offers
significant scientific possibilities, it should not overlook the fundamental principles of measurement
validity and reliability.
Algorithm Dynamics highlights the inherent instability in GFT’s flu prediction models, due to changes in the
Google search engine and the behavior of users. These dynamics result in unpredictable results from the
model because it is linked to non-static factors such as media panics or the constant evolution of the
Google search engine. The changes in search patterns are driven by Google’s adjustments to its algorithms,
which are meant to improve the user experience but affect the reliability of the GFT’s predictions. GFT has
systematically overvalued the flu’s prevalence during different seasons, particularly since 2011. These
errors were not random and showed patterns of temporal autocorrelation and seasonality. Even after
updates in 2009, GFT could not follow flu activity accurately, predicting often higher prevalence rates than
the ones observed by the CDC. This systematic overestimation undermines the utility of GFT as a stand-
alone tool to follow the flu’s trends.
The last part of the article is about Transparency, Granularity and All-Data. The article highlights its
importance in big data analysis, criticizing the lack of the GFT’s documentation, making it harder to replicate
the results. We can also see that the granularity, which means the ability to provide precise measurements
at a local level, could highly improve flu prediction models. Furthermore, an “All-Data” approach,
combining traditional and new data sources, could be a very good way to better understand the world, while
stressing the need for researchers to monitor the evolution of socio-technical systems like those of Google,
Twitter or Facebook.
To conclude, the article underlines the fact that even if big data offers a huge potential, it should not be used
in an isolated way. An approach combining new and traditional data is essential to make analysis more
reliable.
4.
Big Data has become a central research topic thanks to its various utilizations. According to Gartner, Big
Data is characterized by massive volumes of high-speed and diverse data, requiring new processing
methods to optimize decision-making and knowledge discovery. In 2017, Big Data generated $32.4 billion,
thanks to technological progress in AI and open-source tools. Big Data permits predictions and information-
based analysis in several fields, such as organizational management, medical services, environmental
conditions,… However, there is a difficult challenge in Big Data : ensuring the accuracy of predictive
systems. Research continues to explore solutions to improve data management and protection while
maximizing its value in everyday applications.
Big Data differs from the usual data with its huge size set, including organized, unorganized, and
unstructured information. This particularity requires technological advancement: The BI technology is
useless for it. Big Data can be characterized by 5 V: Volume (a large amount of data), Variety (diversity of
formats), Velocity (the speed at which data is generated and processed), Veracity (data reliability), and
Value (the added value obtained by analyzing these data). The VA (Visual Analytics) has also an important
place in Big Data: they permit a better exploitation of complex data (by taking into account their volume,
their variety, their velocity, and their veracity). The VA is defined by 3 layers: the visualization layer (visual
representation of data), the analytics layer (to get conclusions from these data), and the data management
layer (ensuring the quality, retrieval, and long-term preservation of data).
This article presents the tools and technologies used for processing Big Data, focusing on Hadoop and
Apache tools. The objective is to compare these products to help researchers choose the best solutions for
managing large datasets.
This article analyzes the current solutions, such as Hadoop, Spark, Cassandra, and other Apache tools
(Flume, Sqoop, Strom, etc.), to see which tool will be better for processing large datasets, while keeping
velocity, data integrity, and availability.
Each tool is described with its pros and cons, comparing their performance in terms of data management,
scalability, and reliability. The tools are tested based on their ability to handle massive volumes,
heterogeneous data, and real-time data streams.
• NoSQL : Used to handle unstructured data with flexibility but limited by interface challenges.
• Cassandra : Used for its high availability and data replication capabilities.
• Hadoop : A major framework for batch data processing with HDFS for storage and MapReduce for
computation. (It can process data by spreading the work across different machines)
• Spark : Used for streaming and batch data processing. (Faster than Hadoop because it keeps data
in memory)
• Flume and Sqoop : Tools that help collect and transfer data from different sources into Hadoop.
• Hive and Pig : Languages that make it easier to query and manipulate data in Hadoop. They simplify
the use of MapReduce.
• ZooKeeper : A tool for coordinating distributed services.
• Storm and Splunk: Systems for real-time stream processing and analyzing large datasets.
• Technologies effective for batch processing and real-time data stream : Hadoop and Spark,
• Technologies effective for high availability : Cassandra,
• Technologies effective for making easy the data ingestion : Flume and Spook (they require
improvements to ensure proper sequencing of data events)
• Technologies effective for making easy the querying interface : Hive and Pig (there is latency in
complex tasks)
Big Data is applied in many areas, such as smart cities, healthcare, agriculture, business management,
transport, and more. It also uses various techniques, like ML (Machine Learning), Deep Learning, cloud
computing, and IoT (Internet of Things).
Some studies show us where these techniques are used : ML for analyzing disaster data and predicting
people’s needs during emergencies, Deep Learning for analyzing data about businesses to improve
sustainability, Cloud Computing to use a smart home, or for helping students access more powerful
computing systems,...
BDA (Big Data Analytics) is more used than ML and Deep Learning in big data applications.
The accuracy, the standard deviation (SD), and the FPR (False Positive Rate) are widely used, but the
processing time is the most cited metric (especially in recent papers), even if the accuracy seems to be the
widely-accepted parameter utilized for evaluating the big data application.
Most of the data for this study are from Elsevier, Springer, and International Journal of Physics.
5.
o The problem with the Google flu detection algorithm is the fact that it overestimates the prevalence
of flu.
o Big data hubris refers to the mistaken belief that large datasets can replace traditional methods of
data collection and analysis, rather than complementing them.
o To improve the Google flu detection algorithm, they could employ a more integrated approach, for
example, combining GFT with CDC’s data and recalibrating the algorithm depending on changes in
search behavior and seasonal patterns.
o Algorithm Dynamics refers to the changes and instability in the performances of a predictive model
caused by users behavior and underlying algorithms.
o The aspect of algorithm dynamics that impacted the Google flu detection algorithm was the
continuous changes in the Google search engine (the underlying algorithm) and the evolving
behavior of users.
6.
7.