BDA Class1
BDA Class1
Lecture 3
Tools for Big Data Analysis
WHAT IS BIG DATA
Big data is the term for a collection of data sets so large and complex
that it becomes difficult to process using on-hand database
management tools or traditional data processing applications.
Big data analytics deals with the tools and techniques to handle,
analyze and extract uncovered trends, patterns, correlations
among data to provide meaningful insight for improved decision
making.
TERMINOLOGIES USED
DATA
Raw facts, facts on mind, number, text written in books, bit/bytes stored
in computer memory, statistics.
We get information by analyzing and interpreting data
Types: Structured (log files, tables), unstructured (NoSQL), semi-
structured(audio,v ideo, social media posts)
Types of Data: Data can be categorized into different types based on its
nature. These include:
Numerical Data: Includes quantitative values such as numbers and
measurements.
Categorical Data: Represents categories or labels, often used for grouping or
classification.
Textual Data: Involves written or spoken language, often in the form of
sentences, paragraphs, or documents.
Image Data: Consists of visual information or pictures.
Audio Data: Represents sound and auditory information.
Video Data: Combines visual and auditory information in motion.
Data Sources: Data can come from various sources, including sensors,
devices, surveys, social media, websites, transactions, and more. It can
be generated by humans or collected automatically by machines.
Data Formats: Data can be stored in different formats, such as
databases, spreadsheets, documents, images, audio files, and more.
Each format has its own structure and characteristics.
Data Processing: Data is often processed to extract meaningful insights
and information. This can involve cleaning and organizing the data,
performing calculations, applying statistical analyses, and generating
visualizations.
Big Data: Big data refers to extremely large and complex datasets that
are beyond the capabilities of traditional data processing methods. Big
data technologies enable the storage, processing, and analysis of such
datasets.
Data Analytics: Data analytics involves the examination of data to
uncover patterns, trends, correlations, and other valuable insights. It
includes descriptive analytics (understanding what happened),
predictive analytics (predicting what might happen), and prescriptive
analytics (suggesting actions to take).
Data Privacy and Security: Data privacy refers to the protection of
individuals' personal and sensitive information. Data security involves
safeguarding data from unauthorized access, breaches, and cyber
threats.
Data Mining: Data mining is the process of discovering patterns and
relationships in large datasets. It involves using techniques from
statistics, machine learning, and artificial intelligence to extract
valuable information.
A BRIEF HISTORY OF BIG DATA
The concept of big data has evolved over the years, driven by
advancements in technology, data collection methods, and the growing
need to process and make sense of vast amounts of information. Here's
a brief history of the evolution of big data:
Early Days - Pre-2000s:
Data processing and storage were primarily limited by the
capabilities of mainframe computers and early databases.
The term "business intelligence" emerged, referring to the use of
data analysis to aid business decision-making.
Emergence of the Internet - Late 1990s:
The growth of the internet led to the rapid increase in digital data
creation, including web pages, emails, and online transactions.
Search engines like Google introduced methods for indexing and
retrieving information from the web.
First Mention of the Term "Big Data" - Early 2000s:
Doug Laney's paper in 2001 introduced the concept of the "3V"
model (Volume, Velocity, Variety) to describe the challenges of
managing large datasets.
The term "big data" started to gain traction as a way to describe
datasets that were too large to be managed using traditional
methods.
Open Source Technologies - Mid-2000s:
Open source projects like Apache Hadoop began to address the
challenges of processing and analyzing massive datasets.
Hadoop introduced a distributed computing framework that could
process data across clusters of computers.
Social Media and Web 2.0 - Late 2000s:
The explosion of social media platforms like Facebook, Twitter, and
YouTube generated enormous amounts of user-generated content,
contributing to the growth of big data.
The concept of "Web 2.0" emphasized user-generated content and
interactivity on the internet.
Mainstream Recognition - Early 2010s:
Big data gained significant attention in the business world, as
organizations realized the potential value of analyzing large datasets
to gain insights and make better decisions.
Companies started investing in data analytics tools and platforms.
Advanced Analytics and Machine Learning - Mid-2010s:
Advances in machine learning and artificial intelligence led to more
sophisticated data analysis techniques.
Organizations began using predictive and prescriptive analytics to
anticipate future trends and make proactive decisions.
IoT and Real-time Data - Late 2010s:
The proliferation of Internet of Things (IoT) devices led to the generation of
real-time data streams from sensors and connected devices.
Organizations focused on processing and analyzing data in real-time to gain
immediate insights.
Throughout its history, big data has transformed from a concept focused on
managing large volumes of data to a complex ecosystem of technologies,
methodologies, and practices that enable organizations to derive insights and drive
innovation from their data
BUSINESS DRIVES FOR BIGDATA
INNOVATIONS
Businesses are increasingly embracing big data innovations to gain a
competitive edge, make informed decisions, and unlock new
opportunities. Several key business drivers are behind the adoption of
big data innovations:
Velocity: Velocity represents the speed at which data is generated, collected, and
processed. In the era of big data, data is often generated and transmitted rapidly,
sometimes in real time or near-real time. Examples include data streams from
social media, sensors, financial transactions, and more. Organizations need to
have systems that can capture, process, and analyze this data at the required
speed to make timely decisions and gain insights.
Variety: Variety refers to the diverse types and sources of data that exist. Data
comes in various formats such as structured data (like databases and
spreadsheets), unstructured data (like text documents and images), and semi-
structured data (like JSON and XML files). In addition to text, data can include
images, videos, audio recordings, and other forms of multimedia. Managing and
analyzing this diverse data requires flexible and adaptable technologies and
approaches
Veracity: Veracity refers to the quality and trustworthiness of data.
With the abundance of data sources, ensuring data accuracy and
reliability becomes crucial for making informed decisions.
Data Quality and Accuracy: Big data often includes data from various
sources, and ensuring its quality and accuracy can be challenging.
Inaccurate or incomplete data can lead to flawed analyses and decision-
making.