BDA Unit 1
BDA Unit 1
1.Introduction
big data is larger, more complex data sets, especially from new data sources. These
data sets are so voluminous that traditional data processing software just can't
manage them. But these massive volumes of data can be used to address business
problems you wouldn't have been able to tackle before.
Big Data is a collection of data that is huge in volume, yet growing exponentially with
time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently. Big data is also a data but with
huge size.
The New York Stock Exchange is an example of Big Data that generates about one
terabyte of new trade data per day.
A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time.
With many thousand flights per day, generation of data reaches up to
many Petabytes.
1. Structured
2. Unstructured
3. Semi-structured
Structured
Any data that can be stored, accessed and processed in the form of fixed format is
termed as a ‘structured’ data. Over the period of time, talent in computer science has
achieved greater success in developing techniques for working with such kind of
data (where the format is well known in advance) and also deriving value out of it.
Semi-structured
Semi-structured data can contain both the forms of data. We can see semi-
structured data as a structured in form but it is actually not defined with e.g. a table
definition in relational DBMS. Example of semi-structured data is a data represented
in an XML file.
Volume
Variety
Velocity
Variability
(i) Volume – The name Big Data itself is related to a size which is enormous. Size
of data plays a very crucial role in determining value out of data. Also, whether a
particular data can actually be considered as a Big Data or not, is dependent upon
the volume of data. Hence, ‘Volume’ is one characteristic which needs to be
considered while dealing with Big Data solutions.
Variety refers to heterogeneous sources and the nature of data, both structured and
unstructured. During earlier days, spreadsheets and databases were the only
sources of data considered by most of the applications. Nowadays, data in the form
of emails, photos, videos, monitoring devices, PDFs, audio, etc. are also being
considered in the analysis applications. This variety of unstructured data poses
certain issues for storage, mining and analyzing data.
Introduction to Big Data Analytics
(iii) Velocity – The term ‘velocity’ refers to the speed of generation of data. How
fast the data is generated and processed to meet the demands, determines real
potential in the data.
Big Data Velocity deals with the speed at which data flows in from sources like
business processes, application logs, networks, and social media sites,
sensors, Mobile devices, etc. The flow of data is massive and continuous.
(iv) Variability – This refers to the inconsistency which can be shown by the data
at times, thus hampering the process of being able to handle and manage the data
effectively.
Big data refers to the massive volume of structured and unstructured data that
inundates businesses on a day-to-day basis. There are several reasons why big data
is important:
Big data analytics involves using advanced analytics techniques such as predictive
analytics, data mining, machine learning, and statistical analysis to extract insights
from massive datasets. These datasets are often too large or complex for traditional
data processing applications to handle.
Big data analytics helps organisations harness their data and use it to identify new
opportunities. That, in turn, leads to smarter business moves, more efficient
operations, higher profits and happier customers. Businesses that use big data with
advanced analytics gain value in many ways, such as:
Data science is a multidisciplinary field that uses scientific methods, algorithms, processes,
and systems to extract knowledge and insights from structured and unstructured data. It
combines aspects of statistics, computer science, and domain expertise to analyse complex
datasets and solve real-world problems. Data scientists employ various techniques such as
data mining, machine learning, predictive analytics, and data visualization to uncover
patterns, trends, and correlations in data that can inform decision-making and drive
innovation in various industries.
Introduction to Big Data Analytics
The responsibilities of a data scientist can vary depending on the organization and
the specific role, but here are some common tasks and responsibilities:
1. Data Collection and Cleaning: Gathering data from various sources such as
databases, APIs, or web scraping. Cleaning and preprocessing the data to
remove errors, missing values, and inconsistencies.
2. Data Analysis and Exploration: Exploring the data to understand patterns,
trends, and relationships. Using statistical methods and visualization
techniques to gain insights and identify potential areas for further
investigation.
3. Model Development: Developing machine learning models and algorithms to
solve specific business problems or make predictions. This involves selecting
appropriate models, feature engineering, hyperparameter tuning, and
evaluating model performance.
4. Model Deployment: Deploying models into production environments, which
may involve working with software engineers to integrate models into existing
systems or develop new applications.
5. Testing and Validation: Testing the performance of models using validation
techniques such as cross-validation or holdout validation. Ensuring that
models are robust and generalize well to new data.
Introduction to Big Data Analytics
6. Communication and Visualization: Communicating findings and insights to
stakeholders through reports, presentations, or interactive dashboards.
Visualizing data and model results in a clear and understandable way.
Overall, data scientists play a crucial role in extracting meaningful insights from data to
inform decision-making and drive business value.
1. Big Data: Refers to large volumes of data, both structured and unstructured,
that inundates a business on a day-to-day basis.
2. Hadoop: An open-source framework used for distributed storage and
processing of large datasets across clusters of computers.
3. MapReduce: A programming model for processing and generating large data
sets with a parallel, distributed algorithm on a cluster.
4. Data Warehouse: A central repository of integrated data from one or more
disparate sources, used for reporting and data analysis.
5. Data Lake: A storage repository that holds a vast amount of raw data in its
native format until it is needed.
6. ETL (Extract, Transform, Load): The process of extracting data from various
sources, transforming it to fit operational needs, and loading it into a data
warehouse or data lake.
7. NoSQL: A type of database that provides a mechanism for storage and
retrieval of data that is modeled in means other than the tabular relations
used in relational databases.
8. SQL (Structured Query Language): A domain-specific language used in
programming and designed for managing data held in a relational database
management system or for stream processing in a relational data stream
management system.
9. Data Mining: The process of discovering patterns in large data sets involving
methods at the intersection of machine learning, statistics, and database
systems.
10. Machine Learning: A subset of artificial intelligence that uses statistical
techniques to enable computer systems to learn from and make predictions or
decisions based on data.
11. Data Visualization: The graphical representation of information and data to
communicate complex information clearly and efficiently.
Introduction to Big Data Analytics