BDA UNIT-1 NOTES
BDA UNIT-1 NOTES
Big data has found its way into diverse sectors, enabling new insights and efficiency.
Healthcare
Personalized Medicine: Big data analytics in genomics allows for treatments that are
tailored to individuals based on their genetic makeup.
Disease Prediction and Prevention: Public health surveillance can track the spread of
diseases in real-time, while predictive analytics can help healthcare providers prevent
outbreaks.
Finance
Risk Management: Financial institutions use big data for predictive models that assess
creditworthiness, detect fraud, and optimize investment strategies.
Algorithmic Trading: Big data is used in the creation of algorithms that process and
analyze vast amounts of financial data to make trading decisions in fractions of a second.
Transportation
Traffic Management: Big data systems process data from GPS devices, traffic cameras,
and sensors to optimize traffic flow, reducing congestion and enhancing transportation
efficiency.
Fleet Management: Companies with large fleets (e.g., delivery services) use big data to
optimize routes, reduce fuel consumption, and enhance vehicle maintenance.
Dealing with Big Data is no easy feat, and it presents several obstacles that need to be addressed.
Sensitive Data: Big data often contains personal, financial, or health-related information
that must be kept secure from breaches or misuse.
Regulations: Laws like GDPR (General Data Protection Regulation) in Europe and
CCPA (California Consumer Privacy Act) in the U.S. impose strict guidelines for
handling personal data, which can complicate big data projects.
Data Quality
Incomplete or Noisy Data: Big data often comes from a variety of sources, and this data
can be messy or incomplete, requiring substantial cleaning and preprocessing to ensure it
is useful.
Inconsistent Data: Data may be collected at different times and in different formats,
making it difficult to combine and analyze effectively.
Data Integration
Variety of Data Sources: Data is often siloed across different departments or sources,
such as transactional data, sensor data, social media feeds, etc. Integrating this data into a
unified system is challenging.
Complexity of Relationships: Complex relationships between data points across
multiple datasets need to be understood and mapped, which can be resource-intensive.
The need for real-time analysis of streaming data (e.g., stock market data, sensor data
from IoT devices) presents challenges in ensuring that the systems are fast enough to
process and analyze data as it arrives.
The 3 Vs are commonly used to describe the key attributes of Big Data:
Volume
The sheer quantity of data being produced is enormous. For example, Facebook processes
4 petabytes of data every day, and the number of data points collected from IoT devices
is increasing rapidly.
Velocity
Data is being generated at unprecedented speeds, such as real-time social media updates,
financial transactions, and streaming sensor data. Processing such high-velocity data
requires specialized tools and algorithms.
Variety
Big Data encompasses various types of data, including structured data (e.g., databases),
semi-structured data (e.g., JSON files), and unstructured data (e.g., text, images, videos).
Veracity
Data in Big Data systems may be uncertain or unreliable due to errors in data collection,
poor-quality sources, or outliers. Handling and cleaning this "noisy" data is crucial for
achieving accurate analysis.
Value
It’s not enough to just collect vast amounts of data; the real benefit comes from extracting
actionable insights that can improve decision-making, boost profitability, or reduce risks.
5. Dimensions of Scalability
Description: Involves adding more machines (nodes) to the system to distribute the data
and computation load. This approach is ideal for big data as it allows for handling
massive datasets across multiple servers.
Example: Hadoop’s HDFS and Apache Spark both use horizontal scaling to handle
large-scale data processing.
Description: Involves adding more resources (e.g., CPU, memory, storage) to a single
server or machine to handle a larger volume of data. This is more limited compared to
horizontal scaling because a single machine has finite resources.
Processing Scalability
As the volume of data grows, the ability to process larger datasets in less time becomes
increasingly important. Distributed frameworks like Apache Spark and MapReduce
enable parallel processing, making it possible to scale the computation.
Storage Scalability
The ability to store more data without significant performance degradation. Distributed
file systems like HDFS allow big data to be spread across multiple machines, each of
which holds part of the dataset.
Volume
Refers to the massive quantity of data being generated. For example, Google handles
over 3.5 billion searches per day, generating a massive volume of data.
Velocity
The rate at which data is created, processed, and analyzed. Real-time streaming data from
IoT devices or financial markets needs to be processed quickly to make timely decisions.
Variety
Big data is heterogeneous in nature. Structured data (e.g., databases), semi-structured data
(e.g., XML, JSON), and unstructured data (e.g., images, videos, social media posts) all
contribute to big data complexity.
Veracity
The uncertainty or trustworthiness of the data. Not all data is accurate or consistent, and
cleaning and validating data is essential for reliable analysis.
Value
The ultimate goal of big data is to derive actionable insights that are valuable to
businesses, whether that’s predicting trends, optimizing processes, or understanding
customer behavior.
Variability
Data is not always uniform. It can change over time or vary across different sources,
requiring systems to adapt quickly to handle these fluctuations.
Data science is a multidisciplinary field that requires knowledge in statistics, machine learning,
computer science, and domain expertise. The process of extracting value from big data typically
involves the following stages:
Data Exploration: Initial analysis to understand the data’s structure, distribution, and
relationships between variables.
Feature Engineering: Selecting and transforming the raw data into useful features that
can improve model performance.
Model Building: Training machine learning algorithms to create predictive models based
on the data.
Evaluation: Assessing the model’s accuracy, precision, recall, and other performance
metrics.
Deployment: Deploying the model into a production environment where it can make
real-time predictions or provide insights.
8. Steps in the Data Science Process
1. Problem Definition
1. Goal: Understand the business problem and define the project objectives.
2. Tasks:
1. Meet with stakeholders to clarify the business objectives.
2. Define the goals of the data science project in business terms.
3. Identify the metrics and outcomes that will be used to evaluate the success
of the solution.
4. Formulate a hypothesis or a research question to be answered by the data.
3. Outcome: Clear project goals and a focused problem definition.
2. Data Collection
1. Goal: Gather the necessary data from different sources.
2. Tasks:
1. Identify and source relevant data (internal and external data sources).
2. Data can come from structured databases (SQL), unstructured sources
(text files, social media), or APIs (real-time data).
3. Set up a data collection pipeline (e.g., web scraping, using APIs, or
querying databases).
3. Outcome: A diverse set of raw data from various sources.
3. Data Cleaning and Preprocessing
1. Goal: Clean and preprocess the data to prepare it for analysis.
2. Tasks:
1. Handle missing data by imputing, removing, or ignoring incomplete
records.
2. Address inconsistencies or outliers in the dataset.
3. Normalize or standardize the data to ensure uniformity (especially for
machine learning models).
4. Encode categorical variables (e.g., converting "Male" and "Female" into
binary numbers for modeling).
3. Outcome: A clean and processed dataset ready for analysis.
4. Exploratory Data Analysis (EDA)
1. Goal: Explore the data to understand its structure, patterns, and relationships.
2. Tasks:
1. Visualize the data using plots (e.g., histograms, scatter plots, box plots).
2. Summarize the data with descriptive statistics (mean, median, standard
deviation).
3. Identify correlations, trends, and outliers in the dataset.
4. Explore the distribution of variables and understand their behavior.
3. Outcome: Insights into the data's structure and distribution, which guide further
analysis.
5. Feature Engineering
1. Goal: Create new features that help improve the performance of models.
2. Tasks:
1. Combine multiple existing features to create new ones (e.g., "Age" +
"Experience" to create an "Experience Level" feature).
2. Create polynomial features or interaction terms if needed.
3. Perform dimensionality reduction (e.g., using PCA) if the dataset has too
many features.
3. Outcome: Enhanced dataset with new features that can improve model accuracy.
6. Model Building
1. Goal: Build a machine learning model to address the problem.
2. Tasks:
1. Choose an appropriate machine learning algorithm (e.g., regression,
classification, clustering).
2. Split the data into training and testing sets to evaluate model performance.
3. Train the model on the training data.
4. Tune hyperparameters using cross-validation to find the best model.
3. Outcome: A trained machine learning model.
7. Model Evaluation
1. Goal: Evaluate the model's performance using appropriate metrics.
2. Tasks:
1. Use validation data or cross-validation to assess model performance.
2. Evaluate the model using metrics like accuracy, precision, recall, F1 score,
or RMSE (Root Mean Squared Error).
3. Compare the results with the baseline model or other models.
3. Outcome: Performance metrics that determine the model's ability to generalize to
new, unseen data.
8. Model Deployment
1. Goal: Deploy the model into a production environment where it can be used for
decision-making.
2. Tasks:
1. Integrate the model into a real-time or batch processing system.
2. Ensure that the model is scalable and reliable.
3. Monitor model performance over time to ensure it is performing as
expected.
4. Set up processes for model retraining if the model's performance declines.
3. Outcome: A deployed model that is ready for use in real-world scenarios.
9. Model Monitoring and Maintenance
1. Goal: Continuously monitor and maintain the model in a production environment.
2. Tasks:
1. Track the performance of the deployed model over time.
2. Collect new data and retrain the model if needed to keep it relevant.
3. Detect and address model drift or performance degradation.
4. Periodically update the model to improve accuracy.
3. Outcome: A robust, well-maintained model that adapts to new data and evolving
requirements.
Distributed Computing: Systems like Hadoop and Spark distribute the computation
across many machines to efficiently process large datasets.
Programming Languages: Languages like Python, R, and Java are widely used in big
data and data science for their flexibility, libraries, and support for data manipulation and
analysis.
Big Data Storage: NoSQL databases like MongoDB, Cassandra, and HBase are used
to store and manage large volumes of unstructured or semi-structured data.
Overview: HDFS is a distributed file system designed to store vast amounts of data
across multiple machines in a Hadoop cluster. It is the foundation of Hadoop and allows
for the storage of large datasets in a distributed manner.
Architecture:
o NameNode: The master server that manages the file system’s namespace and
metadata, including information about the file blocks and their locations.
o DataNodes: These are worker nodes that store the actual data blocks. Each
DataNode manages the storage on a machine and serves read and write requests.
Features:
o Block Size: In HDFS, files are divided into fixed-size blocks (typically 128MB or
256MB).
o Replication: Data is replicated (usually three times) to provide fault tolerance.
o Write-Once, Read-Many Model: HDFS is optimized for write-once, read-many
use cases, making it ideal for big data workloads where data is mostly written
once and then read multiple times for analysis.
Use Cases:
o Big Data Analytics: Storing and processing large datasets, such as logs, sensor
data, and historical data for analysis.
o Data Warehousing: Used in conjunction with Hadoop ecosystem tools for
storing and querying massive datasets.
Overview: Ceph is a highly scalable and distributed file system that provides object
storage, block storage, and a file system in one unified platform. It is often used in cloud
storage environments.
Features:
o Unified Storage: Supports object storage, block storage, and file storage all in
one system.
o Fault Tolerance: Ceph replicates data across nodes and automatically handles
failure recovery.
o Scalability: Ceph is highly scalable, supporting both small and large deployments
with no single point of failure.
Use Cases:
o Cloud Storage: Ceph is used as a distributed storage backend in cloud
environments.
o Big Data: Ceph can be integrated into big data platforms like Hadoop or used
independently to store large datasets.