0% found this document useful (0 votes)
6 views

BDA UNIT-1 NOTES

The document discusses the significance of big data, highlighting its exponential growth, business value, and applications across various sectors such as e-commerce, healthcare, finance, and telecommunications. It also addresses the challenges associated with big data, including data privacy, quality, integration, and the need for real-time processing. Additionally, it outlines the characteristics of big data, dimensions of scalability, and the data science workflow necessary for extracting value from big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

BDA UNIT-1 NOTES

The document discusses the significance of big data, highlighting its exponential growth, business value, and applications across various sectors such as e-commerce, healthcare, finance, and telecommunications. It also addresses the challenges associated with big data, including data privacy, quality, integration, and the need for real-time processing. Additionally, it outlines the characteristics of big data, dimensions of scalability, and the data science workflow necessary for extracting value from big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

Big Data: Why and Where

Why Big Data?

 Exponential Growth of Data: With the increasing digitalization of everyday activities,


more data is being generated than ever before. According to some estimates, the amount
of data created globally doubles approximately every two years. The sheer scale and
complexity of this data make traditional methods of data storage, management, and
analysis insufficient.
 Business Value: Businesses are realizing the immense potential that big data holds in
deriving actionable insights. Big data analytics allows companies to improve customer
experience, optimize operations, detect patterns, and predict future trends. This leads to
more informed business strategies and competitive advantages.
 Real-Time Insights: Big data analytics enables real-time decision-making. For example,
streaming data from sensors in manufacturing plants can detect issues before they
become critical, allowing for predictive maintenance.

Where Big Data is Used?

 E-commerce: Retailers analyze user behavior, transaction data, and preferences to


recommend products, optimize pricing, and personalize advertisements.
 Healthcare: Large-scale health data (e.g., electronic medical records, genomic
sequences, and patient monitoring data) is analyzed to improve treatment outcomes,
manage diseases, and provide personalized care.
 Finance: Big data enables financial institutions to assess risks, detect fraud, and optimize
trading strategies. Financial data from various sources, including market trends and
consumer behaviors, can inform risk models.
 Telecommunications: Telecom companies use big data for network optimization,
customer churn prediction, and targeted marketing.

2. Applications of Big Data

Big data has found its way into diverse sectors, enabling new insights and efficiency.

Retail and E-commerce

 Customer Analytics: By analyzing customer behavior and preferences, e-commerce


platforms can recommend products, personalize discounts, and optimize marketing
campaigns.
 Supply Chain Management: Real-time data from inventory systems and shipment
trackers enable retailers to optimize supply chain operations, ensuring timely deliveries
and reducing operational costs.

Healthcare
 Personalized Medicine: Big data analytics in genomics allows for treatments that are
tailored to individuals based on their genetic makeup.
 Disease Prediction and Prevention: Public health surveillance can track the spread of
diseases in real-time, while predictive analytics can help healthcare providers prevent
outbreaks.

Finance

 Risk Management: Financial institutions use big data for predictive models that assess
creditworthiness, detect fraud, and optimize investment strategies.
 Algorithmic Trading: Big data is used in the creation of algorithms that process and
analyze vast amounts of financial data to make trading decisions in fractions of a second.

Transportation

 Traffic Management: Big data systems process data from GPS devices, traffic cameras,
and sensors to optimize traffic flow, reducing congestion and enhancing transportation
efficiency.
 Fleet Management: Companies with large fleets (e.g., delivery services) use big data to
optimize routes, reduce fuel consumption, and enhance vehicle maintenance.

3. Challenges of Big Data

Dealing with Big Data is no easy feat, and it presents several obstacles that need to be addressed.

Data Privacy and Security

 Sensitive Data: Big data often contains personal, financial, or health-related information
that must be kept secure from breaches or misuse.
 Regulations: Laws like GDPR (General Data Protection Regulation) in Europe and
CCPA (California Consumer Privacy Act) in the U.S. impose strict guidelines for
handling personal data, which can complicate big data projects.

Data Quality

 Incomplete or Noisy Data: Big data often comes from a variety of sources, and this data
can be messy or incomplete, requiring substantial cleaning and preprocessing to ensure it
is useful.
 Inconsistent Data: Data may be collected at different times and in different formats,
making it difficult to combine and analyze effectively.

Data Integration
 Variety of Data Sources: Data is often siloed across different departments or sources,
such as transactional data, sensor data, social media feeds, etc. Integrating this data into a
unified system is challenging.
 Complexity of Relationships: Complex relationships between data points across
multiple datasets need to be understood and mapped, which can be resource-intensive.

Real-Time Data Processing

 The need for real-time analysis of streaming data (e.g., stock market data, sensor data
from IoT devices) presents challenges in ensuring that the systems are fast enough to
process and analyze data as it arrives.

4. Characteristics of Big Data

The 3 Vs are commonly used to describe the key attributes of Big Data:

Volume

 The sheer quantity of data being produced is enormous. For example, Facebook processes
4 petabytes of data every day, and the number of data points collected from IoT devices
is increasing rapidly.

Velocity

 Data is being generated at unprecedented speeds, such as real-time social media updates,
financial transactions, and streaming sensor data. Processing such high-velocity data
requires specialized tools and algorithms.

Variety

 Big Data encompasses various types of data, including structured data (e.g., databases),
semi-structured data (e.g., JSON files), and unstructured data (e.g., text, images, videos).

Veracity

 Data in Big Data systems may be uncertain or unreliable due to errors in data collection,
poor-quality sources, or outliers. Handling and cleaning this "noisy" data is crucial for
achieving accurate analysis.

Value

 It’s not enough to just collect vast amounts of data; the real benefit comes from extracting
actionable insights that can improve decision-making, boost profitability, or reduce risks.
5. Dimensions of Scalability

Horizontal Scaling (Scale-Out)

 Description: Involves adding more machines (nodes) to the system to distribute the data
and computation load. This approach is ideal for big data as it allows for handling
massive datasets across multiple servers.
 Example: Hadoop’s HDFS and Apache Spark both use horizontal scaling to handle
large-scale data processing.

Vertical Scaling (Scale-Up)

 Description: Involves adding more resources (e.g., CPU, memory, storage) to a single
server or machine to handle a larger volume of data. This is more limited compared to
horizontal scaling because a single machine has finite resources.

Processing Scalability

 As the volume of data grows, the ability to process larger datasets in less time becomes
increasingly important. Distributed frameworks like Apache Spark and MapReduce
enable parallel processing, making it possible to scale the computation.

Storage Scalability

 The ability to store more data without significant performance degradation. Distributed
file systems like HDFS allow big data to be spread across multiple machines, each of
which holds part of the dataset.

6. The Six V's of Big Data

Volume

 Refers to the massive quantity of data being generated. For example, Google handles
over 3.5 billion searches per day, generating a massive volume of data.

Velocity

 The rate at which data is created, processed, and analyzed. Real-time streaming data from
IoT devices or financial markets needs to be processed quickly to make timely decisions.

Variety
 Big data is heterogeneous in nature. Structured data (e.g., databases), semi-structured data
(e.g., XML, JSON), and unstructured data (e.g., images, videos, social media posts) all
contribute to big data complexity.

Veracity

 The uncertainty or trustworthiness of the data. Not all data is accurate or consistent, and
cleaning and validating data is essential for reliable analysis.

Value

 The ultimate goal of big data is to derive actionable insights that are valuable to
businesses, whether that’s predicting trends, optimizing processes, or understanding
customer behavior.

Variability

 Data is not always uniform. It can change over time or vary across different sources,
requiring systems to adapt quickly to handle these fluctuations.

7. Data Science: Getting Value out of Big Data

Data Science Workflow

Data science is a multidisciplinary field that requires knowledge in statistics, machine learning,
computer science, and domain expertise. The process of extracting value from big data typically
involves the following stages:

 Data Exploration: Initial analysis to understand the data’s structure, distribution, and
relationships between variables.
 Feature Engineering: Selecting and transforming the raw data into useful features that
can improve model performance.
 Model Building: Training machine learning algorithms to create predictive models based
on the data.
 Evaluation: Assessing the model’s accuracy, precision, recall, and other performance
metrics.
 Deployment: Deploying the model into a production environment where it can make
real-time predictions or provide insights.
8. Steps in the Data Science Process

1. Problem Definition
1. Goal: Understand the business problem and define the project objectives.
2. Tasks:
1. Meet with stakeholders to clarify the business objectives.
2. Define the goals of the data science project in business terms.
3. Identify the metrics and outcomes that will be used to evaluate the success
of the solution.
4. Formulate a hypothesis or a research question to be answered by the data.
3. Outcome: Clear project goals and a focused problem definition.
2. Data Collection
1. Goal: Gather the necessary data from different sources.
2. Tasks:
1. Identify and source relevant data (internal and external data sources).
2. Data can come from structured databases (SQL), unstructured sources
(text files, social media), or APIs (real-time data).
3. Set up a data collection pipeline (e.g., web scraping, using APIs, or
querying databases).
3. Outcome: A diverse set of raw data from various sources.
3. Data Cleaning and Preprocessing
1. Goal: Clean and preprocess the data to prepare it for analysis.
2. Tasks:
1. Handle missing data by imputing, removing, or ignoring incomplete
records.
2. Address inconsistencies or outliers in the dataset.
3. Normalize or standardize the data to ensure uniformity (especially for
machine learning models).
4. Encode categorical variables (e.g., converting "Male" and "Female" into
binary numbers for modeling).
3. Outcome: A clean and processed dataset ready for analysis.
4. Exploratory Data Analysis (EDA)
1. Goal: Explore the data to understand its structure, patterns, and relationships.
2. Tasks:
1. Visualize the data using plots (e.g., histograms, scatter plots, box plots).
2. Summarize the data with descriptive statistics (mean, median, standard
deviation).
3. Identify correlations, trends, and outliers in the dataset.
4. Explore the distribution of variables and understand their behavior.
3. Outcome: Insights into the data's structure and distribution, which guide further
analysis.
5. Feature Engineering
1. Goal: Create new features that help improve the performance of models.
2. Tasks:
1. Combine multiple existing features to create new ones (e.g., "Age" +
"Experience" to create an "Experience Level" feature).
2. Create polynomial features or interaction terms if needed.
3. Perform dimensionality reduction (e.g., using PCA) if the dataset has too
many features.
3. Outcome: Enhanced dataset with new features that can improve model accuracy.
6. Model Building
1. Goal: Build a machine learning model to address the problem.
2. Tasks:
1. Choose an appropriate machine learning algorithm (e.g., regression,
classification, clustering).
2. Split the data into training and testing sets to evaluate model performance.
3. Train the model on the training data.
4. Tune hyperparameters using cross-validation to find the best model.
3. Outcome: A trained machine learning model.
7. Model Evaluation
1. Goal: Evaluate the model's performance using appropriate metrics.
2. Tasks:
1. Use validation data or cross-validation to assess model performance.
2. Evaluate the model using metrics like accuracy, precision, recall, F1 score,
or RMSE (Root Mean Squared Error).
3. Compare the results with the baseline model or other models.
3. Outcome: Performance metrics that determine the model's ability to generalize to
new, unseen data.
8. Model Deployment
1. Goal: Deploy the model into a production environment where it can be used for
decision-making.
2. Tasks:
1. Integrate the model into a real-time or batch processing system.
2. Ensure that the model is scalable and reliable.
3. Monitor model performance over time to ensure it is performing as
expected.
4. Set up processes for model retraining if the model's performance declines.
3. Outcome: A deployed model that is ready for use in real-world scenarios.
9. Model Monitoring and Maintenance
1. Goal: Continuously monitor and maintain the model in a production environment.
2. Tasks:
1. Track the performance of the deployed model over time.
2. Collect new data and retrain the model if needed to keep it relevant.
3. Detect and address model drift or performance degradation.
4. Periodically update the model to improve accuracy.
3. Outcome: A robust, well-maintained model that adapts to new data and evolving
requirements.

9. Foundations for Big Data Systems and Programming

To build big data systems, you need:

 Distributed Computing: Systems like Hadoop and Spark distribute the computation
across many machines to efficiently process large datasets.
 Programming Languages: Languages like Python, R, and Java are widely used in big
data and data science for their flexibility, libraries, and support for data manipulation and
analysis.
 Big Data Storage: NoSQL databases like MongoDB, Cassandra, and HBase are used
to store and manage large volumes of unstructured or semi-structured data.

10. Distributed File Systems


Big data requires scalable, fault-tolerant storage systems, and distributed file systems play a key
role:

Popular Distributed File Systems Used in Big Data

1. HDFS (Hadoop Distributed File System)

 Overview: HDFS is a distributed file system designed to store vast amounts of data
across multiple machines in a Hadoop cluster. It is the foundation of Hadoop and allows
for the storage of large datasets in a distributed manner.
 Architecture:
o NameNode: The master server that manages the file system’s namespace and
metadata, including information about the file blocks and their locations.
o DataNodes: These are worker nodes that store the actual data blocks. Each
DataNode manages the storage on a machine and serves read and write requests.
 Features:
o Block Size: In HDFS, files are divided into fixed-size blocks (typically 128MB or
256MB).
o Replication: Data is replicated (usually three times) to provide fault tolerance.
o Write-Once, Read-Many Model: HDFS is optimized for write-once, read-many
use cases, making it ideal for big data workloads where data is mostly written
once and then read multiple times for analysis.
 Use Cases:
o Big Data Analytics: Storing and processing large datasets, such as logs, sensor
data, and historical data for analysis.
o Data Warehousing: Used in conjunction with Hadoop ecosystem tools for
storing and querying massive datasets.

2. Ceph File System (CephFS)

 Overview: Ceph is a highly scalable and distributed file system that provides object
storage, block storage, and a file system in one unified platform. It is often used in cloud
storage environments.
 Features:
o Unified Storage: Supports object storage, block storage, and file storage all in
one system.
o Fault Tolerance: Ceph replicates data across nodes and automatically handles
failure recovery.
o Scalability: Ceph is highly scalable, supporting both small and large deployments
with no single point of failure.
 Use Cases:
o Cloud Storage: Ceph is used as a distributed storage backend in cloud
environments.
o Big Data: Ceph can be integrated into big data platforms like Hadoop or used
independently to store large datasets.

3. Amazon S3 (Simple Storage Service)

 Overview: Amazon S3 is a cloud-based distributed object storage system provided by


Amazon Web Services (AWS). It allows users to store and retrieve vast amounts of data
with high availability, scalability, and durability.
 Features:
o Scalability: S3 can scale automatically as data volumes increase, handling
petabytes of data.
o Durability: Data is automatically replicated across multiple data centers to ensure
durability.
o Flexibility: S3 supports multiple data formats (e.g., images, videos, logs,
backups).
 Use Cases:
o Data Lake Storage: S3 is commonly used to store raw, unprocessed data in a
data lake architecture.
o Backup and Archiving: Large datasets can be archived and backed up using S3’s
durable infrastructure.

4. Google Cloud Storage

 Overview: Google Cloud Storage is a distributed storage system provided by Google


Cloud Platform (GCP). It allows for the storage of unstructured data such as images,
videos, backups, and big data for analytics.
 Features:
o Global Accessibility: Data can be stored and accessed globally with low latency.
o Durability and Availability: Google Cloud Storage replicates data across
multiple regions and provides redundancy in case of hardware failures.
o Integration with Big Data Tools: Works seamlessly with Google BigQuery,
Google Dataproc, and other data processing tools.
 Use Cases:
o Big Data Analytics: Google Cloud Storage is used for storing data that is later
processed using tools like BigQuery and Apache Beam.
o Backup and Archiving: Similar to S3, Google Cloud Storage is used for long-
term storage and backup solutions.

You might also like