0% found this document useful (0 votes)

6 views

BDA UNIT-1 NOTES

The document discusses the significance of big data, highlighting its exponential growth, business value, and applications across various sectors such as e-commerce, healthcare, finance, and telecommunications. It also addresses the challenges associated with big data, including data privacy, quality, integration, and the need for real-time processing. Additionally, it outlines the characteristics of big data, dimensions of scalability, and the data science workflow necessary for extracting value from big data.

Uploaded by

sharmavaishnavi2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

BDA UNIT-1 NOTES

Uploaded by

sharmavaishnavi2020

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

Big Data: Why and Where

Why Big Data?

 Exponential Growth of Data: With the increasing digitalization of everyday activities,

more data is being generated than ever before. According to some estimates, the amount
of data created globally doubles approximately every two years. The sheer scale and
complexity of this data make traditional methods of data storage, management, and
analysis insufficient.
 Business Value: Businesses are realizing the immense potential that big data holds in
deriving actionable insights. Big data analytics allows companies to improve customer
experience, optimize operations, detect patterns, and predict future trends. This leads to
more informed business strategies and competitive advantages.
 Real-Time Insights: Big data analytics enables real-time decision-making. For example,
streaming data from sensors in manufacturing plants can detect issues before they
become critical, allowing for predictive maintenance.

Where Big Data is Used?

 E-commerce: Retailers analyze user behavior, transaction data, and preferences to

recommend products, optimize pricing, and personalize advertisements.
 Healthcare: Large-scale health data (e.g., electronic medical records, genomic
sequences, and patient monitoring data) is analyzed to improve treatment outcomes,
manage diseases, and provide personalized care.
 Finance: Big data enables financial institutions to assess risks, detect fraud, and optimize
trading strategies. Financial data from various sources, including market trends and
consumer behaviors, can inform risk models.
 Telecommunications: Telecom companies use big data for network optimization,
customer churn prediction, and targeted marketing.

2. Applications of Big Data

Big data has found its way into diverse sectors, enabling new insights and efficiency.

Retail and E-commerce

 Customer Analytics: By analyzing customer behavior and preferences, e-commerce

platforms can recommend products, personalize discounts, and optimize marketing
campaigns.
 Supply Chain Management: Real-time data from inventory systems and shipment
trackers enable retailers to optimize supply chain operations, ensuring timely deliveries
and reducing operational costs.

Healthcare
 Personalized Medicine: Big data analytics in genomics allows for treatments that are
tailored to individuals based on their genetic makeup.
 Disease Prediction and Prevention: Public health surveillance can track the spread of
diseases in real-time, while predictive analytics can help healthcare providers prevent
outbreaks.

Finance

 Risk Management: Financial institutions use big data for predictive models that assess
creditworthiness, detect fraud, and optimize investment strategies.
 Algorithmic Trading: Big data is used in the creation of algorithms that process and
analyze vast amounts of financial data to make trading decisions in fractions of a second.

Transportation

 Traffic Management: Big data systems process data from GPS devices, traffic cameras,
and sensors to optimize traffic flow, reducing congestion and enhancing transportation
efficiency.
 Fleet Management: Companies with large fleets (e.g., delivery services) use big data to
optimize routes, reduce fuel consumption, and enhance vehicle maintenance.

3. Challenges of Big Data

Dealing with Big Data is no easy feat, and it presents several obstacles that need to be addressed.

Data Privacy and Security

 Sensitive Data: Big data often contains personal, financial, or health-related information
that must be kept secure from breaches or misuse.
 Regulations: Laws like GDPR (General Data Protection Regulation) in Europe and
CCPA (California Consumer Privacy Act) in the U.S. impose strict guidelines for
handling personal data, which can complicate big data projects.

Data Quality

 Incomplete or Noisy Data: Big data often comes from a variety of sources, and this data
can be messy or incomplete, requiring substantial cleaning and preprocessing to ensure it
is useful.
 Inconsistent Data: Data may be collected at different times and in different formats,
making it difficult to combine and analyze effectively.

Data Integration
 Variety of Data Sources: Data is often siloed across different departments or sources,
such as transactional data, sensor data, social media feeds, etc. Integrating this data into a
unified system is challenging.
 Complexity of Relationships: Complex relationships between data points across
multiple datasets need to be understood and mapped, which can be resource-intensive.

Real-Time Data Processing

 The need for real-time analysis of streaming data (e.g., stock market data, sensor data
from IoT devices) presents challenges in ensuring that the systems are fast enough to
process and analyze data as it arrives.

4. Characteristics of Big Data

The 3 Vs are commonly used to describe the key attributes of Big Data:

Volume

 The sheer quantity of data being produced is enormous. For example, Facebook processes
4 petabytes of data every day, and the number of data points collected from IoT devices
is increasing rapidly.

Velocity

 Data is being generated at unprecedented speeds, such as real-time social media updates,
financial transactions, and streaming sensor data. Processing such high-velocity data
requires specialized tools and algorithms.

Variety

 Big Data encompasses various types of data, including structured data (e.g., databases),
semi-structured data (e.g., JSON files), and unstructured data (e.g., text, images, videos).

Veracity

 Data in Big Data systems may be uncertain or unreliable due to errors in data collection,
poor-quality sources, or outliers. Handling and cleaning this "noisy" data is crucial for
achieving accurate analysis.

Value

 It’s not enough to just collect vast amounts of data; the real benefit comes from extracting
actionable insights that can improve decision-making, boost profitability, or reduce risks.
5. Dimensions of Scalability

Horizontal Scaling (Scale-Out)

 Description: Involves adding more machines (nodes) to the system to distribute the data
and computation load. This approach is ideal for big data as it allows for handling
massive datasets across multiple servers.
 Example: Hadoop’s HDFS and Apache Spark both use horizontal scaling to handle
large-scale data processing.

Vertical Scaling (Scale-Up)

 Description: Involves adding more resources (e.g., CPU, memory, storage) to a single
server or machine to handle a larger volume of data. This is more limited compared to
horizontal scaling because a single machine has finite resources.

Processing Scalability

 As the volume of data grows, the ability to process larger datasets in less time becomes
increasingly important. Distributed frameworks like Apache Spark and MapReduce
enable parallel processing, making it possible to scale the computation.

Storage Scalability

 The ability to store more data without significant performance degradation. Distributed
file systems like HDFS allow big data to be spread across multiple machines, each of
which holds part of the dataset.

6. The Six V's of Big Data

Volume

 Refers to the massive quantity of data being generated. For example, Google handles
over 3.5 billion searches per day, generating a massive volume of data.

Velocity

 The rate at which data is created, processed, and analyzed. Real-time streaming data from
IoT devices or financial markets needs to be processed quickly to make timely decisions.

Variety
 Big data is heterogeneous in nature. Structured data (e.g., databases), semi-structured data
(e.g., XML, JSON), and unstructured data (e.g., images, videos, social media posts) all
contribute to big data complexity.

Veracity

 The uncertainty or trustworthiness of the data. Not all data is accurate or consistent, and
cleaning and validating data is essential for reliable analysis.

Value

 The ultimate goal of big data is to derive actionable insights that are valuable to
businesses, whether that’s predicting trends, optimizing processes, or understanding
customer behavior.

Variability

 Data is not always uniform. It can change over time or vary across different sources,
requiring systems to adapt quickly to handle these fluctuations.

7. Data Science: Getting Value out of Big Data

Data Science Workflow

Data science is a multidisciplinary field that requires knowledge in statistics, machine learning,
computer science, and domain expertise. The process of extracting value from big data typically
involves the following stages:

 Data Exploration: Initial analysis to understand the data’s structure, distribution, and
relationships between variables.
 Feature Engineering: Selecting and transforming the raw data into useful features that
can improve model performance.
 Model Building: Training machine learning algorithms to create predictive models based
on the data.
 Evaluation: Assessing the model’s accuracy, precision, recall, and other performance
metrics.
 Deployment: Deploying the model into a production environment where it can make
real-time predictions or provide insights.
8. Steps in the Data Science Process

1. Problem Definition
1. Goal: Understand the business problem and define the project objectives.
2. Tasks:
1. Meet with stakeholders to clarify the business objectives.
2. Define the goals of the data science project in business terms.
3. Identify the metrics and outcomes that will be used to evaluate the success
of the solution.
4. Formulate a hypothesis or a research question to be answered by the data.
3. Outcome: Clear project goals and a focused problem definition.
2. Data Collection
1. Goal: Gather the necessary data from different sources.
2. Tasks:
1. Identify and source relevant data (internal and external data sources).
2. Data can come from structured databases (SQL), unstructured sources
(text files, social media), or APIs (real-time data).
3. Set up a data collection pipeline (e.g., web scraping, using APIs, or
querying databases).
3. Outcome: A diverse set of raw data from various sources.
3. Data Cleaning and Preprocessing
1. Goal: Clean and preprocess the data to prepare it for analysis.
2. Tasks:
1. Handle missing data by imputing, removing, or ignoring incomplete
records.
2. Address inconsistencies or outliers in the dataset.
3. Normalize or standardize the data to ensure uniformity (especially for
machine learning models).
4. Encode categorical variables (e.g., converting "Male" and "Female" into
binary numbers for modeling).
3. Outcome: A clean and processed dataset ready for analysis.
4. Exploratory Data Analysis (EDA)
1. Goal: Explore the data to understand its structure, patterns, and relationships.
2. Tasks:
1. Visualize the data using plots (e.g., histograms, scatter plots, box plots).
2. Summarize the data with descriptive statistics (mean, median, standard
deviation).
3. Identify correlations, trends, and outliers in the dataset.
4. Explore the distribution of variables and understand their behavior.
3. Outcome: Insights into the data's structure and distribution, which guide further
analysis.
5. Feature Engineering
1. Goal: Create new features that help improve the performance of models.
2. Tasks:
1. Combine multiple existing features to create new ones (e.g., "Age" +
"Experience" to create an "Experience Level" feature).
2. Create polynomial features or interaction terms if needed.
3. Perform dimensionality reduction (e.g., using PCA) if the dataset has too
many features.
3. Outcome: Enhanced dataset with new features that can improve model accuracy.
6. Model Building
1. Goal: Build a machine learning model to address the problem.
2. Tasks:
1. Choose an appropriate machine learning algorithm (e.g., regression,
classification, clustering).
2. Split the data into training and testing sets to evaluate model performance.
3. Train the model on the training data.
4. Tune hyperparameters using cross-validation to find the best model.
3. Outcome: A trained machine learning model.
7. Model Evaluation
1. Goal: Evaluate the model's performance using appropriate metrics.
2. Tasks:
1. Use validation data or cross-validation to assess model performance.
2. Evaluate the model using metrics like accuracy, precision, recall, F1 score,
or RMSE (Root Mean Squared Error).
3. Compare the results with the baseline model or other models.
3. Outcome: Performance metrics that determine the model's ability to generalize to
new, unseen data.
8. Model Deployment
1. Goal: Deploy the model into a production environment where it can be used for
decision-making.
2. Tasks:
1. Integrate the model into a real-time or batch processing system.
2. Ensure that the model is scalable and reliable.
3. Monitor model performance over time to ensure it is performing as
expected.
4. Set up processes for model retraining if the model's performance declines.
3. Outcome: A deployed model that is ready for use in real-world scenarios.
9. Model Monitoring and Maintenance
1. Goal: Continuously monitor and maintain the model in a production environment.
2. Tasks:
1. Track the performance of the deployed model over time.
2. Collect new data and retrain the model if needed to keep it relevant.
3. Detect and address model drift or performance degradation.
4. Periodically update the model to improve accuracy.
3. Outcome: A robust, well-maintained model that adapts to new data and evolving
requirements.

9. Foundations for Big Data Systems and Programming

To build big data systems, you need:

 Distributed Computing: Systems like Hadoop and Spark distribute the computation
across many machines to efficiently process large datasets.
 Programming Languages: Languages like Python, R, and Java are widely used in big
data and data science for their flexibility, libraries, and support for data manipulation and
analysis.
 Big Data Storage: NoSQL databases like MongoDB, Cassandra, and HBase are used
to store and manage large volumes of unstructured or semi-structured data.

10. Distributed File Systems

Big data requires scalable, fault-tolerant storage systems, and distributed file systems play a key
role:

Popular Distributed File Systems Used in Big Data

1. HDFS (Hadoop Distributed File System)

 Overview: HDFS is a distributed file system designed to store vast amounts of data
across multiple machines in a Hadoop cluster. It is the foundation of Hadoop and allows
for the storage of large datasets in a distributed manner.
 Architecture:
o NameNode: The master server that manages the file system’s namespace and
metadata, including information about the file blocks and their locations.
o DataNodes: These are worker nodes that store the actual data blocks. Each
DataNode manages the storage on a machine and serves read and write requests.
 Features:
o Block Size: In HDFS, files are divided into fixed-size blocks (typically 128MB or
256MB).
o Replication: Data is replicated (usually three times) to provide fault tolerance.
o Write-Once, Read-Many Model: HDFS is optimized for write-once, read-many
use cases, making it ideal for big data workloads where data is mostly written
once and then read multiple times for analysis.
 Use Cases:
o Big Data Analytics: Storing and processing large datasets, such as logs, sensor
data, and historical data for analysis.
o Data Warehousing: Used in conjunction with Hadoop ecosystem tools for
storing and querying massive datasets.

2. Ceph File System (CephFS)

 Overview: Ceph is a highly scalable and distributed file system that provides object
storage, block storage, and a file system in one unified platform. It is often used in cloud
storage environments.
 Features:
o Unified Storage: Supports object storage, block storage, and file storage all in
one system.
o Fault Tolerance: Ceph replicates data across nodes and automatically handles
failure recovery.
o Scalability: Ceph is highly scalable, supporting both small and large deployments
with no single point of failure.
 Use Cases:
o Cloud Storage: Ceph is used as a distributed storage backend in cloud
environments.
o Big Data: Ceph can be integrated into big data platforms like Hadoop or used
independently to store large datasets.

3. Amazon S3 (Simple Storage Service)

 Overview: Amazon S3 is a cloud-based distributed object storage system provided by

Amazon Web Services (AWS). It allows users to store and retrieve vast amounts of data
with high availability, scalability, and durability.
 Features:
o Scalability: S3 can scale automatically as data volumes increase, handling
petabytes of data.
o Durability: Data is automatically replicated across multiple data centers to ensure
durability.
o Flexibility: S3 supports multiple data formats (e.g., images, videos, logs,
backups).
 Use Cases:
o Data Lake Storage: S3 is commonly used to store raw, unprocessed data in a
data lake architecture.
o Backup and Archiving: Large datasets can be archived and backed up using S3’s
durable infrastructure.

4. Google Cloud Storage

 Overview: Google Cloud Storage is a distributed storage system provided by Google

Cloud Platform (GCP). It allows for the storage of unstructured data such as images,
videos, backups, and big data for analytics.
 Features:
o Global Accessibility: Data can be stored and accessed globally with low latency.
o Durability and Availability: Google Cloud Storage replicates data across
multiple regions and provides redundancy in case of hardware failures.
o Integration with Big Data Tools: Works seamlessly with Google BigQuery,
Google Dataproc, and other data processing tools.
 Use Cases:
o Big Data Analytics: Google Cloud Storage is used for storing data that is later
processed using tools like BigQuery and Apache Beam.
o Backup and Archiving: Similar to S3, Google Cloud Storage is used for long-
term storage and backup solutions.

APA 7 Paper Template
No ratings yet
APA 7 Paper Template
4 pages
Seminar Report Alisha
No ratings yet
Seminar Report Alisha
22 pages
Big Data Report
No ratings yet
Big Data Report
10 pages
Big Data Ashish
No ratings yet
Big Data Ashish
7 pages
BDA-1st unit
No ratings yet
BDA-1st unit
39 pages
3
No ratings yet
3
12 pages
Big Data Seminar Report Rahul Jain
No ratings yet
Big Data Seminar Report Rahul Jain
41 pages
UNIT_1 BDA
No ratings yet
UNIT_1 BDA
14 pages
Big Data Analytics Unit - 1 Notes
No ratings yet
Big Data Analytics Unit - 1 Notes
24 pages
BDA notes part 1
No ratings yet
BDA notes part 1
11 pages
Chapter 1
No ratings yet
Chapter 1
21 pages
Big Data Technology Report With Pages Removed
No ratings yet
Big Data Technology Report With Pages Removed
32 pages
Technical Seminar Report
No ratings yet
Technical Seminar Report
24 pages
Big Data Analysis by deshbandhu
No ratings yet
Big Data Analysis by deshbandhu
368 pages
Orientation To Computing
No ratings yet
Orientation To Computing
67 pages
Big Data-Introduction
No ratings yet
Big Data-Introduction
14 pages
BDA-MSE
No ratings yet
BDA-MSE
62 pages
Big Data
No ratings yet
Big Data
16 pages
Sem Csen1301
No ratings yet
Sem Csen1301
12 pages
BDA ESE Questions
No ratings yet
BDA ESE Questions
22 pages
PPT 1.1.4
No ratings yet
PPT 1.1.4
16 pages
UNIT-1:Overview of Big Data
No ratings yet
UNIT-1:Overview of Big Data
10 pages
ETB 1 (Big data)
No ratings yet
ETB 1 (Big data)
28 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
BD 1
No ratings yet
BD 1
15 pages
big data notes
No ratings yet
big data notes
89 pages
DBMS Unit1
No ratings yet
DBMS Unit1
30 pages
PPT 1.1.3
No ratings yet
PPT 1.1.3
15 pages
Now To Be Data
No ratings yet
Now To Be Data
16 pages
PPT 1.1.2
No ratings yet
PPT 1.1.2
17 pages
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
From Everand
Data-Driven Business Strategies: Understanding and Harnessing the Power of Big Data
Steven Vollmer
No ratings yet
1.big Data and Its Importance
No ratings yet
1.big Data and Its Importance
17 pages
Finance - Unit 4
No ratings yet
Finance - Unit 4
39 pages
Unit I
No ratings yet
Unit I
64 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
ak_as2
No ratings yet
ak_as2
15 pages
Bigdata
No ratings yet
Bigdata
12 pages
Big Data Analytics
No ratings yet
Big Data Analytics
6 pages
Big Data Analytics
No ratings yet
Big Data Analytics
127 pages
Chapter_1_ffd48bbc461e45cfa49fe08c0fbf7c2e_1712934164765
No ratings yet
Chapter_1_ffd48bbc461e45cfa49fe08c0fbf7c2e_1712934164765
18 pages
1_introduction_to_big_data_management_and_processing
No ratings yet
1_introduction_to_big_data_management_and_processing
46 pages
Big Data Analytics
No ratings yet
Big Data Analytics
14 pages
BIG DATA ANALTICS (UNIT 1)
No ratings yet
BIG DATA ANALTICS (UNIT 1)
31 pages
(15) Big Data
No ratings yet
(15) Big Data
10 pages
Research Paper (1) .Docxxx
No ratings yet
Research Paper (1) .Docxxx
6 pages
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
No ratings yet
Ethiopin Tecica University Departement of Ict Cours Title: Big Data
15 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
unit 1 b tech 3 year bd
No ratings yet
unit 1 b tech 3 year bd
10 pages
BDA 1-5 imp
No ratings yet
BDA 1-5 imp
120 pages
Intro to Big Data Analytics
No ratings yet
Intro to Big Data Analytics
14 pages
Book Chapter
No ratings yet
Book Chapter
23 pages
ETEM S01 - (Big Data)
No ratings yet
ETEM S01 - (Big Data)
24 pages
UNIT I BIG DATA Extra Content
No ratings yet
UNIT I BIG DATA Extra Content
15 pages
Unit1-Big-Data-Analytics
No ratings yet
Unit1-Big-Data-Analytics
31 pages
Unit 1 Big Data Analytics Full
No ratings yet
Unit 1 Big Data Analytics Full
29 pages
Big Data
No ratings yet
Big Data
8 pages
Hadoop 2 & 3 Units Final
No ratings yet
Hadoop 2 & 3 Units Final
27 pages
Introduction To Big Data and Hadoop
No ratings yet
Introduction To Big Data and Hadoop
31 pages
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
No ratings yet
Quote: "Data Is Widely Available. What Is Scarce Is The Ability To Extract Wisdom From It."
58 pages
Big-Data-A-Comprehensive-Overview
No ratings yet
Big-Data-A-Comprehensive-Overview
25 pages
Bid Data Analytics - Google Docs
No ratings yet
Bid Data Analytics - Google Docs
5 pages
Software project management (1)
No ratings yet
Software project management (1)
45 pages
Software Design
No ratings yet
Software Design
14 pages
s.e unit part 1
No ratings yet
s.e unit part 1
40 pages
Analysis Model (1)
No ratings yet
Analysis Model (1)
24 pages
Fundamentals of DB System
No ratings yet
Fundamentals of DB System
62 pages
03 SPP 47S Proposal Format
No ratings yet
03 SPP 47S Proposal Format
9 pages
Ek924 18oct
No ratings yet
Ek924 18oct
4 pages
Industry 4.0 - A Tertiary Literature Review
No ratings yet
Industry 4.0 - A Tertiary Literature Review
11 pages
Cevolini The Relevance of Relevance Forgetting Strategies
No ratings yet
Cevolini The Relevance of Relevance Forgetting Strategies
18 pages
Introduction To Databases
No ratings yet
Introduction To Databases
6 pages
Sakshi Rautela (1)
No ratings yet
Sakshi Rautela (1)
1 page
Module 1_part_1 - Cryptography and Network Secuirty Overview-Win23
No ratings yet
Module 1_part_1 - Cryptography and Network Secuirty Overview-Win23
30 pages
Ia2 Dbms Answer Key
No ratings yet
Ia2 Dbms Answer Key
2 pages
EDS Mini Project
No ratings yet
EDS Mini Project
10 pages
OOPS Concepts
No ratings yet
OOPS Concepts
15 pages
Varsha Resume
No ratings yet
Varsha Resume
2 pages
A Survey of Intellectual Property Rights Protection in Big Data Applications
No ratings yet
A Survey of Intellectual Property Rights Protection in Big Data Applications
16 pages
AAM REPORT
No ratings yet
AAM REPORT
8 pages
E Health
No ratings yet
E Health
19 pages
Rohit
No ratings yet
Rohit
77 pages
Introduction To AI Powered Data Analysis
No ratings yet
Introduction To AI Powered Data Analysis
10 pages
10art04 QP
No ratings yet
10art04 QP
7 pages
Summary for LIB 867 - Database Management
No ratings yet
Summary for LIB 867 - Database Management
30 pages
Fourier Analysis For Demand Forecasting in A Fashion Company
No ratings yet
Fourier Analysis For Demand Forecasting in A Fashion Company
11 pages
Hoc SQL
No ratings yet
Hoc SQL
3 pages
Compiler Design Topics
No ratings yet
Compiler Design Topics
2 pages
24-02-14 7. Feature extraction methods
No ratings yet
24-02-14 7. Feature extraction methods
19 pages
A Hombros de Gigantes
No ratings yet
A Hombros de Gigantes
38 pages
Previewpdf
No ratings yet
Previewpdf
50 pages
Data Warehouse and Data Mining - Unit 2
No ratings yet
Data Warehouse and Data Mining - Unit 2
24 pages
ITECH1103 Assignment2 SQL Database
No ratings yet
ITECH1103 Assignment2 SQL Database
5 pages
CSNOtk Release Note v7.5
No ratings yet
CSNOtk Release Note v7.5
2 pages
BDA PPT M1 P1 Big Data Stack
No ratings yet
BDA PPT M1 P1 Big Data Stack
44 pages

BDA UNIT-1 NOTES

Uploaded by

BDA UNIT-1 NOTES

Uploaded by

Big Data: Why and Where

Why Big Data?

 Exponential Growth of Data: With the increasing digitalization of everyday activities,

Where Big Data is Used?

 E-commerce: Retailers analyze user behavior, transaction data, and preferences to

2. Applications of Big Data

Retail and E-commerce

 Customer Analytics: By analyzing customer behavior and preferences, e-commerce

3. Challenges of Big Data

Data Privacy and Security

Real-Time Data Processing

4. Characteristics of Big Data

Horizontal Scaling (Scale-Out)

Vertical Scaling (Scale-Up)

6. The Six V's of Big Data

7. Data Science: Getting Value out of Big Data

Data Science Workflow

9. Foundations for Big Data Systems and Programming

To build big data systems, you need:

10. Distributed File Systems

Popular Distributed File Systems Used in Big Data

1. HDFS (Hadoop Distributed File System)

2. Ceph File System (CephFS)

3. Amazon S3 (Simple Storage Service)

 Overview: Amazon S3 is a cloud-based distributed object storage system provided by

4. Google Cloud Storage

 Overview: Google Cloud Storage is a distributed storage system provided by Google

You might also like