0% found this document useful (0 votes)
2 views

bigdata

Big Data applications span various industries, enhancing efficiency, insights, and product development through monitoring, analysis, and innovative solutions. Key features of Big Data include the five Vs: Volume, Velocity, Variety, Veracity, and Value, which highlight its complexity and importance in decision-making. Technologies like SQL and NoSQL databases, along with distributed and parallel computing, support the management and processing of large-scale data, while foundational systems like Google MapReduce and GFS have shaped modern big data frameworks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

bigdata

Big Data applications span various industries, enhancing efficiency, insights, and product development through monitoring, analysis, and innovative solutions. Key features of Big Data include the five Vs: Volume, Velocity, Variety, Veracity, and Value, which highlight its complexity and importance in decision-making. Technologies like SQL and NoSQL databases, along with distributed and parallel computing, support the management and processing of large-scale data, while foundational systems like Google MapReduce and GFS have shaped modern big data frameworks.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Applications of Big Data

Big Data is used across industries to improve efficiency, gain insights, and develop new
products.

1. Monitoring and Tracking Applications

●​ Public Health Monitoring: Tracks disease outbreaks and improves healthcare data
sharing.
●​ Consumer Sentiment Monitoring: Analyzes social media data to understand customer
opinions.
●​ Asset Tracking: Uses RFID tags to prevent counterfeiting and theft in industries like
defense and retail.
●​ Supply Chain Monitoring: Tracks inventory movement using RFID, ensuring timely
product delivery.
●​ Preventive Machine Maintenance: Uses sensors to predict equipment failures and
reduce downtime.

2. Analysis and Insight Applications

●​ Predictive Policing: Identifies crime hotspots to help law enforcement prevent future
crimes.
●​ Winning Political Elections: Uses voter data analysis to target potential supporters and
optimize campaigns.
●​ Personal Health: AI-based medical analysis improves diagnosis and treatment
recommendations.

3. New Product Development

●​ Flexible Auto Insurance: Adjusts insurance premiums based on real-time driving


behavior.
●​ Location-Based Retail Promotion: Sends personalized offers to customers based on
their location.
●​ Recommendation Services: Uses data analytics to suggest products, movies, and
music based on user preferences.

Importance of Big Data (Key Points in Simple Language)

1.​ Cost Savings​

○​ Helps reduce costs and improve efficiency.


○​ Optimizes quality assurance and testing.
○​ Useful in complex industries like biopharmaceuticals and nanotechnology.
2.​ Time Reduction​

○​ Uses real-time data analytics for quick decision-making.


○​ Tools like Hadoop process large data fast.
3.​ Understanding Market Conditions​

○​ Analyzes customer purchase behavior.


○​ Identifies popular products to improve business strategies.
○​ Helps businesses stay ahead of competitors.
4.​ Social Media Listening​

○​ Uses sentiment analysis to monitor brand reputation.


○​ Helps businesses understand customer opinions.
○​ Improves online presence and customer engagement.
5.​ Customer Acquisition and Retention​

○​ Helps understand customer needs and preferences.


○​ Identifies buying patterns for better customer experience.
○​ Prevents customer loss and improves business growth.
6.​ Better Advertising and Marketing​

○​ Helps businesses target the right audience.


○​ Improves marketing campaigns using data insights.
○​ Modifies product range based on customer demand.
7.​ Driving Innovation and Product Development​

○​ Helps create new and improved products.


○​ Identifies gaps in the market for innovation.
○​ Enhances product features using customer data.

Five Vs of Big Data (Detailed and Simple Explanation)

1.​ Volume (Large Amount of Data)​

○​ Big Data means handling huge amounts of data collected from different sources.
○​ This data is growing every second from social media, online transactions,
sensors, etc.
○​ Example: In 2016, global mobile traffic was 6.2 exabytes per month, and by
2020, it was expected to reach 40,000 exabytes.
2.​ Velocity (Speed of Data Generation)​

○​ Data is being generated at an extremely high speed from various sources like
social media, machines, and IoT devices.
○​ The faster data is collected, the quicker it needs to be processed for real-time
decision-making.
○​ Example: Google handles 3.5 billion searches per day, and Facebook generates
massive amounts of posts, likes, and messages every second.
3.​ Variety (Different Types of Data)​

○​ Big Data comes in multiple formats, making it difficult to store and process in
traditional databases.
○​ Types of data include:
■​ Structured Data: Organized in tables (e.g., databases, spreadsheets).
■​ Semi-structured Data: Partially organized (e.g., JSON files, XML,
emails).
■​ Unstructured Data: No fixed format (e.g., images, videos, social media
posts).
○​ Example: Text messages, audio recordings, GPS location data, and video
surveillance footage all come in different formats.
4.​ Veracity (Accuracy and Reliability of Data)​

○​ Since Big Data comes from multiple sources, some of it may be inaccurate,
inconsistent, or misleading.
○​ Ensuring data quality and trustworthiness is important for making correct
business decisions.
○​ Example: Social media posts may have fake reviews, duplicate data, or errors
that affect analysis.
5.​ Value (Extracting Useful Insights)​

○​ The main goal of Big Data is to analyze and gain meaningful insights that help in
decision-making.
○​ Organizations use Big Data to improve customer experience, detect fraud,
optimize operations, and predict future trends.
○​ Example: E-commerce websites use Big Data to recommend products based on
customer behavior, increasing sales and satisfaction.

SQL:

●​ Structured Query Language: Uses a standardized language for interacting with


the database.
●​ Relational: Data is organized into tables with rows and columns, linked by
relationships.
●​ Fixed Schema: Requires a predefined structure (schema) before data can be
stored.
●​ ACID Properties: Guarantees Atomicity, Consistency, Isolation, and Durability of
transactions. Focuses on data integrity.
●​ Vertical Scalability: Scaled by adding more resources (CPU, RAM) to a single
server.
●​ Best For: Applications with structured data, complex queries, and the need for
strong data consistency (e.g., financial systems, e-commerce).

NoSQL:

●​ Not Only SQL: Encompasses a variety of database types that don't adhere to
the relational model.
●​ Flexible Data Models: Supports various data formats like documents, key-value
pairs, graphs, etc.
●​ Flexible Schema: Schema can be dynamic or even non-existent, allowing for
more flexible data structures.
●​ CAP Theorem: Focuses on Consistency, Availability, and Partition Tolerance.
Often prioritizes availability over absolute consistency.
●​ Horizontal Scalability: Scaled by adding more servers to the database cluster.
●​ Best For: Applications with large volumes of unstructured or semi-structured
data, high traffic, and evolving data requirements (e.g., social media, big data
analytics).

Key Differences :

●​ Structure: SQL is structured, NoSQL is flexible.


●​ Scaling: SQL scales up, NoSQL scales out.
●​ Consistency: SQL prioritizes strong consistency, NoSQL often prioritizes
availability.
●​ Queries: SQL uses a standardized language, NoSQL uses various approaches.

When to Use SQL:

●​ Related data
●​ Data integrity is crucial
●​ Complex queries
●​ Transactions are important
●​ Structured data

When to Use NoSQL:

●​ Large volumes of data


●​ Unstructured or semi-structured data
●​ Flexible schema requirements
●​ High availability and scalability are critical
●​ Rapid development cycles

Relational Database Management System (RDBMS)

A Relational Database Management System (RDBMS) is a type of database


management system that stores data in a structured format using tables (rows and
columns) and manages relationships between them. It follows the principles of
relational model proposed by E.F. Codd.

Key Features of RDBMS

1.​ Data Stored in Tables​

○​ Data is stored in tables consisting of rows (records/tuples) and columns


(attributes/fields).
2.​ Primary Key and Foreign Key​

○​ A primary key uniquely identifies each record in a table.


○​ A foreign key establishes relationships between tables by referencing a
primary key from another table.
3.​ ACID Properties​

○​ Atomicity: Transactions are fully completed or not done at all.


○​ Consistency: Database remains in a valid state before and after
transactions.
○​ Isolation: Transactions do not interfere with each other.
○​ Durability: Once a transaction is completed, it is permanently saved.
4.​ Structured Query Language (SQL)​

○​ SQL is used for managing and querying data (e.g., SELECT, INSERT,
UPDATE, DELETE).
5.​ Normalization​

○​ Reduces data redundancy and improves data integrity by organizing data


efficiently.
6.​ Data Integrity and Security​

○​ Enforces constraints like NOT NULL, UNIQUE, and CHECK to maintain


data accuracy.
○​ Provides user authentication and access control for security.
7.​ Scalability​

○​ Supports large amounts of data and can scale vertically by increasing


hardware resources.

Examples of RDBMS

●​ MySQL
●​ PostgreSQL
●​ Oracle Database
●​ Microsoft SQL Server
●​ IBM Db2

RDBMS is widely used in banking, e-commerce, enterprise applications, and other


structured data management systems due to its reliability and efficiency.

Distributed and Parallel Computing

Both distributed computing and parallel computing deal with executing multiple
tasks simultaneously, but they differ in how they operate and where they are used.

1. Distributed Computing
Definition:​
Distributed computing involves multiple computers (nodes) working together to solve a
problem by sharing resources and communicating over a network.

Key Features:

●​ Multiple independent systems communicate and collaborate.


●​ Tasks are distributed among different machines.
●​ Nodes operate independently and may fail without stopping the entire system.
●​ Uses message passing for communication.

Examples of Distributed Computing:

●​ Cloud computing (AWS, Google Cloud, Microsoft Azure)


●​ Blockchain technology (Bitcoin, Ethereum)
●​ Distributed file systems (HDFS in Hadoop)
2. Parallel Computing
Definition:​
Parallel computing involves executing multiple tasks simultaneously using multiple
processors within a single computer or tightly connected systems.

Key Features:

●​ Single system with multiple processors or cores.


●​ Tasks are broken into subtasks and executed in parallel.
●​ Shared memory architecture (common memory for communication).
●​ Improves performance by utilizing multiple CPUs/GPUs.

Types of Parallel Computing:

1.​ Shared Memory Model – All processors access the same memory (e.g.,
OpenMP).
2.​ Distributed Memory Model – Each processor has its own memory (e.g., MPI).

Examples of Parallel Computing:

●​ Supercomputers (e.g., IBM Summit, Fugaku)


●​ Parallel processing in AI/ML using GPUs
●​ Weather forecasting simulations

Key Differences Between Distributed and Parallel


Computing

Feature Distributed Computing Parallel Computing

System Type Multiple independent computers Single system with multiple


processors
Memory Separate memory for each node Shared or distributed memory

Communicati Uses networks for communication Uses memory for data sharing
on

Fault More fault-tolerant (node failure Less fault-tolerant (failure can


Tolerance does not stop system) halt execution)

Example Cloud computing, Blockchain Supercomputers, AI


processing

Both computing models are essential in handling large-scale data processing and
computation-heavy tasks, with distributed computing focusing on networked
systems and parallel computing leveraging multiple processors for speed.

Google MapReduce and Google File System (GFS) White Papers

Google introduced two foundational technologies for handling large-scale data


processing: Google MapReduce and Google File System (GFS). Both were described
in research papers published by Google engineers and have influenced modern big
data technologies like Hadoop and Apache Spark.

1. Google MapReduce White Paper (2004)


Title: MapReduce: Simplified Data Processing on Large Clusters​
Authors: Jeffrey Dean and Sanjay Ghemawat​
Published by: Google

Overview:

MapReduce is a programming model designed to process large-scale data sets in a


distributed and parallel manner across many machines. It simplifies tasks like web
indexing, log processing, and data analysis.
Working of MapReduce:

1.​ Map Phase: The input data is divided into key-value pairs and processed in
parallel by multiple nodes.
2.​ Shuffle Phase: The intermediate data is grouped and sorted based on keys.
3.​ Reduce Phase: The grouped data is aggregated and combined to produce the
final output.

Key Features:

●​ Automatic parallelization across thousands of nodes.


●​ Fault tolerance using replication and re-execution of failed tasks.
●​ Scalability to process petabytes of data.
●​ Optimized data locality by processing data near where it is stored.

Impact:

●​ Inspired Hadoop MapReduce, an open-source implementation.


●​ Used in search indexing, log analysis, and big data analytics.

2. Google File System (GFS) White Paper (2003)


Title: The Google File System​
Authors: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung​
Published by: Google

Overview:

GFS is a distributed file system designed to store and manage large amounts of
data across multiple servers efficiently.

Architecture:

●​ Master Node: Manages metadata, file system namespace, and chunk locations.
●​ Chunk Servers: Store actual data in 64 MB chunks and handle read/write
requests.
●​ Clients: Access data using file system API requests.

Key Features:
●​ High fault tolerance with data replication (default: 3 copies).
●​ Optimized for large files (GB to TB in size).
●​ Write-once, read-many model to handle high read throughput.
●​ Automatic load balancing and recovery mechanisms.

Impact:

●​ Inspired Hadoop Distributed File System (HDFS), used in big data frameworks.
●​ Used by Google for indexing and storage of web search data.

Comparison of MapReduce and GFS

Feature Google MapReduce Google File System (GFS)

Purpose Parallel data processing Distributed file storage

Key Concept Map and Reduce Large file storage with


functions chunking

Fault Automatic task Data replication across


Tolerance re-execution nodes

Impact Led to Hadoop Led to Hadoop HDFS


MapReduce

These two technologies together revolutionized big data processing, forming the
foundation for modern distributed computing systems like Hadoop, Spark, and cloud
storage solutions.
Evolution of Big Data

Big Data has evolved over the years due to advancements in storage, processing, and
analytics. The evolution can be divided into different phases:

1. Pre-Big Data Era (Before 2000s)

●​ Data Volume: Limited data from structured sources (databases, spreadsheets).


●​ Storage & Processing: Traditional Relational Database Management
Systems (RDBMS) were used.
●​ Key Technologies: SQL databases, mainframes.

Limitations:

●​ Could not handle unstructured data (text, images, videos).


●​ Data processing was slow and required expensive hardware.

2. Emergence of Big Data (2000-2010)

●​ Data Growth: Explosion of digital data due to the internet, social media, and
e-commerce.
●​ Challenges: RDBMS could not handle the volume, variety, and velocity of data.
●​ Key Innovations:
○​ Google File System (GFS) & MapReduce (2003-2004) – Enabled
distributed data storage and processing.
○​ Hadoop (2006) – Open-source framework inspired by GFS and
MapReduce.
○​ NoSQL Databases (2009-2010) – MongoDB, Cassandra for handling
unstructured data.

Impact:

●​ Enabled large-scale data analytics, real-time processing, and cloud-based


storage.

3. Modern Big Data Era (2010-Present)


●​ Data Explosion: Social media, IoT, cloud computing, AI generate massive data.
●​ Advanced Processing: Faster and more efficient frameworks like Apache
Spark (2014) replaced Hadoop MapReduce.
●​ Key Technologies:
○​ Cloud Computing (AWS, Google Cloud, Azure) for scalable storage.
○​ Machine Learning & AI for predictive analytics.
○​ Edge Computing & IoT for real-time data processing.

Current Trends:

●​ Data Lakes & Warehouses (Snowflake, Delta Lake) for centralized storage.
●​ Streaming Analytics (Kafka, Flink) for real-time data processing.
●​ Privacy & Security (GDPR, data encryption) to protect user data.

Future of Big Data

●​ Quantum Computing for even faster processing.


●​ Federated Learning for decentralized AI-driven insights.
●​ Ethical AI & Data Governance to ensure fair data usage.

Big Data continues to evolve with advancements in AI, cloud computing, and
cybersecurity, making data-driven decision-making more powerful.

Comparison of Google’s White Paper Technologies and Current Big Data


Technologies

Google’s MapReduce and Google File System (GFS) white papers introduced
foundational technologies for Big Data processing. These have evolved over time,
leading to modern, faster, and more efficient solutions.

1. Storage Systems

Google White Paper Technology Current Technology


Google File System (GFS) (2003) - Hadoop Distributed File System
Distributed storage system that breaks (HDFS) - Open-source version inspired
files into chunks and stores them across by GFS, widely used in Big Data.
multiple machines.

Bigtable (2006) - NoSQL database built Apache HBase, Cassandra - Modern


on GFS, optimized for scalability. NoSQL databases inspired by
Bigtable, handling massive real-time
workloads.

Colossus (Next-gen GFS, 2010s) - Cloud Storage (AWS S3, Google


Google's internal distributed storage with Cloud Storage, Azure Blob Storage)
improved speed and reliability. - Scalable object storage with high
availability.

2. Processing Frameworks

Google White Paper Technology Current Technology

MapReduce (2004) - Batch processing Apache Spark (2014) - Faster,


framework dividing tasks into "Map" and in-memory processing, replacing
"Reduce" phases. MapReduce in most use cases.

Pregel (2010) - Google’s graph Apache Giraph, GraphX -


processing framework for large-scale Open-source alternatives for large-scale
graphs. graph analytics.
Dremel (2010) - Columnar storage and Apache Drill, Presto, BigQuery -
processing for fast queries. Cloud-based and open-source tools
inspired by Dremel.

3. Query & Analytics

Google White Paper Current Technology


Technology

Sawzall (2003) - Google's early SQL-based tools (Presto, Trino, BigQuery) -


query language for log More flexible and scalable alternatives.
processing.

Google BigQuery (2010s) - Modern Data Warehouses (Snowflake, Amazon


Cloud-based real-time analytics Redshift, Databricks SQL) - More advanced,
platform. multi-cloud capabilities.

4. Stream Processing

Google White Paper Current Technology


Technology

MillWheel (2013) - Google’s Apache Flink, Kafka Streams, Spark


real-time stream processing Streaming - Open-source, widely adopted
engine. streaming platforms.
Key Advancements in Current Technologies
✅ In-Memory Processing: Apache Spark is 100x faster than MapReduce.​
✅ Real-Time Processing: Kafka, Flink, and Spark Streaming replaced batch-oriented
✅ Cloud & Serverless Solutions: BigQuery, Snowflake, and AWS Lambda provide
systems.​

✅ AI & ML Integration: TensorFlow and PyTorch enable predictive analytics on Big


scalable, low-maintenance alternatives.​

Data.

Conclusion

Google’s research laid the foundation for Big Data. However, modern technologies have
improved efficiency, speed, and scalability, making Big Data processing more
accessible and cost-effective.

You might also like