bigdata
bigdata
Big Data is used across industries to improve efficiency, gain insights, and develop new
products.
● Public Health Monitoring: Tracks disease outbreaks and improves healthcare data
sharing.
● Consumer Sentiment Monitoring: Analyzes social media data to understand customer
opinions.
● Asset Tracking: Uses RFID tags to prevent counterfeiting and theft in industries like
defense and retail.
● Supply Chain Monitoring: Tracks inventory movement using RFID, ensuring timely
product delivery.
● Preventive Machine Maintenance: Uses sensors to predict equipment failures and
reduce downtime.
● Predictive Policing: Identifies crime hotspots to help law enforcement prevent future
crimes.
● Winning Political Elections: Uses voter data analysis to target potential supporters and
optimize campaigns.
● Personal Health: AI-based medical analysis improves diagnosis and treatment
recommendations.
○ Big Data means handling huge amounts of data collected from different sources.
○ This data is growing every second from social media, online transactions,
sensors, etc.
○ Example: In 2016, global mobile traffic was 6.2 exabytes per month, and by
2020, it was expected to reach 40,000 exabytes.
2. Velocity (Speed of Data Generation)
○ Data is being generated at an extremely high speed from various sources like
social media, machines, and IoT devices.
○ The faster data is collected, the quicker it needs to be processed for real-time
decision-making.
○ Example: Google handles 3.5 billion searches per day, and Facebook generates
massive amounts of posts, likes, and messages every second.
3. Variety (Different Types of Data)
○ Big Data comes in multiple formats, making it difficult to store and process in
traditional databases.
○ Types of data include:
■ Structured Data: Organized in tables (e.g., databases, spreadsheets).
■ Semi-structured Data: Partially organized (e.g., JSON files, XML,
emails).
■ Unstructured Data: No fixed format (e.g., images, videos, social media
posts).
○ Example: Text messages, audio recordings, GPS location data, and video
surveillance footage all come in different formats.
4. Veracity (Accuracy and Reliability of Data)
○ Since Big Data comes from multiple sources, some of it may be inaccurate,
inconsistent, or misleading.
○ Ensuring data quality and trustworthiness is important for making correct
business decisions.
○ Example: Social media posts may have fake reviews, duplicate data, or errors
that affect analysis.
5. Value (Extracting Useful Insights)
○ The main goal of Big Data is to analyze and gain meaningful insights that help in
decision-making.
○ Organizations use Big Data to improve customer experience, detect fraud,
optimize operations, and predict future trends.
○ Example: E-commerce websites use Big Data to recommend products based on
customer behavior, increasing sales and satisfaction.
SQL:
NoSQL:
● Not Only SQL: Encompasses a variety of database types that don't adhere to
the relational model.
● Flexible Data Models: Supports various data formats like documents, key-value
pairs, graphs, etc.
● Flexible Schema: Schema can be dynamic or even non-existent, allowing for
more flexible data structures.
● CAP Theorem: Focuses on Consistency, Availability, and Partition Tolerance.
Often prioritizes availability over absolute consistency.
● Horizontal Scalability: Scaled by adding more servers to the database cluster.
● Best For: Applications with large volumes of unstructured or semi-structured
data, high traffic, and evolving data requirements (e.g., social media, big data
analytics).
Key Differences :
● Related data
● Data integrity is crucial
● Complex queries
● Transactions are important
● Structured data
○ SQL is used for managing and querying data (e.g., SELECT, INSERT,
UPDATE, DELETE).
5. Normalization
Examples of RDBMS
● MySQL
● PostgreSQL
● Oracle Database
● Microsoft SQL Server
● IBM Db2
Both distributed computing and parallel computing deal with executing multiple
tasks simultaneously, but they differ in how they operate and where they are used.
1. Distributed Computing
Definition:
Distributed computing involves multiple computers (nodes) working together to solve a
problem by sharing resources and communicating over a network.
Key Features:
Key Features:
1. Shared Memory Model – All processors access the same memory (e.g.,
OpenMP).
2. Distributed Memory Model – Each processor has its own memory (e.g., MPI).
Communicati Uses networks for communication Uses memory for data sharing
on
Both computing models are essential in handling large-scale data processing and
computation-heavy tasks, with distributed computing focusing on networked
systems and parallel computing leveraging multiple processors for speed.
Overview:
1. Map Phase: The input data is divided into key-value pairs and processed in
parallel by multiple nodes.
2. Shuffle Phase: The intermediate data is grouped and sorted based on keys.
3. Reduce Phase: The grouped data is aggregated and combined to produce the
final output.
Key Features:
Impact:
Overview:
GFS is a distributed file system designed to store and manage large amounts of
data across multiple servers efficiently.
Architecture:
● Master Node: Manages metadata, file system namespace, and chunk locations.
● Chunk Servers: Store actual data in 64 MB chunks and handle read/write
requests.
● Clients: Access data using file system API requests.
Key Features:
● High fault tolerance with data replication (default: 3 copies).
● Optimized for large files (GB to TB in size).
● Write-once, read-many model to handle high read throughput.
● Automatic load balancing and recovery mechanisms.
Impact:
● Inspired Hadoop Distributed File System (HDFS), used in big data frameworks.
● Used by Google for indexing and storage of web search data.
These two technologies together revolutionized big data processing, forming the
foundation for modern distributed computing systems like Hadoop, Spark, and cloud
storage solutions.
Evolution of Big Data
Big Data has evolved over the years due to advancements in storage, processing, and
analytics. The evolution can be divided into different phases:
Limitations:
● Data Growth: Explosion of digital data due to the internet, social media, and
e-commerce.
● Challenges: RDBMS could not handle the volume, variety, and velocity of data.
● Key Innovations:
○ Google File System (GFS) & MapReduce (2003-2004) – Enabled
distributed data storage and processing.
○ Hadoop (2006) – Open-source framework inspired by GFS and
MapReduce.
○ NoSQL Databases (2009-2010) – MongoDB, Cassandra for handling
unstructured data.
Impact:
Current Trends:
● Data Lakes & Warehouses (Snowflake, Delta Lake) for centralized storage.
● Streaming Analytics (Kafka, Flink) for real-time data processing.
● Privacy & Security (GDPR, data encryption) to protect user data.
Big Data continues to evolve with advancements in AI, cloud computing, and
cybersecurity, making data-driven decision-making more powerful.
Google’s MapReduce and Google File System (GFS) white papers introduced
foundational technologies for Big Data processing. These have evolved over time,
leading to modern, faster, and more efficient solutions.
1. Storage Systems
2. Processing Frameworks
4. Stream Processing
Data.
Conclusion
Google’s research laid the foundation for Big Data. However, modern technologies have
improved efficiency, speed, and scalability, making Big Data processing more
accessible and cost-effective.