Big-Data Final
Big-Data Final
What is HDFS?
The Hadoop Distributed File System (HDFS) is the primary data storage system used by
Hadoop applications
This storage layer of Hadoop is responsible for storing and managing data across a cluster.
It is designed to handle large datasets by breaking them into smaller blocks and
distributing them across multiple machines.
The data blocks are replicated to ensure fault tolerance and high availability
Hadoop runs code across a cluster of computers. This process includes the following core tasks
that Hadoop performs –
Data is initially divided into directories and files. Files are divided into uniform sized blocks of
128M and 64M (preferably 128M).
These files are then distributed across various cluster nodes for further processing.
HDFS, being on top of the local file system, supervises the processing.
Blocks are replicated for handling hardware failure.
Checking that the code was executed successfully.
Performing the sort that takes place between the map and reduce stages.
Sending the sorted data to a certain computer.
Writing the debugging logs for each job.
Evolution of Big Data: There are a lot of milestones in the evolution of Big Data which are
described below:
Data Warehousing: In the 1990s, data warehousing emerged as a solution to store and
analyze large volumes of structured data.
Hadoop: Hadoop was introduced in 2006, which provides distributed storage medium
and large data processing and it is an open-source framework.
NoSQL Databases: In 2009, NoSQL databases were introduced, which provide a flexible
way to store and retrieve unstructured data.
Cloud Computing: Cloud computing technology helps companies to store their
important data in data centers that are remote, and it saves their infrastructure cost
and maintenance costs.
Machine Learning: Machine Learning algorithms are those algorithms that work on
large data, and analysis is done on a huge amount of data to get meaningful insights
from it. This has led to the development of artificial intelligence (AI) applications.
2. Database Commands:
Document Manipulation: Commands like insert, update, and delete allow you to create,
modify, and remove documents in collections.
Index Management: MongoDB commands enable you to create, drop, and manage
indexes on collections to optimize query performance.
Data Backup and Restoration: Commands like mongodump and mongorestore allow you
to create backups of databases or collections and restore them when needed.
Explain the different tools to extract Big Data
a. Apache Hadoop:
Distributed processing framework for big data.
Utilizes the Hadoop Distributed File System (HDFS) for storing and processing large
datasets across clusters of computers.
b. Hive:
Hive is an open source software big data tool.
It allows programmers analyze large data sets on Hadoop
It helps with querying and managing large data sets real fast
c. MongoDB:
MongoDB is an open source No-SQQL database which is cross-platform compatible
with many built-in features.
It is ideal for the business that needs fast and real-time data for instant decisions.
It is ideal for users who want data-driven experience
1. Key-Value Stores:
Stores data as simple key-value pairs.
Provides high performance and scalability.
Suitable for caching, session management, and user profiles.
Examples: Redis, Riak, Amazon DynamoDB.
2. Document Databases:
Stores data in semi-structured documents (e.g., JSON, XML).
Offers flexibility with dynamic schemas.
Handles complex, hierarchical, and evolving data structures.
Examples: MongoDB, Couchbase, Apache CouchDB.
3. Column-Family Stores:
Stores data in column families or column groups.
Optimized for write-heavy workloads.
Efficiently handles vast amounts of data.
Used for large-scale data analytics and time-series data storage.
Examples: Apache Cassandra, Apache HBase.
4. Graph Databases:
Stores and represents relationships between entities.
Utilizes graph structures with nodes and edges.
Excellent for complex data modeling and graph traversals.
Ideal for social networks, recommendation systems, and fraud detection.
Examples: Neo4j, Amazon Neptune, ArangoDB
Feature Key-value Database Document Database
Limited query capabilities (usually based Advanced querying using query languages
Querying
on key lookup) or APIs
Excellent scalability and performance for Good scalability for both read and write-
Scalability
read-heavy workloads heavy workloads
Eventual consistency (replication delay can Eventual consistency (replication delay can
Consistency
lead to data inconsistencies) lead to data inconsistencies)
works better when the data size is works better when the volume of data is low(in
Efficiency
big(i.e., In Terabytes and Petabytes) Gigabytes)
Latency/
Response Hadoop is said to have low latency RDMS has high latency as compared to Hadoop
Time
Speed Writes are fast Reads are fast