0% found this document useful (0 votes)
1 views

Big Data

Big data with HADOOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Big Data

Big data with HADOOP
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Name: Ayanava Chatterjee

Roll: 144001200127
Sem: 8th SEM
Sub: Big Data
Topic: Big Data with HADOOP
Introduction to Big Data

Definition of Big Data


Characteristics: Volume, Velocity, Variety, Veracity, and
Value
Importance of Big Data in today’s world
Applications of Big Data (e.g., healthcare, finance,
marketing)
Challenges of Big Data

Data Storage and Management


Data Processing Speed
Data Analysis Complexity
Data Security and Privacy Concerns
Hadoop Ecosystem
Components
HDFS (HADOOP DISTRIBUTED FILE SYSTEM): STORAGE LAYER OF
HADOOP
MAPREDUCE: PROGRAMMING MODEL FOR PROCESSING LARGE
DATASETS
YARN (YET ANOTHER RESOURCE NEGOTIATOR): RESOURCE
MANAGEMENT LAYER
HIVE, PIG, HBASE, SQOOP, FLUME, OOZIE, ETC.: TOOLS AND
FRAMEWORKS FOR DATA PROCESSING, QUERYING, AND
MANAGEMENT
How Hadoop Works

 Data storage in HDFS


 Data processing using MapReduce
 Resource management using YARN
 Fault tolerance and scalability in Hadoop
Advantages of Hadoop

 Scalability: Easily scales to handle petabytes of data


 Fault tolerance: Data is replicated across different nodes
 Cost-effectiveness: Runs on commodity hardware
 Flexibility: Supports various data formats (structured, semi-
structured, unstructured)
Hadoop Use Cases

 Social Media Data Analysis


 E-Commerce & Recommendation Systems
 Real-Time Analytics
 Data Warehousing
 Internet of Things (IoT) Data Management
COMPARISON OF
HADOOP AND
RELATIONAL DATABASES
Hadoop vs (RDBMS):::
1) SCALABILITY
Traditional 2) FLEXIBILITY
Databases 3) COST
4) SPEED
5) USE CASES
Future of Hadoop and Big Data

Integration with Cloud Computing


Advancements in Machine Learning and AI
More adoption in industries like finance, healthcare, and
transportation
Big Data Technologies Overview

Overview of popular Big Data tools and frameworks:


Apache Spark: Fast, in-memory data processing
Apache Flink: Stream processing
Apache Kafka: Real-time data streaming
NoSQL Databases (MongoDB, Cassandra, etc.)
Elasticsearch: Search and analytics engine
Hadoop vs Spark

Apache Spark:
In-memory processing for faster data processing compared to
MapReduce
Real-time stream processing
More user-friendly APIs for data analytics and machine learning
Hadoop MapReduce:
Batch processing
Slower due to disk-based storage
Best for large-scale batch jobs
Hadoop Distributed File System
(HDFS)

HDFS Overview:
Designed for storing large files across multiple machines
Data replication for fault tolerance
High throughput access to data
HDFS Architecture:
NameNode: Manages metadata and file structure
DataNodes: Store the actual data blocks
YARN (Yet Another Resource
Negotiator)

YARN’s Role in the Hadoop Ecosystem:


Resource management and job scheduling
Enables multiple applications to run on a single Hadoop cluster
Manages and allocates resources dynamically
YARN Components:
ResourceManager: Manages resources across the cluster
NodeManager: Runs on each node and manages resources on that
Node.
Hadoop Ecosystem: Hive

Apache Hive:
A data warehouse system built on top of Hadoop
Provides SQL-like querying capabilities for Hadoop
Supports ETL operations and batch processing
Hive Architecture:
Metastore: Stores schema information
Query Compiler: Converts SQL queries into MapReduce jobs
Hadoop Ecosystem: Hbase

Hbase:
A NoSQL database built on top of HDFS
Provides random read/write access to large datasets
Scalable and distributed architecture
Use cases: Real-time analytics, serving large-scale data applications
Hadoop Ecosystem: Pig

Apache Pig:
A high-level platform for creating MapReduce programs
Uses Pig Latin, a scripting language to simplify data processing
Pig vs MapReduce:
Pig is easier to write, but MapReduce is more flexible for complex
workflows
Ideal for ETL (Extract, Transform, Load) tasks
Hadoop Ecosystem: Sqoop and
Flume

Sqoop:
Designed for importing and exporting data between Hadoop and
relational databases
Used for batch processing tasks
Flume:
Collects and aggregates large amounts of log data
Streams data in real-time to Hadoop HDFS
Security in Hadoop

Authentication:
Kerberos: A network authentication protocol to secure access to Hadoop
services
Authorization:
Apache Ranger: Provides centralized access control and policy
management
Data Encryption:
Encrypt data at rest (HDFS) and in transit (between components)
Auditing:
Track user access and behavior with auditing tools
Real-World Big Data Use Cases

Healthcare:

Predictive analytics for patient outcomes


Managing electronic health records (EHRs) and medical research

Finance:
Fraud detection in real-time financial transactions

High-frequency trading analysis

Telecommunications:

Network performance monitoring and predictive maintenance

Customer churn prediction and service optimization

E-Commerce:

Real-time personalized recommendations

Fraud detection and customer behavior analysis


Hadoop in the Cloud

 Cloud Platforms Supporting Hadoop:


 Amazon EMR (Elastic MapReduce)
 Google Cloud Dataproc
 Microsoft Azure HDInsight
 Benefits of Cloud Hadoop:
 Scalability without infrastructure management
 Pay-per-use model for computing resources
 Easy integration with other cloud services like storage and analytics tools
Hadoop Ecosystem: Oozie

Oozie:
A workflow scheduler system for managing Hadoop jobs
Supports complex job workflows, such as MapReduce, Hive, and Pig
Key Features:
Error handling
Job scheduling and dependency management
Integration with other components like HDFS, Hbase, and Hiv
Hadoop Performance Tuning

 Techniques to optimize Hadoop performance:


 Data Locality: Ensuring tasks are executed on nodes where the data
resides
 Compression: Reducing the size of data being stored and transmitted
 Caching: Storing frequently accessed data in memory to speed up tasks
 Increasing Parallelism: Splitting tasks into smaller units and running them
concurrently
Machine Learning with Hadoop

Mllib (Apache Spark) and Mahout (Apache Hadoop):


Machine learning libraries for large-scale data processing
Algorithms for classification, regression, clustering, and recommendation
systems
Use Case: Predictive modeling on big datasets, fraud detection,
recommendation engines
Hadoop Performance Metrics &
Monitoring

Tools to monitor Hadoop clusters:


Ganglia: Real-time monitoring system
Ambari: Provides cluster management and monitoring for Hadoop
Cloudera Manager: For managing and monitoring Hadoop clusters
Key Metrics:
Resource utilization (CPU, memory, disk, network)
Job performance (MapReduce job statistics, task completion times)
USING HADOOP FOR LARGE-
SCALE LOG ANALYSIS:
1) COLLECTING DATA FROM
Hadoop Use WEB SERVERS, DATABASES,
AND APPLICATIONS
Case: Log 2) USING TOOLS LIKE FLUME
Analysis FOR DATA INGESTION AND
HIVE FOR QUERYING LOGS
BENEFITS: SCALABILITY AND
FLEXIBILITY TO PROCESS
MASSIVE LOG DATA
Future of Hadoop and Big Data

Evolution of Hadoop:
Integration with cloud computing
Real-time stream processing and machine learning
Other upcoming trends:
AI and Deep Learning for Big Data analytics
IoT (Internet of Things) applications using Big Data tools
Increased adoption of edge computing for data
processing at the source
THANK YOU

You might also like