Big Data
Big Data
Roll: 144001200127
Sem: 8th SEM
Sub: Big Data
Topic: Big Data with HADOOP
Introduction to Big Data
Apache Spark:
In-memory processing for faster data processing compared to
MapReduce
Real-time stream processing
More user-friendly APIs for data analytics and machine learning
Hadoop MapReduce:
Batch processing
Slower due to disk-based storage
Best for large-scale batch jobs
Hadoop Distributed File System
(HDFS)
HDFS Overview:
Designed for storing large files across multiple machines
Data replication for fault tolerance
High throughput access to data
HDFS Architecture:
NameNode: Manages metadata and file structure
DataNodes: Store the actual data blocks
YARN (Yet Another Resource
Negotiator)
Apache Hive:
A data warehouse system built on top of Hadoop
Provides SQL-like querying capabilities for Hadoop
Supports ETL operations and batch processing
Hive Architecture:
Metastore: Stores schema information
Query Compiler: Converts SQL queries into MapReduce jobs
Hadoop Ecosystem: Hbase
Hbase:
A NoSQL database built on top of HDFS
Provides random read/write access to large datasets
Scalable and distributed architecture
Use cases: Real-time analytics, serving large-scale data applications
Hadoop Ecosystem: Pig
Apache Pig:
A high-level platform for creating MapReduce programs
Uses Pig Latin, a scripting language to simplify data processing
Pig vs MapReduce:
Pig is easier to write, but MapReduce is more flexible for complex
workflows
Ideal for ETL (Extract, Transform, Load) tasks
Hadoop Ecosystem: Sqoop and
Flume
Sqoop:
Designed for importing and exporting data between Hadoop and
relational databases
Used for batch processing tasks
Flume:
Collects and aggregates large amounts of log data
Streams data in real-time to Hadoop HDFS
Security in Hadoop
Authentication:
Kerberos: A network authentication protocol to secure access to Hadoop
services
Authorization:
Apache Ranger: Provides centralized access control and policy
management
Data Encryption:
Encrypt data at rest (HDFS) and in transit (between components)
Auditing:
Track user access and behavior with auditing tools
Real-World Big Data Use Cases
Healthcare:
Finance:
Fraud detection in real-time financial transactions
Telecommunications:
E-Commerce:
Oozie:
A workflow scheduler system for managing Hadoop jobs
Supports complex job workflows, such as MapReduce, Hive, and Pig
Key Features:
Error handling
Job scheduling and dependency management
Integration with other components like HDFS, Hbase, and Hiv
Hadoop Performance Tuning
Evolution of Hadoop:
Integration with cloud computing
Real-time stream processing and machine learning
Other upcoming trends:
AI and Deep Learning for Big Data analytics
IoT (Internet of Things) applications using Big Data tools
Increased adoption of edge computing for data
processing at the source
THANK YOU