Data Analytics mid sem notes
Data Analytics mid sem notes
Classification of analytics
Descriptive analytics is a statistical method that is used to search and summarize historical
data in order to identify patterns or meaning. Data aggregation and data mining are two
techniques used in descriptive analytics to discover historical data.
Data is first gathered and sorted by data aggregation in order to make the datasets more
manageable by analysts.
Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning.
Identified patterns are analyzed to discover the specific ways that learners interacted with
the learning content and within the learning environment.
Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviours.
You can think of Predictive Analytics as then using this historical data to develop statistical
models that will then forecast about future possibilities.
Prescriptive Analytics takes Predictive Analytics a step further and takes the possible
forecasted outcomes and predicts consequences for these outcomes.
▪ Structured data is tabular data (rows and columns) which are very well defined. Meaning
that we know which columns there are and what kind of data they contain. Often such data
is stored in databases. In databases we can use the power of the language SQL to answer
queries about the data and easily create data sets to use in our data science solutions.
“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”
It refers to a massive amount of data that keeps on growing exponentially with time.
It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
It includes data mining, data storage, data analysis, data sharing, and data
visualization.
The term is an all-comprehensive one including data, data frameworks, along with
the tools and techniques used to process and analyze the data.
1. Volume
Refers to the vast amount of data generated every second. For instance:
2. Velocity
o Unstructured data: Data like images, videos, emails, or social media posts.
4. Veracity
Refers to the trustworthiness and quality of the data. Big Data may include
inconsistent or noisy data, requiring cleansing and validation.
5. Value
Hadoop
Hadoop is an open source framework that allows us to store and process large data
sets in a parallel and distributed manner.
Two main components HDFS and MapReduce.
Hadoop Distributed File System(HDFS) is the primary data storage system used by
Hadoop applications.
Map Reduce is the processing unit of Hadoop.
HDFS
HDFS stores data in distributed manner uses replication to prevent data loss and rack
awareness to locate data is being stored in which rack or node.
1. Hadoop
What is Hadoop?
Hadoop is an open-source framework developed by the Apache Software
Foundation for storing and processing large datasets across clusters of computers
using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
Why is Hadoop Used?
Hadoop is used for:
Storing huge amounts of structured, semi-structured, and unstructured data.
Processing data in a distributed manner using parallel computing.
Handling scalability issues in traditional databases.
Fault-tolerant data processing.
Supporting various Big Data applications like data warehousing, machine learning,
and analytics.
Key Components of Hadoop
Hadoop has four main components:
1. Hadoop Distributed File System (HDFS)
A distributed file system that stores data across multiple nodes.
Uses a Master-Slave Architecture.
Splits large files into blocks (default: 128MB or 256MB).
Data is replicated across multiple nodes to ensure fault tolerance.
2. MapReduce
A programming model for processing large datasets in parallel.
Map Phase: Splits data into key-value pairs.
Reduce Phase: Aggregates and processes the key-value pairs to produce results.
Works well for batch processing but is slow for real-time analytics.
3. YARN (Yet Another Resource Negotiator)
A resource management layer that helps in job scheduling.
Manages computing resources across Hadoop clusters.
Enables multi-tenant data processing.
4. Hadoop Common
Provides shared utilities and libraries required for Hadoop.
Includes Java libraries and necessary dependencies.
Advantages of Hadoop
✅ Scalability – Can handle petabytes of data and scale horizontally.
✅ Cost-Effective – Uses commodity hardware instead of expensive high-end servers.
✅ Fault Tolerance – Replicates data across nodes to prevent data loss.
✅ Flexibility – Supports different types of data (structured, semi-structured,
unstructured).
Disadvantages of Hadoop
❌ Slow Processing – Uses disk-based storage (HDFS), which is slower than in-memory
processing.
❌ Complex to Manage – Requires expertise in distributed computing.
❌ Not Ideal for Small Data – Works best for Big Data; for smaller datasets, traditional
databases are better.
❌ High Latency – Real-time processing is slow compared to Apache Spark.
2. Apache Spark
What is Apache Spark?
Apache Spark is an open-source distributed computing framework designed for fast
and real-time big data processing. It was developed at UC Berkeley’s AMPLab and
later donated to the Apache Software Foundation.
Why is Spark Used?
Faster than Hadoop (100x) because it processes data in-memory.
Supports real-time streaming analytics.
Can run on Hadoop, standalone, or in the cloud.
Provides machine learning and graph processing capabilities.
Key Components of Apache Spark
Apache Spark consists of five main components:
1. Spark Core
The foundation of Spark.
Manages memory, task scheduling, and fault recovery.
Handles distributed computing and resource management.
2. Spark SQL
Enables users to query data using SQL-like syntax.
Supports integration with traditional databases.
Provides DataFrames and Datasets for optimized query execution.
3. Spark Streaming
Supports real-time data processing from sources like Kafka, Flume, and HDFS.
Breaks data into micro-batches for near real-time analytics.
4. MLlib (Machine Learning Library)
A scalable machine learning library.
Includes classification, regression, clustering, and recommendation algorithms.
5. GraphX
Provides an API for graph and network analysis.
Used for social network analysis, fraud detection, and recommendation systems.
Advantages of Apache Spark
✅ Super Fast – Uses in-memory computation (RAM) instead of disk-based storage.
✅ Real-time Processing – Supports streaming analytics.
✅ Flexible – Supports multiple languages (Java, Scala, Python, R).
✅ Easy to Use – Comes with a high-level API for data manipulation.
✅ Integrates with Hadoop – Can use HDFS, HBase, Hive, and other data sources.
Disadvantages of Apache Spark
❌ Consumes More Memory – High RAM usage compared to Hadoop.
❌ No Built-in File Storage – Requires external storage like HDFS, Amazon S3, or
Cassandra.
❌ Costly Infrastructure – Needs powerful machines for optimal performance.
Processing
Slower (Disk-based) Faster (In-memory)
Speed
Real-time + Batch
Data Processing Batch Processing
Processing
Supports real-time
Streaming Not supported
streaming
❌ Requires external
Machine Learning ✅ Built-in MLlib
tools
❌ Requires external
Data Storage ✅ HDFS storage
storage
Conclusion
Hadoop is ideal for batch processing, large-scale storage, and distributed
computing.
Apache Spark is best for real-time analytics, fast computation, and machine
learning.
Spark can run on Hadoop and leverage HDFS for storage.
Hadoop is cost-effective for companies with massive data storage needs, while Spark
is best for high-speed analytics.
If you are working with big datasets, need scalability, and don’t require real-time
analytics, Hadoop is a good choice. If you need high-speed processing, machine
learning, or streaming capabilities, Apache Spark is the better option.
Would you like help setting up Hadoop or Spark for a project? 🚀