0% found this document useful (0 votes)
2 views

Data Analytics mid sem notes

The document provides an overview of data analytics, defining its types: descriptive, predictive, and prescriptive analytics, and discusses the concepts of big data, including its characteristics such as volume, velocity, variety, veracity, and value. It details Hadoop and Apache Spark as major frameworks for big data processing, highlighting their components, advantages, and disadvantages. The conclusion emphasizes the appropriate use cases for each framework, with Hadoop suited for batch processing and Spark for real-time analytics.

Uploaded by

hemantmahto658
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Analytics mid sem notes

The document provides an overview of data analytics, defining its types: descriptive, predictive, and prescriptive analytics, and discusses the concepts of big data, including its characteristics such as volume, velocity, variety, veracity, and value. It details Hadoop and Apache Spark as major frameworks for big data processing, highlighting their components, advantages, and disadvantages. The conclusion emphasizes the appropriate use cases for each framework, with Hadoop suited for batch processing and Spark for real-time analytics.

Uploaded by

hemantmahto658
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 9

``Data Analytics

It is defined as a process of cleaning, transforming and modelling data to discover useful


information for business and decision making.

Analytics is the process of systematically analyzing data to derive meaningful result. It


encompasses various types including,

Classification of analytics

Descriptive analytics is a statistical method that is used to search and summarize historical
data in order to identify patterns or meaning. Data aggregation and data mining are two
techniques used in descriptive analytics to discover historical data.

Data is first gathered and sorted by data aggregation in order to make the datasets more
manageable by analysts.

Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning.

Identified patterns are analyzed to discover the specific ways that learners interacted with
the learning content and within the learning environment.

Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviours.

Prescriptive analytics Prescriptive analytics is a statistical method used to generate


recommendations and make decisions based on the computational findings of algorithmic
models.

Descriptive analytics which focuses on summarizing historical data

Predictive analytics, which focus future outcomes based on patterns and

Prescriptive analytics , which provides recommendation for decision making

Descriptive Analytics is focused solely on historical data.

You can think of Predictive Analytics as then using this historical data to develop statistical
models that will then forecast about future possibilities.

Prescriptive Analytics takes Predictive Analytics a step further and takes the possible
forecasted outcomes and predicts consequences for these outcomes.

What is Big Data?


Unstructured data is the rawest form of data. It can be any type of file, for example, texts,
pictures, sounds, or videos. This data is often stored in a repository of files. Think of this as a
very well-organized directory on your computer’s hard drive. Extracting value out of this
shape of data is often the hardest. Since you first need to extract structured features from
the data that describe or abstract from it. For example, to use text you might want to extract
the topics and whether the text is positive or negative about them.

▪ Structured data is tabular data (rows and columns) which are very well defined. Meaning
that we know which columns there are and what kind of data they contain. Often such data
is stored in databases. In databases we can use the power of the language SQL to answer
queries about the data and easily create data sets to use in our data science solutions.

▪ Semi-structured data is anywhere between unstructured and structured data. A consistent


format is defined however the structure is not extremely strict, like it is not necessarily
tabular, and parts of the data may be incomplete or differing types. Semi-structured data are
often stored as files. However, some kinds of semi-structured data (like JSON or XML) can be
stored in document-oriented databases.

Such databases allow you to query the semi-structured data.

“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”

 It refers to a massive amount of data that keeps on growing exponentially with time.
 It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and data
visualization.
 The term is an all-comprehensive one including data, data frameworks, along with
the tools and techniques used to process and analyze the data.

1. Volume

 Refers to the vast amount of data generated every second. For instance:

o Social media platforms generate terabytes of data daily.

o Sensors and IoT devices produce continuous streams of data.

2. Velocity

 Represents the speed at which data is generated, processed, and analyzed.

o Example: Real-time data like stock market trends or streaming services.


3. Variety

 Denotes the different formats of data:

o Structured data: Organized data like rows and columns in databases.

o Unstructured data: Data like images, videos, emails, or social media posts.

o Semi-structured data: JSON, XML files, etc.

4. Veracity

 Refers to the trustworthiness and quality of the data. Big Data may include
inconsistent or noisy data, requiring cleansing and validation.

5. Value

The ultimate goal of Big Data is to derive meaningful insights or actionable

Hadoop
Hadoop is an open source framework that allows us to store and process large data
sets in a parallel and distributed manner.
Two main components HDFS and MapReduce.
Hadoop Distributed File System(HDFS) is the primary data storage system used by
Hadoop applications.
Map Reduce is the processing unit of Hadoop.
HDFS

HDFS stores data in distributed manner uses replication to prevent data loss and rack
awareness to locate data is being stored in which rack or node.

Rack awareness- it is a physical collection of various node. Genrally, 30-40 nodes


comes under one node
Yarn divides the task on resource management and job scheduling/monitoring into
separate daemons. There is one ResourceManager and per-application
ApplicationMaster. An application can be either a job or a DAG of jobs. The
ResourceManger have two components – Scheduler and AppicationManager. The
scheduler is a pure scheduler i.e. it does not track the status of running application. It
only allocates resources to various competing applications. Also, it does not restart
the job after failure due to hardware or application failure. The scheduler allocates
the resources based on an abstract notion of a container. A container is nothing but a
fraction of resources like CPU, memory, disk, network etc. Following are the tasks of
ApplicationManager:-  Accepts submission of jobs by client.  Negotaites first
container for specific ApplicationMaster.  Restarts the container after application
failure. Below are the responsibilities of ApplicationMaster  Negotiates containers
from Scheduler  Tracking container status and monitoring its progress

Hadoop and Apache Spark: Everything You Need to Know


Hadoop and Apache Spark are two major frameworks used in the field of Big Data for
storing, processing, and analyzing massive amounts of data. They are widely used in
industries such as finance, healthcare, e-commerce, and cybersecurity.

1. Hadoop
What is Hadoop?
Hadoop is an open-source framework developed by the Apache Software
Foundation for storing and processing large datasets across clusters of computers
using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
Why is Hadoop Used?
Hadoop is used for:
 Storing huge amounts of structured, semi-structured, and unstructured data.
 Processing data in a distributed manner using parallel computing.
 Handling scalability issues in traditional databases.
 Fault-tolerant data processing.
 Supporting various Big Data applications like data warehousing, machine learning,
and analytics.
Key Components of Hadoop
Hadoop has four main components:
1. Hadoop Distributed File System (HDFS)
 A distributed file system that stores data across multiple nodes.
 Uses a Master-Slave Architecture.
 Splits large files into blocks (default: 128MB or 256MB).
 Data is replicated across multiple nodes to ensure fault tolerance.
2. MapReduce
 A programming model for processing large datasets in parallel.
 Map Phase: Splits data into key-value pairs.
 Reduce Phase: Aggregates and processes the key-value pairs to produce results.
 Works well for batch processing but is slow for real-time analytics.
3. YARN (Yet Another Resource Negotiator)
 A resource management layer that helps in job scheduling.
 Manages computing resources across Hadoop clusters.
 Enables multi-tenant data processing.
4. Hadoop Common
 Provides shared utilities and libraries required for Hadoop.
 Includes Java libraries and necessary dependencies.
Advantages of Hadoop
✅ Scalability – Can handle petabytes of data and scale horizontally.
✅ Cost-Effective – Uses commodity hardware instead of expensive high-end servers.
✅ Fault Tolerance – Replicates data across nodes to prevent data loss.
✅ Flexibility – Supports different types of data (structured, semi-structured,
unstructured).
Disadvantages of Hadoop
❌ Slow Processing – Uses disk-based storage (HDFS), which is slower than in-memory
processing.
❌ Complex to Manage – Requires expertise in distributed computing.
❌ Not Ideal for Small Data – Works best for Big Data; for smaller datasets, traditional
databases are better.
❌ High Latency – Real-time processing is slow compared to Apache Spark.

2. Apache Spark
What is Apache Spark?
Apache Spark is an open-source distributed computing framework designed for fast
and real-time big data processing. It was developed at UC Berkeley’s AMPLab and
later donated to the Apache Software Foundation.
Why is Spark Used?
 Faster than Hadoop (100x) because it processes data in-memory.
 Supports real-time streaming analytics.
 Can run on Hadoop, standalone, or in the cloud.
 Provides machine learning and graph processing capabilities.
Key Components of Apache Spark
Apache Spark consists of five main components:
1. Spark Core
 The foundation of Spark.
 Manages memory, task scheduling, and fault recovery.
 Handles distributed computing and resource management.
2. Spark SQL
 Enables users to query data using SQL-like syntax.
 Supports integration with traditional databases.
 Provides DataFrames and Datasets for optimized query execution.
3. Spark Streaming
 Supports real-time data processing from sources like Kafka, Flume, and HDFS.
 Breaks data into micro-batches for near real-time analytics.
4. MLlib (Machine Learning Library)
 A scalable machine learning library.
 Includes classification, regression, clustering, and recommendation algorithms.
5. GraphX
 Provides an API for graph and network analysis.
 Used for social network analysis, fraud detection, and recommendation systems.
Advantages of Apache Spark
✅ Super Fast – Uses in-memory computation (RAM) instead of disk-based storage.
✅ Real-time Processing – Supports streaming analytics.
✅ Flexible – Supports multiple languages (Java, Scala, Python, R).
✅ Easy to Use – Comes with a high-level API for data manipulation.
✅ Integrates with Hadoop – Can use HDFS, HBase, Hive, and other data sources.
Disadvantages of Apache Spark
❌ Consumes More Memory – High RAM usage compared to Hadoop.
❌ No Built-in File Storage – Requires external storage like HDFS, Amazon S3, or
Cassandra.
❌ Costly Infrastructure – Needs powerful machines for optimal performance.

Hadoop vs. Apache Spark: A Quick Comparison

Feature Hadoop Apache Spark

Processing
Slower (Disk-based) Faster (In-memory)
Speed

Real-time + Batch
Data Processing Batch Processing
Processing

Fault Tolerance High (Data Replication) High (RDD Resiliency)

Easier (Supports Python,


Ease of Use Complex (Java-based)
Scala, R)

Machine Not built-in (Needs


Built-in MLlib
Learning Mahout)

Supports real-time
Streaming Not supported
streaming

Storage HDFS Needs external storage


Feature Hadoop Apache Spark

Expensive (due to memory


Cost Cheaper
usage)

When to Use Hadoop vs. Apache Spark

Use Case Hadoop Apache Spark

Batch Processing ✅ Best for batch jobs ❌ Not ideal

Real-time Analytics ❌ Not supported ✅ Best for real-time

❌ Requires external
Machine Learning ✅ Built-in MLlib
tools

Graph Processing ❌ Limited support ✅ GraphX available

❌ Requires external
Data Storage ✅ HDFS storage
storage

Streaming Data (Kafka,


❌ Not efficient ✅ Spark Streaming
Flume)

Conclusion
 Hadoop is ideal for batch processing, large-scale storage, and distributed
computing.
 Apache Spark is best for real-time analytics, fast computation, and machine
learning.
 Spark can run on Hadoop and leverage HDFS for storage.
 Hadoop is cost-effective for companies with massive data storage needs, while Spark
is best for high-speed analytics.
If you are working with big datasets, need scalability, and don’t require real-time
analytics, Hadoop is a good choice. If you need high-speed processing, machine
learning, or streaming capabilities, Apache Spark is the better option.
Would you like help setting up Hadoop or Spark for a project? 🚀

You might also like