0% found this document useful (0 votes)

5 views9 pages

Data Analytics mid sem notes

The document provides an overview of data analytics, defining its types: descriptive, predictive, and prescriptive analytics, and discusses the concepts of big data, including its characteristics such as volume, velocity, variety, veracity, and value. It details Hadoop and Apache Spark as major frameworks for big data processing, highlighting their components, advantages, and disadvantages. The conclusion emphasizes the appropriate use cases for each framework, with Hadoop suited for batch processing and Spark for real-time analytics.

Uploaded by

hemantmahto658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views9 pages

Data Analytics mid sem notes

Uploaded by

hemantmahto658

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 9

``Data Analytics

It is defined as a process of cleaning, transforming and modelling data to discover useful

information for business and decision making.

Analytics is the process of systematically analyzing data to derive meaningful result. It

encompasses various types including,

Classification of analytics

Descriptive analytics is a statistical method that is used to search and summarize historical
data in order to identify patterns or meaning. Data aggregation and data mining are two
techniques used in descriptive analytics to discover historical data.

Data is first gathered and sorted by data aggregation in order to make the datasets more
manageable by analysts.

Data mining describes the next step of the analysis and involves a search of the data to
identify patterns and meaning.

Identified patterns are analyzed to discover the specific ways that learners interacted with
the learning content and within the learning environment.

Predictive Analytics is a statistical method that utilizes algorithms and machine learning to
identify trends in data and predict future behaviours.

Prescriptive analytics Prescriptive analytics is a statistical method used to generate

recommendations and make decisions based on the computational findings of algorithmic
models.

Descriptive analytics which focuses on summarizing historical data

Predictive analytics, which focus future outcomes based on patterns and

Prescriptive analytics , which provides recommendation for decision making

Descriptive Analytics is focused solely on historical data.

You can think of Predictive Analytics as then using this historical data to develop statistical
models that will then forecast about future possibilities.

Prescriptive Analytics takes Predictive Analytics a step further and takes the possible
forecasted outcomes and predicts consequences for these outcomes.

What is Big Data?

Unstructured data is the rawest form of data. It can be any type of file, for example, texts,
pictures, sounds, or videos. This data is often stored in a repository of files. Think of this as a
very well-organized directory on your computer’s hard drive. Extracting value out of this
shape of data is often the hardest. Since you first need to extract structured features from
the data that describe or abstract from it. For example, to use text you might want to extract
the topics and whether the text is positive or negative about them.

▪ Structured data is tabular data (rows and columns) which are very well defined. Meaning
that we know which columns there are and what kind of data they contain. Often such data
is stored in databases. In databases we can use the power of the language SQL to answer
queries about the data and easily create data sets to use in our data science solutions.

▪ Semi-structured data is anywhere between unstructured and structured data. A consistent

format is defined however the structure is not extremely strict, like it is not necessarily
tabular, and parts of the data may be incomplete or differing types. Semi-structured data are
often stored as files. However, some kinds of semi-structured data (like JSON or XML) can be
stored in document-oriented databases.

Such databases allow you to query the semi-structured data.

“Big data” is high-volume, velocity, and variety information assets that demand cost-
effective, innovative forms of information processing for enhanced insight and decision
making.”

 It refers to a massive amount of data that keeps on growing exponentially with time.
 It is so voluminous that it cannot be processed or analyzed using conventional data
processing techniques.
 It includes data mining, data storage, data analysis, data sharing, and data
visualization.
 The term is an all-comprehensive one including data, data frameworks, along with
the tools and techniques used to process and analyze the data.

1. Volume

 Refers to the vast amount of data generated every second. For instance:

o Social media platforms generate terabytes of data daily.

o Sensors and IoT devices produce continuous streams of data.

2. Velocity

 Represents the speed at which data is generated, processed, and analyzed.

o Example: Real-time data like stock market trends or streaming services.

3. Variety

 Denotes the different formats of data:

o Structured data: Organized data like rows and columns in databases.

o Unstructured data: Data like images, videos, emails, or social media posts.

o Semi-structured data: JSON, XML files, etc.

4. Veracity

 Refers to the trustworthiness and quality of the data. Big Data may include
inconsistent or noisy data, requiring cleansing and validation.

5. Value

The ultimate goal of Big Data is to derive meaningful insights or actionable

Hadoop
Hadoop is an open source framework that allows us to store and process large data
sets in a parallel and distributed manner.
Two main components HDFS and MapReduce.
Hadoop Distributed File System(HDFS) is the primary data storage system used by
Hadoop applications.
Map Reduce is the processing unit of Hadoop.
HDFS

HDFS stores data in distributed manner uses replication to prevent data loss and rack
awareness to locate data is being stored in which rack or node.

Rack awareness- it is a physical collection of various node. Genrally, 30-40 nodes

comes under one node
Yarn divides the task on resource management and job scheduling/monitoring into
separate daemons. There is one ResourceManager and per-application
ApplicationMaster. An application can be either a job or a DAG of jobs. The
ResourceManger have two components – Scheduler and AppicationManager. The
scheduler is a pure scheduler i.e. it does not track the status of running application. It
only allocates resources to various competing applications. Also, it does not restart
the job after failure due to hardware or application failure. The scheduler allocates
the resources based on an abstract notion of a container. A container is nothing but a
fraction of resources like CPU, memory, disk, network etc. Following are the tasks of
ApplicationManager:-  Accepts submission of jobs by client.  Negotaites first
container for specific ApplicationMaster.  Restarts the container after application
failure. Below are the responsibilities of ApplicationMaster  Negotiates containers
from Scheduler  Tracking container status and monitoring its progress

Hadoop and Apache Spark: Everything You Need to Know

Hadoop and Apache Spark are two major frameworks used in the field of Big Data for
storing, processing, and analyzing massive amounts of data. They are widely used in
industries such as finance, healthcare, e-commerce, and cybersecurity.

1. Hadoop
What is Hadoop?
Hadoop is an open-source framework developed by the Apache Software
Foundation for storing and processing large datasets across clusters of computers
using simple programming models. It is designed to scale up from single servers to
thousands of machines, each offering local computation and storage.
Why is Hadoop Used?
Hadoop is used for:
 Storing huge amounts of structured, semi-structured, and unstructured data.
 Processing data in a distributed manner using parallel computing.
 Handling scalability issues in traditional databases.
 Fault-tolerant data processing.
 Supporting various Big Data applications like data warehousing, machine learning,
and analytics.
Key Components of Hadoop
Hadoop has four main components:
1. Hadoop Distributed File System (HDFS)
 A distributed file system that stores data across multiple nodes.
 Uses a Master-Slave Architecture.
 Splits large files into blocks (default: 128MB or 256MB).
 Data is replicated across multiple nodes to ensure fault tolerance.
2. MapReduce
 A programming model for processing large datasets in parallel.
 Map Phase: Splits data into key-value pairs.
 Reduce Phase: Aggregates and processes the key-value pairs to produce results.
 Works well for batch processing but is slow for real-time analytics.
3. YARN (Yet Another Resource Negotiator)
 A resource management layer that helps in job scheduling.
 Manages computing resources across Hadoop clusters.
 Enables multi-tenant data processing.
4. Hadoop Common
 Provides shared utilities and libraries required for Hadoop.
 Includes Java libraries and necessary dependencies.
Advantages of Hadoop
✅ Scalability – Can handle petabytes of data and scale horizontally.
✅ Cost-Effective – Uses commodity hardware instead of expensive high-end servers.
✅ Fault Tolerance – Replicates data across nodes to prevent data loss.
✅ Flexibility – Supports different types of data (structured, semi-structured,
unstructured).
Disadvantages of Hadoop
❌ Slow Processing – Uses disk-based storage (HDFS), which is slower than in-memory
processing.
❌ Complex to Manage – Requires expertise in distributed computing.
❌ Not Ideal for Small Data – Works best for Big Data; for smaller datasets, traditional
databases are better.
❌ High Latency – Real-time processing is slow compared to Apache Spark.

2. Apache Spark
What is Apache Spark?
Apache Spark is an open-source distributed computing framework designed for fast
and real-time big data processing. It was developed at UC Berkeley’s AMPLab and
later donated to the Apache Software Foundation.
Why is Spark Used?
 Faster than Hadoop (100x) because it processes data in-memory.
 Supports real-time streaming analytics.
 Can run on Hadoop, standalone, or in the cloud.
 Provides machine learning and graph processing capabilities.
Key Components of Apache Spark
Apache Spark consists of five main components:
1. Spark Core
 The foundation of Spark.
 Manages memory, task scheduling, and fault recovery.
 Handles distributed computing and resource management.
2. Spark SQL
 Enables users to query data using SQL-like syntax.
 Supports integration with traditional databases.
 Provides DataFrames and Datasets for optimized query execution.
3. Spark Streaming
 Supports real-time data processing from sources like Kafka, Flume, and HDFS.
 Breaks data into micro-batches for near real-time analytics.
4. MLlib (Machine Learning Library)
 A scalable machine learning library.
 Includes classification, regression, clustering, and recommendation algorithms.
5. GraphX
 Provides an API for graph and network analysis.
 Used for social network analysis, fraud detection, and recommendation systems.
Advantages of Apache Spark
✅ Super Fast – Uses in-memory computation (RAM) instead of disk-based storage.
✅ Real-time Processing – Supports streaming analytics.
✅ Flexible – Supports multiple languages (Java, Scala, Python, R).
✅ Easy to Use – Comes with a high-level API for data manipulation.
✅ Integrates with Hadoop – Can use HDFS, HBase, Hive, and other data sources.
Disadvantages of Apache Spark
❌ Consumes More Memory – High RAM usage compared to Hadoop.
❌ No Built-in File Storage – Requires external storage like HDFS, Amazon S3, or
Cassandra.
❌ Costly Infrastructure – Needs powerful machines for optimal performance.

Hadoop vs. Apache Spark: A Quick Comparison

Feature Hadoop Apache Spark

Processing
Slower (Disk-based) Faster (In-memory)
Speed

Real-time + Batch
Data Processing Batch Processing
Processing

Fault Tolerance High (Data Replication) High (RDD Resiliency)

Easier (Supports Python,

Ease of Use Complex (Java-based)
Scala, R)

Machine Not built-in (Needs

Built-in MLlib
Learning Mahout)

Supports real-time
Streaming Not supported
streaming

Storage HDFS Needs external storage

Feature Hadoop Apache Spark

Expensive (due to memory

Cost Cheaper
usage)

When to Use Hadoop vs. Apache Spark

Use Case Hadoop Apache Spark

Batch Processing ✅ Best for batch jobs ❌ Not ideal

Real-time Analytics ❌ Not supported ✅ Best for real-time

❌ Requires external
Machine Learning ✅ Built-in MLlib
tools

Graph Processing ❌ Limited support ✅ GraphX available

❌ Requires external
Data Storage ✅ HDFS storage
storage

Streaming Data (Kafka,

❌ Not efficient ✅ Spark Streaming
Flume)

Conclusion
 Hadoop is ideal for batch processing, large-scale storage, and distributed
computing.
 Apache Spark is best for real-time analytics, fast computation, and machine
learning.
 Spark can run on Hadoop and leverage HDFS for storage.
 Hadoop is cost-effective for companies with massive data storage needs, while Spark
is best for high-speed analytics.
If you are working with big datasets, need scalability, and don’t require real-time
analytics, Hadoop is a good choice. If you need high-speed processing, machine
learning, or streaming capabilities, Apache Spark is the better option.
Would you like help setting up Hadoop or Spark for a project? 🚀

THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Kakanin Corner - Final Defense
74% (23)
Kakanin Corner - Final Defense
193 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
I am preparing for a Big Data Analytics university... (1)
No ratings yet
I am preparing for a Big Data Analytics university... (1)
15 pages
Big Data Analytics
No ratings yet
Big Data Analytics
21 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
File 1
No ratings yet
File 1
3 pages
Ccs 334
No ratings yet
Ccs 334
16 pages
BIG data1
No ratings yet
BIG data1
49 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Master Spark Concepts
No ratings yet
Master Spark Concepts
112 pages
2 emerging
No ratings yet
2 emerging
10 pages
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Big Data Hadoop Detailed Essay
No ratings yet
Big Data Hadoop Detailed Essay
4 pages
Big Data Overview
No ratings yet
Big Data Overview
18 pages
Big Data
No ratings yet
Big Data
4 pages
Experiment No _ 1 Bda
No ratings yet
Experiment No _ 1 Bda
10 pages
Chapter 2-Data Science
No ratings yet
Chapter 2-Data Science
23 pages
BAD601 Big Data Model Question Paper Solution Search Creators
No ratings yet
BAD601 Big Data Model Question Paper Solution Search Creators
50 pages
Module 1.ppt
No ratings yet
Module 1.ppt
29 pages
Big Data
No ratings yet
Big Data
1 page
Big Data Analytics 1-5
100% (1)
Big Data Analytics 1-5
63 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
biggdata
No ratings yet
biggdata
24 pages
big data unit 1
No ratings yet
big data unit 1
24 pages
Bda Ut1 Que Ans
No ratings yet
Bda Ut1 Que Ans
13 pages
Big Data Analytics M1
No ratings yet
Big Data Analytics M1
27 pages
Big Data Technology
No ratings yet
Big Data Technology
9 pages
Unit 4 LT
No ratings yet
Unit 4 LT
16 pages
Big Data 3
No ratings yet
Big Data 3
16 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
IOT and Comp.architecture
No ratings yet
IOT and Comp.architecture
17 pages
unit 1 big data
No ratings yet
unit 1 big data
15 pages
Introduction to Big Data
No ratings yet
Introduction to Big Data
4 pages
Cloud Security UNIT 5
No ratings yet
Cloud Security UNIT 5
4 pages
Big Data and Hadoop Self Notes
No ratings yet
Big Data and Hadoop Self Notes
16 pages
Unit-1 Introduction to Data Analytics.pptx
No ratings yet
Unit-1 Introduction to Data Analytics.pptx
35 pages
BIG DATA ANALYTICS
No ratings yet
BIG DATA ANALYTICS
10 pages
Big Data
No ratings yet
Big Data
190 pages
Big Data
No ratings yet
Big Data
17 pages
SDCBDASPARKWEEK1-1
No ratings yet
SDCBDASPARKWEEK1-1
9 pages
239700a5-6c7a-43c1-810e-687c652d046e
No ratings yet
239700a5-6c7a-43c1-810e-687c652d046e
14 pages
Big Data Unit 1 Notes
No ratings yet
Big Data Unit 1 Notes
20 pages
Big Data Analytics - Unit 2
No ratings yet
Big Data Analytics - Unit 2
10 pages
Chapter 2 - Data Science
No ratings yet
Chapter 2 - Data Science
20 pages
IOTBDM -mid sem
No ratings yet
IOTBDM -mid sem
16 pages
BDA 02 - Fundamentals
No ratings yet
BDA 02 - Fundamentals
64 pages
Bigdata
No ratings yet
Bigdata
12 pages
Assignment 01 K!
No ratings yet
Assignment 01 K!
8 pages
Ism 6404 CH 7
No ratings yet
Ism 6404 CH 7
47 pages
Unit 1 Big Data
No ratings yet
Unit 1 Big Data
124 pages
DA U2
No ratings yet
DA U2
17 pages
BDA Module-2 Notes PDF
100% (1)
BDA Module-2 Notes PDF
14 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
Big Data Analysis BDA IMP QNA Openinapp
No ratings yet
Big Data Analysis BDA IMP QNA Openinapp
33 pages
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
No ratings yet
Hidden Patterns, Unknown Correlations, Market Trends, Customer Preferences and Other Useful Information That Can Help Organizations Make More-Informed Business Decisions
4 pages
Hadoop
No ratings yet
Hadoop
4 pages
DSBDA EndSem2023 12F FlyHigh
No ratings yet
DSBDA EndSem2023 12F FlyHigh
20 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Level Crossing Requirements PDF
No ratings yet
Level Crossing Requirements PDF
70 pages
Assessing the Role of Emotions in B2B Decision Making an Exploratory Study
No ratings yet
Assessing the Role of Emotions in B2B Decision Making an Exploratory Study
23 pages
Cyber-Law & It Act Comparision
No ratings yet
Cyber-Law & It Act Comparision
28 pages
blockchain
No ratings yet
blockchain
15 pages
Coca Cola v. Agito
No ratings yet
Coca Cola v. Agito
7 pages
371 HW 07 S
No ratings yet
371 HW 07 S
4 pages
Questionnaire Eco
No ratings yet
Questionnaire Eco
4 pages
Accord Capital Equities Corporation:: First Philippine Holdings, Inc (Pse: FPH)
No ratings yet
Accord Capital Equities Corporation:: First Philippine Holdings, Inc (Pse: FPH)
1 page
A Project Report On Titled "A Study On Consumer's Expectation & Perceptions Before Buying A House. Training Undertaken at
No ratings yet
A Project Report On Titled "A Study On Consumer's Expectation & Perceptions Before Buying A House. Training Undertaken at
7 pages
Thermal Guideline for DPE-1-Introduction
No ratings yet
Thermal Guideline for DPE-1-Introduction
8 pages
Assignment 1 - Operating System
No ratings yet
Assignment 1 - Operating System
137 pages
Convenient Synthesis of Long-Chain 1-O-Alkyl Glyceryl Ethers
No ratings yet
Convenient Synthesis of Long-Chain 1-O-Alkyl Glyceryl Ethers
4 pages
PDF Sonosite 180 Plus Service Manual - Compress
No ratings yet
PDF Sonosite 180 Plus Service Manual - Compress
114 pages
KFC Assignment
No ratings yet
KFC Assignment
13 pages
EDU1020 Assign1 19001114 PDF
No ratings yet
EDU1020 Assign1 19001114 PDF
8 pages
Air India Web Booking Eticket (ZDGMY) - Arora
100% (1)
Air India Web Booking Eticket (ZDGMY) - Arora
2 pages
WTW Incubator TS 606-2
100% (1)
WTW Incubator TS 606-2
26 pages
JioMart Invoice 16622748220112317A
No ratings yet
JioMart Invoice 16622748220112317A
2 pages
Gre Engineeering Solutions SDN BHD Company Profile
No ratings yet
Gre Engineeering Solutions SDN BHD Company Profile
46 pages
PN 2100
No ratings yet
PN 2100
89 pages
Finance and Management: Previous Year Questions
No ratings yet
Finance and Management: Previous Year Questions
26 pages
Zimbabwe Presentation
No ratings yet
Zimbabwe Presentation
11 pages
Sptm122ce 124e PDF
No ratings yet
Sptm122ce 124e PDF
4 pages
Hbo
No ratings yet
Hbo
13 pages
Appeals: Stages of Appeal
No ratings yet
Appeals: Stages of Appeal
4 pages
Seguridad Choque Electrico
No ratings yet
Seguridad Choque Electrico
12 pages
Ethernet Cable - Color Coding Diagram - The Internet Centre
No ratings yet
Ethernet Cable - Color Coding Diagram - The Internet Centre
5 pages
Order of Dismissal With Prejudice
No ratings yet
Order of Dismissal With Prejudice
2 pages
Statistical Process Control
No ratings yet
Statistical Process Control
323 pages

Data Analytics mid sem notes

Uploaded by

Data Analytics mid sem notes

Uploaded by

``Data Analytics

It is defined as a process of cleaning, transforming and modelling data to discover useful

Analytics is the process of systematically analyzing data to derive meaningful result. It

Prescriptive analytics Prescriptive analytics is a statistical method used to generate

Descriptive analytics which focuses on summarizing historical data

Predictive analytics, which focus future outcomes based on patterns and

Prescriptive analytics , which provides recommendation for decision making

Descriptive Analytics is focused solely on historical data.

What is Big Data?

▪ Semi-structured data is anywhere between unstructured and structured data. A consistent

Such databases allow you to query the semi-structured data.

o Social media platforms generate terabytes of data daily.

o Sensors and IoT devices produce continuous streams of data.

 Represents the speed at which data is generated, processed, and analyzed.

o Example: Real-time data like stock market trends or streaming services.

 Denotes the different formats of data:

o Structured data: Organized data like rows and columns in databases.

o Semi-structured data: JSON, XML files, etc.

The ultimate goal of Big Data is to derive meaningful insights or actionable

Rack awareness- it is a physical collection of various node. Genrally, 30-40 nodes

Hadoop and Apache Spark: Everything You Need to Know

Hadoop vs. Apache Spark: A Quick Comparison

Feature Hadoop Apache Spark

Fault Tolerance High (Data Replication) High (RDD Resiliency)

Easier (Supports Python,

Machine Not built-in (Needs

Storage HDFS Needs external storage

Expensive (due to memory

When to Use Hadoop vs. Apache Spark

Use Case Hadoop Apache Spark

Batch Processing ✅ Best for batch jobs ❌ Not ideal

Real-time Analytics ❌ Not supported ✅ Best for real-time

Graph Processing ❌ Limited support ✅ GraphX available

Streaming Data (Kafka,

You might also like