0% found this document useful (0 votes)

100 views11 pages

Data Mining Vs Data Exploration UNIT-II

This document discusses data exploration and data mining. It defines them as two different methodologies for analyzing large, unorganized datasets - data exploration refers to manual analysis using techniques like data visualization to better understand the data, while data mining uses automatic methods to gather relevant data from large databases. The document then provides more details on data exploration, explaining that it is the initial analysis step to understand dataset characteristics and identify relationships and patterns. It highlights the importance of data exploration and some common tools used like Microsoft Excel.

Uploaded by

Hanumanthu Gouthami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

100 views11 pages

Data Mining Vs Data Exploration UNIT-II

Uploaded by

Hanumanthu Gouthami

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 11

Data mining vs Data Exploration:

 There are two main methodologies or techniques used to retrieve relevant

data from large, unorganized pools.
 They are manual and automatic methods.
 The manual method is another name for data exploration, while the
automatic method is also known as data mining.
 Data mining generally refers to gathering relevant data from large
databases. On the other hand, data exploration generally refers to a data
user finding their way through large amounts of data to gather necessary
information

What is Data Exploration?

 Data exploration refers to the initial step in data analysis. Data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.

 Data exploration techniques include both manual analysis and automated

data exploration software solutions that visually explore and identify
relationships between different data variables, the structure of the dataset,
the presence of outliers, and the distribution of data values to reveal
patterns and points of interest, enabling data analysts to gain greater
insight into the raw data.
 Data is often gathered in large, unstructured volumes from various
sources. Data analysts must first understand and develop a comprehensive
view of the data before extracting relevant data for further analysis, such
as univariate, bivariate, multivariate, and principal components analysis.

Why is Data Exploration Important?

 Humans process visual data better than numerical data.

 The big challenge for data scientists and data analysts to assign meaning
to thousands of rows and columns of data points and communicate that
meaning without any visual components.
 Data visualization in data exploration leverages familiar visual cues such
as shapes, dimensions, colors, lines, points, and angles so that data
analysts can effectively visualize and define the metadata and then
perform data cleansing.
 Performing the initial step of data exploration enables data analysts to
understand better and visually identify anomalies and relationships that
might otherwise go undetected

Data Exploration Tools

 Manual data exploration methods entail writing scripts to analyze raw
data or manually filtering data into spreadsheets.
 Automated data exploration tools, such as data visualization software,
help data scientists easily monitor data sources and perform big data
exploration
 A popular tool for manual data exploration is Microsoft Excel
spreadsheets, which can create basic charts for data exploration, view raw
data, and identify the correlation between variables.
 To identify the correlation between two continuous variables in Excel,
use the CORREL() function to return the correlation.
 To identify the correlation between two categorical variables in Excel, the
two-way table method, the stacked column chart method, and the chi-
square test are effective.

What can Data Exploration Do?

the goals of data Exploration come into these three categories.

1. Archival: Data Exploration can convert data from physical formats (such
as books, newspapers, and invoices) into digital formats (such as
databases) for backup.
2. Transfer the data format: If you want to transfer the data from your
current website into a new website under development, you can collect
data from your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be
further analyzed to generate insights. This may sound similar to the data
analysis process in data mining, but note that data analysis is the goal of
data Exploration, not part of its process. What's more, the data is analyzed
differently. One example is that e-store owners extract product details
from eCommerce websites like Amazon to monitor competitors'
strategies.
Use Cases of Data Exploration

Data Exploration has been widely used in multiple industries serving different
purposes. Besides monitoring prices in eCommerce, data Exploration can help
in individual paper research, news aggregation, marketing, real estate, travel and
tourism, consulting, finance, and many more.

o Lead generation: Companies can extract data from directories like Yelp,
Crunchbase, and Yellowpages and generate leads for business
development. You can check out this video to see how to extract data
from Yellowpages with a web scraping template.
o Content & news aggregation: Content aggregation websites can get
regular data feeds from multiple sources and keep their sites fresh and up-
to-date.
o Sentiment analysis: After extracting the online
reviews/comments/feedback from social media websites like Instagram
and Twitter, people can analyze the underlying attitudes and understand
how they perceive a brand, product, or phenomenon.

What is Data Mining?

Data mining could be called a subset of Data Analysis. It explores and analyzes
huge knowledge to find important patterns and rules.

Data mining could also be a systematic and successive method of identifying

and discovering hidden patterns and data throughout a big dataset. Moreover, it
is used to build machine learning models that are further used in artificial
intelligence.

Exploratory Data Analysis (EDA)

Let’s suppose we want to make a data science project on the employee churn
rate of a company.

But before we make a model on this data we have to analyze all the
information which is present across the dataset like as what is the salary
distribution of employees, what is the bonus they are getting, what is their
starting time, and the assigned team.

These all steps of analyzing and modifying the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the
data and discover trends, patterns, or check assumptions in data with the help
of statistical summaries and graphical representations.

Types of EDA
Depending on the number of columns we are analyzing we can divide EDA
into two types.
Univariate Analysis
 In univariate analysis, we analyze or deal with only one variable at a
time.

 The analysis of univariate data is thus the simplest form of analysis

since the information deals with only one quantity that changes.

 It does not deal with causes or relationships

 The main purpose of the analysis is to describe the data and find patterns
that exist within it.

Bi-Variate analysis
 This type of data involves two different variables.

 The analysis of this type of data deals with causes and relationships and

 the analysis is done to find out the relationship between the two
variables.

Multivariate Analysis
When the data involves three or more variables, it is categorized under
multivariate.

Depending on the type of analysis we can also subcategorize EDA into two
parts.
1. Non-graphical Analysis – In non-graphical analysis, we analyze data
using statistical tools like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations charts to
visualize trends and patterns in the data
Exploratory Data Analysis (EDA) Using Python Libraries
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender,
Start Date, Last Login, Salary, Bonus%, Senior Management, and Team.
We can get the dataset here Employees.csv
Let’s read the dataset using the Pandas read_csv() function and print the 1st
five rows. To print the first five rows we will use the head() function.

import pandas as pd
import numpy as np
# read dataset using pandas
df = pd.read_csv('employees.csv')
df.head()

Output:

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

df.shape

Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.

Python is a great language for doing data analysis, Pandas is one of those
packages and makes importing and analyzing data much easier.

Pandas DataFrame describe()

Pandas describe() is used to view some basic statistical details like percentile,
mean, std, etc. of a data frame or a series of numeric values. When this method
is applied to a series of strings, it returns a different output which is shown in
the examples below.
Syntax: DataFrame.describe(percentiles=None, include=None,
exclude=None)
Parameters:
 percentile: list like data type of numbers between 0-1 to return the
respective percentile
 include: List of data types to be included while describing dataframe.
Default is None
 exclude: List of data types to be Excluded while describing dataframe.
Default is None
Return type: Statistical summary of data frame.

Dataset used
To download the data set used in the following example, click here. In the
following examples, the data frame used contains data from some NBA
players. Let’s have a look at the data by importing it.
import pandas as pd
# reading and printing csv file
data = pd.read_csv('nba.csv')
print(data.head()

Output:
Name Team Number Position Age Height Weight
College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0
Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0
Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston
University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia
State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0
NaN 5000000.0

Using Describe function in Pandas

We can easily learn about several statistical measures, including mean,
median, standard deviation, quartiles, and more, by using describe() on a
DataFrame.

print(data.descibe())
Number Age Weight Salary
count 457.000000 457.000000 457.000000 4.460000e+02
mean 17.678337 26.938731 221.522976 4.842684e+06
std 15.966090 4.404016 26.368343 5.229238e+06
min 0.000000 19.000000 161.000000 3.088800e+04
25% 5.000000 24.000000 200.000000 1.044792e+06
50% 13.000000 26.000000 220.000000 2.839073e+06
75% 25.000000 30.000000 240.000000 6.500000e+06
max 99.000000 40.000000 307.000000 2.500000e+07

Explanation of the description of numerical columns:

count: Total Number of Non-Empty values

mean: Mean of the column values
std: Standard Deviation of the column values
min: Minimum value from the column
25%: 25 percentile
50%: 50 percentile
75%: 75 percentile
max: Maximum value from the column
Let’s get a quick summary of the dataset using the pandas describe() method.
The describe() function applies basic statistical computations on the dataset
like extreme values, count of data points standard deviation, etc. Any missing
value or NaN value is automatically skipped. describe() function gives a good
picture of the distribution of data.

df.describe()

Output:

description of the dataframe

Note: we can also get the description of categorical columns of the dataset if
we specify include =’all’ in the describe function.

Now, let’s also see the columns and their data types. For this, we will use
the info() method.

# information about the dataset

df.info()

Output:

Information about the dataset

Changing Dtype from Object to Datetime

Start Date is an important column for employees. However, it is not of much
use if we can not handle it properly to handle this type of data pandas provide
a special function datetime() from which we can change object type to
DateTime format.
# convert "Start Date" column to datetime data type
df['Start Date'] = pd.to_datetime(df['Start Date'])

We can see the number of unique elements in our dataset. This will help us in
deciding which type of encoding to choose for converting categorical columns
into numerical columns.

df.nunique()

Output:
First Name 200
Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64

Handling Missing Values

 why a dataset will contain any missing values?

 It can occur when no information is provided for one or more items or for a
whole unit.
 For Example, Suppose different users being surveyed may choose not to
share their income, and some users may choose not to share their address in
this way many datasets went missing.
 Missing Data is a very big problem in real-life scenarios. Missing Data can
also refer to as NA(Not Available) values in pandas. There are several
useful functions for detecting, removing, and replacing null values in
Pandas DataFrame :

 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

Pandas isnull() and notnull() Method

While making a Data Frame from a Pandas CSV file, many blank columns are
imported as null values into the DataFrame which later creates problems while
operating that data frame

Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.

Pandas DataFrame isnull() Method

Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Pandas dropna() method allows the user to analyze and drop Rows/Columns
with Null values in different ways.

Pandas DataFrame.dropna() Syntax

Syntax: DataFrameName.dropna(axis=0, how=’any’, thresh=None,
subset=None, inplace=False)
Parameters:
 axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for
Integer and ‘index’ or ‘columns’ for String.
 how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops
the row/column if ANY value is Null and ‘all’ drops only if ALL values are
null.
 thresh: thresh takes integer value which tells minimum amount of na
values to drop.
 subset: It’s an array which limits the dropping process to passed
rows/columns through list. inplace: It is a boolean which makes the
changes in data frame itself if True.

df.isnull().sum()

Output:

We can see that every column has a different amount of missing values. Like
Gender has 145 missing values and salary has 0. Now for handling these
missing values there can be several cases like dropping the rows containing
NaN or replacing NaN with either mean, median, mode, or some other value.
Now, let’s try to fill in the missing values of gender with the string “No
Gender”.
df["Gender"].fillna("No Gender", inplace = True)

df.isnull().sum()
Output:

Unit-3 DS
No ratings yet
Unit-3 DS
21 pages
Table of Specification Math Grades 7 10
100% (7)
Table of Specification Math Grades 7 10
26 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
29 pages
Unit 1 - Intro To EDA
No ratings yet
Unit 1 - Intro To EDA
40 pages
data analysis
No ratings yet
data analysis
42 pages
Unit - Iii - Eda
No ratings yet
Unit - Iii - Eda
25 pages
UNIT 1 Exploratory Data Analysis
100% (1)
UNIT 1 Exploratory Data Analysis
8 pages
Exploratory Data Analysis-1
No ratings yet
Exploratory Data Analysis-1
10 pages
Data Exploration and Analysis With Python
No ratings yet
Data Exploration and Analysis With Python
9 pages
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
From Everand
DATA ANALYSIS AND DATA SCIENCE: Unlock Insights and Drive Innovation with Advanced Analytical Techniques (2024 Guide)
WINTON CLEM
No ratings yet
Unit 2
No ratings yet
Unit 2
58 pages
Chapter 2. Data Analysis and Processing - Full
No ratings yet
Chapter 2. Data Analysis and Processing - Full
49 pages
Day 1 Article For Discussion
No ratings yet
Day 1 Article For Discussion
5 pages
EDA Feature eng- Estimation Inference and Hypothesis
No ratings yet
EDA Feature eng- Estimation Inference and Hypothesis
53 pages
DA Interview Questions
No ratings yet
DA Interview Questions
7 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
173 pages
Data Exploration
No ratings yet
Data Exploration
11 pages
Introduction To Data Analysis
No ratings yet
Introduction To Data Analysis
9 pages
Session1-DataCharacteristics
No ratings yet
Session1-DataCharacteristics
41 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Ch-1 Introduction To Data Analysis
No ratings yet
Ch-1 Introduction To Data Analysis
23 pages
Exploratory Data Analysis: by Neha Mathur
No ratings yet
Exploratory Data Analysis: by Neha Mathur
14 pages
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
No ratings yet
Systematic Approach To Perform Task Centric Exploratory Data Analysis With Case Study
8 pages
AUTOMATED EDA Libraries
No ratings yet
AUTOMATED EDA Libraries
12 pages
Data Analytics with Generative AI
From Everand
Data Analytics with Generative AI
Younish P
No ratings yet
IOT-Domain Analyst
No ratings yet
IOT-Domain Analyst
11 pages
unit-1
No ratings yet
unit-1
50 pages
Lesson 2 - Data Preprocessing
100% (1)
Lesson 2 - Data Preprocessing
72 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
100% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
12 pages
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
100% (1)
Practical - 1 - Data Exploration and Data Preparation - DAL - Lab
8 pages
Unit 3
No ratings yet
Unit 3
47 pages
Data Analytics Fundamentals-2
No ratings yet
Data Analytics Fundamentals-2
34 pages
BI-LEc 3
No ratings yet
BI-LEc 3
24 pages
Data Analytics
From Everand
Data Analytics
Jeffery Short
1/5 (1)
Data Analytics and Interactive Dashboards using Python
No ratings yet
Data Analytics and Interactive Dashboards using Python
96 pages
Data Science - III
No ratings yet
Data Science - III
94 pages
Data Science Process
No ratings yet
Data Science Process
30 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
13 pages
probability and stat unit 1
No ratings yet
probability and stat unit 1
12 pages
Exploratory Data Analysis (EDA) in Python
No ratings yet
Exploratory Data Analysis (EDA) in Python
6 pages
Microsoft Excel Statistical and Advanced Functions for Decision Making
From Everand
Microsoft Excel Statistical and Advanced Functions for Decision Making
Palani Murugappan
5/5 (2)
AIDS C04-Session-22
No ratings yet
AIDS C04-Session-22
22 pages
Data Analytics
No ratings yet
Data Analytics
36 pages
Unit-1
No ratings yet
Unit-1
52 pages
Unit 3
No ratings yet
Unit 3
222 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
Data Analysis: An In-depth Insight
From Everand
Data Analysis: An In-depth Insight
Pasquale De Marco
No ratings yet
Chapter 1-Introduction To Data
No ratings yet
Chapter 1-Introduction To Data
18 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
62 pages
2.1_Data_Analytics[1]
No ratings yet
2.1_Data_Analytics[1]
16 pages
DAI_Data_Preprocessing_1_46233380_2025_06_12_17_18
No ratings yet
DAI_Data_Preprocessing_1_46233380_2025_06_12_17_18
14 pages
Data Analytics With Python Lecture 1
No ratings yet
Data Analytics With Python Lecture 1
23 pages
Notes - EDA-Unit1 (2)
No ratings yet
Notes - EDA-Unit1 (2)
34 pages
Lesson 5 Exploratory Data Analysis
No ratings yet
Lesson 5 Exploratory Data Analysis
10 pages
Dev Answer Key
No ratings yet
Dev Answer Key
21 pages
MODULE2 Material
No ratings yet
MODULE2 Material
14 pages
Data Minds - Data Science Curriculum 2023 V2
No ratings yet
Data Minds - Data Science Curriculum 2023 V2
15 pages
Lect2 - Data Preprocessing
No ratings yet
Lect2 - Data Preprocessing
10 pages
Document (1)
No ratings yet
Document (1)
10 pages
Unit 3 Ids Notes
No ratings yet
Unit 3 Ids Notes
31 pages
AA MDM MST
No ratings yet
AA MDM MST
8 pages
Normalization_in_Database_Management_System
No ratings yet
Normalization_in_Database_Management_System
3 pages
Mla
No ratings yet
Mla
3 pages
INTRODUCTION TO ORACLE 10G
No ratings yet
INTRODUCTION TO ORACLE 10G
4 pages
JDBC
No ratings yet
JDBC
21 pages
WEEK 5
No ratings yet
WEEK 5
4 pages
Call by value and Call by reference in C
No ratings yet
Call by value and Call by reference in C
4 pages
Applications of Data Science UNIT-1
No ratings yet
Applications of Data Science UNIT-1
4 pages
Anatomy of Map Reduce Job Run
100% (1)
Anatomy of Map Reduce Job Run
20 pages
Data Analysis UNIT-III
No ratings yet
Data Analysis UNIT-III
11 pages
Counting Oneness in A Window
No ratings yet
Counting Oneness in A Window
12 pages
D1417 Schematic
No ratings yet
D1417 Schematic
3 pages
Air Navigation - 8 - Pro Quick Start PDF
No ratings yet
Air Navigation - 8 - Pro Quick Start PDF
31 pages
The Adobe Illustrator Cs Part 1 PDF
No ratings yet
The Adobe Illustrator Cs Part 1 PDF
6 pages
H.S.I. Supply List: Product Description
No ratings yet
H.S.I. Supply List: Product Description
4 pages
Purok-3, Bugtongnapulo, Lipa City, Batangas Mobile No: +63905-317-4466
No ratings yet
Purok-3, Bugtongnapulo, Lipa City, Batangas Mobile No: +63905-317-4466
2 pages
IPremier and Denial of Service Attack
No ratings yet
IPremier and Denial of Service Attack
3 pages
Galperin93 Scapegoat Trees
No ratings yet
Galperin93 Scapegoat Trees
10 pages
Operating Manual: Zhejiang Holip Electronic Technology Co., LTD
No ratings yet
Operating Manual: Zhejiang Holip Electronic Technology Co., LTD
92 pages
Oracle WIP
50% (2)
Oracle WIP
195 pages
AC Ideal Rig
50% (2)
AC Ideal Rig
8 pages
Ci 4100 Spectrophotometer
No ratings yet
Ci 4100 Spectrophotometer
3 pages
Exam AZ-120 topic 10 question 2 discussion - ExamTopics
No ratings yet
Exam AZ-120 topic 10 question 2 discussion - ExamTopics
3 pages
9 Tna Needs Assessment Techniques
No ratings yet
9 Tna Needs Assessment Techniques
10 pages
Atom Power - LTE Box Install Manual
No ratings yet
Atom Power - LTE Box Install Manual
13 pages
3BSE040587-601 - en Compact HMI 6.0.1 Getting Started
100% (1)
3BSE040587-601 - en Compact HMI 6.0.1 Getting Started
246 pages
CH-7-Numerical differentiation_Fall_24-25
No ratings yet
CH-7-Numerical differentiation_Fall_24-25
12 pages
Creating A Data Driven Enterprise With DataOps
100% (2)
Creating A Data Driven Enterprise With DataOps
165 pages
CLASS XII-PYTHON-FILE HANDLING_resource
No ratings yet
CLASS XII-PYTHON-FILE HANDLING_resource
128 pages
E-Correspondence: Browser
No ratings yet
E-Correspondence: Browser
6 pages
Restaurant Management System
No ratings yet
Restaurant Management System
23 pages
Ts 148006v090000p
No ratings yet
Ts 148006v090000p
39 pages
Sd200 Automation System
100% (1)
Sd200 Automation System
45 pages
Catalogue The Punching Machine Specialists TECHNOLOGY Italiana 2020
No ratings yet
Catalogue The Punching Machine Specialists TECHNOLOGY Italiana 2020
68 pages
StudyGuide 101 ADFundamentals v2 OfficialF5
No ratings yet
StudyGuide 101 ADFundamentals v2 OfficialF5
4 pages
Smart Home MS Extract V1 0
No ratings yet
Smart Home MS Extract V1 0
42 pages
WT9011DCL-BTL5 Manual
No ratings yet
WT9011DCL-BTL5 Manual
53 pages
Akira cn9 Chassis ct-14xj9n - cht0807, Tb1238an, Ta8403 - T
No ratings yet
Akira cn9 Chassis ct-14xj9n - cht0807, Tb1238an, Ta8403 - T
51 pages
Kalahi-Cidss Program: Views of Thebeneficiaries in RTR, Agusan Del Norte, Caraga Region
No ratings yet
Kalahi-Cidss Program: Views of Thebeneficiaries in RTR, Agusan Del Norte, Caraga Region
13 pages
Mining Better Technical Trading Strategies With Genetic Algorithms (2006)
No ratings yet
Mining Better Technical Trading Strategies With Genetic Algorithms (2006)
8 pages

Data Mining Vs Data Exploration UNIT-II

Uploaded by

Data Mining Vs Data Exploration UNIT-II

Uploaded by

Data mining vs Data Exploration:

 There are two main methodologies or techniques used to retrieve relevant

What is Data Exploration?

 Data exploration techniques include both manual analysis and automated

Why is Data Exploration Important?

 Humans process visual data better than numerical data.

Data Exploration Tools

What can Data Exploration Do?

What is Data Mining?

Data mining could also be a systematic and successive method of identifying

Exploratory Data Analysis (EDA)

 The analysis of univariate data is thus the simplest form of analysis

 It does not deal with causes or relationships

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.

Pandas DataFrame describe()

Using Describe function in Pandas

Explanation of the description of numerical columns:

count: Total Number of Non-Empty values

description of the dataframe

# information about the dataset

Information about the dataset

Changing Dtype from Object to Datetime

Handling Missing Values

 why a dataset will contain any missing values?

Pandas isnull() and notnull() Method

Pandas DataFrame isnull() Method

Pandas DataFrame.dropna() Syntax

You might also like