0% found this document useful (0 votes)
100 views11 pages

Data Mining Vs Data Exploration UNIT-II

This document discusses data exploration and data mining. It defines them as two different methodologies for analyzing large, unorganized datasets - data exploration refers to manual analysis using techniques like data visualization to better understand the data, while data mining uses automatic methods to gather relevant data from large databases. The document then provides more details on data exploration, explaining that it is the initial analysis step to understand dataset characteristics and identify relationships and patterns. It highlights the importance of data exploration and some common tools used like Microsoft Excel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views11 pages

Data Mining Vs Data Exploration UNIT-II

This document discusses data exploration and data mining. It defines them as two different methodologies for analyzing large, unorganized datasets - data exploration refers to manual analysis using techniques like data visualization to better understand the data, while data mining uses automatic methods to gather relevant data from large databases. The document then provides more details on data exploration, explaining that it is the initial analysis step to understand dataset characteristics and identify relationships and patterns. It highlights the importance of data exploration and some common tools used like Microsoft Excel.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Data mining vs Data Exploration:

 There are two main methodologies or techniques used to retrieve relevant


data from large, unorganized pools.
 They are manual and automatic methods.
 The manual method is another name for data exploration, while the
automatic method is also known as data mining.
 Data mining generally refers to gathering relevant data from large
databases. On the other hand, data exploration generally refers to a data
user finding their way through large amounts of data to gather necessary
information

What is Data Exploration?

 Data exploration refers to the initial step in data analysis. Data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.

 Data exploration techniques include both manual analysis and automated


data exploration software solutions that visually explore and identify
relationships between different data variables, the structure of the dataset,
the presence of outliers, and the distribution of data values to reveal
patterns and points of interest, enabling data analysts to gain greater
insight into the raw data.
 Data is often gathered in large, unstructured volumes from various
sources. Data analysts must first understand and develop a comprehensive
view of the data before extracting relevant data for further analysis, such
as univariate, bivariate, multivariate, and principal components analysis.

Why is Data Exploration Important?

 Humans process visual data better than numerical data.


 The big challenge for data scientists and data analysts to assign meaning
to thousands of rows and columns of data points and communicate that
meaning without any visual components.
 Data visualization in data exploration leverages familiar visual cues such
as shapes, dimensions, colors, lines, points, and angles so that data
analysts can effectively visualize and define the metadata and then
perform data cleansing.
 Performing the initial step of data exploration enables data analysts to
understand better and visually identify anomalies and relationships that
might otherwise go undetected

Data Exploration Tools


 Manual data exploration methods entail writing scripts to analyze raw
data or manually filtering data into spreadsheets.
 Automated data exploration tools, such as data visualization software,
help data scientists easily monitor data sources and perform big data
exploration
 A popular tool for manual data exploration is Microsoft Excel
spreadsheets, which can create basic charts for data exploration, view raw
data, and identify the correlation between variables.
 To identify the correlation between two continuous variables in Excel,
use the CORREL() function to return the correlation.
 To identify the correlation between two categorical variables in Excel, the
two-way table method, the stacked column chart method, and the chi-
square test are effective.

What can Data Exploration Do?


the goals of data Exploration come into these three categories.

1. Archival: Data Exploration can convert data from physical formats (such
as books, newspapers, and invoices) into digital formats (such as
databases) for backup.
2. Transfer the data format: If you want to transfer the data from your
current website into a new website under development, you can collect
data from your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be
further analyzed to generate insights. This may sound similar to the data
analysis process in data mining, but note that data analysis is the goal of
data Exploration, not part of its process. What's more, the data is analyzed
differently. One example is that e-store owners extract product details
from eCommerce websites like Amazon to monitor competitors'
strategies.
Use Cases of Data Exploration

Data Exploration has been widely used in multiple industries serving different
purposes. Besides monitoring prices in eCommerce, data Exploration can help
in individual paper research, news aggregation, marketing, real estate, travel and
tourism, consulting, finance, and many more.

o Lead generation: Companies can extract data from directories like Yelp,
Crunchbase, and Yellowpages and generate leads for business
development. You can check out this video to see how to extract data
from Yellowpages with a web scraping template.
o Content & news aggregation: Content aggregation websites can get
regular data feeds from multiple sources and keep their sites fresh and up-
to-date.
o Sentiment analysis: After extracting the online
reviews/comments/feedback from social media websites like Instagram
and Twitter, people can analyze the underlying attitudes and understand
how they perceive a brand, product, or phenomenon.

What is Data Mining?

Data mining could be called a subset of Data Analysis. It explores and analyzes
huge knowledge to find important patterns and rules.

Data mining could also be a systematic and successive method of identifying


and discovering hidden patterns and data throughout a big dataset. Moreover, it
is used to build machine learning models that are further used in artificial
intelligence.

Exploratory Data Analysis (EDA)

Let’s suppose we want to make a data science project on the employee churn
rate of a company.

But before we make a model on this data we have to analyze all the
information which is present across the dataset like as what is the salary
distribution of employees, what is the bonus they are getting, what is their
starting time, and the assigned team.

These all steps of analyzing and modifying the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the
data and discover trends, patterns, or check assumptions in data with the help
of statistical summaries and graphical representations.

Types of EDA
Depending on the number of columns we are analyzing we can divide EDA
into two types.
Univariate Analysis
 In univariate analysis, we analyze or deal with only one variable at a
time.

 The analysis of univariate data is thus the simplest form of analysis


since the information deals with only one quantity that changes.

 It does not deal with causes or relationships

 The main purpose of the analysis is to describe the data and find patterns
that exist within it.

Bi-Variate analysis
 This type of data involves two different variables.

 The analysis of this type of data deals with causes and relationships and

 the analysis is done to find out the relationship between the two
variables.

Multivariate Analysis
When the data involves three or more variables, it is categorized under
multivariate.

Depending on the type of analysis we can also subcategorize EDA into two
parts.
1. Non-graphical Analysis – In non-graphical analysis, we analyze data
using statistical tools like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations charts to
visualize trends and patterns in the data
Exploratory Data Analysis (EDA) Using Python Libraries
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender,
Start Date, Last Login, Salary, Bonus%, Senior Management, and Team.
We can get the dataset here Employees.csv
Let’s read the dataset using the Pandas read_csv() function and print the 1st
five rows. To print the first five rows we will use the head() function.

import pandas as pd
import numpy as np
# read dataset using pandas
df = pd.read_csv('employees.csv')
df.head()

Output:

First five rows of the dataframe

Getting Insights About The Dataset

Let’s see the shape of the data using the shape.


df.shape

Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.

Python is a great language for doing data analysis, Pandas is one of those
packages and makes importing and analyzing data much easier.

Pandas DataFrame describe()


Pandas describe() is used to view some basic statistical details like percentile,
mean, std, etc. of a data frame or a series of numeric values. When this method
is applied to a series of strings, it returns a different output which is shown in
the examples below.
Syntax: DataFrame.describe(percentiles=None, include=None,
exclude=None)
Parameters:
 percentile: list like data type of numbers between 0-1 to return the
respective percentile
 include: List of data types to be included while describing dataframe.
Default is None
 exclude: List of data types to be Excluded while describing dataframe.
Default is None
Return type: Statistical summary of data frame.

Dataset used
To download the data set used in the following example, click here. In the
following examples, the data frame used contains data from some NBA
players. Let’s have a look at the data by importing it.
import pandas as pd
# reading and printing csv file
data = pd.read_csv('nba.csv')
print(data.head()

Output:
Name Team Number Position Age Height Weight
College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0
Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0
Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston
University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia
State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0
NaN 5000000.0

Using Describe function in Pandas


We can easily learn about several statistical measures, including mean,
median, standard deviation, quartiles, and more, by using describe() on a
DataFrame.

print(data.descibe())
Number Age Weight Salary
count 457.000000 457.000000 457.000000 4.460000e+02
mean 17.678337 26.938731 221.522976 4.842684e+06
std 15.966090 4.404016 26.368343 5.229238e+06
min 0.000000 19.000000 161.000000 3.088800e+04
25% 5.000000 24.000000 200.000000 1.044792e+06
50% 13.000000 26.000000 220.000000 2.839073e+06
75% 25.000000 30.000000 240.000000 6.500000e+06
max 99.000000 40.000000 307.000000 2.500000e+07

Explanation of the description of numerical columns:

count: Total Number of Non-Empty values


mean: Mean of the column values
std: Standard Deviation of the column values
min: Minimum value from the column
25%: 25 percentile
50%: 50 percentile
75%: 75 percentile
max: Maximum value from the column
Let’s get a quick summary of the dataset using the pandas describe() method.
The describe() function applies basic statistical computations on the dataset
like extreme values, count of data points standard deviation, etc. Any missing
value or NaN value is automatically skipped. describe() function gives a good
picture of the distribution of data.

df.describe()

Output:

description of the dataframe


Note: we can also get the description of categorical columns of the dataset if
we specify include =’all’ in the describe function.

Now, let’s also see the columns and their data types. For this, we will use
the info() method.

# information about the dataset


df.info()

Output:

Information about the dataset

Changing Dtype from Object to Datetime


Start Date is an important column for employees. However, it is not of much
use if we can not handle it properly to handle this type of data pandas provide
a special function datetime() from which we can change object type to
DateTime format.
# convert "Start Date" column to datetime data type
df['Start Date'] = pd.to_datetime(df['Start Date'])

We can see the number of unique elements in our dataset. This will help us in
deciding which type of encoding to choose for converting categorical columns
into numerical columns.

df.nunique()

Output:
First Name 200
Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64

Handling Missing Values

 why a dataset will contain any missing values?


 It can occur when no information is provided for one or more items or for a
whole unit.
 For Example, Suppose different users being surveyed may choose not to
share their income, and some users may choose not to share their address in
this way many datasets went missing.
 Missing Data is a very big problem in real-life scenarios. Missing Data can
also refer to as NA(Not Available) values in pandas. There are several
useful functions for detecting, removing, and replacing null values in
Pandas DataFrame :

 isnull()
 notnull()
 dropna()
 fillna()
 replace()
 interpolate()

Pandas isnull() and notnull() Method


While making a Data Frame from a Pandas CSV file, many blank columns are
imported as null values into the DataFrame which later creates problems while
operating that data frame

Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.

Pandas DataFrame isnull() Method


Syntax: Pandas.isnull(“DataFrame Name”) or DataFrame.isnull()
Pandas dropna() method allows the user to analyze and drop Rows/Columns
with Null values in different ways.

Pandas DataFrame.dropna() Syntax


Syntax: DataFrameName.dropna(axis=0, how=’any’, thresh=None,
subset=None, inplace=False)
Parameters:
 axis: axis takes int or string value for rows/columns. Input can be 0 or 1 for
Integer and ‘index’ or ‘columns’ for String.
 how: how takes string value of two kinds only (‘any’ or ‘all’). ‘any’ drops
the row/column if ANY value is Null and ‘all’ drops only if ALL values are
null.
 thresh: thresh takes integer value which tells minimum amount of na
values to drop.
 subset: It’s an array which limits the dropping process to passed
rows/columns through list. inplace: It is a boolean which makes the
changes in data frame itself if True.

df.isnull().sum()

Output:

We can see that every column has a different amount of missing values. Like
Gender has 145 missing values and salary has 0. Now for handling these
missing values there can be several cases like dropping the rows containing
NaN or replacing NaN with either mean, median, mode, or some other value.
Now, let’s try to fill in the missing values of gender with the string “No
Gender”.
df["Gender"].fillna("No Gender", inplace = True)

df.isnull().sum()
Output:

You might also like