Data Mining Vs Data Exploration UNIT-II
Data Mining Vs Data Exploration UNIT-II
Data exploration refers to the initial step in data analysis. Data analysts
use data visualization and statistical techniques to describe dataset
characterizations, such as size, quantity, and accuracy, to understand
the nature of the data better.
1. Archival: Data Exploration can convert data from physical formats (such
as books, newspapers, and invoices) into digital formats (such as
databases) for backup.
2. Transfer the data format: If you want to transfer the data from your
current website into a new website under development, you can collect
data from your own website by extracting it.
3. Data analysis: As the most common goal, the extracted data can be
further analyzed to generate insights. This may sound similar to the data
analysis process in data mining, but note that data analysis is the goal of
data Exploration, not part of its process. What's more, the data is analyzed
differently. One example is that e-store owners extract product details
from eCommerce websites like Amazon to monitor competitors'
strategies.
Use Cases of Data Exploration
Data Exploration has been widely used in multiple industries serving different
purposes. Besides monitoring prices in eCommerce, data Exploration can help
in individual paper research, news aggregation, marketing, real estate, travel and
tourism, consulting, finance, and many more.
o Lead generation: Companies can extract data from directories like Yelp,
Crunchbase, and Yellowpages and generate leads for business
development. You can check out this video to see how to extract data
from Yellowpages with a web scraping template.
o Content & news aggregation: Content aggregation websites can get
regular data feeds from multiple sources and keep their sites fresh and up-
to-date.
o Sentiment analysis: After extracting the online
reviews/comments/feedback from social media websites like Instagram
and Twitter, people can analyze the underlying attitudes and understand
how they perceive a brand, product, or phenomenon.
Data mining could be called a subset of Data Analysis. It explores and analyzes
huge knowledge to find important patterns and rules.
Let’s suppose we want to make a data science project on the employee churn
rate of a company.
But before we make a model on this data we have to analyze all the
information which is present across the dataset like as what is the salary
distribution of employees, what is the bonus they are getting, what is their
starting time, and the assigned team.
These all steps of analyzing and modifying the data come under EDA.
Exploratory Data Analysis (EDA) is an approach that is used to analyze the
data and discover trends, patterns, or check assumptions in data with the help
of statistical summaries and graphical representations.
Types of EDA
Depending on the number of columns we are analyzing we can divide EDA
into two types.
Univariate Analysis
In univariate analysis, we analyze or deal with only one variable at a
time.
The main purpose of the analysis is to describe the data and find patterns
that exist within it.
Bi-Variate analysis
This type of data involves two different variables.
The analysis of this type of data deals with causes and relationships and
the analysis is done to find out the relationship between the two
variables.
Multivariate Analysis
When the data involves three or more variables, it is categorized under
multivariate.
Depending on the type of analysis we can also subcategorize EDA into two
parts.
1. Non-graphical Analysis – In non-graphical analysis, we analyze data
using statistical tools like mean median or mode or skewness
2. Graphical Analysis – In graphical analysis, we use visualizations charts to
visualize trends and patterns in the data
Exploratory Data Analysis (EDA) Using Python Libraries
For the simplicity of the article, we will use a single dataset. We will use the
employee data for this. It contains 8 columns namely – First Name, Gender,
Start Date, Last Login, Salary, Bonus%, Senior Management, and Team.
We can get the dataset here Employees.csv
Let’s read the dataset using the Pandas read_csv() function and print the 1st
five rows. To print the first five rows we will use the head() function.
import pandas as pd
import numpy as np
# read dataset using pandas
df = pd.read_csv('employees.csv')
df.head()
Output:
Output:
(1000, 8)
This means that this dataset has 1000 rows and 8 columns.
Python is a great language for doing data analysis, Pandas is one of those
packages and makes importing and analyzing data much easier.
Dataset used
To download the data set used in the following example, click here. In the
following examples, the data frame used contains data from some NBA
players. Let’s have a look at the data by importing it.
import pandas as pd
# reading and printing csv file
data = pd.read_csv('nba.csv')
print(data.head()
Output:
Name Team Number Position Age Height Weight
College Salary
0 Avery Bradley Boston Celtics 0.0 PG 25.0 6-2 180.0
Texas 7730337.0
1 Jae Crowder Boston Celtics 99.0 SF 25.0 6-6 235.0
Marquette 6796117.0
2 John Holland Boston Celtics 30.0 SG 27.0 6-5 205.0 Boston
University NaN
3 R.J. Hunter Boston Celtics 28.0 SG 22.0 6-5 185.0 Georgia
State 1148640.0
4 Jonas Jerebko Boston Celtics 8.0 PF 29.0 6-10 231.0
NaN 5000000.0
print(data.descibe())
Number Age Weight Salary
count 457.000000 457.000000 457.000000 4.460000e+02
mean 17.678337 26.938731 221.522976 4.842684e+06
std 15.966090 4.404016 26.368343 5.229238e+06
min 0.000000 19.000000 161.000000 3.088800e+04
25% 5.000000 24.000000 200.000000 1.044792e+06
50% 13.000000 26.000000 220.000000 2.839073e+06
75% 25.000000 30.000000 240.000000 6.500000e+06
max 99.000000 40.000000 307.000000 2.500000e+07
df.describe()
Output:
Now, let’s also see the columns and their data types. For this, we will use
the info() method.
Output:
We can see the number of unique elements in our dataset. This will help us in
deciding which type of encoding to choose for converting categorical columns
into numerical columns.
df.nunique()
Output:
First Name 200
Gender 2
Start Date 972
Last Login Time 720
Salary 995
Bonus % 971
Senior Management 2
Team 10
dtype: int64
isnull()
notnull()
dropna()
fillna()
replace()
interpolate()
Pandas isnull() and notnull() methods are used to check and manage NULL
values in a data frame.
df.isnull().sum()
Output:
We can see that every column has a different amount of missing values. Like
Gender has 145 missing values and salary has 0. Now for handling these
missing values there can be several cases like dropping the rows containing
NaN or replacing NaN with either mean, median, mode, or some other value.
Now, let’s try to fill in the missing values of gender with the string “No
Gender”.
df["Gender"].fillna("No Gender", inplace = True)
df.isnull().sum()
Output: