SlideShare a Scribd company logo
1 | Page
DATA SCIENCE PORTFOLIO
Jombaba.s7@gmail.com
I enjoy working on data and highlighted in this portfolio is one of the projects I have
implemented using the python-based tools and programming platform. The entire project is
presented in five sub-projects with each of them capturing specifics of the entire work.
Imagine that, a financial institution / bank wish to find a solution to a ‘Customer Acquisition and
Customer Retention’ related problem. As a data Scientist, this is my attempt at providing a
wholesome solution and the series of five projects illustrate a plausible approach in resolving
the problem.
Exploratory Data Analysis Project:
Purpose:
This is the second in the series of projects. Here, our purpose is to do a pre-analysis
examination of the ‘Credit Card Application’ (CAD) dataset which we cleaned up in a prior
project.
In the previous project, we successfully removed all the missing values in the dataset. Now, as a
prelude to data analysis we ask questions in order to investigate and better understand the
dataset.
In order words, we seek to find answers to why the phenomenon described in the dataset
happened. We are looking for opportunities to expand the scope of our search for a plausible
model of the classifier variable.
Probably, a diagnostic analytics of the dataset might reveal unintended discoveries along the
way.
Dataset:
The dataset is found at: - https://archive.ics.uci.edu/ml/datasets/credit+approval
The dataset comprises continuous and nominal attributes of small and large values. Originally,
there were 67 missing values in the dataset. The missing values were treated in a previous sub-
project to wrangle the dataset.
Here are some specifics of the dataset:
2 | Page
The number of instances (observations) = 690. The number of attributes =15 (columns A1-A15).
There is one class attribute (column A16) and 307 (44.5%) of the classifier is “+” while 383
(55.5%) is “-“. Below is a schema of the variables in the dataset.
Attribute
Label
Value Type Attribute
Label
Value Type
A1 Nominal A11 Continuous
(Integer)
A2 continuous A12 Nominal
A3 continuous A13 Nominal
A4 Nominal A14 Continuous
(Integer)
A5 Nominal A15 Continuous
(Integer)
A6 Nominal A16 Class
attribute
A7 Nominal
A8 continuous
A9 Nominal
A10 Nominal
Process
We explore the dataset by asking questions to which answers are provided in the form of tables
and visualizations.
Question 1:
What are the relationships between the various variables of the dataset?
For the given dataset, appropriate visualizations to provide answers to this question include
various plots which will emphasize the distribution and statistical relationships among all the
variables.
The objective is to detect differences and similarities between the ‘+’ and ‘-‘values of the class
attribute, A16.
The multiple pairwise bivariate distributions (pair-plots) below are derived using this code.
sns.set(style='ticks') sns.pairplot(pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv'),hue='A1')
Out [14]: <seaborn.axisgrid.PairGrid at 0x2061c3a2588>
3 | Page
The plot is symmetrical about its diagonal and because of this, we concern ourselves only with
plots on the upper side of the diagonal. Univariate distributions of the class attribute for the 6
(six) numeric data type variables (i.e. A2, A3, A8, A11, A14 and A15) are shown on the diagonal.
Observe that, for variable A2 this distribution is approximately normal, similar and equal for
both ‘+’ and ‘-‘. The picture is not the same for all numeric variables.
4 | Page
Next, we observe that the distribution of the class attribute values is more dispersed in certain
pairwise plots than in others.
Since, our ultimate goal is to model the data efficiently we shall take a closer look at pairwise
plots where boundary between the ‘+’ and ‘-“values are clearly defined.
For example, let us examine plots A2 vs. A14, A3 vs. A8 and A3 vs. A14 further.
Question 2:
What are the statistical characteristics of the four variables (A2, A3, A8 and A14) identified
above?
Generally, appropriate visualizations that best satisfy what we are looking for here are those
that emphasize distribution and statistical relationships among all the variables. So, we may
start off by looking out for differences and similarities in the characteristics of ‘+’ and ‘-‘values
of the class attribute, A16.
The pair-plots below reveal the general trend and distribution of variables A2, A3, A8 and A14
as they interact with one another. Take the plot of A2 vs.A3 for example, we see that the area
of concentration of the ‘-‘classifier values lie between A2 = 18 – 60 on the y-axis and A3 = 0 – 16
on the x-axis.
Also, we observe that in general the number of both ‘+’ and ‘-‘increases steadily as the value of
A3 increases. We can derive similar impressions from other pair plots as well.
Using this code, we derive the plots below.
sns.pairplot(data=cadnv,hue='A16',vars=['A2','A3','A8','A14'],kind='reg',height=3)
<seaborn.axisgrid.PairGrid at 0x1ad37b93e10>.
5 | Page
Based on these observations, one might be tempted to use an “if else then ……” rule of thumb
to propose a rough and simple model for the data.
Definitely, we need to further examine the dataset so as to optimize our model. Hence, we
examine the A2 vs. A3, A2 vs. A8 and A2 vs. A14 pairs further.
Question 3:
What are the statistical characteristics of variables A2, A3, A8 and A14 which were identified
above?
We observe that the univariate distributions of A2 and A14 are similar and those of A3 and A8
are similar as well.
6 | Page
Knowledge which we attained during the wrangling exercise on this dataset indicate that, the
variance of variable A8 (sigma=3.35) < the variance of variable A3 (sigma=4.98) and the
variance of variable A14 (sigma=173.81) > the variance of variable A2 (sigma=11.96).
We consider these variables further in the matrix below.
To develop the correlation coefficient matrix, we used the Pearson Product-Moment coefficient
to measure linear correlation between the numeric variables (A2, A3, A8, A11, A14 and A15).
For nominal variables (A4, A5, A6, A7, A9, A10, A12, A13), we used the Pearson’s chi-squared
test on the contingency table and normalized to a range of [0,1] using Cramer’s v estimate to
measure the association between 2 (two) variables.
The Correlation Coefficient Matrix
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
A1 1
0.07
4
0.07
4
0.35
3
0.21
7
0.02
6
0.07
8
0.05
2
0.08
5
0.02
9
A2 1
0.20
1
0.39
3
0.18
6 -0.08
0.01
9
A3
0.20
1 1
0.29
9
0.27
1 -0.22
0.12
3
A4
0.07
4 1 1
0.15
9
0.36
6
0.15
1
0.17
8
0.05
9
0.18
9
0.19
7
A5
0.07
4 1 1
0.15
9
0.36
6
0.15
1
0.17
8
0.05
9
0.18
9
0.19
7
A6
0.35
3
0.15
9
0.15
9 1 0.59
0.32
6 0.28
0.18
7 0.23
0.37
7
A7
0.21
7
0.36
6
0.36
6 0.59 1
0.27
1
0.12
9
0.15
4
0.19
9
0.25
5
A8
0.39
3
0.29
9 1
0.32
2 -0.08
0.05
1
A9
0.02
6
0.15
1
0.15
1
0.32
6
0.27
1 1
0.43
2
0.09
1
0.14
9 0.72
A10
0.07
8
0.17
8
0.17
8 0.28
0.12
9
0.43
2 1
0.01
7
0.24
9
0.45
8
A11
0.18
6
0.27
1
0.32
2 1 -0.12
0.06
4
A12
0.05
2
0.05
9
0.05
9
0.18
7
0.15
4
0.09
1
0.01
7 1
0.07
5
0.03
2
A13
0.08
5
0.18
9
0.18
9 0.23
0.19
9
0.14
9
0.24
9
0.07
5 1
0.11
5
A14 -0.08 -0.22 -0.08 -0.12 1
0.06
6
A15
0.01
9
0.12
3
0.05
1
0.06
4
0.06
6 1
7 | Page
A16
0.02
9
0.19
7
0.19
7
0.37
7
0.25
5 0.72
0.45
8
0.03
2
0.11
5 1
Using the correlation coefficients, we examined the variance further by determining R2
(variance explained or predicted) indicator for a number of variable pairs. See the R2
matrix
below.
The R2
Matrix (approximate %).
A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
A1 0 0.1
A2 0 4 15 3 0.6
A3 4 0 9 7 5 2
A4 0 3
A5 0 3
A6 0 14
A7 0 6
A8 15 9 0 10 1 0.3
A9 0 19 51.8
A10 19 0 21
A11 3 7 10 0 1 0.4
A12 0 0.1
A13 0 1
A14 0.6 5 1 1 0 0.4
A15 0.15 2 0.3 0.4 0.4 0
A16 0.1 3 3 14 6 51.8 21 0.1 1 0
For our purpose, we consider the R2
values for the pairs of A2 vs A8, A3 vs A8 and A11 vs A8 of
numeric columns. These values are highlighted in green color. Similarly, for pairs of nominal
variables A6 vs A16, A9 vs A16 and A10 vs A16 we highlight the corresponding values in red
color.
Summary and Conclusion:
Our diagnostic analytics effort has revealed useful information that might have easily been
overlooked. Thus, there is a broad knowledge on the bases of which we would build a plausible
model.
Thus, the diagnostic analysis suggests that the behavior of the + and – values of variable A16
are greatly affected by variables A2, A3, A8 and A11. How critical are these effects in
8 | Page
determining a plausible model for the classifier? We explore this further in another project.
tyJA
Ad

More Related Content

What's hot (20)

Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Gramener
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
Ajendra Sharma
 
Statistics
StatisticsStatistics
Statistics
christianjustine
 
Data visualization
Data visualizationData visualization
Data visualization
Jan Willem Tulp
 
Data Quality Control
Data Quality ControlData Quality Control
Data Quality Control
Hashim Hasnain Hadi
 
11. data management
11. data management11. data management
11. data management
Ashok Kulkarni
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
Maloy Manna, PMP®
 
Data analysis
Data analysisData analysis
Data analysis
LizzyL1
 
Tableau PPT.ppt
Tableau PPT.pptTableau PPT.ppt
Tableau PPT.ppt
eMMAY3
 
Error And Power
Error And PowerError And Power
Error And Power
shoffma5
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
Keylabs
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
javaidsameer123
 
Correspondence analysis(step by step)
Correspondence analysis(step by step)Correspondence analysis(step by step)
Correspondence analysis(step by step)
Nguyen Van Chuc
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
Douglas Joubert
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
Data science in health care
Data science in health careData science in health care
Data science in health care
Chetan Khanzode
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
SSaudia
 
Big Data, Artificial Intelligence & Healthcare
Big Data, Artificial Intelligence & HealthcareBig Data, Artificial Intelligence & Healthcare
Big Data, Artificial Intelligence & Healthcare
Iris Thiele Isip-Tan
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
Peter Reimann
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
ThilinaWanshathilaka
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
Gramener
 
Basic Statistics & Data Analysis
Basic Statistics & Data AnalysisBasic Statistics & Data Analysis
Basic Statistics & Data Analysis
Ajendra Sharma
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
Maloy Manna, PMP®
 
Data analysis
Data analysisData analysis
Data analysis
LizzyL1
 
Tableau PPT.ppt
Tableau PPT.pptTableau PPT.ppt
Tableau PPT.ppt
eMMAY3
 
Error And Power
Error And PowerError And Power
Error And Power
shoffma5
 
Hadoop online training
Hadoop online training Hadoop online training
Hadoop online training
Keylabs
 
Correspondence analysis(step by step)
Correspondence analysis(step by step)Correspondence analysis(step by step)
Correspondence analysis(step by step)
Nguyen Van Chuc
 
Descriptive Statistics and Data Visualization
Descriptive Statistics and Data VisualizationDescriptive Statistics and Data Visualization
Descriptive Statistics and Data Visualization
Douglas Joubert
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
Ajay Ohri
 
Data science in health care
Data science in health careData science in health care
Data science in health care
Chetan Khanzode
 
Introduction to data analytics
Introduction to data analyticsIntroduction to data analytics
Introduction to data analytics
SSaudia
 
Big Data, Artificial Intelligence & Healthcare
Big Data, Artificial Intelligence & HealthcareBig Data, Artificial Intelligence & Healthcare
Big Data, Artificial Intelligence & Healthcare
Iris Thiele Isip-Tan
 
Exploratory data analysis
Exploratory data analysis Exploratory data analysis
Exploratory data analysis
Peter Reimann
 

Similar to Exploratory data analysis project (20)

Interactive data visualization project
Interactive data visualization project Interactive data visualization project
Interactive data visualization project
BabatundeSogunro
 
unit 3_Predictive Analysis Dr. Neeraj.pptx
unit 3_Predictive Analysis Dr. Neeraj.pptxunit 3_Predictive Analysis Dr. Neeraj.pptx
unit 3_Predictive Analysis Dr. Neeraj.pptx
rohitrajbhar1845
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
ARIV4
 
Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...Building the Professional of 2020: An Approach to Business Change Process Int...
Building the Professional of 2020: An Approach to Business Change Process Int...
Dr Harris Apostolopoulos EMBA, PfMP, PgMP, PMP, IPMO-E
 
spss teaching
spss teachingspss teaching
spss teaching
lalit pratpa singh singh
 
Grade 11, U0-L3-Graphing
Grade 11, U0-L3-GraphingGrade 11, U0-L3-Graphing
Grade 11, U0-L3-Graphing
gruszecki1
 
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
ShiraPrater50
 
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docxExcel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
SANSKAR20
 
Graphical Displays of Data
Graphical Displays of DataGraphical Displays of Data
Graphical Displays of Data
Cumberland County Schools
 
Data analysis.pptx
Data analysis.pptxData analysis.pptx
Data analysis.pptx
MDPiasKhan
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression
Long Beach City College
 
assignment 2
assignment 2assignment 2
assignment 2
William Shonk
 
Principal components
Principal componentsPrincipal components
Principal components
Hutami Endang
 
Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5
Redwan Ferdous
 
Scatter Plot
Scatter PlotScatter Plot
Scatter Plot
Nishant Narendra
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
Anuja Lad
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
Khan Rahimeen
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
csula its training
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statistics
mscartersmaths
 
Interactive data visualization project
Interactive data visualization project Interactive data visualization project
Interactive data visualization project
BabatundeSogunro
 
unit 3_Predictive Analysis Dr. Neeraj.pptx
unit 3_Predictive Analysis Dr. Neeraj.pptxunit 3_Predictive Analysis Dr. Neeraj.pptx
unit 3_Predictive Analysis Dr. Neeraj.pptx
rohitrajbhar1845
 
Measures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docxMeasures and Strengths of AssociationRemember that while w.docx
Measures and Strengths of AssociationRemember that while w.docx
ARIV4
 
Grade 11, U0-L3-Graphing
Grade 11, U0-L3-GraphingGrade 11, U0-L3-Graphing
Grade 11, U0-L3-Graphing
gruszecki1
 
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx  BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
BUS 308 Week 4 Lecture 3 Developing Relationships in Exc.docx
ShiraPrater50
 
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docxExcel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
Excel Files AssingmentsCopy of Student_Assignment_File.11.01..docx
SANSKAR20
 
Data analysis.pptx
Data analysis.pptxData analysis.pptx
Data analysis.pptx
MDPiasKhan
 
SOC2002 Lecture 11
SOC2002 Lecture 11SOC2002 Lecture 11
SOC2002 Lecture 11
Bonnie Green
 
2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression2.4 Scatterplots, correlation, and regression
2.4 Scatterplots, correlation, and regression
Long Beach City College
 
Principal components
Principal componentsPrincipal components
Principal components
Hutami Endang
 
Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5Elementary Data Analysis with MS Excel_Day-5
Elementary Data Analysis with MS Excel_Day-5
Redwan Ferdous
 
Lab 4 excel basics
Lab 4 excel basicsLab 4 excel basics
Lab 4 excel basics
Anuja Lad
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
csula its training
 
Frequency Tables - Statistics
Frequency Tables - StatisticsFrequency Tables - Statistics
Frequency Tables - Statistics
mscartersmaths
 
Ad

Recently uploaded (20)

Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136How to join illuminati Agent in uganda call+256776963507/0741506136
How to join illuminati Agent in uganda call+256776963507/0741506136
illuminati Agent uganda call+256776963507/0741506136
 
Classification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptxClassification_in_Machinee_Learning.pptx
Classification_in_Machinee_Learning.pptx
wencyjorda88
 
Defense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptxDefense Against LLM Scheming 2025_04_28.pptx
Defense Against LLM Scheming 2025_04_28.pptx
Greg Makowski
 
Calories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptxCalories_Prediction_using_Linear_Regression.pptx
Calories_Prediction_using_Linear_Regression.pptx
TijiLMAHESHWARI
 
How iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost FundsHow iCode cybertech Helped Me Recover My Lost Funds
How iCode cybertech Helped Me Recover My Lost Funds
ireneschmid345
 
C++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptxC++_OOPs_DSA1_Presentation_Template.pptx
C++_OOPs_DSA1_Presentation_Template.pptx
aquibnoor22079
 
Minions Want to eat presentacion muy linda
Minions Want to eat presentacion muy lindaMinions Want to eat presentacion muy linda
Minions Want to eat presentacion muy linda
CarlaAndradesSoler1
 
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
Molecular methods diagnostic and monitoring of infection  -  Repaired.pptxMolecular methods diagnostic and monitoring of infection  -  Repaired.pptx
Molecular methods diagnostic and monitoring of infection - Repaired.pptx
7tzn7x5kky
 
Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..Secure_File_Storage_Hybrid_Cryptography.pptx..
Secure_File_Storage_Hybrid_Cryptography.pptx..
yuvarajreddy2002
 
04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story04302025_CCC TUG_DataVista: The Design Story
04302025_CCC TUG_DataVista: The Design Story
ccctableauusergroup
 
Conic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptxConic Sectionfaggavahabaayhahahahahs.pptx
Conic Sectionfaggavahabaayhahahahahs.pptx
taiwanesechetan
 
chapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.pptchapter3 Central Tendency statistics.ppt
chapter3 Central Tendency statistics.ppt
justinebandajbn
 
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptxmd-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
md-presentHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHation.pptx
fatimalazaar2004
 
DPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdfDPR_Expert_Recruitment_notice_Revised.pdf
DPR_Expert_Recruitment_notice_Revised.pdf
inmishra17121973
 
Flip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptxFlip flop presenation-Presented By Mubahir khan.pptx
Flip flop presenation-Presented By Mubahir khan.pptx
mubashirkhan45461
 
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdfIAS-slides2-ia-aaaaaaaaaaain-business.pdf
IAS-slides2-ia-aaaaaaaaaaain-business.pdf
mcgardenlevi9
 
183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag183409-christina-rossetti.pdfdsfsdasggsag
183409-christina-rossetti.pdfdsfsdasggsag
fardin123rahman07
 
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnTemplate_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
Template_A3nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn
cegiver630
 
Stack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptxStack_and_Queue_Presentation_Final (1).pptx
Stack_and_Queue_Presentation_Final (1).pptx
binduraniha86
 
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
CTS EXCEPTIONSPrediction of Aluminium wire rod physical properties through AI...
ThanushsaranS
 
Ad

Exploratory data analysis project

  • 1. 1 | Page DATA SCIENCE PORTFOLIO [email protected] I enjoy working on data and highlighted in this portfolio is one of the projects I have implemented using the python-based tools and programming platform. The entire project is presented in five sub-projects with each of them capturing specifics of the entire work. Imagine that, a financial institution / bank wish to find a solution to a ‘Customer Acquisition and Customer Retention’ related problem. As a data Scientist, this is my attempt at providing a wholesome solution and the series of five projects illustrate a plausible approach in resolving the problem. Exploratory Data Analysis Project: Purpose: This is the second in the series of projects. Here, our purpose is to do a pre-analysis examination of the ‘Credit Card Application’ (CAD) dataset which we cleaned up in a prior project. In the previous project, we successfully removed all the missing values in the dataset. Now, as a prelude to data analysis we ask questions in order to investigate and better understand the dataset. In order words, we seek to find answers to why the phenomenon described in the dataset happened. We are looking for opportunities to expand the scope of our search for a plausible model of the classifier variable. Probably, a diagnostic analytics of the dataset might reveal unintended discoveries along the way. Dataset: The dataset is found at: - https://archive.ics.uci.edu/ml/datasets/credit+approval The dataset comprises continuous and nominal attributes of small and large values. Originally, there were 67 missing values in the dataset. The missing values were treated in a previous sub- project to wrangle the dataset. Here are some specifics of the dataset:
  • 2. 2 | Page The number of instances (observations) = 690. The number of attributes =15 (columns A1-A15). There is one class attribute (column A16) and 307 (44.5%) of the classifier is “+” while 383 (55.5%) is “-“. Below is a schema of the variables in the dataset. Attribute Label Value Type Attribute Label Value Type A1 Nominal A11 Continuous (Integer) A2 continuous A12 Nominal A3 continuous A13 Nominal A4 Nominal A14 Continuous (Integer) A5 Nominal A15 Continuous (Integer) A6 Nominal A16 Class attribute A7 Nominal A8 continuous A9 Nominal A10 Nominal Process We explore the dataset by asking questions to which answers are provided in the form of tables and visualizations. Question 1: What are the relationships between the various variables of the dataset? For the given dataset, appropriate visualizations to provide answers to this question include various plots which will emphasize the distribution and statistical relationships among all the variables. The objective is to detect differences and similarities between the ‘+’ and ‘-‘values of the class attribute, A16. The multiple pairwise bivariate distributions (pair-plots) below are derived using this code. sns.set(style='ticks') sns.pairplot(pd.read_csv('C:/Users/Owner/Desktop/DATA/CAD/ABC-1.csv'),hue='A1') Out [14]: <seaborn.axisgrid.PairGrid at 0x2061c3a2588>
  • 3. 3 | Page The plot is symmetrical about its diagonal and because of this, we concern ourselves only with plots on the upper side of the diagonal. Univariate distributions of the class attribute for the 6 (six) numeric data type variables (i.e. A2, A3, A8, A11, A14 and A15) are shown on the diagonal. Observe that, for variable A2 this distribution is approximately normal, similar and equal for both ‘+’ and ‘-‘. The picture is not the same for all numeric variables.
  • 4. 4 | Page Next, we observe that the distribution of the class attribute values is more dispersed in certain pairwise plots than in others. Since, our ultimate goal is to model the data efficiently we shall take a closer look at pairwise plots where boundary between the ‘+’ and ‘-“values are clearly defined. For example, let us examine plots A2 vs. A14, A3 vs. A8 and A3 vs. A14 further. Question 2: What are the statistical characteristics of the four variables (A2, A3, A8 and A14) identified above? Generally, appropriate visualizations that best satisfy what we are looking for here are those that emphasize distribution and statistical relationships among all the variables. So, we may start off by looking out for differences and similarities in the characteristics of ‘+’ and ‘-‘values of the class attribute, A16. The pair-plots below reveal the general trend and distribution of variables A2, A3, A8 and A14 as they interact with one another. Take the plot of A2 vs.A3 for example, we see that the area of concentration of the ‘-‘classifier values lie between A2 = 18 – 60 on the y-axis and A3 = 0 – 16 on the x-axis. Also, we observe that in general the number of both ‘+’ and ‘-‘increases steadily as the value of A3 increases. We can derive similar impressions from other pair plots as well. Using this code, we derive the plots below. sns.pairplot(data=cadnv,hue='A16',vars=['A2','A3','A8','A14'],kind='reg',height=3) <seaborn.axisgrid.PairGrid at 0x1ad37b93e10>.
  • 5. 5 | Page Based on these observations, one might be tempted to use an “if else then ……” rule of thumb to propose a rough and simple model for the data. Definitely, we need to further examine the dataset so as to optimize our model. Hence, we examine the A2 vs. A3, A2 vs. A8 and A2 vs. A14 pairs further. Question 3: What are the statistical characteristics of variables A2, A3, A8 and A14 which were identified above? We observe that the univariate distributions of A2 and A14 are similar and those of A3 and A8 are similar as well.
  • 6. 6 | Page Knowledge which we attained during the wrangling exercise on this dataset indicate that, the variance of variable A8 (sigma=3.35) < the variance of variable A3 (sigma=4.98) and the variance of variable A14 (sigma=173.81) > the variance of variable A2 (sigma=11.96). We consider these variables further in the matrix below. To develop the correlation coefficient matrix, we used the Pearson Product-Moment coefficient to measure linear correlation between the numeric variables (A2, A3, A8, A11, A14 and A15). For nominal variables (A4, A5, A6, A7, A9, A10, A12, A13), we used the Pearson’s chi-squared test on the contingency table and normalized to a range of [0,1] using Cramer’s v estimate to measure the association between 2 (two) variables. The Correlation Coefficient Matrix A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A1 1 0.07 4 0.07 4 0.35 3 0.21 7 0.02 6 0.07 8 0.05 2 0.08 5 0.02 9 A2 1 0.20 1 0.39 3 0.18 6 -0.08 0.01 9 A3 0.20 1 1 0.29 9 0.27 1 -0.22 0.12 3 A4 0.07 4 1 1 0.15 9 0.36 6 0.15 1 0.17 8 0.05 9 0.18 9 0.19 7 A5 0.07 4 1 1 0.15 9 0.36 6 0.15 1 0.17 8 0.05 9 0.18 9 0.19 7 A6 0.35 3 0.15 9 0.15 9 1 0.59 0.32 6 0.28 0.18 7 0.23 0.37 7 A7 0.21 7 0.36 6 0.36 6 0.59 1 0.27 1 0.12 9 0.15 4 0.19 9 0.25 5 A8 0.39 3 0.29 9 1 0.32 2 -0.08 0.05 1 A9 0.02 6 0.15 1 0.15 1 0.32 6 0.27 1 1 0.43 2 0.09 1 0.14 9 0.72 A10 0.07 8 0.17 8 0.17 8 0.28 0.12 9 0.43 2 1 0.01 7 0.24 9 0.45 8 A11 0.18 6 0.27 1 0.32 2 1 -0.12 0.06 4 A12 0.05 2 0.05 9 0.05 9 0.18 7 0.15 4 0.09 1 0.01 7 1 0.07 5 0.03 2 A13 0.08 5 0.18 9 0.18 9 0.23 0.19 9 0.14 9 0.24 9 0.07 5 1 0.11 5 A14 -0.08 -0.22 -0.08 -0.12 1 0.06 6 A15 0.01 9 0.12 3 0.05 1 0.06 4 0.06 6 1
  • 7. 7 | Page A16 0.02 9 0.19 7 0.19 7 0.37 7 0.25 5 0.72 0.45 8 0.03 2 0.11 5 1 Using the correlation coefficients, we examined the variance further by determining R2 (variance explained or predicted) indicator for a number of variable pairs. See the R2 matrix below. The R2 Matrix (approximate %). A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A1 0 0.1 A2 0 4 15 3 0.6 A3 4 0 9 7 5 2 A4 0 3 A5 0 3 A6 0 14 A7 0 6 A8 15 9 0 10 1 0.3 A9 0 19 51.8 A10 19 0 21 A11 3 7 10 0 1 0.4 A12 0 0.1 A13 0 1 A14 0.6 5 1 1 0 0.4 A15 0.15 2 0.3 0.4 0.4 0 A16 0.1 3 3 14 6 51.8 21 0.1 1 0 For our purpose, we consider the R2 values for the pairs of A2 vs A8, A3 vs A8 and A11 vs A8 of numeric columns. These values are highlighted in green color. Similarly, for pairs of nominal variables A6 vs A16, A9 vs A16 and A10 vs A16 we highlight the corresponding values in red color. Summary and Conclusion: Our diagnostic analytics effort has revealed useful information that might have easily been overlooked. Thus, there is a broad knowledge on the bases of which we would build a plausible model. Thus, the diagnostic analysis suggests that the behavior of the + and – values of variable A16 are greatly affected by variables A2, A3, A8 and A11. How critical are these effects in
  • 8. 8 | Page determining a plausible model for the classifier? We explore this further in another project. tyJA