SlideShare a Scribd company logo
7
The 3 key phases
01
Data Exploration:
Finding out more about the data we have
● numpy
● matplotlib
● Pandas
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv")
#Reading the dataset in a dataframe using Pandas
QUICK TIP
Try right clicking on a photo and
using "Replace Image" to show
your own photo.
Most read
8
Data
Exploration
Once you have read the dataset, you can have a look at few top rows by
using the function head()
df.head(10)
Most read
14
Thank you.
Most read
Confidential Customized for Lorem Ipsum LLC Version 1.0
Basic of Python for
Data Analysis
Pramod Toraskar.
Why learn Python for data analysis?
Here are some reasons which go in favour of learning Python:
● Open Source – free to install
● Awesome online community
● Very easy to learn
● Can become a common language for data science and production of web based analytics products.
Choosing a development environment
1
Terminal / Shell based
2
IDLE (default environment)
3
iPython notebook – similar to markdown in
R
iPython environment - jupyter
http://jupyter-notebook-beginner-
guide.readthedocs.io/en/latest/install.html
Recall Python libraries and Data Structures
Lists, Strings, Tuples, Dictionary..
Following are a list of libraries, you will need for any scientific computations and data
analysis:
● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library
also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities
and tools for integration with other low level languages like Fortran, C and C++
● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high
level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization
and Sparse matrices.
● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots..
You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting
features inline. If you ignore the inline option, then pylab converts ipython environment to an
environment, very similar to Matlab. You can also use Latex commands to add math to your plot.
● Pandas for structured data operations and manipulations. It is extensively used for data munging and
preparation. Pandas were added relatively recently to Python and have been instrumental in boosting
Python’s usage in data scientist community.
● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of
efficient tools for machine learning and statistical modeling including classification, regression,
clustering and dimensionality reduction.
● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive
plots, dashboards and data applications on modern web-browsers. It empowers the user to generate
elegant and concise graphics in the style of D3.js.)
Key phases
The 3 key phases
01
Data Exploration:
Finding out more about the data we have
● numpy
● matplotlib
● Pandas
import pandas as pd
import numpy as np
import matplotlib as plt
df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv")
#Reading the dataset in a dataframe using Pandas
QUICK TIP
Try right clicking on a photo and
using "Replace Image" to show
your own photo.
Data
Exploration
Once you have read the dataset, you can have a look at few top rows by
using the function head()
df.head(10)
The 3 key phases
02
Data Munging:
Cleaning the data and playing with it to make it better suit statistical
modeling.
1. There are missing values in some variables. We should
estimate those values wisely depending on the amount of
missing values and the expected importance of variables.
1. While looking at the distributions, we saw that Applicant
Income and Loan Amount seemed to contain extreme values
at either end. Though they might make intuitive sense, but
should be treated appropriately.
Check missing
values in the
dataset
Let us look at missing values in all the variables because most of the models
don’t work with missing data and even if they do, imputing them helps more
often than not. So, let us check the number of nulls / NaNs in the dataset
df.apply(lambda x: sum(x.isnull()),axis=0)
The 3 key phases
03
Predictive Modeling:
Running the actual algorithms and having fun
After, we have made the data useful for modeling, The Skicit-
Learn (sklearn) is the most commonly used library in Python
for this purpose
Building a
Predictive
Model in Python
sklearn requires all inputs to be numeric, we should convert all our
categorical variables into numeric by encoding the categories.
This can be done using the following code:
from sklearn.preprocessingimport LabelEncoder
var_mod =
['Gender','Married','Dependents','Education','Self_Employed','Property_Are
a','Loan_Status']
le = LabelEncoder()
for i in var_mod:
df[i] = le.fit_transform(df[i])
df.dtypes
Model’s
Logistic
Regression
Is a classification algorithm
Decision Tree
is a type of supervised
learning algorithm (having a
pre-defined target variable)
that is mostly used in
classification problems.
Random Forest
Is a versatile machine learning
method capable of performing
both regression and
classification tasks.
Thank you.

More Related Content

What's hot (20)

Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
Marc Garcia
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
Chariza Pladin
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
Harri Hämäläinen
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Intro to Jupyter Notebooks
Intro to Jupyter NotebooksIntro to Jupyter Notebooks
Intro to Jupyter Notebooks
Francis Michael Bautista
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
Mosky Liu
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Python 3 Programming Language
Python 3 Programming LanguagePython 3 Programming Language
Python 3 Programming Language
Tahani Al-Manie
 
Python Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | EdurekaPython Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | Edureka
Edureka!
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners
Sujith Kumar
 
What is Python? | Edureka
What is Python? | EdurekaWhat is Python? | Edureka
What is Python? | Edureka
Edureka!
 
Python Programming ppt
Python Programming pptPython Programming ppt
Python Programming ppt
ismailmrribi
 
Pandas
PandasPandas
Pandas
Jyoti shukla
 
Python ppt
Python pptPython ppt
Python ppt
Mohita Pandey
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Why Python?
Why Python?Why Python?
Why Python?
Adam Pah
 
Pandas
PandasPandas
Pandas
maikroeder
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
Marc Garcia
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
Chariza Pladin
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Deep learning - A Visual Introduction
Deep learning - A Visual IntroductionDeep learning - A Visual Introduction
Deep learning - A Visual Introduction
Lukas Masuch
 
Data Science With Python
Data Science With PythonData Science With Python
Data Science With Python
Mosky Liu
 
Introduction to Python for Data Science
Introduction to Python for Data ScienceIntroduction to Python for Data Science
Introduction to Python for Data Science
Arc & Codementor
 
Python 3 Programming Language
Python 3 Programming LanguagePython 3 Programming Language
Python 3 Programming Language
Tahani Al-Manie
 
Python Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | EdurekaPython Basics | Python Tutorial | Edureka
Python Basics | Python Tutorial | Edureka
Edureka!
 
Data Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney
 
Introduction to python for Beginners
Introduction to python for Beginners Introduction to python for Beginners
Introduction to python for Beginners
Sujith Kumar
 
What is Python? | Edureka
What is Python? | EdurekaWhat is Python? | Edureka
What is Python? | Edureka
Edureka!
 
Python Programming ppt
Python Programming pptPython Programming ppt
Python Programming ppt
ismailmrribi
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
NishantKumar1179
 
Why Python?
Why Python?Why Python?
Why Python?
Adam Pah
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
hktripathy
 

Similar to Basic of python for data analysis (20)

Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
Kumud Arora
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in python
Bruce Jenks
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Python Advanced Predictive Analytics Kumar Ashish
Python Advanced Predictive Analytics Kumar AshishPython Advanced Predictive Analytics Kumar Ashish
Python Advanced Predictive Analytics Kumar Ashish
dakorarampse
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
IRJET Journal
 
To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
DataLab Community
 
-python-for-data-science-20240911071905Ss8z.pdf
-python-for-data-science-20240911071905Ss8z.pdf-python-for-data-science-20240911071905Ss8z.pdf
-python-for-data-science-20240911071905Ss8z.pdf
abhishekprasadabhima
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
DSCUSICT
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
VirajPathania1
 
Data Science.pptx
Data Science.pptxData Science.pptx
Data Science.pptx
TrainerAnalogicx
 
R.SOWMIYA (30323U09086).pptx data science with python
R.SOWMIYA (30323U09086).pptx data science with pythonR.SOWMIYA (30323U09086).pptx data science with python
R.SOWMIYA (30323U09086).pptx data science with python
ksaravanakumar450
 
Python and data analytics
Python and data analyticsPython and data analytics
Python and data analytics
Shree M.L.Kakadiya MCA mahila college, Amreli
 
12 Introduction to Modeling Libraries in Python.pdf
12  Introduction to Modeling Libraries in Python.pdf12  Introduction to Modeling Libraries in Python.pdf
12 Introduction to Modeling Libraries in Python.pdf
PyaeSone96
 
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
KashishKashish22
 
Data analysis using python in Jupyter notebook.pptx
Data analysis using python  in Jupyter notebook.pptxData analysis using python  in Jupyter notebook.pptx
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
Python for Data Science 1 / converted Edition Yuli Vasiliev
Python for Data Science 1 / converted Edition Yuli VasilievPython for Data Science 1 / converted Edition Yuli Vasiliev
Python for Data Science 1 / converted Edition Yuli Vasiliev
dacikaashiti
 
Python
PythonPython
Python
Chetan Khanzode
 
Ml programming with python
Ml programming with pythonMl programming with python
Ml programming with python
Kumud Arora
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in python
Bruce Jenks
 
Python for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive GuidePython for Data Science: A Comprehensive Guide
Python for Data Science: A Comprehensive Guide
priyanka rajput
 
Python Advanced Predictive Analytics Kumar Ashish
Python Advanced Predictive Analytics Kumar AshishPython Advanced Predictive Analytics Kumar Ashish
Python Advanced Predictive Analytics Kumar Ashish
dakorarampse
 
An Overview of Python for Data Analytics
An Overview of Python for Data AnalyticsAn Overview of Python for Data Analytics
An Overview of Python for Data Analytics
IRJET Journal
 
To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.To understand the importance of Python libraries in data analysis.
To understand the importance of Python libraries in data analysis.
GurpinderSingh98
 
Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018Meetup Junio Data Analysis with python 2018
Meetup Junio Data Analysis with python 2018
DataLab Community
 
-python-for-data-science-20240911071905Ss8z.pdf
-python-for-data-science-20240911071905Ss8z.pdf-python-for-data-science-20240911071905Ss8z.pdf
-python-for-data-science-20240911071905Ss8z.pdf
abhishekprasadabhima
 
De-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekendsDe-Cluttering-ML | TechWeekends
De-Cluttering-ML | TechWeekends
DSCUSICT
 
R.SOWMIYA (30323U09086).pptx data science with python
R.SOWMIYA (30323U09086).pptx data science with pythonR.SOWMIYA (30323U09086).pptx data science with python
R.SOWMIYA (30323U09086).pptx data science with python
ksaravanakumar450
 
12 Introduction to Modeling Libraries in Python.pdf
12  Introduction to Modeling Libraries in Python.pdf12  Introduction to Modeling Libraries in Python.pdf
12 Introduction to Modeling Libraries in Python.pdf
PyaeSone96
 
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
Python for Data Analysis Data Wrangling with Pandas NumPy and IPython Wes Mck...
arianmutchpp
 
Abhishek Training PPT.pptx
Abhishek Training PPT.pptxAbhishek Training PPT.pptx
Abhishek Training PPT.pptx
KashishKashish22
 
Data analysis using python in Jupyter notebook.pptx
Data analysis using python  in Jupyter notebook.pptxData analysis using python  in Jupyter notebook.pptx
Data analysis using python in Jupyter notebook.pptx
ssuserc26f8f
 
Python for Data Science 1 / converted Edition Yuli Vasiliev
Python for Data Science 1 / converted Edition Yuli VasilievPython for Data Science 1 / converted Edition Yuli Vasiliev
Python for Data Science 1 / converted Edition Yuli Vasiliev
dacikaashiti
 

Recently uploaded (20)

egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost  Data Protection People.pdfGDPR Audit - GDPR gap analysis cost  Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
Data Protection People
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
smrithimuralidas
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 
egc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPTegc.pdf tài liệu tiếng Anh cho học sinh THPT
egc.pdf tài liệu tiếng Anh cho học sinh THPT
huyenmy200809
 
IST606_SecurityManagement-slides_ 4 pdf
IST606_SecurityManagement-slides_ 4  pdfIST606_SecurityManagement-slides_ 4  pdf
IST606_SecurityManagement-slides_ 4 pdf
nwanjamakane
 
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGePSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
PSUG 7 - 2025-06-03 - David Bianco on Splunk SURGe
Tomas Moser
 
Chronic constipation presentaion final.ppt
Chronic constipation presentaion final.pptChronic constipation presentaion final.ppt
Chronic constipation presentaion final.ppt
DrShashank7
 
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost  Data Protection People.pdfGDPR Audit - GDPR gap analysis cost  Data Protection People.pdf
GDPR Audit - GDPR gap analysis cost Data Protection People.pdf
Data Protection People
 
Internal Architecture of Database Management Systems
Internal Architecture of Database Management SystemsInternal Architecture of Database Management Systems
Internal Architecture of Database Management Systems
M Munim
 
Tableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdfTableau Finland User Group June 2025.pdf
Tableau Finland User Group June 2025.pdf
elinavihriala
 
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdfComprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
Comprehensive Roadmap of AI, ML, DS, DA & DSA.pdf
epsilonice
 
delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)delta airlines new york office (Airwayscityoffice)
delta airlines new york office (Airwayscityoffice)
jamespromind
 
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
一比一原版(USC毕业证)南加利福尼亚大学毕业证如何办理
Taqyea
 
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docxGeospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
Geospatial Data_ Unlocking the Power for Smarter Urban Planning.docx
sofiawilliams5966
 
GROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptxGROUP 7 CASE STUDY Real Life Incident.pptx
GROUP 7 CASE STUDY Real Life Incident.pptx
mardoglenn21
 
Cyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptxCyber Security Presentation(Neon)xu.pptx
Cyber Security Presentation(Neon)xu.pptx
vilakshbhargava
 
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Glary Utilities Pro 5.157.0.183 Crack + Key Download [Latest]
Designer
 
Understanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its ApplicationsUnderstanding Tree Data Structure and Its Applications
Understanding Tree Data Structure and Its Applications
M Munim
 
Али махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptxАли махмуд to The teacm of ghsbh to fortune .pptx
Али махмуд to The teacm of ghsbh to fortune .pptx
palr19411
 
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Ethical Frameworks for Trustworthy AI – Opportunities for Researchers in Huma...
Karim Baïna
 
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
Mastering Data Science: Unlocking Insights and Opportunities at Yale IT Skill...
smrithimuralidas
 
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptxArtificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
Artificial-Intelligence-in-Autonomous-Vehicles (1)-1.pptx
AbhijitPal87
 
time_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptxtime_series_forecasting_constructor_uni.pptx
time_series_forecasting_constructor_uni.pptx
stefanopinto1113
 

Basic of python for data analysis

  • 1. Confidential Customized for Lorem Ipsum LLC Version 1.0 Basic of Python for Data Analysis Pramod Toraskar.
  • 2. Why learn Python for data analysis? Here are some reasons which go in favour of learning Python: ● Open Source – free to install ● Awesome online community ● Very easy to learn ● Can become a common language for data science and production of web based analytics products.
  • 3. Choosing a development environment 1 Terminal / Shell based 2 IDLE (default environment) 3 iPython notebook – similar to markdown in R iPython environment - jupyter http://jupyter-notebook-beginner- guide.readthedocs.io/en/latest/install.html
  • 4. Recall Python libraries and Data Structures Lists, Strings, Tuples, Dictionary.. Following are a list of libraries, you will need for any scientific computations and data analysis: ● NumPy (Numerical Python). The most powerful feature of NumPy is n-dimensional array. This library also contains basic linear algebra functions, Fourier transforms, advanced random number capabilities and tools for integration with other low level languages like Fortran, C and C++ ● SciPy (Scientific Python). SciPy is built on NumPy. It is one of the most useful library for variety of high level science and engineering modules like discrete Fourier transform, Linear Algebra, Optimization and Sparse matrices.
  • 5. ● Matplotlib for plotting vast variety of graphs, starting from histograms to line plots to heat plots.. You can use Pylab feature in ipython notebook (ipython notebook –pylab = inline) to use these plotting features inline. If you ignore the inline option, then pylab converts ipython environment to an environment, very similar to Matlab. You can also use Latex commands to add math to your plot. ● Pandas for structured data operations and manipulations. It is extensively used for data munging and preparation. Pandas were added relatively recently to Python and have been instrumental in boosting Python’s usage in data scientist community. ● Scikit Learn for machine learning. Built on NumPy, SciPy and matplotlib, this library contains a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction. ● Statsmodels (statistical modeling), Seaborn (statistical data visualization), Bokeh (creating interactive plots, dashboards and data applications on modern web-browsers. It empowers the user to generate elegant and concise graphics in the style of D3.js.)
  • 7. The 3 key phases 01 Data Exploration: Finding out more about the data we have ● numpy ● matplotlib ● Pandas import pandas as pd import numpy as np import matplotlib as plt df = pd.read_csv("/home/ptoraska/Downloads/Loan_Prediction/train.csv") #Reading the dataset in a dataframe using Pandas QUICK TIP Try right clicking on a photo and using "Replace Image" to show your own photo.
  • 8. Data Exploration Once you have read the dataset, you can have a look at few top rows by using the function head() df.head(10)
  • 9. The 3 key phases 02 Data Munging: Cleaning the data and playing with it to make it better suit statistical modeling. 1. There are missing values in some variables. We should estimate those values wisely depending on the amount of missing values and the expected importance of variables. 1. While looking at the distributions, we saw that Applicant Income and Loan Amount seemed to contain extreme values at either end. Though they might make intuitive sense, but should be treated appropriately.
  • 10. Check missing values in the dataset Let us look at missing values in all the variables because most of the models don’t work with missing data and even if they do, imputing them helps more often than not. So, let us check the number of nulls / NaNs in the dataset df.apply(lambda x: sum(x.isnull()),axis=0)
  • 11. The 3 key phases 03 Predictive Modeling: Running the actual algorithms and having fun After, we have made the data useful for modeling, The Skicit- Learn (sklearn) is the most commonly used library in Python for this purpose
  • 12. Building a Predictive Model in Python sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. This can be done using the following code: from sklearn.preprocessingimport LabelEncoder var_mod = ['Gender','Married','Dependents','Education','Self_Employed','Property_Are a','Loan_Status'] le = LabelEncoder() for i in var_mod: df[i] = le.fit_transform(df[i]) df.dtypes
  • 13. Model’s Logistic Regression Is a classification algorithm Decision Tree is a type of supervised learning algorithm (having a pre-defined target variable) that is mostly used in classification problems. Random Forest Is a versatile machine learning method capable of performing both regression and classification tasks.