SlideShare a Scribd company logo
daftCode sp. z o.o.
Statistical inference for (Python) Data Analysis.
An introduction
Piotr Milanowski
daftCode sp. z o.o.
Statistical inference? Wait, why?
● Quantify a level of trust for values you obtain
● Compare values
● Infer validity of provided data
daftCode sp. z o.o.
Buzz phrases for this talk
● Probability
● Distribution
● Random variable
● Significance
● Hypothesis testing
● Statistic
daftCode sp. z o.o.
Part 1. Preparation.
daftCode sp. z o.o.
Building Python statistical stack
● Necessary modules:
Numpy
Scipy
● Helpful modules:
Pandas
Matplotlib
daftCode sp. z o.o.
NumPy
● http://www.numpy.org
● Numerical library
● Optimized for speed and memory efficiency
● Many useful and intuitive functionalities, and
methods (especially for multidimensional
arrays)
daftCode sp. z o.o.
NumPy (Example)
Python
>>> # Vector
>>> v = [1, 2, 3, 4]
>>> # scaling vector 2v
>>> v2 = [2*i for i in v]
>>> # Adding vectors v+v2
>>> v3 = [v[i]+v2[i] for i in range(len(v))]
>>> # Vector normalization
>>> mean = sum(v)/len(v)
>>> zero_mean = [(i – mean) for i in v]
>>> std = sum(i**2 for i in zero_mean)/len(v)
>>> normalized = [i/std for i in zero_mean]
Python + NumPy
>>> import numpy as np
>>> # Vector
>>> v = np.array([1, 2, 3, 4])
>>> # sacling vector 2v
>>> v2 = 2*v
>>> # Adding vectors v+v2
>>> v3 = v2 + v
>>> # Normalization
>>> normalized = v.mean()/v.std()
daftCode sp. z o.o.
SciPy
● http://www.scipy.org
● A set of scientific libraries for signal analysis
(scipy.signal), image analysis (scipy.ndimage),
Fourier transform (scipy.fftpack), linear algebra
(scipy.linalg), integration (scipy.integrate)…..
● Here: scipy.stats
daftCode sp. z o.o.
Pandas & Matplotlib
● http://pandas.pydata.org
● Great datastructures with helpful methods
● http://matplotlib.org/
● Visualization library
daftCode sp. z o.o.
Part 2. Examples
daftCode sp. z o.o.
Eaxample 1. Anomaly detection.
● Data: number of daily page entries from 3
months
● Question: should we be suspicious if for a given
day we have 800, 850 and 900 entries?
daftCode sp. z o.o.
Example 1. Anomaly detection
>>> import numpy as np
>>> values = np.array([…]) # Get values from
>>> # somewhere
>>> values
array([796, 756, 827, 779, 787, 731, 847, 829, 794,
790, 831, 820, 822, 792, 767, 757, 777, 775, 799,
756, 849, 808, 800, 823, 798, 809, 783, 800, 848,
772, 853, 759, 745, 868, 713, 863, 819, 829, 799,
777, 785, 812, 877, 839, 783, 867, 875, 829, 818,
813, 763, 829, 812, 867, 790, 748, 822, 774, 900,
830, 781, 795, 825, 799, 795, 839, 840, 765, 781,
781, 797, 821, 852, 836, 811, 771, 800, 752, 776,
755, 798, 839, 821, 794, 846, 834, 825, 825, 830,
814, 839, 760, 786, 747, 803, 717, 801, 819, 789,
824, 835, 806, 858, 774, 848, 793, 826, 796, 798,
773, 779, 775, 779, 806, 768, 787, 788, 822, 843,
780])
>>> values.max(), values.min()
(900, 713)
daftCode sp. z o.o.
Example 1. Anomaly detection
● Assumption: values are drawn from Poisson
distribution
● What is the probability of obtaining 800, 850,
900 for Poisson distribution fitted to this data?
● What is threshold value?
● scipy.stats.poisson (and many other
distributions)
daftCode sp. z o.o.
Example 1. Anomaly detection
>>> import scipy.stats as ss
>>> # Calculating distribution parameter
>>> mu = values.mean()
>>> # Check for 800
>>> 1 – ss.poisson.cdf(800, mu) # equal to ss.poisson.sf(800, mu)
0.548801
>>> # Check for 900
>>> 1 – ss.poisson.cdf(900, mu)
0.00042
>>> # Check for 850
>>> 1 – ss.poisson.cdf(850, mu)
0.05205
>>> # Threshold for magical 5%
>>> ss.poisson.ppf(0.95, mu)
851
● 3 lines of code (read data, calculate distribution
parameter, calculate threshold), and the detector
is ready!
daftCode sp. z o.o.
Example 2. Confidence intervals
● What is the mean number of entries?
● What is the 95% confidence interval for
calculated mean?
>>> # CI simulation
>>> def ci(v, no_reps):
... for i in range(no_reps):
... idx = np.random.randint(0, len(v), size=len(v))
... yield v[idx].mean()
>>> # Get simulated means
>>> gen = ci(values, 10000)
>>> sim_means = np.fromiter(gen, 'float')
>>> # 95% Confidence interval
>>> (ci_low, ci_high) = np.percentile(sim_means, [2.5, 97.5])
>>> print(ci_low, ci_high)
797.942 810.350
daftCode sp. z o.o.
Example 3. Comparing distributions
● Data: two sets of time spent on time – one set
for fraud data (F), and second for non-fraud
data (C)
● Question: is there a (significant) difference in
those two distributions?
daftCode sp. z o.o.
Example 3. Comparing distributions
>>> ok = np.array(ok) # non-fraud
>>> fraud = np.array(fraud)
>>> np.median(ok)
140261.0
>>> np.median(fraud)
109883.0
● Unknown distributions:
nonparametric test
>>> ss.mannwhitneyu(ok, fraud)
MannwhitneyuResuls(statistic=54457079.5,
pvalue=1.05701588547616e-59)
● Equalize sample sizes (just to be
sure)
>>> N = len(fraud)
>>> idx = np.arange(0, len(ok))
>>> np.random.shuffle(idx)
>>> ok_subsample = ok[idx[:N]]
>>> ss.mannwhitneyu(ok_subsample, fraud)
>>> MannwhitneyuResult(statistic=3548976.0,
pvalue=3.1818273295679098e-30)
daftCode sp. z o.o.
Example 4. Bootstrap
● The same data and question as previous
● Test without any build-in tests
● Hypothesis 0: both datasets are drawn from the
same distribution
● Mix them together, draw two new datasets (with
replacement), calculate statistic (difference in
median)
● Probability of obtaining statistic larger or equal to the
initial one (from original data)
daftCode sp. z o.o.
Example 4. Bootstrap
>>> # generate statistics
>>> def generate_statistics(vec1, vec2, no_reps=10000):
... all_ = np.r_[vec1, vec2]
... N, M = len(vec1), len(vec2)
... for i in range(no_reps):
... random_indices = np.random.randint(0, M+N, size=M+N)
... tmp1 = all_[random_indices[:M]]
... tmp2 = all_[random_indices[M:]]
... yield np.abs(np.median(tmp1) – np.median(tmp2))
>>> # Initial statistic
>>> stat_0 = np.abs(np.median(ok) – np.median(fraud))
>>> gen = generate_statistics(ok, fraud)
>>> stats = np.fromiter(gen, 'float')
>>> # Get the probability of obtaining statistic larger then initial
>>> np.sum(stats >= stat_0)/len(stats)
0.0
daftCode sp. z o.o.
Example 5. Naive Bayes
● Can we classify fraud based on time spent on a
page?
● Using Naive Bayes:
P(F|t) ~ P(t|F)P(F)
P(C|t) ~ P(t|C)P(C)
● P(t|F), P(t|C) are sample distributions
P(C), P(F)
daftCode sp. z o.o.
Example 5. Naive Bayes
P(t∣C)
P(t∣F)
daftCode sp. z o.o.
Example 5. Naive Bayes
>>> # Retrieve probability based on histogram
>>> def get_prob(value, bins, probs):
... idx = np.searchsorted(bins)
... return probs[idx]
>>> N = len(ok) + len(fraud)
>>> P_ok = len(ok)/N
>>> P_fraud = len(fraud)/N
>>> probs_ok, bins_ok = np.histogram(ok, bins=500, normed=True)
>>> probs_fraud, bins_fraud = np.histogram(fraud, bins=bins_ok, normed=True)
>>> # Test!
>>> test_times = np.linspace(0, 1e6, 100000)
>>> detected_as_fraud = 0
>>> detected_as_ok = 0
>>> for el in test_times:
... p_ok = get_prob(el, bins_ok, probs_ok)*P_ok
... p_fr = get_prob(el, bins_fraud, probs_fraud)*P_fraud
... if p_ok > p_fr:
... detected_as_ok += 1
... else:
... detected_as_fraud += 1
>>> print(detected_as_ok, detected_as_fraud)
100000 0
daftCode sp. z o.o.
Example 5. Naive Bayes
● NB doesn't seem to work that well in this
example
● Better results by just putting a threshold
● But still, several lines of code and classifier
ready!
daftCode sp. z o.o.
Almost at the end. Just one more slide… and it's a
summary!
daftCode sp. z o.o.
Summary
● Statistical inference is used to compare and
validate values
● It gives some quantification, but there still is a
room for subjective decisions (p-values, priors)
● It is quite easy to do statistics in Python when
you have proper tools

More Related Content

What's hot (20)

NumPy Refresher
NumPy RefresherNumPy Refresher
NumPy Refresher
Lukasz Dobrzanski
 
NumPy
NumPyNumPy
NumPy
AbhijeetAnand88
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
Kimikazu Kato
 
Numpy
NumpyNumpy
Numpy
Jyoti shukla
 
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviScientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Enthought, Inc.
 
NUMPY
NUMPY NUMPY
NUMPY
Global Academy of Technology
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
Gaurav Aggarwal
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
Pythonで機械学習入門以前
Pythonで機械学習入門以前Pythonで機械学習入門以前
Pythonで機械学習入門以前
Kimikazu Kato
 
Python seaborn cheat_sheet
Python seaborn cheat_sheetPython seaborn cheat_sheet
Python seaborn cheat_sheet
Nishant Upadhyay
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
Ajay Ohri
 
1 seaborn introduction
1 seaborn introduction 1 seaborn introduction
1 seaborn introduction
YuleiLi3
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
Namgee Lee
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
kammeyer
 
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
ACASH1011
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
Girish Khanzode
 
Pandas pythonfordatascience
Pandas pythonfordatasciencePandas pythonfordatascience
Pandas pythonfordatascience
Nishant Upadhyay
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
Nishant Upadhyay
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
Lambda Tree
 
Introduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning ProgrammersIntroduction to NumPy for Machine Learning Programmers
Introduction to NumPy for Machine Learning Programmers
Kimikazu Kato
 
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with MayaviScientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Scientific Computing with Python Webinar March 19: 3D Visualization with Mayavi
Enthought, Inc.
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Cheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learnCheat Sheet for Machine Learning in Python: Scikit-learn
Cheat Sheet for Machine Learning in Python: Scikit-learn
Karlijn Willems
 
Pythonで機械学習入門以前
Pythonで機械学習入門以前Pythonで機械学習入門以前
Pythonで機械学習入門以前
Kimikazu Kato
 
Python seaborn cheat_sheet
Python seaborn cheat_sheetPython seaborn cheat_sheet
Python seaborn cheat_sheet
Nishant Upadhyay
 
Python for R Users
Python for R UsersPython for R Users
Python for R Users
Ajay Ohri
 
1 seaborn introduction
1 seaborn introduction 1 seaborn introduction
1 seaborn introduction
YuleiLi3
 
Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303Numpy tutorial(final) 20160303
Numpy tutorial(final) 20160303
Namgee Lee
 
The Joy of SciPy
The Joy of SciPyThe Joy of SciPy
The Joy of SciPy
kammeyer
 
Pandas Cheat Sheet
Pandas Cheat SheetPandas Cheat Sheet
Pandas Cheat Sheet
ACASH1011
 
Pandas pythonfordatascience
Pandas pythonfordatasciencePandas pythonfordatascience
Pandas pythonfordatascience
Nishant Upadhyay
 
Python matplotlib cheat_sheet
Python matplotlib cheat_sheetPython matplotlib cheat_sheet
Python matplotlib cheat_sheet
Nishant Upadhyay
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
Lambda Tree
 

Similar to Statistical inference for (Python) Data Analysis. An introduction. (20)

Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
Andreas Dewes
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
PyDataParis
 
python_programming_NumPy_Pandas_Notes.pptx
python_programming_NumPy_Pandas_Notes.pptxpython_programming_NumPy_Pandas_Notes.pptx
python_programming_NumPy_Pandas_Notes.pptx
sunilsoni446112
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
rpmcruz
 
Numpy_Pandas_for beginners_________.pptx
Numpy_Pandas_for beginners_________.pptxNumpy_Pandas_for beginners_________.pptx
Numpy_Pandas_for beginners_________.pptx
Abhi Marvel
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
Numpy_Cheat_Sheet.pdf
Numpy_Cheat_Sheet.pdfNumpy_Cheat_Sheet.pdf
Numpy_Cheat_Sheet.pdf
SkyNerve
 
Arrays basics
Arrays basicsArrays basics
Arrays basics
sudhirvegad
 
Lec1
Lec1Lec1
Lec1
Nikhil Chilwant
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
Positive Hack Days
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
app4.pptx
app4.pptxapp4.pptx
app4.pptx
sg4795
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
Peadar Coyle
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
Tzar Umang
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
David Chiu
 
Decision Tree.pptx
Decision Tree.pptxDecision Tree.pptx
Decision Tree.pptx
Ramakrishna Reddy Bijjam
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
EN1036VivekSingh
 
Python for Scientists
Python for ScientistsPython for Scientists
Python for Scientists
Andreas Dewes
 
Simple APIs and innovative documentation
Simple APIs and innovative documentationSimple APIs and innovative documentation
Simple APIs and innovative documentation
PyDataParis
 
python_programming_NumPy_Pandas_Notes.pptx
python_programming_NumPy_Pandas_Notes.pptxpython_programming_NumPy_Pandas_Notes.pptx
python_programming_NumPy_Pandas_Notes.pptx
sunilsoni446112
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
rpmcruz
 
Numpy_Pandas_for beginners_________.pptx
Numpy_Pandas_for beginners_________.pptxNumpy_Pandas_for beginners_________.pptx
Numpy_Pandas_for beginners_________.pptx
Abhi Marvel
 
Machine Learning on Code - SF meetup
Machine Learning on Code - SF meetupMachine Learning on Code - SF meetup
Machine Learning on Code - SF meetup
source{d}
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Olivier Grisel
 
Numpy_Cheat_Sheet.pdf
Numpy_Cheat_Sheet.pdfNumpy_Cheat_Sheet.pdf
Numpy_Cheat_Sheet.pdf
SkyNerve
 
Regression and Classification with R
Regression and Classification with RRegression and Classification with R
Regression and Classification with R
Yanchang Zhao
 
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
The System of Automatic Searching for Vulnerabilities or how to use Taint Ana...
Positive Hack Days
 
Comparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptxComparing EDA with classical and Bayesian analysis.pptx
Comparing EDA with classical and Bayesian analysis.pptx
PremaGanesh1
 
app4.pptx
app4.pptxapp4.pptx
app4.pptx
sg4795
 
A Map of the PyData Stack
A Map of the PyData StackA Map of the PyData Stack
A Map of the PyData Stack
Peadar Coyle
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
Tzar Umang
 
Machine Learning With R
Machine Learning With RMachine Learning With R
Machine Learning With R
David Chiu
 
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learnNumerical tour in the Python eco-system: Python, NumPy, scikit-learn
Numerical tour in the Python eco-system: Python, NumPy, scikit-learn
Arnaud Joly
 
Ad

Recently uploaded (20)

Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across IndustriesArtificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its ApplicationsGenerative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Leveraging Foundation Models to Infer Intents
Leveraging Foundation Models to Infer IntentsLeveraging Foundation Models to Infer Intents
Leveraging Foundation Models to Infer Intents
Keheliya Gallaba
 
Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025
Orangescrum
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better DataProviding Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
How Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines OperationsHow Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines Operations
Insurance Tech Services
 
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Natan Silnitsky
 
Who will create the languages of the future?
Who will create the languages of the future?Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Revolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management SoftwareRevolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management Software
Insurance Tech Services
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The SequelMarketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Bonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdfBonk coin airdrop_ Everything You Need to Know.pdf
Bonk coin airdrop_ Everything You Need to Know.pdf
Herond Labs
 
Artificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across IndustriesArtificial Intelligence Applications Across Industries
Artificial Intelligence Applications Across Industries
SandeepKS52
 
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps CyclesFrom Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
From Chaos to Clarity - Designing (AI-Ready) APIs with APIOps Cycles
Marjukka Niinioja
 
OpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native BarcelonaOpenTelemetry 101 Cloud Native Barcelona
OpenTelemetry 101 Cloud Native Barcelona
Imma Valls Bernaus
 
Generative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its ApplicationsGenerative Artificial Intelligence and its Applications
Generative Artificial Intelligence and its Applications
SandeepKS52
 
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink TemplateeeeeeeeeeeeeeeeeeeeeeeeeeNeuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
Neuralink Templateeeeeeeeeeeeeeeeeeeeeeeeee
alexandernoetzold
 
Leveraging Foundation Models to Infer Intents
Leveraging Foundation Models to Infer IntentsLeveraging Foundation Models to Infer Intents
Leveraging Foundation Models to Infer Intents
Keheliya Gallaba
 
Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025Top 5 Task Management Software to Boost Productivity in 2025
Top 5 Task Management Software to Boost Productivity in 2025
Orangescrum
 
Providing Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better DataProviding Better Biodiversity Through Better Data
Providing Better Biodiversity Through Better Data
Safe Software
 
How Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines OperationsHow Insurance Policy Management Software Streamlines Operations
How Insurance Policy Management Software Streamlines Operations
Insurance Tech Services
 
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Async-ronizing Success at Wix - Patterns for Seamless Microservices - Devoxx ...
Natan Silnitsky
 
Who will create the languages of the future?
Who will create the languages of the future?Who will create the languages of the future?
Who will create the languages of the future?
Jordi Cabot
 
Revolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management SoftwareRevolutionize Your Insurance Workflow with Claims Management Software
Revolutionize Your Insurance Workflow with Claims Management Software
Insurance Tech Services
 
Essentials of Resource Planning in a Downturn
Essentials of Resource Planning in a DownturnEssentials of Resource Planning in a Downturn
Essentials of Resource Planning in a Downturn
OnePlan Solutions
 
COBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM CertificateCOBOL Programming with VSCode - IBM Certificate
COBOL Programming with VSCode - IBM Certificate
VICTOR MAESTRE RAMIREZ
 
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The SequelMarketo & Dynamics can be Most Excellent to Each Other – The Sequel
Marketo & Dynamics can be Most Excellent to Each Other – The Sequel
BradBedford3
 
Automating Map Production With FME and Python
Automating Map Production With FME and PythonAutomating Map Production With FME and Python
Automating Map Production With FME and Python
Safe Software
 
FME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable InsightsFME for Climate Data: Turning Big Data into Actionable Insights
FME for Climate Data: Turning Big Data into Actionable Insights
Safe Software
 
Key AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence CompaniesKey AI Technologies Used by Indian Artificial Intelligence Companies
Key AI Technologies Used by Indian Artificial Intelligence Companies
Mypcot Infotech
 
Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3Software Engineering Process, Notation & Tools Introduction - Part 3
Software Engineering Process, Notation & Tools Introduction - Part 3
Gaurav Sharma
 
Ad

Statistical inference for (Python) Data Analysis. An introduction.

  • 1. daftCode sp. z o.o. Statistical inference for (Python) Data Analysis. An introduction Piotr Milanowski
  • 2. daftCode sp. z o.o. Statistical inference? Wait, why? ● Quantify a level of trust for values you obtain ● Compare values ● Infer validity of provided data
  • 3. daftCode sp. z o.o. Buzz phrases for this talk ● Probability ● Distribution ● Random variable ● Significance ● Hypothesis testing ● Statistic
  • 4. daftCode sp. z o.o. Part 1. Preparation.
  • 5. daftCode sp. z o.o. Building Python statistical stack ● Necessary modules: Numpy Scipy ● Helpful modules: Pandas Matplotlib
  • 6. daftCode sp. z o.o. NumPy ● http://www.numpy.org ● Numerical library ● Optimized for speed and memory efficiency ● Many useful and intuitive functionalities, and methods (especially for multidimensional arrays)
  • 7. daftCode sp. z o.o. NumPy (Example) Python >>> # Vector >>> v = [1, 2, 3, 4] >>> # scaling vector 2v >>> v2 = [2*i for i in v] >>> # Adding vectors v+v2 >>> v3 = [v[i]+v2[i] for i in range(len(v))] >>> # Vector normalization >>> mean = sum(v)/len(v) >>> zero_mean = [(i – mean) for i in v] >>> std = sum(i**2 for i in zero_mean)/len(v) >>> normalized = [i/std for i in zero_mean] Python + NumPy >>> import numpy as np >>> # Vector >>> v = np.array([1, 2, 3, 4]) >>> # sacling vector 2v >>> v2 = 2*v >>> # Adding vectors v+v2 >>> v3 = v2 + v >>> # Normalization >>> normalized = v.mean()/v.std()
  • 8. daftCode sp. z o.o. SciPy ● http://www.scipy.org ● A set of scientific libraries for signal analysis (scipy.signal), image analysis (scipy.ndimage), Fourier transform (scipy.fftpack), linear algebra (scipy.linalg), integration (scipy.integrate)….. ● Here: scipy.stats
  • 9. daftCode sp. z o.o. Pandas & Matplotlib ● http://pandas.pydata.org ● Great datastructures with helpful methods ● http://matplotlib.org/ ● Visualization library
  • 10. daftCode sp. z o.o. Part 2. Examples
  • 11. daftCode sp. z o.o. Eaxample 1. Anomaly detection. ● Data: number of daily page entries from 3 months ● Question: should we be suspicious if for a given day we have 800, 850 and 900 entries?
  • 12. daftCode sp. z o.o. Example 1. Anomaly detection >>> import numpy as np >>> values = np.array([…]) # Get values from >>> # somewhere >>> values array([796, 756, 827, 779, 787, 731, 847, 829, 794, 790, 831, 820, 822, 792, 767, 757, 777, 775, 799, 756, 849, 808, 800, 823, 798, 809, 783, 800, 848, 772, 853, 759, 745, 868, 713, 863, 819, 829, 799, 777, 785, 812, 877, 839, 783, 867, 875, 829, 818, 813, 763, 829, 812, 867, 790, 748, 822, 774, 900, 830, 781, 795, 825, 799, 795, 839, 840, 765, 781, 781, 797, 821, 852, 836, 811, 771, 800, 752, 776, 755, 798, 839, 821, 794, 846, 834, 825, 825, 830, 814, 839, 760, 786, 747, 803, 717, 801, 819, 789, 824, 835, 806, 858, 774, 848, 793, 826, 796, 798, 773, 779, 775, 779, 806, 768, 787, 788, 822, 843, 780]) >>> values.max(), values.min() (900, 713)
  • 13. daftCode sp. z o.o. Example 1. Anomaly detection ● Assumption: values are drawn from Poisson distribution ● What is the probability of obtaining 800, 850, 900 for Poisson distribution fitted to this data? ● What is threshold value? ● scipy.stats.poisson (and many other distributions)
  • 14. daftCode sp. z o.o. Example 1. Anomaly detection >>> import scipy.stats as ss >>> # Calculating distribution parameter >>> mu = values.mean() >>> # Check for 800 >>> 1 – ss.poisson.cdf(800, mu) # equal to ss.poisson.sf(800, mu) 0.548801 >>> # Check for 900 >>> 1 – ss.poisson.cdf(900, mu) 0.00042 >>> # Check for 850 >>> 1 – ss.poisson.cdf(850, mu) 0.05205 >>> # Threshold for magical 5% >>> ss.poisson.ppf(0.95, mu) 851 ● 3 lines of code (read data, calculate distribution parameter, calculate threshold), and the detector is ready!
  • 15. daftCode sp. z o.o. Example 2. Confidence intervals ● What is the mean number of entries? ● What is the 95% confidence interval for calculated mean? >>> # CI simulation >>> def ci(v, no_reps): ... for i in range(no_reps): ... idx = np.random.randint(0, len(v), size=len(v)) ... yield v[idx].mean() >>> # Get simulated means >>> gen = ci(values, 10000) >>> sim_means = np.fromiter(gen, 'float') >>> # 95% Confidence interval >>> (ci_low, ci_high) = np.percentile(sim_means, [2.5, 97.5]) >>> print(ci_low, ci_high) 797.942 810.350
  • 16. daftCode sp. z o.o. Example 3. Comparing distributions ● Data: two sets of time spent on time – one set for fraud data (F), and second for non-fraud data (C) ● Question: is there a (significant) difference in those two distributions?
  • 17. daftCode sp. z o.o. Example 3. Comparing distributions >>> ok = np.array(ok) # non-fraud >>> fraud = np.array(fraud) >>> np.median(ok) 140261.0 >>> np.median(fraud) 109883.0 ● Unknown distributions: nonparametric test >>> ss.mannwhitneyu(ok, fraud) MannwhitneyuResuls(statistic=54457079.5, pvalue=1.05701588547616e-59) ● Equalize sample sizes (just to be sure) >>> N = len(fraud) >>> idx = np.arange(0, len(ok)) >>> np.random.shuffle(idx) >>> ok_subsample = ok[idx[:N]] >>> ss.mannwhitneyu(ok_subsample, fraud) >>> MannwhitneyuResult(statistic=3548976.0, pvalue=3.1818273295679098e-30)
  • 18. daftCode sp. z o.o. Example 4. Bootstrap ● The same data and question as previous ● Test without any build-in tests ● Hypothesis 0: both datasets are drawn from the same distribution ● Mix them together, draw two new datasets (with replacement), calculate statistic (difference in median) ● Probability of obtaining statistic larger or equal to the initial one (from original data)
  • 19. daftCode sp. z o.o. Example 4. Bootstrap >>> # generate statistics >>> def generate_statistics(vec1, vec2, no_reps=10000): ... all_ = np.r_[vec1, vec2] ... N, M = len(vec1), len(vec2) ... for i in range(no_reps): ... random_indices = np.random.randint(0, M+N, size=M+N) ... tmp1 = all_[random_indices[:M]] ... tmp2 = all_[random_indices[M:]] ... yield np.abs(np.median(tmp1) – np.median(tmp2)) >>> # Initial statistic >>> stat_0 = np.abs(np.median(ok) – np.median(fraud)) >>> gen = generate_statistics(ok, fraud) >>> stats = np.fromiter(gen, 'float') >>> # Get the probability of obtaining statistic larger then initial >>> np.sum(stats >= stat_0)/len(stats) 0.0
  • 20. daftCode sp. z o.o. Example 5. Naive Bayes ● Can we classify fraud based on time spent on a page? ● Using Naive Bayes: P(F|t) ~ P(t|F)P(F) P(C|t) ~ P(t|C)P(C) ● P(t|F), P(t|C) are sample distributions P(C), P(F)
  • 21. daftCode sp. z o.o. Example 5. Naive Bayes P(t∣C) P(t∣F)
  • 22. daftCode sp. z o.o. Example 5. Naive Bayes >>> # Retrieve probability based on histogram >>> def get_prob(value, bins, probs): ... idx = np.searchsorted(bins) ... return probs[idx] >>> N = len(ok) + len(fraud) >>> P_ok = len(ok)/N >>> P_fraud = len(fraud)/N >>> probs_ok, bins_ok = np.histogram(ok, bins=500, normed=True) >>> probs_fraud, bins_fraud = np.histogram(fraud, bins=bins_ok, normed=True) >>> # Test! >>> test_times = np.linspace(0, 1e6, 100000) >>> detected_as_fraud = 0 >>> detected_as_ok = 0 >>> for el in test_times: ... p_ok = get_prob(el, bins_ok, probs_ok)*P_ok ... p_fr = get_prob(el, bins_fraud, probs_fraud)*P_fraud ... if p_ok > p_fr: ... detected_as_ok += 1 ... else: ... detected_as_fraud += 1 >>> print(detected_as_ok, detected_as_fraud) 100000 0
  • 23. daftCode sp. z o.o. Example 5. Naive Bayes ● NB doesn't seem to work that well in this example ● Better results by just putting a threshold ● But still, several lines of code and classifier ready!
  • 24. daftCode sp. z o.o. Almost at the end. Just one more slide… and it's a summary!
  • 25. daftCode sp. z o.o. Summary ● Statistical inference is used to compare and validate values ● It gives some quantification, but there still is a room for subjective decisions (p-values, priors) ● It is quite easy to do statistics in Python when you have proper tools