0% found this document useful (0 votes)
3K views

Kernel Density Estimation (KDE) in Excel Tutorial

Previously, we’ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this tutorial, we’ll carry on the problem of probability density function inference, but using another method: Kernel density estimation. Kernel density estimates (KDE) are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Uploaded by

NumXL Pro
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3K views

Kernel Density Estimation (KDE) in Excel Tutorial

Previously, we’ve seen how to use the histogram method to infer the probability density function (PDF) of a random variable (population) using a finite data sample. In this tutorial, we’ll carry on the problem of probability density function inference, but using another method: Kernel density estimation. Kernel density estimates (KDE) are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel.

Uploaded by

NumXL Pro
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

KernelDensityEstimation(KDE)Tutorial 1 SpiderFinancialCorp,2013

KernelDensityEstimation(KDE)
Previously,weveseenhowtousethehistogrammethodtoinfertheprobabilitydensityfunction(PDF)
ofarandomvariable(population)usingafinitedatasample.Inthistutorial,wellcarryontheproblem
ofprobabilitydensityfunctioninference,butusinganothermethod:Kerneldensityestimation.
Kerneldensityestimates(KDE)arecloselyrelatedtohistograms,butcanbeendowedwithproperties
suchassmoothnessorcontinuitybyusingasuitablekernel.
Whydowecare?
Oneofthemainproblemsinpracticalapplicationsisthattheneededprobabilitydistributionisusually
notreadilyavailable,butratheritmustbederivedfromotherexistinginformation(e.g.sampledata).
KDEsaresimilartohistogramsintermsofbeingnonparametricmethod,sotherearenorestrictive
assumptionsabouttheshapeofthedensityfunction,butKDEisfarmoresuperiortohistogramsasfar
asaccuracyandcontinuity.
Overview
Letsconsiderafinitedatasample
1 2
{x ,x ,..., }
N
x observedfromastochastic(i.e.continuousand
random)process.Wewishtoinferthepopulationprobabilitydensityfunction.
Inthehistogrammethod,weselecttheleftboundofthehistogram(
o
x ),thebinswidth( h),andthen
computethebin k probabilityestimator

( )
h
f k :
- Bin k representsthefollowinginterval [ ( 1) , )
o o
x k h x k h + +
-
1
{( 1) x x }

( )
N
i o
i
h
I k h k h
f k
N
=
s <
=

- {} I isaneventfunctionthatreturns1(one)iftheconditionistrue,0(zero)otherwise.
Thechoiceofbins,especiallythebinwidth( h ),hasasubstantialeffectontheshapeandother
propertiesof

( )
h
f k .
Finally,wecanthinkofthehistogrammethodasfollows:
- Eachobservation(event)isstatisticallyindependentofallothers,anditsoccurrenceprobability
isequalto
1
N
.
-

( )
h
f k issimplytheintegral(sum)oftheeventprobabilitiesineachbin.

KernelDensityEstimation(KDE)Tutorial 2 SpiderFinancialCorp,2013

Whatisakernel?
Akernelisanonnegative,realvalued,integrablefunction (.) K satisfyingthefollowingtwo
requirements:

( ) 1
( ) ( )
K u du
K u K u

=
=
}

And,asaresult,thescaledfunction
*
( ) K u ,where
*
( ) ( ) K u K u = ,isakernelaswell.
Now,placeascaledkernelfunctionateachobservationinthesampleandcomputethenewprobability
estimators

( )
h
f x foravalue x (comparedtoanearlierbininthehistogram).

1
1
1

( ) ( )
1 u
(u) ( )
1

( ) ( )
N
h h i
i
h
N
i
h
i
f x K x x
N
K K
h h
x x
f x K
N h h
=
=
=
=


Asanexample,let (.) K bethestandardizedGaussiandensityfunction.TheKDElookslikethesumof
Gaussiancurves,eachcenteredononeobservation.

Note:ForGaussiankernel,thebandwidth h isthesameasthestandarddeviationof(
i
x x ).

KernelDensityEstimation(KDE)Tutorial 3 SpiderFinancialCorp,2013

TheKDEmethodreplacesthediscreteprobability
1 2
1 2
1/ { , ,.., }
(x)
0 { , ,.., }
N
N
N x x x x
P
x x x x
e
=

e

withakernel
function.Thispermitsoverlapbetweenkernels,thuspromotingcontinuityintheprobabilityestimator.
WhyKDE?
Duetoourdatasampling,weareleftwithafinitesetofvaluesforcontinuousrandomvariables.Usinga
kernelinsteadofdiscreteprobabilities,wepromotethecontinuitynatureintheunderlyingrandom
variable.
ToproceedwithKDE,youllneedtodecideontwokeyparameters:Kernelfunctionandbandwidth.
WhichkernelshouldIuse?
Arangeofkernelfunctionsarecommonlyused:uniform,triangular,biweight,triweightand
Epanechnikov.TheGaussiankernelisoftenused; (.) (.) K | = ,where | isthestandardnormaldensity
function.
HowdoIproperlycomputekernelbandwidth?
Intuitively,onewantstochooseanhassmallasthedataallows,butthereisatradeoffbetweenthe
biasoftheestimatoranditsvariance.
Selectionofthebandwidthofakernelestimatorisasubjectofconsiderableresearch.Wewilloutline
twopopularmethods:
1. SubjectiveselectionOnecanexperimentbyusingdifferentbandwidthsandsimplyselecting
onethatlooksrightforthetypeofdataunderinvestigation.
2. SelectionwithreferencetosomegivendistributionHereoneselectsthebandwidththatwould
beoptimalforaparticularPDF.Keepinmindthatyouarenotassumingthat ( ) f x isnormal,but
ratherselectingan hwhichwouldbeoptimalifthePDFwerenormal.UsingaGaussiankernel,
theoptimalbandwidth
opt
h isdefinedasfollows:

5
4
3
opt
h
N
o =
Thenormaldistributionisnotawigglydistribution;itisunimodalandbellshaped.Itis
thereforetobeexpectedthat
opt
h willbetoolargeformultimodaldistributions.Furthermore,
thesamplevariance(
2
s )isnotarobustestimatorof
2
o ;itoverestimatesifsomeoutliers
(extremeobservations)arepresent.Toovercometheseproblems,Silvermanproposedthe
followingbandwidthestimator:

KernelDensityEstimation(KDE)Tutorial 4 SpiderFinancialCorp,2013

5
3 1
0.9
min(s, )
1.34
R IQR Q
opt
h
N
R
Q
o
o

=
=
= =

Where IQRistheinterquartilerangeand sisthesamplestandarddeviation.


3. Datadrivenestimationthisisanareaofcurrentresearchusingseveraldifferentmethods:
Fouriertransform,diffusionbased,etc.
Process
UsingtheNumXLaddinforExcel,youcancomputetheKDEvaluesfordifferentkernelfunctions(e.g.
Gaussian,uniform,triangular,etc.)and(optionally)withabandwidthvalue.
Foroursampledata,weareusing50randomlygeneratedvaluesofthenormaldistribution(usingthe
randomgeneratorintheExcelAnalysisPack).Weplottedthehistogramforourreference:

NowwearereadytoconstructourKDEplot.First,selecttheemptycellinyourworksheetwhereyou
wishtheoutputtabletobegenerated,thenlocateandclickontheDescriptiveStatisticsiconinthe
NumXLtab(ortoolbar).Then,selecttheKerneldensityestimationitemfromthedropdownmenu.

KernelDensityEstimation(KDE)Tutorial 5 SpiderFinancialCorp,2013

TheKDEwizardappears.

Selectthecellsrangeforthevaluesoftheinputvariable.
Notes:
1. Thecellsrangeincludes(optional)theheading(Label)cell,whichwouldbeusedintheoutput
tableswhereitreferencesthosevariables.
2. Bydefault,theoutputtablecellsrangeissettothecurrentselectedcellinyourworksheet.
3. Bydefault,theoutputgraphcellsrangeissettothe7cellstotherightofthecurrentlyselected
cellinyourworksheet.
Finally,onceweselecttheinputdata(X)cellsrange,theOptionsandMissingValuestabsbecome
available(enabled).
Next,selecttheOptionstab:

KernelDensityEstimation(KDE)Tutorial 6 SpiderFinancialCorp,2013

Notes:
1. Bydefault,theGaussiankernelfunctionisselected.Letsleavethisoptionunchanged.
2. Bydefault,theoptimalbandwidthoptionischecked.TheKDEfunctionwillusetheSilverman
estimateforthebandwidth.Leaveitchecked.
3. Bydefault,theoutputtablesizeissetto5.Leaveitunchanged.
4. OverlayNormaldistributionischecked.Thisoptionineffectinstructsthewizardtogeneratea
secondcurvefortheGaussiandistributionforcomparisonpurposes.Leavethisoptionchecked.
Now,clickontheMissingValuestab.

Inthistab,youcanselectanapproachtohandlemissingvaluesinthedataset(Xs).Bydefault,any
observationwithmissingvaluewouldbeexcludedfromtheanalysis.
Thistreatmentisagoodapproachforouranalysis,soletsleaveitunchanged.
Now,clickOKtogeneratetheoutputtables.

Notes:
1. ThevaluesofallXaresortedinascendingorder.

KernelDensityEstimation(KDE)Tutorial 7 SpiderFinancialCorp,2013

2. Thesummarystatisticsinthe1
st
rowarecomputedmerelytofacilitatethecreationofthetable
orcomputingtheoverlayGaussiandistributionfunction.
ThegeneratedplotoftheKDEisshownbelow:

NotethattheKDEcurve(blue)tracksverycloselywiththeGaussiandensity(orange)curve.
Case2
Nowletstryanonnormalsampledataset.Wegenerated50randomvaluesofauniform
distributionbetween3and3.Followingsimilarsteps,weplottedthehistogramandtheKDE:

NotethattheKDEcurve(blue)tracksmuchmorecloselywiththeunderlyingdistribution(i.e.
uniform)thanthehistogram.
Case3
Forour3
rd
case,wegenerated50randomvaluesofabinomialdistribution(p=0.2andbatch
size=20).Followingsimilarsteps,weplottedthehistogramandtheKDE.

KernelDensityEstimation(KDE)Tutorial 8 SpiderFinancialCorp,2013


NotethatKDEcurve(blue)tracksmuchmorecloselywiththeunderlyingdistribution(i.e.uniform)than
thehistogram.
Conclusion
Inthistutorial,wedemonstratedtheprocesstogenerateakerneldensityestimationinExcelusing
NumXLsaddinfunctions.
TheKDEmethodisamajorimprovementforinferringtheprobabilitydensityfunctionofthepopulation,
intermsofaccuracyandcontinuityofthefunction.Nevertheless,itintroduceanewchallenge:
selectingaproperbandwidth.Inthemajorityofcases,theSilvermanestimatorforthebandwidth
provestobesatisfactory,butisitoptimal?Dowecare?
Wheredowegofromhere?
First,toanswerthequestionofoptimality,weneedtointroduceadditionalalgorithmstoestimateits
values.Forexample,inAnnalsofStatistics,Volume38,Number5,pages29162957,Z.I.Botev,J.F.
Grotowski,andD.P.Kroesedescribedanumericalsampledatadrivenmethodforfindingtheoptimal
bandwidthusingaKerneldensityestimationviathediffusionapproach.
Second,incaseswheretherangeofvaluesthattherandomnumbercantakeareknowntobe
constrainedfromoneside(e.g.prices,binomialdata,etc.),orinarange(e.g.survivalrate,defaultrate,
etc.),thenhowdoweadapttheKDEtofactorinthoseconstraints?
Finally,wedefinedtheKDEprobabilityestimatorusingafixedbandwidth( h)forallobservations.Ifthe
bandwidthisnotheldfixed,butisvarieddependinguponthelocationofeithertheestimate(balloon
estimator)orthesamples(pointwiseestimator),thisproducesaparticularlypowerfulmethodknown
asadaptiveorvariablebandwidthkerneldensityestimation.

You might also like