SlideShare a Scribd company logo
Revolution Confidential




 Introduc tion to R for
  Data Mining

2012 S pring Webinar S eries



J os eph B . R ic kert,
R evolution A nalytic s
J une 5, 2012


                                               1
G oals for Today’s Webinar                                  Revolution Confidential




                    To convince you that:


                                                 Seriously, it is
                                                 not difficult to
           R                                    learn enough R
     is a serious                                  to do some
     platform for                                 serious data
     data mining                                      mining


                             Revolution R
                             Enterprise is
                          is the platform for
                                serious
                              data mining


                                                                              2
Data Mining   Applications        Actions                Revolution Confidential
                                                  Algorithms


                 Credit Scoring    Acquire Data         CART




                Fraud Detection      Prepare        Random Forests




                Ad Optimization      Classify            SVM




                   Targeted
                                     Predict           KMeans
                   Marketing




                                                      Hierarchical
                Gene Detection      Visualize
                                                       clustering




                Recommendation                         Ensemble
                                    Optimize
                    systems                           Techniques




                Social Networks      Interpret



                                                                           3
R ec ent K DD Nuggets P oll s ugges ts s o are a lot
of other s erious data miners                 Revolution Confidential


          What Analytics, Data mining, Big Data software you used in the past 12
          months for a real project (not just evaluation) [798 voters]


          Software                            % users in 2012                      % users in 2011


          R (245)                             30.7%                                23.3%
          Excel (238)                         29.8%                                21.8%

          Rapid-I RapidMiner (213)            26.7%                                27.7%
          KNIME (174)                         21.8%                                12.1%

          Weka / Pentaho (118)                14.8%                                11.8%

          StatSoft Statistica (112)           14.0%                                8.5%

          SAS (101)                           12.7%                                13.6%

          Rapid-I RapidAnalytics (83)         10.4%                                Not asked in 2011

          MATLAB (80)                         10.0%                                7.2%

          IBM SPSS Statistics (62)            7.8%                                 7.2%

          IBM SPSS Modeler (54)               6.8%                                 8.3%

          SAS Enterprise Miner (46)           5.8%                                 7.1%


                                                                                                       4
Revolution Confidential




Learning R

WHAT DOE S IT ME A N TO
LE AR N R ?

                                            5
What does it mean to learn F renc h? Revolution Confidential




  To get around Paris on the Metro

          To read a Menu
     To carry on a conversation




                                                       6
L earning R                                                                            Revolution Confidential




Levels of R Skill

 Write production level code                                             R developer


    Write an R package                                   R contributor


       Write functions                    R programmer



      Use R Functions                R user


         Use a GUI              R aware


                                   10                                                        10,000
                                                         Hours of use


                               The Malcolm Gladwell “Outlier” Scale



                                                                                                         7
Revolution Confidential




Productive from the Get go!

T HE S T R UC T UR E OF R
FA C IL ITAT E S L E A R NING

                                                  8
R is s et up to c ompute func tions on data
                                          Revolution Confidential




                           lm.model
 lm <-   function(x,y)         lm.model$assign
         {                     lm.model$coefficients
         . . .                 lm.model$df.residual
         }                     lm.model$effects
                               lm.model$fitted.values
                                  .
                                  .
                                  .




                                                            9
A little knowledge goes a long way in R          Revolution Confidential



  R’s functional design facilitates
   performing small tasks
  For the most part, the output of a   The trick is
                                        knowing which
   function depends only on the         functions to
   values of its arguments              call
  calling a function multiple times
   with the same values of its
   arguments will produce the same
   result each time
  Minimal side effects means it is
   much easier to understand and
   predict the behavior of a program


                                                                  10
B as ic Mac hine L earning F unc tions                              Revolution Confidential



              Function       Library        Description
Cluster       hclust         stats          Hierarchical cluster analysis
              kmeans         stats          Kmeans clustering
Classifiers   glm            stats          Logistic Regression
              rpart          rpart          Recursive partitioning and
                                            regression trees
              ksvm           kernlab        Support Vector Machine
Ensemble      ada            ada            Stochastic boosting
              randomForest   randomForest   Random Forests classification and
                                            regression




                                                                                     11
Noteworthy Data Mining P ac kages                            Revolution Confidential




     Package   Comment
     rattle    A very intuitive GUI for data mining that
               produces useful R code
     caret     Well organized and remarkably complete
               collection of functions to facilitate model
               building for regression and classification
               problems




                                                                              12
Revolution Confidential




Doing a lot with a little R

T IME TO R UN S OME C ODE


                                               13
S c ripts to run                                          Revolution Confidential




        Script                      Some key Functions
    0   Setup                       Load libraries
    1   Explore weather data        Read.csv, plot
    2   Run clustering algorithms   kmeans, hclust
    3   Basic decision tree         rpart
    4   Boosted Tree                ada
    5   Random Forest               randomForest
    6   Support Vector Machine      randomForest, varImpPlot
    7   Big Data Mortgage Default   rxLogit, rxKmeans
        model




                                                                           14
B ig Data and R                         Revolution Confidential




There are some challenges:
 All of your data and model code must fit into
  memory
 Big data sets as well as big models (lots of
  variables) can run out of memory
 Parallel computation might be necessary for
  models to run in a reasonable time



                                                         15
R evoS c aleR in R evolution R E nterpris e      Revolution Confidential




Can help in a number of ways:
 Manipulate large data sets, and perhaps
  aggregating data so that it will fit in memory
   For example, boiling down time-stamped data
    like a web log to form a time series that will fit in
    memory
 Run RevoScaleR Functions directly on big
  data sets
 Run R functions in parallel
                                                                  16
Top R evoS c aleR F unc tions for Data Mining
parallel external memory algorithms                         Revolution Confidential




      Task                        RevoScaleR function
      Data processing             rxDataStep
      Descriptive Statistics      rxSumary
      Tables and cubes            rxCube, rxCrosstabs
      Correlations / covariance   rxCovCor, rxCor, rxCov,
                                  rxSSCP
      Linear Models               rxLinMod
      Logistic regressions        rxLogit
      Generalized linear models   rxGlm
      K means clustering          rxKmeans
      Predictions (scoring)       rxPredict




                                                                             17
Revolution Confidential




More than code, R is a community

WHE R E TO G O F R OM HE R E ?


                                                    18
F inding your way around the R world         Revolution Confidential




   Machine Learning
   Data Mining
   Visualization
   Finding Packages
        Task Views
        crantastic.org
   Blogs
        Revolutions
        R-Bloggers
        Quick-R
   Getting Help
        StackOverflow
        @RLangTip
        Inside-R
        www.rseek.org
   Finding R People
        User Groups worldwide
        #rstats

                                 Word Cloud for @inside_R




                                                                       19
L ook at s ome more s ophis tic ated examples     Revolution Confidential




 Thomson Nguyen on the Heritage Health Prize
 Shannon Terry & Ben Ogorek (Nationwide Insurance):
  A Direct Marketing In-Flight Forecasting System
 Jeffrey Breen:
  Mining Twitter for Airline Consumer Sentiment
 Joe Rothermich: Alternative Data Sources for Measuring
  Market Sentiment and Events (Using R)




                                                                   20
R evolution A nalytic s Training          Revolution Confidential




    http://www.revolutionanalytics.com/
    products/training/




                                                           21
R eferenc es   Revolution Confidential




                                22
Revolution Confidential




                          Revolution Confidential




                                           23
Ad

More Related Content

What's hot (20)

Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
DataWorks Summit
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
stelligence
 
#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"#BigDataCanarias: "Big Data & Career Paths"
#BigDataCanarias: "Big Data & Career Paths"
Marcos Colebrook-Santamaria
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
Inside Analysis
 
Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1Open problems big_data_19_feb_2015_ver_0.1
Open problems big_data_19_feb_2015_ver_0.1
Vijay Srinivas Agneeswaran, Ph.D
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
tervela
 
Big data 101
Big data 101Big data 101
Big data 101
Lars Marius Garshol
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Greg Goltsov
 
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIUsing AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Dana Gardner
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini Sector 5
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
AI on Big Data
AI on Big DataAI on Big Data
AI on Big Data
Jongwook Woo
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
Jongwook Woo
 
Big Data: Issues and Challenges
Big Data: Issues and ChallengesBig Data: Issues and Challenges
Big Data: Issues and Challenges
Harsh Kishore Mishra
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 
Introduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for ManagersIntroduction to Deep Learning and AI at Scale for Managers
Introduction to Deep Learning and AI at Scale for Managers
DataWorks Summit
 
SuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-finalSuanIct-Bigdata desktop-final
SuanIct-Bigdata desktop-final
stelligence
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
Inside Analysis
 
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data WarehouseHadoop, Big Data, and the Future of the Enterprise Data Warehouse
Hadoop, Big Data, and the Future of the Enterprise Data Warehouse
tervela
 
Leveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science ToolsLeveraging Open Source Automated Data Science Tools
Leveraging Open Source Automated Data Science Tools
Domino Data Lab
 
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
Challenging Problems for Scalable Mining of Heterogeneous Social and Informat...
BigMine
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
Greg Goltsov
 
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AIUsing AI to Solve Data and IT Complexity -- And Better Enable AI
Using AI to Solve Data and IT Complexity -- And Better Enable AI
Dana Gardner
 
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini IT Flash Monthly Newsletter  - October IssueJIMS Rohini IT Flash Monthly Newsletter  - October Issue
JIMS Rohini IT Flash Monthly Newsletter - October Issue
JIMS Rohini Sector 5
 
Big data analytics 1
Big data analytics 1Big data analytics 1
Big data analytics 1
gauravsc36
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
Caserta
 
The Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop AdoptionThe Role of Data Wrangling in Driving Hadoop Adoption
The Role of Data Wrangling in Driving Hadoop Adoption
Inside Analysis
 
The Importance of Open Innovation in AI era
The Importance of Open Innovation in AI eraThe Importance of Open Innovation in AI era
The Importance of Open Innovation in AI era
Jongwook Woo
 
Introduction of Data Science
Introduction of Data ScienceIntroduction of Data Science
Introduction of Data Science
Jason Geng
 
Introduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and PredictionIntroduction to Big Data and AI for Business Analytics and Prediction
Introduction to Big Data and AI for Business Analytics and Prediction
Jongwook Woo
 

Similar to Introduction to R for Data Mining (20)

100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
Revolution Analytics
 
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Revolution Analytics
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
Revolution Analytics
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
Revolution Analytics
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Revolution Analytics
 
Revolution R - 100% R and More
Revolution R - 100% R and MoreRevolution R - 100% R and More
Revolution R - 100% R and More
Revolution Analytics
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
Revolution Analytics
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution Analytics
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Massimo Gaetano Panunzio
 
Software Development Engineers Ireland
Software Development Engineers IrelandSoftware Development Engineers Ireland
Software Development Engineers Ireland
Sean O'Sullivan
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and More
Revolution Analytics
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Burr Sutter
 
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey TheoremInfochimps + CloudCon: Infinite Monkey Theorem
Infochimps + CloudCon: Infinite Monkey Theorem
Infochimps, a CSC Big Data Business
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud
 
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Sybase Türkiye
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services Industry
Alison B. Lowndes
 
Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
Sybase Türkiye
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0100% R and More: Plus What's New in Revolution R Enterprise 6.0
100% R and More: Plus What's New in Revolution R Enterprise 6.0
Revolution Analytics
 
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Integrate Your Advanced Analytics into BI Apps and MS Office and Multiply The...
Revolution Analytics
 
Real-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to ProductionReal-time Big Data Analytics: From Deployment to Production
Real-time Big Data Analytics: From Deployment to Production
Revolution Analytics
 
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Turbo-Charge Your Analytics with IBM Netezza and Revolution R Enterprise: A S...
Revolution Analytics
 
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios Conference 2012 - Dave Josephsen - 2002 called they want there rrd she...
Nagios
 
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution R Enterprise: 100% R and More (14 Mar 2013)
Revolution Analytics
 
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Turbo charge-your-analytics-with-ibm-netezza-and-revolution-r-enterprise-pres...
Massimo Gaetano Panunzio
 
Software Development Engineers Ireland
Software Development Engineers IrelandSoftware Development Engineers Ireland
Software Development Engineers Ireland
Sean O'Sullivan
 
Revolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and MoreRevolution R Enterprise - 100% R and More
Revolution R Enterprise - 100% R and More
Revolution Analytics
 
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Teaching Elephants to Dance (and Fly!) A Developer's Journey to Digital Trans...
Burr Sutter
 
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of SybaseBig Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
Big Data Analytics in a Heterogeneous World - Joydeep Das of Sybase
BigDataCloud
 
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Farklı Ortamlarda Büyük Veri Kavramı -Big Data by Sybase
Sybase Türkiye
 
AI in the Financial Services Industry
AI in the Financial Services IndustryAI in the Financial Services Industry
AI in the Financial Services Industry
Alison B. Lowndes
 
Sybase Complex Event Processing
Sybase Complex Event ProcessingSybase Complex Event Processing
Sybase Complex Event Processing
Sybase Türkiye
 
Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015Strata London - Deep Learning 05-2015
Strata London - Deep Learning 05-2015
Turi, Inc.
 
Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...Best practices for building and deploying predictive models over big data pre...
Best practices for building and deploying predictive models over big data pre...
Kun Le
 
Ad

More from Revolution Analytics (20)

Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
Revolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
Revolution Analytics
 
R in Minecraft
R in Minecraft R in Minecraft
R in Minecraft
Revolution Analytics
 
The case for R for AI developers
The case for R for AI developersThe case for R for AI developers
The case for R for AI developers
Revolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
Revolution Analytics
 
R Then and Now
R Then and NowR Then and Now
R Then and Now
Revolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
Reproducible Data Science with R
Reproducible Data Science with RReproducible Data Science with R
Reproducible Data Science with R
Revolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
Revolution Analytics
 
The R Ecosystem
The R EcosystemThe R Ecosystem
The R Ecosystem
Revolution Analytics
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
Revolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
Revolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
Revolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
Revolution Analytics
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
Revolution Analytics
 
Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
Revolution Analytics
 
Migrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to AzureMigrating Existing Open Source Machine Learning to Azure
Migrating Existing Open Source Machine Learning to Azure
Revolution Analytics
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
Revolution Analytics
 
Predicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per SecondPredicting Loan Delinquency at One Million Transactions per Second
Predicting Loan Delinquency at One Million Transactions per Second
Revolution Analytics
 
The Value of Open Source Communities
The Value of Open Source CommunitiesThe Value of Open Source Communities
The Value of Open Source Communities
Revolution Analytics
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
The Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data ScienceThe Business Economics and Opportunity of Open Source Data Science
The Business Economics and Opportunity of Open Source Data Science
Revolution Analytics
 
Taking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the CloudTaking R Analytics to SQL and the Cloud
Taking R Analytics to SQL and the Cloud
Revolution Analytics
 
The Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductorThe Network structure of R packages on CRAN & BioConductor
The Network structure of R packages on CRAN & BioConductor
Revolution Analytics
 
The network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 finalThe network structure of cran 2015 07-02 final
The network structure of cran 2015 07-02 final
Revolution Analytics
 
Simple Reproducibility with the checkpoint package
Simple Reproducibilitywith the checkpoint packageSimple Reproducibilitywith the checkpoint package
Simple Reproducibility with the checkpoint package
Revolution Analytics
 
Ad

Recently uploaded (20)

DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier VroomAI x Accessibility UXPA by Stew Smith and Olivier Vroom
AI x Accessibility UXPA by Stew Smith and Olivier Vroom
UXPA Boston
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Hybridize Functions: A Tool for Automatically Refactoring Imperative Deep Lea...
Raffi Khatchadourian
 
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
GDG Cloud Southlake #42: Suresh Mathew: Autonomous Resource Optimization: How...
James Anderson
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and MLGyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
GyrusAI - Broadcasting & Streaming Applications Driven by AI and ML
Gyrus AI
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Q1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor PresentationQ1 2025 Dropbox Earnings and Investor Presentation
Q1 2025 Dropbox Earnings and Investor Presentation
Dropbox
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptxWebinar - Top 5 Backup Mistakes MSPs and Businesses Make   .pptx
Webinar - Top 5 Backup Mistakes MSPs and Businesses Make .pptx
MSP360
 
Financial Services Technology Summit 2025
Financial Services Technology Summit 2025Financial Services Technology Summit 2025
Financial Services Technology Summit 2025
Ray Bugg
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...Canadian book publishing: Insights from the latest salary survey - Tech Forum...
Canadian book publishing: Insights from the latest salary survey - Tech Forum...
BookNet Canada
 

Introduction to R for Data Mining

  • 1. Revolution Confidential Introduc tion to R for Data Mining 2012 S pring Webinar S eries J os eph B . R ic kert, R evolution A nalytic s J une 5, 2012 1
  • 2. G oals for Today’s Webinar Revolution Confidential To convince you that: Seriously, it is not difficult to R learn enough R is a serious to do some platform for serious data data mining mining Revolution R Enterprise is is the platform for serious data mining 2
  • 3. Data Mining Applications Actions Revolution Confidential Algorithms Credit Scoring Acquire Data CART Fraud Detection Prepare Random Forests Ad Optimization Classify SVM Targeted Predict KMeans Marketing Hierarchical Gene Detection Visualize clustering Recommendation Ensemble Optimize systems Techniques Social Networks Interpret 3
  • 4. R ec ent K DD Nuggets P oll s ugges ts s o are a lot of other s erious data miners Revolution Confidential What Analytics, Data mining, Big Data software you used in the past 12 months for a real project (not just evaluation) [798 voters] Software % users in 2012 % users in 2011 R (245) 30.7% 23.3% Excel (238) 29.8% 21.8% Rapid-I RapidMiner (213) 26.7% 27.7% KNIME (174) 21.8% 12.1% Weka / Pentaho (118) 14.8% 11.8% StatSoft Statistica (112) 14.0% 8.5% SAS (101) 12.7% 13.6% Rapid-I RapidAnalytics (83) 10.4% Not asked in 2011 MATLAB (80) 10.0% 7.2% IBM SPSS Statistics (62) 7.8% 7.2% IBM SPSS Modeler (54) 6.8% 8.3% SAS Enterprise Miner (46) 5.8% 7.1% 4
  • 5. Revolution Confidential Learning R WHAT DOE S IT ME A N TO LE AR N R ? 5
  • 6. What does it mean to learn F renc h? Revolution Confidential To get around Paris on the Metro To read a Menu To carry on a conversation 6
  • 7. L earning R Revolution Confidential Levels of R Skill Write production level code R developer Write an R package R contributor Write functions R programmer Use R Functions R user Use a GUI R aware 10 10,000 Hours of use The Malcolm Gladwell “Outlier” Scale 7
  • 8. Revolution Confidential Productive from the Get go! T HE S T R UC T UR E OF R FA C IL ITAT E S L E A R NING 8
  • 9. R is s et up to c ompute func tions on data Revolution Confidential lm.model lm <- function(x,y) lm.model$assign { lm.model$coefficients . . . lm.model$df.residual } lm.model$effects lm.model$fitted.values . . . 9
  • 10. A little knowledge goes a long way in R Revolution Confidential  R’s functional design facilitates performing small tasks  For the most part, the output of a The trick is knowing which function depends only on the functions to values of its arguments call  calling a function multiple times with the same values of its arguments will produce the same result each time  Minimal side effects means it is much easier to understand and predict the behavior of a program 10
  • 11. B as ic Mac hine L earning F unc tions Revolution Confidential Function Library Description Cluster hclust stats Hierarchical cluster analysis kmeans stats Kmeans clustering Classifiers glm stats Logistic Regression rpart rpart Recursive partitioning and regression trees ksvm kernlab Support Vector Machine Ensemble ada ada Stochastic boosting randomForest randomForest Random Forests classification and regression 11
  • 12. Noteworthy Data Mining P ac kages Revolution Confidential Package Comment rattle A very intuitive GUI for data mining that produces useful R code caret Well organized and remarkably complete collection of functions to facilitate model building for regression and classification problems 12
  • 13. Revolution Confidential Doing a lot with a little R T IME TO R UN S OME C ODE 13
  • 14. S c ripts to run Revolution Confidential Script Some key Functions 0 Setup Load libraries 1 Explore weather data Read.csv, plot 2 Run clustering algorithms kmeans, hclust 3 Basic decision tree rpart 4 Boosted Tree ada 5 Random Forest randomForest 6 Support Vector Machine randomForest, varImpPlot 7 Big Data Mortgage Default rxLogit, rxKmeans model 14
  • 15. B ig Data and R Revolution Confidential There are some challenges:  All of your data and model code must fit into memory  Big data sets as well as big models (lots of variables) can run out of memory  Parallel computation might be necessary for models to run in a reasonable time 15
  • 16. R evoS c aleR in R evolution R E nterpris e Revolution Confidential Can help in a number of ways:  Manipulate large data sets, and perhaps aggregating data so that it will fit in memory  For example, boiling down time-stamped data like a web log to form a time series that will fit in memory  Run RevoScaleR Functions directly on big data sets  Run R functions in parallel 16
  • 17. Top R evoS c aleR F unc tions for Data Mining parallel external memory algorithms Revolution Confidential Task RevoScaleR function Data processing rxDataStep Descriptive Statistics rxSumary Tables and cubes rxCube, rxCrosstabs Correlations / covariance rxCovCor, rxCor, rxCov, rxSSCP Linear Models rxLinMod Logistic regressions rxLogit Generalized linear models rxGlm K means clustering rxKmeans Predictions (scoring) rxPredict 17
  • 18. Revolution Confidential More than code, R is a community WHE R E TO G O F R OM HE R E ? 18
  • 19. F inding your way around the R world Revolution Confidential  Machine Learning  Data Mining  Visualization  Finding Packages  Task Views  crantastic.org  Blogs  Revolutions  R-Bloggers  Quick-R  Getting Help  StackOverflow  @RLangTip  Inside-R  www.rseek.org  Finding R People  User Groups worldwide  #rstats Word Cloud for @inside_R 19
  • 20. L ook at s ome more s ophis tic ated examples Revolution Confidential  Thomson Nguyen on the Heritage Health Prize  Shannon Terry & Ben Ogorek (Nationwide Insurance): A Direct Marketing In-Flight Forecasting System  Jeffrey Breen: Mining Twitter for Airline Consumer Sentiment  Joe Rothermich: Alternative Data Sources for Measuring Market Sentiment and Events (Using R) 20
  • 21. R evolution A nalytic s Training Revolution Confidential http://www.revolutionanalytics.com/ products/training/ 21
  • 22. R eferenc es Revolution Confidential 22
  • 23. Revolution Confidential Revolution Confidential 23