SlideShare a Scribd company logo
What We Learned Building an
R-Python Hybrid Analytics Pipeline
Niels Bantilan, Pegged Software
NY R Conference April 8th 2016
Help healthcare organizations recruit better
Pegged Software’s Mission:
Core Activities
● Build, evaluate, refine, and deploy predictive models
● Work with Engineering to ingest, validate, and store data
● Work with Product Management to develop data-driven feature sets
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Anchor Yourself to Problem Statements / Use Cases
1. Define Problem statement
2. Scope out solution space and trade-offs
3. Make decision, justify it, document it
4. Implement chosen solution
5. Evaluate working solution against problem statement
6. Rinse and repeat
Problem-solving Heuristic
R-Python Pipeline
Read Data Preprocess Build Model Evaluate Deploy
Data Science Stack
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
● Code quality
● Incremental Knowledge Transfer
● Sanity check
Git
Why? Because Version Control
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Dependency Management
Why Pip + Pyenv?
1. Easily sync Python package dependencies
2. Easily manage multiple Python versions
3. Create and manage virtual environments
Why Packrat? From RStudio
1. Isolated: separate system environment and repo environment
2. Portable: easily sync dependencies across data science team
3. Reproducible: easily add/remove/upgrade/downgrade as needed.
Dependency Management
Packrat Internals
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile # points R to packrat
└─ packrat
├─ init.R # initialize script
├─ packrat.lock # package deps
├─ packrat.opts # options config
├─ lib # repo private library
└─ src # repo source files
Understanding packrat
PackratFormat: 1.4
PackratVersion: 0.4.6.1
RVersion: 3.2.3
Repos:CRAN=https://cran.rstudio.com/
...
Package: ggplot2
Source: CRAN
Version: 2.0.0
Hash: 5befb1e7a9c7d0692d6c35fa02a29dbf
Requires: MASS, digest, gtable, plyr,
reshape2, scales
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile
└─ packrat
├─ init.R
├─ packrat.lock # package deps
├─ packrat.opts
├─ lib
└─ src
packrat.lock: package version and deps
Packrat Internals
auto.snapshot: TRUE
use.cache: FALSE
print.banner.on.startup: auto
vcs.ignore.lib: TRUE
vcs.ignore.src: TRUE
load.external.packages.on.startup: TRUE
quiet.package.installation: TRUE
snapshot.recommended.packages: FALSE
packrat.opts: project-specific configuration
Packrat Internals
datascience_repo
├─ project_folder_a
├─ project_folder_b
├─ datascience_repo.Rproj
...
├─ .Rprofile
└─ packrat
├─ init.R
├─ packrat.lock
├─ packrat.opts # options config
├─ lib
└─ src
● Initialize packrat with packrat::init()
● Toggle packrat in R session with packrat::on() / off()
● Save current state of project with packrat::snapshot()
● Reconstitute your project with packrat::restore()
● Removing unused libraries with packrat::clean()
Packrat Workflow
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
Problem: Unable to find source packages when restoring
Happens when there is a new version of a package on an R package
repository like CRAN
Packrat Issues
> packrat::restore()
Installing knitr (1.11) ...
FAILED
Error in getSourceForPkgRecord(pkgRecord, srcDir(project),
availablePkgs, :
Couldn't find source for version 1.11 of knitr (1.10.5 is
current)
Solution 1: Use R’s Installation Procedure
Packrat Issues
> install.packages(<package_name>)
> packrat::snapshot()
Solution 2: Manually Download Source File
$ wget -P repo/packrat/src <package_source_url>
> packrat::restore()
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Call R from Python: Data Pipeline
Read Data Preprocess Build Model Evaluate Deploy
# model_builder.R
cmdargs <- commandArgs(trailingOnly = TRUE)
data_filepath <- cmdargs[1]
model_type <- cmdargs[2]
formula <- cmdargs[3]
build.model <- function(data_filepath, model_type, formula) {
df <- read.data(data_filepath)
model <- train.model(df, model_type, formula)
model
}
Call R from Python: Example
# model_pipeline.py
import subprocess
subprocess.call([‘path/to/R/executable’,
'path/to/model_builder.R’,
data_filepath, model_type, formula])
Why subprocess?
1. Python for control flow, data manipulation, IO handling
2. R for model build and evaluation computations
3. main.R script (model_builder.R) as the entry point into R layer
4. No need for tight Python-R integration
Call R from Python
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Tolerance to Change
Are we confident that a modification to the codebase will not silently
introduce new bugs?
Automated Testing
Working Effectively with Legacy Code - Michael Feathers
1. Identify change points
2. Break dependencies
3. Write tests
4. Make changes
5. Refactor
Automated Testing
Need R Python
Maintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Make is a language-agnostic utility for *nix
● Enables reproducible workflow
● Serves as lightweight documentation for repo
# makefile
build-model:
python model_pipeline.py 
-i ‘model_input’ 
-m_type ‘glm’ 
-formula ‘y ~ x1 + x2’ 
# command-line
$ make build-model
Build Management: Make
$ python model_pipeline.py 
-i input_fp 
-m_type ‘glm’ 
-formula ‘y ~ x1 + x2’ 
VS
By adopting the above practices, we:
1. Can maintain the codebase more easily
2. Reduce cognitive load and context switching
3. Improve code quality and correctness
4. Facilitate knowledge transfer among team members
5. Encourage reproducible workflows
Big Wins
Necessary Time Investment
1. The learning curve
2. Breaking old habits
3. Create fixes for issues that come with chosen solutions
Costs
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Questions?
niels@peggedsoftware.com
@cosmicbboy

More Related Content

What's hot (20)

PDF
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
Work-Bench
 
PPTX
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Raffi Khatchadourian
 
PDF
ownR extended technical introduction
Functional Analytics
 
PDF
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
PDF
From NASA to Startups to Big Commerce
Daniel Greenfeld
 
PDF
OwnR introduction
Functional Analytics
 
PDF
ownR presentation eRum 2016
Functional Analytics
 
PDF
ownR platform technical introduction
Functional Analytics
 
PDF
Wrapping and securing REST APIs with GraphQL
Roy Derks
 
PPTX
TDD For Mortals
Kfir Bloch
 
ODP
Contributing to Upstream Open Source Projects
Scott Garman
 
PDF
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
PDF
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Databricks
 
PDF
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Databricks
 
PDF
Stream Processing: Choosing the Right Tool for the Job
Databricks
 
PPTX
Processing genetic data at scale
Mark Schroering
 
PPTX
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil
 
PDF
An Empirical Study of Unspecified Dependencies in Make-Based Build Systems
corpaulbezemer
 
PPTX
Internship final presentation
NealGopani
 
PPTX
Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...
Karan Singh
 
A Statistician Walks into a Tech Company: R at a Rapidly Scaling Healthcare S...
Work-Bench
 
Actor Concurrency Bugs: A Comprehensive Study on Symptoms, Root Causes, API U...
Raffi Khatchadourian
 
ownR extended technical introduction
Functional Analytics
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
PyData
 
From NASA to Startups to Big Commerce
Daniel Greenfeld
 
OwnR introduction
Functional Analytics
 
ownR presentation eRum 2016
Functional Analytics
 
ownR platform technical introduction
Functional Analytics
 
Wrapping and securing REST APIs with GraphQL
Roy Derks
 
TDD For Mortals
Kfir Bloch
 
Contributing to Upstream Open Source Projects
Scott Garman
 
Introduction to Apache Airflow - Data Day Seattle 2016
Sid Anand
 
Query or Not to Query? Using Apache Spark Metrics to Highlight Potentially Pr...
Databricks
 
Getting Started Contributing to Apache Spark – From PR, CR, JIRA, and Beyond
Databricks
 
Stream Processing: Choosing the Right Tool for the Job
Databricks
 
Processing genetic data at scale
Mark Schroering
 
OpenMetrics: What Does It Mean for You (PromCon 2019, Munich)
Brian Brazil
 
An Empirical Study of Unspecified Dependencies in Make-Based Build Systems
corpaulbezemer
 
Internship final presentation
NealGopani
 
Demo : Twitter Sentiment Analysis on Kubernetes using Kafka, MongoDB with Ope...
Karan Singh
 

Viewers also liked (16)

PDF
Thinking Small About Big Data
Work-Bench
 
PDF
Iterating over statistical models: NCAA tournament edition
Work-Bench
 
PDF
R for Everything
Work-Bench
 
PDF
Julia + R for Data Science
Work-Bench
 
PDF
Using R at NYT Graphics
Work-Bench
 
PDF
Improving Data Interoperability for Python and R
Work-Bench
 
PDF
Analyzing NYC Transit Data
Work-Bench
 
PDF
The Feels
Work-Bench
 
PDF
Broom: Converting Statistical Models to Tidy Data Frames
Work-Bench
 
PDF
The Political Impact of Social Penumbras
Work-Bench
 
PDF
Reflection on the Data Science Profession in NYC
Work-Bench
 
PDF
I Don't Want to Be a Dummy! Encoding Predictors for Trees
Work-Bench
 
PDF
One Algorithm to Rule Them All: How to Automate Statistical Computation
Work-Bench
 
PDF
R Packages for Time-Varying Networks and Extremal Dependence
Work-Bench
 
PDF
Scaling Data Science at Airbnb
Work-Bench
 
PPTX
Inside the R Consortium
Work-Bench
 
Thinking Small About Big Data
Work-Bench
 
Iterating over statistical models: NCAA tournament edition
Work-Bench
 
R for Everything
Work-Bench
 
Julia + R for Data Science
Work-Bench
 
Using R at NYT Graphics
Work-Bench
 
Improving Data Interoperability for Python and R
Work-Bench
 
Analyzing NYC Transit Data
Work-Bench
 
The Feels
Work-Bench
 
Broom: Converting Statistical Models to Tidy Data Frames
Work-Bench
 
The Political Impact of Social Penumbras
Work-Bench
 
Reflection on the Data Science Profession in NYC
Work-Bench
 
I Don't Want to Be a Dummy! Encoding Predictors for Trees
Work-Bench
 
One Algorithm to Rule Them All: How to Automate Statistical Computation
Work-Bench
 
R Packages for Time-Varying Networks and Extremal Dependence
Work-Bench
 
Scaling Data Science at Airbnb
Work-Bench
 
Inside the R Consortium
Work-Bench
 
Ad

Similar to What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline (20)

PDF
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
PDF
Introduction to python along with the comparitive analysis with r
Ashwini Mathur
 
PDF
Turbocharge your data science with python and r
Kelli-Jean Chun
 
PPTX
Using R on High Performance Computers
Dave Hiltbrand
 
PDF
Large scale machine learning projects with R Suite
WLOG Solutions
 
PDF
Large scale machine learning projects with r suite
Wit Jakuczun
 
PPTX
Reproducible Computational Research in R
Samuel Bosch
 
PPTX
How to lock a Python in a cage? Managing Python environment inside an R project
WLOG Solutions
 
PDF
Putting data science to work
Alex Breeze
 
PDF
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
 
PPTX
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
 
PDF
Python vs. r for data science
Hugo Shi
 
PPTX
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
PDF
R - the language
Mike Martinez
 
PPTX
Python ml
Shubham Sharma
 
PDF
Extending lifespan with Hadoop and R
Radek Maciaszek
 
PDF
Python and R for quantitative finance
Luca Sbardella
 
PDF
R ext world/ useR! Kiev
Ruslan Shevchenko
 
PDF
MLflow with R
Databricks
 
PDF
RDataMining slides-r-programming
Yanchang Zhao
 
Language-agnostic data analysis workflows and reproducible research
Andrew Lowe
 
Introduction to python along with the comparitive analysis with r
Ashwini Mathur
 
Turbocharge your data science with python and r
Kelli-Jean Chun
 
Using R on High Performance Computers
Dave Hiltbrand
 
Large scale machine learning projects with R Suite
WLOG Solutions
 
Large scale machine learning projects with r suite
Wit Jakuczun
 
Reproducible Computational Research in R
Samuel Bosch
 
How to lock a Python in a cage? Managing Python environment inside an R project
WLOG Solutions
 
Putting data science to work
Alex Breeze
 
Managing large (and small) R based solutions with R Suite
Wit Jakuczun
 
Reproducibility with Checkpoint & RRO - NYC R Conference
Revolution Analytics
 
Python vs. r for data science
Hugo Shi
 
Next.ml Boston: Data Science Dev Ops
Eric Chiang
 
R - the language
Mike Martinez
 
Python ml
Shubham Sharma
 
Extending lifespan with Hadoop and R
Radek Maciaszek
 
Python and R for quantitative finance
Luca Sbardella
 
R ext world/ useR! Kiev
Ruslan Shevchenko
 
MLflow with R
Databricks
 
RDataMining slides-r-programming
Yanchang Zhao
 
Ad

More from Work-Bench (8)

PDF
2017 Enterprise Almanac
Work-Bench
 
PDF
AI to Enable Next Generation of People Managers
Work-Bench
 
PDF
Startup Recruiting Workbook: Sourcing and Interview Process
Work-Bench
 
PDF
Cloud Native Infrastructure Management Solutions Compared
Work-Bench
 
PPTX
Building a Demand Generation Machine at MongoDB
Work-Bench
 
PPTX
How to Market Your Startup to the Enterprise
Work-Bench
 
PDF
Marketing & Design for the Enterprise
Work-Bench
 
PDF
Playing the Marketing Long Game
Work-Bench
 
2017 Enterprise Almanac
Work-Bench
 
AI to Enable Next Generation of People Managers
Work-Bench
 
Startup Recruiting Workbook: Sourcing and Interview Process
Work-Bench
 
Cloud Native Infrastructure Management Solutions Compared
Work-Bench
 
Building a Demand Generation Machine at MongoDB
Work-Bench
 
How to Market Your Startup to the Enterprise
Work-Bench
 
Marketing & Design for the Enterprise
Work-Bench
 
Playing the Marketing Long Game
Work-Bench
 

Recently uploaded (20)

PPTX
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
PDF
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
PPTX
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
PDF
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
PDF
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PDF
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
PPTX
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
PPTX
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
PDF
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PPTX
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
PPTX
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
PDF
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
PDF
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
PPTX
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
PPTX
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
PDF
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
PDF
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 
b6057ea5-8e8c-4415-90c0-ed8e9666ffcd.pptx
Anees487379
 
apidays Helsinki & North 2025 - Monetizing AI APIs: The New API Economy, Alla...
apidays
 
Aict presentation on dpplppp sjdhfh.pptx
vabaso5932
 
NIS2 Compliance for MSPs: Roadmap, Benefits & Cybersecurity Trends (2025 Guide)
GRC Kompas
 
Data Retrieval and Preparation Business Analytics.pdf
kayserrakib80
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
apidays Singapore 2025 - Surviving an interconnected world with API governanc...
apidays
 
ER_Model_with_Diagrams_Presentation.pptx
dharaadhvaryu1992
 
apidays Singapore 2025 - Designing for Change, Julie Schiller (Google)
apidays
 
Driving Employee Engagement in a Hybrid World.pdf
Mia scott
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
apidays Helsinki & North 2025 - From Chaos to Clarity: Designing (AI-Ready) A...
apidays
 
apidays Helsinki & North 2025 - Agentic AI: A Friend or Foe?, Merja Kajava (A...
apidays
 
What does good look like - CRAP Brighton 8 July 2025
Jan Kierzyk
 
apidays Singapore 2025 - From API Intelligence to API Governance by Harsha Ch...
apidays
 
apidays Singapore 2025 - From Data to Insights: Building AI-Powered Data APIs...
apidays
 
apidays Singapore 2025 - Generative AI Landscape Building a Modern Data Strat...
apidays
 
apidays Singapore 2025 - Streaming Lakehouse with Kafka, Flink and Iceberg by...
apidays
 
Optimizing Large Language Models with vLLM and Related Tools.pdf
Tamanna36
 

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

  • 1. What We Learned Building an R-Python Hybrid Analytics Pipeline Niels Bantilan, Pegged Software NY R Conference April 8th 2016
  • 2. Help healthcare organizations recruit better Pegged Software’s Mission:
  • 3. Core Activities ● Build, evaluate, refine, and deploy predictive models ● Work with Engineering to ingest, validate, and store data ● Work with Product Management to develop data-driven feature sets
  • 4. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?
  • 5. Anchor Yourself to Problem Statements / Use Cases 1. Define Problem statement 2. Scope out solution space and trade-offs 3. Make decision, justify it, document it 4. Implement chosen solution 5. Evaluate working solution against problem statement 6. Rinse and repeat Problem-solving Heuristic
  • 6. R-Python Pipeline Read Data Preprocess Build Model Evaluate Deploy
  • 7. Data Science Stack Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 8. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 9. ● Code quality ● Incremental Knowledge Transfer ● Sanity check Git Why? Because Version Control
  • 10. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 11. Dependency Management Why Pip + Pyenv? 1. Easily sync Python package dependencies 2. Easily manage multiple Python versions 3. Create and manage virtual environments
  • 12. Why Packrat? From RStudio 1. Isolated: separate system environment and repo environment 2. Portable: easily sync dependencies across data science team 3. Reproducible: easily add/remove/upgrade/downgrade as needed. Dependency Management
  • 13. Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile # points R to packrat └─ packrat ├─ init.R # initialize script ├─ packrat.lock # package deps ├─ packrat.opts # options config ├─ lib # repo private library └─ src # repo source files Understanding packrat
  • 14. PackratFormat: 1.4 PackratVersion: 0.4.6.1 RVersion: 3.2.3 Repos:CRAN=https://cran.rstudio.com/ ... Package: ggplot2 Source: CRAN Version: 2.0.0 Hash: 5befb1e7a9c7d0692d6c35fa02a29dbf Requires: MASS, digest, gtable, plyr, reshape2, scales datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock # package deps ├─ packrat.opts ├─ lib └─ src packrat.lock: package version and deps Packrat Internals
  • 15. auto.snapshot: TRUE use.cache: FALSE print.banner.on.startup: auto vcs.ignore.lib: TRUE vcs.ignore.src: TRUE load.external.packages.on.startup: TRUE quiet.package.installation: TRUE snapshot.recommended.packages: FALSE packrat.opts: project-specific configuration Packrat Internals datascience_repo ├─ project_folder_a ├─ project_folder_b ├─ datascience_repo.Rproj ... ├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock ├─ packrat.opts # options config ├─ lib └─ src
  • 16. ● Initialize packrat with packrat::init() ● Toggle packrat in R session with packrat::on() / off() ● Save current state of project with packrat::snapshot() ● Reconstitute your project with packrat::restore() ● Removing unused libraries with packrat::clean() Packrat Workflow
  • 18. Problem: Unable to find source packages when restoring Happens when there is a new version of a package on an R package repository like CRAN Packrat Issues > packrat::restore() Installing knitr (1.11) ... FAILED Error in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePkgs, : Couldn't find source for version 1.11 of knitr (1.10.5 is current)
  • 19. Solution 1: Use R’s Installation Procedure Packrat Issues > install.packages(<package_name>) > packrat::snapshot() Solution 2: Manually Download Source File $ wget -P repo/packrat/src <package_source_url> > packrat::restore()
  • 20. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 21. Call R from Python: Data Pipeline Read Data Preprocess Build Model Evaluate Deploy
  • 22. # model_builder.R cmdargs <- commandArgs(trailingOnly = TRUE) data_filepath <- cmdargs[1] model_type <- cmdargs[2] formula <- cmdargs[3] build.model <- function(data_filepath, model_type, formula) { df <- read.data(data_filepath) model <- train.model(df, model_type, formula) model } Call R from Python: Example # model_pipeline.py import subprocess subprocess.call([‘path/to/R/executable’, 'path/to/model_builder.R’, data_filepath, model_type, formula])
  • 23. Why subprocess? 1. Python for control flow, data manipulation, IO handling 2. R for model build and evaluation computations 3. main.R script (model_builder.R) as the entry point into R layer 4. No need for tight Python-R integration Call R from Python
  • 24. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 25. Tolerance to Change Are we confident that a modification to the codebase will not silently introduce new bugs? Automated Testing
  • 26. Working Effectively with Legacy Code - Michael Feathers 1. Identify change points 2. Break dependencies 3. Write tests 4. Make changes 5. Refactor Automated Testing
  • 27. Need R Python Maintainable codebase Git Git Sync package dependencies Packrat Pip, Pyenv Call R from Python - subprocess Automated Testing Testthat Nose Reproducible pipeline Makefile Makefile
  • 28. Make is a language-agnostic utility for *nix ● Enables reproducible workflow ● Serves as lightweight documentation for repo # makefile build-model: python model_pipeline.py -i ‘model_input’ -m_type ‘glm’ -formula ‘y ~ x1 + x2’ # command-line $ make build-model Build Management: Make $ python model_pipeline.py -i input_fp -m_type ‘glm’ -formula ‘y ~ x1 + x2’ VS
  • 29. By adopting the above practices, we: 1. Can maintain the codebase more easily 2. Reduce cognitive load and context switching 3. Improve code quality and correctness 4. Facilitate knowledge transfer among team members 5. Encourage reproducible workflows Big Wins
  • 30. Necessary Time Investment 1. The learning curve 2. Breaking old habits 3. Create fixes for issues that come with chosen solutions Costs
  • 31. How might we build a predictive analytics pipeline that is reproducible, maintainable, and statistically rigorous?