0% found this document useful (0 votes)

10 views23 pages

3 4 Forests

This document provides an introduction to using random forests for gene expression data analysis. It discusses gene profiling to observe expression across conditions, and using machine learning algorithms like random forests for prediction and variable selection. Random forests grow many classification or regression trees on bootstrap samples of the data and aggregate results. They provide measures of variable importance useful for identifying genes that best separate conditions. The document demonstrates random forest analysis on a subset of acute lymphoblastic leukemia gene expression data to classify samples as B-cell or T-cell cancer and identify important genes.

Uploaded by

Ben Allen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views23 pages

3 4 Forests

Uploaded by

Ben Allen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to Random Forests

for Gene Expression Data

Utah State University – Spring 2012

STAT 5570: Statistical Bioinformatics
Notes 3.4

1
References

 Breiman, Machine Learning (2001)

45(1): 5-32.
 Diaz-Uriarte and Alvarez de Andres,
BMC Bioinformatics (2006) 7:3.
 Cutler, Cutler, and Stevens (2009). In
Li and Xu, editors, High-Dimensional
Data Analysis in Cancer Research,
pages 83-101.
2
Gene Profiling / Selection
 “Observe” gene expression in different
conditions – healthy vs. diseased, e.g.

 Use simultaneous expression “profiles” of

thousands of genes (what are the genes
doing across arrays)

 Look at which genes are “important” in

“separating” the two conditions; i.e., what
determines the conditions’ “signatures”

3
Machine Learning
 Computational & statistical inference
processes:
observed data  reusable algorithms for prediction
 Why “machine”?
want minimal: human involvement
 Why “learning”?
develop ability to predict
 Here, supervised learning:
use knowledge of condition type

4
Machine Learning Methods
 Neural Network
 SVM (Support Vector Machine)
 RPART (Recursive PArtitioning and Regression
Trees)
 CART (Classification and Regression Trees)
 Ensembling Learning (average of many trees)
 Boosting (Shapire et al., 1998)
 Bagging (Breiman, 1996)
 RandomForests (Breiman, 2001; Cutler & Stevens 2006;
Cutler, Cutler, & Stevens 2008)

5
CART: Classification and Regression Trees
 Each individual (array) has data on many predictors
(genes) and one response (disease state)

 Think of a tree, with Diseased

(18:21)
splits based on levels
of specific predictors Gene1 > 8.3 Gene1 < 8.3

Healthy Diseased
 Choose predictors and (15:10) (3:11)
split levels to maximize
“purity” in new groups; the best split at each node

 Prediction made by: passing test cases down tree

6
CART generalized: Random Forests
 Rather than using all predictors and all individuals to
make a single tree, make a forest of many (ntree)
trees, each one based on a random selection of
predictors and individuals

 Each tree is fit using a bootstrap sample of data

(draw with replacement)
and ‘grown’ until each node is ‘pure’

 Each node is split using the best among a subset (of size
mtry) of predictors randomly chosen at that node
(default is sqrt. of # of predictors)
(special case using all predictors: bagging)

 Prediction made by aggregating across the forest

(majority vote or average)

7
How to measure “goodness”?
 Each tree fit on a “training” set (bootstrap
sample), or the “bag”

 The left-over cases (“out-of-bag”) can be

used as a “test” set for that tree
(usually 1/3 of original data)

 The “out-of-bag” (OOB) error rate is the:

% misclassification

8
What does RF give us?
 Kind of a “black box”
– but can look at “variable importance”
 For each tree, look at the OOB data:
 Permute values of predictor j among all OOB cases

 Pass OOB data down the tree, save the predictions

 For case i of OOB and predictor j , get:

OOB error rate with variable j permuted –
OOB error rate before permutation
 Average across forest to get overall variable
importance for each predictor j

9
Why “variable importance”?

 Recall: identifying conditions’ signatures

 Sort genes by some criterion

 Want smallest set of genes to achieve good

diagnostic ability

10
Recall – ALL Data
 RMA-preprocessed gene expression data
 Chiaretti et al., Blood (2004) 103(7)
 12625 genes (hgu95av2 Affymetrix GeneChip)
 128 samples (arrays)
 phenotypic data on all 128 patients, including:
 95 B-cell cancer

 33 T-cell cancer

11
ALL subset: 4,400 genes, 30 arrays
(15 B, 15 T)
Look at top 25 most important genes
from Random Forest.
Color scale:
purple (low) to orange (high)

12
memory.limit(size=4000)

# load data
library(affy); library(ALL); data(ALL)

# obtain relevant subset of data; similar to slide 11 of Notes 3.2

# (here, filter genes on raw scale, then return to log scale)
# also, keep 30 arrays here JUST for computational convenience
# (in-class)
library(genefilter); e.mat <- 2^(exprs(ALL)[,c(81:110)])
ffun <- filterfun(pOverA(0.20,100))
t.fil <- genefilter(e.mat,ffun)
small.eset <- log2(e.mat[t.fil,])
dim(small.eset) # 4400 genes, 30 arrays (15 B and 15 T)
group <- c(rep('B',15),rep('T',15)) # classification, in order

# One RF
library(randomForest)
set.seed(1234)
print(date())
rf <- randomForest(x=t(small.eset),y=as.factor(group),ntree=10000)
print(date()) # about 25 seconds

13
# Look at variable importance
imp.temp <- abs(rf$importance[,])
t <- order(imp.temp,decreasing=TRUE)
plot(c(1:nrow(small.eset)),imp.temp[t],log='x',cex.main=1.5,
xlab='gene rank',ylab='variable importance',cex.lab=1.5,
pch=16,main='ALL subset results')

# Get subset of expression values for 25 most 'important' genes

gn.imp <- names(imp.temp)[t]
gn.25 <- gn.imp[1:25] # vector of top 25 genes, in order
t <- is.element(rownames(small.eset),gn.25)
sig.eset <- small.eset[t,]
# matrix of expression values, not necessarily in order

## Make a heatmap, with group differences obvious on plot

library(RColorBrewer)
hmcol <- colorRampPalette(brewer.pal(11,"PuOr"))(256)
colnames(sig.eset) <- group # This will label the heatmap columns
csc <- rep(hmcol[50],30)
csc[group=='T'] <- hmcol[200]
# column side color will be purple for T and orange for B
heatmap(sig.eset,scale="row", col=hmcol,ColSideColors=csc)

14
A Better Variable Importance Plot
varImpPlot(rf, n.var=25, main='ALL Subset Results')

imp.temp <- importance(rf)

t <- order(imp.temp,decreasing=TRUE)
gn.imp <- names(imp.temp)[t]

15
Can focus on “variable selection”
 Iteratively fit many forests, each time discarding predictors
with low importance from previous iterations

 Use bootstrap to assess standard error of error rates

 Choose the forest with the smallest number of genes

whose error rate is within u standard errors of the minimum
error rate of all forests (u = 0 or 1, typically)

 Reference: Diaz-Uriarte and Alvarez de Andres, BMC

Bioinformatics (2006) 7:3.

 Online tool: http://genesrf.bioinfo.cnio.es

16
RF Variable Selection on ALL subset

Subset: 30 arrays, 4400 genes

First forest: 10,000 trees; Subsequent forests: 2,000 trees
At each iteration, drop 20% of variables (genes) until a 2-variable model is
built; return the best of the series of models considered.

17
# Look at variable selection
library(varSelRF)
set.seed(1234)
print(date())
rfsel <- varSelRF(t(small.eset),as.factor(group),
ntree=10000, ntreeIterat=2000, vars.drop.frac=0.2)
# 50 seconds
print(date())
rf.sig.gn <- rfsel$selected.vars # "38147_at" "38319_at"

# Visualize these two genes

exp.gn.1 <- small.eset[rownames(small.eset)==rf.sig.gn[1],]
exp.gn.2 <- small.eset[rownames(small.eset)==rf.sig.gn[2],]
use.pch <- c(rep(1,15),rep(16,15)) # Define plotting characters
use.col <- c(rep(1,15),rep(2,15)) # Define plotting colors
plot(exp.gn.1,exp.gn.2,col=use.col,main='30 subset arrays',
cex.main=1.5, cex.lab=1.5, xlab=rf.sig.gn[1], ylab=rf.sig.gn[2],
pch=use.pch,cex=2)
legend('bottomright',
c('B-cell','T-cell'),pch=c(1,16),col=c(1,2),cex=1.5)

# Note that set.seed(123) leads to two other genes:

# "2059_s_at" "38319_at"
# Also: rfsel$firstForest is the same as the slide 13 rf object

18
# Did this overfit these 30 arrays?
# Look at the other 98 (first 80 are B-cell, last 18 are T-cell)
eset.2 <- exprs(ALL)[,c(1:80,111:128)]
group.2 <- c(rep(0,80),rep(1,18))
exp.gn.1 <- eset.2[rownames(eset.2)==rf.sig.gn[1],]
exp.gn.2 <- eset.2[rownames(eset.2)==rf.sig.gn[2],]
use.pch.2 <- c(rep(1,80),rep(16,18))
use.col.2 <- c(rep(1,80),rep(2,18))
plot(exp.gn.1,exp.gn.2,col=use.col.2, main='non-subset arrays',
cex.main=1.5,cex=2,cex.lab=1.5, xlab=rf.sig.gn[1],
ylab=rf.sig.gn[2], pch=use.pch.2)
legend('bottomright',
c('B-cell','T-cell'),pch=c(1,16),col=c(1,2),cex=1.5)

19
RF Variable Selection on ALL (full)

Each condition has a profile / signature

(across these genes)

full ALL data: 12,625 genes, 128 arrays 20

# RF variable selection with full data set

# set seed and define initial objects

set.seed(123)
eset <- exprs(ALL) # 12625 genes, 128 arrays
cell <- c(rep(0,95),rep(1,33))
# first 95 are B-cell; last 33 are T-cell

print(date())
rf.big <- varSelRF(t(eset),as.factor(cell),
ntree=10000, ntreeIterat=2000, vars.drop.frac=0.2)
print(date()) # about 14 minutes

rf.gn <- rf.big$selected.vars

# "33039_at" "35016_at" "38319_at"

21
# make scatterplot matrix, with points colored by cell type
rf.eset <- t(eset[is.element(rownames(eset),rf.gn),])
use.pch <- c(rep(1,95),rep(16,33))
use.col <- cell+1
pairs(rf.eset,col=use.col,pch=use.pch,cex=1.5)

# Now - make a profile plot

t <- is.element(rownames(eset),rf.gn)
rf.eset <- eset[t,]
library(MASS)
parcoord(t(rf.eset), col=cell+1, lty=cell+1, lwd=3, var.label=TRUE)
legend(1.2,.15,c('B-cell','T-cell'),lty=c(1,2),
lwd=3,col=c(1,2),bty='n')

22
Summary: RF for gene expression data
 Works well even with: many more variables than
observations, many-valued categoricals, extensive
missing values, badly unbalanced data
 Low error (comparable w/ boosting and SVM)
 Robustness (even with large “noise” in genes)
 Does not overfit
 Fast, and invariant to monotone transformations of
the predictors
 Free! (Fortran code by Breiman and Cutler)
 Returns the importance of predictors (gene)
 Little need to tune parameters
 But, no: statistical inference

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6436)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brene Brown
4/5 (1174)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1854)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5144)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Knime Project Report
No ratings yet
Knime Project Report
12 pages
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
2016.random Forest in Remote Sensing A Review of Applications and Future
No ratings yet
2016.random Forest in Remote Sensing A Review of Applications and Future
8 pages
Imputing DXA
No ratings yet
Imputing DXA
41 pages
Bright Et Al. - 2012 - Effect of Clinical Decision-Support Systems A Sys PDF
No ratings yet
Bright Et Al. - 2012 - Effect of Clinical Decision-Support Systems A Sys PDF
64 pages
Numberless Worlds Infinite Beings
No ratings yet
Numberless Worlds Infinite Beings
14 pages
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
No ratings yet
BMC Bioinformatics: Bias in Random Forest Variable Importance Measures: Illustrations, Sources and A Solution
21 pages
The Symbolism of Freemasonry
No ratings yet
The Symbolism of Freemasonry
341 pages
Zafar Et Al. - 2023 - Reviewing Methods of Deep Learning For Intelligent
No ratings yet
Zafar Et Al. - 2023 - Reviewing Methods of Deep Learning For Intelligent
25 pages
Miller 2014
No ratings yet
Miller 2014
29 pages
Chapter 19
No ratings yet
Chapter 19
70 pages
Lecture Slides Group 11 Projet Finalization 11152023
No ratings yet
Lecture Slides Group 11 Projet Finalization 11152023
11 pages
Resampling Methods Lesson Plan
No ratings yet
Resampling Methods Lesson Plan
1 page
Overview of Psychophysiology
No ratings yet
Overview of Psychophysiology
32 pages
FastBridge Benchmarks and Norms Interpretation Guide Most-Recent
No ratings yet
FastBridge Benchmarks and Norms Interpretation Guide Most-Recent
18 pages
JPM 13 01610
No ratings yet
JPM 13 01610
21 pages
ScriptDraft With BIOPAC - Docx - BA - Fix
No ratings yet
ScriptDraft With BIOPAC - Docx - BA - Fix
5 pages
Longitu RF
No ratings yet
Longitu RF
12 pages
Photos of Plants
100% (5)
Photos of Plants
66 pages
Ajol File Journals - 274 - Articles - 179550 - Submission - Proof - 179550 3265 458362 1 10 20181113
No ratings yet
Ajol File Journals - 274 - Articles - 179550 - Submission - Proof - 179550 3265 458362 1 10 20181113
24 pages
Precision Medicine Personalized, Problematic, And.7
No ratings yet
Precision Medicine Personalized, Problematic, And.7
3 pages
L13 Introduction 042355
No ratings yet
L13 Introduction 042355
3 pages
Coll Af Bull 2011 Terracotta Figures
No ratings yet
Coll Af Bull 2011 Terracotta Figures
5 pages
L14 Procedure - 042428
No ratings yet
L14 Procedure - 042428
7 pages
Information: Knowledge Graphs and Explainable AI in Healthcare
No ratings yet
Information: Knowledge Graphs and Explainable AI in Healthcare
11 pages
ML in DBS Systematic Review Preprint
No ratings yet
ML in DBS Systematic Review Preprint
20 pages
WK 10
No ratings yet
WK 10
1 page
Tesseract
No ratings yet
Tesseract
6 pages
L14 Analysis Procedure - 042415
No ratings yet
L14 Analysis Procedure - 042415
6 pages
Corrected Pumpkin Patch Trip Letter To Parents
No ratings yet
Corrected Pumpkin Patch Trip Letter To Parents
1 page
1 s2.0 S0933365721001913 Main
No ratings yet
1 s2.0 S0933365721001913 Main
13 pages
Course in Epigenetics-856 Spring 2023 Flyer
No ratings yet
Course in Epigenetics-856 Spring 2023 Flyer
1 page
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
No ratings yet
Customer Credit Risk Application and Evaluation of Machine Learning and Deep Learning Models
5 pages
Journal Review Assignment 0
No ratings yet
Journal Review Assignment 0
3 pages
Application of Data Science and Machine Learning Algorithms For ROP Prediction Turning Data Into Knowledge
No ratings yet
Application of Data Science and Machine Learning Algorithms For ROP Prediction Turning Data Into Knowledge
10 pages
R Random Forest Guide
No ratings yet
R Random Forest Guide
8 pages
Analysing Earth Near object & visualizing hazard
No ratings yet
Analysing Earth Near object & visualizing hazard
5 pages
Week 8
No ratings yet
Week 8
70 pages
Rainfall
No ratings yet
Rainfall
24 pages
Toxic Comments Classification
No ratings yet
Toxic Comments Classification
10 pages
1 s2.0 S003991401930133X Main
No ratings yet
1 s2.0 S003991401930133X Main
9 pages
Diabetes Synopsis Report
No ratings yet
Diabetes Synopsis Report
10 pages
Airbnb PDF
No ratings yet
Airbnb PDF
9 pages
Analysis of Wheat
No ratings yet
Analysis of Wheat
21 pages
RFLR Artificial Intelligence in Asset Management
No ratings yet
RFLR Artificial Intelligence in Asset Management
100 pages
Machine Learning (6CS4-02) Unit-1 Notes
No ratings yet
Machine Learning (6CS4-02) Unit-1 Notes
34 pages
Predicting energy consumption in multiple buildings using machine
No ratings yet
Predicting energy consumption in multiple buildings using machine
15 pages
Project Report Sem II Final
0% (1)
Project Report Sem II Final
102 pages
PDF Malware Detection A Hybrid Approach Using Random Forest and K-Nearest Neighbors
No ratings yet
PDF Malware Detection A Hybrid Approach Using Random Forest and K-Nearest Neighbors
6 pages
ML Unit 1
No ratings yet
ML Unit 1
27 pages
Machine Learning Internshala: Mini Project / Internship Report
100% (1)
Machine Learning Internshala: Mini Project / Internship Report
28 pages
Explaining Relationships Among Various Coal Analyses With Coal Grindability Index by Random Forest
No ratings yet
Explaining Relationships Among Various Coal Analyses With Coal Grindability Index by Random Forest
7 pages
Credit Card Fraud Detection System
100% (1)
Credit Card Fraud Detection System
7 pages
54 Batch Project Documentation-1
No ratings yet
54 Batch Project Documentation-1
82 pages
Technical Seminar Darshan.docx
No ratings yet
Technical Seminar Darshan.docx
32 pages
IEEE Horse Race Prediction
No ratings yet
IEEE Horse Race Prediction
3 pages
Mortality Prediction Analysis
No ratings yet
Mortality Prediction Analysis
7 pages
Special Issue On Innovations and Technology in FinTech 2023 - Unveiled at GFF 2023
No ratings yet
Special Issue On Innovations and Technology in FinTech 2023 - Unveiled at GFF 2023
86 pages
Major Project Detailed Report
No ratings yet
Major Project Detailed Report
50 pages
Real_Estate_Price_Prediction_Using_a_Logistic_Regression_Model (1)
No ratings yet
Real_Estate_Price_Prediction_Using_a_Logistic_Regression_Model (1)
8 pages

3 4 Forests

Uploaded by

3 4 Forests

Uploaded by

Introduction to Random Forests

for Gene Expression Data

Utah State University – Spring 2012

 Breiman, Machine Learning (2001)

 Use simultaneous expression “profiles” of

 Look at which genes are “important” in

 Think of a tree, with Diseased

 Prediction made by: passing test cases down tree

 Each tree is fit using a bootstrap sample of data

 Prediction made by aggregating across the forest

 The left-over cases (“out-of-bag”) can be

 The “out-of-bag” (OOB) error rate is the:

 Pass OOB data down the tree, save the predictions

 For case i of OOB and predictor j , get:

 Recall: identifying conditions’ signatures

 Sort genes by some criterion

 Want smallest set of genes to achieve good

# obtain relevant subset of data; similar to slide 11 of Notes 3.2

# Get subset of expression values for 25 most 'important' genes

## Make a heatmap, with group differences obvious on plot

imp.temp <- importance(rf)

 Use bootstrap to assess standard error of error rates

 Choose the forest with the smallest number of genes

 Reference: Diaz-Uriarte and Alvarez de Andres, BMC

 Online tool: http://genesrf.bioinfo.cnio.es

Subset: 30 arrays, 4400 genes

# Visualize these two genes

# Note that set.seed(123) leads to two other genes:

Each condition has a profile / signature

full ALL data: 12,625 genes, 128 arrays 20

# set seed and define initial objects

rf.gn <- rf.big$selected.vars

# Now - make a profile plot

You might also like