Protein Significance Analysis of Mass Spectrometry-Based Proteomics Experiments With R and Msstats (V3.7.3)
Protein Significance Analysis of Mass Spectrometry-Based Proteomics Experiments With R and Msstats (V3.7.3)
Contents
1. Statistical relative protein quantification: SRM, DDA and DIA experiments 1
1.1 Applicability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Statistical functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Interoperability with existing computational tools . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Overview of the functionalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Troubleshooting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Reference 54
1
1.1 Applicability
MSstats version 3.0 and above is applicable to multiple types of sample preparation, including label-free
workflows, workflows that use stable isotope labeled reference proteins and peptides, and workflows that use
fractionation. It is applicable to targeted Selected Reaction Monitoring (SRM), Data-Dependent Acquisition
(DDA or shotgun), and Data-Independent Acquisition (DIA or SWATH-MS). It is applicable to experiments
that make arbitrary complex comparisons of experimental conditions or times.
MSstats is currently not applicable to experiments that compare multiple metabolically labeled endogenous
samples within a same run. It is not applicable to experiments with iTRAQ labeling. These experiments will
be supported in the future.
MSstats version 3.0 and above performs three analysis steps. The first step, data processing, visualization,
and run-level summarization, transforms, normalizes and summarizes the intensities of the peaks per MS run
and per protein, and generates workflow-specific and customizable numeric summaries for data visualization
and quality control.
The second step, statistical modeling and inference, automatically detects the experimental design (e.g. group
comparison, paired design or time course, presence of labeled reference peptides or proteins) from the data.
It then reflects the experimental design and the type of spectral acquisition strategy, and fits an appropriate
linear mixed model by means of lm and lmer functionalities in R. The model is used to detect differentially
abundant proteins or peptides, or to summarize the protein or peptide abundance in a single biological
replicate or condition (that can be used, e.g. as input to clustering or classification).
The third step, statistical experimental design, views the dataset being analyzed as a pilot study of a future
experiment, utilizes the variance components of the current dataset, and calculates the minimal number of
replicates necessary in the future experiment to achieve a pre-specified statistical power.
MSstats takes as input data in a tabular .csv format, which can be generated by any spectral processing tool
such as Skyline (MacLean et al. 2010), MaxQuant (Jürgen Cox and Mann 2008), Progenesis QI(Nonlinear dy-
namics/Waters), Proteome Discoverer (Thermo Scientific) MultiQuant(Applied Biosystems), OpenMS (Sturm
et al. 2008), SuperHirn (Mueller et al. 2007), OpenSWATH (Röst et al. 2014) or Spectronaut(Biognosys).
The functions to convert the required format from several processing tools are available from MSstats v3.6.
Details are in the section below.
For statistics experts, MSstats 3.0 and above satisfies the interoperability requirements of Bioconductor,
and takes as input data in the MSnSet format (Gatto and Lilley 2012). The command line-based workflow
is partitioned into a series of independent steps, that facilitate the development and testing of alternative
statistical approaches. It complies with the maintenance and documentation requirements of Bioconductor.
Finally, MSstats 3.0 and above is available as an external tool within Skyline. The external tool support
within Skyline manages MSstats installation, point-and-click execution, parameter collection in Windows
forms and output display. Skyline manages the annotations of the experimental design, and the processing of
raw data. It outputs a custom report, that is fed as a single stream input into MSstats. This design buffers
proteomics users from the details of the R implementation, while enabling rigorous statistical modeling.
1.4 Availability
MSstats is available under the Artistic-2.0 license at msstats.org. MSstats as an external tool for Skyline is
available at http://proteome.gs.washington.edu/software/Skyline/tools.html. MSstats is now also available
2
in Bioconductor. The most recent version of the package is available at msstats.org or MSstats GitHub.
We suggest to use that if possible. The versioning of the main package is updated several times a year, to
synchronise with the Bioconductor release.
3
1.5 Overview of the functionalities
Formatting
SkylinetoMSstatsFormat Pre-process MSstats report from Skyline
MaxQtoMSstatsFormat Convert the outputs from MaxQuant into MSstats required format
Tool specific
ProgenesistoMSstatsFormat Convert the outputs from Progenesis into MSstats required format
Pre-processing:
PDMSstatsFormat Convert the outputs from Proteome Discoverer into MSstats required format
SpectronauttoMSstatsFormat Convert the outputs from Spectronaut into MSstats required format
Visualization dataProcessPlots
• Profile plot
• Quality control plot
• Condition plot
4
1.6 Troubleshooting
To help troubleshoot potential problems with installation or functionalities of MSstats, a progress report is
generated in a separate log file msstats.log. The file includes information on the R session (R version, loaded
software libraries), options selected by the user, checks of successful completion of intermediate analysis
steps, and warning messages. If the analysis produces an error, the file contains suggestions for possible
reasons for the errors. If a file with this name already exists in working directory, a suffix with a number
will be appended to the file name. In this way a record of all the analyses is kept. Please see the file
KnownIssues-Skyline-MSstatsV3.6.pdf on the “Installation” section of “MSstats” page in msstats.org
for a list of known issues and possible solutions for installation problem of MSstats external tool in Skyline
MSstats performs statistical analysis steps, that follow peak identification and quantitation. Therefore, input
to MSstats is the output of other software tools (such as Skyline or MultiQuant) that read raw spectral files
and identify and quantify spectral peaks. The preferred structure of data for use in MSstats is a .csv file
in a “long” format with 10 columns representing the following variables: ProteinName, PeptideSequence,
PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run,
Intensity. The variable names are fixed, but are case-insensitive.
(a) ProteinName: This column needs information about Protein id. Statistical analysis will be done
separately for each unique label in this column. For peptide-level modeling and analysis, use peptide id
in this column.
(b)-(e) PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge: The combination of these
4 columns defines a feature of a protein (in SRM experiments, it is a transition that is identified and
quantified across runs). If the information for one or several of these columns is not available, please do
not discard these columns but use a single fixed value across the entire dataset. For example, if the original
raw data does not contain the information of ProductCharge, assign the value 0 to the entries in the
column ProductCharge for the entire dataset. If the peptide sequences should be distinguished based on
post-translational modifications, this column can be renamed to PeptideModifiedSequence. For example,
this allows us to use the PeptideModifiedSequence column from the Skyline report.
(f) IsotopeLabelType: This column indicates whether this measurement is based on the endogenous
peptides (use “L”) or labeled reference peptides (use “H”).
(g) Condition: For group comparison experiments, this column indicates groups of interest (such as
“Disease” or “Control”). For time-course experiments, this column indicates time points (such as “T1”,
“T2”, etc). If the experimental design contains both distinct groups of subjects and multiple time
points per subject, this column should indicate a combination of these values (such as “Disease_T1”,
“Disease_T2”, “Control_T1”, “Control_T2”, etc.).
(h) BioReplicate: This column should contain a unique identifier for each biological replicate in the
experiment. For example, in a clinical proteomic investigation this should be a unique patient id.
Patients from distinct groups should have distinct ids. MSstats does not require the presence of
technical replicates in the experiment. If the technical replicates are present, all samples or runs from a
same biological replicate should have a same id. MSstats automatically detects the presence of technical
replicates and accounts for them in the model-based analysis.
5
(i) Run: This column contains the identifier of a mass spectrometry run. Each mass spectrometry run
should have a unique identifier, regardless of the origin of the biological sample. In SRM experiments,
if all the transitions of a biological or a technical replicate are split into multiple “methods” due to
the technical limitations, each method should have a separate identifier. When processed by Skyline,
distinct values of runs correspond to distinct input file names. It is possible to use the actual input file
names as values in the column Run.
(j) Intensity: This column should contain the quantified signal of a feature in a run without any
transformation (in particular, no logarithm transform). The signals can be quantified as the peak height
or the peak of area under curve. Any other quantitative representation of abundance can also be used.
An example of an acceptable input dataset is shown below. This example dataset is from an SRM experiment
with stable isotope labeled reference peptides. The dataset is stored in a .csv file in a “long” format. Each
row corresponds to a single intensity. More details on assigning the values of Condition, BioReplicate
and Run, depending on the structure of the experimental design, are given below.
The values of Condition, BioReplicate, Run depend on the design of the specific experiment.
1) Group comparison
In a group comparison design, the conditions (e.g., disease states) are profiled across non-overlapping sets
of biological replicates (i.e., subjects). In this example there are 2 conditions, Disease and Control (in
general the number of conditions can vary). There are 3 subjects (i.e., biological replicates) per condition (in
general an equal number of replicates per condition is not required). Each subject has 2 technical replicate
runs (in general technical replicates are not required, and their number per sample may vary). Overall, in
this example there are 2 × 3 × 2 = 12 mass spectrometry runs.
Table below shows the values of the columns Condition, BioReplicate and Run for this situation. It is
important to note two things. First, the order of subjects and conditions in the experiment should be
randomized, and run id does not need to represent the order of spectral acquisition. Second, the values of the
columns are repeated for every quantified transition. For example, if in each run the experiment quantifies
50 endogenous transitions and 50 labeled reference counterparts, then the input file has 12 × 50 × 2 = 1200
lines. When a feature intensity is missing in a run, the data structure should contain a separate row for each
missing value. The rows should include all the information (from ProteinName to Run), and indicate missing
intensities with NA.
6
Condition BioReplicate Run
Disease Subject2 4
Disease Subject3 5
Disease Subject3 6
Control Subject4 7
Control Subject4 8
Control Subject5 9
Control Subject5 10
Control Subject6 11
Control Subject6 12
2) Time course
The important feature of a time course experimental design is that a same subject (i.e., biological
replicate) is repetitively measured across multiple time points. In this example there are 2 time
points, Time1 and Time2 (in general the number of times can vary). There are 4 subjects (i.e., biological
replicates) measured across times (in general an equal number of times per replicate is not required). There
are no technical replicates (in general the number of technical replicates per sample may vary). Overall, in
this example there are 2 × 4 × 1 = 8 mass spectrometry runs.
Table below shows the values of the columns Condition, BioReplicate and Run for this situation. Comments
on the order of the runs, on the number of lines in the input data structure, and on the handling of missing
peak intensities are as in the group comparison design.
3) Paired design
Another frequently used experimental design is a paired design, where measurements from multiple conditions
(such as healthy biopsy and disease biopsy) are taken from a same subject. The statistical model for this
experimental design is the same as in the time course experiment, however the values in the columns of the
input data may have a different appearence. In this example there are 2 subjects, PatientA and PatientB
(in general the number of patients can vary). There are two conditions per subject, BiopsyHealthy and
BiopsyTumor (in general the number of conditions per subject can exceed two). In this example there are 3
technical replicates of each type (in this example, the technical replicates are biopsies; in general these can
also be replicate sample preparations or replicate mass spectrometry runs). Overall, in this example there are
2 × 2 × 3 = 12 mass spectrometry runs.
Table below shows the values of the columns Condition, BioReplicate and Run for this situation. Comments
on the order of the runs, on the number of lines in the input data structure, and on the handling of missing
peak intensities are as in the group comparison design.
7
Condition BioReplicate Run
BiopsyHealthy PatientA 2
BiopsyHealthy PatientA 3
BiopsyTumor PatientA 4
BiopsyTumor PatientA 5
BiopsyTumor PatientA 6
BiopsyHealthy PatientB 7
BiopsyHealthy PatientB 8
BiopsyHealthy PatientB 9
BiopsyTumor PatientB 10
BiopsyTumor PatientB 11
BiopsyTumor PatientB 12
MSstats also allows data to be in the format of MSnSet, which is consistent with the requirements of
Bioconductor. The MSnSet format has several components, of which the most commonly accessed are
assayData, phenoData, and featureData. assayData is a matrix of intensities, where each row corresponds
a transition, and the columns correspond to sample ids. phenoData contains columns that describe the
biological samples, conditions in the experiment. featureData contains columns describing the peptide
features, such as the name or id of the underlying protein and information of features.
If the data are stored in the format expressionSet, information for group labels is required. If more than one
variable is listed in the argument group, then a concatenated variable is created based on all of the specified
group variables. The remaining information (peptide feature ids, biological replicate ids, and abundance) can
be extracted from the rows and columns of featureData and phenoData, or assigned by the users based on
the experimental design.
For label-free DDA experiments the required input is the 10-column format, the same as described in section 2.1
for SRM experiments. In DDA experiments spectral features are defined as peptide ions, which are identified
and quantified across runs. Since for label-free DDA experiments some of the columns PeptideSequence,
PrecursorCharge, FragmentIon, and ProductCharge are not relevant, these columns will have a constant
fixed value (such as NA) across the entire dataset. Furthermore, the column IsotopeLabelType will be set to
“L” for the entire dataset.
8
2.2 Label-free DIA
For label-free DIA experiments, the required input is the 10-column format, the same as described in
section 2.1 for SRM experiments. The values of the required columns can be extracted from the output
of signal processing software such as Skyline or OpenSWATH. By default, the combination of the values
in the columns PeptideSequence, PrecursorCharge, FragmentIon, ProductCharge uniquely identifies
each spectral feature (i.e., a fragment ion identified and quantified across multiple runs). If the signal
processing software does not provide the information on some of these columns but provides a unique feature
identifier, it is possible to use this unique identifier instead of one of these columns. Furthermore, the column
IsotopeLabelType is set to “L” for the entire dataset.
An example dataset is shown below. In this example, feature id generated by OpenSWATH is used instead of
ProductCharge to uniquely characterize each feature.
Alternatively, you can downlaod the package from the MSstats installation page and install it as follows:
install.packages(pkgs = 'MSstats_3.7.3.tar.gz', repos = NULL, type = 'source')
Once you have the package installed, load MSstats into an R session and verify that you have the correct
version (3.6). Note that in order to use MSstats, the package needs to be loaded every time you restart R.
library('MSstats', warn.conflicts = F, quietly = T, verbose = F)
?MSstats
Finally, set the working directory to where you saved files. Note that you may have a different path on your
computer from the example.
setwd('/Users/meenachoi/Dropbox/MSstats_GitHub_document/MSstats_v3.7.3')
## [1] "/Users/meenachoi/Dropbox/MSstats_GitHub_document/MSstats_v3.7.3"
9
4. DDA analysis with MSstats
This section describes a typical workflow for DDA analysis with MSstats. Controlled mixture DDA data will
be used for demonstration. This dataset is available as an example data(DDARawData) in MSstats. Also the
csv file for the same dataset, RawData.DDA.csv, is available in MSstats material GitHub in the folder named
‘example dataset/DDA_controlledMixture2009’. It is processed by Superhirn. (original reference link)
The first step in using the MSstats is to format the data as described in Section 2. DDARawData is already
formatted for MSstats input.
# Check the first 6 rows in DDARawData
head(DDARawData)
NOTE At the logarithm transformation step, zero value in Intensity is problematic. When Intensity=0,
Inf is the output from logarithm transformed intensities. Also, logarithm transformed intensites, when
Intensity < 1, are negative values and it can make overestimated between log fold change. Therefore,
logarithm transformed intensities for original intensity between 0 and 1 will be replaced with zero value after
normalization.
10
all possible scenarios. It is important to understand their underlying assumption to avoid misuse. Below is
the additional explanation for main options.
• logTrans : logarithm transformation with base 2(default) of Intensity column.
• Normalization :
– ‘equalizeMedians’ : The default option for normalization is equalizeMedians, where all the
intensities in a run are shifted by a constant, to equalize the median of intensities across runs for
label-free experiment. This normalization method is appropriate when we can assume that the
majority of proteins do not change across runs. Be cautious when using the equalizeMedians
option for a label-free DDA dataset with only a small number of proteins. For label based experiment,
equalizeMedians equalizes the median of reference intensities across runs and is generally proper
even for a dataset with a small number of proteins.
– ‘globalStandards’ : Instead, if you have a spiked in standard, you may set this to
globalStandards and define the standard with nameStandards option.
– ‘quantile’ : The distribution of all the intensities in each run will become the same across runs for
label-free experiment. For label-based experiment, the distribution of all the reference intensities
will be become the same across runs and all the endogenous intensities are shifted by a constant
corresponding to reference intensities.
– FALSE : No normalization is performed. If you had your own normalization before MSstats, you
should use Normalization=FALSE.
– NOTE : If there are multiple fractionations or injections for one sample, normalization is perform
by each fractionation or different m/z range from multiple injections.
• nameStandards : Only for Normalization='globalStandards', global standard peptide or Protein
names, which you can assume that they have the same abundance across MS runs, should be assigned
in the vector for this option.
• featureSubset :
– ‘all’ : Use all features in the dataset.
– ‘top3’ : Use top 3 features which have highest average of log2(intensity) across runs.
– ‘topN’ : Use top N features which have highest average of log2(intensity) across runs. It needs
the input for n_top_featureoption (ex. n_top_feature=5 for top 5 features).
• summaryMethod : Method for run-level summarization.
– ‘TMP’ : Default. Tukey’s median polish (medpolish function in stats). Robust parameter
estimation method with median across rows and columns.
– ‘linear’ : Linear model (lm function). Average-based summarization.
• MBimpute : whether model-based imputation will be performed or not. Only for summaryMethod='TMP'.
– TRUE : Default. Censored missing values will be imputed by Accelerated Failure Time
model. Censored missing values will be determined by other options, censoredInt and
maxQuantileforCensored
– FALSE : No model-based imputation.
• maxQuantileforCensored : Maximum quantile for deciding censored missing value. Default is
0.999. If you don’t want to apply the threshold of noise intensity in your data, you can use
maxQuantileforCensored=NULL.
• censoredInt : The processing tools report missing values differently. This option is for distinguish
which value should be considered as missing, and further whether it is censored or at random.
– ‘NA’ : Default. It assumes that all NAs in Intensity column are censored.
– ‘0’ : It assumes that all values between 0 and 1 in Intensity column are censored. If there are
NAs in Intensity with this option, NAs will be considered as random missing.
– NULL : It assumes that all missing values are randomly missing.
11
• cutoffCensored : cutoff value for AFT model. It is only with censoredInt='NA' or censoredInt='0'.
If you have censoredInt=NULL, it assumes that there is no censored missing and any imputation will
not be performed.
– ‘minFeature’ : cutoff for AFT model will be the minimum value for each feature across runs.
– ‘minRun’ : cutoff for AFT model will be the minimum value for each run across features.
– ‘minFeatureNRun’ : cutoff for AFT model will be the smallest value between minimum value
of corresponding feature and minimum value of corresponding run.
A typical label-free DDA dataset may have many missing values and noisy features with outliers. MSstats
supports several ways to deal with this. The default option for summarization is TMP (robust parameter
estimation method with median across rows and columns) after imputation by AFT (accelerated failure
time model, MBimpute=TRUE) based on censored intensity for NA (censoredInt="NA") with a cutoff as the
minimum value for a feature (cutoffCensored="minFeature").
This process handles missing values through imputation and reduces the influence of the outliers using the
TMP estimation. Note, however, that those runs with no measurements at all will be removed and not be used
for any calculation.
# default option
DDA2009.proposed <- dataProcess(raw = DDARawData,
normalization = 'equalizeMedians',
summaryMethod = 'TMP',
censoredInt = "NA", cutoffCensored = "minFeature",
MBimpute = TRUE,
maxQuantileforCensored=0.999)
## Log2 intensities under cutoff = 13.456 were considered as censored missing values.
## * Use all features that the dataset origianally has.
##
## Summary of Features :
## count
## # of Protein 6
## # of Peptides/Protein 11-32
## # of Transitions/Peptide 1-1
##
## Summary of Samples :
## C1 C2 C3 C4 C5 C6
## # of MS runs 3 3 3 3 3 3
## # of Biological Replicates 1 1 1 1 1 1
## # of Technical Replicates 3 3 3 3 3 3
##
## Summary of Missingness :
## # transitions are completely missing in one condition: 90
## -> D.GPLTGTYR_23_23_NA_NA, F.HFHWGSSDDQGSEHTVDR_402_402_NA_NA, G.PLTGTYR_8_8_NA_NA, H.SFNVEYDDSQD
##
## # run with 75% missing observations: 0
##
## == Start the summarization per subplot...
## Getting the summarization by Tukey's median polish per subplot for protein bovine ( 1 of 6 )
## Getting the summarization by Tukey's median polish per subplot for protein chicken ( 2 of 6 )
12
## Getting the summarization by Tukey's median polish per subplot for protein cyc_horse ( 3 of 6 )
## Getting the summarization by Tukey's median polish per subplot for protein myg_horse ( 4 of 6 )
## Getting the summarization by Tukey's median polish per subplot for protein rabbit ( 5 of 6 )
## Getting the summarization by Tukey's median polish per subplot for protein yeast ( 6 of 6 )
##
## == the summarization per subplot is done.
Output of dataProcess
Output of the dataProcess function contains the processed and run-level summarized data as well as relevant
information for the summarization step.
# output of dataProcess includes several data types.
names(DDA2009.proposed)
13
the decision about censored missing or not, based on censoredInt and maxQuantileforCensored options.
ABUNDANCE with TRUE value in censored column will be considered as censored missing and imputed with
MBimpute=TRUE option. Censored missing will be distinguished in Profile plot from dataProcessPlots.
# run-level summarized data
head(DDA2009.proposed$RunlevelData)
## [1] "TMP"
14
# use type="QCplot" with all proteins
# change the upper limit of y-axis=35
# set up the size of pdf
dataProcessPlots(data = DDA2009.proposed, type="QCplot", ylimUp=35,
width=5, height=5)
NOTE Don’t worry about warning messages as below. It means NA values are not included in the plot,
15
which is a proper way for this case.
Warning messages:
1: Removed 698 rows containing non-finite values
(stat_boxplot).
...
Profile plot
Profile plot helps identify potential sources of variation (both variation of interest and nuisance variation) for
each protein. Such plots should be done after the normalization. Profile plots with summarization present the
effects of the summarization step by showing all individual measurements of a protein and their summarized
intensity. With type="profileplot", two pdfs will be generated. The first pdf includes plots (per protein)
to show individual measurement for each peptide (peptide for DDA, transition for SRM or DIA) across runs,
grouped per condition. Each peptide has a different color/type layout. Disconnected lines show that there are
missing value (NA). To ignore these plots, please use the option originalPlot=FALSE. The second pdf, which
is named with ‘wSummarization’ suffix, shows run-level summarized data per protein. The same peptides (or
transition) in the first plot are presented in grey, with the summarized values (by TMP, in this example)
overlaid in red. To ignore these plots with summarization, please use the option summaryPlot=FALSE.
dataProcessPlots(data = DDA2009.proposed, type="Profileplot", ylimUp=35,
featureName="NA", width=5, height=5, address="DDA2009_proposed_")
Condition plot
Condition plot visualizes potential systematic diffrences in protein inensities between conditions. Dots indicate
the mean of log2 intensities for each condition. With the option interval='CI'(default), error bars indicate
the confidence interval with 0.95 significant level for each condition. With the option interval='SD', error
bars indicate the standard deviation among all feature intensities for each condition. The intervals are
for descriptive purposes only, as more refined model-based estimation is obtained as discussed
below. With the option scale=TRUE, the levels of conditions are scaled according to their labels. If
scale=FALSE (default), the conditions on the x-axis are equally spaced.
dataProcessPlots(data = DDA2009.proposed, type="Conditionplot",
width=5, height=5, address="DDA2009_proposed_")
16
dataProcessPlots has a number of layout options, including size and description of axes labels, output file
name etc for three types of plots above. The option address specifies the name of the folder storing pdf files
with the plots. With the option address=FALSE, plots will be shown in the graphical window, but not saved
in a file. If a file with this name already exists in working directory, a suffix with a number will be appended
to the file name. In this way a record of all the analyses is kept.
For more details, visit the help file using the following code.
?dataProcessPlots
17
summaryMethod = 'TMP',
censoredInt = NULL, MBimpute=FALSE)
18
While original profile plots are the same, summarization plots reveal differences, especially for conditions ‘C1’
and ‘C2’ in ‘yeast’ protein, which have many missing values. Without imputation, summarized values in ‘C1’
group is higher than with imputation for missing values.
Besides summarizing observations with the median polish method, MSstats also offers a summarization
option using linear model with option summaryMethod="linear" with censoredInt=NULL assumes that all
NA’s are missing at random and uses lm for parameter estimation.
# linear model (lm) with run and feature
DDA2009.linear <- dataProcess(raw = DDARawData,
normalization = 'equalizeMedians',
summaryMethod = 'linear',
censoredInt = NULL,
MBimpute = FALSE)
Profile plots below can be used compare among different options for summarization (e.g., TMP with or without
imputation vs linear for summarization in dataProcess).
dataProcessPlots(data = DDA2009.linear, type="Profileplot", ylimUp=35,
featureName="NA", width=5, height=5, address="DDA2009_linear_")
19
While original profile plots are the same, summarization plots reveal differences, especially for conditions
‘C1’, ‘C2’, and ‘C6’ in ‘yeast’ protein, which have many missing values. Summarized values with linear model
in these groups are much higher than those with TMP considering missing values or not.
In addition to the processed data, the groupComparison function requires a contrast matrix to define the
comparison to be made. The contrast matrix is created with each condition in column and each comparison
in row. Note that the conditions are arranged in alphabetical order. The order of condition that MSstats
recognizes can be shown by using levels:
levels(DDA2009.TMP$ProcessedData$GROUP_ORIGINAL)
20
comparison6 <- matrix(c(1,0,0,0,0,-1),nrow=1)
comparison<-rbind(comparison1,comparison2,comparison3,comparison4,comparison5,comparison6)
row.names(comparison) <- c("C2-C1","C3-C2","C4-C3","C5-C4","C6-C5","C1-C6")
With the contrast matrix specified, group comparison can be performed as follows.
DDA2009.comparisons <- groupComparison(contrast.matrix = comparison, data = DDA2009.proposed)
21
# get only significant proteins and comparisons among all comparisons
# To simultaneoulsy controll the overall FDR at the level, 0.05
SignificantProteins <- with(DDA2009.comparisons,
ComparisonResult[ComparisonResult$adj.pvalue < 0.05, ])
nrow(SignificantProteins)
## [1] 34
Results based on the statistical models are accurate as long as the assumptions of the models hold. Here we
focus on the assumption of the Normal distribution of the measurement errors, and also on the assumption of
constant variance of the measurement errors (if this option is specified in the model above). The assumptions
can be checked by examining the residuals of the model fit (i.e., the deviations of the observed intensities of
the transition from their model-based predictions).
modelBasedQCPlots function generates residual plots and Normal quantile-quantile plots for each protein,
taking as input the results of model fitting and testing in groupComparison. Normal quantile-quantile plot
with the option type='QQPlots' illustrates that such deviations from constant variance can be mistaken
for deviations from Normality. Only large deviations of transition intensities from the straight line are
problematic.
Residual plot with the option type='ResidualPlots' shows variance of the residuals that is associated with
the mean feature intensity. Any specific pattern, such as increasing or decreasing by predicted abundance, is
problematic.
# normal quantile-quantile plots
modelBasedQCPlots(data=DDA2009.comparisons, type="QQPlots",
width=5, height=5, address="DDA2009_proposed_")
# residual plots
modelBasedQCPlots(data=DDA2009.comparisons, type="ResidualPlots",
width=5, height=5, address="DDA2009_proposed_")
For more details, visit the help file using the following code.
22
?modelBasedQCPlots
Volcano plots
Volcano plots visualize the outcome of one comparison between conditions for all the proteins, and combine
the information on statistical and practical significance. The y-axis displays the FDR-adjusted p-values on the
negative log10 scale, representing statistical significance. The horizontal dashed line shows the FDR cutoff.
The points above the FDR cutoff line are statistically significant proteins that are differentially abundant
across conditions. These points are colored in red and blue for upregulated and downregulated proteins,
respectively. The x-axis is the model-based estimate of fold change on log scale (the base of logarithm
transform is the same as specified in the logTrans option of the dataProcess function), and represents
practical significance. It is possible to specify a practical significance cutoff based on the estimate of fold
change in addition to the statistical significance cutoff. If the fold change cutoff is specified, the points above
the horizontal cutoff line but within the vertical cutoff line will be considered as not differentially abundant
(and will be colored in black). The practical significance cutoff should only be applied in addition to the
statistical significance cutoff (i.e., the fold change alone does not present enough evidence for differential
abundance).
groupComparisonPlots(data = DDA2009.comparisons$ComparisonResult, type = 'VolcanoPlot',
width=5, height=5, address="DDA2009_proposed_")
‘VolcanoPlot.pdf’ will be saved under the folder you assigned. It has the plots per comparison defined in
contrast.matrix. Please check ?groupComparisonPlots for detail, such as labelling protein names, size of
dots, font sizes, etc. Below is one of volcano plots, for comparison ‘C1-C6’ including protein name labelling.
Protein name will be shown for significant proteins, without overlapping protein names each other.
23
Heatmap
Heatmaps illustrate the patterns of up- and down-regulation of proteins in several comparisons. Columns
in the heatmaps are comparison of conditions assigned in contrast.matrix, and rows are proteins. The
heatmaps display signed FDR-adjusted p-values of the tests, colored in red/blue for significantly up-/down-
regulated proteins, while taking into account the specified FDR cutoff and the additional optional fold change
cutoff. Brighter colors indicate stronger evidence in favor of differential abundance. Black color represents
proteins that are not significantly differentially abundant.
NOTE To draw heatmap, at least two comparisons are needed.
The rows and columns of the heatmaps can be ordered with the option clustering, which performs hierarchical
clustering with the Ward method (minimum variance). The option clustering='protein' (default) clusters
the rows (proteins) in the space of comparisons, based on the values of (sign of comparison)·(-log2(adjusted
p-values)). The option clustering='comparison' clusters the columns in the space of proteins, based on
the values of (sign of comparison)·(-log2(adjusted p-value)). The option clustering='both reorders both
columns and rows.
groupComparisonPlots(data = DDA2009.comparisons$ComparisonResult, type = 'Heatmap')
‘Heatmap.pdf’ will be saved under the folder you assigned. Below is one example, showing the results for
several comparisons simultaneously.
Comparison plots
Comparison plots illustrate model-based estimates of log-fold changes, and the associated uncertainty, in
several comparisons of conditions for one protein. X-axis is the comparison of interest. Y-axis is the log fold
change. The dots are the model-based estimates of log-fold change, and the error bars are the model-based
95% confidence intervals (the option sig can be used to change the significance level of significance). For
24
simplicity, the confidence intervals are adjusted for multiple comparisons within protein only, using the
Bonferroni approach. For proteins with N comparisons, the individual confidence intervals are at the level of
1-sig/N.
groupComparisonPlots(data=DDA2009.comparisons$ComparisonResult, type="ComparisonPlot",
width=5, height=5, address="DDA2009_proposed_")
For further details, such as labelling protein names, size of dots, font sizes, etc., visit the help file using the
following code.
?groupComparisonPlots
This last analysis step views the dataset as a pilot study of a future experiment, utilizes its variance
components, and calculates the minimal number of replicates required in a future experiment to achieve
the desired statistical power. The calculation is performed by the function designSampleSize, which takes
as input the fitted model in groupComparison. Sample size calculation assumes same experimental design
(i.e. group comparison, time course or paired design) as in the current dataset, and uses the model fit to
estimate the median variance components across all the proteins. Finally, sample size calculation assumes that
a large proportion of proteins (specifically, 99%) will not change in abundance in the future experiment. This
assumption also provides conservative results. Using the estimated variance components, the function relates
the number of biological replicates per condition (numSample, rounded to 0 decimal), average statistical power
across all the proteins (power), minimal fold change that we would like to detect (can be specified as a range,
e.g. desiredFC=c(1.1, 2)), and the False Discovery Rate (FDR). The user should specify all these quantities
but one, and the function will solve for the remainder. The quantity to solve for should be set to = TRUE.
# Minimal number of biological replicates per condition
result.sample <- designSampleSize(data=DDA2009.comparisons$fittedmodel, numSample=TRUE,
desiredFC=c(1.25, 3), FDR=0.05, power=0.8)
result.sample
25
## 4 1.325 22 0.05 0.8 0.007
## 5 1.350 19 0.05 0.8 0.007
## 6 1.375 17 0.05 0.8 0.008
## 7 1.400 16 0.05 0.8 0.009
## 8 1.425 14 0.05 0.8 0.010
## 9 1.450 13 0.05 0.8 0.010
## 10 1.475 12 0.05 0.8 0.011
## 11 1.500 11 0.05 0.8 0.012
## 12 1.525 10 0.05 0.8 0.013
## 13 1.550 9 0.05 0.8 0.014
## 14 1.575 9 0.05 0.8 0.014
## 15 1.600 8 0.05 0.8 0.015
## 16 1.625 7 0.05 0.8 0.017
## 17 1.650 7 0.05 0.8 0.017
## 18 1.675 7 0.05 0.8 0.016
## 19 1.700 6 0.05 0.8 0.019
## 20 1.725 6 0.05 0.8 0.018
## 21 1.750 6 0.05 0.8 0.018
## 22 1.775 5 0.05 0.8 0.022
## 23 1.800 5 0.05 0.8 0.021
## 24 1.825 5 0.05 0.8 0.021
## 25 1.850 5 0.05 0.8 0.021
## 26 1.875 4 0.05 0.8 0.026
## 27 1.900 4 0.05 0.8 0.025
## 28 1.925 4 0.05 0.8 0.025
## 29 1.950 4 0.05 0.8 0.025
## 30 1.975 4 0.05 0.8 0.024
## 31 2.000 4 0.05 0.8 0.024
## 32 2.025 4 0.05 0.8 0.024
## 33 2.050 3 0.05 0.8 0.031
## 34 2.075 3 0.05 0.8 0.031
## 35 2.100 3 0.05 0.8 0.030
## 36 2.125 3 0.05 0.8 0.030
## 37 2.150 3 0.05 0.8 0.030
## 38 2.175 3 0.05 0.8 0.029
## 39 2.200 3 0.05 0.8 0.029
## 40 2.225 3 0.05 0.8 0.029
## 41 2.250 3 0.05 0.8 0.028
## 42 2.275 3 0.05 0.8 0.028
## 43 2.300 3 0.05 0.8 0.028
## 44 2.325 2 0.05 0.8 0.041
## 45 2.350 2 0.05 0.8 0.041
## 46 2.375 2 0.05 0.8 0.040
## 47 2.400 2 0.05 0.8 0.040
## 48 2.425 2 0.05 0.8 0.039
## 49 2.450 2 0.05 0.8 0.039
## 50 2.475 2 0.05 0.8 0.039
## 51 2.500 2 0.05 0.8 0.038
## 52 2.525 2 0.05 0.8 0.038
## 53 2.550 2 0.05 0.8 0.038
## 54 2.575 2 0.05 0.8 0.037
## 55 2.600 2 0.05 0.8 0.037
## 56 2.625 2 0.05 0.8 0.036
## 57 2.650 2 0.05 0.8 0.036
26
## 58 2.675 2 0.05 0.8 0.036
## 59 2.700 2 0.05 0.8 0.035
## 60 2.725 2 0.05 0.8 0.035
## 61 2.750 2 0.05 0.8 0.035
## 62 2.775 2 0.05 0.8 0.034
## 63 2.800 2 0.05 0.8 0.034
## 64 2.825 2 0.05 0.8 0.034
## 65 2.850 2 0.05 0.8 0.034
## 66 2.875 2 0.05 0.8 0.033
## 67 2.900 2 0.05 0.8 0.033
## 68 2.925 2 0.05 0.8 0.033
## 69 2.950 1 0.05 0.8 0.065
## 70 2.975 1 0.05 0.8 0.064
## 71 3.000 1 0.05 0.8 0.064
# Power calculation
result.power <- designSampleSize(data=DDA2009.comparisons$fittedmodel, numSample=3,
desiredFC=c(1.25, 3), FDR=0.05, power=TRUE)
result.power
27
## 35 2.100 3 0.05 0.76 0.030
## 36 2.125 3 0.05 0.78 0.030
## 37 2.150 3 0.05 0.80 0.030
## 38 2.175 3 0.05 0.82 0.029
## 39 2.200 3 0.05 0.84 0.029
## 40 2.225 3 0.05 0.86 0.029
## 41 2.250 3 0.05 0.87 0.028
## 42 2.275 3 0.05 0.88 0.028
## 43 2.300 3 0.05 0.90 0.028
## 44 2.325 3 0.05 0.91 0.027
## 45 2.350 3 0.05 0.92 0.027
## 46 2.375 3 0.05 0.93 0.027
## 47 2.400 3 0.05 0.93 0.027
## 48 2.425 3 0.05 0.94 0.026
## 49 2.450 3 0.05 0.95 0.026
## 50 2.475 3 0.05 0.95 0.026
## 51 2.500 3 0.05 0.96 0.026
## 52 2.525 3 0.05 0.96 0.025
## 53 2.550 3 0.05 0.97 0.025
## 54 2.575 3 0.05 0.97 0.025
## 55 2.600 3 0.05 0.98 0.025
## 56 2.625 3 0.05 0.98 0.024
## 57 2.650 3 0.05 0.98 0.024
## 58 2.675 3 0.05 0.98 0.024
## 59 2.700 3 0.05 0.99 0.024
## 60 2.725 3 0.05 0.99 0.023
## 61 2.750 3 0.05 0.99 0.023
## 62 2.775 3 0.05 0.99 0.023
## 63 2.800 3 0.05 0.99 0.023
## 64 2.825 3 0.05 0.99 0.023
## 65 2.850 3 0.05 0.99 0.022
## 66 2.875 3 0.05 0.99 0.022
## 67 2.900 3 0.05 0.99 0.022
## 68 2.925 3 0.05 0.99 0.022
## 69 2.950 3 0.05 0.99 0.022
## 70 2.975 3 0.05 0.99 0.021
## 71 3.000 3 0.05 0.99 0.021
For further details, visit the help file using the following code.
?designSampleSize
28
For further details, visit the help file using the following code.
?designSampleSizePlots
Many downstream analysis steps (such as clustering or classification of individual samples in the space of
their protein profiles) require summary values of protein abundance in each biological replicate or in each
condition, on a relative scale that is comparable between runs.
dataProcess function performs model-based run-level summarization. quantification function en-
ables subject-level summarization or group-level summarization with the run-level summarization from
dataProcess.
The option, type='sample'(default), performs sample quantification, i.e. it outputs the estimates of relative
protein abundance in each biological replicate. If there are technical replicates for biological replicates,
sample quantification will be the median among technical replicates. If there is no technical replicate for
biological replicate (sample), sample quantification will be the same as run-level summarization. In presence
of completely missing values in biological replicate, the estimates will be zero.
The option type='group' performs group quantification, i.e. it outputs the estimates of relative protein
abundance in each condition, summarized over the biological replicates (median among sample quantification).
In presence of completely missing values in a condition, the estimates will be zero.
MSstats supports two output formats. The option format='matrix' (default) outputs an array where rows
are proteins, and columns are conditions (for group quantification), or combinations of biological replicate
and condition ids (for sample quantification). The option format='long' produces an array where each row
corresponding to relative protein abundances, and columns are Protein, Condition, LogIntensities (and
BioReplicate in the case of sample quantification).
subQuant <- quantification(DDA2009.proposed)
head(subQuant)
29
## 5 rabbit 14.89507 15.88492 17.43767 20.19014 21.27964 22.07550
## 6 yeast 17.26792 19.19987 20.71073 22.73666 24.06156 16.38660
groupQuant <- quantification(DDA2009.proposed, type='group')
head(groupQuant)
## Protein C1 C2 C3 C4 C5 C6
## 1 bovine 20.85653 21.60443 14.32690 16.10441 17.63141 19.27802
## 2 chicken 18.48792 19.43204 20.41274 22.42284 15.92462 17.09803
## 3 cyc_horse 20.25927 21.33967 22.22028 15.85252 17.62720 18.45536
## 4 myg_horse 22.66495 14.73701 14.99667 18.61740 20.26392 21.52022
## 5 rabbit 14.89507 15.88492 17.43767 20.19014 21.27964 22.07550
## 6 yeast 17.26792 19.19987 20.71073 22.73666 24.06156 16.38660
For further details, visit the help file using the following code.
?quantification
This section describes steps and considerations to properly format data processed by Skyline, prior to the
MSstats analysis. In the following example, the raw files for Dynamic benchmark dataset (J. Cox et al. 2014)
are used,searched by Andromeda in MaxQuant, but quantified by Skyline.
This required input data is generated automatically when using MSstats report format in Skyline. We first
load and access the dataset processed by Skyline. The name of saved file from Skyline using MSstats report
format is ‘Cox.Skyline.csv’.
# Read output from skyline : Cox.skyline.csv
raw <- read.csv("Cox.skyline.csv")
We can read csv file. Here we will load R data file which is the exactly same data in Cox.skyline.csv file.
# Load R data, which is convered from csv file, output from skyline : Cox.skyline.csv
load("Cox.skyline.RData")
raw <- Cox.skyline
head(raw)
30
## 74 light NA NA
## 75 light NA NA
## 76 light NA NA
## 77 light NA NA
## 78 light NA NA
## FileName Area StandardType Truncated
## 73 20130510_EXQ1_IgPa_QC_UPS1_01.raw 1222621440 NA False
## 74 20130510_EXQ1_IgPa_QC_UPS1_02.raw 1641793792 NA False
## 75 20130510_EXQ1_IgPa_QC_UPS1_03.raw 1225490048 NA False
## 76 20130510_EXQ1_IgPa_QC_UPS1_04.raw 1777631616 NA False
## 77 20130510_EXQ1_IgPa_QC_UPS2_01.raw 7395562496 NA False
## 78 20130510_EXQ1_IgPa_QC_UPS2_02.raw 6193937408 NA False
## DetectionQValue
## 73 #N/A
## 74 #N/A
## 75 #N/A
## 76 #N/A
## 77 #N/A
## 78 #N/A
Annotation information is required to fill in Condition and BioReplicate for corresponding Run information.
Users have to prepare as csv or txt file like ‘Cox_skyline_annotation.csv’, which includes Run, Condition,
and BioReplicate information, and load it in R.
annot <- read.csv("Cox_skyline_annotation.csv", header=TRUE)
annot
The input data for MSstats is required to contain variables of ProteinName, PeptideSequence,
PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run,
Intensity. These variable names should be fixed. MSstats input from Skyline adapts the column scheme
of the dataset so that it fits MSstats input format. However there are several extra column names and
also some of them need to be changed. SkylinetoMSstatsFormat function helps pre-processing for making
right format of MSstats input from Skyline output. For example, it renames some column name, and
replace truncated peak intensities with NA. Another important step is to handle isotopic peaks before using
dataProcess. The output from Skyline for DDA experiment has several measurements of peak area from
the monoisotopic, M+1 and M+2 peaks. To get a robust measure of peptide intensity, we can sum over
isotopic peaks per peptide or use the highest peak. Here we take a summation per peptide ion.
Here is the summary of pre-processing steps in SkylinetoMSstatsFormat function (in orange box below).
31
# reformating and pre-processing for Skyline output.
quant <- SkylinetoMSstatsFormat(raw, annotation=annot)
## Peptides, that are used in more than one proteins, are removed.
## Warning in SkylinetoMSstatsFormat(raw, annotation = annot): NAs introduced
## by coercion
## ** Truncated peaks are replaced with NA.
## ** For DDA datasets, three isotopic peaks per feature and run are summed.
head(quant)
32
For further details, visit the help file using the following code.
?SkylinetoMSstatsFormat
The difference between output from Skyline and other spectral processing tool is that Skyline distinguishes
random missing (NA) by technical issues and low noisy intensity due to less than limit of etection. The output
from Skyline can have both NA (expect small number of NAs or none of them) and very small intensity close
to zero (less than 1 in intensity) and those should be treated different types of missing. In dataProcess,
users need to use censoredInt='0' for Skyline output, which means to distinguish between NA as random
missing and 0 as censored missing.
cox.skyline.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature",
censoredInt="0",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
The following R code chunks show steps to format a MaxQuant output for analysis by MSstats.
Here a controlled mixture dataset with dynamic range benchmark (J. Cox et al. 2014) is used for
demonstration. This dataset is available in MSstats material GitHub under the folder named ‘example
dataset/DDA_controlledMixture20014’.
Three files should be prepared before MSstats. Two files, ‘proteinGroups.txt’ and ‘evidence.txt’ are outputs
from MaxQuant.
## First, get protein ID information
proteinGroups <- read.table("Cox_maxquant_proteinGroups.txt", sep = "\t", header = TRUE)
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run
information. Users have to prepare as csv or txt file like ‘Cox_maxquant_annotation.csv’, which includes
Run, Condition, and BioReplicate information, and load it in R.
## Read in annotation including condition and biological replicates: annotation.csv
annot <- read.csv("Cox_maxquant_annotation.csv", header = TRUE)
annot
33
## 6 20130510_EXQ1_IgPa_QC_UPS2_02 UPS2 2 UPS2_02
## 7 20130510_EXQ1_IgPa_QC_UPS2_03 UPS2 2 UPS2_03
## 8 20130510_EXQ1_IgPa_QC_UPS2_04 UPS2 2 UPS2_04
## IsotopeLabelType
## 1 L
## 2 L
## 3 L
## 4 L
## 5 L
## 6 L
## 7 L
## 8 L
MaxQtoMSstatsFormat function helps pre-processing for making right format of MSstats input from MaxQuant
output. Basically, this function gets peptide ion intensity from ‘evidence.txt’ file. In addition, there are
several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in MaxQtoMSstatsFormat function (in orange box below).
34
## ProteinName PeptideSequence PrecursorCharge FragmentIon
## 1 A5A614 QVAESTPDIPK 2 NA
## 2 O00762ups DPAATSVAAAR 2 NA
## 3 O00762ups FLTPCYHPNVDTQGNICLDILK 2 NA
## 4 O00762ups FLTPCYHPNVDTQGNICLDILK 3 NA
## 5 O00762ups GAEPSGGAAR 2 NA
## 6 O00762ups GISAFPESDNLFK 2 NA
## ProductCharge IsotopeLabelType Condition BioReplicate
## 1 NA L UPS1 1
## 2 NA L UPS1 1
## 3 NA L UPS1 1
## 4 NA L UPS1 1
## 5 NA L UPS1 1
## 6 NA L UPS1 1
## Run Intensity
## 1 20130510_EXQ1_IgPa_QC_UPS1_01 NA
## 2 20130510_EXQ1_IgPa_QC_UPS1_01 1144800000
## 3 20130510_EXQ1_IgPa_QC_UPS1_01 32793000
## 4 20130510_EXQ1_IgPa_QC_UPS1_01 566960000
## 5 20130510_EXQ1_IgPa_QC_UPS1_01 58709000
## 6 20130510_EXQ1_IgPa_QC_UPS1_01 861090000
MaxQuant has certain or fixed threshold for intensity value internally as an parameter. Intensities less than
the threshold are reported as NA. All missing values are NA in output from MaxQuant. In dataProcess,
users need to use censoredInt='NA'. Users can used the same choice for other options.
cox.maxquant.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature", censoredInt="NA",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
This section describes steps and considerations to properly format data processed by Progenesis, prior to the
MSstats analysis. In the following example, the same raw dataset from the previous section is used, but it is
processed by Progenesis.
35
## 2 # Retention time (min) Charge m/z Measured mass
## 3 897 57.93265 2 459.27857242872 916.542591923679
## 4 1281 118.816733333333 2 1002.03351355306 2002.05247417235
## 5 3867 114.552433333333 2 1002.01296785898 2002.0113827842
## 6 1660 31.1707666666667 2 502.277007963734 1002.53946299371
## X.5 X.6 X.7 X.8
## 1
## 2 Mass error (u) Mass error (ppm) Score Sequence
## 3 0.0043889236793575 4.7885878242628 0.9992 VPYGAVLAK
## 4 0.0555781723510336 27.7613678932665 0.9956 LVITPVDGSDPYEEMIPK
## 5 0.0144867842016083 7.23616716417142 1 LVITPVDGSDPYEEMIPK
## 6 0.00488799370805282 4.87563604282956 1 AAAESSIQVK
## X.9 X.10
## 1
## 2 Modifications Accession
## 3 sp|P0A8T7|RPOC_ECOLI
## 4 sp|P0A8T7|RPOC_ECOLI
## 5 sp|P0A8T7|RPOC_ECOLI
## 6 sp|P0A8T7|RPOC_ECOLI
## X.11
## 1
## 2 Grouped accessions (for this sequence)
## 3
## 4
## 5
## 6
## X.12
## 1
## 2 Description
## 3 DNA-directed RNA polymerase subunit beta' OS=Escherichia coli (strain K12) GN=rpoC PE=1 SV=1
## 4 DNA-directed RNA polymerase subunit beta' OS=Escherichia coli (strain K12) GN=rpoC PE=1 SV=1
## 5 DNA-directed RNA polymerase subunit beta' OS=Escherichia coli (strain K12) GN=rpoC PE=1 SV=1
## 6 DNA-directed RNA polymerase subunit beta' OS=Escherichia coli (strain K12) GN=rpoC PE=1 SV=1
## X.13 X.14 X.15
## 1
## 2 Use in quantitation Max fold change Highest mean condition
## 3 False 1.07024295028376 Condition 1
## 4 False 1.26408361729762 Condition 2
## 5 True 1.12493122091942 Condition 2
## 6 False 2.08957830009021 Condition 1
## X.16 X.17 X.18
## 1
## 2 Lowest mean condition Anova Maximum CV
## 3 Condition 2 0.456859930184477 20.1184072001903
## 4 Condition 1 0.159939805145712 23.8614566196111
## 5 Condition 1 0.349651536742781 17.4104032083395
## 6 Condition 2 0.113361006195836 96.5523112559413
## Normalized.abundance X.19
## 1 Condition 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_01 20130510_EXQ1_IgPa_QC_UPS1_02
## 3 20931810.5655776 21680597.2888134
## 4 160825738.322531 204844686.21804
## 5 73462123.4527211 95179635.7807487
## 6 35548647.7029835 30254160.104057
36
## X.20 X.21
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_03 20130510_EXQ1_IgPa_QC_UPS1_04
## 3 19010264.4603025 19287017.2266766
## 4 159221868.441427 198325215.213719
## 5 66486528.3804692 93200193.8860612
## 6 21703274.487964 21391131.0188304
## X.22 X.23
## 1 Condition 2
## 2 20130510_EXQ1_IgPa_QC_UPS2_01 20130510_EXQ1_IgPa_QC_UPS2_02
## 3 16362526.9389495 15818549.690736
## 4 232357617.95467 167457335.555413
## 5 103977057.096685 74305266.8499236
## 6 25651156.0484212 22037815.8546054
## X.24 X.25
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS2_03 20130510_EXQ1_IgPa_QC_UPS2_04
## 3 24123356.4583506 19294933.8780514
## 4 299234521.249582 215157929.093349
## 5 88240964.864741 102823670.745066
## 6 2576827.30283144 1848645.75613902
## Raw.abundance X.26
## 1 Condition 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_01 20130510_EXQ1_IgPa_QC_UPS1_02
## 3 15105416.9228044 19732517.8634299
## 4 116059708.340506 186438656.471507
## 5 53013856.5563334 86627404.1374054
## 6 25653640.5636365 27535715.3100392
## X.27 X.28
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_03 20130510_EXQ1_IgPa_QC_UPS1_04
## 3 13810320.3071721 19287017.2266766
## 4 115669353.662823 198325215.213719
## 5 48300235.6418326 93200193.8860612
## 6 15766701.8793535 21391131.0188304
## X.29 X.30
## 1 Condition 2
## 2 20130510_EXQ1_IgPa_QC_UPS2_01 20130510_EXQ1_IgPa_QC_UPS2_02
## 3 16908752.5099689 12362738.4578786
## 4 240114346.057948 130873644.09502
## 5 107448093.544628 58072111.4178035
## 6 26507461.2763444 17223308.0098951
## X.31 X.32
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS2_03 20130510_EXQ1_IgPa_QC_UPS2_04
## 3 20505908.0138557 16847582.9106248
## 4 254362429.950747 187867503.05488
## 5 75008679.3143534 89781614.6456385
## 6 2190415.9038019 1614165.29570777
## Spectral.counts X.33
## 1 Condition 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_01 20130510_EXQ1_IgPa_QC_UPS1_02
## 3 1 0
## 4 0 0
37
## 5 4 3
## 6 1 1
## X.34 X.35
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS1_03 20130510_EXQ1_IgPa_QC_UPS1_04
## 3 1 0
## 4 0 0
## 5 7 3
## 6 1 1
## X.36 X.37
## 1 Condition 2
## 2 20130510_EXQ1_IgPa_QC_UPS2_01 20130510_EXQ1_IgPa_QC_UPS2_02
## 3 0 0
## 4 0 1
## 5 3 6
## 6 1 1
## X.38 X.39
## 1
## 2 20130510_EXQ1_IgPa_QC_UPS2_03 20130510_EXQ1_IgPa_QC_UPS2_04
## 3 0 0
## 4 0 0
## 5 2 3
## 6 0 0
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding Run
information. Users have to prepare as csv or txt file like ‘Cox_progenesis_annotation.csv’, which includes
Run, Condition, and BioReplicate information, and load it in R.
## Read in annotation including condition and biological replicates: annotation.csv
annot <- read.csv("Cox_Progenesis_annotation.csv", header = TRUE)
annot
The output from Progenesis includes peptide ion-level quantification for each MS runs. ProgenesistoMSstatsFormat
function helps pre-processing for making right format of MSstats input from Progenesis output. Basically,
this function reformats wide format to long format. It provide ‘Raw.abundance’, ‘Normalized.abundance’
and ‘Spectral count’ columns. This converter uses ‘Raw.abundance’ columns for Intensity values. In addition,
there are several steps to filter out or to modify the data in order to get required information.
Here is the summary of pre-processing steps in ProgenesistoMSstatsFormat function (in orange box below).
38
## check options for converting format
?ProgenesistoMSstatsFormat
## ProteinName
## 1 O00762ups|UBE2C_HUMAN_UPS
## 2 O00762ups|UBE2C_HUMAN_UPS
## 3 O00762ups|UBE2C_HUMAN_UPS
## 4 O00762ups|UBE2C_HUMAN_UPS
## 5 O00762ups|UBE2C_HUMAN_UPS
## 6 O00762ups|UBE2C_HUMAN_UPS
## PeptideModifiedSequence PrecursorCharge
## 1 FLTPCYHPNVDTQGNICLDILK[5] C+57.0215|[17] C+57.0215 2
## 2 FLTPCYHPNVDTQGNICLDILK[5] C+57.0215|[17] C+57.0215 3
## 3 FLTPCYHPNVDTQGNICLDILK[5] C+57.0215|[17] C+57.0215 4
## 4 GISAFPESDNLFK 2
## 5 WVGTIHGAAGTVYEDLR 2
## 6 WVGTIHGAAGTVYEDLR 3
## FragmentIon ProductCharge IsotopeLabelType Condition BioReplicate
## 1 NA NA L UPS1 1
## 2 NA NA L UPS1 1
## 3 NA NA L UPS1 1
## 4 NA NA L UPS1 1
## 5 NA NA L UPS1 1
## 6 NA NA L UPS1 1
## Run Intensity
## 1 20130510_EXQ1_IgPa_QC_UPS1_01 3790142.3
## 2 20130510_EXQ1_IgPa_QC_UPS1_01 63386703.5
## 3 20130510_EXQ1_IgPa_QC_UPS1_01 165145.8
## 4 20130510_EXQ1_IgPa_QC_UPS1_01 98738397.1
39
## 5 20130510_EXQ1_IgPa_QC_UPS1_01 9624505.6
## 6 20130510_EXQ1_IgPa_QC_UPS1_01 12633723.8
Progenesis reports 0(zero) for missing values and does not have NA. Therefore,in dataProcess, users need to
use censoredInt='0'. Users can used the same choice for other options.
cox.progenesis.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature",
censoredInt="0",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
This section describes steps and considerations to properly format data processed by Proteome Discoverer,
prior to the MSstats analysis. In the following example, another spike-in dataset processed by Proteome
Discoverer is used to demonstrate.
The output from Proteome Discoverer includes several level of datasets. PSM sheet should be saved as csv as
below. Here is the expected input for MSstats.
## Read PSM-level data
raw <- read.csv("spikein_PD_psm.csv")
head(raw)
40
## 6 GTP cyclohydrolase 1 OS=Escherichia coli (strain K12) GN=folE PE=1 SV=2 - [GCH1_EC
## X..Proteins X..Protein.Groups Protein.Group.Accessions Modifications
## 1 1 1 P00961
## 2 1 1 P60438
## 3 1 1 P60723
## 4 1 1 P0ADY1
## 5 1 1 P07639
## 6 1 1 P0A6T5
## Activation.Type DeltaScore DeltaCn Rank Search.Engine.Rank
## 1 CID 1.0000 0 1 1
## 2 CID 0.5455 0 1 1
## 3 CID 0.0000 0 1 1
## 4 CID 0.4062 0 1 1
## 5 CID 1.0000 0 1 1
## 6 CID 1.0000 0 1 1
## Precursor.Area QuanResultID Decoy.Peptides.Matched Exp.Value
## 1 3.77e+07 NA 11 0.00033
## 2 6.59e+08 NA 6 0.00940
## 3 3.83e+08 NA 17 0.20000
## 4 1.42e+07 NA 4 0.01300
## 5 3.93e+07 NA NA 0.00860
## 6 2.80e+07 NA 7 0.27000
## Homology.Threshold Identity.High Identity.Middle IonScore
## 1 13 13 13 48
## 2 13 13 13 33
## 3 13 13 13 20
## 4 13 13 13 32
## 5 13 13 13 34
## 6 13 13 13 19
## Peptides.Matched X..Missed.Cleavages Isolation.Interference....
## 1 5 0 53
## 2 11 0 8
## 3 19 0 38
## 4 6 0 34
## 5 5 0 13
## 6 4 0 41
## Ion.Inject.Time..ms. Intensity Charge m.z..Da. MH...Da. Delta.Mass..Da.
## 1 4 1700000 2 350.2295 699.4517 0
## 2 2 2520000 2 350.2417 699.4761 0
## 3 5 739000 2 350.7340 700.4607 0
## 4 3 1520000 2 350.7342 700.4611 0
## 5 2 2480000 2 350.7520 700.4968 0
## 6 70 53500 2 351.1900 701.3728 0
## Delta.Mass..PPM. RT..min. First.Scan Last.Scan MS.Order Ions.Matched
## 1 0.68 32.17 8180 8180 MS2 Jun-50
## 2 -0.44 38.77 10907 10907 MS2 May-52
## 3 0.41 27.49 6221 6221 MS2 May-40
## 4 0.93 43.27 12766 12766 MS2 May-40
## 5 -0.03 42.75 12552 12552 MS2 Apr-40
## 6 -0.25 17.39 2693 2693 MS2 Apr-32
## Matched.Ions Total.Ions Spectrum.File
## 1 6 50 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
## 2 5 52 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
## 3 5 40 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
41
## 4 5 40 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
## 5 4 40 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
## 6 4 32 121219_S_CCES_01_01_LysC_Try_1to10_Mixt_1_1.raw
## Annotation
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
One file is for annotation information, required to fill in Condition and BioReplicate for corresponding
Run information. Users have to prepare as csv or txt file like ‘spikein_PD_annotation.csv’, which includes
Run, Condition, and BioReplicate information, and load it in R.
## Read in annotation including condition and biological replicates: annotation.csv
annot <- read.csv("spikein_PD_annotation.csv", header = TRUE)
annot
PDtoMSstatsFormat function helps pre-processing for making right format of MSstats input from Proteome
Discoverer output. Protein.Group.Accessions is used for ProteinName. The combination of Sequence
and Modifications is used for PeptideSequence. Charge is used for PrecursorCharge. Precursor.Area
is used for Intensity. In addition, there are several steps to filter out or to modify the data in order to get
required information.
Here is the summary of pre-processing steps in PDtoMSstatsFormat function (in orange box below).
42
## check options for converting format
?PDtoMSstatsFormat
quant <- PDtoMSstatsFormat(raw, annotation=annot)
43
4.5.3 Different options for Proteome Discoverer in dataProcess
Progenesis reports NA for missing values. Therefore,in dataProcess, users need to use censoredInt='NA'.
Users can used the same choice for other options.
cox.progenesis.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature",
censoredInt="NA",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
This section describes steps and considerations to properly format data processed by Skyline for SWATH/DIA
experiments, prior to the MSstats analysis. In the following example, the raw files for profiling standard
sample set (Bruderer et al. 2015) are quantified by Skyline.
This required input data is generated automatically when using MSstats report format in Skyline. We first
load and access the dataset processed by Skyline. The name of saved file from Skyline using MSstats report
format is ‘Cox.Skyline.csv’.
# Read output from skyline : Bruderer.skyline.csv
raw <- read.csv("Bruderer.skyline.csv")
We can read csv file. Here we will load R data file which is the exactly same data in Cox.skyline.csv file.
# Load R data, which is convered from csv file, output from skyline : Bruderer.skyline.csv
load("Bruderer.skyline.RData")
raw <- Bruderer.skyline
Annotation information in Condition and BioReplicate for corresponding Run was already filled in Skyline >
Result grid. If not, users have to prepare as csv or txt file, which includes Run, Condition, and BioReplicate
information, and load it in R as section 4.2.1.
The input data for MSstats is required to contain variables of ProteinName, PeptideSequence,
PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run,
Intensity. These variable names should be fixed. MSstats input from Skyline adapts the column scheme of
the dataset so that it fits MSstats input format. However there are several extra column names and also
some of them need to be changed.
## ProteinName PeptideSequence PeptideModifiedSequence
## 1 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
44
## 2 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
## 3 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
## 4 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
## 5 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
## 6 P60174 ELASQPDVDGFLVGGASLKPEFVDIINAK ELASQPDVDGFLVGGASLKPEFVDIINAK
## PrecursorCharge PrecursorMz FragmentIon ProductCharge ProductMz
## 1 3 1010.533 y10 1 1145.62
## 2 3 1010.533 y10 1 1145.62
## 3 3 1010.533 y10 1 1145.62
## 4 3 1010.533 y10 1 1145.62
## 5 3 1010.533 y10 1 1145.62
## 6 3 1010.533 y10 1 1145.62
## IsotopeLabelType Condition BioReplicate
## 1 light S1 S1
## 2 light S1 S1
## 3 light S1 S1
## 4 light S2 S2
## 5 light S2 S2
## 6 light S2 S2
## FileName Area StandardType Truncated
## 1 B_D140314_SGSDSsample1_R01_MHRM_T0.raw 17578982 False
## 2 B_D140314_SGSDSsample1_R02_MHRM_T0.raw 19800498 False
## 3 B_D140314_SGSDSsample1_R03_MHRM_T0.raw 16162569 False
## 4 B_D140314_SGSDSsample2_R01_MHRM_T0.raw 19254086 False
## 5 B_D140314_SGSDSsample2_R02_MHRM_T0.raw 16377574 False
## 6 B_D140314_SGSDSsample2_R03_MHRM_T0.raw 15045770 False
## DetectionQValue
## 1 3.5156850231032877E-07
## 2 2.2968222879171662E-07
## 3 1.0004539490182651E-06
## 4 3.2378503078689391E-07
## 5 2.3675326588090684E-07
## 6 9.7241468210995663E-07
SkylinetoMSstatsFormat function helps pre-processing for making right format of MSstats input from
Skyline output. For example, it removes iRT protein, renames some column name, and replace truncated peak
intensities with NA. Another important step for SWATH/DIA experiment is to use q-value (column named
DetectionQValue) for filtering data before using dataProcess. The option, filter_with_Qvalue=TRUE,
will replace Intensity value with zero for the rows with DetectionQValue column value greater than
qvalue_cutoff option value in SkylinetoMSstatsFormat function.
Here is the summary of pre-processing steps for SWATH/DIA experiment in SkylinetoMSstatsFormat
function (in orange box below).
45
## check options for converting format
?SkylinetoMSstatsFormat
46
## 3 S1 B_D140314_SGSDSsample1_R03_MHRM_T0.raw 16162569
## 4 S2 B_D140314_SGSDSsample2_R01_MHRM_T0.raw 19254086
## 5 S2 B_D140314_SGSDSsample2_R02_MHRM_T0.raw 16377574
## 6 S2 B_D140314_SGSDSsample2_R03_MHRM_T0.raw 15045770
## StandardType Truncated DetectionQValue
## 1 False 3.515685e-07
## 2 False 2.296822e-07
## 3 False 1.000454e-06
## 4 False 3.237850e-07
## 5 False 2.367533e-07
## 6 False 9.724147e-07
In dataProcess, users need to use censoredInt='0' for Skyline output, which means to distinguish between
NA as random missing and 0 as censored missing as described in section 4.2.3. The same options of summa-
rization method and imputation for DDA experiments (section 4.2.3) are recommended for SWATH/DIA
experiments. featureSubset option for using subset of features can be used for SWATH/DIA experiments,
which have relatively large number of features in each protein.
bruderer.skyline.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature", censoredInt="0",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
This section describes steps and considerations to properly format data processed by Spectronaut for
SWATH/DIA experiments, prior to the MSstats analysis. In the following example, the same raw files in
section 5.2 for profiling standard sample set (Bruderer et al. 2015) are quantified by Spectronaut.
We can read the file as above. Here isntead, we will load R data file which is the exactly same data in
Bruderer.spectronaut.xls file.
# Load R data, which is converted, output from spectronaut : Bruderer.SN.RData
load("Bruderer.SN.RData")
raw <- Bruderer.SN
47
head(raw)
48
## 3 1.381949 6.769939e-01
## 4 446.975891 2.180690e+03
## 5 1661.265259 9.115102e+03
## 6 3084.851807 1.311769e+04
The input data for MSstats is required to contain variables of ProteinName, PeptideSequence,
PrecursorCharge, FragmentIon, ProductCharge, IsotopeLabelType, Condition, BioReplicate, Run,
Intensity. These variable names should be fixed. Therefore, we need to get subset of useful columns and
to rename them. Also several filtering steps are required. SpectronauttoMSstatsFormat function helps
pre-processing for making right format of MSstats input from Spectronaut output. First, it uses only noloss
from F.FrgLossType. If not, multiple measurements for each feature and run can be happend. Spectronaut
provides the column named F.ExcludedFromQuantification based on XIC quality such as interference
between chromatographies. Only features with F.ExcludedFromQuantification == 'False' should be
used. PG.ProteinGroups is used for ProteinName. EG.ModifiedSequence is used for PeptideSequence.
FG.Charge is used for PrecursorCharge. F.FrgIon is used for FragmentIon. F.Charge is used for
ProductCharge. F.PeakArea with default option is used for Intensity. Then several filtering steps will be
performed.
Here is the summary of pre-processing steps for SWATH/DIA experiment in SpectronauttoMSstatsFormat
function (in orange box below).
## ** Intensities with great than 0.01 in EG.Qvalue are replaced with zero.
## ** All peptides are unique peptides in proteins.
## ** No multiple measurements in a feature and a run.
## now 'quant' is ready for MSstats
head(quant)
49
## 6 A0AVT1 _VVQTDETAR_ 2 y5
## 10 A0AVT1 _LATSISETLEEK_ 2 y8
## 11 A0AVT1 _LATSISETLEEK_ 2 y4
## 13 A0AVT1 _VC[+57]PTTETIYNDEFYTK_ 2 y8
## 14 A0AVT1 _VC[+57]PTTETIYNDEFYTK_ 2 y14
## ProductCharge Condition Run BioReplicate
## 1 1 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## 6 1 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## 10 1 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## 11 1 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## 13 1 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## 14 2 SGSDSsample1 B_D140314_SGSDSsample1_R01_MHRM 1
## Intensity IsotopeLabelType
## 1 15882.3604 L
## 6 3084.8518 L
## 10 182.7725 L
## 11 667.2112 L
## 13 2555.1445 L
## 14 1747.8701 L
In dataProcess, users need to use censoredInt='0' for Spectronaut output. Spectronaut ouput generates
very few number of NA. After applying Qvalue, zero intensities will be generated and those should be
imputed. The same options of summarization method and imputation for DDA experiments (section 4.2.3)
are recommended for SWATH/DIA experiments. featureSubset option for using subset of features can be
used for SWATH/DIA experiments, which have relatively large number of features in each protein.
bruderer.spectronaut.proposed <- dataProcess(quant,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature",
censoredInt="0",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
This section describes steps and considerations to properly format data processed by OpenSWATH for
SWATH experiments, prior to the MSstats analysis. In the following example, the dataset processed and
quantified by OpenSWATH and available as supplementary in (Röst et al. 2014) is used.
R package, SWATH2stats, in Bioconductor reformats SWATH data from OpenSWATH software for MSstats
input format.(Blattmann, Heusel, and Aebersold 2016)
library(SWATH2stats)
50
sep="\t", header=TRUE)
# Users should prepare the file including, FileName, Condition, BioReplicate and Run.
# and Read it.
annot <- read.csv('DIA_Rost2014_annotation.csv')
head(annot)
51
## 11 AQUA4SWATH_HMLangeC_LDASLPALLLIR(UniMod:267)/2_run0_split_napedro_L120417_001_SW_combined.f
## 101 AQUA4SWATH_MouseSabido_QEPAAPSLSPAVSAK(UniMod:259)/2_run0_split_napedro_L120417_001_SW_combined.f
## 170 AQUA4SWATH_Lepto_AIAEEVPK(UniMod:259)/2_run0_split_napedro_L120417_001_SW_combined.f
## decoy main_var_xx_swath_prelim_score var_bseries_score
## 11 FALSE 1.053124 1
## 101 FALSE 1.407187 1
## 170 FALSE 2.592268 4
## var_elution_model_fit_score var_intensity_score
## 11 -0.5072533 0.022184300
## 101 0.4999997 0.007779326
## 170 0.9660140 0.106216673
## var_isotope_correlation_score var_isotope_overlap_score
## 11 0.8245458 0.00000000
## 101 0.7895422 0.00000000
## 170 0.9110748 0.06904215
## var_library_corr var_library_rmsd var_log_sn_score var_massdev_score
## 11 0.9989755 0.06604379 2.014903 4.633929
## 101 0.9768712 0.04136264 1.230344 2.806702
## 170 0.9688958 0.12375189 2.521098 6.274398
## var_massdev_score_weighted var_norm_rt_score var_xcorr_coelution
## 11 16.087377 0.02200757 3.648683
## 101 3.424393 0.05639886 1.832796
## 170 6.682840 0.03075860 1.859502
## var_xcorr_coelution_weighted var_xcorr_shape var_xcorr_shape_weighted
## 11 0.7401841 0.1000000 0.7532720
## 101 0.2552277 0.6000000 0.8723862
## 170 0.2428034 0.8764185 0.9650843
## var_yseries_score
## 11 1
## 101 1
## 170 3
## transition
## 11 AQUA4SWATH_HMLangeC_LDASLPALLLIR(UniMod:267)/2_run0_split_napedro_L120417_001_SW_combined.f
## 101 AQUA4SWATH_MouseSabido_QEPAAPSLSPAVSAK(UniMod:259)/2_run0_split_napedro_L120417_001_SW_combined.f
## 170 AQUA4SWATH_Lepto_AIAEEVPK(UniMod:259)/2_run0_split_napedro_L120417_001_SW_combined.f
## run_id
## 11 1_1_split_napedro_L120417_001_SW_combined.featureXML
## 101 1_1_split_napedro_L120417_001_SW_combined.featureXML
## 170 1_1_split_napedro_L120417_001_SW_combined.featureXML
## Filename RT
## 11 split_napedro_L120417_001_SW_combined.featureXML 5706.057
## 101 split_napedro_L120417_001_SW_combined.featureXML 2130.896
## 170 split_napedro_L120417_001_SW_combined.featureXML 1632.214
## id Sequence m.z Intensity assay_rt
## 11 f_11766711863966126076 LDASLPALLLIR 652.912 130 5620.726
## 101 f_7536148822812640906 QEPAAPSLSPAVSAK 730.895 240 2326.711
## 170 f_6525552705944724733 AIAEEVPK 432.749 68842 1530.658
## delta_rt leftWidth norm_RT nr_peaks peak_apices_sum rightWidth
## 11 85.33057 5702.47 116.80076 4 70 5716.13
## 101 -195.81477 2128.42 12.96011 4 150 2135.25
## 170 101.55614 1618.53 -1.52414 4 14452 1652.67
## rt_score sn_ratio total_xic dotprod_score library_dotprod
## 11 2.200757 7.500000 5860 0.4210968 0.9316182
## 101 5.639886 3.422408 30851 0.7415526 0.9619197
52
## 170 3.075860 12.442253 648128 0.8172930 0.9471398
## library_manhattan manhatt_score xx_lda_prelim_score
## 11 0.7964739 1.2636574 2.647497
## 101 0.2837612 0.7505515 3.030819
## 170 0.3700136 0.6605900 4.604896
## xx_swath_prelim_score aggr_Peak_Apex log10_total_xic LD1
## 11 0 NA;NA;NA;NA 3.767898 1.978064
## 101 0 NA;NA;NA;NA 4.489269 1.245393
## 170 0 NA;NA;NA;NA 5.811661 1.149967
## peak_group_rank d_score m_score
## 11 1 3.399867 0.0007289805
## 101 1 2.649140 0.0071091755
## 170 1 2.551362 0.0093207975
data.transition <- disaggregate(data.filtered)
## One or several columns required by MSstats were not in the data. The columns were created and filled
## Missing columns: ProductCharge, IsotopeLabelType
## IsotopeLabelType was filled with light.
## Warning in convert4MSstats(data.transition): Intensity values that were 0,
## were replaced by NA
## now 'MSstats.input' is ready for MSstats
head(MSstats.input)
53
## 2 split_napedro_L120417_001_SW_combined.featureXML
## 3 split_napedro_L120417_001_SW_combined.featureXML
## 4 split_napedro_L120417_002_SW_combined.featureXML
## 5 split_napedro_L120417_002_SW_combined.featureXML
## 6 split_napedro_L120417_002_SW_combined.featureXML
In dataProcess, users need to use censoredInt='NA' for OpenSWATH output. The same options of sum-
marization method and imputation for DDA experiments (section 4.2.3) are recommended for SWATH/DIA
experiments. featureSubset option for using subset of features can be used for SWATH/DIA experiments,
which have relatively large number of features in each protein.
goldstandard.proposed <- dataProcess(MSstats.input,
normalization='equalizeMedian',
summaryMethod="TMP",
cutoffCensored="minFeature", censoredInt="NA",
MBimpute=TRUE,
maxQuantileforCensored=0.999)
Reference
Benjamini, Y., and Y. Hochberg. 1955. “Controlling the false discovery rate: a practical and powerful
approach to multiple testing.” J.R. Statist. Soc. B 57 (1): 289–300.
Blattmann, P., M. Heusel, and R. Aebersold. 2016. “SWATH2stats: An R/Bioconductor Package to Process
and Convert Quantitative SWATH-MS Proteomics Data for Downstream Analysis Tools.” PLoS ONE 11 (4).
doi:10.1371/journal.pone.0153160.
Bruderer, R., O. M. Bernhardt, T. Gandhi, S. M Miladinović, L.-Y. Cheng, S. Messner, T. Ehrenberger, et
al. 2015. “Extending the limits of quantitative proteome profiling with Data-Independent Acquisition and
application to acetaminophen-treated three-dimensional liver microtissues.” Mol. Cell. Proteomics 14 (5):
1400–1410.
Cox, J., M. Y. Hein, C. A. Luber, I. Paron, N. Nagaraj, and M. Mann. 2014. “Accurate proteome-wide
label-free quantification by delayed normalization and maximal peptide ratio extraction, termed MaxLFQ.”
Mol. Cell. Proteomics 13 (9): 2513–26.
Cox, Jürgen, and Matthias Mann. 2008. “MaxQuant enables high peptide identification rates, individualized
p.p.b.-range mass accuracies and proteome-wide protein quantification.” Nature Biotechnology 26 (12):
1367–72.
Gatto, L., and K. S. Lilley. 2012. “MSnbase-an R/Bioconductor package for isobaric tagged mass spectrometry
data visualization, processing and quantitation.” Bioinformatics 28: 288–89.
MacLean, B., D. M. Tomazela, N. Shulman, M. Chambers, G. Finney, B. Frewen, R. Kern, D. L Tabb, D. C.
Liebler, and M. J. MacCoss. 2010. “Skyline: An open source document editor for creating and analyzing
targeted proteomics experiments.” Bioinformatics 26–27: 966.
Mueller, L. N., O. Rinner, A. Schmidt, S. Letarte, B. Bodenmiller, M.-Y. Brusniak, O. Vitek, R. Aebersold,
and M. Müller. 2007. “SuperHirn - a novel tool for high resolution LC-MS-based peptide/protein profiling.”
54
Proteomics 7: 3470–80.
Röst, H. L., G. Rosenberger, P. Navarro, L. Gillet, S. M. Miladinović, O. T. Schubert, W. Wolski, et al.
2014. “OpenSWATH enables automated, targeted analysis of data- independent acquisition MS data.” Nat.
Biotechnol. 32 (3): 219–23.
Sturm, Marc, Andreas Bertsch, Clemens Gröpl, Andreas Hildebrandt, Rene Hussong, Eva Lange, Nico Pfeifer,
et al. 2008. “OpenMS – An open-source software framework for mass spectrometry.” BMC Bioinformatics 9
(1): 163.
55