Selene: A Pytorch-Based Deep Learning Library For Sequence Data
Selene: A Pytorch-Based Deep Learning Library For Sequence Data
https://doi.org/10.1038/s41592-019-0360-8
To enable the application of deep learning in biology, we pres- and training for model development (Fig. 1a) and (2) prediction and
ent Selene (https://selene.flatironinstitute.org/), a PyTorch- visualization for analyses using the trained model (Fig. 1b,c). With
based deep learning library for fast and easy development, Selene, researchers can run model development and analysis work-
training, and application of deep learning model architec- flows out-of-the-box. For more advanced use cases, Selene provides
tures for any biological sequence data. We demonstrate on templates for extending modules within each workflow so that users
DNA sequences how Selene allows researchers to easily train can adapt the library to their particular research questions.
a published architecture on new data, develop and evaluate a There has been recent work to make deep learning in biology
new architecture, and use a trained model to answer biological more accessible: DragoNN is a toolkit for teaching deep learning in
questions of interest. regulatory genomics; pysster12 is a Python package for training con-
Deep learning describes a set of machine learning techniques volutional neural networks on biological sequence data; and Kipoi13
that use stacked neural networks to extract complicated patterns is a framework to archive, use, and build on published predictive
from high-dimensional data1. These techniques are widely used for models in genomics. These resources constitute the nascent software
image classification and natural language processing, and have led ecosystem for sequence-level deep learning. Selene is our contribu-
to very promising advances in the biomedical domain, including tion to this ecosystem. Selene supports general model development
genomics and chemical synthesis1–3. In regulatory genomics, net- not constrained to a particular architecture (in contrast to pysster)
works trained on high-throughput sequencing data (for example, or task (in contrast to DragoNN) and is designed for users with dif-
ChIP-seq), or ‘sequence-based models’, have become the de facto ferent levels of computational experience. Users are supported in
standard for predicting the regulatory and disease impact of muta- tasks ranging from simply applying an existing model, to retraining
tions4–7. While deep-learning-related publications are often accom- it on new data (tasks also supported by Kipoi), to developing new
panied by the associated pre-trained model6,8,9, a key challenge in model architectures (a task that is challenging to do with any other
both developing new deep learning architectures and training exist- tool). The models developed using Selene can be shared and used
ing architectures on new data is the lack of a comprehensive, gener- through the Kipoi framework.
alizable, and user-friendly deep learning library for biology. To demonstrate Selene’s capabilities for developing and evalu-
Beyond regulatory genomics, sequence-level deep learning mod- ating sequence-level deep learning models, we use it to (1) train a
els have broad promise in a wide range of research areas, including published architecture on new data; (2) develop, train, and evalu-
recent advances on prediction of disease risk of missense mutations ate a new model (improving a published model); and (3) apply a
in proteins10 and potential applications to, for example, predict- trained model to data and visualize the resulting predictions in the
ing target site accessibility in genome editing. We must enable the case studies that follow.
adoption and active development of deep-learning-based methods In the first case study, a researcher wants to use the DeepSEA4
in biomedical sciences. For example, a biomedical scientist excited model architecture as a starting point and train the model on dif-
by a publication of a model capable of predicting the disease-asso- ferent data. Selene is completely general and a user can easily use or
ciated effect of mutations should be able to train a similar model specify any model of their choice using modules in PyTorch.
on their own ChIP-seq data focused on their disease of interest. A Suppose a cancer researcher is interested in modeling the regula-
bioinformatician interested in developing new model architectures tory elements of the transcription factor GATA1, specifically focus-
should be able to experiment with different architectures and evalu- ing on proerythroblasts in bone marrow. This is a tissue-specific
ate all of them on the same data. Currently, this requires advanced genomic feature that DeepSEA does not predict. The researcher
knowledge specific to deep learning2,11, substantial new code devel- downloads peak data from Cistrome14 and a reference genome
opment, and associated time investment far beyond what most bio- FASTA file. Once a researcher formats the data to match the docu-
medical scientists are able to commit. mented inputs15 and fills out the necessary training parameters (for
Here we present Selene, a framework for developing sequence- example, batch size or learning rate), they can use Selene to train
level deep learning networks that provides biomedical scientists the DeepSEA architecture on their data with no new lines of Python
with comprehensive support for model training, evaluation, and code. In this example, they find that the model obtains an area
application across a broad range of biological questions. Sequence- under the curve (AUC) of 0.942 on this feature (Fig. 2a).
level data refers to any type of biological sequence such as DNA, Selene automatically generates training, testing, and validation
RNA, or protein sequences and their measured properties (for samples from the provided input data. The samples generated for
example, binding of transcription factors or RNA-binding proteins, each partition can be saved and used in subsequent model develop-
or DNase sensitivity). Selene contains modules for (1) data sampling ment so that comparisons can be made across models with different
Flatiron Institute, Simons Foundation, New York, NY, USA. 2Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA.
1
Graduate Program in Quantitative and Computational Biology, Princeton University, Princeton, NJ, USA. 4Department of Computer Science, Princeton
3
University, Princeton, NJ, USA. 5These authors contributed equally: Kathleen M. Chen, Evan M. Cofer. *e-mail: [email protected]
a
Model training and evaluation
Model architecture,
for example DeepSEA
0.8
0.0
0.0 0.2 0.4 0.6 0.8 1.0
False positive rate
b 0.8
Predicted effect
0.6
Trained model
Run Selene
Variants VCF file
0.4
0.2
79
43
88
23
28
39
3
71
86
0
4
42
93
4
0
60
65
7
97
7
45
44
17
68
95
53
48
48
43
18
67
32
65
98
64
30
40
43
01
64
93
69
14
80
65
47
03
04
78
81
04
06
62
21
18
23
65
93
53
64
14
62
57
77
67
04
90
33
87
57
72
81
37
24
32
83
42
78
50
52
32
62
77
19
17
78
42
51
87
r1
10
12
21
20
11
r9
7
r7
r8
ch
1
r4
3
r5
8
r6
r1
0
1
r2
r1
ch
ch
ch
r1
r1
r1
r1
r1
0
r1
r1
r2
r1
r2
r3
ch
ch
ch
ch
ch
ch
r1
r1
ch
ch
ch
ch
ch
ch
ch
ch
ch
ch
ch
ch
ch
Genome coordinates
c
In silico mutagenesis
Sequences FASTA file
Trained model
Run Selene
Fig. 1 | Overview of Selene. a, As input, the library accepts (left) the model architecture, dataset and (middle) a configuration file that specifies the
necessary input data paths and training parameters. Selene automatically splits the data into training and validation/testing, trains the model, evaluates
it, and (right) generates figures from the results. b, Selene supports variant effect prediction with the same configuration file format and includes
functionality to visualize the variants and their difference scores as a Manhattan plot, where a user can hover over each point to see variant information. c,
Selene calculates mutation effect scores and visualizes the scores as a heat map.
architectures and/or parameters. Further, Selene automatically We provide the code and results for this example in Selene’s
evaluates the model on the test set after training and, in this case, GitHub repository (https://github.com/FunctionLab/selene; see
generates figures to visualize the model’s performance as receiver case 1 in the ‘manuscript’ folder).
operating characteristic and average precision curves. In another use case, a researcher may want to develop and train
Now that the researcher has a trained model, they can use a new model architecture. For example, a bioinformatician might
Selene to apply in silico mutagenesis—converting every position want to modify a published model architecture to see how that
in the sequence to every other possible base4 (DNA and RNA) or affects performance. First, the researcher uses modules in PyTorch
amino acid (protein sequences)—to a set of GATA1 sequences to specify the model architecture they are interested in evaluat-
drawn from the test set and examine the consequences of these ing; in this case study, they try to enhance the DeepSEA architec-
‘mutations’. Selene supports visualization of the outputs of in silico ture with batch normalization and three additional convolutional
mutagenesis as a heat map and/or motif plot. By visualizing the layers. The researcher specifies parameters for training and the
log2 fold change for these sequences in a heat map, the researcher paths to the model architecture and data in a configuration file
can see that the model detects disruptions in binding at the GATA and passes this as input to the library’s command-line interface
motif (Fig. 2b). (CLI). Training is automatically completed by Selene; afterward,
Bases
G
0.8 C
True positive rate A
0.6 400 450 500 550 600
Position in sequence
0.4 Center 200 bp of 1,000 bp sequence: chr8 (89567149, 89568149)
T
Bases
0.2 G
C
A
0.0 400 450 500 550 600
0.0 0.2 0.4 0.6 0.8 1.0 Position in sequence
False positive rate
Fig. 2 | Visualizations generated by using Selene to train and apply a model to sequences. a, Selene visualization of the performance of the model trained in
the first case study. b, Selene visualization of in silico mutagenesis on the case-study-trained model for 20 randomly selected GATA1 sequences in the test set
(two representative plots displayed here; all heat maps generated are displayed in the example Jupyter notebook, https://github.com/FunctionLab/selene/
blob/master/manuscript/case1/3_visualize_ism_outputs.ipynb). Bases in the original sequence are distinguished by the gray stripes in the heat map cells.
–0.20
0.8
predicted effect scores
Gaussian-transformed
True positive rate
–0.21
0.6
–0.22
0.4
–0.23
0.2
–0.24
0.0
0.0 0.2 0.4 0.6 0.8 1.0 GWAS GWAS
False positive rate nominally significant nonsignificant
Fig. 3 | Using Selene to train a model and obtain model predictions for variants in an Alzheimer’s GWAS study. a, Selene visualization of the performance of
the trained six-convolutional-layer model. b, We visualize the mean and 95% confidence intervals of the quantile-normalized (against the Gaussian distribution)
predicted effect scores of the two variant groups for the genomic feature H3K36me3 in K562 cells, the feature in the model with the most significant difference
(one-sided Wilcoxon rank-sum test, adjusted P value using Benjamini–Hochberg of 3.89 × 10−67). After applying the multiple testing correction, 914 of the 919
genomic features that the model predicts showed a significant difference (ɑ < 0.05) between the groups. SNP, single-nucleotide polymorphism.
the researcher can easily use Selene to compare the performance researcher finds that the predicted effect is significantly higher
of their new model to the original DeepSEA model on the same for GWAS nominally significant variants than for non-significant
chromosomal holdout dataset. variants, indicating that the new model is indeed able to prioritize
In this case study, the researcher finds that the deeper archi- potential disease-associated variants (one-sided Wilcoxon rank-
tecture achieves an average AUC of 0.938 (Fig. 3a) and an average sum test; the most significant feature, H3K36me3 in K562 cells,
area under the precision recall curve (AUPRC) of 0.362, which has an adjusted P value, by Benjamini–Hochberg correction, of
is an improvement over the average AUC of 0.933 and AUPRC 3.89 × 10−67) (Fig. 3b).
of 0.342 of the original three-convolutional-layer model. The Selene’s modeling capability extends far beyond the case stud-
researcher can share this model with a collaborator and upload ies described here. The library can be applied to not only DNA
it to the Kipoi13 model zoo, a repository of trained models for but also RNA and protein sequences, and not only chromatin data
regulatory genomics, with which Selene-trained models are fully but any current genome-, transcriptome-, or even proteome-wide
compatible (an example is available in the GitHub repository at measurements. We developed Selene to increase the accessibility of
https://github.com/FunctionLab/selene/tree/master/manuscript/ deep learning in biology and facilitate the creation of reproducible
case2/3_kipoi_export). workflows and results. Furthermore, Selene is open-source soft-
In the final case study, a human geneticist studying Alzheimer’s ware that will continue to be updated and expanded on the basis of
wants to apply the six-convolutional-layer model developed community and user feedback.
in the previous case study, so they first assess its ability to pri-
oritize potential disease-associated variants. Specifically, they use Online content
Selene to make variant effect predictions for nominally signifi- Any methods, additional references, Nature Research reporting
cant variants (P < 0.05, n = 422,398) and non-significant variants summaries, source data, statements of data availability and asso-
(P > 0.50, n = 3,842,725) reported in the International Genomics ciated accession codes are available at https://doi.org/10.1038/
of Alzheimer’s Project16 Alzheimer’s disease GWAS17. The s41592-019-0360-8.
Received: 8 October 2018; Accepted: 20 February 2019; updates to the site. The authors are pleased to acknowledge that this work was performed
Published online: 28 March 2019 using the high-performance computing resources at Simons Foundation and the
TIGRESS computer center at Princeton University. This work was supported by NIH
grants R01HG005998, U54HL117798, R01GM071966, and T32HG003284; HHS grant
References HHSN272201000054C; and Simons Foundation grant 395506, all to O.G.T. O.G.T. is a
1. LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015). CIFAR fellow.
2. Ching, T. et al. J. R. Soc. Interface. 15, 20170387 (2018).
3. Segler, M. H. S., Preuss, M. & Waller, M. P. Nature 555, 604–610 (2018).
4. Zhou, J. & Troyanskaya, O. G. Nat. Meth. 12, 931–934 (2015).
5. Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Nat. Biotechnol. 33,
Author contributions
K.M.C and J.Z. conceived the Selene library. K.M.C. and E.M.C. designed, implemented,
831–838 (2015).
and documented Selene. K.M.C. performed the analyses described in the manuscript.
6. Kelley, D. R., Snoek, J. & Rinn, J. L. Genome Res. 26, 990–999 (2016).
O.G.T. supervised the project. K.M.C., E.M.C., and O.G.T wrote the manuscript.
7. Angermueller, C., Lee, H. J., Reik, W. & Stegle, O. Genome. Biol. 18, 67 (2017).
8. Kelley, D. R. et al. Genome Res. 28, 739–750 (2018).
9. Quang, D. & Xie, X. Nucleic Acids Res. 44, e107 (2016).
10. Sundaram, L. et al. Nat. Genet. 50, 1161–1170 (2018). Competing interests
11. Min, S., Lee, B. & Yoon, S. Brief. Bioinform. 18, 851–869 (2017). The authors declare no competing interests.
12. Budach, S. & Marsico, A. Bioinformatics 34, 3035–3037 (2018).
13. Avsec, Z. et al. bioRxiv Preprint at https://www.biorxiv.org/
content/10.1101/375345v1 (2018).
14. Mei, S. et al. Nucleic Acids Res. 45, D658–D662 (2017). Additional information
15. Troyanskaya, O. G. et al. Selene CLI operations and outputs. Selene Supplementary information is available for this paper at https://doi.org/10.1038/
https://selene.flatironinstitute.org/overview/cli.html (2018). s41592-019-0360-8.
16. Ruiz, A. et al. Transl. Psychiatry 4, e358 (2014). Reprints and permissions information is available at www.nature.com/reprints.
17. Huang, K.-L. et al. Nat. Neurosci. 20, 1052–1061 (2017).
Correspondence and requests for materials should be addressed to O.G.T.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in
Acknowledgements published maps and institutional affiliations.
The authors acknowledge all members of the Troyanskaya lab for helpful discussions.
In addition, the authors thank D. Simon for setting up the website and automating © The Author(s), under exclusive licence to Springer Nature America, Inc. 2019
Case 2: developing a new architecture and making model comparisons. Steps to Case 3: applying a new model to variants.
train ‘deeper DeepSEA’ on the same exact data as DeepSEA.
(1) Download the single-nucleotide polymorphisms from the International
(1) Download the code and data bundle from the DeepSEA website (http:// Genomics of Alzheimer’s Project. (https://www.niagads.org/igap-age-onset-
deepsea.princeton.edu/media/code/deepsea_train_bundle.v0.9.tar.gz). You survival-analyses-p-value-only).
only need the .mat files in this directory. We also include a file listing the (2) Group the variants into those with P values below 0.05 (significant) and those
919 genomic features that the model predicts. This is from the resources with P values above 0.50 (non-significant).
directory in the standalone version of DeepSEA (http://deepsea.princeton. (3) Fill out the configuration file with the paths to the two variant files and the
edu/media/code/deepsea.v0.94b.tar.gz). Zenodo record: https://zenodo.org/ trained model weights file from case 2.
record/2214970/files/DeepSEA_data.tar.gz. (4) Run Selene.
(2) Fill out the configuration file for Selene’s MultiFileSampler (https://selene. (5) Follow the script provided for this case to analyze the variant predictions
flatironinstitute.org/overview/cli.html#multiple-file-sampler) and specify (https://github.com/FunctionLab/selene/blob/master/manuscript/case3/2_
the path to each .mat file for training, validation and testing. variant_groups_comparison.sh).
(3) Run Selene.
Please see the DeepSEA publication4 for details about data processing and Statistical analysis. Details of the statistical test used for case study 3 are specified
training. in the associated text and figure legend (Fig. 3b).
In the main text, we report test performance for the model trained using the
online sampler. When training on the same exact data (the .mat files) as DeepSEA, Reporting Summary. Further information on research design is available in the
we achieve an average AUC of 0.934 and an average AUPRC of 0.361. Nature Research Reporting Summary linked to this article.
Steps to download and format all the peak data from ENCODE and Roadmap Code availability
Epigenomics. Selene is open-source software (license BSD 3-Clause Clear). Project homepage:
(1) Download all chromatin feature profiles used for training DeepSEA, specified https://selene.flatironinstitute.org. GitHub: https://github.com/FunctionLab/selene.
in Supplementary Table 1 of the DeepSEA manuscript4 (https://zenodo.org/ Archived version: https://github.com/FunctionLab/selene/archive/0.2.0.tar.gz.
record/2214970/files/chromatin_profiles.tar.gz).
(2) For each file, keep the chromosome, start, and end columns. In addition, create Data availability
a fourth column with the feature’s name. Concatenate all these files and create Cistrome14, Cistrome file ID 33545, measurements from GSM970258: http://
the distinct features file. We provide a Python script for this step on the GitHub dc2.cistrome.org/api/downloads/eyJpZCI6IjMzNTQ1In0%3A1fujCu%3ArNv
page (https://github.com/FunctionLab/selene/blob/master/manuscript/ WLCNoET6o9SdkL8fEv13uRu4b/. ENCODE21 and Roadmap Epigenomics22
case2/1_train_with_online_sampler/data/process_chromatin_profiles.py). chromatin profiles: files listed in Supplementary Table 1 of ref. 4. IGAP age at onset
(3) Format the data according to the instructions in the ‘Getting started’ tutorial: survival16,17: https://www.niagads.org/datasets/ng00058 (P-values-only file). The
i. Sort the file by [chr, start, end]: sort -k1V -k2n -k3n case studies used processed datasets from these sources. They can be downloaded
<peak-coordinates-file> > <sorted-coordinates-file>. at the following Zenodo links: Cistrome, https://zenodo.org/record/2214130/files/
ii. Compress the file: bgzip <sorted-coordinates-file>. This data.tar.gz; ENCODE and Roadmap Epigenomics chromatin profiles, https://
compresses the file to a .gz file in place. To separately generate the .gz file, zenodo.org/record/2214970/files/chromatin_profiles.tar.gz; IGAP age at onset
run bgzip -c <sorted-coordinates-file> > survival, https://zenodo.org/record/1445556/files/variant_effect_prediction_data.
<sorted-coordinates-file>.gz. tar.gz. Source data for Figs. 2 and 3 are available online.
iii. Tabix index the file: tabix -p bed <sorted-coordinates-
file>.gz.
References
script containing these steps can be downloaded from https://github.com/
A 18. Li, H. et al. Bioinformatics 25, 2078–2079 (2009).
FunctionLab/selene/blob/master/manuscript/case2/1_train_with_online_ 19. Li, H. Bioinformatics 27, 718–719 (2011).
sampler/data/process_data.sh. 20. ENCODE Project. Reference sequences. ENCODE: Encyclopedia of DNA
(4) Download the hg19 FASTA file (https://www.encodeproject.org/files/male. Elements https://www.encodeproject.org/data-standards/reference-sequences/
hg19/@@download/male.hg19.fasta.gz). (2016).
(5) Specify the model architecture, loss, and optimizer as a Python file: https://github. 21. ENCODE Project Consortium. Nature 489, 57–74 (2012).
com/FunctionLab/selene/blob/master/selene_sdk/utils/example_model.py. 22. Kundaje, A. et al. Nature 518, 317–330 (2015).
Reporting Summary
Nature Research wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency
in reporting. For further information on Nature Research policies, see Authors & Referees and the Editorial Policy Checklist.
Statistical parameters
When statistical analyses are reported, confirm that the following items are present in the relevant location (e.g. figure legend, table legend, main
text, or Methods section).
n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement
An indication of whether measurements were taken from distinct samples or whether the same sample was measured repeatedly
The statistical test(s) used AND whether they are one- or two-sided
Only common tests should be described solely by name; describe more complex techniques in the Methods section.
For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings
For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes
Estimates of effect sizes (e.g. Cohen's d, Pearson's r), indicating how they were calculated
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors/reviewers
April 2018
upon request. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Research guidelines for submitting code & software for further information.
1
Data
Processed datasets from these sources are available at the following Zenodo links:
Cistrome:
https://zenodo.org/record/2214130/files/data.tar.gz
ENCODE and Roadmap Epigenomics chromatin profiles:
https://zenodo.org/record/2214970/files/chromatin_profiles.tar.gz
IGAP age at onset survival:
https://zenodo.org/record/1445556/files/variant_effect_prediction_data.tar.gz
Field-specific reporting
Please select the best fit for your research. If you are not sure, read the appropriate sections before making your selection.
Life sciences Behavioural & social sciences Ecological, evolutionary & environmental sciences
For a reference copy of the document with all sections, see nature.com/authors/policies/ReportingSummary-flat.pdf
Blinding We did not need blinding since we had no randomized control trials.