Skip to content

Commit c7e7f8a

Browse files
committed
remove md
1 parent 7c066b8 commit c7e7f8a

File tree

3 files changed

+203
-8
lines changed

3 files changed

+203
-8
lines changed

docs/source/overview/faq.rst

Lines changed: 52 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,53 @@
1-
Frequently Asked Questions
2-
===========================
31

4-
.. mdinclude:: ./faq.md
2+
FAQ and additional resources
3+
============================
4+
5+
Extending Selene
6+
----------------
7+
8+
The main modules that users may want to extend are
9+
10+
11+
* ``selene_sdk.samplers.OnlineSampler``
12+
* ``selene_sdk.samplers.file_samplers.FileSampler``
13+
* ``selene_sdk.sequences.Sequence``
14+
* ``selene_sdk.targets.Target``
15+
16+
Please refer to the documentation for these classes.
17+
If you are encounter a bug or have a feature request, please post to our Github `issues <https://github.com/FunctionLab/selene/issues>`_. E-mail kchen@flatironinstitute.org if you are interested in being a contributor to Selene.
18+
19+
Join our `Google group <https://groups.google.com/forum/#!forum/selene-sdk>`_ if you have questions about the package, case studies, or model development.
20+
21+
Exporting a Selene-trained model to Kipoi
22+
-----------------------------------------
23+
24+
We have provided an example of how to prepare a model for upload to `Kipoi's model zoo <http://kipoi.org/>`_ using a model trained during case study 2. You can use `this example <https://github.com/FunctionLab/selene/tree/master/manuscript/case2/3_kipoi_export>`_ as a starting point for preparing your own model for Kipoi. We have provided a script that can help to automate parts of the process.
25+
26+
We are also working on an export function that will be built into Selene and accessible through the CLI.
27+
28+
Hyperparameter optimization
29+
---------------------------
30+
31+
Hyperparameter optimization is the process of finding the set of hyperparameters that yields an optimal model against a predefined score (e.g. minimizing a loss function).
32+
Hyperparameters are the variables that govern the training process (i.e. these parameters are constant during training, compared to model parameters which are optimized/"tuned" by the training process itself).
33+
Hyperparameter tuning works by running multiple trials of a single training run with different values for your chosen hyperparameters, set within some specified limit. Some examples of hyperparameters:
34+
35+
36+
* learning rate
37+
* number of hidden units
38+
* convolutional kernel size
39+
40+
You can select hyperparameters yourself (manually) or automatically.
41+
For automatic hyperparameter optimization, you can look into grid search or random search.
42+
43+
Some resources that may be useful:
44+
45+
46+
* `Hyperopt: Distributed Asynchronous Hyper-parameter Optimization <https://github.com/hyperopt/hyperopt>`_
47+
* `skorch: a scikit-learn compatible neural network library that wraps PyTorch <https://github.com/dnouri/skorch>`_
48+
* `Tune: scalable hyperparameter search <https://ray.readthedocs.io/en/latest/tune.html>`_
49+
* `Spearmint <https://github.com/JasperSnoek/spearmint>`_
50+
* `weights & biases <https://www.wandb.com/>`_
51+
* `comet.ml <https://www.comet.ml/>`_
52+
53+
To use hyperparameter optimization on models being developed with Selene, you could implement a method that runs Selene (via a command-line call) with a set of hyperparameters and then monitors the validation performance based on the output to ``selene_sdk.train_model.validation.txt``.
Lines changed: 59 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,61 @@
1+
12
Installation
2-
=============
3+
============
4+
5+
Users can clone and build the repository locally or install Selene through conda/pip.
6+
7+
Please use Selene with Python 3.6+.
8+
9+
Install `PyTorch <https://pytorch.org/get-started/locally/>`_. If you have an NVIDIA GPU, install a version of PyTorch that supports it---Selene will run much faster with a discrete GPU.
10+
11+
Installing with Anaconda
12+
------------------------
13+
14+
To install with conda (recommended for Linux users), run the following command in your terminal:
15+
16+
.. code-block::
17+
18+
conda install -c bioconda selene-sdk
19+
20+
Installing selene with pip:
21+
---------------------------
22+
23+
.. code-block:: sh
24+
25+
pip install selene-sdk
26+
27+
Note that we do not recommend pip-installing older versions of Selene (below 0.4.0), as these releases were less stable.
28+
29+
We currently only have a source distribution available for pip-installation. We are looking into releasing wheels in the future.
30+
31+
Installing from source
32+
----------------------
33+
34+
Selene can also be installed from source.
35+
First, download the latest commits from the source repository:
36+
37+
.. code-block::
38+
39+
git clone https://github.com/FunctionLab/selene.git
40+
41+
The ``setup.py`` script requires NumPy, Cython, and setuptools. Please make sure you have these already installed.
42+
43+
If you plan on working in the ``selene`` repository directly, we recommend `setting up a conda environment <https://conda.io/docs/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file>`_ using ``selene-cpu.yml`` or ``selene-gpu.yml`` (if CUDA is enabled on your machine) and activating it.
44+
45+
Selene contains some Cython files. You can build these by running
46+
47+
.. code-block:: sh
48+
49+
python setup.py build_ext --inplace
50+
51+
If you would like to locally install Selene, you can run
52+
53+
.. code-block:: sh
54+
55+
python setup.py install
56+
57+
Additional dependency for running the CLI (versions 0.4.8 and below)
58+
--------------------------------------------------------------------
359

4-
.. mdinclude:: ./installation.md
60+
Please install ``docopt`` before running the command-line script ``selene_cli.py`` provided in the repository.
61+
In newer versions of Selene, the CLI can be run by calling ``selene_sdk`` or ``python -m selene_sdk`` from anywhere in bash (assuming you have installed the library, through conda/pip/local install - ``python setup.py install``).

docs/source/overview/overview.rst

Lines changed: 92 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,93 @@
1-
Overview
2-
=========
31

4-
.. mdinclude:: ./overview.md
2+
Functional overview of the SDK
3+
==============================
4+
5+
The software development kit (SDK), formally known as ``selene_sdk``\ , is an extensible Python package intended to ease development of new programs that leverage sequence-level models through code reuse.
6+
The package is composed of six submodules: *sequences*\ , *samplers*\ , *targets*\ , *predict*\ , *interpret*\ , and *utils*.
7+
It also provides two top-level classes: *TrainModel* and *EvaluateModel*.
8+
In the following sections, we briefly discuss each submodule and top-level class.
9+
10+
Sampling
11+
--------
12+
13+
We start with the modules for sampling data because both training and evaluting a model in Selene will require a user to specify the kind of sampler they want to use.
14+
15+
*sequences* submodule (\ `API <http://selene.flatironinstitute.org/sequences.html>`_\ )
16+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
17+
18+
The *sequences* submodule defines the ``Sequence`` type, and includes implementations for several sub-classes.
19+
These sub-classes--\ ``Genome`` and ``Proteome``\ --represent different kinds of biological sequences (e.g. DNA, RNA, amino acid sequences), and implement the ``Sequence`` interface’s methods for reading the reference sequence from files (e.g. FASTA), querying subsequences of the reference sequence, and subsequently converting those queried subsequences into a numeric representation.
20+
Further, each sequence class specifies its own alphabet (e.g., nucleotides, amino acids) to represent query results as strings.
21+
22+
*targets* submodule (\ `API <http://selene.flatironinstitute.org/targets.html>`_\ )
23+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
24+
25+
The *targets* submodule defines the ``Target`` class, which specifies the interface for classes to retrieve labels or “targets” for a given query sequence.
26+
At present, we supply a single implementation of this interface: ``GenomicFeatures``.
27+
This class takes a tabix-indexed file of intervals for each label we want our model to predict, and uses this file to identify the labels for a given sequence drawn from the reference.
28+
29+
*samplers* submodule (\ `API <http://selene.flatironinstitute.org/samplers.html>`_\ )
30+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
31+
32+
The *samplers* submodule provides methods and classes for randomly sampling and partitioning datasets for training and evaluation.
33+
The ``Sampler`` interface defines the minimal requirements for a class fulfilling these functions.
34+
In particular, samplers must be able to partition data (i.e. into training, validation, and testing datasets), sample data from each partition, and, if needed, save the sampled data to a file.
35+
Further, a file of names must be provided for the features to be predicted.
36+
We provide several implementations adhering to the ``Sampler`` interface: the ``RandomPositionsSampler``\ , ``IntervalsSampler``\ , and ``MultiFileSampler``.
37+
38+
``MultiFileSampler`` draws samples from structured data files for each partition.
39+
There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <http://selene.flatironinstitute.org/samplers.file_samplers.html>`_\ ).
40+
It is worth noting that the .bed file used by ``BedFileSampler`` includes the coordinates of each sequence, and the indices corresponding to each feature for which said sequence is a positive example.
41+
We hope that users will request or contribute classes for other file samplers in the future.
42+
``MultiFileSampler`` does not support saving the sampled data to a file, so calling the ``save_dataset_to_file`` method from this class will have no effect.
43+
44+
``RandomPositionsSampler`` and ``IntervalsSampler`` are what we call online samplers.
45+
Online samplers generate examples from the reference sequence (e.g. genome, proteome) on-the-fly--either across the whole reference sequence (random positions sampler), or from user-specified regions (intervals sampler)--using a tabix-indexed .bed file.
46+
These samplers automatically partition said data according to user-specified parameters (e.g. validate on a subset of chromosomes or on some percentage of the data).
47+
Since ``OnlineSampler``\ ’s samples are randomly generated, we allow the user to save the sampled data to file.
48+
This file can be subsequently loaded with the ``BedFileSampler``. They rely on classes from the *sequences* and *targets* submodules for retrieving each sequence and its targets in the proper matrix format.
49+
50+
Training a model (\ `API <http://selene.flatironinstitute.org/selene.html#trainmodel>`_\ )
51+
------------------------------------------------------------------------------------------
52+
53+
The ``TrainModel`` class may be used for training and testing of sequence-based models, and provides the core functionality of the CLI’s train command.
54+
It relies on an ``OnlineSampler`` (or a subclass of ``OnlineSampler``\ ) to automatically partition the dataset into subsets for training, validation, and testing.
55+
These subsets are then used to automatically train and validate performance for a user-specified number of steps.
56+
The testing subset is used to evaluate the model performance after training is completed.
57+
The model’s loss, area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AUPRC) are logged during training. (In the future, we plan to support other performance metrics. Please request specific ones or use cases in our `Github issues <https://github.com/FunctionLab/selene/issues>`_.
58+
The frequency of logging is provided by the user.
59+
At the end of evaluation, ``TrainModel`` logs the performance metrics for each feature predicted, and produces plots of the precision recall and receiver operating characteristic curves.
60+
61+
Evaluating a model (\ `API <http://selene.flatironinstitute.org/selene.html#evaluatemodel>`_\ )
62+
-----------------------------------------------------------------------------------------------
63+
64+
The ``EvaluateModel`` class is used to test the performance of a trained model.
65+
``EvaluateModel`` uses an instance of ``Sampler`` class or subclass to draw samples from a test set.
66+
After using the provided model to predict labels for said data, ``EvaluateModel`` logs the performance measures (as described in "Training a model") and generates figures and a performance breakdown by feature.
67+
68+
Using a model to make predictions (\ `API <http://selene.flatironinstitute.org/predict.html>`_\ )
69+
-------------------------------------------------------------------------------------------------
70+
71+
Selene’s ``predict`` submodule includes a number of methods and classes for making predictions with sequence-based models.
72+
The ``AnalyzeSequences`` class is the main class to use.
73+
It leverages a user-specified trained model to make predictions for sequences sequences in a FASTA file, apply *in silico* mutagenesis to sequences in a FASTA file, or perform variant effect prediction on variants in a VCF file.
74+
In each case, the user can specify what ``AnalyzeSequences`` should save: raw predictions, difference scores, absolute difference scores, and/or logit scores.
75+
Note that the aforementioned “scores” can only be computed for *in silico* mutagenesis and variant effect prediction.
76+
77+
Visualizing model predictions (\ `API <http://selene.flatironinstitute.org/interpret.html>`_\ )
78+
-----------------------------------------------------------------------------------------------
79+
80+
The ``interpret`` submodule of ``selene_sdk`` provides methods for visualizing a sequence-based model’s predictions made with ``AnalyzeSequences``.
81+
For example, ``interpret`` includes methods for processing variant effect predictions made with ``AnalyzeSequences`` and subsequently visualizing them with a heatmap or sequence logo.
82+
The functionality included in the ``interpret`` submodule is not heavily incorporated into the CLI, but is instead intended for incorporation into user code.
83+
84+
The utilities submodule (\ `API <http://selene.flatironinstitute.org/utils.html>`_\ )
85+
-------------------------------------------------------------------------------------
86+
87+
Unlike the aforementioned submodules designed around individual concepts, the ``utils`` submodule is a catch-all submodule intended to prevent cluttering of the ``selene_sdk`` top-level namespace.
88+
It provides diverse functionality at varying levels of flexibility.
89+
Some members of ``utils`` are general-purpose (e.g. configuration file parsing) while others have highly specific use cases (e.g. CLI logger initialization).
90+
91+
Help
92+
----
93+
Join our `Google group <https://groups.google.com/forum/#!forum/selene-sdk>`_ if you have questions about the package, case studies, or model development.

0 commit comments

Comments
 (0)