remove md

kathyxchen · kathyxchen · commit c7e7f8a6c058 · 2020-10-29T15:48:55.000-04:00
diff --git a/docs/source/overview/faq.rst b/docs/source/overview/faq.rst
@@ -1,4 +1,53 @@
-Frequently Asked Questions
-===========================
 
-.. mdinclude:: ./faq.md
+FAQ and additional resources
+============================
+
+Extending Selene
+----------------
+
+The main modules that users may want to extend are
+
+
+* ``selene_sdk.samplers.OnlineSampler``
+* ``selene_sdk.samplers.file_samplers.FileSampler``
+* ``selene_sdk.sequences.Sequence``
+* ``selene_sdk.targets.Target``
+
+Please refer to the documentation for these classes.
+If you are encounter a bug or have a feature request, please post to our Github `issues <https://github.com/FunctionLab/selene/issues>`_. E-mail kchen@flatironinstitute.org if you are interested in being a contributor to Selene.
+
+Join our `Google group <https://groups.google.com/forum/#!forum/selene-sdk>`_ if you have questions about the package, case studies, or model development.
+
+Exporting a Selene-trained model to Kipoi
+-----------------------------------------
+
+We have provided an example of how to prepare a model for upload to `Kipoi's model zoo <http://kipoi.org/>`_ using a model trained during case study 2. You can use `this example <https://github.com/FunctionLab/selene/tree/master/manuscript/case2/3_kipoi_export>`_ as a starting point for preparing your own model for Kipoi. We have provided a script that can help to automate parts of the process.
+
+We are also working on an export function that will be built into Selene and accessible through the CLI. 
+
+Hyperparameter optimization
+---------------------------
+
+Hyperparameter optimization is the process of finding the set of hyperparameters that yields an optimal model against a predefined score (e.g. minimizing a loss function). 
+Hyperparameters are the variables that govern the training process (i.e. these parameters are constant during training, compared to model parameters which are optimized/"tuned" by the training process itself). 
+Hyperparameter tuning works by running multiple trials of a single training run with different values for your chosen hyperparameters, set within some specified limit. Some examples of hyperparameters:
+
+
+* learning rate
+* number of hidden units
+* convolutional kernel size
+
+You can select hyperparameters yourself (manually) or automatically. 
+For automatic hyperparameter optimization, you can look into grid search or random search. 
+
+Some resources that may be useful:
+
+
+* `Hyperopt: Distributed Asynchronous Hyper-parameter Optimization <https://github.com/hyperopt/hyperopt>`_
+* `skorch: a scikit-learn compatible neural network library that wraps PyTorch <https://github.com/dnouri/skorch>`_
+* `Tune: scalable hyperparameter search <https://ray.readthedocs.io/en/latest/tune.html>`_
+* `Spearmint <https://github.com/JasperSnoek/spearmint>`_
+* `weights & biases <https://www.wandb.com/>`_
+* `comet.ml <https://www.comet.ml/>`_
+
+To use hyperparameter optimization on models being developed with Selene, you could implement a method that runs Selene (via a command-line call) with a set of hyperparameters and then monitors the validation performance based on the output to ``selene_sdk.train_model.validation.txt``. 
diff --git a/docs/source/overview/installation.rst b/docs/source/overview/installation.rst
@@ -1,4 +1,61 @@
+
 Installation
-=============
+============
+
+Users can clone and build the repository locally or install Selene through conda/pip. 
+
+Please use Selene with Python 3.6+.
+
+Install `PyTorch <https://pytorch.org/get-started/locally/>`_. If you have an NVIDIA GPU, install a version of PyTorch that supports it---Selene will run much faster with a discrete GPU.
+
+Installing with Anaconda
+------------------------
+
+To install with conda (recommended for Linux users), run the following command in your terminal:
+
+.. code-block::
+
+   conda install -c bioconda selene-sdk
+
+Installing selene with pip:
+---------------------------
+
+.. code-block:: sh
+
+   pip install selene-sdk
+
+Note that we do not recommend pip-installing older versions of Selene (below 0.4.0), as these releases were less stable. 
+
+We currently only have a source distribution available for pip-installation. We are looking into releasing wheels in the future. 
+
+Installing from source
+----------------------
+
+Selene can also be installed from source.
+First, download the latest commits from the source repository:
+
+.. code-block::
+
+   git clone https://github.com/FunctionLab/selene.git
+
+The ``setup.py`` script requires NumPy, Cython, and setuptools. Please make sure you have these already installed.
+
+If you plan on working in the ``selene`` repository directly, we recommend `setting up a conda environment <https://conda.io/docs/user-guide/tasks/manage-environments.html#creating-an-environment-from-an-environment-yml-file>`_ using ``selene-cpu.yml`` or ``selene-gpu.yml`` (if CUDA is enabled on your machine) and activating it.
+
+Selene contains some Cython files. You can build these by running
+
+.. code-block:: sh
+
+   python setup.py build_ext --inplace
+
+If you would like to locally install Selene, you can run
+
+.. code-block:: sh
+
+   python setup.py install
+
+Additional dependency for running the CLI (versions 0.4.8 and below)
+--------------------------------------------------------------------
 
-.. mdinclude:: ./installation.md
+Please install ``docopt`` before running the command-line script ``selene_cli.py`` provided in the repository. 
+In newer versions of Selene, the CLI can be run by calling ``selene_sdk`` or ``python -m selene_sdk`` from anywhere in bash (assuming you have installed the library, through conda/pip/local install - ``python setup.py install``).
diff --git a/docs/source/overview/overview.rst b/docs/source/overview/overview.rst
@@ -1,4 +1,93 @@
-Overview
-=========
 
-.. mdinclude:: ./overview.md
+Functional overview of the SDK
+==============================
+
+The software development kit (SDK), formally known as ``selene_sdk``\ , is an extensible Python package intended to ease development of new programs that leverage sequence-level models through code reuse.
+The package is composed of six submodules: *sequences*\ , *samplers*\ , *targets*\ , *predict*\ , *interpret*\ , and *utils*.
+It also provides two top-level classes: *TrainModel* and *EvaluateModel*.
+In the following sections, we briefly discuss each submodule and top-level class. 
+
+Sampling
+--------
+
+We start with the modules for sampling data because both training and evaluting a model in Selene will require a user to specify the kind of sampler they want to use. 
+
+*sequences* submodule (\ `API <http://selene.flatironinstitute.org/sequences.html>`_\ )
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The *sequences* submodule defines the ``Sequence`` type, and includes implementations for several sub-classes.
+These sub-classes--\ ``Genome`` and ``Proteome``\ --represent different kinds of biological sequences (e.g. DNA, RNA, amino acid sequences), and implement the ``Sequence`` interface’s methods for reading the reference sequence from files (e.g. FASTA), querying subsequences of the reference sequence, and subsequently converting those queried subsequences into a numeric representation.
+Further, each sequence class specifies its own alphabet (e.g., nucleotides, amino acids) to represent query results as strings.
+
+*targets* submodule (\ `API <http://selene.flatironinstitute.org/targets.html>`_\ )
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The *targets* submodule defines the ``Target`` class, which specifies the interface for classes to retrieve labels or “targets” for a given query sequence.
+At present, we supply a single implementation of this interface: ``GenomicFeatures``.
+This class takes a tabix-indexed file of intervals for each label we want our model to predict, and uses this file to identify the labels for a given sequence drawn from the reference.
+
+*samplers* submodule (\ `API <http://selene.flatironinstitute.org/samplers.html>`_\ )
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The *samplers* submodule provides methods and classes for randomly sampling and partitioning datasets for training and evaluation.
+The ``Sampler`` interface defines the minimal requirements for a class fulfilling these functions.
+In particular, samplers must be able to partition data (i.e. into training, validation, and testing datasets), sample data from each partition, and, if needed, save the sampled data to a file.
+Further, a file of names must be provided for the features to be predicted.
+We provide several implementations adhering to the ``Sampler`` interface: the ``RandomPositionsSampler``\ , ``IntervalsSampler``\ , and ``MultiFileSampler``.
+
+``MultiFileSampler`` draws samples from structured data files for each partition.
+There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <http://selene.flatironinstitute.org/samplers.file_samplers.html>`_\ ).
+It is worth noting that the .bed file used by ``BedFileSampler`` includes the coordinates of each sequence, and the indices corresponding to each feature for which said sequence is a positive example.
+We hope that users will request or contribute classes for other file samplers in the future.
+``MultiFileSampler`` does not support saving the sampled data to a file, so calling the ``save_dataset_to_file`` method from this class will have no effect.
+
+``RandomPositionsSampler`` and ``IntervalsSampler`` are what we call online samplers.
+Online samplers generate examples from the reference sequence (e.g. genome, proteome) on-the-fly--either across the whole reference sequence (random positions sampler), or from user-specified regions (intervals sampler)--using a tabix-indexed .bed file.
+These samplers automatically partition said data according to user-specified parameters (e.g. validate on a subset of chromosomes or on some percentage of the data).
+Since ``OnlineSampler``\ ’s samples are randomly generated, we allow the user to save the sampled data to file.
+This file can be subsequently loaded with the ``BedFileSampler``. They rely on classes from the *sequences* and *targets* submodules for retrieving each sequence and its targets in the proper matrix format. 
+
+Training a model (\ `API <http://selene.flatironinstitute.org/selene.html#trainmodel>`_\ )
+------------------------------------------------------------------------------------------
+
+The ``TrainModel`` class may be used for training and testing of sequence-based models, and provides the core functionality of the CLI’s train command.
+It relies on an ``OnlineSampler`` (or a subclass of ``OnlineSampler``\ )  to automatically partition the dataset into subsets for training, validation, and testing.
+These subsets are then used to automatically train and validate performance for a user-specified number of steps.
+The testing subset is used to evaluate the model performance after training is completed.
+The model’s loss, area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AUPRC) are logged during training. (In the future, we plan to support other performance metrics. Please request specific ones or use cases in our `Github issues <https://github.com/FunctionLab/selene/issues>`_.
+The frequency of logging is provided by the user.
+At the end of evaluation, ``TrainModel`` logs the performance metrics for each feature predicted, and produces plots of the precision recall and receiver operating characteristic curves.
+
+Evaluating a model (\ `API <http://selene.flatironinstitute.org/selene.html#evaluatemodel>`_\ )
+-----------------------------------------------------------------------------------------------
+
+The ``EvaluateModel`` class is used to test the performance of a trained model. 
+``EvaluateModel`` uses an instance of ``Sampler`` class or subclass to draw samples from a test set.
+After using the provided model to predict labels for said data, ``EvaluateModel`` logs the performance measures (as described in "Training a model") and generates figures and a performance breakdown by feature.
+
+Using a model to make predictions (\ `API <http://selene.flatironinstitute.org/predict.html>`_\ )
+-------------------------------------------------------------------------------------------------
+
+Selene’s ``predict`` submodule includes a number of methods and classes for making predictions with sequence-based models. 
+The ``AnalyzeSequences`` class is the main class to use.
+It leverages a user-specified trained model to make predictions for sequences sequences in a FASTA file, apply *in silico* mutagenesis to sequences in a FASTA file, or perform variant effect prediction on variants in a VCF file.
+In each case, the user can specify what ``AnalyzeSequences`` should save: raw predictions, difference scores, absolute difference scores, and/or logit scores.
+Note that the aforementioned “scores” can only be computed for *in silico* mutagenesis and variant effect prediction. 
+
+Visualizing model predictions (\ `API <http://selene.flatironinstitute.org/interpret.html>`_\ )
+-----------------------------------------------------------------------------------------------
+
+The ``interpret`` submodule of ``selene_sdk`` provides methods for visualizing a sequence-based model’s predictions made with ``AnalyzeSequences``.
+For example, ``interpret`` includes methods for processing variant effect predictions made with ``AnalyzeSequences`` and subsequently visualizing them with a heatmap or sequence logo.
+The functionality included in the ``interpret`` submodule is not heavily incorporated into the CLI, but is instead intended for incorporation into user code.
+
+The utilities submodule (\ `API <http://selene.flatironinstitute.org/utils.html>`_\ )
+-------------------------------------------------------------------------------------
+
+Unlike the aforementioned submodules designed around individual concepts, the ``utils`` submodule is a catch-all submodule intended to prevent cluttering of the ``selene_sdk`` top-level namespace. 
+It provides diverse functionality at varying levels of flexibility. 
+Some members of ``utils`` are general-purpose (e.g. configuration file parsing) while others have highly specific use cases (e.g. CLI logger initialization).
+
+Help
+----
+Join our `Google group <https://groups.google.com/forum/#!forum/selene-sdk>`_ if you have questions about the package, case studies, or model development.