add rst files

kathyxchen · kathyxchen · commit 3a185f2efbb7 · 2020-10-28T11:19:38.000-04:00
diff --git a/docs/source/overview/cli.md b/docs/source/overview/cli.md
@@ -119,12 +119,12 @@ train_model: !obj:selene_sdk.TrainModel {
 }
 ```
 
-#### Required parameters
+### Required parameters
 - `batch_size`:  Number of samples in one forward/backward pass (a single step).
 - `max_steps`: Total number of steps for which to train the model. 
 - `report_stats_every_n_steps`: The frequency with which to report summary statistics. You can set this value to be equivalent to a training epoch (`n_steps * batch_size`) being the total number of samples seen by the model so far. Selene evaluates the model on the validation dataset every `report_stats_every_n_steps` and, if the model obtains the best performance so far (based on the user-specified loss function), Selene saves the model state to a file called `best_model.pth.tar` in `output_dir`.  
 
-#### Optional parameters
+### Optional parameters
 - `save_checkpoint_every_n_steps`: Default is 1000. The number of steps before Selene saves a new checkpoint model weights file. If this parameter is set to `None`, we will set it to the same value as `report_stats_every_n_steps`.
 - `save_new_checkpoints_after_n_steps`: Default is None. The number of steps after which Selene will continually save new checkpoint model weights files (`checkpoint-<TIMESTAMP>.pth.tar`) every `save_checkpoint_every_n_steps`. Before this, the file `checkpoint.pth.tar` is overwritten every `save_checkpoint_every_n_steps` to limit the memory requirements.
 - `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
@@ -147,13 +147,14 @@ train_model: !obj:selene_sdk.TrainModel {
 - `metrics`: Default is a dictionary with `"roc_auc"` mapped to `sklearn.metrics.roc_auc_score` and `"average_precision"` mapped to `sklearn.metrics.average_precision_score`. `metrics` is a dictionary that maps metric names (`str`) to metric functions. In addition to the [loss function you specified with your model architecture](#expected-input-class-and-methods), these are the metrics that you would like to monitor during the training/evaluation process (they all get reported every `report_stats_every_n_steps`). See the [Regression Models in Selene](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb) tutorial for a different input to the `metrics` parameter. You can `!import` metrics from `scipy`, `scikit-learn`, `statsmodels`. Each metric function should require, in order, the true values and predicted values as input arguments. For example,
   [`sklearn.metrics.average_precision_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) takes `y_true` and `y_score` as input.  
  - `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training.  
-#### Additional notes
+
+### Additional notes
 Attentive readers might have noticed that in the [documentation for the `TrainModel` class](https://selene.flatironinstitute.org/selene.html#trainmodel) there are more input arguments than are required to instantiate the class through the CLI configuration file. This is because they are assumed to be carried through/retrieved from other configuration keys for consistency. Specifically:
 - `output_dir` can be specified as a top-level key in the configuration. You can specify it within each function-type constructor (e.g.  `!obj:selene_sdk.TrainModel`) if you prefer. If `output_dir` exists as a top-level key, Selene does use the top-level `output_dir` and ignores all other `output_dir` keys. **The `output_dir` is omitted in many of the configurations for this reason.**
 - `model`, `loss_criterion`, `optimizer_class`, `optimizer_kwargs` are all retrieved from the path in the [`model` configuration](#model-architecture). 
 - `data_sampler`has its own separate configuration that you will need to specify in the same YAML file. Please see [Sampler configurations](#sampler-configurations) for more information.
 
-#### Expected outputs for training
+### Expected outputs for training
 These outputs will be written to `output_dir` (a top-level parameter, can also  be specified within the function-type constructor, see above).
 - `best_model.pth.tar`: the best performing model so far. IMPORTANT: for all `*.pth.tar` files output by Selene right now, we save additional information beyond the model's state dictionary so that users may continue training these models through Selene if they wish. If you would like to save only the state dictionary, you can run `out = torch.load(<*.pth.tar>)` and then save only the `state_dict` key with `torch.save(out["state_dict"], <state_dict_only.pth.tar>)`. 
 - `checkpoint.pth.tar`: model saved every `save_checkpoint_every_n_steps` steps
@@ -183,24 +184,24 @@ evaluate_model: !obj:selene_sdk.EvaluateModel {
 }
 ```
 
-#### Required parameters
+### Required parameters
 - `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
 - `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. you can pass in the saved model file generated by Selene's `TrainModel` class.)
 
-#### Optional parameters
+### Optional parameters
 - `batch_size`: Default is 64. Specify the batch size to process examples. Should be a power of 2.
 - `n_test_samples`: Default is `None`. Use `n_test_samples` if you want to limit the number of samples on which you evaluate your model. If you are using a sampler of type [`selene_sdk.samplers.OnlineSampler`](#samplers-used-for-training-and-evaluation-optionally))---you must specify a test partition in this case---it will default to 640000 test samples if `n_test_samples = None`. If you are using a file sampler ([multiple-file sampler](#multiple-file-sampler) or [BED](#bed-file-sampler)/[matrix](#matrix-file-sampler) file samplers), it will use all samples available in the file.
 - `report_gt_feature_n_positives`: Default is 10. In total, each class/feature must have more than `report_gt_feature_n_positives` positive examples in the test set to be considered in the performance computation. The output file that reports each class's performance will report 'NA' for classes that do not have enough positive samples.
 - `use_cuda`: Default is False. Specify whether CUDA-enabled GPUs are available for torch to use.  
 - `data_parallel`: Default is False. Specify whether multiple GPUs are available for torch to use.
 - `use_features_ord`: Default is None. Specify an ordered list of features for which to run the evaluation. The features in this list must be identical to or a subset of `features`, and in the order you want the resulting `test_targets.npz` and `test_predictions.npz` to be saved.
 
-#### Additional notes
+### Additional notes
 Similar to the `train_model` configuration, any arguments that you find in [the documentation](https://selene.flatironinstitute.org/selene.html#evaluatemodel) that are not present in the function-type value's arguments are automatically instantiated and passed in by Selene.
 
 If you use a [sampler with multiple data partitions](#samplers-used-for-training-and-evaluation-optionally) with the `evaluate_model` configuration, **please make sure that your sampler configuration's `mode` parameter is set to `test`**. 
 
-#### Expected outputs for evaluation
+### Expected outputs for evaluation
 These outputs will be written to `output_dir` (a top-level parameter, can also  be specified within the function-type constructor).
 - `test_performance.txt`: columns are `class` and whatever other metrics you specified (defaults: `roc_auc` and `average_precision`). The breakdown of performance metrics by each class that the model predicts.
 - `test_predictions.npz`: The model predictions for each sample in the test set. Useful if you want to make your own visualizations/figures.
@@ -240,12 +241,12 @@ analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
     write_mem_limit: 5000
 }
 ```
-#### Required parameters
+### Required parameters
 - `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. You can pass in the saved model file generated by Selene's `TrainModel` class.)
 - `sequence_length`: The sequence length the model is expecting for each input.
 - `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
 
-#### Optional parameters
+### Optional parameters
 - `batch_size`: Default is 64. The size of the mini-batches to use.
 - `use_cuda`: Default is `False`. Specify whether CUDA-enabled GPUs are available for torch to use.  
 - `reference_sequence`: Default is the class `selene_sdk.sequences.Genome`. The type of sequence on which this analysis will be performed (must be type `selene.sequences.Sequence`).
@@ -438,7 +439,6 @@ With the exception of `intervals_path` and `sample_negative`, all other paramete
 - `sample_negative`: Optional, default is False. Specify whether negative examples (i.e. samples with no positive labels) should be drawn. When False, the sampler will check if the `center_bin_to_predict` in the input sequence contains at least 1 of the features/classes the model wants to predict. When True, no such check is made. 
 
 #### Multiple-file sampler
-
 The multi-file sampler loads in the training, validation, and optionally, the testing dataset.  The configuration for this therefore asks that you fill in some keys with the function-type constructors of type `selene_sdk.samplers.file_samplers.FileSampler`. Please consult the following sections for information about these file samplers. 
 
 An example configuration for the multiple-file sampler:
@@ -623,7 +623,7 @@ create_subdirectory: False
 ...
 ```
 
-### Some notes
+#### Some notes
 - For the matrix file sampler, we assume that you know ahead of time the shape of the data matrix. That is, which dimension is the batch dimension? Sequence? Alphabet (should be size 4 for DNA/RNA)? You must specify the keys that end in `axis` unless the shape of the sequences matrix is `(n_samples, n_alphabet, n_sequence_length)` and the shape of the targets matrix is `(n_samples, n_targets)`.
 - In this case, since `create_subdirectory` is False, all outputs from evaluate are written to `output_dir` directly (as opposed to being written in a timestamped subdirectory). Be careful of overwriting files.
 
diff --git a/docs/source/overview/cli.rst b/docs/source/overview/cli.rst
@@ -0,0 +1,4 @@
+Command-line Interface
+=======================
+
+.. mdinclude:: ./cli.md
diff --git a/docs/source/overview/faq.rst b/docs/source/overview/faq.rst
@@ -0,0 +1,4 @@
+Frequently Asked Questions
+===========================
+
+.. mdinclude:: ./faq.md
diff --git a/docs/source/overview/installation.md b/docs/source/overview/installation.md
@@ -13,7 +13,7 @@ To install with conda (recommended for Linux users), run the following command i
 conda install -c bioconda selene-sdk
 ```
 
-### Installing selene with pip:
+## Installing selene with pip:
 
 ```sh
 pip install selene-sdk
@@ -45,6 +45,6 @@ If you would like to locally install Selene, you can run
 python setup.py install
 ```
 
-## Additional dependency for running the CLI 
+## Additional dependency for running the CLI (versions 0.4.8 and below) 
 
 Please install `docopt` before running the command-line script `selene_cli.py` provided in the repository.
diff --git a/docs/source/overview/installation.rst b/docs/source/overview/installation.rst
@@ -0,0 +1,4 @@
+Installation
+=============
+
+.. mdinclude:: ./installation.md
diff --git a/docs/source/overview/overview.rst b/docs/source/overview/overview.rst
@@ -0,0 +1,4 @@
+Overview
+=========
+
+.. mdinclude:: ./overview.md
diff --git a/docs/source/overview/tutorial.rst b/docs/source/overview/tutorial.rst
@@ -0,0 +1,4 @@
+Tutorials
+=========
+
+.. mdinclude:: ./tutorials.md