Skip to content

Commit 3a185f2

Browse files
committed
add rst files
1 parent fd45176 commit 3a185f2

File tree

7 files changed

+34
-14
lines changed

7 files changed

+34
-14
lines changed

docs/source/overview/cli.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -119,12 +119,12 @@ train_model: !obj:selene_sdk.TrainModel {
119119
}
120120
```
121121

122-
#### Required parameters
122+
### Required parameters
123123
- `batch_size`: Number of samples in one forward/backward pass (a single step).
124124
- `max_steps`: Total number of steps for which to train the model.
125125
- `report_stats_every_n_steps`: The frequency with which to report summary statistics. You can set this value to be equivalent to a training epoch (`n_steps * batch_size`) being the total number of samples seen by the model so far. Selene evaluates the model on the validation dataset every `report_stats_every_n_steps` and, if the model obtains the best performance so far (based on the user-specified loss function), Selene saves the model state to a file called `best_model.pth.tar` in `output_dir`.
126126

127-
#### Optional parameters
127+
### Optional parameters
128128
- `save_checkpoint_every_n_steps`: Default is 1000. The number of steps before Selene saves a new checkpoint model weights file. If this parameter is set to `None`, we will set it to the same value as `report_stats_every_n_steps`.
129129
- `save_new_checkpoints_after_n_steps`: Default is None. The number of steps after which Selene will continually save new checkpoint model weights files (`checkpoint-<TIMESTAMP>.pth.tar`) every `save_checkpoint_every_n_steps`. Before this, the file `checkpoint.pth.tar` is overwritten every `save_checkpoint_every_n_steps` to limit the memory requirements.
130130
- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
@@ -147,13 +147,14 @@ train_model: !obj:selene_sdk.TrainModel {
147147
- `metrics`: Default is a dictionary with `"roc_auc"` mapped to `sklearn.metrics.roc_auc_score` and `"average_precision"` mapped to `sklearn.metrics.average_precision_score`. `metrics` is a dictionary that maps metric names (`str`) to metric functions. In addition to the [loss function you specified with your model architecture](#expected-input-class-and-methods), these are the metrics that you would like to monitor during the training/evaluation process (they all get reported every `report_stats_every_n_steps`). See the [Regression Models in Selene](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb) tutorial for a different input to the `metrics` parameter. You can `!import` metrics from `scipy`, `scikit-learn`, `statsmodels`. Each metric function should require, in order, the true values and predicted values as input arguments. For example,
148148
[`sklearn.metrics.average_precision_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) takes `y_true` and `y_score` as input.
149149
- `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training.
150-
#### Additional notes
150+
151+
### Additional notes
151152
Attentive readers might have noticed that in the [documentation for the `TrainModel` class](https://selene.flatironinstitute.org/selene.html#trainmodel) there are more input arguments than are required to instantiate the class through the CLI configuration file. This is because they are assumed to be carried through/retrieved from other configuration keys for consistency. Specifically:
152153
- `output_dir` can be specified as a top-level key in the configuration. You can specify it within each function-type constructor (e.g. `!obj:selene_sdk.TrainModel`) if you prefer. If `output_dir` exists as a top-level key, Selene does use the top-level `output_dir` and ignores all other `output_dir` keys. **The `output_dir` is omitted in many of the configurations for this reason.**
153154
- `model`, `loss_criterion`, `optimizer_class`, `optimizer_kwargs` are all retrieved from the path in the [`model` configuration](#model-architecture).
154155
- `data_sampler`has its own separate configuration that you will need to specify in the same YAML file. Please see [Sampler configurations](#sampler-configurations) for more information.
155156

156-
#### Expected outputs for training
157+
### Expected outputs for training
157158
These outputs will be written to `output_dir` (a top-level parameter, can also be specified within the function-type constructor, see above).
158159
- `best_model.pth.tar`: the best performing model so far. IMPORTANT: for all `*.pth.tar` files output by Selene right now, we save additional information beyond the model's state dictionary so that users may continue training these models through Selene if they wish. If you would like to save only the state dictionary, you can run `out = torch.load(<*.pth.tar>)` and then save only the `state_dict` key with `torch.save(out["state_dict"], <state_dict_only.pth.tar>)`.
159160
- `checkpoint.pth.tar`: model saved every `save_checkpoint_every_n_steps` steps
@@ -183,24 +184,24 @@ evaluate_model: !obj:selene_sdk.EvaluateModel {
183184
}
184185
```
185186

186-
#### Required parameters
187+
### Required parameters
187188
- `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
188189
- `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. you can pass in the saved model file generated by Selene's `TrainModel` class.)
189190

190-
#### Optional parameters
191+
### Optional parameters
191192
- `batch_size`: Default is 64. Specify the batch size to process examples. Should be a power of 2.
192193
- `n_test_samples`: Default is `None`. Use `n_test_samples` if you want to limit the number of samples on which you evaluate your model. If you are using a sampler of type [`selene_sdk.samplers.OnlineSampler`](#samplers-used-for-training-and-evaluation-optionally))---you must specify a test partition in this case---it will default to 640000 test samples if `n_test_samples = None`. If you are using a file sampler ([multiple-file sampler](#multiple-file-sampler) or [BED](#bed-file-sampler)/[matrix](#matrix-file-sampler) file samplers), it will use all samples available in the file.
193194
- `report_gt_feature_n_positives`: Default is 10. In total, each class/feature must have more than `report_gt_feature_n_positives` positive examples in the test set to be considered in the performance computation. The output file that reports each class's performance will report 'NA' for classes that do not have enough positive samples.
194195
- `use_cuda`: Default is False. Specify whether CUDA-enabled GPUs are available for torch to use.
195196
- `data_parallel`: Default is False. Specify whether multiple GPUs are available for torch to use.
196197
- `use_features_ord`: Default is None. Specify an ordered list of features for which to run the evaluation. The features in this list must be identical to or a subset of `features`, and in the order you want the resulting `test_targets.npz` and `test_predictions.npz` to be saved.
197198

198-
#### Additional notes
199+
### Additional notes
199200
Similar to the `train_model` configuration, any arguments that you find in [the documentation](https://selene.flatironinstitute.org/selene.html#evaluatemodel) that are not present in the function-type value's arguments are automatically instantiated and passed in by Selene.
200201

201202
If you use a [sampler with multiple data partitions](#samplers-used-for-training-and-evaluation-optionally) with the `evaluate_model` configuration, **please make sure that your sampler configuration's `mode` parameter is set to `test`**.
202203

203-
#### Expected outputs for evaluation
204+
### Expected outputs for evaluation
204205
These outputs will be written to `output_dir` (a top-level parameter, can also be specified within the function-type constructor).
205206
- `test_performance.txt`: columns are `class` and whatever other metrics you specified (defaults: `roc_auc` and `average_precision`). The breakdown of performance metrics by each class that the model predicts.
206207
- `test_predictions.npz`: The model predictions for each sample in the test set. Useful if you want to make your own visualizations/figures.
@@ -240,12 +241,12 @@ analyze_sequences: !obj:selene_sdk.predict.AnalyzeSequences {
240241
write_mem_limit: 5000
241242
}
242243
```
243-
#### Required parameters
244+
### Required parameters
244245
- `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. You can pass in the saved model file generated by Selene's `TrainModel` class.)
245246
- `sequence_length`: The sequence length the model is expecting for each input.
246247
- `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
247248

248-
#### Optional parameters
249+
### Optional parameters
249250
- `batch_size`: Default is 64. The size of the mini-batches to use.
250251
- `use_cuda`: Default is `False`. Specify whether CUDA-enabled GPUs are available for torch to use.
251252
- `reference_sequence`: Default is the class `selene_sdk.sequences.Genome`. The type of sequence on which this analysis will be performed (must be type `selene.sequences.Sequence`).
@@ -438,7 +439,6 @@ With the exception of `intervals_path` and `sample_negative`, all other paramete
438439
- `sample_negative`: Optional, default is False. Specify whether negative examples (i.e. samples with no positive labels) should be drawn. When False, the sampler will check if the `center_bin_to_predict` in the input sequence contains at least 1 of the features/classes the model wants to predict. When True, no such check is made.
439440

440441
#### Multiple-file sampler
441-
442442
The multi-file sampler loads in the training, validation, and optionally, the testing dataset. The configuration for this therefore asks that you fill in some keys with the function-type constructors of type `selene_sdk.samplers.file_samplers.FileSampler`. Please consult the following sections for information about these file samplers.
443443

444444
An example configuration for the multiple-file sampler:
@@ -623,7 +623,7 @@ create_subdirectory: False
623623
...
624624
```
625625

626-
### Some notes
626+
#### Some notes
627627
- For the matrix file sampler, we assume that you know ahead of time the shape of the data matrix. That is, which dimension is the batch dimension? Sequence? Alphabet (should be size 4 for DNA/RNA)? You must specify the keys that end in `axis` unless the shape of the sequences matrix is `(n_samples, n_alphabet, n_sequence_length)` and the shape of the targets matrix is `(n_samples, n_targets)`.
628628
- In this case, since `create_subdirectory` is False, all outputs from evaluate are written to `output_dir` directly (as opposed to being written in a timestamped subdirectory). Be careful of overwriting files.
629629

docs/source/overview/cli.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Command-line Interface
2+
=======================
3+
4+
.. mdinclude:: ./cli.md

docs/source/overview/faq.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Frequently Asked Questions
2+
===========================
3+
4+
.. mdinclude:: ./faq.md

docs/source/overview/installation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ To install with conda (recommended for Linux users), run the following command i
1313
conda install -c bioconda selene-sdk
1414
```
1515

16-
### Installing selene with pip:
16+
## Installing selene with pip:
1717

1818
```sh
1919
pip install selene-sdk
@@ -45,6 +45,6 @@ If you would like to locally install Selene, you can run
4545
python setup.py install
4646
```
4747

48-
## Additional dependency for running the CLI
48+
## Additional dependency for running the CLI (versions 0.4.8 and below)
4949

5050
Please install `docopt` before running the command-line script `selene_cli.py` provided in the repository.
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Installation
2+
=============
3+
4+
.. mdinclude:: ./installation.md

docs/source/overview/overview.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Overview
2+
=========
3+
4+
.. mdinclude:: ./overview.md

docs/source/overview/tutorial.rst

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
Tutorials
2+
=========
3+
4+
.. mdinclude:: ./tutorials.md

0 commit comments

Comments
 (0)