You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- `batch_size`: Number of samples in one forward/backward pass (a single step).
124
124
- `max_steps`: Total number of steps for which to train the model.
125
125
- `report_stats_every_n_steps`: The frequency with which to report summary statistics. You can set this value to be equivalent to a training epoch (`n_steps * batch_size`) being the total number of samples seen by the model so far. Selene evaluates the model on the validation dataset every `report_stats_every_n_steps` and, if the model obtains the best performance so far (based on the user-specified loss function), Selene saves the model state to a file called `best_model.pth.tar` in `output_dir`.
126
126
127
-
#### Optional parameters
127
+
### Optional parameters
128
128
- `save_checkpoint_every_n_steps`: Default is 1000. The number of steps before Selene saves a new checkpoint model weights file. If this parameter is set to `None`, we will set it to the same value as `report_stats_every_n_steps`.
129
129
- `save_new_checkpoints_after_n_steps`: Default is None. The number of steps after which Selene will continually save new checkpoint model weights files (`checkpoint-<TIMESTAMP>.pth.tar`) every `save_checkpoint_every_n_steps`. Before this, the file `checkpoint.pth.tar` is overwritten every `save_checkpoint_every_n_steps` to limit the memory requirements.
130
130
- `n_validation_samples`: Default is `None`. Specify the number of validation samples in the validation set. If `None`
- `metrics`: Default is a dictionary with `"roc_auc"` mapped to `sklearn.metrics.roc_auc_score` and `"average_precision"` mapped to `sklearn.metrics.average_precision_score`. `metrics` is a dictionary that maps metric names (`str`) to metric functions. In addition to the [loss function you specified with your model architecture](#expected-input-class-and-methods), these are the metrics that you would like to monitor during the training/evaluation process (they all get reported every `report_stats_every_n_steps`). See the [Regression Models in Selene](https://github.com/FunctionLab/selene/blob/master/tutorials/regression_mpra_example/regression_mpra_example.ipynb) tutorial for a different input to the `metrics` parameter. You can `!import` metrics from `scipy`, `scikit-learn`, `statsmodels`. Each metric function should require, in order, the true values and predicted values as input arguments. For example,
148
148
[`sklearn.metrics.average_precision_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.average_precision_score.html) takes `y_true` and `y_score` as input.
149
149
- `checkpoint_resume`: Default is `None`. If not `None`, you should pass in the path to a model weights file generated by `torch.save` (and can now be read by `torch.load`) to resume training.
150
-
#### Additional notes
150
+
151
+
### Additional notes
151
152
Attentive readers might have noticed that in the [documentation for the `TrainModel` class](https://selene.flatironinstitute.org/selene.html#trainmodel) there are more input arguments than are required to instantiate the class through the CLI configuration file. This is because they are assumed to be carried through/retrieved from other configuration keys for consistency. Specifically:
152
153
- `output_dir`can be specified as a top-level key in the configuration. You can specify it within each function-type constructor (e.g. `!obj:selene_sdk.TrainModel`) if you prefer. If `output_dir` exists as a top-level key, Selene does use the top-level `output_dir` and ignores all other `output_dir` keys. **The `output_dir` is omitted in many of the configurations for this reason.**
153
154
- `model`, `loss_criterion`, `optimizer_class`, `optimizer_kwargs` are all retrieved from the path in the [`model` configuration](#model-architecture).
154
155
- `data_sampler`has its own separate configuration that you will need to specify in the same YAML file. Please see [Sampler configurations](#sampler-configurations) for more information.
155
156
156
-
#### Expected outputs for training
157
+
### Expected outputs for training
157
158
These outputs will be written to `output_dir` (a top-level parameter, can also be specified within the function-type constructor, see above).
158
159
- `best_model.pth.tar`: the best performing model so far. IMPORTANT: for all `*.pth.tar` files output by Selene right now, we save additional information beyond the model's state dictionary so that users may continue training these models through Selene if they wish. If you would like to save only the state dictionary, you can run `out = torch.load(<*.pth.tar>)` and then save only the `state_dict` key with `torch.save(out["state_dict"], <state_dict_only.pth.tar>)`.
159
160
- `checkpoint.pth.tar`: model saved every `save_checkpoint_every_n_steps` steps
- `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
188
189
- `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. you can pass in the saved model file generated by Selene's `TrainModel` class.)
189
190
190
-
#### Optional parameters
191
+
### Optional parameters
191
192
- `batch_size`: Default is 64. Specify the batch size to process examples. Should be a power of 2.
192
193
- `n_test_samples`: Default is `None`. Use `n_test_samples` if you want to limit the number of samples on which you evaluate your model. If you are using a sampler of type [`selene_sdk.samplers.OnlineSampler`](#samplers-used-for-training-and-evaluation-optionally))---you must specify a test partition in this case---it will default to 640000 test samples if `n_test_samples = None`. If you are using a file sampler ([multiple-file sampler](#multiple-file-sampler) or [BED](#bed-file-sampler)/[matrix](#matrix-file-sampler) file samplers), it will use all samples available in the file.
193
194
- `report_gt_feature_n_positives`: Default is 10. In total, each class/feature must have more than `report_gt_feature_n_positives` positive examples in the test set to be considered in the performance computation. The output file that reports each class's performance will report 'NA' for classes that do not have enough positive samples.
194
195
- `use_cuda`: Default is False. Specify whether CUDA-enabled GPUs are available for torch to use.
195
196
- `data_parallel`: Default is False. Specify whether multiple GPUs are available for torch to use.
196
197
- `use_features_ord`: Default is None. Specify an ordered list of features for which to run the evaluation. The features in this list must be identical to or a subset of `features`, and in the order you want the resulting `test_targets.npz` and `test_predictions.npz` to be saved.
197
198
198
-
#### Additional notes
199
+
### Additional notes
199
200
Similar to the `train_model` configuration, any arguments that you find in [the documentation](https://selene.flatironinstitute.org/selene.html#evaluatemodel) that are not present in the function-type value's arguments are automatically instantiated and passed in by Selene.
200
201
201
202
If you use a [sampler with multiple data partitions](#samplers-used-for-training-and-evaluation-optionally) with the `evaluate_model` configuration, **please make sure that your sampler configuration's `mode` parameter is set to `test`**.
202
203
203
-
#### Expected outputs for evaluation
204
+
### Expected outputs for evaluation
204
205
These outputs will be written to `output_dir` (a top-level parameter, can also be specified within the function-type constructor).
205
206
- `test_performance.txt`: columns are `class` and whatever other metrics you specified (defaults: `roc_auc` and `average_precision`). The breakdown of performance metrics by each class that the model predicts.
206
207
- `test_predictions.npz`: The model predictions for each sample in the test set. Useful if you want to make your own visualizations/figures.
- `trained_model_path`: Path to the trained model weights file, which should have been generated/saved using `torch.save`. (i.e. You can pass in the saved model file generated by Selene's `TrainModel` class.)
245
246
- `sequence_length`: The sequence length the model is expecting for each input.
246
247
- `features`: The list of distinct features the model predicts. (`input_path` to the function-type value that loads the features as a list.)
247
248
248
-
#### Optional parameters
249
+
### Optional parameters
249
250
- `batch_size`: Default is 64. The size of the mini-batches to use.
250
251
- `use_cuda`: Default is `False`. Specify whether CUDA-enabled GPUs are available for torch to use.
251
252
- `reference_sequence`: Default is the class `selene_sdk.sequences.Genome`. The type of sequence on which this analysis will be performed (must be type `selene.sequences.Sequence`).
@@ -438,7 +439,6 @@ With the exception of `intervals_path` and `sample_negative`, all other paramete
438
439
- `sample_negative`: Optional, default is False. Specify whether negative examples (i.e. samples with no positive labels) should be drawn. When False, the sampler will check if the `center_bin_to_predict` in the input sequence contains at least 1 of the features/classes the model wants to predict. When True, no such check is made.
439
440
440
441
#### Multiple-file sampler
441
-
442
442
The multi-file sampler loads in the training, validation, and optionally, the testing dataset. The configuration for this therefore asks that you fill in some keys with the function-type constructors of type `selene_sdk.samplers.file_samplers.FileSampler`. Please consult the following sections for information about these file samplers.
443
443
444
444
An example configuration for the multiple-file sampler:
@@ -623,7 +623,7 @@ create_subdirectory: False
623
623
...
624
624
```
625
625
626
-
### Some notes
626
+
#### Some notes
627
627
- For the matrix file sampler, we assume that you know ahead of time the shape of the data matrix. That is, which dimension is the batch dimension? Sequence? Alphabet (should be size 4 for DNA/RNA)? You must specify the keys that end in `axis` unless the shape of the sequences matrix is `(n_samples, n_alphabet, n_sequence_length)` and the shape of the targets matrix is `(n_samples, n_targets)`.
628
628
- In this case, since `create_subdirectory` is False, all outputs from evaluate are written to `output_dir` directly (as opposed to being written in a timestamped subdirectory). Be careful of overwriting files.
0 commit comments