|
1 | | -Overview |
2 | | -========= |
3 | 1 |
|
4 | | -.. mdinclude:: ./overview.md |
| 2 | +Functional overview of the SDK |
| 3 | +============================== |
| 4 | + |
| 5 | +The software development kit (SDK), formally known as ``selene_sdk``\ , is an extensible Python package intended to ease development of new programs that leverage sequence-level models through code reuse. |
| 6 | +The package is composed of six submodules: *sequences*\ , *samplers*\ , *targets*\ , *predict*\ , *interpret*\ , and *utils*. |
| 7 | +It also provides two top-level classes: *TrainModel* and *EvaluateModel*. |
| 8 | +In the following sections, we briefly discuss each submodule and top-level class. |
| 9 | + |
| 10 | +Sampling |
| 11 | +-------- |
| 12 | + |
| 13 | +We start with the modules for sampling data because both training and evaluting a model in Selene will require a user to specify the kind of sampler they want to use. |
| 14 | + |
| 15 | +*sequences* submodule (\ `API <http://selene.flatironinstitute.org/sequences.html>`_\ ) |
| 16 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 17 | + |
| 18 | +The *sequences* submodule defines the ``Sequence`` type, and includes implementations for several sub-classes. |
| 19 | +These sub-classes--\ ``Genome`` and ``Proteome``\ --represent different kinds of biological sequences (e.g. DNA, RNA, amino acid sequences), and implement the ``Sequence`` interface’s methods for reading the reference sequence from files (e.g. FASTA), querying subsequences of the reference sequence, and subsequently converting those queried subsequences into a numeric representation. |
| 20 | +Further, each sequence class specifies its own alphabet (e.g., nucleotides, amino acids) to represent query results as strings. |
| 21 | + |
| 22 | +*targets* submodule (\ `API <http://selene.flatironinstitute.org/targets.html>`_\ ) |
| 23 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 24 | + |
| 25 | +The *targets* submodule defines the ``Target`` class, which specifies the interface for classes to retrieve labels or “targets” for a given query sequence. |
| 26 | +At present, we supply a single implementation of this interface: ``GenomicFeatures``. |
| 27 | +This class takes a tabix-indexed file of intervals for each label we want our model to predict, and uses this file to identify the labels for a given sequence drawn from the reference. |
| 28 | + |
| 29 | +*samplers* submodule (\ `API <http://selene.flatironinstitute.org/samplers.html>`_\ ) |
| 30 | +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| 31 | + |
| 32 | +The *samplers* submodule provides methods and classes for randomly sampling and partitioning datasets for training and evaluation. |
| 33 | +The ``Sampler`` interface defines the minimal requirements for a class fulfilling these functions. |
| 34 | +In particular, samplers must be able to partition data (i.e. into training, validation, and testing datasets), sample data from each partition, and, if needed, save the sampled data to a file. |
| 35 | +Further, a file of names must be provided for the features to be predicted. |
| 36 | +We provide several implementations adhering to the ``Sampler`` interface: the ``RandomPositionsSampler``\ , ``IntervalsSampler``\ , and ``MultiFileSampler``. |
| 37 | + |
| 38 | +``MultiFileSampler`` draws samples from structured data files for each partition. |
| 39 | +There is currently support for loading either .bed or .mat files via the ``FileSampler`` classes ``BedFileSampler`` and ``MatFileSampler``\ , respectively (see `API docs for file samplers <http://selene.flatironinstitute.org/samplers.file_samplers.html>`_\ ). |
| 40 | +It is worth noting that the .bed file used by ``BedFileSampler`` includes the coordinates of each sequence, and the indices corresponding to each feature for which said sequence is a positive example. |
| 41 | +We hope that users will request or contribute classes for other file samplers in the future. |
| 42 | +``MultiFileSampler`` does not support saving the sampled data to a file, so calling the ``save_dataset_to_file`` method from this class will have no effect. |
| 43 | + |
| 44 | +``RandomPositionsSampler`` and ``IntervalsSampler`` are what we call online samplers. |
| 45 | +Online samplers generate examples from the reference sequence (e.g. genome, proteome) on-the-fly--either across the whole reference sequence (random positions sampler), or from user-specified regions (intervals sampler)--using a tabix-indexed .bed file. |
| 46 | +These samplers automatically partition said data according to user-specified parameters (e.g. validate on a subset of chromosomes or on some percentage of the data). |
| 47 | +Since ``OnlineSampler``\ ’s samples are randomly generated, we allow the user to save the sampled data to file. |
| 48 | +This file can be subsequently loaded with the ``BedFileSampler``. They rely on classes from the *sequences* and *targets* submodules for retrieving each sequence and its targets in the proper matrix format. |
| 49 | + |
| 50 | +Training a model (\ `API <http://selene.flatironinstitute.org/selene.html#trainmodel>`_\ ) |
| 51 | +------------------------------------------------------------------------------------------ |
| 52 | + |
| 53 | +The ``TrainModel`` class may be used for training and testing of sequence-based models, and provides the core functionality of the CLI’s train command. |
| 54 | +It relies on an ``OnlineSampler`` (or a subclass of ``OnlineSampler``\ ) to automatically partition the dataset into subsets for training, validation, and testing. |
| 55 | +These subsets are then used to automatically train and validate performance for a user-specified number of steps. |
| 56 | +The testing subset is used to evaluate the model performance after training is completed. |
| 57 | +The model’s loss, area under the receiver operating characteristic curve (AUC), and area under the precision-recall curve (AUPRC) are logged during training. (In the future, we plan to support other performance metrics. Please request specific ones or use cases in our `Github issues <https://github.com/FunctionLab/selene/issues>`_. |
| 58 | +The frequency of logging is provided by the user. |
| 59 | +At the end of evaluation, ``TrainModel`` logs the performance metrics for each feature predicted, and produces plots of the precision recall and receiver operating characteristic curves. |
| 60 | + |
| 61 | +Evaluating a model (\ `API <http://selene.flatironinstitute.org/selene.html#evaluatemodel>`_\ ) |
| 62 | +----------------------------------------------------------------------------------------------- |
| 63 | + |
| 64 | +The ``EvaluateModel`` class is used to test the performance of a trained model. |
| 65 | +``EvaluateModel`` uses an instance of ``Sampler`` class or subclass to draw samples from a test set. |
| 66 | +After using the provided model to predict labels for said data, ``EvaluateModel`` logs the performance measures (as described in "Training a model") and generates figures and a performance breakdown by feature. |
| 67 | + |
| 68 | +Using a model to make predictions (\ `API <http://selene.flatironinstitute.org/predict.html>`_\ ) |
| 69 | +------------------------------------------------------------------------------------------------- |
| 70 | + |
| 71 | +Selene’s ``predict`` submodule includes a number of methods and classes for making predictions with sequence-based models. |
| 72 | +The ``AnalyzeSequences`` class is the main class to use. |
| 73 | +It leverages a user-specified trained model to make predictions for sequences sequences in a FASTA file, apply *in silico* mutagenesis to sequences in a FASTA file, or perform variant effect prediction on variants in a VCF file. |
| 74 | +In each case, the user can specify what ``AnalyzeSequences`` should save: raw predictions, difference scores, absolute difference scores, and/or logit scores. |
| 75 | +Note that the aforementioned “scores” can only be computed for *in silico* mutagenesis and variant effect prediction. |
| 76 | + |
| 77 | +Visualizing model predictions (\ `API <http://selene.flatironinstitute.org/interpret.html>`_\ ) |
| 78 | +----------------------------------------------------------------------------------------------- |
| 79 | + |
| 80 | +The ``interpret`` submodule of ``selene_sdk`` provides methods for visualizing a sequence-based model’s predictions made with ``AnalyzeSequences``. |
| 81 | +For example, ``interpret`` includes methods for processing variant effect predictions made with ``AnalyzeSequences`` and subsequently visualizing them with a heatmap or sequence logo. |
| 82 | +The functionality included in the ``interpret`` submodule is not heavily incorporated into the CLI, but is instead intended for incorporation into user code. |
| 83 | + |
| 84 | +The utilities submodule (\ `API <http://selene.flatironinstitute.org/utils.html>`_\ ) |
| 85 | +------------------------------------------------------------------------------------- |
| 86 | + |
| 87 | +Unlike the aforementioned submodules designed around individual concepts, the ``utils`` submodule is a catch-all submodule intended to prevent cluttering of the ``selene_sdk`` top-level namespace. |
| 88 | +It provides diverse functionality at varying levels of flexibility. |
| 89 | +Some members of ``utils`` are general-purpose (e.g. configuration file parsing) while others have highly specific use cases (e.g. CLI logger initialization). |
| 90 | + |
| 91 | +Help |
| 92 | +---- |
| 93 | +Join our `Google group <https://groups.google.com/forum/#!forum/selene-sdk>`_ if you have questions about the package, case studies, or model development. |
0 commit comments