Update run_git.rst, run_python.rst, and 4 more files...

qiuosier · qiuosier · commit 61e04ebf03b1 · 2023-02-27T15:43:03.000-05:00
diff --git a/docs/source/user_guide/jobs/run_git.rst b/docs/source/user_guide/jobs/run_git.rst
@@ -3,12 +3,18 @@ Run Code from Git Repo
 
 The :py:class:`~ads.jobs.GitPythonRuntime` allows you to run source code from a Git repository as a job.
 
+.. include:: ../jobs/toc_local.rst
+
+PyTorch Example
+===============
+
 The following example shows how to run a
 `PyTorch Neural Network Example to train third order polynomial predicting y=sin(x) 
 <https://github.com/pytorch/tutorials/blob/master/beginner_source/examples_nn/polynomial_nn.py>`_.
 
 .. include:: ../jobs/tabs/git_runtime.rst
 
+
 Git Repository
 ==============
 
@@ -38,7 +44,7 @@ Entrypoint
 The entrypoint specifies how the source code is invoked.
 The :py:meth:`~ads.jobs.GitPythonRuntime.with_entrypoint` supports the following arguments:
 
-* ``path``: Required. The relative path for the script, module, or file to start the job.
+* ``path``: Required. The relative path of the script/module from the root of the Git repository.
 * ``func``: Optional. The function in the script specified by ``path`` to call.
   If you don't specify it, then the script specified by ``path`` is run as a Python script in a subprocess.
 
@@ -60,6 +66,18 @@ The arguments can be strings, ``list`` of strings or ``dict`` containing only st
 Arguments are not used when the entrypoint is a notebook.
 
 
+Working Directory
+=================
+
+By default, the working directory is the root of the git repository.
+This can be configured by can be configured by :py:meth:`~ads.jobs.GitPythonRuntime.with_working_dir`
+using a relative path from the root of the Git repository.
+
+Note that the entrypoint should always specified as a relative path from the root of the Git repository,
+regardless of the working directory.
+The python paths and output directory should be specified relative to the working directory.
+
+
 Python Paths
 ============
 
@@ -68,17 +86,19 @@ The working directory is added to the Python paths automatically.
 You can call :py:meth:`~ads.jobs.GitPythonRuntime.with_python_path` to add additional python paths as needed.
 The paths should be relative paths from the working directory.
 
+
 Outputs
 =======
 
-The :py:meth:`~ads.jobs.GitPythonRuntime.with_output` method allows you to specify the output path ``output_path``
+The :py:meth:`~ads.jobs.GitPythonRuntime.with_output` method allows you to specify the output path ``output_dir``
 in the job run and a remote URI (``output_uri``).
-Files in the ``output_path`` are copied to the remote output URI after the job run finishes successfully.
-Note that the ``output_path`` should be a path relative to the working directory.
+Files in the ``output_dir`` are copied to the remote output URI after the job run finishes successfully.
+Note that the ``output_dir`` should be a path relative to the working directory.
 
 OCI object storage location can be specified in the format of ``oci://bucket_name@namespace/path/to/dir``.
 Please make sure you configure the I AM policy to allow the job run dynamic group to use object storage.
 
+
 Metadata
 ========
 The :py:class:`~ads.jobs.GitPythonRuntime` updates metadata as free-form tags of the job run
@@ -93,6 +113,6 @@ after the job run finishes. The following tags are added automatically:
 The new values overwrite any existing tags.
 If you want to skip the metadata update, set ``skip_metadata_update`` to ``True`` when initializing the runtime:
 
-.. code-block:: python3
+.. code-block:: python
 
   runtime = GitPythonRuntime(skip_metadata_update=True)
diff --git a/docs/source/user_guide/jobs/run_python.rst b/docs/source/user_guide/jobs/run_python.rst
@@ -8,6 +8,9 @@ as described in :doc:`infra_and_runtime`. This section shows the additional enha
 
 .. include:: ../jobs/toc_local.rst
 
+Example
+=======
+
 Here is an example to define and run a job using :py:class:`~ads.jobs.PythonRuntime`:
 
 .. include:: ../jobs/tabs/python_runtime.rst
diff --git a/docs/source/user_guide/jobs/run_script.rst b/docs/source/user_guide/jobs/run_script.rst
@@ -15,6 +15,9 @@ Here is an example:
 
 .. include:: ../jobs/tabs/script_runtime.rst
 
+An `example script <https://github.com/oracle-samples/oci-data-science-ai-samples/blob/master/jobs/shell/shell-with-args.sh>`_
+is available on `Data Science AI Sample GitHub Repository <https://github.com/oracle-samples/oci-data-science-ai-samples>`_.
+
 Working Directory
 =================
 
diff --git a/docs/source/user_guide/jobs/tabs/training_job.rst b/docs/source/user_guide/jobs/tabs/training_job.rst
@@ -0,0 +1,92 @@
+.. tabs::
+
+  .. code-tab:: python
+    :caption: Python
+
+    from ads.jobs import Job, DataScienceJob, GitPythonRuntime
+
+    job = (
+        Job(name="Training RNN with PyTorch")
+        .with_infrastructure(
+            DataScienceJob()
+            .with_log_group_id("<log_group_ocid>")
+            .with_log_id("<log_ocid>")
+            .with_shape_name("VM.GPU3.1")
+            # The following infrastructure configurations are optional
+            # if you are in an OCI data science notebook session.
+            # The configurations of the notebook session will be used as defaults.
+            .with_compartment_id("<compartment_ocid>")
+            .with_project_id("<project_ocid>")
+            # Default block storage size is 50GB
+            .with_block_storage_size(50)
+        )
+        .with_runtime(
+            GitPythonRuntime(skip_metadata_update=True)
+            # Use service conda pack
+            .with_service_conda("pytorch110_p38_gpu_v1")
+            # Specify training source code from GitHub
+            .with_source(url="https://github.com/pytorch/examples.git", branch="main")
+            # Entrypoint is a relative path from the root of the Git repository
+            .with_entrypoint("word_language_model/main.py")
+            # Pass the arguments as: "--epochs 5 --save model.pt --cuda"
+            .with_argument(epochs=5, save="model.pt", cuda=None)
+            # Set working directory, which will also be added to PYTHONPATH
+            .with_working_dir("word_language_model")
+            # Save the output to OCI object storage
+            # output_dir is relative to working directory
+            .with_output(output_dir=".", output_uri="oci://bucket@namespace/prefix")
+        )
+    )
+
+  .. code-tab:: yaml
+    :caption: YAML
+
+    kind: job
+    spec:
+      name: "My Job"
+      infrastructure:
+        kind: infrastructure
+        type: dataScienceJob
+        spec:
+          blockStorageSize: 50
+          compartmentId: <compartment_ocid>
+          jobInfrastructureType: STANDALONE
+          jobType: DEFAULT
+          logGroupId: <log_group_ocid>
+          logId: <log_ocid>
+          projectId: <project_ocid>
+          shapeConfigDetails:
+            memoryInGBs: 16
+            ocpus: 1
+          shapeName: VM.Standard.E3.Flex
+          subnetId: <subnet_ocid>
+      runtime:
+        kind: runtime
+        type: gitPython
+        spec:
+          args:
+          - --epochs
+          - '5'
+          - --save
+          - model.pt
+          - --cuda
+          branch: main
+          conda:
+            slug: pytorch110_p38_gpu_v1
+            type: service
+          entrypoint: word_language_model/main.py
+          outputDir: .
+          outputUri: oci://bucket@namespace/prefix
+          skipMetadataUpdate: true
+          url: https://github.com/pytorch/examples.git
+          workingDir: word_language_model
+
+
+.. code-block:: python
+
+  # Create the job on OCI Data Science
+  job.create()
+  # Start a job run
+  run = job.run()
+  # Stream the job run outputs
+  run.watch()
diff --git a/docs/source/user_guide/jobs/tabs/training_mnist.rst b/docs/source/user_guide/jobs/tabs/training_mnist.rst
diff --git a/docs/source/user_guide/model_training/training_with_oci.rst b/docs/source/user_guide/model_training/training_with_oci.rst
@@ -7,9 +7,39 @@ enables you to define and run repeatable machine learning tasks on a fully manag
 You can have Compute resource on demand and run applications that perform tasks such as
 data preparation, model training, hyperparameter tuning, and batch inference.
 
-Here is an example for training MNIST model with PyTorch using source code directly from GitHub.
+Here is an example for training RNN on `Word-level Language Modeling <https://github.com/pytorch/examples/tree/main/word_language_model>`_,
+using the source code directly from GitHub.
 
-.. include:: ../jobs/tabs/training_mnist.rst
+.. include:: ../jobs/tabs/training_job.rst
+
+The job run will:
+
+* Setup the PyTorch conda environment
+* Fetch the source code from GitHub
+* Run the training script with the specific arguments
+* Save the outputs to OCI object storage
+
+Following are the example outputs of the job run:
+
+.. code-block:: text
+
+    2023-02-27 20:26:36 - Job Run ACCEPTED
+    2023-02-27 20:27:05 - Job Run ACCEPTED, Infrastructure provisioning.
+    2023-02-27 20:28:27 - Job Run ACCEPTED, Infrastructure provisioned.
+    2023-02-27 20:28:53 - Job Run ACCEPTED, Job run bootstrap starting.
+    2023-02-27 20:33:05 - Job Run ACCEPTED, Job run bootstrap complete. Artifact execution starting.
+    2023-02-27 20:33:08 - Job Run IN_PROGRESS, Job run artifact execution in progress.
+    2023-02-27 20:33:31 - | epoch   1 |   200/ 2983 batches | lr 20.00 | ms/batch  8.41 | loss  7.63 | ppl  2064.78
+    2023-02-27 20:33:32 - | epoch   1 |   400/ 2983 batches | lr 20.00 | ms/batch  8.23 | loss  6.86 | ppl   949.18
+    2023-02-27 20:33:34 - | epoch   1 |   600/ 2983 batches | lr 20.00 | ms/batch  8.21 | loss  6.47 | ppl   643.12
+    2023-02-27 20:33:36 - | epoch   1 |   800/ 2983 batches | lr 20.00 | ms/batch  8.22 | loss  6.29 | ppl   537.11
+    2023-02-27 20:33:37 - | epoch   1 |  1000/ 2983 batches | lr 20.00 | ms/batch  8.22 | loss  6.14 | ppl   462.61
+    2023-02-27 20:33:39 - | epoch   1 |  1200/ 2983 batches | lr 20.00 | ms/batch  8.21 | loss  6.05 | ppl   425.85
+    ...
+    2023-02-27 20:35:41 - =========================================================================================
+    2023-02-27 20:35:41 - | End of training | test loss  4.96 | test ppl   142.94
+    2023-02-27 20:35:41 - =========================================================================================
+    ...
 
 For more details, see: