Metaflow With Kubernetes Demo

This repository is a demo of a plugin built on Metaflow to support Kubernetes as a compute abstraction for running machine learning workflows.
Works with Kubernetes Cluster and Minikube. Works with local and Service based Metadata Provider.
To run the metaflow run command within container follow the instructions given in Using Metaflow with Kubernetes Cluster
The plugin supports running metaflow within container or on local machine to orchestrate workflow on Kubernetes.
As Metaflow is currently coupled with AWS, S3 is required as a Datastore. In Future more datastores will be supported.
multi_step_mnist.py is a demo workflow which showcases how metaflow can be used with Kubernetes to parallely train multiple models with different hyperparameters and finally easily gather collated results.
The flow looks like the following :

Using Metaflow with Kubernetes Cluster

Information on Cluster and Metaflow related services setup on Kubernetes is available here.

Setting Up Environment and Running Demo

Install Minikube or use commands in kops.sh to setup a CLUSTER on AWS. For more documentation on kops with AWS check here.
Plugin Requires AWS KEYS to be set in environment variables.
Replace in AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN in setup.sh
Add the S3 Bucket location in setup.sh to METAFLOW_DATASTORE_SYSROOT_S3 environment variable.
Run sh setup.sh to use Minikube and run the flow with plugin repo on Kubernetes.
If you are not using Minikube, a Kube config is required.

Using The Plugin

Usage is very similar to @batch decorator.
on top of any @step add the @kube decorator or use --with kube:cpu=2,memory=4000,image=python:3.7 in the CLI args.
To directly deploy the entire runtime into Kubernetes as a job, using the kube-deploy run command:
- python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 kube-deploy run --num_training_examples 1000 --dont-exit
- --dont-exit will follow log trail from the job. Otherwise the workflow will be deployed as a job on Kubernetes which will destroy itself once it ends.
- Directly deploy to kubernetes only works with Service based Metaprovider
- Good practice before directly moving to kube-deploy would be:
  - Local tests : python multi_step_mnist.py run --num_training_examples 1000 : With or without Conda.
  - Dry run with python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 run --num_training_examples 1000
  - On successful dry run : python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 kube-deploy run --num_training_examples 50000 : Run Larger Dataset.

Running with Conda

To run with Conda it will need 'python-kubernetes':'10.0.1' in the libraries argument to @conda_base step decorator
Use image=python:3.6 when running with Conda in --with kube:. Ideally that should be the python version used/mentioned in conda.
Direct deploy to kubernetes with Conda environment is supported
- python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=python:3.6 --environment=conda kube-deploy run --num_training_examples 1000 --dont-exit
- Ensure to use image=python:<conda_python_version>

Running on GPU Clusters.

Instructions for GPU Based setup for Metaflow on Kubernetes can be found here
The deployment works for Cuda 10.2,9.1.
Results of GPU Run on experiments_analytics_gpumnist.ipynb

CLI Operations Available with Kube:

python multi_step_mnist.py kube list : Show the currently running jobs of flow.
python multi_step_mnist.py kube kill : Kills all jobs on Kube. Any Metaflow Runtime accessing those jobs will be gracefully exited.
python multi_step_mnist.py kube-deploy run : Will run the Metaflow Runtime inside a container on kubernetes cluster. Needs metadata service to work.
python multi_step_mnist.py kube-deploy list : It will list any running deployment of the current flow on Kubernetes.

Seeing Results Post Completion

check experiments_analytics.ipynb
Before running notebook use %set_env in notebook to set env vars for AWS Keys.

Current Constraints

Supports S3 based Datastore. Wont work without S3 as datastore.
Current Has Been Tested with GPU's. Current Deployment Automation scripts Supports Cuda v9.1.
Needs METAFLOW_KUBE_CONFIG_PATH and METAFLOW_KUBE_NAMESPACE env vars for kubernetes config file and Namespace . Takes ~/.kube/config and default as defaults
Requires AWS Secrets in ENV Vars for Deploying PODS. That Needs fixing.
Supports Conda Decorators. Current setup.sh uses tensorflow/tensorflow image.
To use conda Add the following :
- @conda_base(python=get_python_version(),libraries={'numpy':'1.18.1','tensorflow':'1.4.0','python-kubernetes':'10.0.1'}) above the class definition to make it work with conda.
- TO run with conda change the line in setup.sh to following :
  - .env/bin/python multi_step_mnist.py --environment=conda --with kube:cpu=2,memory=4000,image=python:3.7 run --num_training_examples 2500
  - Please also ensure that Conda is in the PATH env variable.

Installing Plugin Repo

pip install https://github.com/valayDave/metaflow/archive/kube_cpu_stable.zip

Plugin Fork Repo

https://github.com/valayDave/metaflow/tree/kube_cpu_stable

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Resources		Resources
.gitignore		.gitignore
Readme.md		Readme.md
download_mnist.py		download_mnist.py
experiments_analytics.ipynb		experiments_analytics.ipynb
experiments_analytics_gpumnist.ipynb		experiments_analytics_gpumnist.ipynb
gpu_minst.py		gpu_minst.py
hello_mnist.py		hello_mnist.py
kops.sh		kops.sh
multi_step_mnist.py		multi_step_mnist.py
requirements.txt		requirements.txt
setup.sh		setup.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Metaflow With Kubernetes Demo

Using Metaflow with Kubernetes Cluster

Setting Up Environment and Running Demo

Using The Plugin

Running with Conda

Running on GPU Clusters.

CLI Operations Available with Kube:

Seeing Results Post Completion

Current Constraints

Installing Plugin Repo

Plugin Fork Repo

About

Uh oh!

Releases

Packages

Uh oh!

Languages

valayDave/metaflow-kube-demo

Folders and files

Latest commit

History

Repository files navigation

Metaflow With Kubernetes Demo

Using Metaflow with Kubernetes Cluster

Setting Up Environment and Running Demo

Using The Plugin

Running with Conda

Running on GPU Clusters.

CLI Operations Available with Kube:

Seeing Results Post Completion

Current Constraints

Installing Plugin Repo

Plugin Fork Repo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages