- This repository is a demo of a plugin built on Metaflow to support Kubernetes as a compute abstraction for running machine learning workflows.
- Works with Kubernetes Cluster and Minikube. Works with local and Service based Metadata Provider.
- To run the
metaflow runcommand within container follow the instructions given in Using Metaflow with Kubernetes Cluster - The plugin supports running metaflow within container or on local machine to orchestrate workflow on Kubernetes.
- As Metaflow is currently coupled with AWS, S3 is required as a Datastore. In Future more datastores will be supported.
multi_step_mnist.pyis a demo workflow which showcases how metaflow can be used with Kubernetes to parallely train multiple models with different hyperparameters and finally easily gather collated results.- The flow looks like the following :

- Information on Cluster and Metaflow related services setup on Kubernetes is available here.
- Install Minikube or use commands in
kops.shto setup a CLUSTER on AWS. For more documentation onkopswith AWS check here. - Plugin Requires AWS KEYS to be set in environment variables.
- Replace in AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN in
setup.sh - Add the S3 Bucket location in
setup.shto METAFLOW_DATASTORE_SYSROOT_S3 environment variable. - Run
sh setup.shto use Minikube and run the flow with plugin repo on Kubernetes. - If you are not using Minikube, a Kube config is required.
- Usage is very similar to
@batchdecorator. - on top of any
@stepadd the@kubedecorator or use--with kube:cpu=2,memory=4000,image=python:3.7in the CLI args. - To directly deploy the entire runtime into Kubernetes as a job, using the
kube-deploy runcommand:python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 kube-deploy run --num_training_examples 1000 --dont-exit--dont-exitwill follow log trail from the job. Otherwise the workflow will be deployed as a job on Kubernetes which will destroy itself once it ends.- Directly deploy to kubernetes only works with Service based Metaprovider
- Good practice before directly moving to
kube-deploywould be:- Local tests :
python multi_step_mnist.py run --num_training_examples 1000: With or without Conda. - Dry run with
python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 run --num_training_examples 1000 - On successful dry run :
python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=tensorflow/tensorflow:latest-py3 kube-deploy run --num_training_examples 50000: Run Larger Dataset.
- Local tests :
- To run with Conda it will need
'python-kubernetes':'10.0.1'in the libraries argument to@conda_basestep decorator - Use
image=python:3.6when running with Conda in--with kube:. Ideally that should be the python version used/mentioned in conda. - Direct deploy to kubernetes with Conda environment is supported
python multi_step_mnist.py --with kube:cpu=3.2,memory=4000,image=python:3.6 --environment=conda kube-deploy run --num_training_examples 1000 --dont-exit- Ensure to use
image=python:<conda_python_version>
- Instructions for GPU Based setup for Metaflow on Kubernetes can be found here
- The deployment works for Cuda 10.2,9.1.
- Results of GPU Run on experiments_analytics_gpumnist.ipynb
python multi_step_mnist.py kube list: Show the currently running jobs of flow.python multi_step_mnist.py kube kill: Kills all jobs on Kube. Any Metaflow Runtime accessing those jobs will be gracefully exited.python multi_step_mnist.py kube-deploy run: Will run the Metaflow Runtime inside a container on kubernetes cluster. Needs metadata service to work.python multi_step_mnist.py kube-deploy list: It will list any running deployment of the current flow on Kubernetes.
- check experiments_analytics.ipynb
- Before running notebook use %set_env in notebook to set env vars for AWS Keys.
- Supports S3 based Datastore. Wont work without S3 as datastore.
- Current Has Been Tested with GPU's. Current Deployment Automation scripts Supports Cuda v9.1.
- Needs
METAFLOW_KUBE_CONFIG_PATHandMETAFLOW_KUBE_NAMESPACEenv vars for kubernetes config file and Namespace . Takes~/.kube/configanddefaultas defaults - Requires AWS Secrets in ENV Vars for Deploying PODS. That Needs fixing.
- Supports Conda Decorators. Current
setup.shuses tensorflow/tensorflow image. - To use conda Add the following :
@conda_base(python=get_python_version(),libraries={'numpy':'1.18.1','tensorflow':'1.4.0','python-kubernetes':'10.0.1'})above the class definition to make it work with conda.- TO run with conda change the line in setup.sh to following :
.env/bin/python multi_step_mnist.py --environment=conda --with kube:cpu=2,memory=4000,image=python:3.7 run --num_training_examples 2500- Please also ensure that Conda is in the PATH env variable.
pip install https://github.com/valayDave/metaflow/archive/kube_cpu_stable.zip