55[ ![ License: MIT] ( https://img.shields.io/badge/License-MIT-blue.svg )] ( https://opensource.org/licenses/MIT )
66[ ![ PyPI Status Badge] ( https://badge.fury.io/py/elasticdl-client.svg )] ( https://pypi.org/project/elasticdl-client/ )
77
8- ElasticDL is a Kubernetes-native deep learning framework built on top of
9- TensorFlow 2.0 that supports fault-tolerance and elastic scheduling.
8+ ElasticDL is a Kubernetes-native deep learning framework
9+ that supports fault-tolerance and elastic scheduling.
1010
1111## Main Features
1212
@@ -16,11 +16,11 @@ Through Kubernetes-native design, ElasticDL enables fault-tolerance and works
1616with the priority-based preemption of Kubernetes to achieve elastic scheduling
1717for deep learning tasks.
1818
19- ### TensorFlow 2.0 Eager Execution
19+ ### Support TensorFlow and PyTorch
2020
21- A distributed deep learning framework needs to know local gradients before the
22- model update. Eager Execution allows ElasticDL to do it without hacking into the
23- graph execution process.
21+ - TensorFlow Estimator.
22+ - TensorFlow Keras.
23+ - PyTorch
2424
2525### Minimalism Interface
2626
@@ -37,30 +37,27 @@ elasticdl train \
3737 --volume=" host_path=/data,mount_path=/data"
3838```
3939
40- ### Integration with SQLFlow
41-
42- ElasticDL will be integrated seamlessly with SQLFlow to connect SQL to
43- distributed deep learning tasks with ElasticDL.
44-
45- ``` sql
46- SELECT * FROM employee LABEL income INTO my_elasticdl_model
47- ```
48-
4940## Quick Start
5041
5142Please check out our [ step-by-step tutorial] ( docs/tutorials/get_started.md ) for
5243running ElasticDL on local laptop, on-prem cluster, or on public cloud such as
5344Google Kubernetes Engine.
5445
46+ [ TensorFlow Estimator on MiniKube] ( docs/tutorials/elasticdl_estimator.md )
47+
48+ [ TensorFlow Keras on MiniKube] ( docs/tutorials/elasticdl_local.md )
49+
50+ [ PyTorch on MiniKube] ( docs/tutorials/elasticdl_torch.md )
51+
5552## Background
5653
57- TensorFlow has its native distributed computing feature that is
54+ TensorFlow/PyTorch has its native distributed computing feature that is
5855fault-recoverable. In the case that some processes fail, the distributed
5956computing job would fail; however, we can restart the job and recover its status
6057from the most recent checkpoint files.
6158
62- ElasticDL, as an enhancement of TensorFlow's distributed training feature,
63- supports fault-tolerance. In the case that some processes fail, the job would
59+ ElasticDL supports fault-tolerance during distributed training.
60+ In the case that some processes fail, the job would
6461go on running. Therefore, ElasticDL doesn't need to save checkpoint nor recover
6562from checkpoints.
6663
@@ -80,11 +77,11 @@ first job completes. In this case, the overall utilization is 100%.
8077
8178The feature of elastic scheduling of ElasticDL comes from its Kubernetes-native
8279design -- it doesn't rely on Kubernetes extensions like Kubeflow to run
83- TensorFlow programs; instead, the master process of an ElasticDL job calls
80+ TensorFlow/PyTorch programs; instead, the master process of an ElasticDL job calls
8481Kubernetes API to start workers and parameter servers; it also watches events
8582like process/pod killing and reacts to such events to realize fault-tolerance.
8683
87- In short, ElasticDL enhances TensorFlow with fault-tolerance and elastic
84+ In short, ElasticDL enhances TensorFlow/PyTorch with fault-tolerance and elastic
8885scheduling in the case that you have a Kubernetes cluster. We provide a tutorial
8986showing how to set up a Kubernetes cluster on Google Cloud and run ElasticDL
9087jobs there. We respect TensorFlow's native distributed computing feature, which
0 commit comments