This repository contains the artifact for our ASPLOS '23 paper "Lucid: A Non-Intrusive, Scalable and Interpretable Scheduler for Deep Learning Training Jobs". It includes following parts:
-
simulation: It contains code and data for reproducing key results in our paper. -
workloads: The Pytorch implementation of 14 different workloads used in experiments. -
profile: It contains the code to collect traces of each training job type.
simulation (adopted from Helios) contains instructions for reproducing the Venus cluster experiments shown in Section 4. These scripts have been tested on Ubuntu 20.04 with Python 3.9.
The contents inside simulation folder are summarized as follows:
- data/ contains
Venuscluster job trace and cluster configuration used for evaluation. - analyzer/ contains the Packing Analyze Model and profiled workloads information used in our experiment.
- estimator/ contains the Workload Estimate Model and job duration estimation for both Lucid and QSSF.
- plot/ contains notebook for visualizing experiment results.
- policy/ contains implementations of the Lucid scheduling policy, and baseline policies including FIFO, SJF, QSSF, Tiresias.
- predictor/ contains the Throughput Predict Model and cluster throughput estimation in Venus September.
- profiler/ contains the Least-GPU-First and Auto-Scaling Profiler implementation for Lucid.
- cluster.py, job.py and updater.py contain implementations of the GPU cluster and workload logic.
- simulator.py is the main entry of the simulator.
We suggest using a conda environment to install the dependencies:
conda create -n lucid python=3.9
conda activate lucid
cd simulation
pip install -r requirements.txtBesides, we recommend execute Jupyter notebook (.ipynb) files with VSCode or JupyterLab (conda install jupyterlab).
We train Throughput Predict Model as a reproduction example. Please follow below steps:
-
Enter
predictorfolder and openpredictor.ipynbfile -
Run all cells inside the notebook. It contains the interpretable model (Primo EBM) used in Lucid and other ML baselines (LightGBM, XGBoost, Random Forest, DNN).
-
Table 7: Interpretable Model Performance: Check
Result Comparisoncell, the MAE scores of all baselines are listed. -
Figure 13 (a): Throughput Predict Performance: Check
Prediction Visualizationcell (orVenus_throughput.pdfoutput file), both the real and predicted throughput are plotted. Generated figures should have similar patterns as the paper. The difference is because we release the Venus Job throughput prediction code but we plot Saturn Job throughput prediction in our paper. -
Figure 7 (a)(b): Global Model Interpretation and Learned Shape Function: Check
Model Interpretationcell (orinterpret_Venus_throughput.pdf&interpret_Venus_shapefunc.pdfoutput files). Generated figures should have similar patterns as the paper. The difference is because we release the Venus Job throughput prediction code but we plot Saturn GPU throughput prediction in our paper.
More model training codes are also provided (estimator/estimator_lucid.ipynb and analyzer/analyzer.py).
Use the following command to run all baselines simultaneously
cd simulation
python simulator.py --sweep The output of this script looks like this:
2022 Oct 08 14:32:57 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13220000 | Total Job: 7603 | End job: 13 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13220000 | Total Job: 2826 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13230000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13230000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13240000 | Total Job: 7603 | End job: 120 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13220000 | Total Job: 2654 | End job: 1 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13240000 | Total Job: 2826 | End job: 0 | Running job: 1 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13250000 | Total Job: 7603 | End job: 121 | Running job: 4 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13220000 | Total Job: 1452 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-2 | vcWoR | Time: 13250000 | Total Job: 2826 | End job: 0 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-3 | vcHvQ | Time: 13230000 | Total Job: 2654 | End job: 2 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-1 | vcEwI | Time: 13260000 | Total Job: 7603 | End job: 162 | Running job: 9 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13220000 | Total Job: 710 | End job: 0 | Running job: 0 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-4 | vcvGl | Time: 13230000 | Total Job: 1452 | End job: 1 | Running job: 2 | Pending job: 0
2022 Oct 08 14:32:59 | ForkPoolWorker-5 | vc8Gr | Time: 13230000 | Total Job: 710 | End job: 0 | Running job: 1 | Pending job: 0
Similarly, use the following command to run all baselines simultaneously
python simulator.py -s lucidThe output of this script looks like this:
2022 Oct 08 14:45:07 | MainProcess | Total Job Number in Cluster Training: 23859
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13220000 | Total Job: 23859 | End job: 17 | Running job: 1 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13230000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13240000 | Total Job: 23859 | End job: 134 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13250000 | Total Job: 23859 | End job: 136 | Running job: 0 | Pending job: 0 | Avail Nodes: 2
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13260000 | Total Job: 23859 | End job: 249 | Running job: 3 | Pending job: 4 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13270000 | Total Job: 23859 | End job: 385 | Running job: 3 | Pending job: 2 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13280000 | Total Job: 23859 | End job: 589 | Running job: 2 | Pending job: 0 | Avail Nodes: 1
2022 Oct 08 14:45:08 | MainProcess | profvc | Time: 13290000 | Total Job: 23859 | End job: 780 | Running job: 2 | Pending job: 0 | Avail Nodes: 2
After the program is executed, you can check the result in the log folder. The job log and time sequence of each VC are provided separately.
We provide simulation analysis and plot scripts to generate the figures shown in our paper. Please follow below steps:
-
Enter
plotfolder and openresult_plot.ipynbfile -
Run all cells inside the notebook.
-
Table 4: Scheduling Performance: Check
Table 4: Result Summarycell (orresult_summary.csvoutput file), the Average JCT, Average Queuing Delay and Queuing Delay 99.9 Quantile of all policies are listed. -
Table 5: Scheduling Performance (workload analysis): Check
Table 5: Result Summary of Different Scales of Workloadscell, the Average JCT, Average Queuing Delay of large and small jobs are listed. -
Figure 8: CDF of JCT: Check
Plot Result 8: JCTcell (orresult_cdf_jct.pdfoutput file), JCT CDF of all policies are plotted. -
Figure 9: Queue Time in each VC: Check
Plot Result 9: Queue Time in each VCcell (orresult_bar_queue.pdfoutput file), queuing delay of all policies are plotted.
This part profile contains code for profiling metrics of multiple workloads.
Note that ./result/ will be created when main_co.py or main_single.py is launched.
Run main_co.py will generate the colocated jobs' metrics under ./result/colocate. Run main_single.py will generate single jobs' metrics under ./result/. Some specific settings can be set in each workload's profiling file, e.g.profile_cifar.py. The output will be like this:
imagenet + imagenet
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
co-locate:
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 1 mp..
imagenet + cifar10
co-locate:
Files already downloaded and verified
==> Training ResNet18 model with 32 batchsize, 0 mp..
==> Training mobilenet_v3_small model with 32 batchsize, 0 mp..
...
The data path storing all datasets is specified in ./workloads/settings.py as data_dir. You can also specify the total runtime of some workloads by changing total_runtime.
-
CIFAR-10: The cifar10 dataset will be downloaded automatically(if not exist) when
./workloads/cifar/profile_cifar.pyis run. -
ImageNet: The dataset is generated automatically in
./workloads/imagenet/profile_imagenet.py. -
LSUN: The dataset is generated automatically in
./workloads/dcgan/profile_dcgan.py. You can change the custom image size of generated data via--imageSize. The default value is 64. -
ShapeNet: Use the following command to download dataset under directory
data_dir/shapenetcore/:wget https://shapenet.cs.stanford.edu/ericyi/shapenetcore_partanno_segmentation_benchmark_v0.zip --no-check-certificate unzip shapenetcore_partanno_segmentation_benchmark_v0.zip
-
SQuAD: The data can be downloaded with the following link and should be saved under
data_dir/SQUAD_DIR/directory. -
Wikitext2: The dataset can be downloaded from
File
test.txt,train.txtandvalid.txtshould be saved indata_dir/wikitext-2/directory. -
Multi30k: First download the Moses tokenizer(http://www.statmt.org/moses/) for data preparation:
wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/tokenizer/tokenizer.perl wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.de wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/share/nonbreaking_prefixes/nonbreaking_prefix.en sed -i "s/$RealBin\/..\/share\/nonbreaking_prefixes//" tokenizer.perl wget https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perlThese files should be downloaded in
./workloads/translation/.Then download data in
data_dir/multi30k/:mkdir -p data/multi30k wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz && tar -xf training.tar.gz -C data/multi30k && rm training.tar.gz wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz && tar -xf validation.tar.gz -C data/multi30k && rm validation.tar.gz wget http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz && tar -xf mmt16_task1_test.tar.gz -C data/multi30k && rm mmt16_task1_test.tar.gz
Preprocess the data:
for l in en de; do for f in ~/data/multi30k/*.$l; do if [[ "$f" != *"test"* ]]; then sed -i "$ d" $f; fi; done; done for l in en de; do for f in ~/data/multi30k/*.$l; do perl tokenizer.perl -a -no-escape -l $l -q < $f > $f.atok; done; done python preprocess.py -train_src ~/data/multi30k/train.en.atok -train_tgt ~/data/multi30k/train.de.atok -valid_src ~/data/multi30k/val.en.atok -valid_tgt ~/data/multi30k/val.de.atok -save_data ~/data/multi30k.atok.low.pt
Referenced from: https://github.com/Eathoublu/attention-is-all-you-need-pytorch.
-
MovieLens: Use the following command to download the dataset in
data_dir/ml-1m/:wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.negative wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.test.rating wget https://github.com/hexiangnan/neural_collaborative_filtering/raw/master/Data/ml-1m.train.rating