AI-Hypercomputer
diff --git a/‎.gitignore‎
Lines changed: 3 additions & 0 deletions b/‎.gitignore‎
Lines changed: 3 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 52 additions & 28 deletions b/‎README.md‎
Lines changed: 52 additions & 28 deletions
diff --git a/‎benchmarks/decode_microbenchmark.py‎
Lines changed: 0 additions & 214 deletions b/‎benchmarks/decode_microbenchmark.py‎
Lines changed: 0 additions & 214 deletions
@@ -1,3 +1,6 @@
+# source dependencies
+deps/
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 
@@ -1,51 +1,75 @@
-# petstream
+# Jetstream-PyTorch
 JetStream Engine implementation in PyTorch
 
 
-# Install torch_xla2
+# Install
 
+### 1. Get the jetstream-pytorch code
 ```bash
-git clone https://github.com/pytorch/xla.git
-cd xla/experimental/torch_xla2
-pip install -e .
+git clone https://github.com/pytorch-tpu/jetstream-pytorch.git
 ```
 
-# Merge weights
+1.1 (optional) Create a virtual env using `venv` or `conda` and activate it.
+
+### 2. Run installation script:
+
+```bash
+source install_everything.sh
 ```
-export input_ckpt_dir = Original sharded pytorch checkpoints
-export output_ckpt_dir = The output director
-export output_safetensor = True/False, user can choose to store as SafeTensor
-format or not
-python petstream/pets/weight_merger.py --input_ckpt_dir={{input_ckpt_dir}} --output_ckpt_dir={{output_ckpt_dir}} --output_safetensor={{output_safetensor}}
 
-If user choose to load or store the checkpoints from Google Cloud Storage
-buckets, please make sure run `gcloud auth application-default login` beforehand 
+NOTE: the above script will export `PYTHONPATH`, so sourcing will make it 
+to take effect in the current shell
+
+
+# Get weights
+
+### First get official llama weights from meta-llama
+
+Following instructions here: https://github.com/meta-llama/llama#download
+
+After you have downloaded the weights, it will also download a `tokenizer.model` file that is 
+the tokenizer that we will use.
+
+### Run weight merger to convert (and )
+```bash
+export input_ckpt_dir=Original llama weights directory
+export output_ckpt_dir=The output directory
+export quantize=True #whether to quantize
+python -m convert_checkpoints --input_checkpoint_dir=$input_ckpt_dir --output_checkpoint_dir=$output_ckpt_dir --quantize=$quantize
 ```
 
 
 # Local run
+
+## Llama 7b
 ```
-python -m petstream.jet_engine_python_run --bf16_enable=True --context_length=8 --batch_size=2
+python benchmarks/run_offline.py --size=7b --batch_size=128 --max_cache_length=2048 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir/model.safetensors --tokenizer_path=tokenizer.model
 ```
 
-# Bring up server
+## Llama 13b
 ```
-python -m run_server
-By default it runs on 'tpu=4', add --platform='cpu=1' if you are running on CPU
-By default it runs with tiny model, add --param_size='7b' to run 7b model
-
-Firing the request with:
-python jetstream/core/tools/requester.py
+python benchmarks/run_offline.py --size=13b --batch_size=96 --max_cache_length=1280 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir/model.safetensors --tokenizer_path=tokenizer.model
 ```
+NOTE: for 13b model we recommend to use `--max_cache_length=1280`, i.e. this implements sliding window attention.
+
+
+# Run the server
+NOTE: the `--platform=tpu=8` need to specify number of tpu devices (which is 4 for v4-8 and 8 for v5light-8`)
 
-# Profiling
+```bash
+python run_server.py --param_size=7b --batch_size=128 --max_cache_length=2048 --quantize_weights=$quantize --quantize_kv_cache=$quantize --checkpoint_path=$output_ckpt_dir/model.safetensors   --tokenizer_path=tokenizer.model --platform=tpu=8
 ```
-export profiling_output = Some gcs bucket
-python -m petstream.jet_engine_python_run --bf16_enable=True --context_length=8 --batch_size=2 --profiling_output={{profiling_output}}
+Now you can fire gRPC to it
 
-Switch to your Cloud top, run:
-export profiling_result = Some google generated folder in your gcs bucket
-petstream/gcs_to_cns.sh {{profiling_result}}
+# Run benchmark
+go to the deps/JetStream folder (downloaded during `install_everything.sh`)
+```bash
+cd deps/JetStream
+python benchmark_serving.py --tokenizer /home/hanq/jetstream-pytorch/tokenizer.model --num-prompts 2000  --dataset ~/data/ShareGPT_V3_unfiltered_cleaned_split.json --warmup-first=1 --save-request-outputs
+```
+The ShareGPT dataset can be downloaded at 
 
-The dump will always be in this directory: /cns/pi-d/home/{USER}/tensorboard/multislice/, load to Xprof/Offeline/Xplane
+```bash
+wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
 ```
+Please look at `deps/JetStream/benchmarks/README.md` for more information.