add EEVE-Korean-instruct to neuron

gonsoomoon-ml · gonsoomoon-ml · commit 1310da738c03 · 2024-04-01T00:42:01.000Z
diff --git a/neuron/Readme.md b/neuron/Readme.md
@@ -38,7 +38,7 @@ Last updated: Mar 31, 2024
 - (Feb 2024) [AWS Tranium 기반 위에 llama-2-7B 및 Dolly Dataset 으로 파인 튜닝](hf-optimum/02-Fine-tune-Llama-7B-Trn1/README.md)
 
 ## 2.3. vLLM on Inferentia/Trainium 
-- (Mar 2024) SOLAR-10.7B-instruct, yanolja-KoSOLAR-10.7B 배치 추론 함: [vLLM 으로 Inferentia2 (inf2.48xlarge)에서 배치성 추론 하기](vLLM/01-offline-inference_neuron/Readme.md)
+- (Mar 2024) SOLAR-10.7B-instruct, yanolja-KoSOLAR-10.7B, 04-yanolja-EEVE-Korean-Instruct-10.8B 배치 추론 함: [vLLM 으로 Inferentia2 (inf2.48xlarge)에서 배치성 추론 하기](vLLM/01-offline-inference_neuron/Readme.md)
 
 # 3. 관련 블로그
 - [주요 블로그 보기](blog/Readme.md)
diff --git a/neuron/vLLM/01-offline-inference_neuron/04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb b/neuron/vLLM/01-offline-inference_neuron/04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb
@@ -0,0 +1,258 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Inferentia2 (inf2.48xlarge)에서 yanolja/KoSOLAR-10.7B-v0.2 배치 추론 \n",
+    "\n",
+    "---\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 1. 사전 필수 단계\n",
+    "- 아래를 클릭하셔서 사전 단계를 수행 하세요.\n",
+    "    - [AWS Inferentia2 설치 및 실행 가이드](Readme.md)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 2. 배치 추론 실행"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from vllm import LLM , SamplingParams"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 3. yanolja/KoSOLAR-10.7B-v0.2 모델 컴파일 후에 로딩\n",
+    "- 아래의 파라미터에 대해서 기존에는 128 이었으나, 1024 로 변경하여 진행 함.\n",
+    "    - max_model_len=1024,\n",
+    "    - block_size=1024,"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "아래는 약 70 분이 소요 되었습니다."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "INFO 03-31 12:39:04 llm_engine.py:87] Initializing an LLM engine with config: model='yanolja/KoSOLAR-10.7B-v0.2', tokenizer='yanolja/KoSOLAR-10.7B-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, seed=0)\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2024-03-31 12:42:28.000186:  70201  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache\n",
+      "2024-03-31 12:42:28.000196:  70201  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/084afca8-7030-4d30-bde7-7ada13ca1f93/model.MODULE_9f050279deca50a4ec46+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/084afca8-7030-4d30-bde7-7ada13ca1f93/model.MODULE_9f050279deca50a4ec46+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35\n",
+      "2024-03-31 12:42:28.000407:  70202  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache\n",
+      "2024-03-31 12:42:28.000470:  70202  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/06656340-372a-4b43-8ad6-1c5ebf6a6dbd/model.MODULE_d80e3715f30f9a482a22+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/06656340-372a-4b43-8ad6-1c5ebf6a6dbd/model.MODULE_d80e3715f30f9a482a22+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35\n",
+      "..............................................................................................\n",
+      "Compiler status PASS\n",
+      "................................................................................................................................................\n",
+      "Compiler status PASS\n",
+      "INFO 03-31 13:47:17 llm_engine.py:357] # GPU blocks: 8, # CPU blocks: 0\n"
+     ]
+    }
+   ],
+   "source": [
+    "llm = LLM(\n",
+    "    # model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
+    "    model=\"yanolja/KoSOLAR-10.7B-v0.2\",\n",
+    "    max_num_seqs=8,\n",
+    "    # The max_model_len and block_size arguments are required to be same as\n",
+    "    # max sequence length when targeting neuron device.\n",
+    "    # Currently, this is a known limitation in continuous batching support\n",
+    "    # in transformers-neuronx.\n",
+    "    # TODO(liangfu): Support paged-attention in transformers-neuronx.\n",
+    "    max_model_len=1024,\n",
+    "    block_size=1024,\n",
+    "    # The device can be automatically detected when AWS Neuron SDK is installed.\n",
+    "    # The device argument can be either unspecified for automated detection,\n",
+    "    # or explicitly assigned.\n",
+    "    device=\"neuron\",\n",
+    "    tensor_parallel_size=2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# 4. 모델 배치 추론"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompts = [\n",
+    "    \"Hello, my name is\",\n",
+    "    \"The president of the United States is\",\n",
+    "    \"The capital of France is\",\n",
+    "    \"The future of AI is\",\n",
+    "]\n",
+    "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processed prompts:   0%|          | 0/4 [00:00<?, ?it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2024-Mar-31 13:47:31.0203 69535:75779 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n",
+      "2024-Mar-31 13:47:31.0203 69535:75779 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processed prompts: 100%|██████████| 4/4 [00:02<00:00,  1.48it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt: 'Hello, my name is', Generated text: ' Jaime and I am a Writer, Editor, and Social Media Manager.\\n'\n",
+      "Prompt: 'The president of the United States is', Generated text: ' now speaking. He\\'s delivering the final speech of his administration.\\n\"'\n",
+      "Prompt: 'The capital of France is', Generated text: ' home to about 2.2 million people.\\nThe population is growing rapidly'\n",
+      "Prompt: 'The future of AI is', Generated text: ' in our hands: KLF19 Day 1 | GoKh'\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "outputs = llm.generate(prompts, sampling_params)\n",
+    "# Print the outputs.\n",
+    "for output in outputs:\n",
+    "    prompt = output.prompt\n",
+    "    generated_text = output.outputs[0].text\n",
+    "    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "prompts = [\n",
+    "    \"대한민국의 수도는 어디야?\",\n",
+    "    \"사과의 건강 효능에 대해서 알려줘\",\n",
+    "]\n",
+    "sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Processed prompts: 100%|██████████| 2/2 [00:01<00:00,  1.39it/s]"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Prompt: '대한민국의 수도는 어디야?', Generated text: ' 그 질문에 대한 답이 서울이라면 이미 당신은 한국 역사의 뿌리'\n",
+      "Prompt: '사과의 건강 효능에 대해서 알려줘', Generated text: ' 줘\\n사과의 건강 효능에 대해서 알려줘.\\n중학교'\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "outputs = llm.generate(prompts, sampling_params)\n",
+    "# Print the outputs.\n",
+    "for output in outputs:\n",
+    "    prompt = output.prompt\n",
+    "    generated_text = output.outputs[0].text\n",
+    "    print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python (torch-neuronx)",
+   "language": "python",
+   "name": "aws_neuron_venv_pytorch"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/neuron/vLLM/01-offline-inference_neuron/Readme.md b/neuron/vLLM/01-offline-inference_neuron/Readme.md
@@ -73,6 +73,7 @@ Last Update: Mar 31, 2024
     - [01-offline_inference_neuron.ipynb](01-offline_inference_neuron.ipynb)
     - [02-SOLAR-10.7B-Instruct-offline_inference_neuron.ipynb](02-SOLAR-10.7B-Instruct-offline_inference_neuron.ipynb)
     - [03-yanolja-KoSOLAR-10.7B-v02-offline_inference_neuron.ipynb](03-yanolja-KoSOLAR-10.7B-v02-offline_inference_neuron.ipynb)        
+    - [04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb](04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb)