Skip to content

Commit 1310da7

Browse files
committed
add EEVE-Korean-instruct to neuron
1 parent 3ad9a75 commit 1310da7

File tree

3 files changed

+260
-1
lines changed

3 files changed

+260
-1
lines changed

neuron/Readme.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Last updated: Mar 31, 2024
3838
- (Feb 2024) [AWS Tranium 기반 위에 llama-2-7B 및 Dolly Dataset 으로 파인 튜닝](hf-optimum/02-Fine-tune-Llama-7B-Trn1/README.md)
3939

4040
## 2.3. vLLM on Inferentia/Trainium
41-
- (Mar 2024) SOLAR-10.7B-instruct, yanolja-KoSOLAR-10.7B 배치 추론 함: [vLLM 으로 Inferentia2 (inf2.48xlarge)에서 배치성 추론 하기](vLLM/01-offline-inference_neuron/Readme.md)
41+
- (Mar 2024) SOLAR-10.7B-instruct, yanolja-KoSOLAR-10.7B, 04-yanolja-EEVE-Korean-Instruct-10.8B 배치 추론 함: [vLLM 으로 Inferentia2 (inf2.48xlarge)에서 배치성 추론 하기](vLLM/01-offline-inference_neuron/Readme.md)
4242

4343
# 3. 관련 블로그
4444
- [주요 블로그 보기](blog/Readme.md)
Lines changed: 258 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,258 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Inferentia2 (inf2.48xlarge)에서 yanolja/KoSOLAR-10.7B-v0.2 배치 추론 \n",
8+
"\n",
9+
"---\n",
10+
"\n",
11+
"\n"
12+
]
13+
},
14+
{
15+
"cell_type": "markdown",
16+
"metadata": {},
17+
"source": [
18+
"# 1. 사전 필수 단계\n",
19+
"- 아래를 클릭하셔서 사전 단계를 수행 하세요.\n",
20+
" - [AWS Inferentia2 설치 및 실행 가이드](Readme.md)"
21+
]
22+
},
23+
{
24+
"cell_type": "markdown",
25+
"metadata": {},
26+
"source": [
27+
"# 2. 배치 추론 실행"
28+
]
29+
},
30+
{
31+
"cell_type": "code",
32+
"execution_count": 2,
33+
"metadata": {},
34+
"outputs": [],
35+
"source": [
36+
"from vllm import LLM , SamplingParams"
37+
]
38+
},
39+
{
40+
"cell_type": "markdown",
41+
"metadata": {},
42+
"source": [
43+
"# 3. yanolja/KoSOLAR-10.7B-v0.2 모델 컴파일 후에 로딩\n",
44+
"- 아래의 파라미터에 대해서 기존에는 128 이었으나, 1024 로 변경하여 진행 함.\n",
45+
" - max_model_len=1024,\n",
46+
" - block_size=1024,"
47+
]
48+
},
49+
{
50+
"cell_type": "markdown",
51+
"metadata": {},
52+
"source": [
53+
"아래는 약 70 분이 소요 되었습니다."
54+
]
55+
},
56+
{
57+
"cell_type": "code",
58+
"execution_count": 3,
59+
"metadata": {},
60+
"outputs": [
61+
{
62+
"name": "stdout",
63+
"output_type": "stream",
64+
"text": [
65+
"INFO 03-31 12:39:04 llm_engine.py:87] Initializing an LLM engine with config: model='yanolja/KoSOLAR-10.7B-v0.2', tokenizer='yanolja/KoSOLAR-10.7B-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cpu, seed=0)\n"
66+
]
67+
},
68+
{
69+
"name": "stdout",
70+
"output_type": "stream",
71+
"text": [
72+
"2024-03-31 12:42:28.000186: 70201 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache\n",
73+
"2024-03-31 12:42:28.000196: 70201 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/084afca8-7030-4d30-bde7-7ada13ca1f93/model.MODULE_9f050279deca50a4ec46+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/084afca8-7030-4d30-bde7-7ada13ca1f93/model.MODULE_9f050279deca50a4ec46+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35\n",
74+
"2024-03-31 12:42:28.000407: 70202 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache\n",
75+
"2024-03-31 12:42:28.000470: 70202 INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: neuronx-cc compile --target=trn1 --framework=XLA /tmp/ubuntu/neuroncc_compile_workdir/06656340-372a-4b43-8ad6-1c5ebf6a6dbd/model.MODULE_d80e3715f30f9a482a22+2c2d707e.hlo_module.pb --output /tmp/ubuntu/neuroncc_compile_workdir/06656340-372a-4b43-8ad6-1c5ebf6a6dbd/model.MODULE_d80e3715f30f9a482a22+2c2d707e.neff --model-type=transformer --auto-cast=none --verbose=35\n",
76+
"..............................................................................................\n",
77+
"Compiler status PASS\n",
78+
"................................................................................................................................................\n",
79+
"Compiler status PASS\n",
80+
"INFO 03-31 13:47:17 llm_engine.py:357] # GPU blocks: 8, # CPU blocks: 0\n"
81+
]
82+
}
83+
],
84+
"source": [
85+
"llm = LLM(\n",
86+
" # model=\"TinyLlama/TinyLlama-1.1B-Chat-v1.0\",\n",
87+
" model=\"yanolja/KoSOLAR-10.7B-v0.2\",\n",
88+
" max_num_seqs=8,\n",
89+
" # The max_model_len and block_size arguments are required to be same as\n",
90+
" # max sequence length when targeting neuron device.\n",
91+
" # Currently, this is a known limitation in continuous batching support\n",
92+
" # in transformers-neuronx.\n",
93+
" # TODO(liangfu): Support paged-attention in transformers-neuronx.\n",
94+
" max_model_len=1024,\n",
95+
" block_size=1024,\n",
96+
" # The device can be automatically detected when AWS Neuron SDK is installed.\n",
97+
" # The device argument can be either unspecified for automated detection,\n",
98+
" # or explicitly assigned.\n",
99+
" device=\"neuron\",\n",
100+
" tensor_parallel_size=2)"
101+
]
102+
},
103+
{
104+
"cell_type": "markdown",
105+
"metadata": {},
106+
"source": [
107+
"# 4. 모델 배치 추론"
108+
]
109+
},
110+
{
111+
"cell_type": "code",
112+
"execution_count": 4,
113+
"metadata": {},
114+
"outputs": [],
115+
"source": [
116+
"prompts = [\n",
117+
" \"Hello, my name is\",\n",
118+
" \"The president of the United States is\",\n",
119+
" \"The capital of France is\",\n",
120+
" \"The future of AI is\",\n",
121+
"]\n",
122+
"sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n"
123+
]
124+
},
125+
{
126+
"cell_type": "code",
127+
"execution_count": 5,
128+
"metadata": {},
129+
"outputs": [
130+
{
131+
"name": "stderr",
132+
"output_type": "stream",
133+
"text": [
134+
"Processed prompts: 0%| | 0/4 [00:00<?, ?it/s]"
135+
]
136+
},
137+
{
138+
"name": "stdout",
139+
"output_type": "stream",
140+
"text": [
141+
"2024-Mar-31 13:47:31.0203 69535:75779 [1] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed\n",
142+
"2024-Mar-31 13:47:31.0203 69535:75779 [1] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?\n"
143+
]
144+
},
145+
{
146+
"name": "stderr",
147+
"output_type": "stream",
148+
"text": [
149+
"Processed prompts: 100%|██████████| 4/4 [00:02<00:00, 1.48it/s]"
150+
]
151+
},
152+
{
153+
"name": "stdout",
154+
"output_type": "stream",
155+
"text": [
156+
"Prompt: 'Hello, my name is', Generated text: ' Jaime and I am a Writer, Editor, and Social Media Manager.\\n'\n",
157+
"Prompt: 'The president of the United States is', Generated text: ' now speaking. He\\'s delivering the final speech of his administration.\\n\"'\n",
158+
"Prompt: 'The capital of France is', Generated text: ' home to about 2.2 million people.\\nThe population is growing rapidly'\n",
159+
"Prompt: 'The future of AI is', Generated text: ' in our hands: KLF19 Day 1 | GoKh'\n"
160+
]
161+
},
162+
{
163+
"name": "stderr",
164+
"output_type": "stream",
165+
"text": [
166+
"\n"
167+
]
168+
}
169+
],
170+
"source": [
171+
"outputs = llm.generate(prompts, sampling_params)\n",
172+
"# Print the outputs.\n",
173+
"for output in outputs:\n",
174+
" prompt = output.prompt\n",
175+
" generated_text = output.outputs[0].text\n",
176+
" print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
177+
]
178+
},
179+
{
180+
"cell_type": "code",
181+
"execution_count": 6,
182+
"metadata": {},
183+
"outputs": [],
184+
"source": [
185+
"prompts = [\n",
186+
" \"대한민국의 수도는 어디야?\",\n",
187+
" \"사과의 건강 효능에 대해서 알려줘\",\n",
188+
"]\n",
189+
"sampling_params = SamplingParams(temperature=0.8, top_p=0.95)\n"
190+
]
191+
},
192+
{
193+
"cell_type": "code",
194+
"execution_count": 7,
195+
"metadata": {},
196+
"outputs": [
197+
{
198+
"name": "stderr",
199+
"output_type": "stream",
200+
"text": [
201+
"Processed prompts: 100%|██████████| 2/2 [00:01<00:00, 1.39it/s]"
202+
]
203+
},
204+
{
205+
"name": "stdout",
206+
"output_type": "stream",
207+
"text": [
208+
"Prompt: '대한민국의 수도는 어디야?', Generated text: ' 그 질문에 대한 답이 서울이라면 이미 당신은 한국 역사의 뿌리'\n",
209+
"Prompt: '사과의 건강 효능에 대해서 알려줘', Generated text: ' 줘\\n사과의 건강 효능에 대해서 알려줘.\\n중학교'\n"
210+
]
211+
},
212+
{
213+
"name": "stderr",
214+
"output_type": "stream",
215+
"text": [
216+
"\n"
217+
]
218+
}
219+
],
220+
"source": [
221+
"outputs = llm.generate(prompts, sampling_params)\n",
222+
"# Print the outputs.\n",
223+
"for output in outputs:\n",
224+
" prompt = output.prompt\n",
225+
" generated_text = output.outputs[0].text\n",
226+
" print(f\"Prompt: {prompt!r}, Generated text: {generated_text!r}\")"
227+
]
228+
},
229+
{
230+
"cell_type": "code",
231+
"execution_count": null,
232+
"metadata": {},
233+
"outputs": [],
234+
"source": []
235+
}
236+
],
237+
"metadata": {
238+
"kernelspec": {
239+
"display_name": "Python (torch-neuronx)",
240+
"language": "python",
241+
"name": "aws_neuron_venv_pytorch"
242+
},
243+
"language_info": {
244+
"codemirror_mode": {
245+
"name": "ipython",
246+
"version": 3
247+
},
248+
"file_extension": ".py",
249+
"mimetype": "text/x-python",
250+
"name": "python",
251+
"nbconvert_exporter": "python",
252+
"pygments_lexer": "ipython3",
253+
"version": "3.10.12"
254+
}
255+
},
256+
"nbformat": 4,
257+
"nbformat_minor": 2
258+
}

neuron/vLLM/01-offline-inference_neuron/Readme.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ Last Update: Mar 31, 2024
7373
- [01-offline_inference_neuron.ipynb](01-offline_inference_neuron.ipynb)
7474
- [02-SOLAR-10.7B-Instruct-offline_inference_neuron.ipynb](02-SOLAR-10.7B-Instruct-offline_inference_neuron.ipynb)
7575
- [03-yanolja-KoSOLAR-10.7B-v02-offline_inference_neuron.ipynb](03-yanolja-KoSOLAR-10.7B-v02-offline_inference_neuron.ipynb)
76+
- [04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb](04-yanolja-EEVE-Korean-Instruct-10.8B-offline_inference_neuron.ipynb)
7677

7778

7879

0 commit comments

Comments
 (0)