Skip to content

Commit 32c6fe3

Browse files
committed
Try Graviton w PCS recipe. Also added a helper util for looking up instance availability by region.
1 parent e044850 commit 32c6fe3

File tree

9 files changed

+842
-0
lines changed

9 files changed

+842
-0
lines changed

recipes/pcs/try_graviton/.gitkeep

Whitespace-only changes.

recipes/pcs/try_graviton/Makefile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
# Target rules
2+
all: build
3+
@echo "Building try_graviton"
4+
5+
build: assets
6+
7+
assets:
8+
@echo "Build assets for try_graviton"
9+
10+
run: build
11+
@echo "Run assets for try_graviton"
12+
13+
test: build
14+
15+
clean:
16+
17+
clobber: clean

recipes/pcs/try_graviton/README.md

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Try AWS PCS with Graviton-powered EC2 instances
2+
3+
## Info
4+
5+
This recipe helps you launch a Slurm cluster using AWS Parallel Computing Service, powered by Amazon EC2 instances with Graviton processors.
6+
7+
## Pre-requisites
8+
9+
1. An active AWS account with an adminstrative user. To sign up for one if you do not have one, please see [Sign up for AWS and create an administrative user](https://docs.aws.amazon.com/pcs/latest/userguide/setting-up.html) in the AWS PCS user guide.
10+
2. Sufficient Amazon EC2 service quota to launch the cluster. To check your quotas:
11+
* Navigate to the [AWS Service Quotas console](https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas).
12+
* Change to region where you will use PCS with Graviton instances (`us-east-1`, `ap-northeast-1`, `eu-west-1`)
13+
* Search for **Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances**
14+
* Make sure your **Applied account-level quota value** is at least 16
15+
* Search for **Running On-Demand HPC instances**
16+
* Make sure your **Applied quota value** is at least 64 to run one HPC instance. Each additional running HPC instances require 64 vCPU of service quota.
17+
* If either quota is too low, choose the **Request increase at account-level** option and wait for your request to be processed. Then, return to this exercise.
18+
* Note that service quotas are region-specific.
19+
20+
## Create an AWS PCS cluster powered by Graviton processors
21+
22+
To launch a cluster using AWS CloudFormation in a region where Graviton HPC instances are available:
23+
24+
* Choose the quick-create link that corresponds to the region where you will work with PCS and Graviton instances
25+
* `us-east-1` (Virginia, United States) [![Launch stack](../../../docs/media/launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml)
26+
* `ap-northeast-1` (Tokyo, Japan) [![Launch stack](../../../docs/media/launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml)
27+
* `eu-west-1` (Dubin, Ireland) [![Launch stack](../../../docs/media/launch-stack.svg)](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml)
28+
* Follow the instructions in the AWS CloudFormation console:
29+
* Under **Parameters**
30+
* (Optional) Customize the stack name
31+
* For **SlurmVersion**, choose one of the supported Slurm versions
32+
* For **ClientIpCidr**, either leave it as its default value or replace with a more restrictive CIDR range
33+
* Leave the parameters under **HPC Recipes configuration** as their default values.
34+
* Under **Capabilities and transforms**
35+
* Check all three boxes
36+
* Choose **Create stack**
37+
* Monitor the status of your stack (e.g. **try-graviton-cfn**).
38+
* When its status reaches `CREATE_COMPLETE`, you can interact with the PCS cluster.
39+
40+
## Interact with the PCS cluster
41+
42+
You can work with your new cluster using the AWS PCS console, or you can connect to its login node to run jobs and manage data. Your new CloudFormation stack can help you with this. In the [AWS CloudFormation console](https://console.amazonaws.com/cloudformation/home), choose the stack you have created. Then, navigate to the **Outputs** tab.
43+
44+
There will be three URLs:
45+
* **SshKeyPairSsmParameter** This link takes you to where you can download an SSH key that has been generated to enable SSH access to the cluster. See below `Extra: Connecting via SSH` to learn how to use this information.
46+
* **PcsConsoleUrl** This is a link to the cluster you created, in the PCS console. Go here to explore the cluster, node group, and queue configurations.
47+
* **Ec2ConsoleUrl** This link takes you to a filtered view of the EC2 console that shows the instance(s) managed by the `login` node group.
48+
49+
### Connect to the cluster
50+
51+
You can connect to your PCS cluster login node right in the browser.
52+
1. Navigate to the **Ec2ConsoleUrl** URL.
53+
2. Select an instance and choose **Connect**.
54+
3. On the **Connect to instance** choose **Session Manager**.
55+
4. Click on the **Connect** button. You will be taken to a terminal session.
56+
5. Become the `ec2-user` user by typing `sudo su - ec2-user`
57+
58+
### Cluster design
59+
60+
There is one Slurm partition on the system: `hpc`. It will send work to the `hpc7g-16xlarge` node group, which features [`hpc7g.16xlarge`](https://aws.amazon.com/ec2/instance-types/hpc7g/) instances that have EFA built in.
61+
62+
Find the queues by running `sinfo` and inspect the nodes with `scontrol show nodes`.
63+
64+
The `/home` and `/fsx` directories are network file systems. The `home` directory is provided by [Amazon Elastic Filesystem](https://aws.amazon.com/efs/), while the `fsx` directory is powered by [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/). You can install software on the `/home` or `/fsx` directory. We recommend you run jobs out of the `/fsx` directory.
65+
66+
Verify that these filesystems are present with `df -h`. It will return a screen that resembles this.
67+
68+
```shell
69+
[ec2-user@ip-10-0-8-20 ~]$ df -h
70+
Filesystem Size Used Avail Use% Mounted on
71+
devtmpfs 3.8G 0 3.8G 0% /dev
72+
tmpfs 3.8G 0 3.8G 0% /dev/shm
73+
tmpfs 3.8G 612K 3.8G 1% /run
74+
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
75+
/dev/nvme0n1p1 24G 20G 4.2G 83% /
76+
fs-0d0a17eaafcc0d0e6.efs.us-east-1.amazonaws.com:/ 8.0E 0 8.0E 0% /home
77+
10.0.10.150@tcp:/xjmflbev 1.2T 4.5G 1.2T 1% /fsx
78+
tmpfs 774M 0 774M 0% /run/user/0
79+
```
80+
81+
### Run some jobs
82+
83+
Once you have connected to the login instance, follow along with the **Getting Started with AWS PCS** tutorial starting at [_Explore the cluster environment in AWS PCS_](https://docs.aws.amazon.com/pcs/latest/userguide/getting-started_explore.html).
84+
85+
## Cleaning Up
86+
87+
When you are done using your PCS cluster, you can delete it and all its associated resources by navigating to the AWS CloudFormation console and deleting the stack you created.
88+
89+
However, if you have created additional resources in your cluster, beyond the `login`, `c7a-xlarge`, and `hpc7a-48xlarge` node groups, or the `large` and `small` queues, **you must delete those resources** in the PCS console before deleting the CloudFormation stack. Otherwise, deleting the stack will fail and you will need to manually delete several resources on your own.
90+
91+
If you do need to delete extra resources , go to detail page for your PCS cluster.
92+
* Delete any queues besides `hpc`.
93+
* Delete any node groups besides `login` and `hpc7g-16xlarge`
94+
95+
**Note** We do not recommend you create or delete any resources in this demonstration cluster. Get started building your own, totally customizable HPC clusters with [this tutorial](https://docs.aws.amazon.com/pcs/latest/userguide/getting-started.html) in the AWS PCS user guide.
96+
97+
## Extra: Connecting via SSH
98+
99+
By default, we have configured the cluster to support logins via Session Manager, in the browser. If you want to connect using regular SSH, here's how.
100+
101+
### Retrieve the SSH key
102+
103+
We generated an SSH key as part of deploying the cluster. It is stored in [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html). You can download the key and use it to connect to the public IP address of your PCS cluster login node.
104+
105+
* Go to the **SshKeyPairSsmParameter** URL
106+
* Copy the name of the SSH key - it will look like this `/ec2/keypair/key-HEXADECIMAL-DATA`
107+
* Use the AWS CLI to download the key
108+
109+
`aws ssm get-parameter —-name "/ec2/keypair/key-HEXADECIMAL-DATA" —-query "Parameter.Value" —-output text —-region us-east-2 —-with-decryption | tee > key-HEXADECIMAL-DATA.pem`
110+
111+
* Set permissions on the key to owner-readable `chmod 400 key-HEXADECIMAL-DATA.pem`
112+
113+
### Log in to the cluster
114+
115+
* Log in to the login node public IP, which you can retrieve via **Ec2ConsoleUrl**.
116+
117+
`ssh -i key-HEXADECIMAL-DATA.pem ec2-user@LOGIN-NODE-PUBLIC-IP`
118+
119+
## Resources
120+
121+
Here's some additional reading where you can learn more about HPC at AWS.
122+
123+
* [High Performance Computing at AWS](https://aws.amazon.com/hpc/)
124+
* [AWS HPC Blog](https://aws.amazon.com/blogs/hpc/)
125+
* [Day1HPC](https://day1hpc.com/)
126+
* [HPC TechShorts](https://www.youtube.com/c/hpctechshorts)
127+
* [HPC Recipes for AWS](https://github.com/aws-samples/aws-hpc-recipes)
128+
* [AWS PCS user guide](https://docs.aws.amazon.com/pcs/)

recipes/pcs/try_graviton/assets/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)