|
| 1 | +# Try AWS PCS with Graviton-powered EC2 instances |
| 2 | + |
| 3 | +## Info |
| 4 | + |
| 5 | +This recipe helps you launch a Slurm cluster using AWS Parallel Computing Service, powered by Amazon EC2 instances with Graviton processors. |
| 6 | + |
| 7 | +## Pre-requisites |
| 8 | + |
| 9 | +1. An active AWS account with an adminstrative user. To sign up for one if you do not have one, please see [Sign up for AWS and create an administrative user](https://docs.aws.amazon.com/pcs/latest/userguide/setting-up.html) in the AWS PCS user guide. |
| 10 | +2. Sufficient Amazon EC2 service quota to launch the cluster. To check your quotas: |
| 11 | + * Navigate to the [AWS Service Quotas console](https://console.aws.amazon.com/servicequotas/home/services/ec2/quotas). |
| 12 | + * Change to region where you will use PCS with Graviton instances (`us-east-1`, `ap-northeast-1`, `eu-west-1`) |
| 13 | + * Search for **Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances** |
| 14 | + * Make sure your **Applied account-level quota value** is at least 16 |
| 15 | + * Search for **Running On-Demand HPC instances** |
| 16 | + * Make sure your **Applied quota value** is at least 64 to run one HPC instance. Each additional running HPC instances require 64 vCPU of service quota. |
| 17 | + * If either quota is too low, choose the **Request increase at account-level** option and wait for your request to be processed. Then, return to this exercise. |
| 18 | + * Note that service quotas are region-specific. |
| 19 | + |
| 20 | +## Create an AWS PCS cluster powered by Graviton processors |
| 21 | + |
| 22 | +To launch a cluster using AWS CloudFormation in a region where Graviton HPC instances are available: |
| 23 | + |
| 24 | +* Choose the quick-create link that corresponds to the region where you will work with PCS and Graviton instances |
| 25 | + * `us-east-1` (Virginia, United States) [](https://console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml) |
| 26 | + * `ap-northeast-1` (Tokyo, Japan) [](https://console.aws.amazon.com/cloudformation/home?region=ap-northeast-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml) |
| 27 | + * `eu-west-1` (Dubin, Ireland) [](https://console.aws.amazon.com/cloudformation/home?region=eu-west-1#/stacks/create/review?stackName=try-graviton-cfn&templateURL=https://aws-hpc-recipes.s3.us-east-1.amazonaws.com/main/recipes/pcs/try_graviton/assets/cluster.cfn.yaml) |
| 28 | +* Follow the instructions in the AWS CloudFormation console: |
| 29 | + * Under **Parameters** |
| 30 | + * (Optional) Customize the stack name |
| 31 | + * For **SlurmVersion**, choose one of the supported Slurm versions |
| 32 | + * For **ClientIpCidr**, either leave it as its default value or replace with a more restrictive CIDR range |
| 33 | + * Leave the parameters under **HPC Recipes configuration** as their default values. |
| 34 | + * Under **Capabilities and transforms** |
| 35 | + * Check all three boxes |
| 36 | + * Choose **Create stack** |
| 37 | +* Monitor the status of your stack (e.g. **try-graviton-cfn**). |
| 38 | + * When its status reaches `CREATE_COMPLETE`, you can interact with the PCS cluster. |
| 39 | + |
| 40 | +## Interact with the PCS cluster |
| 41 | + |
| 42 | +You can work with your new cluster using the AWS PCS console, or you can connect to its login node to run jobs and manage data. Your new CloudFormation stack can help you with this. In the [AWS CloudFormation console](https://console.amazonaws.com/cloudformation/home), choose the stack you have created. Then, navigate to the **Outputs** tab. |
| 43 | + |
| 44 | +There will be three URLs: |
| 45 | +* **SshKeyPairSsmParameter** This link takes you to where you can download an SSH key that has been generated to enable SSH access to the cluster. See below `Extra: Connecting via SSH` to learn how to use this information. |
| 46 | +* **PcsConsoleUrl** This is a link to the cluster you created, in the PCS console. Go here to explore the cluster, node group, and queue configurations. |
| 47 | +* **Ec2ConsoleUrl** This link takes you to a filtered view of the EC2 console that shows the instance(s) managed by the `login` node group. |
| 48 | + |
| 49 | +### Connect to the cluster |
| 50 | + |
| 51 | +You can connect to your PCS cluster login node right in the browser. |
| 52 | +1. Navigate to the **Ec2ConsoleUrl** URL. |
| 53 | +2. Select an instance and choose **Connect**. |
| 54 | +3. On the **Connect to instance** choose **Session Manager**. |
| 55 | +4. Click on the **Connect** button. You will be taken to a terminal session. |
| 56 | +5. Become the `ec2-user` user by typing `sudo su - ec2-user` |
| 57 | + |
| 58 | +### Cluster design |
| 59 | + |
| 60 | +There is one Slurm partition on the system: `hpc`. It will send work to the `hpc7g-16xlarge` node group, which features [`hpc7g.16xlarge`](https://aws.amazon.com/ec2/instance-types/hpc7g/) instances that have EFA built in. |
| 61 | + |
| 62 | +Find the queues by running `sinfo` and inspect the nodes with `scontrol show nodes`. |
| 63 | + |
| 64 | +The `/home` and `/fsx` directories are network file systems. The `home` directory is provided by [Amazon Elastic Filesystem](https://aws.amazon.com/efs/), while the `fsx` directory is powered by [Amazon FSx for Lustre](https://aws.amazon.com/fsx/lustre/). You can install software on the `/home` or `/fsx` directory. We recommend you run jobs out of the `/fsx` directory. |
| 65 | + |
| 66 | +Verify that these filesystems are present with `df -h`. It will return a screen that resembles this. |
| 67 | + |
| 68 | +```shell |
| 69 | +[ec2-user@ip-10-0-8-20 ~]$ df -h |
| 70 | +Filesystem Size Used Avail Use% Mounted on |
| 71 | +devtmpfs 3.8G 0 3.8G 0% /dev |
| 72 | +tmpfs 3.8G 0 3.8G 0% /dev/shm |
| 73 | +tmpfs 3.8G 612K 3.8G 1% /run |
| 74 | +tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup |
| 75 | +/dev/nvme0n1p1 24G 20G 4.2G 83% / |
| 76 | +fs-0d0a17eaafcc0d0e6.efs.us-east-1.amazonaws.com:/ 8.0E 0 8.0E 0% /home |
| 77 | +10.0.10.150@tcp:/xjmflbev 1.2T 4.5G 1.2T 1% /fsx |
| 78 | +tmpfs 774M 0 774M 0% /run/user/0 |
| 79 | +``` |
| 80 | + |
| 81 | +### Run some jobs |
| 82 | + |
| 83 | +Once you have connected to the login instance, follow along with the **Getting Started with AWS PCS** tutorial starting at [_Explore the cluster environment in AWS PCS_](https://docs.aws.amazon.com/pcs/latest/userguide/getting-started_explore.html). |
| 84 | + |
| 85 | +## Cleaning Up |
| 86 | + |
| 87 | +When you are done using your PCS cluster, you can delete it and all its associated resources by navigating to the AWS CloudFormation console and deleting the stack you created. |
| 88 | + |
| 89 | +However, if you have created additional resources in your cluster, beyond the `login`, `c7a-xlarge`, and `hpc7a-48xlarge` node groups, or the `large` and `small` queues, **you must delete those resources** in the PCS console before deleting the CloudFormation stack. Otherwise, deleting the stack will fail and you will need to manually delete several resources on your own. |
| 90 | + |
| 91 | +If you do need to delete extra resources , go to detail page for your PCS cluster. |
| 92 | +* Delete any queues besides `hpc`. |
| 93 | +* Delete any node groups besides `login` and `hpc7g-16xlarge` |
| 94 | + |
| 95 | +**Note** We do not recommend you create or delete any resources in this demonstration cluster. Get started building your own, totally customizable HPC clusters with [this tutorial](https://docs.aws.amazon.com/pcs/latest/userguide/getting-started.html) in the AWS PCS user guide. |
| 96 | + |
| 97 | +## Extra: Connecting via SSH |
| 98 | + |
| 99 | +By default, we have configured the cluster to support logins via Session Manager, in the browser. If you want to connect using regular SSH, here's how. |
| 100 | + |
| 101 | +### Retrieve the SSH key |
| 102 | + |
| 103 | +We generated an SSH key as part of deploying the cluster. It is stored in [AWS Systems Manager Parameter Store](https://docs.aws.amazon.com/systems-manager/latest/userguide/systems-manager-parameter-store.html). You can download the key and use it to connect to the public IP address of your PCS cluster login node. |
| 104 | + |
| 105 | +* Go to the **SshKeyPairSsmParameter** URL |
| 106 | +* Copy the name of the SSH key - it will look like this `/ec2/keypair/key-HEXADECIMAL-DATA` |
| 107 | +* Use the AWS CLI to download the key |
| 108 | + |
| 109 | +`aws ssm get-parameter —-name "/ec2/keypair/key-HEXADECIMAL-DATA" —-query "Parameter.Value" —-output text —-region us-east-2 —-with-decryption | tee > key-HEXADECIMAL-DATA.pem` |
| 110 | + |
| 111 | +* Set permissions on the key to owner-readable `chmod 400 key-HEXADECIMAL-DATA.pem` |
| 112 | + |
| 113 | +### Log in to the cluster |
| 114 | + |
| 115 | +* Log in to the login node public IP, which you can retrieve via **Ec2ConsoleUrl**. |
| 116 | + |
| 117 | +`ssh -i key-HEXADECIMAL-DATA.pem ec2-user@LOGIN-NODE-PUBLIC-IP` |
| 118 | + |
| 119 | +## Resources |
| 120 | + |
| 121 | +Here's some additional reading where you can learn more about HPC at AWS. |
| 122 | + |
| 123 | +* [High Performance Computing at AWS](https://aws.amazon.com/hpc/) |
| 124 | +* [AWS HPC Blog](https://aws.amazon.com/blogs/hpc/) |
| 125 | +* [Day1HPC](https://day1hpc.com/) |
| 126 | +* [HPC TechShorts](https://www.youtube.com/c/hpctechshorts) |
| 127 | +* [HPC Recipes for AWS](https://github.com/aws-samples/aws-hpc-recipes) |
| 128 | +* [AWS PCS user guide](https://docs.aws.amazon.com/pcs/) |
0 commit comments