Skip to content

aws-solutions-library-samples/guidance-for-protein-language-esm-model-training-with-sagemaker-hyperpod

Guidance for Training Transformer Protein Language Models (ESM-2) with Amazon SageMaker HyperPod on AWS

This guidance aims to instruct users on how to provision SageMaker HyperPod clusters using both Slurm and Kubernetes based orchestrations. In addition, this guidance provides code examples for pre-training popular computational protein folding models such as Evolutionary Scale Models (ESM) 2nd generation using the DDP and FSDP and NVIDIA BioNemo frameworks on Amazon SageMaker Hyperpod clusters.

Table of Contents

Required

  1. Overview
  2. Prerequisites
  3. Deployment Steps
  4. Deployment Validation
  5. Running the Guidance
  6. Next Steps
  7. Cleanup
  8. Revisions
  9. Notices
  10. Authors

Overview

As generative artificial intelligence (generative AI) continues to transform industries, the life sciences sector is leveraging these advanced technologies to accelerate drug discovery. Generative AI tools powered by deep learning models make it possible to analyze massive datasets, identify patterns, and generate insights to aid the search for new drug compounds. However, running these generative AI workloads requires a full-stack approach that combines robust computing infrastructure with optimized domain-specific software that can accelerate time to solution.

With the recent proliferation of new models and tools in this field, researchers are looking for help to simplify the training, customization, and deployment of these generative AI models. And our high performance computing (HPC) customers are asking for how to easily perform distributed training with these models on AWS. In this guidance, we’ll demonstrate how to pre-train the Evolutionary Scale Modeling ESM-2 model using nVIDIA GPUs on AWS SageMaker HyperPod highly available managed application platform.

NVIDIA BioNeMo

NVIDIA BioNeMo is a generative AI platform for drug discovery that simplifies and accelerates the training of models using your own data. BioNeMo provides researchers and developers a fast and easy way to build and integrate state-of-the-art generative AI applications across the entire drug discovery pipeline—from target identification to lead optimization—with AI workflows for 3D protein structure prediction, de novo design, virtual screening, docking, and property prediction.

The BioNeMo framework facilitates centralized model training, optimization, fine-tuning, and inferencing for protein and molecular design. Researchers can build and train foundation models from scratch at scale, or use pre-trained model checkpoints provided with the BioNeMo Framework for fine-tuning for downstream tasks. Currently, BioNeMo supports biomolecular AI architectures that can be scaled to billions of parameters, such as BERT, Striped Hyena, along with models such as ESM-2, Evo-2, and Geneformer.

Architecture Overview

This section provides architecture diagrams and describes the components deployed with this Guidance.

Architecture and steps for provisioning SageMaker HyperPod SLURM Cluster

Reference Architecture - HyperPod SLURM Cluster


Figure 1. Reference Architecture - AWS SageMaker HyperPod SLURM based Cluster

  1. If required for EC2 instance types to be provisioned as HyperPod cluster nodes, account team may reserve compute capacity with On-Demand Capacity Reservation (ODCR) or Amazon SageMaker HyperPod Flexible Training Plans
  2. Admins/DevOps Engineers use the AWS CloudFormation stack to deploy Virtual Private Cloud (VPC) networking, Amazon Simple Storage Service (S3) or FSx for Lustre (FSxL) storage and Identity and Access Management (IAM) resources into Customer Account
  3. Admins/DevOps Engineers push Lifecycle scripts to S3 bucket created in the previous step
  4. Admins/DevOps Engineers use the AWS CLI to create the SageMaker HyperPod cluster,including Controller Node, Compute nodes etc.
  5. Admins/DevOps Engineers generate key pair to establish access to the Controller Node of the SageMaker HyperPod cluster.
  6. Once the SageMaker HyperPod cluster is created, Admins/DevOps Engineers and Data Scientists/ML engineers can test SSH access to the Controller and Compute nodes and examine the cluster
  7. Admin/DevOps Engineers configure IAM to use Amazon Managed Prometheus to collect metrics and Amazon Managed Grafana for metric visualization
  8. Admin/DevOps Engineers can make further changes to the cluster using the AWS CLI

Architecture and steps for provisioning SageMaker HyperPod EKS Cluster

Reference Architecture - HyperPod EKS Cluster


Figure 2. Reference Architecture - AWS SageMaker HyperPod EKS based Cluster

  1. If required for EC2 instance types to be provisioned as HyperPod cluster nodes, account team may reserve capacity with On-Demand Capacity Reservation (ODCR) or Amazon SageMaker HyperPod Flexible Training Plans.
  2. Admin/DevOps Engineers can use eksctl CLI to provision an Amazon EKS cluster
  3. Admin/DevOps Engineers use the Sagemaker HyperPod VPC stack to deploy HyperPod managed node group on the EKS cluster
  4. Admin/DevOps Engineers verify access to EKS cluster and SSM access to HyperPod nodes.
  5. Admin/DevOps Engineers can install FSx for Lustre CSI driver and mount file system on the EKS cluster
  6. Admin/DevOps Engineers install Amazon EFA Kubernetes device plugins
  7. Admin/DevOps Engineers configures IAM to use Amazon Managed Prometheus to configure the observability stack and collect metrics and Amazon Managed Grafana to display those metrics.
  8. Admin/DevOps Engineers can configure Container Insights to push metrics in Amazon Cloudwatch

Cost

You are responsible for the cost of the AWS services used while running this Guidance. We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.

Sample Cost Table

The following tables provide sample cost breakdown for deploying this guidance with the default parameters in the US East (N. Virginia) Region for one month. As of September, 2025 the monthly costs for running this Guidance with the default settings in the US East (N. Virginia) us-east-1 region are shown below for HyperPod SLURM and EKS based clusters, respectively:

HyperPod cluster with SLURM Infrastructure

AWS service Dimensions Cost [USD] / month
Compute 2 * ml.g5.8xlarge 4467.60
Compute 1 * ml.m5.12xlarge 2018.45
Storage S3 (1GB) 00.02
Storage EBS (500GB) 344.87
Storage FSx (1.2TB) 720.07
Network VPC, Subnets, NAT Gateway, VPC Endpoints 513.20
Total $8064.21

Please see cost breakdown details in this AWS Calculator instance

HyperPod cluster with EKS Infrastructure

AWS service Dimensions Cost [USD] / month
Compute EC2 2 * ml.g5.8xlarge 4467.60
Control Plane EKS Control Plane 73.00
Container Registry ECR 01.32
Storage S3 (1GB) 00.02
Storage EBS (500GB) 229.92
Storage FSx (1.2TB) 720.07
Network VPC, Subnets, NAT Gateway, VPC Endpoints 507.80
Total $5999.73

Please see cost breakdown details in this AWS Calculator instance

Prerequisites

Operating System

Amazon SageMaker HyperPod compute nodes support the following operating systems:

  • Amazon Linux 2
  • Ubuntu 20.04
  • Ubuntu 22.04

These Linux-based operating systems are optimized for machine learning workloads and are fully compatible with SageMaker HyperPod’s distributed training capabilities. The OS images are managed and maintained by AWS to ensure security and performance optimizations for ML training workloads. We highly recommend using optimized SageMaker Studio Code Editor environment to run HyperPod cluster provisioning commands.

Third-party tools

Install the AWS CLI (for both kinds of HyperPod clusters)

Depending on the OS that you are using, run a command similar to:

curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install --update

Install kubectl (for EKS orchstrator clusters)

The following command installs K8s API CLI client:

curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.30.4/2024-09-11/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc

Install eksctl CLI utility

The following command installs eksctl AWS command line utility to manage EKS based clusters

# for ARM systems, set ARCH to: `arm64`, `armv6` or `armv7`
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
# (Optional) Verify checksum
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin

Install Helm Package manager

Helm is a package manager for Kubernetes that will be used to install various dependencies using Charts , which bundle together all the resources needed to deploy an application to a Kubernetes cluster.

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Acquire AWS access long-term credentials

Using the AWS credentials you fetched above, use aws configure to add the credentials to your terminal. See configure aws credentials for more details.

$ aws configure
AWS Access Key ID [None]: <Access key>
AWS Secret Access Key [None]: <Secret access key>
Default region name [None]: <Region>
Default output format [None]: json

AWS account requirements

List out pre-requisites required on the AWS account if applicable, this includes enabling AWS regions, requiring ACM certificate.

Example: “This deployment requires you have public ACM certificate available in your AWS account”

Example resources:

  • ACM certificate
  • DNS record
  • S3 bucket
  • VPC
  • IAM role with specific permissions
  • Enabling a Region or service etc.

Service limits

Here are the key service quota limits for SageMaker HyperPod clusters:

1. Instance-related limits:

  • Maximum instances per HyperPod cluster: Must exceed procured capacity + 1 (for controller node)
  • Total instances across all HyperPod clusters: Must exceed procured capacity + 1
  • ML instance type quota for cluster usage: Must exceed procured capacity
  • ML instance type quota for head node

2. Storage limit:

  • Maximum EBS volume size per cluster instance: 2000 GB (recommended)
  • FsX for Lustre storage Capacity Increments: -- Persistent or Scratch 2: 1.2 TiB or increments of 2.4 TiB -- Scratch 1: 1.2, 2.4, or increments of 3.6 TiB

3. Training plan limits (if using training plans):

  • Training-plan-total_count: Limits the number of training plans per Region
  • Reserved-capacity-ml: Limits the number of instances in reserved capacity across training plans per Region

If you need to increase these limits, you can submit service quota increase requests through the AWS Service Quotas console. These requests are typically reviewed and processed within 1-2 business days.

Supported Regions

As of September, 2025 the Guidance sample code is supported in the following AWS regions, based on SageMaker HyperPod and specific EC2 instance availability:

Region Name Region Code
US East (N. Virginia) us-east-1
US East (Ohio) us-east-2
US West (Oregon) us-west-2
Asia Pacific (Mumbai) ap-south-1
Asia Pacific (Seoul) ap-northeast-2
Asia Pacific (Singapore) ap-southeast-1
Asia Pacific (Sydney) ap-southeast-2
Asia Pacific (Tokyo) ap-northeast-1
Europe (Frankfurt) eu-central-1
Europe (Ireland) eu-west-1
Europe (London) eu-west-2
Europe (Paris) eu-west-3
South America (São Paulo) sa-east-1

Please consult the current Sagemaker HyperPod documentation for most up-to-date supported AWS regions.

Quotas

Service quotas, also referred to as limits, are the maximum number of service resources or operations for your AWS account.

Quotas for AWS services in this Guidance

Make sure you have sufficient quota for each of the services implemented in this guidance. For more information, see AWS service quotas.

Specifically, make sure you have sufficient service quota for SageMaker EC2 instances you are planning to deploy with the HyperPod clusters, whether SLURM or EKS orchestrator is used.

To view the service quotas for all AWS services in the documentation without switching pages, view the information in the Service endpoints and quotas page in the PDF instead.

Deployment Steps

Please see details of deployment for both types of SageMaker HyperPod clusters in this section of the Implementation Guide

Deployment Validation

Please see details of validation of deployment and access to provisioned HyperPod clusters in this section of the Implementation Guide

Running the Guidance

Please see details of training of Protein Language (ESM-2) models on both types of HyperPod clusters in the Implementation Guide

Next Steps

NOTE: It is highly recommended to patch your HyperPod clusters software on a regular basis to keep your clusters secured and up-to-date.

Please see details of patching software on HyperPod clusters in this section of the Implementation Guide

NOTE: Also, as this is a very rapidly evolving technology area, please keep checking this repository for updates to both HyperPod Cluster infrastructure and protein folding model training code.

Cleanup

Please see details about uninstallation of SageMaker HyperPod clusters and related components in this section of the Implementation Guide

Revisions

Document all notable changes to this project.

Consider formatting this section based on Keep a Changelog, and adhering to Semantic Versioning.

Date Version Changes
09/04/2025 1.0 Initial version of README with references to Implementation Guide
09/15/2025 1.1 Validated version of README
09/16/2025 1.2 Validated version of README with references to Implementation Guide

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Third-Party Dependencies Disclaimer

This sample code utilizes various third-party packages, modules, models, and datasets, including but not limited to:

  • BioNemo
  • NVIDIA base images
  • Facebook ESM models

Important Notice:

  • Amazon Web Services (AWS) is not associated to these third-party entities and their components.
  • The maintenance, updates, and security of these third-party dependencies are the sole responsibility of the customer/user.
  • Users should regularly review and update these dependencies to ensure security and compatibility.
  • Users are responsible for compliance with all applicable licenses and terms of use for these third-party components.

Please review and comply with all relevant licenses and terms of service for each third-party component before using in your applications.

Authors

Daniel Zilberman, Sr SA AWS Tech Solutions
Mark Vinciguerra, Associate WW Specialist SA GenAI
Alex Iankoulski, Principal WW Specialist SA GenAI

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

Infrastructure deployment automation of SageMaker HyperPod clusters based on EKS and SLURM orchestration and Protein Language ESM-2 model training job definitions including NVIDIA BioNemo

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 7