April 2025

Creating a Cloud HPC solution for a FAANG company

Cloud HPC solutions, while not being as performant due to cloud constraints provide a provide a flexible and cost effective environment.

Offering

HPC Cluster on AWS

Technologies

AWS, Terraform, Slurm

Client Challenge

The AI/ML explosion required more researchers and more GPU/Researcher. On premises HPC clusters have a few advantages (customizable, performance, security and control) while having many disadvantages (massive upfront investment, long ROI, takes years to build making fast pacing hardware obsolete).

Cloud HPC solutions, while not being as performant due to cloud constraints (hardware co-location, storage technologies, network constraints) provide a flexible and cost effective environment making it ideal for testing cutting edge hardware and when having overflow capacity.

This cloud offerings have limited features, sometimes making hard to adopt.

Ingratiation with internal services was a priority.

Solution Delivered

HPC Slurm cluster deployed on AWS using AWS ParallelCluster as base layer and boosted with many custom features to get a production ready environment: Secure access internal users. Unix users management. Secure access. 2FA. S3 data pipelines. Support for Multiple FSx for Lustre. Slurm partitions and limits. Slurm Accounting. Observability. Hardware testing. Login Nodes. Support for multiple tenants on different accounts. Persistent $HOME. Lustre eviction. Capacity planning. Custom safeguards for AWS services. Over time an Azure cluster was also added to the stack using Cycle Cloud. Tech stack: Terraform. Packer. AWS (EC2 + EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito DynamoDB, CloudWatch). PyTorch + NCCL. DUO

Project Results

500+ researchers
20+ clusters
5+ accounts/tenants
6000+ GPUs under management
multiple PB on S3/FSx
AWS ParallelCluster took many ideas from this engagement

‍

Go back

Creating a Cloud HPC solution for a FAANG company

Client Challenge

Solution Delivered

Project Results

Ready to scale with us?