Cloud HPC solutions, while not being as performant due to cloud constraints provide a provide a flexible and cost effective environment.
HPC Cluster on AWS
AWS, Terraform, Slurm
The AI/ML explosion required more researchers and more GPU/Researcher. On premises HPC clusters have a few advantages (customizable, performance, security and control) while having many disadvantages (massive upfront investment, long ROI, takes years to build making fast pacing hardware obsolete).
Cloud HPC solutions, while not being as performant due to cloud constraints (hardware co-location, storage technologies, network constraints) provide a flexible and cost effective environment making it ideal for testing cutting edge hardware and when having overflow capacity.
This cloud offerings have limited features, sometimes making hard to adopt.
Ingratiation with internal services was a priority.
HPC Slurm cluster deployed on AWS using AWS ParallelCluster as base layer and boosted with many custom features to get a production ready environment: Secure access internal users. Unix users management. Secure access. 2FA. S3 data pipelines. Support for Multiple FSx for Lustre. Slurm partitions and limits. Slurm Accounting. Observability. Hardware testing. Login Nodes. Support for multiple tenants on different accounts. Persistent $HOME. Lustre eviction. Capacity planning. Custom safeguards for AWS services. Over time an Azure cluster was also added to the stack using Cycle Cloud. Tech stack: Terraform. Packer. AWS (EC2 + EFA, FSx, EFS, S3, SES, SNS, SQS, Step Functions, Cognito DynamoDB, CloudWatch). PyTorch + NCCL. DUO