The objective of this project was to create and administer an HPC cluster on AWS, running Slurm with Terraform and Ansible.
This was a final research project for my High Performance Computing class at Trent University
- Creates AWS infrastructure as defined in the
terraform/cluster.tf
file. This is- one login node
- however many compute nodes as defined in the
terraform/terraform.tfvars
file. - The network rules are set so that to ssh into any of the compute nodes, you must proxy through the login node. This is to reduce the cyber attack surface to just one machine rather than the entire cluster.
- Runs python scripts to generate:
- The Ansible inventory and config file,
- The
slurm.config
file for slurm, - the
/etc/hosts
for the machines so they can communicate through slurm easily.
- Runs Ansible playbooks against each nodes, setting up:
- Updates each machine,
- Sets permissions on directories needed for slurm,
- Munge (required for slurm),
- the slurmd and slurmctld.
- You should be running this on a Linux machine (this could work on Mac or Windows, but I have no experience with them).
- You need aws account with ec2 permissions
- You specifically need the access key and secret access key from this account.
- Install the following software:
- Create ssh keys on your machine with
ssh keygen
- ~/.ssh/aws-cluster-key.pub
- keys/nodekey.pub
- keys/aws-private-cluster-key.pub
- export the aws credentials as env variables
- export AWS_ACCESS_KEY_ID="anaccesskey"
- export AWS_SECRET_ACCESS_KEY="asecretkey"
- initialize terraform
- terraform init
- Modify the
terraform/terraform.tfvars
file to set the amount of compute nodes you need, and the size of the AWS instance you need.- Note: this doesn't work with any of the free tier options. Slurm has trouble running with the network speed the t2.micro. t3.small is the default and works.
- run the ./create_cluster.sh script and answer 'yes' to the prompts from terraform, and adding the key to known_hosts.
- Take note of the public ip printed by terraform, or get it later from the Ansible inventory to ssh into the login node.
- Once Ansible is finished, ssh into the login node with
ssh ubuntu@the-login-node-ip
. Runsinfo
to see your compute cluster, and run some slurm commands to shcedule your jobs.
This project is licensed under the terms of the GNU GENERAL PUBLIC LICENSE.