We are hiring an HPC DevOps Engineer to design, develop, and support HPC clusters for research, financial backtesting and model optimizations.
The role focuses on SLURM-based workload management, cloud and hybrid setups, as well as high-performance computing infrastructure.
This role is ideal for someone who takes ownership, thrives in high-performance environments, and is eager to build scalable, efficient HPC systems.
Key responsibilities
- Deploy, manage, and optimize HPC clusters using AWS ParallelCluster, SLURM, and parallel file systems.
- Automate cluster provisioning, configuration, and scaling with Ansible, Terraform, and scripting (Bash/Python).
- Implement monitoring, security, and CI/CD pipelines to ensure stability and efficiency.
- Collaborate with cross-functional teams to design, implement, and optimize scalable and reliable infrastructure solutions.
- Develop and maintain automation scripts and tools to streamline operational workflows.
- Document new processes and procedures to ensure all documentation is up-to-date and relevant.
- Troubleshoot and resolve complex issues related to infrastructure, deployment, and performance.
Requirements
- 4+ years of relevant work experience in an IT Ops role.
- Expertise in Linux performance tuning, job schedulers (SLURM), and HPC storage solutions.
- Understanding of networking concepts and technologies. (TCP/IP, firewalls, VPNs, load balancing).
- Hands-on experience with AWS infrastructure, automation tools (Ansible, Terraform), and scripting (Python/Bash).
- Familiarity with containerization (Docker, Kubernetes), monitoring (Prometheus, Grafana, ELK), and CI/CD pipelines (Gitlab, Jenkins).
- Familiarity with the following technologies: Gitlab, iptables, IPsec, Docker, OpenVPN, Zabbix, Prometheus, Grafana, ELK, Proxmox, AWS.
- Strong communication skills.
- A deep sense of ownership and urgency; a detail-oriented approach to operations.
Would be a plus:
- Experience with HPC-specific optimizations, parallel file systems, and cloud-native HPC solutions.
- Knowledge of low-latency networking and high-speed interconnects.
- Familiarity working with GPUs or other accelerators in HPC/ML/AI environments.
Interview process
- HR interview.
- Technical interview.
- Test assignment.
- Final interview.