Senior HPC AI Cluster Engineer

FAQs

What is the main focus of the Senior HPC AI Cluster Engineer role?

The main focus is on building supercomputers and HPC clusters based on groundbreaking technologies to contribute to advancements in artificial intelligence and GPU computing.

What qualifications are required for this position?

A degree in Computer Science, Engineering, or a related field, along with 5+ years of experience is required.

What kinds of technologies will I be working with?

You will work with HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.

What experience is preferred with workload scheduling tools?

Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes (K8s) is preferred.

Which operating systems should I be knowledgeable about?

Excellent knowledge of Windows and Linux (specifically Redhat/CentOS and Ubuntu) is required.

What programming skills are necessary for this role?

Python programming and bash scripting experience are necessary for this role.

Are there specific automation or configuration management tools I should be familiar with?

Yes, familiarity with automation and configuration management tools such as Jenkins, Ansible, and Puppet/Chef is important.

Is knowledge of networking protocols important for this role?

Yes, deep knowledge of networking protocols like InfiniBand and Ethernet is essential.

What storage solutions should I be experienced with?

Experience with multiple storage solutions such as Lustre, GPFS, ZFS, and XFS is needed, along with familiarity with emerging storage technologies.

Will I be involved in Research & Development activities?

Yes, you will support Research & Development activities and engage in POCs/POVs for future improvements.

Is cloud computing experience relevant for this position?

Yes, familiarity with cloud computing platforms such as AWS, Azure, and Google Cloud is relevant.

What are some preferred areas of knowledge that can help me stand out?

Knowledge of CPU and/or GPU architecture, Kubernetes, container-related microservice technologies, GPU-focused hardware/software (like DGX and CUDA), and background with RDMA fabrics are preferred.

What kind of infrastructure will I be designing and maintaining?

You will design, implement, and maintain large-scale HPC/AI clusters, including monitoring, logging, and alerting capabilities.

What types of troubleshooting skills are necessary for this position?

You should be able to perform troubleshooting from bare metal, operating system, software stack, and application levels.

Are there opportunities for professional development within the company?

Yes, there will be opportunities to develop, redefine, and document standard methodologies, sharing knowledge with internal teams.

What is NVIDIA's stance on diversity and equal opportunity?

NVIDIA values diversity and provides equal opportunity employment, ensuring no discrimination based on race, religion, color, national origin, sex, gender identity, sexual orientation, age, marital status, veteran status, or disability status.