Logo of Huzzle

Senior HPC AI Cluster Engineer

image

NVIDIA

10d ago

  • Job
    Full-time
    Senior Level
  • Engineering
    IT & Cybersecurity

AI generated summary

  • You need a degree in Computer Science/Engineering and 5+ years of HPC/AI experience, knowledge of job schedulers, Windows/Linux networking, storage solutions, Python, automation tools, and cloud platforms.
  • You will design and maintain HPC/AI clusters, manage workloads, automate deployments, deploy monitoring solutions, troubleshoot systems, document best practices, and support R&D activities.

Requirements

  • A degree in Computer Science, Engineering, or a related field and 5+ years of experience
  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
  • Python programming and bash scripting experience.
  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)
  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)

Responsibilities

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
  • Deploy monitoring solutions for the servers, network and storage
  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level
  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams
  • Support Research & Development activities and engage in POCs/POVs for future improvements

FAQs

What is the main focus of the Senior HPC AI Cluster Engineer role?

The main focus is on building supercomputers and HPC clusters based on groundbreaking technologies to contribute to advancements in artificial intelligence and GPU computing.

What qualifications are required for this position?

A degree in Computer Science, Engineering, or a related field, along with 5+ years of experience is required.

What kinds of technologies will I be working with?

You will work with HPC and AI solution technologies, including CPUs, GPUs, high-speed interconnects, and supporting software.

What experience is preferred with workload scheduling tools?

Experience with job scheduling workloads and orchestration tools such as Slurm and Kubernetes (K8s) is preferred.

Which operating systems should I be knowledgeable about?

Excellent knowledge of Windows and Linux (specifically Redhat/CentOS and Ubuntu) is required.

What programming skills are necessary for this role?

Python programming and bash scripting experience are necessary for this role.

Are there specific automation or configuration management tools I should be familiar with?

Yes, familiarity with automation and configuration management tools such as Jenkins, Ansible, and Puppet/Chef is important.

Is knowledge of networking protocols important for this role?

Yes, deep knowledge of networking protocols like InfiniBand and Ethernet is essential.

What storage solutions should I be experienced with?

Experience with multiple storage solutions such as Lustre, GPFS, ZFS, and XFS is needed, along with familiarity with emerging storage technologies.

Will I be involved in Research & Development activities?

Yes, you will support Research & Development activities and engage in POCs/POVs for future improvements.

Is cloud computing experience relevant for this position?

Yes, familiarity with cloud computing platforms such as AWS, Azure, and Google Cloud is relevant.

What are some preferred areas of knowledge that can help me stand out?

Knowledge of CPU and/or GPU architecture, Kubernetes, container-related microservice technologies, GPU-focused hardware/software (like DGX and CUDA), and background with RDMA fabrics are preferred.

What kind of infrastructure will I be designing and maintaining?

You will design, implement, and maintain large-scale HPC/AI clusters, including monitoring, logging, and alerting capabilities.

What types of troubleshooting skills are necessary for this position?

You should be able to perform troubleshooting from bare metal, operating system, software stack, and application levels.

Are there opportunities for professional development within the company?

Yes, there will be opportunities to develop, redefine, and document standard methodologies, sharing knowledge with internal teams.

What is NVIDIA's stance on diversity and equal opportunity?

NVIDIA values diversity and provides equal opportunity employment, ensuring no discrimination based on race, religion, color, national origin, sex, gender identity, sexual orientation, age, marital status, veteran status, or disability status.

Manufacturing & Electronics
Industry
10,001+
Employees
1993
Founded Year

Mission & Purpose

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.