Logo of Huzzle

Senior HPC AI Engineer

image

NVIDIA

8d ago

  • Job
    Full-time
    Senior (5-8 years)
  • Software Engineering
  • Remote
  • Quick Apply

AI generated summary

  • You need 5+ years of HPC and AI experience, expertise in CPU/GPU technologies, networking, storage, scripting, automation tools, virtual systems, and cloud platforms. Knowledge of Kubernetes, GPUs. Familiarity with RDMA fabrics is a plus.
  • You will design, implement, and maintain HPC/AI clusters, manage job schedules, develop CI/CD pipelines, automate deployments, troubleshoot at all levels, document best practices, support R&D, and engage in future improvements.

Requirements

  • A degree in Computer Science, Engineering, or a related field and 5+ years of experience
  • Knowledge of HPC and AI solution technologies from CPU’s and GPU’s to high speed interconnects and supporting software
  • Experience with job scheduling workloads and orchestration tools such as Slurm, K8s
  • Excellent knowledge of Windows and Linux (Redhat/CentOS and Ubuntu) networking (sockets, firewalld, iptables, wireshark, etc.) and internals, ACLs and OS level security protection and common protocols e.g. TCP, DHCP, DNS, etc.
  • Experience with multiple storage solutions such as Lustre, GPFS, zfs and xfs. Familiarity with newer and emerging storage technologies.
  • Python programming and bash scripting experience.
  • Comfortable with automation and configuration management tools such as Jenkins, Ansible, Puppet/chef
  • Deep knowledge of Networking Protocols like InfiniBand, Ethernet
  • Deep understanding and experience with virtual systems (for example VMware, Hyper-V, KVM, or Citrix)
  • Familiarity with cloud computing platforms (e.g. AWS, Azure, Google Cloud)
  • Ways to stand out from the crowd:
  • Knowledge of CPU and/or GPU architecture
  • Knowledge of Kubernetes, container related microservice technologies
  • Experience with GPU-focused hardware/software (DGX, Cuda)
  • Background with RDMA (InfiniBand or RoCE) fabrics

Responsibilities

  • Design, implement and maintain large scale HPC/AI clusters with monitoring, logging and alerting
  • Manage Linux job/workload schedules and orchestration tools
  • Develop and maintain continuous integration and delivery pipelines
  • Develop tooling to automate deployment and management of large-scale infrastructure environments, to automate operational monitoring and alerting, and to enable self-service consumption of resources
  • Deploy monitoring solutions for the servers, network and storage
  • Perform troubleshooting bottom up from bare metal, operating system, software stack and application level
  • Being a technical resource, develop, re-define and document standard methodologies to share with internal teams
  • Support Research & Development activities and engage in POCs/POVs for future improvements

FAQs

What is the primary focus of the Senior HPC AI Engineer role at NVIDIA?

The primary focus of the Senior HPC AI Engineer role at NVIDIA is to build supercomputers and HPC clusters based on groundbreaking technologies, contribute to the latest breakthroughs in artificial intelligence and GPU computing, provide insights on at-scale system design and tuning mechanisms for large-scale compute runs, and interact with various specialists to architect and develop large scale performance platforms.

Manufacturing & Electronics
Industry
10,001+
Employees
1993
Founded Year

Mission & Purpose

Since its founding in 1993, NVIDIA (NASDAQ: NVDA) has been a pioneer in accelerated computing. The company’s invention of the GPU in 1999 sparked the growth of the PC gaming market, redefined computer graphics, ignited the era of modern AI and is fueling the creation of the metaverse. NVIDIA is now a full-stack computing company with data-center-scale offerings that are reshaping industry.