Logo of Huzzle

Utilization and Performance Optimization Intern - Non-CMU Graduate Student - Pittsburgh Supercomputing Center

Applications are closed

  • Internship
    Part-time
    Off-cycle Internship
  • IT & Cybersecurity
  • Pittsburgh

Requirements

  • You should demonstrate:
  • Strong programming skills, preferably in Python, or similar languages relevant to system monitoring and performance analysis.
  • Basic understanding of high-performance computing (HPC) environments and GPU computing.
  • Ability to work independently and collaboratively in a research-focused environment.
  • Keen interest in computational research and performance optimization.
  • Qualifications:
  • Candidates must be pursuing a Master’s degree. Examples of relevant majors are computer science, computer engineering, or any major with a significant computational/programming component
  • Excellent communication skills and ability to work in a team environment.
  • Excellent problem-solving skills and creativity.

Responsibilities

  • The intern will be responsible for developing and implementing a system to monitor GPU utilization across all jobs running on the Bridges-2 supercomputer. This includes the generation of alerts for researchers when their jobs underutilize the requested resources. The project comprises several key steps, as outlined below:
  • Framework Identification: Identify popular frameworks to establish a baseline for GPU utilization. This step is essential for understanding the current landscape and setting realistic benchmarks.
  • Base Code Development: Create base code examples for efficiently running jobs on Bridges-2, serving as templates for researchers.
  • Configuration Definition: Define optimal node/GPU amount configurations for various types of jobs on the clusters.
  • Performance Benchmarking: Obtain base performance numbers using NGC containers and Bridges-2 modules, comparing these against Bridges-2-validation and reference Nvidia performance numbers.
  • Automated Testing: Create unittest sbatch jobs or similar for automating job submissions and performance evaluations.
  • GPU Utilization Monitoring: Develop a method to measure GPU utilization automatically from running jobs and deploy this system to the real cluster environment.
  • Data Ingest Prototype: Implement a data ingest Slurm-buffer prototype configuration (burst buffer) for enhanced data handling efficiency.
  • Performance Comparison: Compare job performance numbers running with original data in Ocean, Jet, and any other specified locations to optimize data transfer and processing speeds.
  • Network Optimization: Configure multiple IB interfaces on machines equipped with them and develop methods to easily measure InfiniBand throughput.
  • Performance Re-Evaluation: Re-run jobs under optimized conditions, including RDMA support if applicable, and compare new performance metrics to initial benchmarks.
  • Our internships offer the opportunity to gain:
  • Practical experience in monitoring and optimizing GPU utilization in one of the most popular supercomputers in the US.
  • Knowledge of high-performance computing (HPC) practices and challenges.
  • Skills in data analysis, software development, and system optimization.
  • An opportunity to contribute to the efficiency and effectiveness of computational research.

We learn, we make, we solve, we create, and we give it to the world.

Education
Industry
5001-10,000
Employees
1900
Founded Year

Mission & Purpose

Carnegie Mellon University founder Andrew Carnegie said: "My heart is in the work."​ No statement better captures the passion and drive of our people to make a real difference. At Carnegie Mellon, we're not afraid of the work. Our educational environment creates problem solvers, drivers of innovation and pioneers in technology and the arts. Employers in every field say our graduates are ready to hit the ground running the day they graduate. So, join us. Whether you're looking for a career or an education. Or both.