Logo of Huzzle

Internship

Utilization and Performance Optimization Intern - Non-CMU Graduate Student - Pittsburgh Supercomputing Center

🚀 Off-cycle Internship

Pittsburgh

AI generated summary

  • You should have strong programming skills, understanding of HPC and GPU computing, ability to work independently and collaboratively, keen interest in research and optimization, pursuing a Master’s degree, excellent communication and problem-solving skills.
  • You will develop, implement, and optimize a GPU utilization monitoring system for the Bridges-2 supercomputer, ensuring researchers' jobs efficiently use requested resources. Gain practical HPC experience and contribute to computational research efficiency.

Off-cycle Internship

IT & CybersecurityPittsburgh

Description

  • We are seeking a motivated and technically skilled intern to join our team for a summer project focused on optimizing GPU utilization for the Bridges-2 supercomputer. The intern will engage in a comprehensive project aiming to monitor and enhance the efficiency of GPU jobs running on Bridges-2. This project is critical for ensuring that researchers are fully utilizing the resources they request, thereby enhancing computational efficiency and research output. The intern will gain hands-on experience in high-performance computing (HPC), data analysis, and software development while contributing to the advancement of computational research capabilities.

Requirements

  • You should demonstrate:
  • Strong programming skills, preferably in Python, or similar languages relevant to system monitoring and performance analysis.
  • Basic understanding of high-performance computing (HPC) environments and GPU computing.
  • Ability to work independently and collaboratively in a research-focused environment.
  • Keen interest in computational research and performance optimization.
  • Qualifications:
  • Candidates must be pursuing a Master’s degree. Examples of relevant majors are computer science, computer engineering, or any major with a significant computational/programming component
  • Excellent communication skills and ability to work in a team environment.
  • Excellent problem-solving skills and creativity.

Education requirements

Currently Studying
Masters

Area of Responsibilities

IT & Cybersecurity

Responsibilities

  • The intern will be responsible for developing and implementing a system to monitor GPU utilization across all jobs running on the Bridges-2 supercomputer. This includes the generation of alerts for researchers when their jobs underutilize the requested resources. The project comprises several key steps, as outlined below:
  • Framework Identification: Identify popular frameworks to establish a baseline for GPU utilization. This step is essential for understanding the current landscape and setting realistic benchmarks.
  • Base Code Development: Create base code examples for efficiently running jobs on Bridges-2, serving as templates for researchers.
  • Configuration Definition: Define optimal node/GPU amount configurations for various types of jobs on the clusters.
  • Performance Benchmarking: Obtain base performance numbers using NGC containers and Bridges-2 modules, comparing these against Bridges-2-validation and reference Nvidia performance numbers.
  • Automated Testing: Create unittest sbatch jobs or similar for automating job submissions and performance evaluations.
  • GPU Utilization Monitoring: Develop a method to measure GPU utilization automatically from running jobs and deploy this system to the real cluster environment.
  • Data Ingest Prototype: Implement a data ingest Slurm-buffer prototype configuration (burst buffer) for enhanced data handling efficiency.
  • Performance Comparison: Compare job performance numbers running with original data in Ocean, Jet, and any other specified locations to optimize data transfer and processing speeds.
  • Network Optimization: Configure multiple IB interfaces on machines equipped with them and develop methods to easily measure InfiniBand throughput.
  • Performance Re-Evaluation: Re-run jobs under optimized conditions, including RDMA support if applicable, and compare new performance metrics to initial benchmarks.
  • Our internships offer the opportunity to gain:
  • Practical experience in monitoring and optimizing GPU utilization in one of the most popular supercomputers in the US.
  • Knowledge of high-performance computing (HPC) practices and challenges.
  • Skills in data analysis, software development, and system optimization.
  • An opportunity to contribute to the efficiency and effectiveness of computational research.

Details

Work type

Part time

Work mode

office

Location

Pittsburgh