Student Researcher (Machine Learning Sys-US) - 2024 Start (PhD)

ByteDance

Mar 18

Job
Full-time
Entry Level
Data
San Jose

AI generated summary

PhD student in distributed computing with knowledge of machine learning algorithms, PyTorch, CUDA, and programming languages like C/C++, Python. Experience with GPU/high performance computing, distributed training optimization, AI compiler stacks, large scale systems, and CUDA programming is preferred. Graduating in December 2024 or later with intent to return to degree program.
Conduct research to optimize machine learning systems, develop heterogeneous computing architecture, implement model-specific optimizations, and enhance efficiency for large scale distributed training jobs.

Requirements

Currently in PhD program in distributed, parallel computing principles and know the recent advances in computing, storage, networking, and hardware technologies.
Familiar with machine learning algorithms, platforms and frameworks such as PyTorch and Jax.
Have basic understanding of how GPU and/or ASIC works.
Expert in at least one or two programming languages in Linux environment: C/C++, CUDA, Python.
Preferred Qualifications:
Graduating December 2024 onwards with the intent to return to degree program after the completion of the position.
The following experiences will be a big plus:
GPU based high performance computing, RDMA high performance network (MPI, NCCL, ibverbs).
Distributed training framework optimizations such as DeepSpeed, FSDP, Megatron, GSPMD.
AI compiler stacks such as torch.fx, XLA and MLIR.
Large scale data processing and parallel computing.
Experiences in designing and operating large scale systems in cloud computing or machine learning.
Experiences in in-depth CUDA programming and performance tuning (cutlass, triton).

Responsibilities

Research and develop our machine learning systems, including heterogeneous computing architecture, management, scheduling, and monitoring.
Manage cross-layer optimization of system and AI algorithms and hardware for machine learning (GPU, ASIC).
Implement both general purpose training framework features and model specific optimizations (e.g. LLM, diffusions).
Improve efficiency and stability for extremely large scale distributed training jobs.

FAQs

What is the duration of the Student Researcher opportunity at ByteDance?

The Student Researcher position provides flexibility in duration, time commitment, and location of work. The exact duration will be determined based on individual circumstances and will go beyond the constraints of a standard internship program.

ByteDance

Technology

Industry

10,001+

Employees

2012

Founded Year

Mission & Purpose

ByteDance is a global incubator of platforms at the cutting edge of commerce, content, entertainment and enterprise services - over 2.5bn people interact with ByteDance products including TikTok. Creation is the core of ByteDance's purpose. Our products are built to help imaginations thrive. This is doubly true of the teams that make our innovations possible. Together, we inspire creativity and enrich life - a mission we aim towards achieving every day. At ByteDance, we create together and grow together. That's how we drive impact - for ourselves, our company, and the users we serve. We are committed to building a safe, healthy and positive online environment for all our users.