Description:
100% Remote
Job Title: Senior AI/ML Engineer – Large Language Model Pretraining (100B+ Parameters)
Role Overview
We are seeking Senior AI/ML Engineers with PhDs or Master's degrees in Computer Science or related fields. You will lead the pretraining of massive LLMs (100B+ parameters), requiring deep expertise in distributed training, large-scale optimization, and model architecture. This is a rare opportunity to work with petabyte-scale datasets and cutting-edge compute clusters in a high-impact environment.
Key Responsibilities
- Architect and implement large-scale training pipelines for LLMs with 100B+ parameters.
- Optimize distributed training performance across thousands of GPUs/TPUs.
- Collaborate with research scientists to translate experimental results into production-grade training runs.
- Manage and preprocess petabyte-scale datasets for pretraining.
- Implement state-of-the-art techniques in scaling laws, model parallelism, and memory optimization.
- Conduct rigorous benchmarking, profiling, and performance tuning.
- Contribute to Client research in LLM architecture, training stability, and efficiency.
Required Qualifications
- Advanced degree (PhD or Master’s) in Computer Science, Machine Learning, or related field from a top 20 global university in CS.
- 3+ years of hands-on experience with large-scale deep learning model training.
- Proven experience in pretraining models exceeding 10B parameters, preferably 100B+.
- Deep expertise in distributed training frameworks (DeepSpeed, Megatron-LM, PyTorch FSDP, TensorFlow Mesh, JAX/TPU).
- Proficiency with parallelism strategies (data, tensor, pipeline) and mixed precision training.
- Experience with large-scale cloud or HPC environments (AWS, Azure, GCP, Slurm, Kubernetes, Ray).
- Strong skills in Python, CUDA, and performance optimization.
- Strong publication record in top-tier ML/AI venues (NeurIPS, ICML, ICLR, ACL, etc.) preferred.
Preferred Skills
- Experience with LLM fine-tuning (RLHF, LoRA, PEFT).
- Familiarity with tokenizer development and multilingual pretraining.
- Knowledge of scaling laws and model evaluation frameworks for massive LLMs.
- Hands-on work with petabyte-scale distributed storage systems.