Machine Learning Engineer

ITCO Solutions, Inc. • Contract • Remote (California, United States, US) • 2h ago

Description:

100% Remote

Job Title: Senior AI/ML Engineer – Large Language Model Pretraining (100B+ Parameters)

Role Overview

We are seeking Senior AI/ML Engineers with PhDs or Master's degrees in Computer Science or related fields. You will lead the pretraining of massive LLMs (100B+ parameters), requiring deep expertise in distributed training, large-scale optimization, and model architecture. This is a rare opportunity to work with petabyte-scale datasets and cutting-edge compute clusters in a high-impact environment.

Key Responsibilities

Architect and implement large-scale training pipelines for LLMs with 100B+ parameters.
Optimize distributed training performance across thousands of GPUs/TPUs.
Collaborate with research scientists to translate experimental results into production-grade training runs.
Manage and preprocess petabyte-scale datasets for pretraining.
Implement state-of-the-art techniques in scaling laws, model parallelism, and memory optimization.
Conduct rigorous benchmarking, profiling, and performance tuning.
Contribute to Client research in LLM architecture, training stability, and efficiency.

Required Qualifications

Advanced degree (PhD or Master’s) in Computer Science, Machine Learning, or related field from a top 20 global university in CS.
3+ years of hands-on experience with large-scale deep learning model training.
Proven experience in pretraining models exceeding 10B parameters, preferably 100B+.
Deep expertise in distributed training frameworks (DeepSpeed, Megatron-LM, PyTorch FSDP, TensorFlow Mesh, JAX/TPU).
Proficiency with parallelism strategies (data, tensor, pipeline) and mixed precision training.
Experience with large-scale cloud or HPC environments (AWS, Azure, GCP, Slurm, Kubernetes, Ray).
Strong skills in Python, CUDA, and performance optimization.
Strong publication record in top-tier ML/AI venues (NeurIPS, ICML, ICLR, ACL, etc.) preferred.

Preferred Skills

Experience with LLM fine-tuning (RLHF, LoRA, PEFT).
Familiarity with tokenizer development and multilingual pretraining.
Knowledge of scaling laws and model evaluation frameworks for massive LLMs.
Hands-on work with petabyte-scale distributed storage systems.