We are looking for a Data Engineer passionate about LLMs, VLMs, post-training, and reinforcement learning. You will design and implement scalable data systems that power dataset generation, filtering, and evaluation for model alignment and agentic reasoning. You’ll collaborate closely with our research and infrastructure teams to ship real systems that train the next generation of intelligent models.
Key Responsibilities
- Build and maintain scalable data pipelines for mid-training and post-training.
- Design high-throughput systems for data collection, deduplication, and quality measurement.
- Work with researchers to implement reward models, benchmarks, and feedback loops.
- Collaborate cross-functionally with infra and research teams to integrate new data modalities and tasks.
Qualifications
- Strong software engineering background.
- Experience with LLMs, RLHF/RLAIF, and/or post-training pipelines (SFT, DPO, PPO, etc.).
- Familiarity with modern data tooling (e.g., PySpark, Ray, Hugging Face Datasets, Arrow, Parquet).
- Comfort with large-scale data manipulation, storage, and retrieval.
- Understanding of data curation principles, filtering heuristics, and annotation workflows.
- (Bonus) Experience with training reward models.
- (Bonus) Experience with coding, tool-using or agentic LLM datasets.
- (Bonus) Experience building and maintaining hybrid compute clusters (Kubernetes, Slurm).
What We Offer
- Work with a world-class, research-driven team shaping the future of data-centric AI.
- Early technical ownership and influence in a fast-moving, well-funded startup.
- Competitive compensation with equity.
- Hybrid flexibility (SF Bay Area preferred, remote considered).
- Impactful open-source contributions (papers, codes) recognized by top research and industry labs.
Job Types: Full-time, Contract, Internship
Projected Total Compensation: $132,000.00 - $156,000.00 per year
Benefits:
- 401(k)
- Health insurance
- Vision insurance
Work Location: In person