Data Engineer

Aldea • Full-time • San Francisco, CA, US • 4h ago

About Aldea

Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.

The Role

We are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research. You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale, process diverse data sources across language and speech domains, and generate high-quality synthetic data for model training.

This is a high-impact role where your work directly determines training quality and efficiency. If you're passionate about building data systems that power cutting-edge AI research, this role is for you.

What You'll Do

Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
Generate synthetic data for model training and evaluation across diverse tasks and domains
Design efficient data loading systems achieving high throughput across multi-node training clusters
Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
Collaborate with ML engineers and researchers to optimize pipelines and improve data quality

Minimum Qualifications

Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
Experience with data quality techniques including deduplication, filtering, and validation at scale
Proven ability to optimize data pipelines for performance and throughput in distributed systems
Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats

Preferred Qualifications

Experience building data pipelines for LLM pretraining or large-scale ML training
Hands-on experience with synthetic data generation for language or speech models
Experience with text processing at scale: tokenization, deduplication (MinHash, LSH), and quality assessment
Familiarity with audio/speech data processing and dataset curation
Knowledge of data contamination detection and dataset versioning best practices
Experience optimizing data loaders for PyTorch or TensorFlow at scale
Understanding of distributed storage systems (S3, GCS, HDFS) and data streaming patterns

Compensation & Benefits

Competitive base salary
Performance-based bonus aligned with research and model milestones
Equity participation
Comprehensive health, dental, and vision coverage
Flexible paid time off

Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.

Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov.

Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.