About Aldea
Aldea is a multi-modal foundational AI company reimagining the scaling laws of intelligence. We believe today's architectures create unnecessary bottlenecks for the evolution of software. Our mission is to build the next generation of foundational models that power a more expressive, contextual, and intelligent human–machine interface.
The Role
We are hiring a Data Engineer to build the data infrastructure that powers Aldea's multi-modal AI research. You will design and scale data pipelines for pretraining, midtraining, and post-training at trillion-token scale, process diverse data sources across language and speech domains, and generate high-quality synthetic data for model training.
This is a high-impact role where your work directly determines training quality and efficiency. If you're passionate about building data systems that power cutting-edge AI research, this role is for you.
What You'll Do
- Build and scale data pipelines for pretraining, midtraining, and post-training at trillion+ token scale across language and speech domains
- Process and curate large-scale datasets including cleaning, deduplication, quality filtering, and optimization for distributed training
- Generate synthetic data for model training and evaluation across diverse tasks and domains
- Design efficient data loading systems achieving high throughput across multi-node training clusters
- Build data versioning and reproducibility systems to track dataset compositions and enable reproducible experiments
- Collaborate with ML engineers and researchers to optimize pipelines and improve data quality
Minimum Qualifications
- Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience
- 3+ years of experience building large-scale data pipelines for machine learning or data-intensive applications
- Strong programming skills in Python and experience with data processing frameworks (Spark, Dask, Ray, or similar)
- Experience with data quality techniques including deduplication, filtering, and validation at scale
- Proven ability to optimize data pipelines for performance and throughput in distributed systems
- Experience working with large datasets (100GB-10TB+) and understanding of storage systems and data formats
Preferred Qualifications
- Experience building data pipelines for LLM pretraining or large-scale ML training
- Hands-on experience with synthetic data generation for language or speech models
- Experience with text processing at scale: tokenization, deduplication (MinHash, LSH), and quality assessment
- Familiarity with audio/speech data processing and dataset curation
- Knowledge of data contamination detection and dataset versioning best practices
- Experience optimizing data loaders for PyTorch or TensorFlow at scale
- Understanding of distributed storage systems (S3, GCS, HDFS) and data streaming patterns
Compensation & Benefits
- Competitive base salary
- Performance-based bonus aligned with research and model milestones
- Equity participation
- Comprehensive health, dental, and vision coverage
- Flexible paid time off
Aldea is proud to be an equal-opportunity employer. We are committed to building a diverse and inclusive culture that celebrates authenticity to win as one. We do not discriminate on the basis of race, religion, color, national origin, gender, gender identity, sexual orientation, age, marital status, disability, protected veteran status, citizenship or immigration status, or any other legally protected characteristics.
Aldea uses E-Verify to confirm employment eligibility in compliance with federal law. For more information please visit: https://www.e-verify.gov
.
Please note: We do not accept unsolicited resumes from recruiters or employment agencies and will not be responsible for any fees related to unsolicited resumes.