Primary Skills: Databricks, Python, Pyspark, Airflow, Apache Spark
Location: San Jose, CA (This is a Hybrid role. 3 days a week in San Jose office.)
Duration: 5 months
Contract Type: W2 only
Responsibilities
- Design, develop, and maintain scalable and reliable data pipelines to support large-scale data processing.
- Build and optimize data workflows using orchestration tools like Apache Airflow and Spark to support scheduled and event-driven ETL/ELT processes.
- Implement complex parsing, cleansing, and transformation logic to normalize data from a variety of structured and unstructured sources.
- Collaborate with data scientists, analysts, and application teams to integrate, test, and validate data products and pipelines.
- Operate and maintain pipelines running on cloud platforms (AWS) and distributed compute environments (e.g., Databricks).
- Monitor pipeline performance, perform root cause analysis, and troubleshoot failures to ensure high data quality and uptime.
- Ensure proper security, compliance, and governance of data across systems and environments.
- Contribute to the automation and standardization of data engineering processes to improve development velocity and operational efficiency.
Required Skills
- 9-12 YOE
- Proficient in Python and PySpark for data processing and scripting.
- Strong experience with SQL for data manipulation and performance tuning.
- Deep understanding of distributed data processing with Apache Spark.
- Hands-on experience with Airflow or similar orchestration tools.
- Experience with cloud services and data tools in AWS (e.g., S3, Lambda, SQS, Gateway, Networking).
- Expertise with Databricks for collaborative data engineering and analytics.
- Solid understanding of data modeling, data warehousing, and best practices in data pipeline architecture.
- Strong problem-solving skills with the ability to work independently on complex tasks.
- Familiarity with CI/CD practices and version control (e.g., Git) in data engineering workflows.