Job Summary
As a PySpark Developer, you will be responsible for building and optimizing data pipelines that ingest massive datasets from Hadoop systems. Your primary focus will be on scanning dataset fields to detect Personally Identifiable Information (PII), integrating tokenization services for data anonymization, and ensuring high-performance query execution. This role requires expertise in big data technologies, Python, and Apache Spark, with a strong emphasis on scalability, efficiency, and data security.
Key Responsibilities
- Design, develop, and maintain PySpark-based ETL pipelines to read and process high volumes of multiple datasets from Hadoop Distributed File System (HDFS).
- Analyze and traverse multiple fields within datasets to identify attributes containing PII data, using pattern matching, rules-based logic, or machine learning-assisted detection where applicable.
- Integrate and call external tokenization services to tokenize sensitive PII data for secure storage and processing, as well as de-tokenize data when required for authorized access.
- Optimize PySpark queries and data processing workflows to handle huge volumes of data efficiently, minimizing latency and resource consumption.
- Collaborate with data architects, security teams, and stakeholders to ensure compliance with data privacy regulations (e.g., GDPR, CCPA).
- Monitor and troubleshoot data pipeline performance, implementing best practices for partitioning, caching, and join optimizations in PySpark.
- Document code, processes, and data flows to support team knowledge sharing and maintainability.
- Participate in code reviews, testing, and deployment of data solutions in a CI/CD environment.
Required Qualifications and Skills
- Bachelor's or Master's degree in Computer Science, Data Engineering, or a related field.
- 3+ years of hands-on experience with Apache Spark and PySpark for big data processing.
- Advanced proficiency in Python for data processing, scripting, and integration with Spark applications.
- Proven expertise in working with Hadoop ecosystems, including HDFS, YARN, and related tools.
- Strong understanding of data privacy concepts, including PII identification techniques (e.g., regex patterns, entity recognition).
- Experience integrating APIs or services for tokenization/de-tokenization (e.g., via RESTful services or cloud-based tools like AWS Macie or custom microservices).
- Deep knowledge of handling large-scale data volumes, including data partitioning, shuffling, and broadcast joins in Spark.
- Acute awareness of query optimization strategies, such as cost-based optimization, predicate pushdown, and tuning Spark configurations (e.g., executor memory, parallelism).
- Proficiency in SQL for data querying.
- Experience with version control systems (e.g., Git) and agile methodologies.
Preferred Qualifications
- Certifications in big data technologies (e.g., Databricks Certified Developer for Apache Spark, Cloudera Certified Data Engineer).
- Familiarity with cloud platforms like AWS, Azure, or GCP for big data processing.
- Knowledge of additional data security tools or frameworks (e.g., Apache Ranger, Kerberos for authentication).
- Experience with machine learning libraries in PySpark (e.g., MLlib) for advanced PII detection.
- Background in data governance or compliance roles.