Job Title: Data Engineer

Company Name: Codebridge

Job Url: https://boards.greenhouse.io/embed/job_app?token=4663264006&utm_source=jobright&jr_id=69b4ce9406c1ba00c54651a7

Job Description: You will:

Design, develop, and implement robust ETL/ELT data pipelines for large-scale data ingestion, transformation, and storage.
Ensure data quality, integrity, and governance by implementing validation techniques, data monitoring, and automated testing.
Collaborate with cross-functional teams, including data scientists, analysts, platform engineers, and business stakeholders, to develop scalable and reusable data solutions.
Automate deployments and testing using CI/CD pipelines with Git, Terraform, GitHub Actions, or Jenkins.
Design and build custom data tools and abstractions to support analytics, machine learning, and real-time data processing.
Work with DevOps and platform teams to establish efficient deployment and monitoring processes for internal and external data products.
Develop and implement alerting, monitoring, and observability frameworks for data pipelines to ensure reliability and proactive issue resolution.
Contribute to the data architecture and strategy, driving improvements in scalability, performance, and cost optimization.
Stay up to date with emerging technologies and industry best practices to continuously enhance data engineering capabilities.                                                                                                       
We are looking for people who have:

5+ years of hands-on experience in data engineering, with expertise in distributed data processing and big data frameworks (e.g., Apache Spark, Apache Iceberg, Trino, Apache Airflow, dbt, Dagster).
Advanced programming skills in Scala or Python for data transformation and automation.
Experience with real-time data streaming technologies such as Apache Flink, Spark Streaming, or Kafka.
Strong experience in performance tuning for Spark and optimizing large-scale data workflows.
Proficiency in SQL and database management, with hands-on experience in Massively Parallel Processing (MPP) databases such as Amazon Redshift, Snowflake, Teradata.
Familiarity with cloud-based data services (AWS, RDS, DynamoDB) and containerized infrastructure (EKS, Docker, Kubernetes).
Hands-on experience integrating DevOps and CI/CD practices in data engineering using GitHub Actions, Jenkins, or Terraform.
Proven ability to build monitoring, alerting, and observability tools for data pipelines to ensure high availability and reliability.
Experience in data mapping, validation, and testing frameworks to ensure accuracy and consistency.
Exposure in machine learning/deep learning using PyTorch, Tensorflow, or Keras (PyTorch preferred)
Exposure to machine learning workflows and familiarity with MLOps tools for model deployment and lifecycle management.
Self-starter with strong problem-solving skills who thrives in a fast-paced, agile environment.