Job Url: https://www.linkedin.com/jobs/search/?currentJobId=4342074749&distance=25&f_AL=true&f_TPR=r86400&f_WT=2&geoId=103644278&keywords=software%20engineer&origin=JOB_SEARCH_PAGE_JOB_FILTER&refresh=true&spellCorrectionEnabled=true&start=375

Job Description: Senior Data Engineer 
Atlanta Metropolitan Area · 20 hours ago · Over 100 applicants
Promoted by hirer · No response insights available yet


 Remote
Matches your job preferences, workplace type is Remote.

Contract

Easy Apply

Save
Save Senior Data Engineer  at Princeton University
Senior Data Engineer
Princeton University · Atlanta Metropolitan Area (Remote)

Easy Apply

Save
Save Senior Data Engineer  at Princeton University
Show more options
Your profile is missing required qualifications


Show match details

Help me update my profile


BETA

Is this information helpful?


Get personalized tips to stand out to hirers
Practice mock interviews personalized to every role and get custom feedback

Try Premium for PKR0
Meet the hiring team

Kristen DeCaires Gall, MPH, PMP  
 3rd
Research Operations and Program Development 
Job poster

Message
About the job
1. Overview

The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platform—designing new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions.


This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others).
The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization.
2. Objectives

Provide end-to-end data engineering leadership for the Accelerator’s Databricks ecosystem.
Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets.
Establish durable engineering standards, monitoring, and documentation to support long-term sustainability.
Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders.
3. Scope of WorkA. Pipeline Maintenance & Optimization

Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed.
Schema & Metadata Management: Maintain schema consistency and update documentation.
Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion.
Data Validation & QA: Implement automated data quality checks.
Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics).
B. Researcher & Stakeholder Support

Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns.
Prepare reference tables, specialized views, or dataset extracts as needed.
Communicate all changes and updates through Slack, Confluence, and regular engineering meetings.
4. Deliverables

Stable, scalable pipeline architecture across all Accelerator datasets.
Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption.
Unified observability suite with job monitoring, lineage, and cost visibility.
Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures.
Comscore dataset reliably supported and refreshed with documented validation steps.
Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations.
5. Required Skills

Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog.
Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark.
Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design.
Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent).
Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform.
Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning.
Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies.
Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders.
Experience working with large datasets (10–100 TB range) with evolving schemas or complex ingestion patterns.
6. Success Metrics

Reduction in compute and/or storage costs due to architectural and performance improvements.
All core pipelines achieve consistent SLA compliance.
Comscore dataset refresh completed with validated data and minimal regression issues.
Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness.
Complete, maintainable documentation and observability tools in place.
7. Proposal Submission Guidelines

Interested vendors should submit:
A brief company profile and relevant experience.
Proposed approach and timeline.
Key personnel and their qualifications.
Budget estimate for the 3-month engagement.
References from similar projects (if available).