Job Url: https://www.linkedin.com/jobs/search/?currentJobId=4342074749&distance=25&f_AL=true&f_TPR=r86400&f_WT=2&geoId=103644278&keywords=software%20engineer&origin=JOB_SEARCH_PAGE_JOB_FILTER&refresh=true&spellCorrectionEnabled=true&start=375 Job Description: Senior Data Engineer  Atlanta Metropolitan Area · 20 hours ago · Over 100 applicants Promoted by hirer · No response insights available yet Remote Matches your job preferences, workplace type is Remote. Contract Easy Apply Save Save Senior Data Engineer  at Princeton University Senior Data Engineer Princeton University · Atlanta Metropolitan Area (Remote) Easy Apply Save Save Senior Data Engineer  at Princeton University Show more options Your profile is missing required qualifications Show match details Help me update my profile BETA Is this information helpful? Get personalized tips to stand out to hirers Practice mock interviews personalized to every role and get custom feedback Try Premium for PKR0 Meet the hiring team Kristen DeCaires Gall, MPH, PMP 3rd Research Operations and Program Development Job poster Message About the job 1. Overview The Princeton School of Public and International Affairs seeks proposals for a full-time Senior Data Engineer to support the data engineering backbone of the Accelerator initiative. The engineer will work across our Databricks-based platform—designing new pipelines, improving existing ones, optimizing performance, and driving down compute/storage costs through architectural decisions. This role will also support maintenance and ongoing enhancement of large text-based datasets (~25 TB uncompressed), alongside our primary social-media ingestion pipelines (Telegram, YouTube, and others). The ideal partner will bring deep experience with Databricks, PySpark, Delta Lake, and large-scale ETL systems, with a proven track record of performance and cost optimization. 2. Objectives Provide end-to-end data engineering leadership for the Accelerator’s Databricks ecosystem. Architect, build, and optimize scalable, cost-efficient pipelines for social media, Comscore, and future datasets. Establish durable engineering standards, monitoring, and documentation to support long-term sustainability. Improve platform performance, ensure data reliability, and support the needs of researchers and internal stakeholders. 3. Scope of WorkA. Pipeline Maintenance & Optimization Performance Tuning & Monitoring: Optimize pipelines to reduce compute costs and improve query speed. Schema & Metadata Management: Maintain schema consistency and update documentation. Data Refresh Operations: Support ingestion of new datasets, including staging, validation, and promotion. Data Validation & QA: Implement automated data quality checks. Documentation: Maintain runbooks, lineage diagrams, and operational dashboards (e.g., job health, cost, runtime metrics). B. Researcher & Stakeholder Support Respond to researcher tickets related to data access, derived datasets, transformation enhancements, or performance concerns. Prepare reference tables, specialized views, or dataset extracts as needed. Communicate all changes and updates through Slack, Confluence, and regular engineering meetings. 4. Deliverables Stable, scalable pipeline architecture across all Accelerator datasets. Improved performance and reliability of all pipelines, with measurable reductions in runtime and DBU consumption. Unified observability suite with job monitoring, lineage, and cost visibility. Updated documentation: runbooks, architecture diagrams, data dictionaries, and operational procedures. Comscore dataset reliably supported and refreshed with documented validation steps. Quarterly technical report summarizing work completed, issues addressed, and strategic recommendations. 5. Required Skills Expert-level experience with Databricks, including Spark optimization, cluster configurations, Delta Lake internals, DLT, and Unity Catalog. Advanced software engineering skills, capable of writing production-quality, well-tested, modular, and maintainable code in Python/PySpark. Experience designing and implementing scalable data architectures, including schema evolution, metadata management, partitioning strategies, and high-performance table design. Strong DevOps experience, including familiarity with CI/CD systems (GitHub Actions, Azure DevOps, GitLab CI, or equivalent). Ability to create and maintain CI/CD pipelines for data workflows, automate deployments, package shared libraries, and improve engineering processes within the data platform. Demonstrated skill in cloud cost engineering and Databricks DBU optimization, including autoscaling strategies, caching logic, cluster policy design, and Delta Lake performance tuning. Ability to diagnose and resolve complex distributed pipeline failures, performance regressions, or schema inconsistencies. Excellent communication and documentation habits, including the ability to translate technical decisions to nontechnical stakeholders. Experience working with large datasets (10–100 TB range) with evolving schemas or complex ingestion patterns. 6. Success Metrics Reduction in compute and/or storage costs due to architectural and performance improvements. All core pipelines achieve consistent SLA compliance. Comscore dataset refresh completed with validated data and minimal regression issues. Positive feedback from researchers and internal stakeholders regarding pipeline usability and responsiveness. Complete, maintainable documentation and observability tools in place. 7. Proposal Submission Guidelines Interested vendors should submit: A brief company profile and relevant experience. Proposed approach and timeline. Key personnel and their qualifications. Budget estimate for the 3-month engagement. References from similar projects (if available).