Job Title: Senior Data Engineer (Streaming + Batch)

Company Name: PraxisPro Inc

Job Url: https://jobs.gusto.com/postings/praxispro-inc-senior-data-engineer-033a0022-d8e0-44b8-b6ab-03cb24d0a558

Job Description: Senior Data Engineer (Streaming + Batch)

About PraxisPro Inc
PraxisPro, a data intelligence company, is on a mission to heal the fractured state of Life Sciences commercial data by surfacing undocumented and previously inaccessible datasets to drive novel commercial intelligence and improve patient outcomes. PraxisPro has begun surfacing undocumented and previously inaccessible data with the industry’s first purpose-built Learning Experience Platform (LXP). By serving as the commercial intelligence backbone across commercial, medical, and compliance functions, our industry-specific AI models for therapeutic areas and disease states form the foundation for a new standard of commercial intelligence, one that enables disciplined execution at scale while allowing Life Sciences organizations to focus on what matters most: advancing patient outcomes.
Description
Senior Data Engineer (Streaming + Batch)
Location: Hybrid--San Francisco Bay Area, New York Metropolitan Area


The Role
We’re looking for a Senior Data Engineer (5–7 years) who is fluent in both streaming and batch paradigms on AWS or GCP. You’ll design and operate data platforms that power analytics, personalization, and recommendation use cases—partnering closely with ML engineers to move models from notebooks to production.


What You’ll Do
Design & build pipelines: Low-latency streaming and reliable batch ETL/ELT for multi-tenant datasets across AWS or GCP.
Own data quality: Implement contracts, validation, observability, lineage, backfills, and SLAs/SLOs.
Operationalize ML: Productionize features, embeddings, and model I/O for personalization/recommendation (feature stores, real-time inference paths, batch retraining).
Model the warehouse/lake: Create well-governed schemas (e.g., medallion/lakehouse patterns) to support BI and experimentation.
Harden & scale: Optimize cost/perf, implement autoscaling, partitioning, compaction, and tiered storage; champion reliability and incident response.
Security & compliance: Build with least-privilege IAM, encryption, PII handling, and auditability aligned to SOC 2 and healthcare data expectations.
Collaborate: Partner with product, ML, and app teams; contribute to data platform roadmap and coding standards.


Required Qualifications
5–7 years building and running production streaming + batch data pipelines.
Cloud: Expertise in AWS (Kinesis/MSK, Glue/EMR, Lambda, S3, Redshift) or GCP (Pub/Sub, Dataflow/Dataproc, GCS, BigQuery).
Polyglot engineering: Strong hands-on in Python plus one or more of Scala/Java/Go/TypeScript.
Distributed processing: Solid with Spark/Flink/Beam and related performance tuning (checkpointing, state, watermarking).
Orchestration & ELT: Airflow/Dagster and dbt or equivalent; CI/CD for data (tests, contracts).
ML-adjacent experience: Shipping data features for personalization/recs (e.g., candidate generation, ranking features, user/item embeddings, offline/online consistency).
Data foundations: Schema design, partitioning, CDC, late/duplicate data handling, idempotency, backfills.
Reliability: Monitoring/alerting, on-call familiarity, cost/perf optimization.
Communication: Clear written/spoken communication across engineering and product stakeholders.


Nice to Have
Feature stores (e.g., Feast), vector DBs (Qdrant, Pinecone, FAISS), or realtime retrieval for recs.
Event bus & contract tooling (Kafka + Protobuf/Avro, schema registry).
Data governance/lineage (OpenLineage, DataHub, Collibra or similar).
MLOps/model serving (Vertex AI, SageMaker, Ray Serve, Triton, custom microservices).
Infra-as-code (Terraform/CDK), containers (Docker/Kubernetes/ECS/GKE).
Experience with regulated data (HIPAA-adjacent), multi-tenant SaaS, and privacy-preserving analytics.
Experimentation/platform work for ranking systems (A/B testing, counterfactual logging).


Our Current Stack (illustrative)
AWS: S3, Kinesis/MSK, Glue/EMR, Lambda, Redshift; GCP: GCS, Pub/Sub, Dataflow/Dataproc, BigQuery
Processing: Spark, Flink, Beam; Transform: db
Orchestration: Airflow or Dagster; Contracts/Obs: Great Expectations/Deequ, OpenLineage
Serving: REST/gRPC services, Model inference endpoints; Storage: Postgres, Redis, vector stores (e.g., Qdrant)


Work Style & Hours
Remote-first U.S. team; preference for Pacific Time overlap (West Coast strongly preferred).
Collaboration via docs, async updates, and crisp incident/ops playbooks.

PraxisPro is an Equal Opportunity Employer. We celebrate diversity and are committed to an inclusive environment. We consider qualified applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran, or disability status.


Salary
$190,000 - $250,000 per year