Job Url: https://hiring.cafe/?searchState=%7B%22dateFetchedPastNDays%22%3A4%2C%22locations%22%3A%5B%7B%22id%22%3A%22FxY1yZQBoEtHp_8UEq7V%22%2C%22types%22%3A%5B%22country%22%5D%2C%22address_components%22%3A%5B%7B%22long_name%22%3A%22United+States%22%2C%22short_name%22%3A%22US%22%2C%22types%22%3A%5B%22country%22%5D%7D%5D%2C%22formatted_address%22%3A%22United+States%22%2C%22population%22%3A327167434%2C%22workplace_types%22%3A%5B%22Remote%22%5D%2C%22options%22%3A%7B%22flexible_regions%22%3A%5B%22anywhere_in_continent%22%2C%22anywhere_in_world%22%5D%7D%7D%5D%2C%22securityClearances%22%3A%5B%22None%22%5D%2C%22searchQuery%22%3A%22software+engineer%22%7D

Job Description: Apply now
Full View

Contact Recruiter
Job Info
Company Info

Job Description
Posted 2d ago
Principal SRE Engineer
@ Lifetouch
View All Jobs

Website
United States
$125k-$191k/yr
Remote
Full Time
Responsibilities:
Lead performance analysis, Design monitoring and observability, Collaborate on architecture and reliability
Requirements Summary:
8+ years in software engineering, SRE or DevOps; strong performance troubleshooting; cloud and edge network expertise; production-grade coding; AI/ML for SRE; excellent communication.
Technical Tools Mentioned:
Python, Go, Java, AWS, Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry, Terraform, CloudFormation, Ansible, Chef, Puppet, CDN (Cloudflare/Akamai/CloudFront/Fastly), WAF, DDS, Direct Connect, ExpressRoute, DNS, OpenTelemetry, AIOps

Apply now
Save
Mark Applied
Hide Job
Report & Hide
Job Description

Copy Job Description
Software & Data Engineering 2025-2701
Description

At Shutterfly, we make life’s experiences unforgettable. We believe there is extraordinary power in the self-expression. That’s why our family of brands helps customers create products and capture moments that reflect who they uniquely are.

Overview
Shutterfly LLC is undergoing a comprehensive consumer website re-platforming effort, with the Site Reliability Engineering (SRE) team at the forefront of establishing shared infrastructure and enabling future efficiencies and supportability.
 
The Principal SRE Engineer role is responsible for driving the reliability, availability, and performance of our consumer systems at scale. This role extends beyond operational stability—it requires deep expertise in performance troubleshooting, system optimization, and development practices to ensure that Shutterfly’s platforms are resilient, scalable, and cost-efficient.
 
As a senior technical leader within the SRE organization, the Principal SRE will influence architecture and design decisions, mentor senior engineers, and work closely with development and operations teams to solve complex reliability challenges through code, automation, and data-driven insights.
 
Responsibilities
•    Lead advanced performance analysis and troubleshooting across distributed systems, ensuring optimal availability, scalability, and cost efficiency.
•    Design, implement, and maintain monitoring, alerting, and observability solutions that provide proactive insight into application and infrastructure health.
•    Partner with development teams to influence system architecture, ensuring new features and services meet high standards for reliability, scalability, and performance.
•    Drive incident management improvements by analyzing root causes, identifying systemic issues, and implementing long-term reliability solutions.
•    Lead capacity planning and optimization efforts, aligning system performance with business growth and financial goals.
•    Develop automation and tooling that accelerates delivery, improves operational efficiency, and reduces human error.
•    Leverage AI and machine learning technologies to optimize SRE practices such as anomaly detection, predictive scaling, automated incident response, and intelligent alerting.
•    Act as a technical leader and mentor for the SRE team, providing guidance on troubleshooting methodologies, performance optimization, and coding best practices.
•    Collaborate across infrastructure, development, and business teams to establish enterprise standards and best practices.
•    Serve as a subject matter expert during critical incidents, providing leadership, technical depth, and decisive action.
•    Document best practices, solutions, and troubleshooting strategies for team knowledge sharing.
 
Qualifications
•    Minimum 8+ years of combined experience in software engineering, SRE, or DevOps roles with direct accountability for large-scale, highly available systems.
•    Proven expertise in performance troubleshooting, system profiling, and root cause analysis for distributed applications.
•    Strong development background with proficiency in one or more programming languages (Python, Go, Java, or similar), with ability to contribute to production-quality code.
•    Deep experience with observability platforms (e.g., Splunk, Datadog, SignalFx, Prometheus, OpenTelemetry).
•    Advanced knowledge of AWS services, cost optimization strategies, and large-scale cloud deployments.
•    Extensive experience with edge networking technologies including CDN architecture and optimization (Cloudflare, Akamai, CloudFront, Fastly), Web Application Firewalls (WAF), and DDoS mitigation strategies.
•    Strong understanding of cloud networking concepts including VPC design, transit gateways, private connectivity (Direct Connect, ExpressRoute), DNS management, and global traffic routing strategies.
•    Hands-on experience optimizing edge performance through caching strategies, origin shielding, edge compute, and content delivery optimization for global user bases.
•    Hands-on experience with Infrastructure as Code (Terraform, CloudFormation) and configuration management tools (Ansible, Chef, Puppet).
•    Strong understanding of distributed systems concepts (scalability, high availability, fault tolerance).
•    Knowledge of application security at the edge including bot management, rate limiting, SSL/TLS optimization, and API gateway patterns.
•    Demonstrated leadership in technical projects, incident management, and cross-functional initiatives.
•    Experience applying AI/ML technologies to site reliability engineering (e.g., predictive analytics, automated anomaly detection, AIOps platforms).
•    Experience mentoring senior-level engineers and fostering a culture of continuous learning and improvement.
•    Excellent communication skills with ability to articulate complex technical issues to both engineering and business audiences.
•    Bachelor’s degree in Computer Science, Engineering, or equivalent experience. Advanced degree or technical certifications preferred.
Supporting a diverse and inclusive workforce is important to Shutterfly not only because it directly reflects our value of Embracing our Differences, but also because it’s the right thing to do for our business and for our people. We welcome all applicants and evaluate them based on their qualifications. Learn more about our commitment to Diversity, Equity, and Inclusion on our Career Site.

This position will accept applications on an ongoing basis until filled.

The compensation package for this role is based on multiple factors, such as job level, responsibilities, location, and candidate experience. The base pay ranges included below are specific to the locations listed, and may not be applicable to other locations.

California : [$125,000-$190,750]

Connecticut and New York: [$125,000-$174,750]

Colorado, Illinois, Minnesota and Washington: [$125,000-$161,750]

Nevada: [$126,750-$174,750]

Maryland and New Jersey: [$145,500-$174,750]

Hawaii : [$126,750 – $152,000]

This position may be eligible for a bonus incentive, health benefits, a 401K program, and other employee perks. More details about our company benefits can be found at https://shutterflyinc.com/benefits/.

This opportunity can be remote, but candidates must reside in a state in which Shutterfly is registered to do business. This includes all US states except District of Columbia, North Dakota, Mississippi, Rhode Island, Vermont, and Wyoming.

This position will accept applications on an ongoing basis until filled.