Job Title: System Reliability Engineer/DevOps Company Name: Growe Job Url: https://job-boards.greenhouse.io/embed/job_app?for=growe&jr_id=69b9824156973837413eff95&token=4807529101&utm_source=jobright Job Description: System Reliability Engineer/DevOps Anywhere Growe welcomes those who are excited to: Ensure availability, performance, and scalability of infrastructure and services through monitoring, automation, and operational best practices; Lead incident response, perform root cause analysis, and implement recovery and long-term fixes; Manage infrastructure using Terraform, Terragrunt, and automation tools for consistency and repeatability; Implement and maintain metrics, logs, and tracing solutions (Prometheus, Grafana, Loki, VictoriaMetrics, CloudWatch) to ensure system visibility; Identify bottlenecks, tune systems, and improve infrastructure performance; Monitor resources, forecast growth, and implement scaling strategies; Integrate security best practices into IaC, CI/CD pipelines, and deployments; Support vulnerability management; Participate in 24/7 rotations (once a week) for timely resolution of critical incidents; Work with DevOps, PRE, development, and security teams to improve reliability and design resilient systems; Maintain operational runbooks, incident reports, and system documentation. We need your professional experience: 3+ years in a DevOps, SRE, or related role; Strong hands-on experience with AWS services including EC2, ECS, EKS, RDS, DocumentDB, ElastiCache, Keyspaces, S3, EBS, VPC, Route53, KMS, ACM, and CloudWatch; Proficiency with Terraform, Terragrunt, and Atlantis for reproducible and version-controlled infrastructure; Experience with GitLab CI, FluxCD, Argo Rollouts, and automation tools (Ansible, Python, Bash); Solid experience with Docker, Kubernetes (AWS EKS), and Helm (including custom templates, ChartMuseum); Familiarity with cluster add-ons such as KEDA, VPA, Karpenter, External-DNS, ingress-nginx, aws-alb-controller, and ebs-csi-driver; Hands-on experience with Grafana, VictoriaMetrics stack, Tempo, metrics exporters, Pingdom, AWS CloudWatch, and alerting systems like PagerDuty, VMAlert, and Alertmanager; Proficiency with Grafana Loki, OpenSearch, and Vector Agent for centralized logging; Strong understanding of networking concepts, AWS networking (VPC, Network Firewall, Transit Gateway, Site-to-Site VPN), identity and access management, certificate management (ACM, Vault, SOPS), and application security best practices; Familiarity with Cloudflare services, including caching, DNS, and Workers; Exposure to AWS Cost Explorer, KubeCost, and custom cost export tools; Certifications: AWS, Terraform, Kubernetes, or Helm are a plus. We appreciate if you have those personal features: Problem-Solving Mindset: Approaches complex issues methodically and finds practical solutions under pressure; Analytical Thinking: Able to interpret metrics, logs, and system behavior to make informed decisions; Attention to Details: Ensures accuracy in infrastructure changes, configurations, and deployment processes; Adaptability: Comfortable learning new tools, technologies, and adjusting to changing environments; Collaboration & Teamwork: Works effectively with cross-functional teams and communicates clearly; Ownership & Responsibility: Takes accountability for tasks, incidents, and service reliability; Continuous Learning: Keeps up-to-date with DevOps, SRE, cloud, and security best practices; Effective Communication: Can explain technical concepts clearly to both technical and non-technical stakeholders. We are seeking those who align with our core values: GROWE TOGETHER: Our team is our main asset. We work together and support each other to achieve our common goals; DRIVE RESULT OVER PROCESS: We set ambitious, clear, measurable goals in line with our strategy and driving Growe to success; BE READY FOR CHANGE: We see challenges as opportunities to grow and evolve. We adapt today to win tomorrow.