← All Jobs
Posted May 18, 2026

Distinguished Site Reliability Engineer – Cloud

Apply Now
Job Description: • Lead, design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting • Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews • Maintain services once they are live by measuring and monitoring availability, latency and overall system health • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity • Practice sustainable incident response and blameless postmortems • Be part of an on call rotation to support production systems Requirements: • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience • 16+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production • Experience in one or more of the following: Python, Go, Perl or Ruby • In depth knowledge on Linux, Networking and Containers Benefits: • equity • benefits Apply tot his job Apply To this Job