Job Description:
• Lead, design, implement and support operational and reliability aspects of large scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
• Engage in and improve the whole lifecycle of services—from inception and design through deployment, operation and refinement
• Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
• Maintain services once they are live by measuring and monitoring availability, latency and overall system health
• Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
• Practice sustainable incident response and blameless postmortems
• Be part of an on call rotation to support production systems
Requirements:
• BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
• 16+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production
• Experience in one or more of the following: Python, Go, Perl or Ruby
• In depth knowledge on Linux, Networking and Containers
Benefits:
• equity
• benefits
Apply tot his job
Apply To this Job