← All Jobs
Posted May 25, 2026

Senior Site Reliability Engineer (Fleet Management)

Apply Now
Requirements • Have 6+ years of experience in software development and operating distributed systems, • Are proficient in Go, Python, or a similar language, with a strong commitment to code quality and testing practices (writing unit, integration, and E2E tests), • Have deep experience using and extending containerization technologies, preferably Kubernetes, • Have a solid understanding of Linux operating system internals and networking concepts (e.g., filesystems, TCP/IP, DNS, TLS), • Possess a customer focused mindset, treating internal developers as your primary users, • Have strong operational ownership, including a track record of debugging complex production issues and driving them to resolution, • Prefer automation over manual processes ("allergic to ops work"), • We are a small team of software engineers with a strong bias toward building software solutions to eliminate toil, • (Desirable) Designing and implementing secure, multi-tenant runtime environments from first principles, • (Desirable) Proficiency with Kubernetes ecosystem tools such as Helm, Kustomize, Gatekeeper, Kyverno, and CRDs/Operators, CRI, CSI, • (Desirable) Expertise in cloud infrastructure platforms, including AWS, GCP, or Azure, • (Desirable) Proficiency in provisioning infrastructure using tools like Terraform, Crossplane, and AWS Controllers for Kubernetes (ACK), • (Desirable) Advanced Linux systems internals and networking concepts specifically relevant to containers, such as namespaces and cgroups What the job involves • Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization, • Among these are our multi-cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-facing edge and internal service mesh), and observability and alerting systems, • The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers, • We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper), • As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model, • Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB, • Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems, • Participate in a 24/7 on-call rotation to resolve critical issues, • Prioritize blameless post-mortems and dedicate engineering time to systemic fixes, ensuring you aren’t paged for the same issue twice Apply tot his job Apply To this Job