Requirements
• Have 6+ years of experience in software development and operating distributed systems,
• Are proficient in Go, Python, or a similar language, with a strong commitment to code quality and testing practices (writing unit, integration, and E2E tests),
• Have deep experience using and extending containerization technologies, preferably Kubernetes,
• Have a solid understanding of Linux operating system internals and networking concepts (e.g., filesystems, TCP/IP, DNS, TLS),
• Possess a customer focused mindset, treating internal developers as your primary users,
• Have strong operational ownership, including a track record of debugging complex production issues and driving them to resolution,
• Prefer automation over manual processes ("allergic to ops work"),
• We are a small team of software engineers with a strong bias toward building software solutions to eliminate toil,
• (Desirable) Designing and implementing secure, multi-tenant runtime environments from first principles,
• (Desirable) Proficiency with Kubernetes ecosystem tools such as Helm, Kustomize, Gatekeeper, Kyverno, and CRDs/Operators, CRI, CSI,
• (Desirable) Expertise in cloud infrastructure platforms, including AWS, GCP, or Azure,
• (Desirable) Proficiency in provisioning infrastructure using tools like Terraform, Crossplane, and AWS Controllers for Kubernetes (ACK),
• (Desirable) Advanced Linux systems internals and networking concepts specifically relevant to containers, such as namespaces and cgroups
What the job involves
• Platform Engineering is the department within SRE that is responsible for a range of critical infrastructure and operational functions that support the broader engineering organization,
• Among these are our multi-cloud-provider Kubernetes infrastructure, networking, load balancing (including our public-facing edge and internal service mesh), and observability and alerting systems,
• The Fleet Management team provides the core runtime environment that empowers our developers to build and ship products to delight our customers,
• We manage the end-to-end lifecycle of our Kubernetes fleet, alongside the critical components that ensure cluster reliability and security (e.g., CoreDNS, cert-manager, and Gatekeeper),
• As our infrastructure scales to support new use cases and products, we are spearheading a migration from Terraform-based Infrastructure as Code (IaC) to an Operator-driven lifecycle management model,
• Contribute to developing and maintaining a scalable and secure runtime environment on top of Kubernetes that supports product needs across MongoDB,
• Provide internal support for our Kubernetes ecosystem, partnering with engineering teams to help them solve domain-specific problems,
• Participate in a 24/7 on-call rotation to resolve critical issues,
• Prioritize blameless post-mortems and dedicate engineering time to systemic fixes, ensuring you aren’t paged for the same issue twice
Apply tot his job
Apply To this Job