Description:
• Conduct interviews with engineering teams to identify and remove operational friction in CI/CD, deployments, observability, and cloud environments.
• Design and implement scalable infrastructure-as-code patterns using Terraform to standardize provisioning and reduce configuration drift.
• Own and evolve the Kubernetes platform, including EKS or self-managed environments, so workloads are secure, scalable, and resilient.
• Architect and optimize CI/CD pipelines to improve deployment frequency, reduce lead time, and increase release confidence.
• Lead reliability initiatives such as incident response improvements, root cause analysis, and postmortem practices.
• Design and enforce secure networking, IAM, and secrets management strategies across environments.
• Improve observability through metrics, logs, and tracing using DataDog or similar tooling.
• Optimize cloud costs through rightsizing, autoscaling, and architectural improvements.
• Own disaster recovery planning, backup strategies, and multi-region resilience initiatives.
• Refactor manual or brittle infrastructure into automated, testable, reproducible systems and drive adoption through documentation and hands-on support.
Requirements:
• 8+ years of experience in DevOps, SRE, or Infrastructure Engineering roles.
• Proven experience designing and operating production Kubernetes environments at scale.
• Deep hands-on expertise with AWS infrastructure and cloud networking.
• Strong experience building and maintaining Terraform modules across large cloud environments.
• Demonstrated ownership of CI/CD systems and measurable improvement of DORA metrics.
• Experience leading incident response processes and driving meaningful postmortem outcomes.
• Strong understanding of distributed systems, event-driven architectures with Kafka, and database performance with PostgreSQL.
• Proven ability to modernize legacy infrastructure and eliminate manual operational toil.
• Experience navigating high-ambiguity environments and translating operational friction into prioritized infrastructure roadmaps.
• Nice to have: experience operating high-throughput Kafka clusters, tuning PostgreSQL or Redis, implementing autoscaling, building internal developer platforms, applying security best practices, working with multi-region systems, using Python for automation, or introducing SLO/error budget/chaos testing frameworks.
• All remote hires must be able to travel to Orlando, Florida at least twice per year, plus for orientation in Orlando.
Benefits:
• Health care plan including medical, dental, and vision coverage.
• Retirement plan with 401(k) and IRA options.
• Life insurance.
• Flexible vacation.
• Work-from-home option.
• Wellness resources.
• Free food and snacks in the office.
• Hybrid setup in Orlando, Florida.