Description:
We are seeking a Staff DevOps Engineer to drive the reliability, scalability, and performance of our systems. In this role, you will architect and support a large-scale EKS environment powering internal applications and firmware development. You will collaborate with cross-functional teams to evolve our real-time monitoring, logging, and alerting solutions while optimizing the entire lifecycle of our services, from inception and design to deployment and refinement
Responsibilities
- Resource Optimization: Oversee the lifecycle management of cloud resources, leveraging advanced orchestration techniques to improve efficiency and scalability.
- Observability & Monitoring: Optimize our observability and telemetry platforms focused on real-time performance monitoring, logging, and alerting using tools like Prometheus, Grafana, and OpenTelemetry.
- Operational Excellence: Maintain and enhance systems post-deployment by monitoring system health, optimizing availability and latency, and ensuring operational reliability.
- Scalable Automation: Implement automation solutions to scale systems sustainably while driving improvements in reliability and deployment velocity.
- Incident Response: Participate in on-call rotations to support production systems, handle incidents with a sustainable response process, and perform blameless postmortems to refine workflows.
- Tooling & Platforms: Develop and maintain tools, platforms, and self-service frameworks with a user-centric approach to enhance internal team productivity and operational efficiency.
Qualifications
- Educational Background: Bachelor's degree in Computer Science, a related technical field, or equivalent experience.
- Infrastructure Expertise: 5+ years of experience with infrastructure automation, distributed systems, and production-grade private or public cloud systems.
- Observability & Telemetry: Proven track record in implementing and supporting observability platforms using tools like Grafana, Prometheus, and OpenTelemetry.
- Cloud & Kubernetes Knowledge: Deep understanding of Kubernetes (e.g., EKS), ArgoCD, Crossplane, and multi-cloud platforms.
- Programming Skills: Proficiency in Python or Go for building automation and operational tools.
- Linux & Networking Proficiency: Expertise in Linux systems, networking concepts, and containerization technologies.
- Problem Solving & Ownership: A systematic approach to debugging and optimizing systems with a strong sense of ownership and attention to detail.