Senior Site Reliability Engineer, Observability

Remote $129k–$304k senior 1 month ago full-time quality 9/10

Role in brief

Chainlink Labs seeks a Senior Site Reliability Engineer focused on observability to build and maintain a GitOps environment. This role involves developing a modern OTEL-based observability platform, managing various telemetry types, and ensuring system reliability and performance. Candidates with a strong background in SRE, infrastructure, or platform teams, who can design and manage large real-time systems, should consider applying.

AWSTerraformKubernetesPrometheusGrafanaGitHub ActionsPacker

About the role

This role centers on building and orchestrating a modern, OTEL-based observability platform, supporting diverse telemetry types such as metrics, logs, and traces. The work involves defining governance for observability at scale and ensuring that reliability, security, and performance consistently meet or exceed defined service level agreements. A key aspect is leading the design and deployment of monitoring services to proactively detect issues and alert the team.

The Senior SRE will collaborate with engineers across the company to troubleshoot problems, deploy new products, and enhance development velocity while reducing cognitive load. Responsibilities also include ingesting, aggregating, transforming, and utilizing data from multiple sources within a real-time data pipeline. The position oversees the availability, performance, and supportability of the observability infrastructure.

Success in this role means establishing robust processes for alert response operations and supporting the team to ensure reliable delivery of oracle data. The engineer will make recommendations for metric collection with every new feature release and champion reliability and security by focusing on doing work correctly from the outset. This involves a commitment to quality and proactive problem-solving.

The competitive salary for this role ranges from USD 129,000 to USD 304,000.

Skills that matter here

  • AWS: The role involves working within the AWS cloud environment for infrastructure and services.
  • Terraform: Terraform is used for infrastructure as code to manage and provision resources.
  • Kubernetes: Experience with Kubernetes is required for managing and deploying containerized applications and services.
  • Prometheus: Prometheus is utilized for exporting and collecting metrics to monitor system performance.
  • Grafana: Grafana is used for creating dashboards to visualize monitoring data and system health.
  • GitHub Actions: GitHub Actions are part of the GitOps environment for automation and continuous integration/delivery.

Who this role suits

  • A person who has spent at least seven years in DevOps, infrastructure, SRE, or platform teams.
  • Someone capable of developing software beyond typical infrastructure configurations.
  • An individual with expert knowledge in designing and managing large real-time systems.
  • A clear communicator who actively participates in planning meetings and code reviews, providing and receiving feedback.

From the employer

  • Build and orchestrate Modern OTEL-based Observability Platform
  • Support multiple telemetry types, like metrics, logs and traces.
  • Define and support modern governance in observability and problems at scale.
  • Ensure reliability, security, and performance exceed our defined SLAs
  • Work with engineers from across the company to help troubleshoot issues, deploy new products and services, and increase velocity while decreasing cognitive load
  • Lead the design and deployment of monitoring/observability services to detect and alert the team of needed action.
  • Ingest, aggregate, transform, and utilize data from a multitude of sources in our real time data pipeline.
  • Oversee the availability, performance, and supportability of our observability infrastructure.
  • Create processes around alert response operations and support the team to ensure the reliable delivery of oracle data.
  • Make recommendations to ensure sufficient metrics are collected to create alerts with every new feature release.
  • Champion reliability and security by taking the time to do your work right the first time.
  • 7+ years of relevant professional experience. You probably have worked on a devops, infrastructure, SRE, and/or platform team before
  • Ability to develop software outside of the scope of typical infrastructure requirements and configurations
  • Experience programming in C, C++, Java, Python, Go, Perl, or Ruby
  • Expert knowledge in all aspects of designing, developing, and managing large real-time systems
  • Experience with monitoring and logging. You know how to export metrics using Prometheus, have built a Grafana dashboard or two, and have experience with a centralized logging solution like an ELK Stack, Splunk or Grafana Stack.
  • Experience with distributed systems and container orchestration. You have maintained or even built Kubernetes clusters before and feel comfortable deploying completely new services on them
  • Strong communication skills. You can give and receive constructive feedback, and you do not shy away from planning meetings and code reviews.
  • Competitive salary ranging from USD 129,000 to USD 304,000
  • Opportunities for growth and learning in a remote environment
  • Commitment to equal opportunity and support for diverse backgrounds.

Questions about this role

What is the remote work policy for this role?

This is a fully remote position, allowing candidates to work from various locations.

What level of seniority is expected for this position?

This is a senior-level position, requiring at least 7 years of relevant professional experience.

What programming languages are relevant for this role?

Experience programming in languages such as C, C++, Java, Python, Go, Perl, or Ruby is beneficial.

Similar jobs

Before you apply

  • Legitimate employers never ask you to pay anything to apply or get hired.
  • Never share seed phrases or private keys. No real job needs them.
  • Do not install software ("test tasks", "trading tools", "video call clients") sent during hiring.
  • Check that the application page's domain really belongs to Chainlink Labs.