Director of Site Reliability Engineering

Stellar Development Foundation · website

Remote $210k–$310k head 1 month ago full-time quality 8.6/10

Role in brief

The Stellar Development Foundation seeks a Director of Site Reliability Engineering to lead and mentor a team responsible for the Stellar blockchain's robust infrastructure. This role involves setting the SRE vision, defining team objectives, and ensuring system reliability. Ideal for experienced SRE leaders with a background in large-scale distributed systems and a passion for blockchain technology.

Apply now →

Site Reliability EngineeringInfrastructureOperationsKubernetesTerraformAnsiblePuppet

About the role

This Director role involves establishing the vision and mandate for the Site Reliability Engineering team at Stellar Development Foundation. You will define quarterly objectives, set up collaboration processes with development teams, and create career growth paths for individual contributors. A key part of your work will be to define and track engineering metrics, holding teams accountable for key performance indicators.

The position requires hands-on involvement in designing and building reliable, user-friendly infrastructure. You will monitor and troubleshoot production systems, participate in 24/7 on-call rotations, and mediate technical discussions. This includes reviewing pull requests and contributing code fixes when necessary, ensuring the continuous operation and improvement of the Stellar blockchain network.

Success in this role means ensuring the Stellar blockchain operates at high scale with robust and efficient services. You will lead efforts to maintain highly-available infrastructure for large distributed systems, collaborating across the Stellar ecosystem. This involves engaging with partners and advising on integrations to support the network's growth and stability.

The base salary for this position ranges from $210,000 to $310,000, with additional lumen-denominated grants and comprehensive benefits.

Skills that matter here

Site Reliability Engineering: This role requires establishing the vision and mandate for the SRE team, defining OKRs, and setting up collaboration processes with development teams.
Infrastructure: You will design and build reliable infrastructure for large distributed systems and maintain highly-available infrastructure.
Operations: This position involves monitoring and troubleshooting systems in production, as well as defining and participating in 24/7 on-call rotations.
Kubernetes: Experience with building and maintaining infrastructure using Kubernetes is a requirement for this role.
Terraform: The role requires using Infrastructure as Code tooling like Terraform for configuration management.
Ansible: Experience with configuration management tools such as Ansible is necessary for this position.

Who this role suits

A leader who can define a clear vision and mandate for a technical team.
An experienced manager capable of coaching and mentoring individual contributors while defining career growth paths.
Someone adept at mediating technical discussions and reviewing code contributions.
A hands-on contributor willing to jump in with code fixes and troubleshooting when needed.

From the employer

In this role, you will:

Establish a clear vision and mandate for the Site Reliability Engineering team
Define the SRE team's quarterly OKRs to best align with the company's goals
Define processes of collaboration between SREs and development teams throughout the software development lifecycle
Define a career growth path for the SRE team, as well as coach and mentor individual contributors on the team
Define and track metrics across engineering and help hold engineering teams accountable for their KPIs
Coordinate priorities with other teams and areas of the organization
Participate in sprint planning and execution, track progress and oversee day-to-day tactical decisions
Design and build reliable systems, and infrastructure that is easy to use by software engineers
Monitor and troubleshoot systems in production
Define and participate in 24/7 on-call rotations alongside the team
Mediate technical discussions and review PRs
Jump in as needed with code fixes, troubleshooting and hands-on contributions
Collaborate across the Stellar ecosystem, engaging with key partners and advising on their integration to set them up for success

You have:

3+ years of experience working as a Site Reliability Engineer
3+ years of experience managing an SRE team
Site Reliability Engineering experience:
Strong track record of collaborating with dev teams at all stages of product development (design, development/CI, beta testing, production)
Strong track record collaborating on defining, measuring and driving improvements in KPIs
Strong track record assisting teams during Root Cause Analysis and post mortems
Infrastructure and Operations experience:
Designing and building out the infrastructure for large distributed systems
Maintaining highly-available infrastructure
Troubleshooting and understanding complex technical problems
Using configuration Management or IaC tooling such as Terraform, Ansible, Puppet
Building and maintaining infrastructure using Kubernetes
Highly autonomous; able to find clarity in ambiguous circumstances
Excellent communicator; comfortable working with remote team members

We offer competitive pay with a base salary range for this position of $210,000 - $310,000 depending on job-related knowledge, skills, experience, and location. In addition, we offer lumen-denominated grants along with the following perks and benefits:

Competitive health, dental & vision coverage with most plans covered at 100% for the employee + any dependents
Flexible time off + 15 company holidays including a company-wide holiday break
Generous paid parental leave for all parents, plus paid pregnancy disability leave for birthing parents
Gym reimbursement ($80 per month)
Life & ADD (up to $50K)
Short & Long term disability
401K with 4% match
Health & Dependent Care FSA Accounts
Commuter benefits with $250/month employer contribution
Health Savings Account (HSA) with monthly employer contribution
Family building benefits through Kindbody
Wellbeing benefits (One Medical, Rightway, Headspace)
L&D budget of $1,500/year
Daily lunch and snacks in office
Company retreats

Questions about this role

What is the remote work policy for this position?

This is a fully remote position.

What level of seniority is expected for this role?

This is a head-level position, requiring significant experience in Site Reliability Engineering and team management.

What are the key technical skills required?

Key technical skills include Site Reliability Engineering, Infrastructure, Operations, Kubernetes, Terraform, Ansible, and Puppet.

Apply now →

Similar jobs

Role in brief

About the role

Skills that matter here

Who this role suits

From the employer

Questions about this role

Similar jobs

Senior Developer Experience Engineer

Vice President Site Reliability Engineering (Data Centers)

Infrastructure Engineer

Senior IaaS / Kubernetes Platform Engineer

Staff Site Reliability Engineer-Federal, Security Clearance

DevSecOps Engineer - Casa