Director of Site Reliability Engineering
Role in brief
The Stellar Development Foundation seeks a Director of Site Reliability Engineering to lead and mentor a team responsible for the Stellar blockchain's robust infrastructure. This role involves setting the SRE vision, defining team objectives, and ensuring system reliability. Ideal for experienced SRE leaders with a background in large-scale distributed systems and a passion for blockchain technology.
About the role
This Director role involves establishing the vision and mandate for the Site Reliability Engineering team at Stellar Development Foundation. You will define quarterly objectives, set up collaboration processes with development teams, and create career growth paths for individual contributors. A key part of your work will be to define and track engineering metrics, holding teams accountable for key performance indicators.
The position requires hands-on involvement in designing and building reliable, user-friendly infrastructure. You will monitor and troubleshoot production systems, participate in 24/7 on-call rotations, and mediate technical discussions. This includes reviewing pull requests and contributing code fixes when necessary, ensuring the continuous operation and improvement of the Stellar blockchain network.
Success in this role means ensuring the Stellar blockchain operates at high scale with robust and efficient services. You will lead efforts to maintain highly-available infrastructure for large distributed systems, collaborating across the Stellar ecosystem. This involves engaging with partners and advising on integrations to support the network's growth and stability.
The base salary for this position ranges from $210,000 to $310,000, with additional lumen-denominated grants and comprehensive benefits.
Skills that matter here
- Site Reliability Engineering: This role requires establishing the vision and mandate for the SRE team, defining OKRs, and setting up collaboration processes with development teams.
- Infrastructure: You will design and build reliable infrastructure for large distributed systems and maintain highly-available infrastructure.
- Operations: This position involves monitoring and troubleshooting systems in production, as well as defining and participating in 24/7 on-call rotations.
- Kubernetes: Experience with building and maintaining infrastructure using Kubernetes is a requirement for this role.
- Terraform: The role requires using Infrastructure as Code tooling like Terraform for configuration management.
- Ansible: Experience with configuration management tools such as Ansible is necessary for this position.
Who this role suits
- A leader who can define a clear vision and mandate for a technical team.
- An experienced manager capable of coaching and mentoring individual contributors while defining career growth paths.
- Someone adept at mediating technical discussions and reviewing code contributions.
- A hands-on contributor willing to jump in with code fixes and troubleshooting when needed.
From the employer
In this role, you will:
- Establish a clear vision and mandate for the Site Reliability Engineering team
- Define the SRE team's quarterly OKRs to best align with the company's goals
- Define processes of collaboration between SREs and development teams throughout the software development lifecycle
- Define a career growth path for the SRE team, as well as coach and mentor individual contributors on the team
- Define and track metrics across engineering and help hold engineering teams accountable for their KPIs
- Coordinate priorities with other teams and areas of the organization
- Participate in sprint planning and execution, track progress and oversee day-to-day tactical decisions
- Design and build reliable systems, and infrastructure that is easy to use by software engineers
- Monitor and troubleshoot systems in production
- Define and participate in 24/7 on-call rotations alongside the team
- Mediate technical discussions and review PRs
- Jump in as needed with code fixes, troubleshooting and hands-on contributions
- Collaborate across the Stellar ecosystem, engaging with key partners and advising on their integration to set them up for success
You have:
- 3+ years of experience working as a Site Reliability Engineer
- 3+ years of experience managing an SRE team
- Site Reliability Engineering experience:
- Strong track record of collaborating with dev teams at all stages of product development (design, development/CI, beta testing, production)
- Strong track record collaborating on defining, measuring and driving improvements in KPIs
- Strong track record assisting teams during Root Cause Analysis and post mortems
- Infrastructure and Operations experience:
- Designing and building out the infrastructure for large distributed systems
- Maintaining highly-available infrastructure
- Troubleshooting and understanding complex technical problems
- Using configuration Management or IaC tooling such as Terraform, Ansible, Puppet
- Building and maintaining infrastructure using Kubernetes
- Highly autonomous; able to find clarity in ambiguous circumstances
- Excellent communicator; comfortable working with remote team members
We offer competitive pay with a base salary range for this position of $210,000 - $310,000 depending on job-related knowledge, skills, experience, and location. In addition, we offer lumen-denominated grants along with the following perks and benefits:
- Competitive health, dental & vision coverage with most plans covered at 100% for the employee + any dependents
- Flexible time off + 15 company holidays including a company-wide holiday break
- Generous paid parental leave for all parents, plus paid pregnancy disability leave for birthing parents
- Gym reimbursement ($80 per month)
- Life & ADD (up to $50K)
- Short & Long term disability
- 401K with 4% match
- Health & Dependent Care FSA Accounts
- Commuter benefits with $250/month employer contribution
- Health Savings Account (HSA) with monthly employer contribution
- Family building benefits through Kindbody
- Wellbeing benefits (One Medical, Rightway, Headspace)
- L&D budget of $1,500/year
- Daily lunch and snacks in office
- Company retreats
Questions about this role
What is the remote work policy for this position?
This is a fully remote position.
What level of seniority is expected for this role?
This is a head-level position, requiring significant experience in Site Reliability Engineering and team management.
What are the key technical skills required?
Key technical skills include Site Reliability Engineering, Infrastructure, Operations, Kubernetes, Terraform, Ansible, and Puppet.