Role in brief
Copper, a digital asset infrastructure company, seeks a Principal Site Reliability Engineer. This role involves defining and implementing SRE principles, automating systems for reliability and deployment, and leading technical excellence in microservices. Candidates with strong distributed systems experience and a focus on operational improvement in a remote setting should apply.
About the role
This Principal Site Reliability Engineer position at Copper focuses on enhancing system reliability and operational excellence. The role requires defining SRE principles, including SLIs, SLOs, and error budgets, and ensuring their adoption across the organization. Success in this area means establishing measurable standards for reliability and fostering a culture of operational accountability.
A key aspect of this role is scaling systems through automation and architectural improvements. This involves consulting on system architecture, developing reusable platforms, planning capacity, and conducting production readiness reviews. The goal is to ensure services launch and operate successfully while increasing deployment velocity and overall system robustness.
The Principal SRE will also drive technical excellence by engaging with the full lifecycle of microservices, from initial design to deployment, operation, and continuous refinement. This includes partnering with engineering and product leadership to integrate reliability into development, leading incident management improvements through blameless postmortems, and mentoring other engineers in SRE practices.
The salary for this position ranges from $120,000 to $200,000 USD.
Skills that matter here
- Site Reliability Engineering: This role is central to shaping SRE practices, defining reliability metrics, and driving adoption across the organization.
- Observability: The position requires establishing how the company approaches observability and building systems to measure it effectively.
- AWS: Experience with production workloads in AWS is a desirable skill for this role.
- Software development: The role involves improving the lifecycle of microservices from inception through deployment and refinement.
- Incident management: This position requires leading blameless postmortems and driving systemic improvements in incident management.
- Microservices architecture: The role involves designing, analyzing, and troubleshooting distributed systems or micro-services architectures.
Who this role suits
- A person who thrives on defining and implementing best practices for system reliability and operational excellence.
- Someone who enjoys mentoring others and influencing technical leadership to embed reliability into product development.
- An individual with a systematic problem-solving approach who is adept at troubleshooting complex distributed systems.
- A professional who is passionate about automation and architectural improvements to enhance system reliability and deployment speed.
From the employer
Key Responsibilities:
- Shape SRE;
- Define how we think about reliability, observability, and operational excellence. Drive the adoption of SRE principles across the organization while building the systems and processes that make those principles measurable – think SLIs, SLOs and error budgets.
- Scale Through Automation;
- Champion architectural improvements that enhance both system reliability and deployment velocity. Provide consultation on system architecture, building reusable platforms and frameworks, planning capacity needs, and conducting production readiness reviews to ensure services launch and operate successfully.
- Drive Technical Excellence;
- Engage in and improve the lifecycle of microservices, from inception through deployment, operation, observability, and continuous refinement.
- Lead Through Influence;
- Partner with engineering and product leadership to embed reliability into our product development lifecycle. Conduct blameless postmortems and drive systemic improvements in incident management. Mentor engineers across the organisation on SRE practices, helping teams take ownership of their service reliability.
Skills and Experience:
Essential
- Experience in designing, analysing, and troubleshooting distributed systems or micro-services architectures.
- Established expertise in observability and incident management.
- Proven experience in driving organizational Change.
- Excellent communication skills, with a systematic problem-solving approach.
Desirable
- Experience working with production workloads in AWS.
- Experience working in financial services or similarly regulated environments.
- Interest in blockchain based technologies and/or ‘decentralised’ finance.
- Master's degree in Computer Science or Engineering.
Benefits:
- 35 Days paid time off per annum, inclusive of annual leave and public holidays. Employees also receive one additional day of annual leave for each year of service.
- Private Health Insurance.
Questions about this role
What is the remote work policy for this position?
This is a fully remote position.
What level of seniority is expected for this role?
This is a senior-level position, specifically a Principal Site Reliability Engineer.
What are the core technical skills required for this role?
Essential skills include experience with distributed systems or microservices, observability, incident management, and driving organizational change.