Role in brief
Copperco is seeking a Principal Site Reliability Engineer to define and implement SRE practices, focusing on reliability, observability, and operational excellence. This role involves automating system scaling, improving microservice lifecycles, and influencing engineering teams. It suits experienced SREs who can drive organizational change and mentor others in a remote setting.
About the role
This role is central to establishing and maturing Site Reliability Engineering within Copperco. The Principal SRE will be responsible for defining the company's approach to reliability, observability, and operational excellence. This includes developing systems and processes to measure SRE principles, such as defining Service Level Indicators (SLIs), Service Level Objectives (SLOs), and managing error budgets.
A key aspect of the position involves scaling systems through automation and championing architectural improvements that boost both reliability and deployment speed. The Principal SRE will consult on system architecture, build reusable platforms, plan capacity needs, and conduct production readiness reviews to ensure successful service launches and operations. They will also improve the lifecycle of microservices from inception through deployment and continuous refinement.
The successful candidate will lead through influence, partnering with engineering and product leadership to embed reliability into the product development process. This includes conducting blameless postmortems to drive systemic improvements in incident management and mentoring engineers across the organization on SRE practices, fostering ownership of service reliability within teams. While starting as an individual contributor, this role will significantly shape Copperco's future SRE strategy.
The annual salary for this position is between $140,000 and $180,000 USD.
Skills that matter here
- designing, analysing, and troubleshooting distributed systems or micro-services architectures: This skill is essential for improving system reliability and deployment velocity.
- observability and incident management: Expertise in these areas is crucial for defining SRE practices and conducting blameless postmortems.
- driving organizational Change: This role requires leading the adoption of SRE principles across the company and embedding reliability into product development.
- communication skills: Effective communication is necessary for partnering with leadership and mentoring engineers on SRE practices.
- AWS: Experience with AWS production workloads is desirable for enhancing system reliability in a cloud environment.
- financial services or similarly regulated environments: Experience in these environments is desirable, indicating a preference for candidates familiar with stringent operational requirements.
Who this role suits
- A person who thrives on defining and implementing new technical strategies.
- Someone who enjoys mentoring others and driving change through influence rather than direct authority.
- An individual with a systematic problem-solving approach who is comfortable with complex distributed systems.
- A candidate who is proactive in identifying and addressing reliability challenges across an organization.
From the employer
Key Responsibilities:
- Shape SRE; Define how we think about reliability, observability, and operational excellence. Drive the adoption of SRE principles across the organization while building the systems and processes that make those principles measurable – think SLIs, SLOs and error budgets.
- Scale Through Automation; Champion architectural improvements that enhance both system reliability and deployment velocity. Provide consultation on system architecture, building reusable platforms and frameworks, planning capacity needs, and conducting production readiness reviews to ensure services launch and operate successfully.
- Drive Technical Excellence; Engage in and improve the lifecycle of microservices, from inception through deployment, operation, observability, and continuous refinement.
- Lead Through Influence; Partner with engineering and product leadership to embed reliability into our product development lifecycle. Conduct blameless postmortems and drive systemic improvements in incident management. Mentor engineers across the organisation on SRE practices, helping teams take ownership of their service reliability.
While this role begins as an IC position, it will play a key part in shaping the future of SRE at Copper.
Skills and Experience:
Essential
- Experience in designing, analysing, and troubleshooting distributed systems or micro-services architectures.
- Established expertise in observability and incident management.
- Proven experience in driving organizational Change
- Excellent communication skills, with a systematic problem-solving approach.
Desirable
- Experience working with production workloads in AWS
- Experience working in financial services or similarly regulated environments
- Interest in blockchain based technologies and/or ‘decentralised’ finance
- Master's degree in Computer Science or Engineering.
Questions about this role
What is the remote work policy for this role?
This is a fully remote position.
What is the expected salary range for this position?
The salary for this role ranges from $140,000 to $180,000 USD annually.
What kind of systems will I be working with?
You will be working with distributed systems and micro-services architectures, with a focus on enhancing their reliability and operational excellence.