Principal Site Reliability Engineer

Remote $140k–$180k 1 month ago full-time quality 7.6/10

Role in brief

Copperco is seeking a Principal Site Reliability Engineer to define and implement SRE practices, focusing on reliability, observability, and operational excellence. This role involves automating system scaling, improving microservice lifecycles, and influencing engineering teams. It suits experienced SREs who can drive organizational change and mentor others in a remote setting.

Apply now →

About the role

This role is central to establishing and maturing Site Reliability Engineering within Copperco. The Principal SRE will be responsible for defining the company's approach to reliability, observability, and operational excellence. This includes developing systems and processes to measure SRE principles, such as defining Service Level Indicators (SLIs), Service Level Objectives (SLOs), and managing error budgets.

A key aspect of the position involves scaling systems through automation and championing architectural improvements that boost both reliability and deployment speed. The Principal SRE will consult on system architecture, build reusable platforms, plan capacity needs, and conduct production readiness reviews to ensure successful service launches and operations. They will also improve the lifecycle of microservices from inception through deployment and continuous refinement.

The successful candidate will lead through influence, partnering with engineering and product leadership to embed reliability into the product development process. This includes conducting blameless postmortems to drive systemic improvements in incident management and mentoring engineers across the organization on SRE practices, fostering ownership of service reliability within teams. While starting as an individual contributor, this role will significantly shape Copperco's future SRE strategy.

The annual salary for this position is between $140,000 and $180,000 USD.

Skills that matter here

designing, analysing, and troubleshooting distributed systems or micro-services architectures: This skill is essential for improving system reliability and deployment velocity.
observability and incident management: Expertise in these areas is crucial for defining SRE practices and conducting blameless postmortems.
driving organizational Change: This role requires leading the adoption of SRE principles across the company and embedding reliability into product development.
communication skills: Effective communication is necessary for partnering with leadership and mentoring engineers on SRE practices.
AWS: Experience with AWS production workloads is desirable for enhancing system reliability in a cloud environment.
financial services or similarly regulated environments: Experience in these environments is desirable, indicating a preference for candidates familiar with stringent operational requirements.

Who this role suits

A person who thrives on defining and implementing new technical strategies.
Someone who enjoys mentoring others and driving change through influence rather than direct authority.
An individual with a systematic problem-solving approach who is comfortable with complex distributed systems.
A candidate who is proactive in identifying and addressing reliability challenges across an organization.

From the employer

Key Responsibilities:

Shape SRE; Define how we think about reliability, observability, and operational excellence. Drive the adoption of SRE principles across the organization while building the systems and processes that make those principles measurable – think SLIs, SLOs and error budgets.
Scale Through Automation; Champion architectural improvements that enhance both system reliability and deployment velocity. Provide consultation on system architecture, building reusable platforms and frameworks, planning capacity needs, and conducting production readiness reviews to ensure services launch and operate successfully.
Drive Technical Excellence; Engage in and improve the lifecycle of microservices, from inception through deployment, operation, observability, and continuous refinement.
Lead Through Influence; Partner with engineering and product leadership to embed reliability into our product development lifecycle. Conduct blameless postmortems and drive systemic improvements in incident management. Mentor engineers across the organisation on SRE practices, helping teams take ownership of their service reliability.

While this role begins as an IC position, it will play a key part in shaping the future of SRE at Copper.

Skills and Experience:

Essential

Experience in designing, analysing, and troubleshooting distributed systems or micro-services architectures.
Established expertise in observability and incident management.
Proven experience in driving organizational Change
Excellent communication skills, with a systematic problem-solving approach.

Desirable

Experience working with production workloads in AWS
Experience working in financial services or similarly regulated environments
Interest in blockchain based technologies and/or ‘decentralised’ finance
Master's degree in Computer Science or Engineering.

Questions about this role

What is the remote work policy for this role?

This is a fully remote position.

What is the expected salary range for this position?

The salary for this role ranges from $140,000 to $180,000 USD annually.

What kind of systems will I be working with?

You will be working with distributed systems and micro-services architectures, with a focus on enhancing their reliability and operational excellence.

Apply now →

Similar jobs

Role in brief

About the role

Skills that matter here

Who this role suits

From the employer

Key Responsibilities:

Skills and Experience:

Questions about this role

Similar jobs

Principal Site Reliability Engineer

Principal Site Reliability Engineer

Senior/Lead OSDU DevOps

Senior Site Reliability Engineer

(Senior) DevOps Engineer (f/m/d)

Senior DevOps Engineer