Platform Engineer - (Site Reliability Engineering)

Remote $112k–$188k 8 days ago full-time quality 8.6/10

Role in brief

Bitso is seeking a Platform Engineer specializing in Site Reliability Engineering to enhance incident management and automation for their crypto platform. This role involves leading incident response, building automation for postmortems and AI-assisted workflows, and improving observability. Candidates with strong Kubernetes, CI/CD, and software development skills, who are adept at operating under pressure and have an automation mindset, should consider applying.

KubernetesCI/CDDevOpsPythonJavaAI

About the role

This Platform Engineer role focuses on Site Reliability Engineering within Bitso, a major cryptocurrency platform in Latin America with over 9 million users. The primary responsibility involves end-to-end ownership of on-call shifts, including incident declaration, role assignment, communication, and resolution. A key aspect of the role is to build automation around the Sev1/Sev2 postmortem workflow, ensuring timely follow-up on action items and continuous improvement.

A significant part of this position involves leveraging AI to identify incident patterns and propose systemic solutions, such as refining runbooks, optimizing alerts, and strengthening the platform. The engineer will also develop and extend internal automation and tooling, including AI-assisted incident response, to minimize manual effort and accelerate detection and resolution times. This work contributes to a more resilient and efficient platform.

Success in this role means continuously improving the observability ecosystem through dashboards and alert configurations, and proactively identifying early-warning signals. The engineer will collaborate with other engineering teams to address platform risks and implement preventive measures, while also maintaining accurate incident tooling, runbooks, and severity criteria. This ensures the broader engineering organization has reliable resources for incident management.

The annual salary for this position ranges from $112,000 to $188,000 USD.

Skills that matter here

  • Kubernetes: The role requires hands-on experience with Kubernetes for deploying, debugging, and navigating pod-level issues during incident response.
  • CI/CD: A solid understanding of CI/CD pipelines is necessary to manage changes and deployments effectively and reduce deployment-related incidents.
  • DevOps: The position requires familiarity with modern DevOps practices to contribute to platform reliability and automation.
  • Python: Software development skills, particularly in Python, are beneficial for building and extending internal automation and tooling.
  • Java: Experience with Java is a plus for software development tasks, including reading, writing, and debugging code for platform improvements.
  • AI: The role involves leveraging AI to identify incident patterns, propose systemic fixes, and build AI-assisted incident response workflows.

Who this role suits

  • A person who can remain composed and communicate clearly with stakeholders during live production issues.
  • Someone with a strong inclination to automate repetitive tasks rather than simply performing them manually.
  • An individual who is proactive in learning and can start contributing without needing a fully defined path.
  • A candidate who enjoys collaborating with various engineering teams to identify and mitigate platform risks.

From the employer

  • Own and execute on-call shifts end-to-end: acknowledge pages within SLA, declare incidents, assign roles, maintain comms cadence, and drive to resolution
  • Build automation that drives the Sev1/Sev2 postmortem workflow — from scheduling and facilitation reminders to action-item assignment, ownership tracking, and due-date enforcement
  • Leverage AI to identify patterns across incidents and propose systemic fixes: runbook improvements, alert tuning, platform hardening, and process changes
  • Build and extend internal automation and tooling, including AI-assisted incident response workflows, to reduce manual toil and accelerate detection and resolution
  • Contribute to and improve the observability ecosystem — dashboards, alert configurations, and early-warning signals across Bitso’s platform
  • Participate in change and maintenance management processes, applying risk management to reduce deployment-related incidents
  • Collaborate with engineering squads across the company to surface platform risks and drive preventive actions
  • Keep incident tooling, runbooks, and severity criteria accurate, current, and useful for the broader engineering org
  • Proven ability to operate confidently in high-pressure incident scenarios, including communicating clearly with senior stakeholders and leadership while a production issue is live
  • Hands-on experience with Kubernetes — comfortable deploying, debugging, and navigating pod-level issues
  • Solid understanding of CI/CD pipelines and modern DevOps practices
  • Software development background in any language; ability to read, write, and debug code is essential (Python or Java experience is a plus)
  • Strong automation mindset: you identify repetitive toil and your first instinct is to eliminate it, not absorb it
  • Experience building or working with AI agents or LLM-based workflows is highly desirable
  • Strong interpersonal and written communication skills
  • Self-directed learner who doesn’t need a fully defined path to start contributing
  • Fintech or crypto industry background is a plus — familiarity with the domain vocabulary accelerates onboarding and incident triage
  • Me Time program, including unlimited paid time off.
  • Remote-first work environment.
  • Employee Stock Option program.
  • Zero trading fees through our Bitso Alpha app.
  • Extended Family Leave Policy: all birthing parents, non-birthing parents and adopting parents are eligible for a 4-months leave.
  • Premium health, dental and life insurances in Mexico, Gibraltar, Colombia, USA, Brazil and Argentina.

Questions about this role

What is the remote work policy for this role?

This is a remote-first position, allowing work from various locations.

What level of experience is expected for this position?

The role requires proven ability to operate in high-pressure incident scenarios and hands-on experience with technologies like Kubernetes and CI/CD, indicating a mid to senior level of expertise without a specific seniority title.

How much does this role pay?

The salary range for this position is between $112,000 and $188,000 USD.

Similar jobs

Before you apply

  • Legitimate employers never ask you to pay anything to apply or get hired.
  • Never share seed phrases or private keys. No real job needs them.
  • Do not install software ("test tasks", "trading tools", "video call clients") sent during hiring.
  • Check that the application page's domain really belongs to Bitso.