Vice President Site Reliability Engineering (Data Centers)
Role in brief
Galaxydigitalservices seeks a Vice President of Site Reliability Engineering to lead automation and infrastructure efforts for its digital assets and data center operations. This role involves overseeing a specialized SRE team, establishing IaC standards, and driving automated lifecycle management. Candidates with deep expertise in Terraform, Ansible, and virtual platforms like VMware, Azure, and AWS, along with strong scripting skills, should apply.
About the role
This Vice President of Site Reliability Engineering will lead a dedicated team focused on designing, deploying, and maintaining automation toolsets and the systems they interact with. A core responsibility involves establishing and enforcing Infrastructure as Code (IaC) standards to ensure consistent, secure deployments across the entire infrastructure ecosystem. The role demands strong proficiency in Terraform and leadership in configuration management strategies, optimizing Ansible playbooks and Packer image pipelines for Windows, Linux, and ESXi platforms.
The position also involves managing the monitoring and health of automation platforms, implementing SLIs/SLOs to ensure high availability and performance of tools that build servers. The VP will drive the automated lifecycle of physical and virtual assets, from template creation to patching, scaling, and decommissioning. This includes developing custom scripts and internal providers using Python, Go, PowerShell, and Bash to enhance insights and tooling for the systems.
Success in this role means fostering collaboration within the datacenter team and facilitating their needs, analyzing system behavior and resource utilization in virtual environments to optimize automated deployments, and providing technical guidance and career mentorship to SREs. The goal is to cultivate an 'automate-first' culture and continuous improvement, supporting Galaxy's mission in digital assets and data center infrastructure for finance and AI.
The competitive salary for this position ranges from $120,000 to $200,000.
Skills that matter here
- Terraform: This role requires deep proficiency in Terraform for Infrastructure as Code (IaC) governance, including providers, modules, and state management.
- Ansible: The VP will lead the strategy for automated configuration and state management, optimizing Ansible playbooks for various platforms.
- Packer: Experience with Packer is essential for building standardized and hardened images for both Windows and Linux in hybrid environments.
- VMware: The role involves managing and automating virtual platforms like VMware (vSphere/vCenter).
- Python: High-level scripting skills in Python are required for developing custom tools and internal providers.
- Observability Tools: Experience with observability tools such as Splunk, ELK, Prometheus, or Grafana is needed to monitor infrastructure health and automation telemetry.
Who this role suits
- A leader with 6-10 years of experience in SRE or DevOps, specifically focused on infrastructure automation at scale.
- Someone who can establish and enforce technical standards for consistent and secure deployments.
- An individual adept at mentoring SREs and fostering a culture of automation and continuous improvement.
- A collaborator who can work effectively with other datacenter teams and address their needs.
From the employer
Responsibilities
- Automation Platform Leadership: Oversee a specialized SRE team focused on the design, deployment, and maintenance of automation toolsets as well as the systems they interact with.
- Infrastructure as Code (IaC) Governance: Establish and enforce standards for IaC to ensure consistent, repeatable, and secure deployments across an entire infrastructure ecosystem. Strong proficiency in Terraform is required.
- Configuration Management: Lead the strategy for automated configuration and state management, ensuring Ansible playbooks and Packer image pipelines are optimized for both Windows, Linux, and ESXi Platforms.
- Monitoring & Observability: Manage the monitoring and health of the automation platforms themselves. Implement SLIs/SLOs to ensure the "tools that build the servers" are highly available and performant.
- Lifecycle Management: Drive the automated lifecycle of both physical and virtual assets, from initial template creation/deployment to automated patching, scaling, and decommissioning.
- Custom Tooling & Scripting: Lead the development of custom scripts and internal providers (Python, Go, PowerShell, Bash) to provide better insights and tooling for our systems.
- Collaboration: Outside of the automation team you will need to be able to collaborate and foster workflows alongside the rest of the Datacenter team and be able to facilitate needs for the team as a whole.
- Capacity & Performance: Analyze system behavior and resource utilization in virtual environments to optimize the performance of automated deployments.
- Mentorship & Growth: Provide technical guidance and career mentorship to SREs, fostering a culture of "automate-first" and continuous improvement.
Requirements
- 6-10 years’ experience in Infrastructure, SRE or DevOps, specifically focused on infrastructure automation at scale.
- Deep proficiency with Terraform (providers, modules, state management) and Ansible (roles, playbooks, Tower/AWX).
- Hands-on experience with Image Creation (i.e. Packer, Ansible, SCCM) to build standardized, hardened images for both Windows and Linux in hybrid environments.
- Strong experience managing and automating virtual platforms such as VMware (vSphere/vCenter) as well as Cloud providers such as Azure and AWS.
- High-level scripting skills in mediums such as Python, Go, PowerShell, and Bash.
- Experience with observability tools (Splunk, ELK, Prometheus, or Grafana) to monitor infrastructure health and automation telemetry.
- Good understanding of Network topology and design as well as experience with platforms such as Juniper Networks or Palo Alto.
- Strong mastery of Git (branching strategies, PR workflows) and CI/CD platforms (Jenkins, GitLab CI, or GitHub Actions).
- Equal comfort managing, troubleshooting, and tuning performance for both Windows Server and Linux.
Conditions
- Competitive salary range of $120K-$200K.
- Remote work environment.
- Opportunities for career mentorship and growth.
Questions about this role
What is the remote work policy for this position?
This is a fully remote position.
What level of experience is required for this role?
Candidates should have 6-10 years of experience in Infrastructure, SRE, or DevOps, with a specific focus on infrastructure automation at scale.
What are the key technical skills needed for this role?
Key technical skills include deep proficiency with Terraform and Ansible, hands-on experience with image creation tools like Packer, managing virtual platforms such as VMware, Azure, and AWS, strong scripting skills in Python, Go, PowerShell, and Bash, and experience with observability tools.