Senior IaaS / Kubernetes Platform Engineer
Remote
$115k–$196k
senior
2 months ago
full-time
quality 9.1/10
What You Will Do
- Kubernetes Platform Engineering (Primary Focus — 40%)
- Design, build, and operate a multi-tenant Kubernetes platform using Cluster API (CAPI) with bare-metal providers (Metal3/Sidero).
- Implement hard multi-tenancy using vCluster (Loft Labs) or similar technology, providing isolated Kubernetes API servers per tenant.
- Deploy and manage KubeVirt for VM orchestration within Kubernetes, including CPU pinning, NUMA awareness, and HugePages configuration.
- Implement GitOps-driven infrastructure using ArgoCD or Flux as the single source of truth for all cluster configurations.
- Deploy and manage Policy-as-Code using Kyverno or OPA Gatekeeper for admission control, resource quotas, and security policies.
- Build self-service capabilities using Crossplane or similar Kubernetes-native infrastructure provisioning tools.
- Storage Engineering (20%)
- Operate and optimize Ceph distributed storage clusters (currently 1 PiB raw, 149 OSDs, Quincy 17.2.5).
- Manage Rook-Ceph operator deployments at scale on modern Kubernetes (v1.28+).
- Implement storage tiering: Ceph for bulk storage, local NVMe for high-IOPS workloads, LINSTOR/DRBD or TopoLVM for ultra-fast replicated storage.
- Design and implement per-VM / per-tenant I/O isolation on shared Ceph clusters.
- Manage CDI (Containerized Data Importer) for VM image lifecycle in KubeVirt environments.
- Networking (15%)
- Deploy and manage overlay networks for pod networking, micro-segmentation, and WireGuard/IPsec encryption.
- Implement Cluster Mesh for multi-datacenter pod-to-pod connectivity.
- Configure Multus CNI and SR-IOV for multi-NIC VM support in KubeVirt.
- Work with physical network infrastructure: Juniper switches (JunOS), BGP (eBGP/iBGP), EVPN/VXLAN, VLANs.
- Maintain IPSec site-to-site connectivity between datacenters.
- Reliability and Operations (15%)
- Practice SRE discipline: define and maintain SLOs with error budgets, implement proactive capacity management with 6-12 month forecasting.
- Design and execute chaos engineering experiments to validate system resilience.
- Participate in on-call rotation for IaaS infrastructure (OpenNebula, Ceph, networking).
- Write and maintain runbooks, DRP documentation, and postmortem analyses.
- Drive proactive improvement: identify reliability risks, performance bottlenecks, and toil — then propose and implement solutions without waiting for incidents.
- Infrastructure as Code and Automation (10%)
- Develop and maintain Terraform/OpenTofu modules for multi-cloud infrastructure provisioning.
- Write Ansible playbooks for bare-metal server configuration and fleet management.
- Automate infrastructure lifecycle: PXE
Requirements
- Proven experience in Kubernetes platform engineering and IaaS.
- Strong understanding of cloud infrastructure, networking, and storage solutions.
- Experience with GitOps practices and tools (ArgoCD, Flux).
- Familiarity with Ceph and distributed storage management.
- Proficiency in Terraform and Ansible for automation and infrastructure as code.
- Ability to work independently and collaboratively in a remote team environment.
What We Offer
- Competitive salary ranging from $115,000 to $195,500 USD.
- Fully remote work environment.
- Supportive team culture focused on collaboration and success.
- Opportunities for professional growth and development.
Similar jobs
Senior Developer Experience Engineer
Galaxydigitalservices · Remote
$98k–$162k
9 days ago
View →
Cloud Infrastructure Engineer
Alchemy · Remote
$135k–$240k
2 months ago
View →
Cloud Infrastructure Engineer
Alchemy · Remote
$135k–$240k
2 months ago
View →
Staff Site Reliability Engineer-Federal, Security Clearance
Zscaler · Remote
$119k–$170k
3 months ago
View →
Senior Systems Software Engineer
Unto Labs · Remote
$214k–$220k
3 days ago
View →
Senior Developer Experience Engineer / Principal Mobile Engineer
Galaxy / Consensys · Remote
$140k–$272k
3 days ago
View →