This job posting has expired

Expired on April 4, 2026

Site Reliability Engineer

Full-time
GCPGKEPub/SubCloud RunNode.jsMongoDB AtlasKubernetesPuppeteerCloud Monitoring

Job Description

Our low-code platform is preparing for an immediate scale-up to 3,000,000 concurrent users. We currently operate on a GKE-based architecture with 78 microservices and a MongoDB Atlas backend. We need a Lead Site Reliability Engineer who can transform our current synchronous system into a high-concurrency, asynchronous engine capable of surviving massive traffic spikes without database or compute failure.

Responsibilities

  • Transition synchronous API flows to Google Cloud Pub/Sub
  • Implement and own the 'Speed Limit' for the database
  • Configure Subscriber-side Flow Control in Node.js and Kubernetes HPA
  • Isolate heavy Puppeteer/Chrome workloads using Cloud Run or dedicated Spot VM node pools
  • Build a 'Nerve Center' using Cloud Monitoring
  • Optimize container footprints using Vertical Pod Autoscaling (VPA)

Qualifications

  • Deep experience with GKE, Pub/Sub, and Cloud Run
  • Knowledge of how to request and manage high-scale CPU quotas
  • Advanced Node.js knowledge
  • Experience with MongoDB Atlas M60/M80 tiers
  • Experience implementing Backpressure and Circuit Breakers

Job Information

Posted

February 3, 2026

Experience Level

lead

Status

Expired