Site Reliability Engineer

Full-time

GCPGKEPub/SubCloud RunNode.jsMongoDB AtlasKubernetesPuppeteerCloud Monitoring

Job Description

Our low-code platform is preparing for an immediate scale-up to 3,000,000 concurrent users. We currently operate on a GKE-based architecture with 78 microservices and a MongoDB Atlas backend. We need a Lead Site Reliability Engineer who can transform our current synchronous system into a high-concurrency, asynchronous engine capable of surviving massive traffic spikes without database or compute failure.

Responsibilities

Transition synchronous API flows to Google Cloud Pub/Sub
Implement and own the 'Speed Limit' for the database
Configure Subscriber-side Flow Control in Node.js and Kubernetes HPA
Isolate heavy Puppeteer/Chrome workloads using Cloud Run or dedicated Spot VM node pools
Build a 'Nerve Center' using Cloud Monitoring
Optimize container footprints using Vertical Pod Autoscaling (VPA)

Qualifications

Deep experience with GKE, Pub/Sub, and Cloud Run
Knowledge of how to request and manage high-scale CPU quotas
Advanced Node.js knowledge
Experience with MongoDB Atlas M60/M80 tiers
Experience implementing Backpressure and Circuit Breakers

Job Information

Posted

February 3, 2026

Experience Level

lead

Status

Expired