AI Job Summary

6 to 10 years in reliability engineering, infrastructure engineering, or platform operations.
Demonstrated experience operating large-scale distributed systems in production.
Experience defining and operating against SLOs, error budgets, and incident response practices.

Role Type

On-site • Permanent • Full-time • Mid-level Senior

Description

BigGeo is the Spatial Cloud.

We help companies manage and access the world’s spatial data.

Any size, any slice, any insight.

Delivered in seconds.

We’re building something that hasn’t existed before: a new layer of the internet where the “where” and “when” behind every decision is instantly clear, programmable, and actionable. Our platform removes the complexity that has kept spatial data locked in silos for decades, and replaces it with speed, precision, and control.

We’re a Calgary-based company, early and moving fast, with real customers, real infrastructure, and a clear point of view on where the world is going.

Why BigGeo Exists and Why People Build Here

Most companies are spatially blind. They know what their data says, but not where or when things actually happen. That gap costs real money, creates real risk, and limits what AI can actually do in the physical world.

BigGeo exists to close that gap.

We’re not building another tool. We’re building the rails that connect the planet’s moving data to the systems that run the world. That’s a big problem, and it takes people who care about doing things right, not just fast.

People build here because:

The problem is real and the category is open. We’re not competing for the middle of an existing market. We’re defining a new one. Your work shapes what the category becomes.
Your fingerprints are on the architecture. We’re at the stage where the decisions you make today become the foundation tomorrow. What you ship matters.
We run on clarity, not politics. We move with purpose. No bureaucratic drag, no HiPPO decisions, just a team that agrees on the mission and gets to work.
You’ll grow fast because the problems are hard. Spatial data at scale is a genuinely difficult domain. If you want to be stretched, you’ll be stretched.
We’re building for longevity. We’re not chasing hype cycles. We’re building infrastructure, the kind that compounds in value over time and earns the trust of the companies that depend on it.

The Role

BigGeo is looking for a Lead Cloud Reliability Engineer to design and operate the systems that keep The Spatial Cloud running reliably at scale. This role sits at the intersection of hands-on infrastructure engineering and technical leadership, and it carries real ownership over how dependable our platform feels to the customers, systems, and AI agents that run on top of it.

You’ll be responsible for the reliability architecture that supports spatial compute, data pipelines, and platform services across the Spatial Cloud. Working side-by-side with platform engineers, data engineers, and spatial compute teams, you’ll make sure the systems we ship are observable, resilient, and ready to handle large-scale spatial workloads in production.

This is also a leadership seat. You’ll help set the reliability practices, operational standards, and automation systems that keep the platform stable as it scales across industries and global datasets. If you want to shape how a category-defining infrastructure company runs in production, this is the role.

Key Responsibilities

Reliability Architecture

Design reliability patterns for distributed services across the Spatial Cloud, including failure isolation, graceful degradation, and multi-region resilience.
Ensure systems are fault-tolerant, production-ready, and capable of meeting well-defined SLOs and error budgets.
Guide architectural decisions that materially improve platform stability, throughput, and predictability under load.

Observability and Monitoring

Build and maintain monitoring, logging, and tracing systems that give every engineer clear visibility into system health, latency, and saturation.
Define and maintain meaningful SLIs, SLOs, and alert thresholds that catch real problems without creating noise.
Create dashboards, runbooks, and alerting systems that turn raw telemetry into operational awareness the whole team can act on.

Incident Response and Recovery

Lead investigation and resolution of reliability incidents, including high-severity production events.
Improve detection, escalation, and recovery processes so service disruptions are shorter, smaller, and less frequent.
Run blameless post-incident reviews, capture the real root causes, and drive preventative improvements to completion.

Infrastructure Automation

Develop automated infrastructure management, provisioning, and deployment systems that remove manual work from the critical path.
Build and extend infrastructure-as-code workflows (Terraform, Pulumi, or equivalent) with rigorous review, testing, and rollback paths.
Improve reliability through automated scaling, self-healing behaviour, and operational tooling that operators can trust.

Performance and Capacity Management

Monitor infrastructure performance and proactively anticipate scaling needs across compute, storage, and network.
Tune and optimize platform systems to support high-throughput spatial workloads and real-time query patterns.
Ensure spatial compute and data systems operate efficiently at scale, with clear cost, performance, and capacity signals.

Technical Leadership

Mentor reliability and platform engineers, raising the reliability engineering bar across the team.
Establish operational standards, on-call practices, and reliability engineering best practices for the company.
Contribute actively to architecture discussions across platform, data, and spatial compute teams.

AI-Enabled Operations

Use modern AI tools to accelerate incident diagnosis, log and trace analysis, and system debugging.
Integrate AI-assisted workflows into runbook execution, on-call support, and post-incident review.
Help build internal AI-assisted operational agents that reduce toil and shorten mean time to resolution.

Cross-Functional Collaboration

Work closely with platform engineers, spatial compute teams, and data engineers to bake reliability into system design from day one.
Partner with product and engineering leadership to translate reliability posture into clear customer commitments.
Support engineering teams in operating production systems responsibly, with shared ownership of outcomes.

What You Bring

Required:

6 to 10 years of experience in reliability engineering, infrastructure engineering, or platform operations.
Bachelor’s degree in Computer Science, Information Technology, or a related field.
Demonstrated experience operating large-scale distributed systems in production.
Strong experience with monitoring, observability, and operational tooling in cloud-native environments.
Experience designing and implementing infrastructure automation, CI/CD pipelines, and deployment reliability practices.
Strong working knowledge of modern cloud infrastructure, containerized environments, and orchestration platforms.
Track record of diagnosing and resolving complex, high-severity infrastructure issues under pressure.
Experience defining and operating against SLOs, error budgets, and incident response practices.
Strong written and verbal communication skills, with the ability to explain reliability tradeoffs to technical and non-technical audiences.

Nice to Have:

Experience supporting large-scale data platforms, geospatial systems, or high-throughput compute infrastructure.
Experience working with real-time systems, streaming data pipelines, or low-latency query workloads.
Familiarity with spatial or geospatial data platforms, including indexing strategies and spatial query patterns.
Experience building reliability systems in startup or high-growth environments, where the rules are still being written.
Experience supporting AI or data-intensive infrastructure workloads in production.
Exposure to cost and capacity management practices (FinOps, usage-based scaling, workload right-sizing).

Success Measures

First 30 days:

Build deep context on BigGeo’s platform architecture, current reliability posture, and active operational risks. Onboard into the on-call rotation and shadow incidents to understand real failure modes and team response patterns. Identify the top reliability gaps across observability, automation, and incident response, and share a prioritized view with engineering leadership.

First 60 days:

Lead at least one meaningful reliability improvement to completion, with measurable impact on stability or operability. Establish or refine SLIs and SLOs for one or more critical services, and wire up supporting dashboards and alerts. Drive improvements to incident response processes, including escalation, communication, and post-incident review.

Lead Cloud Reliability Engineer