AWS Site Reliability Engineer (Data Platform)
AWS Site Reliability Engineer (Data Platform) Fully onsite London or Glasgow 12 month contract Inside IR35 Role Summary We are seeking an AWS Site Reliability Engineer (SRE) to support, scale, and improve a cloud-native data platform built on AWS, Snowflake, and Databricks. This role focuses on enhancing platform reliability through automation, disaster recovery testing, resiliency engineering, observability best practices, and proactive SLO/SLI/SLA management. Key Responsibilities Design, build, and maintain automation for infrastructure provisioning, platform operations, and incident response using Infrastructure as Code (IaC) and CI/CD. Lead resiliency and disaster recovery initiatives, including scheduled DR drills, fault injection, and validation of recovery processes across AWS and data platform components. Define, implement, and manage SLIs, SLOs, and SLAs for critical data pipelines and platform services; leverage error budgets to guide reliability-focused improvements. Build and operate end-to-end observability solutions (metrics, logs, traces, alerts) for AWS services, Snowflake, and Databricks workloads. Partner with data engineering and platform teams to embed reliability-by-design into architectural decisions and delivery practices. Perform root cause analysis (RCA) and drive continuous improvement to reduce operational toil and enhance platform availability and performance. Own and drive resolution of platform-related incidents and service requests, ensuring ef
Other jobs of interest...
Perform a fresh search...
-
Create your ideal job search criteria by
completing our quick and simple form and
receive daily job alerts tailored to you!