Senior Site Reliability Engineer SRE/DevOps
This job does not exist anymore.
Try running a new searchor browse our vacancies.
Or fill in the form below to receive job alerts.
Location | Canada - Full Remote |
Area | Alberta, CanadaBritish Colombia, CanadaManitoba, CanadaNew Brunswick, CanadaNewfoundland & Labrador, CanadaNova Scotia, CanadaOntario, CanadaPrince Edward Island, CanadaQuébec, CanadaSaskatchewan, Canada |
Sector | 38 |
Salary | 115k-125K CAD + Equity |
Currency | CAD |
Start Date | ASAP |
Advertiser | Aston Robinson |
Job Ref | ARQC.003 |
- Description
- Our (start-up) client is building the future of e-commerce customer service for over 8,000 companies helping transform support from painful to exceptional using multiple data points and comms channels. Streamlining data such as purchase, and delivery times is giving them the edge thanks to their multi-channel platform.
And its working. Following their Series B just over a year ago, the number of users has tripled in the last 12 months, and that is bringing infrastructure challenges. That’s where you step in..
The job role
Their main challenge is now scaling. The number of users of their platform has exploded in the last 12 months and they are now running multiple K8s clusters in production, in several GCP regions which require some changes in tooling and monitoring.
Your role will be to focus on reliability, especially around their P99 response time and error rate. Another big challenge is improving their monitoring and tracing to find incident root causes faster. Finally you will need to bring your development skills to the table to design efficient DevOps tooling around deployments and CI pipelines.
Cloud native on GCP and K8S with no legacy to worry about, you will focus on performance optimizations, stability improvements and scalability issues. Your team are also responsible for Identity and Access Management, and security in general.
Tech stack
GPC, Kubernetes (GKE, Helm), Terraform, Datadog, ArgoCD, GitlabCI, Python/Go
Environment - Linux OS, Prostgres/SQL (50TB of data), distributed systems, high traffic (8k users)
About the Team
From their first SRE back in 2019, the team has grown to 5 people, 3 in Europe, 1 in Canada, and 1 US. We are now looking for someone in North America, to have a healthy balance of 3 people on each side of the Pond, which helps with on-call duties. The main tasks of the SRE team are to make sure the platform is functional, available and fast enough. You empower the product teams to efficiently run services by reducing human error, aggressively focusing on automation, and providing deep insight into application behaviour and health This is done by incorporating aspects of software engineering and applying them to infrastructure and operations problems as a way to build and manage scalable and reliable distributed software systems
About you
Experienced SRE/DevOps Engineer (5 years+)
Independent mindset (working remotely)
Agile and able to switch topics swiftly between incident handling (reactive) and tooling improvement (proactive)
Strong experience with Public cloud (GCP, AWS or Azure), K8S, Python/Go skills, CI/CD (e.g Gitlab), Monitoring (Datadog, Prometheus/Grafana, ELK etc.)
Fluent in English