Job in - Aston Robinson

Description: Our (start-up) client is building the future of e-commerce customer service for over 8,000 companies helping transform support from painful to exceptional using multiple data points and comms channels. Streamlining data such as purchase, and delivery times is giving them the edge thanks to their multi-channel platform.
And its working. Following their Series B just over a year ago, the number of users has tripled in the last 12 months, and that is bringing infrastructure challenges. That’s where you step in..

The job role

Their main challenge is now scaling. The number of users of their platform has exploded in the last 12 months and they are now running multiple K8s clusters in production, in several GCP regions which require some changes in tooling and monitoring.
Your role will be to focus on reliability, especially around their P99 response time and error rate. Another big challenge is improving their monitoring and tracing to find incident root causes faster. Finally you will need to bring your development skills to the table to design efficient DevOps tooling around deployments and CI pipelines.

Cloud native on GCP and K8S with no legacy to worry about, you will focus on performance optimizations, stability improvements and scalability issues. Your team are also responsible for Identity and Access Management, and security in general.

Tech stack
GPC, Kubernetes (GKE, Helm), Terraform, Datadog, ArgoCD, GitlabCI, Python/Go
Environment - Linux OS, Prostgres/SQL (50TB of data), distributed systems, high traffic (8k users)

About the Team

From their first SRE back in 2019, the team has grown to 5 people, 3 in Europe, 1 in Canada, and 1 US. We are now looking for someone in North America, to have a healthy balance of 3 people on each side of the Pond, which helps with on-call duties. The main tasks of the SRE team are to make sure the platform is functional, available and fast enough. You empower the product teams to efficiently run services by reducing human error, aggressively focusing on automation, and providing deep insight into application behaviour and health This is done by incorporating aspects of software engineering and applying them to infrastructure and operations problems as a way to build and manage scalable and reliable distributed software systems

About you

Experienced SRE/DevOps Engineer (5 years+)
Independent mindset (working remotely)
Agile and able to switch topics swiftly between incident handling (reactive) and tooling improvement (proactive)
Strong experience with Public cloud (GCP, AWS or Azure), K8S, Python/Go skills, CI/CD (e.g Gitlab), Monitoring (Datadog, Prometheus/Grafana, ELK etc.)
Fluent in English

Location	Canada - Full Remote
Area	Alberta, Canada British Colombia, Canada Manitoba, Canada New Brunswick, Canada Newfoundland & Labrador, Canada Nova Scotia, Canada Ontario, Canada Prince Edward Island, Canada Québec, Canada Saskatchewan, CanadaAlbertaCanadaCanada - Full Remote
Sector	38
Salary	115k-125K CAD + Equity
Currency	CAD
Start Date	ASAP
Advertiser	Aston Robinson
Job Ref	ARQC.003

Call us

Senior Site Reliability Engineer SRE/DevOps

This job does not exist anymore.

Usefull links