32
Operational Automation Helping Netflix developers sleep at night! Jean-Sebastien Jeannotte / Sayli Karmarkar

Netflix Winston meetup presentation 2015-11-18

Embed Size (px)

Citation preview

Page 1: Netflix Winston meetup presentation 2015-11-18

OperationalAutomationHelping Netflix developerssleep at night!

Jean-Sebastien Jeannotte / Sayli Karmarkar

Sayli Karmarkar
Change the title to 'Operational Automation At Netflix' ?
Jean-Sebastien Jeannotte
Add slides about looking into Facebook, Linkedin and Dropbox projects
Page 2: Netflix Winston meetup presentation 2015-11-18

Jean-Sebastien Jeannotte – JS

Senior Software EngineerPlatform Automation Engineering

[email protected]@jsjeannotte

http://www.linkedin.com/in/jsjeannotte

Sayli KarmarkarSenior Software Engineer

Platform Automation Engineering

[email protected]@HikerTechy

https://www.linkedin.com/in/saylikarmarkar

Speakers

Page 3: Netflix Winston meetup presentation 2015-11-18

AWSBootre:September 2014, Every AZ

Page 4: Netflix Winston meetup presentation 2015-11-18
Page 5: Netflix Winston meetup presentation 2015-11-18
Page 6: Netflix Winston meetup presentation 2015-11-18
Page 7: Netflix Winston meetup presentation 2015-11-18

C*Priam

C*Priam

C*Priam

Atlas

Atlas

Our Stack in 2014

Page 8: Netflix Winston meetup presentation 2015-11-18

Atlas Dashboard

Page 9: Netflix Winston meetup presentation 2015-11-18

Healthcheck Script

Every 30 min

Disappearing instance?

Yes

Launch new instance

No

Is the C* ring healthy?

Yes

Are all instances healthy?

Yes

No

Can we fix automatically?

Yes

Replace bad instance

No

No

First failure?Yes Sleep for X

minutes and retry

No

No

First failure?

Yes

Is there an offline

maintenance?

Page 10: Netflix Winston meetup presentation 2015-11-18

AWSBootre:September 2014, Every AZ

Page 11: Netflix Winston meetup presentation 2015-11-18

How Did The Healthcheck Script Handled It

Every 30 min

Disappearing instance?

Yes

Launch new instance

No

Is the C* ring healthy?

Yes

Are all instances healthy?

Yes

No

Can we fix automatically?

Yes

Replace bad instance

No

No

First failure?Yes Sleep for X

minutes and retry

No

No

First failure?

Yes

Is there an offline

maintenance?

Page 12: Netflix Winston meetup presentation 2015-11-18
Page 13: Netflix Winston meetup presentation 2015-11-18

Let’s Take a Step Back

Page 14: Netflix Winston meetup presentation 2015-11-18

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

Page 15: Netflix Winston meetup presentation 2015-11-18

Non-Automated On-Call Pain Points

MTTR

Productivity

Page 16: Netflix Winston meetup presentation 2015-11-18

New Direction

Page 17: Netflix Winston meetup presentation 2015-11-18

Failure / Alert Automation

Automation using Building Blocks

Integrations with Netflix Ecosystem

Platform as a Service

Event-driven Automation Platform

Page 18: Netflix Winston meetup presentation 2015-11-18

How Are Others Approaching This Problem?

Page 19: Netflix Winston meetup presentation 2015-11-18

Evaluation

Page 20: Netflix Winston meetup presentation 2015-11-18

Winston

+Inbound Integrations

+

Outbound Integrations

...

As a Service

Page 21: Netflix Winston meetup presentation 2015-11-18

SQS queue

Atlas SQS Sensor

Poll

RabbitMQ

Atlas Alert Trigger

Stackstorm Action Runners

Action A Action B Workflow C

Rules EngineRule Definitions

MongoDB Replica Set

Winston DeploymentAtlas Telemetry

Platform

...

Page 22: Netflix Winston meetup presentation 2015-11-18

Cassandra Monitoring with Winston

C*Priam

C*Priam

C*Priam

Atlas

Atlas

Page 23: Netflix Winston meetup presentation 2015-11-18

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

Page 24: Netflix Winston meetup presentation 2015-11-18

FalsePositive

Winston

2:00 AM

2:05 AM

2:05 AM

2:15 AMAssistedDiagnostics

Fixed theproblem

On-call With Winston

Page 25: Netflix Winston meetup presentation 2015-11-18

Runbook patterns

Page 26: Netflix Winston meetup presentation 2015-11-18

False Positive

Page 27: Netflix Winston meetup presentation 2015-11-18

Assisted Diagnostics

Page 28: Netflix Winston meetup presentation 2015-11-18

Auto Remediation

Page 29: Netflix Winston meetup presentation 2015-11-18

● Product○ Reduced MTTR (Mean Time To Recover)○ Safety - Reduce risk of human errors○ Capture operational knowledge as code

● People○ Reduced pager fatigue for developers○ Increase in productivity○ Morale

Impact

Page 30: Netflix Winston meetup presentation 2015-11-18

Stackstorm Docs - http://docs.stackstorm.com/Stackstorm Slack Channel - https://stackstorm-community.slack.com/Netflix OpenSource: https://netflix.github.io/

Check out our https://jobs.netflix.com page for current openings

Page 31: Netflix Winston meetup presentation 2015-11-18

… no more questions

Page 32: Netflix Winston meetup presentation 2015-11-18