Upload
sayli-karmarkar
View
1.092
Download
5
Embed Size (px)
Citation preview
OperationalAutomationHelping Netflix developerssleep at night!
Jean-Sebastien Jeannotte / Sayli Karmarkar
Jean-Sebastien Jeannotte – JS
Senior Software EngineerPlatform Automation Engineering
[email protected]@jsjeannotte
http://www.linkedin.com/in/jsjeannotte
Sayli KarmarkarSenior Software Engineer
Platform Automation Engineering
[email protected]@HikerTechy
https://www.linkedin.com/in/saylikarmarkar
Speakers
AWSBootre:September 2014, Every AZ
C*Priam
C*Priam
C*Priam
Atlas
Atlas
Our Stack in 2014
Atlas Dashboard
Healthcheck Script
Every 30 min
Disappearing instance?
Yes
Launch new instance
No
Is the C* ring healthy?
Yes
Are all instances healthy?
Yes
No
Can we fix automatically?
Yes
Replace bad instance
No
No
First failure?Yes Sleep for X
minutes and retry
No
No
First failure?
Yes
Is there an offline
maintenance?
AWSBootre:September 2014, Every AZ
How Did The Healthcheck Script Handled It
Every 30 min
Disappearing instance?
Yes
Launch new instance
No
Is the C* ring healthy?
Yes
Are all instances healthy?
Yes
No
Can we fix automatically?
Yes
Replace bad instance
No
No
First failure?Yes Sleep for X
minutes and retry
No
No
First failure?
Yes
Is there an offline
maintenance?
Let’s Take a Step Back
Engineer Wakes up
Logs in and ACK
Checksrunbook
Studiesthe alert
Fixes theproblem
Runs diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
Non-Automated On-Call Pain Points
MTTR
Productivity
New Direction
Failure / Alert Automation
Automation using Building Blocks
Integrations with Netflix Ecosystem
Platform as a Service
Event-driven Automation Platform
How Are Others Approaching This Problem?
Evaluation
Winston
+Inbound Integrations
+
Outbound Integrations
...
As a Service
SQS queue
Atlas SQS Sensor
Poll
RabbitMQ
Atlas Alert Trigger
Stackstorm Action Runners
Action A Action B Workflow C
Rules EngineRule Definitions
MongoDB Replica Set
Winston DeploymentAtlas Telemetry
Platform
...
Cassandra Monitoring with Winston
C*Priam
C*Priam
C*Priam
Atlas
Atlas
Engineer Wakes up
Logs in and ACK
Checksrunbook
Studiesthe alert
Fixes theproblem
Runs diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
FalsePositive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15 AMAssistedDiagnostics
Fixed theproblem
On-call With Winston
Runbook patterns
False Positive
Assisted Diagnostics
Auto Remediation
● Product○ Reduced MTTR (Mean Time To Recover)○ Safety - Reduce risk of human errors○ Capture operational knowledge as code
● People○ Reduced pager fatigue for developers○ Increase in productivity○ Morale
Impact
Stackstorm Docs - http://docs.stackstorm.com/Stackstorm Slack Channel - https://stackstorm-community.slack.com/Netflix OpenSource: https://netflix.github.io/
Check out our https://jobs.netflix.com page for current openings
… no more questions