Netflix Winston meetup presentation 2015-11-18

  • View
    1.096

  • Download
    5

  • Category

    Software

Preview:

Citation preview

OperationalAutomationHelping Netflix developerssleep at night!

Jean-Sebastien Jeannotte / Sayli Karmarkar

Sayli Karmarkar
Change the title to 'Operational Automation At Netflix' ?
Jean-Sebastien Jeannotte
Add slides about looking into Facebook, Linkedin and Dropbox projects

Jean-Sebastien Jeannotte – JS

Senior Software EngineerPlatform Automation Engineering

jjeannotte@netflix.com@jsjeannotte

http://www.linkedin.com/in/jsjeannotte

Sayli KarmarkarSenior Software Engineer

Platform Automation Engineering

skarmarkar@netflix.com@HikerTechy

https://www.linkedin.com/in/saylikarmarkar

Speakers

AWSBootre:September 2014, Every AZ

C*Priam

C*Priam

C*Priam

Atlas

Atlas

Our Stack in 2014

Atlas Dashboard

Healthcheck Script

Every 30 min

Disappearing instance?

Yes

Launch new instance

No

Is the C* ring healthy?

Yes

Are all instances healthy?

Yes

No

Can we fix automatically?

Yes

Replace bad instance

No

No

First failure?Yes Sleep for X

minutes and retry

No

No

First failure?

Yes

Is there an offline

maintenance?

AWSBootre:September 2014, Every AZ

How Did The Healthcheck Script Handled It

Every 30 min

Disappearing instance?

Yes

Launch new instance

No

Is the C* ring healthy?

Yes

Are all instances healthy?

Yes

No

Can we fix automatically?

Yes

Replace bad instance

No

No

First failure?Yes Sleep for X

minutes and retry

No

No

First failure?

Yes

Is there an offline

maintenance?

Let’s Take a Step Back

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

Non-Automated On-Call Pain Points

MTTR

Productivity

New Direction

Failure / Alert Automation

Automation using Building Blocks

Integrations with Netflix Ecosystem

Platform as a Service

Event-driven Automation Platform

How Are Others Approaching This Problem?

Evaluation

Winston

+Inbound Integrations

+

Outbound Integrations

...

As a Service

SQS queue

Atlas SQS Sensor

Poll

RabbitMQ

Atlas Alert Trigger

Stackstorm Action Runners

Action A Action B Workflow C

Rules EngineRule Definitions

MongoDB Replica Set

Winston DeploymentAtlas Telemetry

Platform

...

Cassandra Monitoring with Winston

C*Priam

C*Priam

C*Priam

Atlas

Atlas

Engineer Wakes up

Logs in and ACK

Checksrunbook

Studiesthe alert

Fixes theproblem

Runs diagnostics

PagerDuty

Alert

2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM

On-call, Without Automation

FalsePositive

Winston

2:00 AM

2:05 AM

2:05 AM

2:15 AMAssistedDiagnostics

Fixed theproblem

On-call With Winston

Runbook patterns

False Positive

Assisted Diagnostics

Auto Remediation

● Product○ Reduced MTTR (Mean Time To Recover)○ Safety - Reduce risk of human errors○ Capture operational knowledge as code

● People○ Reduced pager fatigue for developers○ Increase in productivity○ Morale

Impact

Stackstorm Docs - http://docs.stackstorm.com/Stackstorm Slack Channel - https://stackstorm-community.slack.com/Netflix OpenSource: https://netflix.github.io/

Check out our https://jobs.netflix.com page for current openings

… no more questions