A presentation at DevOpsDays Houston in in Houston, TX, USA by fen aldrich
Heresy & Evangelism Schism in the church of monitoring (@elastic) - Aaron Aldrich (@crayzeigh) 1
Hi! ! • • Community Advocate • aaron.aldrich@elastic.co • @CrayZeigh • noti.st/crayzeigh (slides will be here) 2
A word from our sponsor (@elastic) - Aaron Aldrich (@crayzeigh) 3
• We make: • Elasticsearch • Logstash • Kibana • Beats • Elastic APM (open tracing, ooo) • We host: • Elastic Search Service • Site Search • App Search • You can run it all where ever • Core is Free and/or free • We’re hiring (Fully Distributed, oooh, aaah) • Talk to us at the booth (@elastic) - Aaron Aldrich (@crayzeigh) 4
Let’s find out where we’re at. (@elastic) - Aaron Aldrich (@crayzeigh) 5
How many of you deal with monitoring as a job function? (@elastic) - Aaron Aldrich (@crayzeigh) 6
How many of you touch monitoring in some way? (@elastic) - Aaron Aldrich (@crayzeigh) 7
Uptime Performance/ Resource Utilization Response time? (@elastic) - Aaron Aldrich (@crayzeigh) 8
Why? (@elastic) - Aaron Aldrich (@crayzeigh) 9
Things Fall Apart * something about a slouching beast (@elastic) - Aaron Aldrich (@crayzeigh) 10
Incidents Suck (@elastic) - Aaron Aldrich (@crayzeigh) 11
(@elastic) - Aaron Aldrich (@crayzeigh) 12
(@elastic) - Aaron Aldrich (@crayzeigh) 13
(@elastic) - Aaron Aldrich (@crayzeigh) 14
(@elastic) - Aaron Aldrich (@crayzeigh) 15
(@elastic) - Aaron Aldrich (@crayzeigh) 16
100% (@elastic) - Aaron Aldrich (@crayzeigh) 17
99.999% (@elastic) - Aaron Aldrich (@crayzeigh) 18
Just a minute! (@elastic) - Aaron Aldrich (@crayzeigh) 19
Eine Minute, bitte! ! Stolen Joke, if you know where it’s from we’re probably friends (@elastic) - Aaron Aldrich (@crayzeigh) 20
NINES don’t matter… ~ Charity Majors (@mipsytipsy) (@elastic) - Aaron Aldrich (@crayzeigh) 21
(@elastic) - Aaron Aldrich (@crayzeigh) 22
NINES don’t matter when USERS aren’t HAPPY ~ Charity Majors (@mipsytipsy) (@elastic) - Aaron Aldrich (@crayzeigh) 23
She doesn’t care whether or not [the datacenter is literally on fire], just as long as the ship’s coming in. (@elastic) - Aaron Aldrich (@crayzeigh) 24
How does your business make money? (@elastic) - Aaron Aldrich (@crayzeigh) 25
How do you help? (@elastic) - Aaron Aldrich (@crayzeigh) 26
It is not enough to do your best; you must know what to do, and then do your best. - W. Edwards Deming (@elastic) - Aaron Aldrich (@crayzeigh) 27
DevOps is about delivering Value (@elastic) - Aaron Aldrich (@crayzeigh) 28
(@elastic) - Aaron Aldrich (@crayzeigh) 29
SRE (@elastic) - Aaron Aldrich (@crayzeigh) 30
(@elastic) - Aaron Aldrich (@crayzeigh) SLI SLO SLA 31
Services not systems (@elastic) - Aaron Aldrich (@crayzeigh) 32
(@elastic) - Aaron Aldrich (@crayzeigh) 33
Site Reliability Engineering • (SLI) What is availability? • (SLO) How much do we actually need? • (SLA) What happens when we’re not meeting this target? (@elastic) - Aaron Aldrich (@crayzeigh) 34
Site Reliability Engineering • (SLI) What is availability? • (SLO) How much do we actually need? • (SLA) What happens when we’re not meeting this target? (@elastic) - Aaron Aldrich (@crayzeigh) 35
Service Level Indicators • Is it up? • 200OK • latency • percentiles or medians for meaning (@elastic) - Aaron Aldrich (@crayzeigh) 36
Service Level Indicators • Is it up? ! • 200OK • latency • percentiles or medians for meaning ! Never trust averages, they hide data (@elastic) - Aaron Aldrich (@crayzeigh) 37
twitter slide: Never trust averages, they hide data (@elastic) - Aaron Aldrich (@crayzeigh) 38
The 99th percentile latency of requests received in the last five minutes <300 ms and responded to with a 200 status (@elastic) - Aaron Aldrich (@crayzeigh) 39
Service Level Objectives How much availability do we need? (@elastic) - Aaron Aldrich (@crayzeigh) 40
99% (@elastic) - Aaron Aldrich (@crayzeigh) 41
99.9% (@elastic) - Aaron Aldrich (@crayzeigh) 42
99.99% (@elastic) - Aaron Aldrich (@crayzeigh) 43
99.999% (@elastic) - Aaron Aldrich (@crayzeigh) 44
Each 9 is exponentially more expensive to provide (@elastic) - Aaron Aldrich (@crayzeigh) 45
availability avg per year avg per day 99% 3.65 days 14.4 minutes 99.9% 8.76 hours 1.44 minutes 99.99% 52.56 minutes 8.64 seconds 99.999% 5.25 minutes 863 ms (@elastic) - Aaron Aldrich (@crayzeigh) 46
A good SLO barely keeps users happy (these should be driving your alerts) (@elastic) - Aaron Aldrich (@crayzeigh) 47
Error Budgets (@elastic) - Aaron Aldrich (@crayzeigh) 48
It’s GOOD to have errors (@elastic) - Aaron Aldrich (@crayzeigh) 49
(@elastic) - Aaron Aldrich (@crayzeigh) 50
(@elastic) - Aaron Aldrich (@crayzeigh) 51
SLAs = (@elastic) - Aaron Aldrich (@crayzeigh) 52
SLAs = (@elastic) - Aaron Aldrich (@crayzeigh) 53
What about the fire? (@elastic) - Aaron Aldrich (@crayzeigh) 54
(@elastic) - Aaron Aldrich (@crayzeigh) 55
(@elastic) - Aaron Aldrich (@crayzeigh) 56
(@elastic) - Aaron Aldrich (@crayzeigh) 57
(@elastic) - Aaron Aldrich (@crayzeigh) 58
(@elastic) - Aaron Aldrich (@crayzeigh) 59
Observability (@elastic) - Aaron Aldrich (@crayzeigh) 60
O11y (@elastic) - Aaron Aldrich (@crayzeigh) 61
Isn’t it just monitoring with another name? (@elastic) - Aaron Aldrich (@crayzeigh) 62
no. (@elastic) - Aaron Aldrich (@crayzeigh) 63
Observability A system is observable when you can ask arbitrary questions about it and receive meaningful answers without having to resort to writing new code or command line tools. It lets you discover unknown-unknowns and debug in production. (@elastic) - Aaron Aldrich (@crayzeigh) 64
our tools must change with our systems. (@elastic) - Aaron Aldrich (@crayzeigh) 65
(@elastic) - Aaron Aldrich (@crayzeigh) 66
Traditional Architecture • Predictable • Obvious relationships • able to be easily modeled • System Health is an accurate predictor of user experience • Dashboards are useful and valuable • known-unknowns cover most issues (@elastic) - Aaron Aldrich (@crayzeigh) 67
(@elastic) - Aaron Aldrich (@crayzeigh) 68
Complex Systems • Always changing • Difficult or impossible to model • emergent behavior (unknown-unknowns) • non-linear relationships • feedback loops • can adapt and have memory • can be nested • System health and user experience are no longer directly related (@elastic) - Aaron Aldrich (@crayzeigh) 69
Root Cause is a myth (@elastic) - Aaron Aldrich (@crayzeigh) 70
(@elastic) - Aaron Aldrich (@crayzeigh) 71
Three Pillars of Observability • Metrics • Logs • APM (@elastic) - Aaron Aldrich (@crayzeigh) 72
These aren’t pillars. (@elastic) - Aaron Aldrich (@crayzeigh) 73
(@elastic) - Aaron Aldrich (@crayzeigh) 74
Three Pillars of Carpentry? stahp. (@elastic) - Aaron Aldrich (@crayzeigh) 75
They’re tools, not pillars You need to know how to use them (@elastic) - Aaron Aldrich (@crayzeigh) 76
Metrics • Great, not on their own ! • largely contextless • need further notation to be valuable (tags) • Easy to store lots of them • collection can be a pain ! Check out Open Metrics! openmetrics.io (@elastic) - Aaron Aldrich (@crayzeigh) 77
High Cardinality Data • UUIDs • raw queries • comments • firstname, lastname • PID/PPID • app ID • device ID • build ID • IP:port • shopping cart ID • userid (@elastic) - Aaron Aldrich (@crayzeigh) 78
One-in-a-million chances crop up nine times out of ten ~ Terry Pratchett (@elastic) - Aaron Aldrich (@crayzeigh) 79
What’s better at carrying Cardinality? (@elastic) - Aaron Aldrich (@crayzeigh) 80
Events! (@elastic) - Aaron Aldrich (@crayzeigh) 81
(Logs) (@elastic) - Aaron Aldrich (@crayzeigh) 82
But please not these: 64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] “GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12846 64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] “GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1” 200 4523 64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] “GET /mailman/listinfo/hsdivision HTTP/1.1” 200 6291 64.242.88.10 - - [07/Mar/2004:16:11:58 -0800] “GET /twiki/bin/view/TWiki/WikiSyntax HTTP/1.1” 200 7352 64.242.88.10 - - [07/Mar/2004:16:20:55 -0800] “GET /twiki/bin/view/Main/DCCAndPostFix HTTP/1.1” 200 5253 64.242.88.10 - - [07/Mar/2004:16:23:12 -0800] “GET /twiki/bin/oops/TWiki/AppendixFileSystem?template=oopsmore¶m1=1.12¶m2=1.12 HTTP/1.1” 200 11382 64.242.88.10 - - [07/Mar/2004:16:24:16 -0800] “GET /twiki/bin/view/Main/PeterThoeny HTTP/1.1” 200 4924 64.242.88.10 - - [07/Mar/2004:16:29:16 -0800] “GET /twiki/bin/edit/Main/Header_checks?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:30:29 -0800] “GET /twiki/bin/attach/Main/OfficeLocations HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:31:48 -0800] “GET /twiki/bin/view/TWiki/WebTopicEditTemplate HTTP/1.1” 200 3732 64.242.88.10 - - [07/Mar/2004:16:32:50 -0800] “GET /twiki/bin/view/Main/WebChanges HTTP/1.1” 200 40520 64.242.88.10 - - [07/Mar/2004:16:33:53 -0800] “GET /twiki/bin/edit/Main/Smtpd_etrn_restrictions?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:35:19 -0800] “GET /mailman/listinfo/business HTTP/1.1” 200 6379 64.242.88.10 - - [07/Mar/2004:16:36:22 -0800] “GET /twiki/bin/rdiff/Main/WebIndex?rev1=1.2&rev2=1.1 HTTP/1.1” 200 46373 64.242.88.10 - - [07/Mar/2004:16:37:27 -0800] “GET /twiki/bin/view/TWiki/DontNotify HTTP/1.1” 200 4140 64.242.88.10 - - [07/Mar/2004:16:39:24 -0800] “GET /twiki/bin/view/Main/TokyoOffice HTTP/1.1” 200 3853 64.242.88.10 - - [07/Mar/2004:16:43:54 -0800] “GET /twiki/bin/view/Main/MikeMannix HTTP/1.1” 200 3686 64.242.88.10 - - [07/Mar/2004:16:45:56 -0800] “GET /twiki/bin/attach/Main/PostfixCommands HTTP/1.1” 401 12846 64.242.88.10 - - [07/Mar/2004:16:47:12 -0800] “GET /robots.txt HTTP/1.1” 200 68 64.242.88.10 - - [07/Mar/2004:16:47:46 -0800] “GET /twiki/bin/rdiff/Know/ReadmeFirst?rev1=1.5&rev2=1.4 HTTP/1.1” 200 5724 64.242.88.10 - - [07/Mar/2004:16:49:04 -0800] “GET /twiki/bin/view/Main/TWikiGroups?rev=1.2 HTTP/1.1” 200 5162 64.242.88.10 - - [07/Mar/2004:16:50:54 -0800] “GET /twiki/bin/rdiff/Main/ConfigurationVariables HTTP/1.1” 200 59679 64.242.88.10 - - [07/Mar/2004:16:52:35 -0800] “GET /twiki/bin/edit/Main/Flush_service_name?topicparent=Main.ConfigurationVariables HTTP/1.1” 401 12851 64.242.88.10 - - [07/Mar/2004:16:53:46 -0800] “GET /twiki/bin/rdiff/TWiki/TWikiRegistration HTTP/1.1” 200 34395 64.242.88.10 - - [07/Mar/2004:16:54:55 -0800] “GET /twiki/bin/rdiff/Main/NicholasLee HTTP/1.1” 200 7235 64.242.88.10 - - [07/Mar/2004:16:56:39 -0800] “GET /twiki/bin/view/Sandbox/WebHome?rev=1.6 HTTP/1.1” 200 8545 64.242.88.10 - - [07/Mar/2004:16:58:54 -0800] “GET /mailman/listinfo/administration HTTP/1.1” 200 6459 lordgun.org - - [07/Mar/2004:17:01:53 -0800] “GET /razor.html HTTP/1.1” 200 2869 64.242.88.10 - - [07/Mar/2004:17:09:01 -0800] “GET /twiki/bin/search/Main/SearchResult?scope=text®ex=on&search=Joris%20*Benschop[^A-Za-z] HTTP/1.1” 200 4284 (@elastic) - Aaron Aldrich (@crayzeigh) 83
Structured Data ! { “message”:”user_deleted”, “user”: { “id”:6, “email”:”crayzeigh@example.com”, “created_at”:”2015-12-11T04:31:46.828Z”, “updated_at”:”2015-12-11T04:32:18.340Z”, “name”:”crayzeigh”, “role”:”user”, “invitation_token”:null, “invitation_created_at”:null, “invitation_sent_at”:null, “invitation_accepted_at”:null, “invitation_limit”:null, “invited_by_id”:null, “invited_by_type”:null, “invitations_count”:0 }, “@timestamp”:”2015-12-11T13:35:50.070+00:00”, “@version”:”1”, “severity”:”INFO”, “host”:”app1-web1”, “type”:”apps” } ! from James Turnbull: https://www.kartar.net/2015/12/structured-logging/ (@elastic) - Aaron Aldrich (@crayzeigh) 84
Generate LOTS of events use sampling to store them (@elastic) - Aaron Aldrich (@crayzeigh) 85
OK let’s talk about APM (@elastic) - Aaron Aldrich (@crayzeigh) 86
Distributed Tracing ! Check out Open tracing fron CNCF: opentracing.io (@elastic) - Aaron Aldrich (@crayzeigh) 87
Instrumentation: SLIs are a good place to start (@elastic) - Aaron Aldrich (@crayzeigh) 88
Kill Staging: Test in Production (@elastic) - Aaron Aldrich (@crayzeigh) 89
(@elastic) - Aaron Aldrich (@crayzeigh) 90
This doesn’t eliminate QA or testing (please test before prod) (@elastic) - Aaron Aldrich (@crayzeigh) 91
Kill your staging environment • always out of sync • can’t replicate prod traffic anyway • definitely can’t replicate real users • replace with feature flags and canary deploys ! Launch Darkly talks about this a lot. You should listen to what they have to say. (@elastic) - Aaron Aldrich (@crayzeigh) 92
O11y ❤ ‘s QA Start leveraging a common toolset (@elastic) - Aaron Aldrich (@crayzeigh) 93
Every Dashboard sucks (@elastic) - Aaron Aldrich (@crayzeigh) 94
(@elastic) - Aaron Aldrich (@crayzeigh) 95
Not really, some dashboards are pretty good (@elastic) - Aaron Aldrich (@crayzeigh) 96
(@elastic) - Aaron Aldrich (@crayzeigh) 97
It’s about Storytelling know your audience (@elastic) - Aaron Aldrich (@crayzeigh) 98
Ops & Incident Response • Interactive • Iterative • Involve search bars (@elastic) - Aaron Aldrich (@crayzeigh) 99
Vendor Warning: Search & Common Data Schema (@elastic) - Aaron Aldrich (@crayzeigh) 100
Making O11y Evangelists (@elastic) - Aaron Aldrich (@crayzeigh) 101
Don’t just start making changes (@elastic) - Aaron Aldrich (@crayzeigh) 102
(@elastic) - Aaron Aldrich (@crayzeigh) 103
History is important (@elastic) - Aaron Aldrich (@crayzeigh) 104
Change conducted poorly breaks organizations (@elastic) - Aaron Aldrich (@crayzeigh) 105
top-down mandated change never works ☠ Did you know “defenestration” is the act of throwing someone out a window? (@elastic) - Aaron Aldrich (@crayzeigh) 106
Talk to other parts of the business to understand what stories they value (@elastic) - Aaron Aldrich (@crayzeigh) 107
LISTEN It’s all about context (@elastic) - Aaron Aldrich (@crayzeigh) 108
Start measuring business values (@elastic) - Aaron Aldrich (@crayzeigh) 109
Who else might care about dashboards? (@elastic) - Aaron Aldrich (@crayzeigh) 110
What data can we expose to the rest of the business? (@elastic) - Aaron Aldrich (@crayzeigh) 111
112
113
114
Dashboards help tell stories with context (@elastic) - Aaron Aldrich (@crayzeigh) 115
Share results Good and Bad (@elastic) - Aaron Aldrich (@crayzeigh) 116
Are your systems up? Are they responding acceptably? (@elastic) - Aaron Aldrich (@crayzeigh) 117
Who cares? (@elastic) - Aaron Aldrich (@crayzeigh) 118
(@elastic) - Aaron Aldrich (@crayzeigh) 119
Are your services delivering value? (@elastic) - Aaron Aldrich (@crayzeigh) 120
Monitoring and the Church of 9s has a new competing ideology, Observability and the SLO. Without taking a full 99 theses to explore the differences, this talk will explore the differences separating them and build new o11y evangelists.
The following resources were mentioned during the presentation or are useful additional information.
Landing page for CNCFs Open Metrics project
Landing page for the Open Tracing project.
Here’s what was said about this presentation on social media.
"Do you actually understand how your business makes money?" - @crayzeigh #HoustonDoesDevOps
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
"We're not operating systems, we're building services" - @crayzeigh #HoustonDoesDevOps @DevOpsDaysHTown
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
"Each 9 is exponentially more expensive to provide" -@crayzeigh #HoustonDoesDevOps @DevOpsDaysHTown
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
I need to talk to @crayzeigh more about error budgets. I think he would have some good tips for me to add to my talk for @RefactrTech in June. #HoustonDoesDevOps
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
Do you understand the difference between observability and monitoring? Cause I honestly didn't before @crayzeigh just explained it. #HoustonDoesDevOps
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
Lots of valuable heresy being dropped by @crayzeigh at #devopsdayshtown @DevOpsDaysHTown
— Bearded Beauty @ DevOpsDays Houston (@gwaldo) April 16, 2019
"If you have a complex system then one-in-a-million chances happen all the time" - @crayzeigh #HoustonDoesDevOps
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
The homie @crayzeigh! pic.twitter.com/V031TpfyWb
— jay @ DevOpsDays Houston (@jaydestro) April 16, 2019
Observability is basically Monitoring
— Bearded Beauty @ DevOpsDays Houston (@gwaldo) April 16, 2019
Events are basically Logs
Test in Production. *DO IT LIVE*@crayzeigh at #devopsdayshtown @DevOpsDaysHTown
DEATH TO STAGING ENVIRONMENTS!!! At least according to @crayzeigh #HoustonDoesDevOps
— MadBlkMan @ DevOpsDays - Houston (@MadBlkMan) April 16, 2019
“NOOOOBODY EXPECTS THE DEVOPS INQUISITION!!!”
— Bearded Beauty @ DevOpsDays Houston (@gwaldo) April 16, 2019
— not quite said by @crayzeigh at #HoustonDoesDevOps @DevOpsDaysHTown
Good talk by @crayzeigh on o11y (I learned a new x11y word!) #HoustonDoesDevOps
— Josh Masterson (@Josh_Masterson) April 16, 2019