By Our Powers Combined: Observability for Developers

A presentation at CodeMash 2020 in January 2020 in Sandusky, OH 44870, USA by Aaron Aldrich

Slide 1

Slide 1

BY OUR POWERS COMBINED: OBSERVABILITY FOR DEVELOPERS @CrayZeigh — #CodeMash2020

Slide 2

Slide 2

HI, CODEMASH!

! $ ” : theelasticast.com : @CrayZeigh : aaron.aldrich@elastic.co : noti.st/crayzeigh

!

Slide 3

Slide 3

OBSERVABILITY @CrayZeigh — #CodeMash2020

Slide 4

Slide 4

DEVOPS @CrayZeigh — #CodeMash2020

Slide 5

Slide 5

DEVOPS @CrayZeigh — #CodeMash2020

Slide 6

Slide 6

@CrayZeigh — #CodeMash2020

Slide 7

Slide 7

@CrayZeigh — #CodeMash2020

Slide 8

Slide 8

DEVOPS @CrayZeigh — #CodeMash2020

Slide 9

Slide 9

DEVOPS > Wave 1: Ops learns code & automation @CrayZeigh — #CodeMash2020

Slide 10

Slide 10

DEVOPS > Wave 1: Ops learns code & automation 1 > Wave 2: Dev owns code through production 1 https://vimeo.com/341142053 @CrayZeigh — #CodeMash2020

Slide 11

Slide 11

*Simon Wardley: https://twitter.com/swardley/status/1014883354481741825?lang=en

Slide 12

Slide 12

https://dev.to/molly_struve/making-on-call-not-suck-490

Slide 13

Slide 13

SHARED LANGUAGE @CrayZeigh — #CodeMash2020

Slide 14

Slide 14

SHARED TOOLS @CrayZeigh — #CodeMash2020

Slide 15

Slide 15

SHARED SOURCE OF TRUTH @CrayZeigh — #CodeMash2020

Slide 16

Slide 16

Isn’t it just Monitoring with better SEO? — You @CrayZeigh — #CodeMash2020

Slide 17

Slide 17

YOU’RE NOT WRONG.. @CrayZeigh — #CodeMash2020

Slide 18

Slide 18

@CrayZeigh — #CodeMash2020

Slide 19

Slide 19

@CrayZeigh — #CodeMash2020

Slide 20

Slide 20

MONITORING Tracking system health and watching for known failure conditions. Good for tracking known unknowns and generating failure alerts for operational issues. Debugging often requires further investigation of systems and failure state recreation. Finds horses. @CrayZeigh — #CodeMash2020

Slide 21

Slide 21

@CrayZeigh — #CodeMash2020

Slide 22

Slide 22

@CrayZeigh — #CodeMash2020

Slide 23

Slide 23

THERE IS NO ROOT CAUSE @CrayZeigh — #CodeMash2020

Slide 24

Slide 24

OBSERVABILITY A system is observable when you can ask arbitrary questions about it and receive meaningful answers without having to resort to writing new code or command line tools. It lets you discover unknown-unknowns and debug in production. Helps debug zebras. @CrayZeigh — #CodeMash2020

Slide 25

Slide 25

Software is inherently opaque, we have to instrument it to output meaningful information @CrayZeigh — #CodeMash2020

Slide 26

Slide 26

THE THREE PILLARS OF OBSERVABILITY @CrayZeigh — #CodeMash2020

Slide 27

Slide 27

THE THREE PILLARS OF OBSERVABILITY 1. Logs @CrayZeigh — #CodeMash2020

Slide 28

Slide 28

THE THREE PILLARS OF OBSERVABILITY 1. Logs 2. Metrics @CrayZeigh — #CodeMash2020

Slide 29

Slide 29

THE THREE PILLARS OF OBSERVABILITY 1. Logs 2. Metrics 3. APM @CrayZeigh — #CodeMash2020

Slide 30

Slide 30

THE THREE PILLARS OF OBSERVABILITY 1. Logs Events 2. Metrics 3. APM @CrayZeigh — #CodeMash2020

Slide 31

Slide 31

THE THREE PILLARS OF OBSERVABILITY 1. Logs Events 2. Metrics 3. APM 4. Distributed Tracing @CrayZeigh — #CodeMash2020

Slide 32

Slide 32

THE THREE PILLARS OF OBSERVABILITY 1. Logs Events 2. Metrics 3. APM & Distributed Tracing 4. Distributed Tracing @CrayZeigh — #CodeMash2020

Slide 33

Slide 33

@CrayZeigh — #CodeMash2020

Slide 34

Slide 34

LOGS/METRICS/APM ARE THE MEDIA WE WORK IN @CrayZeigh — #CodeMash2020

Slide 35

Slide 35

METRICS @CrayZeigh — #CodeMash2020

Slide 36

Slide 36

@CrayZeigh — #CodeMash2020

Slide 37

Slide 37

EVENTS @CrayZeigh — #CodeMash2020

Slide 38

Slide 38

CARDINALITY & YOU @CrayZeigh — #CodeMash2020

Slide 39

Slide 39

EXISTING LOGS 66.249.65.159 - - [06/Nov/2014:19:10:38 +0600] “GET /news/53f8d72920ba2744fe873ebc.html HTTP/1.1” 404 177 “-” “Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version /6.0 Mobile/10A5376e Safari/8536.25 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” 66.249.65.3 - - [06/Nov/2014:19:11:24 +0600] “GET /?q=%E0%A6%AB%E0%A6%BE%E0%A7%9F%E0%A6%BE%E0%A6%B0 HTTP/ 1.1” 200 4223 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” 66.249.65.62 - - [06/Nov/2014:19:12:14 +0600] “GET /?q=%E0%A6%A6%E0%A7%8B%E0%A7%9F%E0%A6%BE HTTP/1.1” 200 4356 “-” “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” @CrayZeigh — #CodeMash2020

Slide 40

Slide 40

STRUCTURED LOGGING { “@timestamp”: “2019-08-04T12:30:04.000Z”, … “container”: { “image.id”: “48f5af6667f3457be0a2c7814caefe21ed3c94fb94bd6243096b3a61ea502b1d”, “version”: “version”, }, “build.id”: “efdd0b5e69b0742fa5e5bad0771df4d1df2459d1” … “transaction”: “transaction_ID”, “user”: “importantPerson”, “account”: “0129”, “os”: “osx”, … “api_endpoint”: “endpoint”, … “response”: 400, … “message”: “Some informative thing, probably more human readable and friendly, but difficult to parse” } @CrayZeigh — #CodeMash2020

Slide 41

Slide 41

APM & TRACING @CrayZeigh — #CodeMash2020

Slide 42

Slide 42

@CrayZeigh — #CodeMash2020

Slide 43

Slide 43

@CrayZeigh — #CodeMash2020

Slide 44

Slide 44

@CrayZeigh — #CodeMash2020

Slide 45

Slide 45

SHARED LANGUAGE @CrayZeigh — #CodeMash2020

Slide 46

Slide 46

SHARED LANGUAGE > Learning to speak Prod @CrayZeigh — #CodeMash2020

Slide 47

Slide 47

SHARED LANGUAGE > Learning to speak Prod > Teaching Prod to speak Dev (Structured Logs; Traces) @CrayZeigh — #CodeMash2020

Slide 48

Slide 48

SLI/SLO/SLA @CrayZeigh — #CodeMash2020

Slide 49

Slide 49

SYSTEMS RELIABILITY ENGINEERING @CrayZeigh — #CodeMash2020

Slide 50

Slide 50

SYSTEMS RELIABILITY ENGINEERING > SLI: Service Level Indicator @CrayZeigh — #CodeMash2020

Slide 51

Slide 51

SYSTEMS RELIABILITY ENGINEERING > SLI: Service Level Indicator > SLO: Service level Objective @CrayZeigh — #CodeMash2020

Slide 52

Slide 52

SYSTEMS RELIABILITY ENGINEERING > SLI: Service Level Indicator > SLO: Service level Objective > SLA: Service level Agreement @CrayZeigh — #CodeMash2020

Slide 53

Slide 53

SYSTEMS RELIABILITY ENGINEERING > SLI: Service Level Indicator > SLO: Service level Objective > SLA: Service level Agreement @CrayZeigh — #CodeMash2020

Slide 54

Slide 54

SHARED LANGUAGE > Learning to speak Prod > Teaching Prod to speak Dev (Structured Logs; Traces) @CrayZeigh — #CodeMash2020

Slide 55

Slide 55

SHARED LANGUAGE > Learning to speak Prod > Teaching Prod to speak Dev (Structured Logs; Traces) > Speaking directly to business value & Customer Experience @CrayZeigh — #CodeMash2020

Slide 56

Slide 56

SHARED TOOLS @CrayZeigh — #CodeMash2020

Slide 57

Slide 57

SHARED TOOLS > Debugging in Prod @CrayZeigh — #CodeMash2020

Slide 58

Slide 58

SHARED TOOLS > Debugging in Prod > Ops skills transferrable and replicable @CrayZeigh — #CodeMash2020

Slide 59

Slide 59

SHARED TOOLS > Debugging in Prod > Ops skills transferrable and replicable > New knowledge and methods shareable @CrayZeigh — #CodeMash2020

Slide 60

Slide 60

CONVERGED TOOLSETS @CrayZeigh — #CodeMash2020

Slide 61

Slide 61

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging @CrayZeigh — #CodeMash2020

Slide 62

Slide 62

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging > Any arbitrary business data can be added and correlated @CrayZeigh — #CodeMash2020

Slide 63

Slide 63

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging > Any arbitrary business data can be added and correlated > SIEM & InfoSec @CrayZeigh — #CodeMash2020

Slide 64

Slide 64

@CrayZeigh — #CodeMash2020

Slide 65

Slide 65

CONVERGED TOOLSETS Vendor Warning @CrayZeigh — #CodeMash2020 ⚠

Slide 66

Slide 66

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging Vendor Warning @CrayZeigh — #CodeMash2020 ⚠

Slide 67

Slide 67

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging > Any arbitrary business data can be added and correlated (ECS) Vendor Warning @CrayZeigh — #CodeMash2020 ⚠

Slide 68

Slide 68

CONVERGED TOOLSETS > Single platform for all data means easy, improved debugging > Any arbitrary business data can be added and correlated (ECS) > SIEM & InfoSec Vendor Warning @CrayZeigh — #CodeMash2020 ⚠

Slide 69

Slide 69

SHARED SOURCE OF TRUTH @CrayZeigh — #CodeMash2020

Slide 70

Slide 70

SHARED SOURCE OF TRUTH > Real production data @CrayZeigh — #CodeMash2020

Slide 71

Slide 71

SHARED SOURCE OF TRUTH > Real production data > Draw better lines from code to prod @CrayZeigh — #CodeMash2020

Slide 72

Slide 72

SHARED SOURCE OF TRUTH > Real production data > Draw better lines from code to prod > Write better, production ready code. @CrayZeigh — #CodeMash2020

Slide 73

Slide 73

WHERE DO WE GO FROM HERE? @CrayZeigh — #CodeMash2020

Slide 74

Slide 74

Testing & Experimentation @CrayZeigh — #CodeMash2020

Slide 75

Slide 75

TEST IN PRODUCTION @CrayZeigh — #CodeMash2020

Slide 76

Slide 76

@CrayZeigh — #CodeMash2020

Slide 77

Slide 77

FEATURE FLAGS ! https://martinfowler.com/articles/feature-toggles.html @CrayZeigh — #CodeMash2020

Slide 78

Slide 78

config.json { “featureFlags”: { “newThing”: true, } } app.js import { featureFlags } from “./config.json” if (featureFlags.newThing) { // do the new thing! } else { // do the old thing! } @CrayZeigh — #CodeMash2020

Slide 79

Slide 79

DON’T DEBATE EXPERIMENT @CrayZeigh — #CodeMash2020

Slide 80

Slide 80

2 Speaking of Testing : 2 QA https://theelasticast.com/episodes/0017-qa/ @CrayZeigh — #CodeMash2020

Slide 81

Slide 81

Slide 82

Slide 82

Chaos Engineering is about refining our mental models @CrayZeigh — #CodeMash2020

Slide 83

Slide 83

DEVOPS > Wave 1: Ops learns code & automation 1 > Wave 2: Dev owns code through production 1 https://vimeo.com/341142053 @CrayZeigh — #CodeMash2020

Slide 84

Slide 84

DEV OWNS CODE THROUGH PRODUCTION @CrayZeigh — #CodeMash2020

Slide 85

Slide 85

DEV OWNS CODE THROUGH PRODUCTION > Better, more production-ready code @CrayZeigh — #CodeMash2020

Slide 86

Slide 86

DEV OWNS CODE THROUGH PRODUCTION > Better, more production-ready code > Real World experimentation @CrayZeigh — #CodeMash2020

Slide 87

Slide 87

DEV OWNS CODE THROUGH PRODUCTION > Better, more production-ready code > Real World experimentation > Improved operational resiliency @CrayZeigh — #CodeMash2020

Slide 88

Slide 88

THE POWER IS YOURS

Slide 89

Slide 89

THANKS! > Slides & References: noti.st/crayzeigh > Trial: ela.st/aaron-aldrich-trial > Come say, “Hi!” at the Elastic booth (N6) > Check out Elastic APM @ 14:00 in Salon A! @CrayZeigh — #CodeMash2020