Yet Another List of Devops Resources

This is a list of DevOps resources that have influenced me. It’s more focused on AWS and data, because that is where I’ve spent the majority of my career, but I’ve included content on a variety of topics.

Big Picture Resources

DevOps Handbook: A great overview of the DevOps landscape. It focuses on workflows and how to make them smooth.
Dora Report: What an elite software delivery team looks like and does.
DevOps Culture: What really matters, in many ways, is having people and a culture that cares about operations. Rouan Wilsenach explains what that means.
SRE Vs DevOps: The differences between SRE and DevOps, and how they’re related.
Platform Engineering: Where ops is going and my favorite article explaining platform engineering.

Security

STRIDE model: I’ve found the stride model really helpful for thinking about security questions. It provides a basic framework for what could go wrong.
OWASP Top 10: The most common security vulnerabilities in web applications.

Databases and Data Modeling

Use The Index Luke: Database performance is often treated like a dark art. It isn’t. Use The Index Luke is an excellent start to actually understanding how indexes work, and how to improve database performance for your application.
Modeling Data In DynamoDB: A step by step walkthrough of how to model your data in DynamoDB. This also applies to other key value stores.
Advanced Design Patterns for DynamoDB: The talk that the previous article is based on. Absolutely worth watching in it’s own right, as it shows the real wizardry that is possible.

Continous Delivery

Going Faster With Continous Delivery: Amazon talking about the improvements they saw with continuous delivery.
Ensuring Rollback Safety During Deployments: How to deploy stateful applications, and still be able to rollback.
10+ Deploys A Day: This talk by John Allspaw and Paul Hammond is DevOps canon at this point.

Containers

Dockerfile Best Practices: If you’re using containers then they’re probably docker images - make them suck less.
Kubernetes The Hard Way: Kelsey Hightower’s guide to setting up Kubernetes from scratch. Super useful for understanding what is actually going on under the hood.
Katacoda Kubernetes: Guided exercises on how to use Kubernetes. Super short and to the point.
Kubernetes Best Practices: Recomendations for maintaining and using kubernetes clusters.

Networking

Beej’s Guide To Network Programming: A classic book on sockets and network programming using C.
IP Addressing and Subnetting For New Users: Networking is one of the last preisthoods left in computers. This gives you the basics.
Questions: The amazing Julia Evans has created flash cards that will help you learn about a wide variety of subjects, and it includes a bunch of important networking topics including HTTP and DNS.

Cloud

Load Shedding: This is an important technique for designing systems and maintaing high throughput, while allowing avalibility to gracefully degrade instead of rapidly sliding to 0.
AWS White Papers: AWS’s white papers are an excellent source for how things in AWS work, and ideas for making your own cloud deploys better.
So You Want To Migrate To Another Region: Techniques to make migrating to another AWS region possible.
Amazon Builder’s Library: More resources from Amazon for building high scale applications.

Linux and Operating Systems

Operating Systems in Three Easy Pieces: An excellent introduction to how operating systems actually work. If you read the book and do the lab exercises, it will vastly increase your OS knowledge.
The Linux System Administrator’s Guide: Out of date in some places now, but still useful for understanding what is going on when operating a Linux system.
The Art Of The Command Line: Learning how to use the command line is one of those high value things that often isn’t formally taught.

Debugging and Reliability

How I got better at debugging: Julia Evans excellent advice on debugging and how to get better at it.
Computers Can Be Understood: Because computers don’t actually require you to sacrifice a goat in a pentagram to make them work - it just feels that way sometimes.
SLI SLA and SLO: All three of those acronyms get thrown around far too often in a confused manner. Here is what they mean.
Debugging Under Fire: An amazing talk by Bryan Cantrill about debugging a major production outage.

Systems Thinking and Design

Designing Data Intensive Applications: A classic on building distributed system and an excellent blend of theory and practice about how they actually work.
Conway’s Law: How software architecture reflects the architecture of the organization that produces the software.
An Introduction to Wardley Value Chain Mapping: Wardley mapping is a useful exercise to understand a space, and what actions you should be taking in it.

Human Factors

Postmortem Culture: An explanation of what a good blameless postmortem looks like.
High-Performing Teams Need Psychological Safety. Here’s How to Create It: Psychological safety is one of those terms that I feel is pretty overused, but there’s value in the idea.
The Null Process: Why having process is important, and not just for giant companies!

Code Review

Google On Code Review: Code review best practices from google. Lots of good food for thought on how to conduct and ask for code review.
How To Make Good Code Reviews Better: What a good code review looks like, and how to do it better.

Agile

Detecting Agile BS: If the Department of Defense can do it, so can you.
Why Scaling Agile Doesn’t Work: SAFe is bad. Convince your boss to watch this talk, and maybe get out of doing SAFe.

Automation

XKCD Automation: How much time savings are needed to justify automation.
Resiliant Test Automation: How to automate stuff and convince people that it’ll be okay!
Elimanting Toil: From the google SRE book on how to reduce the time spent on manual tasks.

Incident Response

Being On Call: Being on call is stressful, this is an excellent guide to setting up teams for success.
Incident Management: From Atlassian, on the steps to take when managing an incident, as well as some good cultural practices for incident management.

Monitoring and Observability

A Next Step Beyond Test Driven Development: An excellent introduction to thinking about observability from Honeycomb, who are truely defining how it should be done.
Performance Is A Shape, Not A Number: A guide to thinking about and measuring performance.
Alerting on SLOs: The google SRE book on how to handle alerting. An excellent introduction to the various options that you have for choosing when to trigger alerts.