Can a system ever truly be considered reliable if it isn’t fundamentally secure? Or can it be considered secure if it’s unreliable?
Successfully designing, implementing, and maintaining systems requires a commitment to the full system lifecycle. This commitment is possible only when security and reliability are central elements in the architecture of systems. Yet both are often afterthoughts, considered only after an incident occurs, resulting in expensive and sometimes difficult improvements.
Security by design is increasingly important in a world where many products are connected to the internet, and where cloud technologies are becoming more prevalent. The more we come to rely on those systems, the more reliable they need to be; the more trust we place in their security, the more secure they need to be.
We wanted to write a book that focuses on integrating security and reliability directly into the software and system lifecycle, both to highlight technologies and practices that protect systems and keep them reliable, and to illustrate how those practices interact with each other. The aim of this book is to provide insights about system design, implementation, and maintenance from practitioners who specialize in security and reliability.
We’d like to explicitly acknowledge that some of the strategies this book recommends require infrastructure support that simply may not exist where you’re currently working. Where possible, we recommend approaches that can be tailored to organizations of any size. However, we felt that it was important to start a conversation about how we can all evolve and improve existing security and reliability practices, as all the members of our growing and skilled community of professionals can learn a lot from one another. We hope other organizations will also be eager to share their successes and war stories with the community. As ideas about security and reliability evolve, the industry can benefit from a diverse set of implementation examples. Security and reliability engineering are still rapidly evolving fields. We constantly find conditions and cases that cause us to revise (or in some cases, replace) previously firmly held beliefs.
Because security and reliability are everyone’s responsibility, we’re targeting a broad audience: people who design, implement, and maintain systems. We’re challenging the dividing lines between the traditional professional roles of developers, architects, Site Reliability Engineers (SREs), systems administrators, and security engineers. While we’ll dive deeply into some subjects that might be more relevant to experienced engineers, we invite you—the reader—to try on different hats as you move through the chapters, imagining yourself in roles you (currently) don’t have and thinking about how you could improve your systems.
We argue that everyone should be thinking about the fundamentals of reliability and security from the very beginning of the development process, and integrating those principles early in the system lifecycle. This is a crucial concept that shapes this entire book. There are many lively active discussions in the industry about security engineers becoming more like software developers, and SREs and software developers becoming more like security engineers.1 We invite you to join in the conversation.
When we say “you” in the book, we mean the reader, independent of a particular job or experience level. This book challenges the traditional expectations of engineering roles and aims to empower you to be responsible for security and reliability throughout the whole product lifecycle. You shouldn’t worry about using all of the practices described here in your specific circumstances. Instead, we encourage you to return to this book at different stages of your career or throughout the evolution of your organization, considering whether ideas that didn’t seem valuable at first might be newly meaningful.
Building and adopting the widespread best practices we recommend in this book requires a culture that is supportive of such change. We feel it is essential that you address the culture of your organization in parallel with the technology choices you make to focus on both security and reliability, so that any adjustments you make are persistent and resilient. In our opinion, organizations that don’t embrace the importance of both security and reliability need to change—and revamping the culture of an organization in itself often demands an up-front investment.
We’ve woven technical best practices throughout the book and we support them with data, but it’s not possible to include data-backed cultural best practices. While this book calls out approaches that we think others can adapt or generalize, every organization has a distinct and unique culture. We discuss how Google has tried to work within its culture, but this may not be directly applicable to your organization. Instead, we encourage you to extract your own practical applications from the high-level recommendations we’ve included in this book.
While this book includes plenty of examples, it’s not a cookbook. It presents Google and industry stories, and shares what we’ve learned over the years. Everyone’s infrastructure is different, so you may need to significantly adapt some of the solutions we present, and some solutions may not apply to your organization at all. We try to present high-level principles and practical solutions that you can implement in a way that suits your unique environment.
We recommend you start with Chapter 1 and Chapter 2, and then read the chapters that most interest you. Most chapters begin with a boxed preface or executive summary that outlines the following:
The problem statement
When in the software development lifecycle you should apply these principles and practices
The intersections of and/or tradeoffs between reliability and security to consider
Within each chapter, topics are generally ordered from the most fundamental to the most sophisticated. We also call out deep dives and specialized subjects with an alligator icon.
This book recommends many tools or techniques considered to be good practice in the industry. Not every idea will be suitable for your particular use case, so you should evaluate the requirements of your project and design solutions adapted to your particular risk landscape.
While this book aims to be self-contained, you will find references to Site Reliability Engineering and The Site Reliability Workbook, where experts from Google describe how reliability is fundamental to service design. Reading these books may give you a deeper understanding of certain concepts but is not a prerequisite.
We hope you enjoy this book, and that some of the information in these pages can help you improve the reliability and security of your systems.
The following typographical conventions are used in this book:
Constant width
Constant width bold
Constant width italic
This element signifies a general note.
This icon indicates a deep dive.
For more than 40 years, O'Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.
Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O'Reilly and 200+ other publishers. For more information, please visit https://oreilly.com.
Please address comments and questions concerning this book to the publisher:
We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://shop.oreilly.com/product/0636920297550.do.
Email bookquestions@oreilly.com to comment or ask technical questions about this book.
For more information about our books, courses, conferences, and news, see our website at https://www.oreilly.com.
Find us on Facebook: https://facebook.com/oreilly
Follow us on Twitter: https://twitter.com/oreillymedia
Watch us on YouTube: https://www.youtube.com/oreillymedia
This book is the product of the enthusiastic and generous contributions of about 150 people, including authors, tech writers, chapter managers, and reviewers from engineering, legal, and marketing. The contributors span 18 time zones throughout the Americas, Europe, and Asia-Pacific, and more than 10 offices. We’d like to take a moment to thank everyone who isn’t already listed on a per-chapter basis.
As the leaders of Google Security and SRE, Gordon Chaffee, Royal Hansen, Ben Lutch, Sunil Potti, Dave Rensin, Benjamin Treynor Sloss, and Michael Wildpaner were the executive sponsors within Google. Their belief in a project that focuses on integrating security and reliability directly into the software and system lifecycle was essential to making this book happen.
This book would never have come to be without the drive and dedication of Ana Oprea. She recognized the value a book like this could have, initiated the idea at Google, evangelized it to SRE and Security leaders, and organized the vast amount of work necessary to make it happen.
We’d like to recognize the people who contributed by providing thoughtful input, discussion, and review. In chapter order, they are:
Chapter 1, The Intersection of Security and Reliability: Felipe Cabrera, Perry The Cynic, and Amanda Walker
Chapter 2, Understanding Adversaries: John Asante, Shane Huntley, and Mike Koivunen
Chapter 3, Case Study: Safe Proxies: Amaya Booker, Michał Czapiński, Scott Dier, and Rainer Wolafka
Chapter 4, Design Tradeoffs: Felipe Cabrera, Douglas Colish, Peter Duff, Cory Hardman, Ana Oprea, and Sergey Simakov
Chapter 5, Design for Least Privilege: Paul Guglielmino and Matthew Sachs
Chapter 6, Design for Understandability: Douglas Colish, Paul Guglielmino, Cory Hardman, Sergey Simakov, and Peter Valchev
Chapter 7, Design for a Changing Landscape: Adam Bacchus, Brandon Baker, Amanda Burridge, Greg Castle, Piotr Lewandowski, Mark Lodato, Dan Lorenc, Damian Menscher, Ankur Rathi, Daniel Rebolledo Samper, Michee Smith, Sampath Srinivas, Kevin Stadmeyer, and Amanda Walker
Chapter 8, Design for Resilience: Pierre Bourdon, Perry The Cynic, Jim Higgins, August Huber, Piotr Lewandowski, Ana Oprea, Adam Stubblefield, Seth Vargo, and Toby Weingartner
Chapter 9, Design for Recovery: Ana Oprea and JC van Winkel
Chapter 10, Mitigating Denial-of-Service Attacks: Zoltan Egyed, Piotr Lewandowski, and Ana Oprea
Chapter 11, Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA: Heather Adkins, Betsy Beyer, Ana Oprea, and Ryan Sleevi
Chapter 12, Writing Code: Douglas Colish, Felix Gröbert, Christoph Kern, Max Luebbe, Sergey Simakov, and Peter Valchev
Chapter 13, Testing Code: Douglas Colish, Daniel Fabian, Adrien Kunysz, Sergey Simakov, and JC van Winkel
Chapter 14, Deploying Code: Brandon Baker, Max Luebbe, and Federico Scrinzi
Chapter 15, Investigating Systems: Oliver Barrett, Pierre Bourdon, and Sandra Raicevic
Chapter 16, Disaster Planning: Heather Adkins, John Asante, Tim Craig, and Max Luebbe
Chapter 17, Crisis Management: Heather Adkins, Johan Berggren, John Lunney, James Nettesheim, Aaron Peterson, and Sara Smollet
Chapter 18, Recovery and Aftermath: Johan Berggren, Matt Linton, Michael Sinno, and Sara Smollett
Chapter 19, Case Study: Chrome Security Team: Abhishek Arya, Will Harris, Chris Palmer, Carlos Pizano, Adrienne Porter Felt, and Justin Schuh
Chapter 20, Understanding Roles and Responsibilities: Angus Cameron, Daniel Fabian, Vera Haas, Royal Hansen, Jim Higgins, August Huber, Artur Janc, Michael Janosko, Mike Koivunen, Max Luebbe, Ana Oprea, Andrew Pollock, Laura Posey, Sara Smollett, Peter Valchev, and Eduardo Vela Nava
Chapter 21, Building a Culture of Security and Reliability: David Challoner, Artur Janc, Christoph Kern, Mike Koivunen, Kostya Serebryany, and Dave Weinstein
We’d also like to especially thank Andrey Silin for his guidance throughout the book.
The following reviewers provided valuable insight and feedback to guide us along the way: Heather Adkins, Kristin Berdan, Shaudy Danaye-Armstrong, Michelle Duffy, Jim Higgins, Rob Mann, Robert Morlino, Lee-Anne Mulholland, Dave O’Connor, Charles Proctor, Olivia Puerta, John Reese, Pankaj Rohatgi, Brittany Stagnaro, Adam Stubblefield, Todd Underwood, and Mia Vu. A special thanks to JC van Winkel for performing a book-level consistency review.
We are also grateful to the following contributors, who supplied significant expertise or resources, or had some otherwise excellent effect on this work: Ava Katushka, Kent Kawahara, Kevin Mould, Jennifer Petoff, Tom Supple, Salim Virji, and Merry Yen.
External directional review from Eric Grosse helped us strike a good balance between novelty and practical advice. We very much appreciate his guidance, along with the thoughtful feedback we received from industry reviewers for the whole book: Blake Bisset, David N. Blank-Edelman, Jennifer Davis, and Kelly Shortridge. The in-depth reviews of the following people made each chapter better targeted to an external audience: Kurt Andersen, Andrea Barberio, Akhil Behl, Alex Blewitt, Chris Blow, Josh Branham, Angelo Failla, Tony Godfrey, Marco Guerri, Andrew Hoffman, Steve Huff, Jennifer Janesko, Andrew Kalat, Thomas A. Limoncelli, Allan Liska, John Looney, Niall Richard Murphy, Lukasz Siudut, Jennifer Stevens, Mark van Holsteijn, and Wietse Venema.
We would like to extend a special thanks to Shylaja Nukala and Paul Blankinship, who generously committed the time and skills of the SRE and security technical writing teams.
Finally, we’d like to thank the following contributors who worked on content that doesn’t appear directly in this book: Heather Adkins, Amaya Booker, Pierre Bourdon, Alex Bramley, Angus Cameron, David Challoner, Douglas Colish, Scott Dier, Fanuel Greab, Felix Gröbert, Royal Hansen, Jim Higgins, August Huber, Kris Hunt, Artur Janc, Michael Janosko, Hunter King, Mike Koivunen, Susanne Landers, Roxana Loza, Max Luebbe, Thomas Maufer, Shylaja Nukala, Ana Oprea, Massimiliano Poletto, Andrew Pollock, Laura Posey, Sandra Raicevic, Fatima Rivera, Steven Roddis, Julie Saracino, David Seidman, Fermin Serna, Sergey Simakov, Sara Smollett, Johan Strumpfer, Peter Valchev, Cyrus Vesuna, Janet Vong, Jakub Warmuz, Andy Warner, and JC van Winkel.
Thanks also to the O’Reilly Media team—Virginia Wilson, Kristen Brown, John Devins, Colleen Lobner, and Nikki McDonald—for their help and support in making this book a reality. Thanks to Rachel Head for a fantastic copyediting experience!
Finally, the book’s core team would also like to personally thank the following people: