Table of Contents

 

  • Praise and Dedication
  • Foreword by Royal Hansen
  • Foreword by Michael Wildpaner
  • Preface
    • Why We Wrote This Book
    • Who This Book Is For
    • A Note About Culture
    • How to Read This Book
    • Conventions Used in This Book
    • O'Reilly Online Learning
    • How to Contact Us
    • Acknowledgments
  • Part I. Introductory Material
  • 1. The Intersection of Security and Reliability
    • On Passwords and Power Drills
    • Reliability Versus Security: Design Considerations
    • Confidentiality, Integrity, Availability
      • Confidentiality
      • Integrity
      • Availability
    • Reliability and Security: Commonalities
      • Invisibility
      • Assessment
      • Simplicity
      • Evolution
      • Resilience
      • From Design to Production
      • Investigating Systems and Logging
      • Crisis Response
      • Recovery
    • Conclusion
  • 2. Understanding Adversaries
    • Attacker Motivations
    • Attacker Profiles
      • Hobbyists
      • Vulnerability Researchers
      • Governments and Law Enforcement
      • Activists
      • Criminal Actors
      • Automation and Artificial Intelligence
      • Insiders
    • Attacker Methods
      • Threat Intelligence
      • Cyber Kill Chains™
      • Tactics, Techniques, and Procedures
    • Risk Assessment Considerations
    • Conclusion
  • Part II. Designing Systems
  • 3. Case Study: Safe Proxies
    • Safe Proxies in Production Environments
    • Google Tool Proxy
    • Conclusion
  • 4. Design Tradeoffs
    • Design Objectives and Requirements
      • Feature Requirements
      • Nonfunctional Requirements
      • Features Versus Emergent Properties
      • Example: Google Design Document
    • Balancing Requirements
      • Example: Payment Processing
    • Managing Tensions and Aligning Goals
      • Example: Microservices and the Google Web Application Framework
      • Aligning Emergent-Property Requirements
    • Initial Velocity Versus Sustained Velocity
    • Conclusion
  • 5. Design for Least Privilege
    • Concepts and Terminology
      • Least Privilege
      • Zero Trust Networking
      • Zero Touch
    • Classifying Access Based on Risk
    • Best Practices
      • Small Functional APIs
      • Breakglass
      • Auditing
      • Testing and Least Privilege
      • Diagnosing Access Denials
      • Graceful Failure and Breakglass Mechanisms
    • Worked Example: Configuration Distribution
      • POSIX API via OpenSSH
      • Software Update API
      • Custom OpenSSH ForceCommand
      • Custom HTTP Receiver (Sidecar)
      • Custom HTTP Receiver (In-Process)
      • Tradeoffs
    • Authorization Decisions
      • Using Advanced Authorization Controls
      • Investing in a Widely Used Authorization Framework
      • Avoiding Potential Pitfalls
    • Advanced Controls
      • Multi-Party Authorization (MPA)
      • Three-Factor Authorization (3FA)
      • Business Justifications
      • Temporary Access
      • Proxies
    • Tradeoffs and Tensions
      • Increased Security Complexity
      • Impact on Collaboration and Company Culture
      • Quality Data and Systems That Impact Security
      • Impact on User Productivity
      • Impact on Developer Complexity
    • Conclusion
  • 6. Design for Understandability
    • Why Is Understandability Important?
      • System Invariants
      • Analyzing Invariants
      • Mental Models
    • Designing Understandable Systems
      • Complexity Versus Understandability
      • Breaking Down Complexity
      • Centralized Responsibility for Security and Reliability Requirements
    • System Architecture
      • Understandable Interface Specifications
      • Understandable Identities, Authentication, and Access Control
      • Security Boundaries
    • Software Design
      • Using Application Frameworks for Service-Wide Requirements
      • Understanding Complex Data Flows
      • Considering API Usability
    • Conclusion
  • 7. Design for a Changing Landscape
    • Types of Security Changes
    • Designing Your Change
    • Architecture Decisions to Make Changes Easier
      • Keep Dependencies Up to Date and Rebuild Frequently
      • Release Frequently Using Automated Testing
      • Use Containers
      • Use Microservices
    • Different Changes: Different Speeds, Different Timelines
      • Short-Term Change: Zero-Day Vulnerability
      • Medium-Term Change: Improvement to Security Posture
      • Long-Term Change: External Demand
    • Complications: When Plans Change
    • Example: Growing Scope—Heartbleed
    • Conclusion
  • 8. Design for Resilience
    • Design Principles for Resilience
    • Defense in Depth
      • The Trojan Horse
      • Google App Engine Analysis
    • Controlling Degradation
      • Differentiate Costs of Failures
      • Deploy Response Mechanisms
      • Automate Responsibly
    • Controlling the Blast Radius
      • Role Separation
      • Location Separation
      • Time Separation
    • Failure Domains and Redundancies
      • Failure Domains
      • Component Types
      • Controlling Redundancies
    • Continuous Validation
      • Validation Focus Areas
      • Validation in Practice
    • Practical Advice: Where to Begin
    • Conclusion
  • 9. Design for Recovery
    • What Are We Recovering From?
      • Random Errors
      • Accidental Errors
      • Software Errors
      • Malicious Actions
    • Design Principles for Recovery
      • Design to Go as Quickly as Possible (Guarded by Policy)
      • Limit Your Dependencies on External Notions of Time
      • and Reliability
      • Use an Explicit Revocation Mechanism
      • Know Your Intended State, Down to the Bytes
      • Design for Testing and Continuous Validation
    • Emergency Access
      • Access Controls
      • Communications
      • Responder Habits
    • Unexpected Benefits
    • Conclusion
  • 10. Mitigating Denial-of-Service Attacks
    • Strategies for Attack and Defense
      • Attacker’s Strategy
      • Defender’s Strategy
    • Designing for Defense
      • Defendable Architecture
      • Defendable Services
    • Mitigating Attacks
      • Monitoring and Alerting
      • Graceful Degradation
      • A DoS Mitigation System
      • Strategic Response
    • Dealing with Self-Inflicted Attacks
      • User Behavior
      • Client Retry Behavior
    • Conclusion
  • Part III. Implementing Systems
  • 11. Case Study: Designing, Implementing, and Maintaining a Publicly Trusted CA
    • Background on Publicly Trusted Certificate Authorities
    • Why Did We Need a Publicly Trusted CA?
    • The Build or Buy Decision
    • Design, Implementation, and Maintenance Considerations
      • Programming Language Choice
      • Complexity Versus Understandability
      • Securing Third-Party and Open Source Components
      • Testing
      • Resiliency for the CA Key Material
      • Data Validation
    • Conclusion
  • 12. Writing Code
    • Frameworks to Enforce Security and Reliability
      • Benefits of Using Frameworks
      • Example: Framework for RPC Backends
    • Common Security Vulnerabilities
      • SQL Injection Vulnerabilities: TrustedSqlString
      • Preventing XSS: SafeHtml
    • Lessons for Evaluating and Building Frameworks
      • Simple, Safe, Reliable Libraries for Common Tasks
      • Rollout Strategy
    • Simplicity Leads to Secure and Reliable Code
      • Avoid Multilevel Nesting
      • Eliminate YAGNI Smells
      • Repay Technical Debt
      • Refactoring
    • Security and Reliability by Default
      • Choose the Right Tools
      • Use Strong Types
      • Sanitize Your Code
    • Conclusion
  • 13. Testing Code
    • Unit Testing
      • Writing Effective Unit Tests
      • When to Write Unit Tests
      • How Unit Testing Affects Code
    • Integration Testing
      • Writing Effective Integration Tests
    • Dynamic Program Analysis
    • Fuzz Testing
      • How Fuzz Engines Work
      • Writing Effective Fuzz Drivers
      • An Example Fuzzer
      • Continuous Fuzzing
    • Static Program Analysis
      • Automated Code Inspection Tools
      • Integration of Static Analysis in the Developer Workflow
      • Abstract Interpretation
      • Formal Methods
    • Conclusion
  • 14. Deploying Code
    • Concepts and Terminology
    • Threat Model
    • Best Practices
      • Require Code Reviews
      • Rely on Automation
      • Verify Artifacts, Not Just People
      • Treat Configuration as Code
    • Securing Against the Threat Model
    • Advanced Mitigation Strategies
      • Binary Provenance
      • Provenance-Based Deployment Policies
      • Verifiable Builds
      • Deployment Choke Points
      • Post-Deployment Verification
    • Practical Advice
      • Take It One Step at a Time
      • Provide Actionable Error Messages
      • Ensure Unambiguous Provenance
      • Create Unambiguous Policies
      • Include a Deployment Breakglass
    • Securing Against the Threat Model, Revisited
    • Conclusion
  • 15. Investigating Systems
    • From Debugging to Investigation
      • Example: Temporary Files
      • Debugging Techniques
      • What to Do When You’re Stuck
      • Collaborative Debugging: A Way to Teach
      • How Security Investigations and Debugging Differ
    • Collect Appropriate and Useful Logs
      • Design Your Logging to Be Immutable
      • Take Privacy into Consideration
      • Determine Which Security Logs to Retain
      • Budget for Logging
    • Robust, Secure Debugging Access
      • Reliability
      • Security
    • Conclusion
  • Part IV. Maintaining Systems
  • 16. Disaster Planning
    • Defining “Disaster”
    • Dynamic Disaster Response Strategies
    • Disaster Risk Analysis
    • Setting Up an Incident Response Team
      • Identify Team Members and Roles
      • Establish a Team Charter
      • Establish Severity and Priority Models
      • Define Operating Parameters for Engaging the IR Team
      • Develop Response Plans
      • Create Detailed Playbooks
      • Ensure Access and Update Mechanisms Are in Place
    • Prestaging Systems and People Before an Incident
      • Configuring Systems
      • Training
      • Processes and Procedures
    • Testing Systems and Response Plans
      • Auditing Automated Systems
      • Conducting Nonintrusive Tabletops
      • Testing Response in Production Environments
      • Red Team Testing
      • Evaluating Responses
    • Google Examples
      • Test with Global Impact
      • DiRT Exercise Testing Emergency Access
      • Industry-Wide Vulnerabilities
    • Conclusion
  • 17. Crisis Management
    • Is It a Crisis or Not?
      • Triaging the Incident
      • Compromises Versus Bugs
    • Taking Command of Your Incident
      • The First Step: Don’t Panic!
      • Beginning Your Response
      • Establishing Your Incident Team
      • Operational Security
      • Trading Good OpSec for the Greater Good
      • The Investigative Process
    • Keeping Control of the Incident
      • Parallelizing the Incident
      • Handovers
      • Morale
    • Communications
      • Misunderstandings
      • Hedging
      • Meetings
      • Keeping the Right People Informed with the Right Levels of Detail
    • Putting It All Together
      • Triage
      • Declaring an Incident
      • Communications and Operational Security
      • Beginning the Incident
      • Handover
      • Handing Back the Incident
      • Preparing Communications and Remediation
      • Closure
    • Conclusion
  • 18. Recovery and Aftermath
    • Recovery Logistics
    • Recovery Timeline
    • Planning the Recovery
      • Scoping the Recovery
      • Recovery Considerations
      • Recovery Checklists
    • Initiating the Recovery
      • Isolating Assets (Quarantine)
      • System Rebuilds and Software Upgrades
      • Data Sanitization
      • Recovery Data
      • Credential and Secret Rotation
    • After the Recovery
      • Postmortems
    • Examples
      • Compromised Cloud Instances
      • Large-Scale Phishing Attack
      • Targeted Attack Requiring Complex Recovery
    • Conclusion
  • Part V. Organization and Culture
  • 19. Case Study: Chrome Security Team
    • Background and Team Evolution
    • Security Is a Team Responsibility
    • Help Users Safely Navigate the Web
    • Speed Matters
    • Design for Defense in Depth
    • Be Transparent and Engage the Community
    • Conclusion
  • 20. Understanding Roles and Responsibilities
    • Who Is Responsible for Security and Reliability?
      • The Roles of Specialists
      • Understanding Security Expertise
      • Certifications and Academia
    • Integrating Security into the Organization
      • Embedding Security Specialists and Security Teams
      • Example: Embedding Security at Google
      • Special Teams: Blue and Red Teams
      • External Researchers
    • Conclusion
  • 21. Building a Culture of Security and Reliability
    • Defining a Healthy Security and Reliability Culture
      • Culture of Security and Reliability by Default
      • Culture of Review
      • Culture of Awareness
      • Culture of Yes
      • Culture of Inevitably
      • Culture of Sustainability
    • Changing Culture Through Good Practice
      • Align Project Goals and Participant Incentives
      • Reduce Fear with Risk-Reduction Mechanisms
      • Make Safety Nets the Norm
      • Increase Productivity and Usability
      • Overcommunicate and Be Transparent
      • Build Empathy
    • Convincing Leadership
      • Understand the Decision-Making Process
      • Build a Case for Change
      • Pick Your Battles
      • Escalations and Problem Resolution
    • Conclusion
  • Conclusion
  • Appendix. A Disaster Risk Assessment Matrix
  • Index
  • About the Editors