Reliability Engineering in the Cloud by Pearson

Course details

In this course, develop valuable, in-demand skills that you can apply as a reliability engineer. Discover engineering strategies for promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and more. Along the way, learn how to automate operations to improve time to restore and time to detect using modern cloud services, LLMs, and best-in-class tools. This course is an ideal fit for software engineers and development teams responsible for designing, deploying, or maintaining cloud-native applications.

Learning objectives
Set an enterprise-wide CRE strategy for thousands of applications and dependencies.
Evaluate methods to increase the reliability and scalability of systems in the cloud. Ignite faster incident response while automating operations to improve time to restore and time to detect to the maximum possible extent.
Recognize that operational agility and chaos experimentation bring a culture of continuous improvement built on collaboration and knowledge sharing between teams.
Build effective strategies, promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and practical examples of techniques and tooling for success.
Identify domain-specific approaches and review examples to apply to your organizations and teams.

Concepts

Introduction

Cloud-based reliability engineering

How to Design, Build, Operate, and Stress-Test Highly Reliable Systems

Learning objectives
Defining resilience, reliability, engineering, and engineering excellence
Ensuring engineering excellence in the cloud - Why your business can t succeed without it
Understanding how to design and build resilient and reliable systems
Understanding how to test the resilience of your applications
Responding to and mitigating potential issues
Understanding how to leverage artificial intelligence (AI) and large language models (LLM)
Lesson 1 review and an exercise

Defining Engineering Strategies for Building Resilient, Available, and Scalable Systems

Learning objectives
Understand the foundational concepts of reliability, such as fault tolerance, high availability, scalability, and recovery
Choosing between alternatives for uptime and architecture design
Implementing service level objectives (SLO) and service level indicators (SLI) as performance measurements
Exploring immutable infrastructure, containerization, and event-driven architecture
Validating application and infrastructure resilience with chaos engineering and other modern techniques
Lesson 2 review and an exercise

The Power of Artificial Intelligence, Value Streams, and Cloud Reliability Engineering (CRE)

Learning objectives
Understanding foundational AI components
Applying ML and GenAI to CRE
Incorporating value streams and the CRE strategy
Fostering a culture of innovation - leadership, ownership, and fast decision-making
Lesson 3 review and an exercise

Leveraging Observability, Monitoring, and Reliability Metrics

Learning objectives
Defining observability and monitoring
Deploying a 10-step process to create effective monitoring
Surveying monitoring and alerting tools from leading cloud providers
Recognizing and proactively mitigating known service disruptions
Establishing objectives and key results (OKRs)
Lesson 4 review and an exercise

CRE Tooling and Chaos Engineering

Learning objectives
Distributing load with autoscaling and load balancing
Enabling automatic failovers for high availability
Implementing continuous deployments with rollback strategies
Leveraging chaos engineering for resilience testing
Lesson 5 review and an exercise

Incident Response for Fast Recovery

Learning objectives
Understanding incident response foundational concepts
Implementing a structured approach to incident response and CRE tools
Understanding incident handling in CRE
Defining time to detect (TTD) and time to recover (TTR)
Understanding playbooks and runbooks
Lesson 6 review and an exercise

Operational Excellence and Change Management

Learning objectives
Defining operational excellence in CRE
Identifying processes, people, and tools for operational excellence
Establishing key performance indicators
Understanding root cause analysis (RCA) and correction of error (CoE) form
Identifying tools for operational excellence assessments
Lesson 7 review and an exercise

Conclusion

Summary and next steps