Special offers now — see discounted courses.
day
:
hour
:
min
:
sec
See special offers
Reliability Engineering in the Cloud by Pearson

Reliability Engineering in the Cloud by Pearson

4h 55mIntermediate2026-02-19

Authors

Pearson

Pearson

Course details

In this course, develop valuable, in-demand skills that you can apply as a reliability engineer. Discover engineering strategies for promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and more. Along the way, learn how to automate operations to improve time to restore and time to detect using modern cloud services, LLMs, and best-in-class tools. This course is an ideal fit for software engineers and development teams responsible for designing, deploying, or maintaining cloud-native applications.

Learning objectives
Set an enterprise-wide CRE strategy for thousands of applications and dependencies.
Evaluate methods to increase the reliability and scalability of systems in the cloud. Ignite faster incident response while automating operations to improve time to restore and time to detect to the maximum possible extent.
Recognize that operational agility and chaos experimentation bring a culture of continuous improvement built on collaboration and knowledge sharing between teams.
Build effective strategies, promoting chaos engineering practices, observability and monitoring techniques, disaster recovery exercises, reliability metrics, fast data-driven decision-making, and practical examples of techniques and tooling for success.
Identify domain-specific approaches and review examples to apply to your organizations and teams.

Concepts

Introduction

  • Cloud-based reliability engineering

How to Design, Build, Operate, and Stress-Test Highly Reliable Systems

  • Learning objectives
  • Defining resilience, reliability, engineering, and engineering excellence
  • Ensuring engineering excellence in the cloud - Why your business can t succeed without it
  • Understanding how to design and build resilient and reliable systems
  • Understanding how to test the resilience of your applications
  • Responding to and mitigating potential issues
  • Understanding how to leverage artificial intelligence (AI) and large language models (LLM)
  • Lesson 1 review and an exercise

Defining Engineering Strategies for Building Resilient, Available, and Scalable Systems

  • Learning objectives
  • Understand the foundational concepts of reliability, such as fault tolerance, high availability, scalability, and recovery
  • Choosing between alternatives for uptime and architecture design
  • Implementing service level objectives (SLO) and service level indicators (SLI) as performance measurements
  • Exploring immutable infrastructure, containerization, and event-driven architecture
  • Validating application and infrastructure resilience with chaos engineering and other modern techniques
  • Lesson 2 review and an exercise

The Power of Artificial Intelligence, Value Streams, and Cloud Reliability Engineering (CRE)

  • Learning objectives
  • Understanding foundational AI components
  • Applying ML and GenAI to CRE
  • Incorporating value streams and the CRE strategy
  • Fostering a culture of innovation - leadership, ownership, and fast decision-making
  • Lesson 3 review and an exercise

Leveraging Observability, Monitoring, and Reliability Metrics

  • Learning objectives
  • Defining observability and monitoring
  • Deploying a 10-step process to create effective monitoring
  • Surveying monitoring and alerting tools from leading cloud providers
  • Recognizing and proactively mitigating known service disruptions
  • Establishing objectives and key results (OKRs)
  • Lesson 4 review and an exercise

CRE Tooling and Chaos Engineering

  • Learning objectives
  • Distributing load with autoscaling and load balancing
  • Enabling automatic failovers for high availability
  • Implementing continuous deployments with rollback strategies
  • Leveraging chaos engineering for resilience testing
  • Lesson 5 review and an exercise

Incident Response for Fast Recovery

  • Learning objectives
  • Understanding incident response foundational concepts
  • Implementing a structured approach to incident response and CRE tools
  • Understanding incident handling in CRE
  • Defining time to detect (TTD) and time to recover (TTR)
  • Understanding playbooks and runbooks
  • Lesson 6 review and an exercise

Operational Excellence and Change Management

  • Learning objectives
  • Defining operational excellence in CRE
  • Identifying processes, people, and tools for operational excellence
  • Establishing key performance indicators
  • Understanding root cause analysis (RCA) and correction of error (CoE) form
  • Identifying tools for operational excellence assessments
  • Lesson 7 review and an exercise

Conclusion

  • Summary and next steps

Related courses

About us

LyndaKade is a leading learning platform that helps people learn business, software, technology, and creative skills to achieve personal and professional goals.

Phone numberAparat ChannelTelegram SupportTelegram ChannelInstagram Page

All rights to this site belong to LyndaKade.

Terms of Service|Privacy Policy

نماد الکترونیک enamad در صورت اتصال با آی‌پی داخل کشور، نمایش داده خواهد شد.
logo-samandehi - لوگو ساماندهی
zarinpal
zibal