The Azure Well-Architected Framework is a set of guidelines spanning five key pillars that can be used to optimise your workloads. In the previous blogs we covered Reliability, Security and Cost Optimisation alongside relevant services, processes and assessments. This time we’ll focus on the Operational Excellence pillar of the framework.
Overview of Operational Excellence
The services and technologies you use in the cloud differ hugely compared to those on-premises. But, what doesn’t differ is the requirement that all deployments and environments are reliable and predictable. Operational excellence is the forth pillar of the Well-Architected framework that covers the operational processes you require to ensure applications continue to operate.
The key processes that fall within operational excellence are Workload Automation, Workload Release, Monitoring and Testing. The end goal is to achieve superior operational practices.
Similar to the previous Security and Cost Optimisation pillars, Operational Excellence must be thought about throughout the lifecycle of a workload, including design and architecture phases, but especially once the workload is running. The management of a service and the related processes should not be retrofitted to environments or services, you must think about these areas early on as it will reduce management overhead in the long term.
A Well-Architected workload viewed through the lens of Operational Excellence is a workload this is released in an automated manner, monitored and tested in an efficient way to ensure the application provides value not just to your customers, but to your internal development and operations teams.
Specific to Operational Excellence, at a high-level you should be thinking about the following areas and processes:
- Design, build and orchestrate workloads with DevOps principals in mind
- Monitor workloads efficiently using Azure Monitor
- Understand Application Performance Management
- Automate as many processes as possible
- Create and automate repeatable infrastructure
- Prepare for the unexpected by testing workloads
Operational Excellence Principals
When designing for Operational Excellence in Azure, there are a set of principals covered in the Framework that you must think about, those principles include:
- Optimise build and release processes by embracing software engineering disciplines. Infrastructure should be deployed via code (IaC) alongside Continuous integration and delivery pipelines that should be used for build and release (CI/CD). Automate testing plans and avoid any configuration drift using configuration as code. Azure DevOps and Azure Policy are two tools which can assist greatly in optimising build, release and configuration drift.
- Understand operational health by using tools and processes that monitor all aspects of a workload including but not limited to build and release processes, infrastructure health and application health. Allow your teams to be proactive instead of reactive by observing workloads and correlating events to truly understand the workload health and performance.
- Rehearse recovery and practice failure by running disaster recovery (DR) drills at regular intervals to validate and understand the effectiveness of your recovery processes, and the responsibilities of internal teams. Use chaos engineering practices to identify weak points in applications via services such as Azure Chaos Studio.
- Embrace continuous operational improvement to reduce complexity and ambiguity where possible via continuously evaluating and refining operational processes and tasks. It’s important processes are always being evolved over time and that inefficiencies are optimised. Most importantly, always learn from your failures.
- Use loosely coupled architectures such as microservices and serverless technologies that allow teams to build and deploy services independently to minimise service failures or impact on a large scale. It’s also important to think about cloud design patterns such as circuit breakers, load-levelling and throttling.
Operational Excellence Recommendations & Tips
Some of the best tips or recommendations for operational excellence are as follows:
Azure policy is a free Azure service that allows you to enforce resource-level rules across your Azure estate that can assist in the adoption on operational best practices. Azure Policy is also a great tool for configuration drift management and monitoring. For example, Azure Policy can ensure all workloads adhere to a specific set of security rules such as HTTPS usage or TLS.
Azure Advisor is a fantastic resource that provides a set of Azure Policy recommendations that, in turn, can be used to identify opportunities to implement best practices across your workloads.
Use the DevOps checklist to review your design and management from a DevOps Standpoint. The checklist covers culture, development, testing, release, monitoring and management. The checklist can be found here
Strangler Fig is a cloud design pattern that covers incrementally migrating a legacy system by gradually replacing specific pieces of functionality with new apps or services. Eventually, the older system is ‘strangled’ by the new system and eventually it takes over.
Take time to understand and plan your operating model and internal teams. For example, managing loosely coupled architecture requires procedural decoupling as teams shouldn’t have to depend on partner teams to support, approve or operate their workloads.
Review your workloads
We will continue to cover the remaining pillars throughout this series of blogs. As highlighted on previous posts, you can review your current posture against the five well-architected pillars. The tool is free and can be accessed here.
For a more in-depth Architecture Review or a specific Operational Excellence Review feel free to reach out to our Azure Cloud Experts.