Improving Operational Resilience With A Modern Observability Strategy

Drafted by Ben Saunders

Roughly a 2-minute read

Monitoring has traditionally been a dull corner of the IT operations world.  It historically involved deploying heavyweight enterprise tools to monitor for situations such as disk space running low or unresponsive servers.  If something went wrong, email alerts would be sent to support staff who would step in and resolve, ideally before customers were impacted.  

The monitoring world has experienced significant innovation over recent years.  This is something that needed to happen as application architectures have become more complex, and as digital channels have become a more significant source of revenue for businesses.  

Instead of monitoring, today we aspire to “Observability”.  This is an evolution from monitoring the basics such as availability through to a more detailed understanding of how our systems are operating, the customer and employee experience they are supporting, and increasingly the business metrics.  Sometimes, people describe monitoring as telling you that something is wrong. Whilst observability is telling you what is wrong. Whilst giving you the tools to debug and understand the historical situation.  

Three Pillars Of Observability

It is said that there are three pillars that need to be in place before we can reach the vision of high-quality observability.  These are referred to as logs, metrics, and traces.   

  • Logs - Logs describe events of what has happened over time.  They typically describe an event, such as a user log in, a new order being placed, or an application crash.  They are typically timestamped so we can understand the time relationships between these events during subsequent analysis.  

  • Metrics - Metrics are some aggregated measurement rather than being an individual event.  This could include technical considerations such as memory consumption, disk space, currently active users, or business metrics such as the number or value of transactions flowing through the systems. 
    Because this information is aggregated, it tends to give you a picture of the overall health of the system, whilst also producing less data than logs.  

  • Traces - A trace describes how a number of services and systems have interacted to get to the following output.  This is important in a complex technology environment such as a large FSI organisation.  

Logs, metrics and tracing potentially add up to an enormous amount of data.  For this reason, we have seen the rapid evolution and adoption of cloud-based, cloud-native monitoring solutions such as Prometheus, Datadog and New Relic, and Elastic which help us to ingest and analyse them, alert on them, detect combinations of situations of interest, provide slice and dice analytics etc.

Observability and Operational Resilience 

Observability is a critical tool to improve operational resilience at source.  When we have high-quality observability, we can really understand how well underlying systems are behaving. When incidents do occur.  

These observability metrics can also be rolled up into historical information.  For instance, if we observe that a service is giving us only 98% uptime instead of the contracted SLA, we can alert a user to this and use the actual observed value rather than the SLA one.  This way, we get a highly accurate view, in real-time and can cross-reference multiple data sources that are evidence-based, and not subjective.

Previous
Previous

Cross Border Operational Resilience Policies: What Do Firms Need to Know?

Next
Next

12 Key Indicators to Optimise Your Firms Operational Resilience