12 Key Indicators to Optimise Your Firms Operational Resilience

Drafted by Ben Saunders

Roughly a 10-minute Read

In my previous life as a consultant, I spent many years working with highly regulated enterprises as they embarked on embedding DevOps ways of working into their software delivery lifecycles (SDLC). This was often done so in the public cloud, across multiple cloud service providers, technology stacks, and application architectures. This dictated the use of multiple toolchains, interactions with different teams, across many continents, all whilst being able to deliver change at speed, without breaking things. Oh, and there were regulatory considerations to enforce across different regions and jurisdictions as well! 

Needless to say, these changes were hard. Both technically and culturally. If they weren’t, then every enterprise in the world would be cutting code quicker than the digital unicorns. 

Whenever we aimed to introduce highly automated solutions into a customer’s SDLC, it was very much a hearts and minds process. Often building confidence incrementally with material risk-takers that we could provide greater stability & control through an automated CI-CD pipeline. Especially when compared to a manual, human-intensive release and deployment process. 

With OpRes, we are embarking on a similar journey. We want to remove the subjectivity from operational resilience and allow firms of all sizes to move towards a more objective, data-driven approach. Our aim is to enable firms to achieve this in real-time by aggregating many of their existing, disparate data sources. More often than not, extracting data and insights from their existing toolchains and technology investments. 

In this blog, we will provide examples of the types of data sets and measurements we recommend firms start to capture in order to build a richer understanding of their operational resilience posture. We believe that capturing these data points will allow them to build a portfolio of what we call, key resilience indicators (KRI’s) across their important business services. 

Operational Resilience - Key Resilience Indicators

We define KRI’s as a metric or measurement that a firm can use to quantify their aggregated operational resilience posture. We’ve outlined 12 areas below that we believe firms should measure as part of their operational resilience agenda to form a baseline and ongoing framework from which to improve. 

  • IT Asset End of Life (EOL) & Support Periods:

    Firms should measure the number of assets across their IT estate and whether these are still running on supported system versions or releases. This stretches across both hardware and software-related assets. Firms should also assess assets that are in support but fast approaching EOL and measure what percentage of their total estate this represents. Finally, firms should identify systems that are running on technology stacks that are no longer supported by the original supplier/provider.

  • Patching Coverage Rate:

    Firms should capture and continuously measure the percentage of critical systems without Up-to-Date Patches. This metric is typically known as “Patch Coverage Rate.”

  • Frequency & Severity of Outages:

    Across each business service, firms should baseline the total number of incidents and production outages that have caused intolerable harm to their standard modes of operation. This should consider the duration of the outage, the systems impacted and the number of customers/users affected. If incidents are recurring on an all too frequent basis firms should track the collective mean time between failures, so as to enable pattern detection and preemptive identification of potential incidents before they occur. 

  • Mean Time to Resolution:

    Things break. There is no getting away from that! However, it is important for firms to have run-books and standard operating procedures in place to restore service as fast as possible. As such, firms should measure their Mean Time To Restoration. Namely, if an outage does occur, how long does it take the firm to restore normal levels of service in line with the defined performance needs of the business and end customers. Where possible, firms should also track their ongoing MTTR on a month-by-month basis to ensure that when incidents do occur, service restoration is getting faster, not slower. 

  • System Availability:

    Firms should measure the amount of time that all systems are available for customers and end-users, to enable the delivery of important business services across a calendar month. Ideally, this data should have the ability to be categorised into granular measurements of minutes, hours, days, and weeks if required. Indeed, there may well be core trading hours for certain business services or time frames where a firm is willing to accept some disruption owing to lower demands for their services. As an example, an electronic trading platform that is accessed by users across the London markets might be able to accept an outage between 1:00 am and 2:00 am.

     

  • System Downtime:

    In the same vein as monitoring system uptime, firms should also monitor system downtime. This can come in two forms. Planned and unplanned downtime. Unplanned downtime can be caused by an external influence that disrupts a business service. Whilst planned downtime is often the result of a planned event such as monthly maintenance or a system upgrade. When unplanned outages occur across a business service, firms should correlate the percentage & duration of the unplanned downtime as well as the severity of the outage. Furthermore, in the event where planned downtime is performed but not all tasks or upgrade activities are completed, firms should capture this information to understand how maintenance windows could begin to cause potential disruption to their impact tolerances. 

  • System Capacity:

    Across their technology landscape, firms and their suppliers should measure the total number of IT assets and their respective performance thresholds, which if exceeded could disrupt operational resilience. These measurements could be in areas such as disk space, storage space, memory allocation, and frequency in spikes. Firms should have the capability to measure these data points in transactions or requests per second. With an ability to identify errors, the frequency of those errors, and the speed at which areas like storage are becoming over utilised. 

  • Network Performance:

    Across the entirety of a financial services organisation, there are multiple networks, suppliers, switches, and end-points which power the financial system. Arguably, network connectivity is the lowest common denominator in any technology supply chain. That, and electricity! As such, there are a number of resilience indicators firms should measure to determine their posture in this space:  

  • Network Availability: The amount of actual time where core networks are available (measured in minutes) against the amount of total time, as an overall percentage across a given day, week, month or year. 

  • Network Bandwidth: The average utilisation of a firm's core networks set to pre-defined intervals and measured as a ratio of a network's total available bandwidth. As an example, is a network seeing an increasingly large surge in consumption on a given time point each day. 

  • Network Spikes and Utilization Bursts: Number of occurrences where the firm's network bandwidth has been exceeded beyond acceptable thresholds. Leading to saturation, latency and errors in a business services normal operations.

  • Network Hardware Performance: In the same way firms should monitor their infrastructure and database resilience indicators, they should also replicate this practice across their network hardware. In this instance firms should capture the utilisation of a network device and again surface data that identifies when acceptable performance thresholds have experienced a degradation in service. 

  • Service Level Agreement (SLA’S) Conformance:

    Over the course of our recent blogs, we have stressed the importance of firms setting their impact tolerances, which is tightly associated with SLA’s. As such firms should analyse and measure the following data points to underpin a sound resilience strategy. 

  • Business Services Without a Defined SLA:

    Firms should capture the total number of business services, systems and suppliers where an SLA has not been documented and aim to remediate these gaps as quickly as possible. This should again be documented as a percentage of the overall technology estate. 

  • Service Provider SLA Conformance:

    Firms should actively monitor and track the performance of their suppliers. Whether these be software, hardware, or services based partners and ensure that they are conforming with the defined SLA’s stipulated in their contracts and agreements. These suppliers could be internally deployed IT functions or 3rd and 4th party suppliers outside of the firm’s control. Furthermore, firms should track the number of active disputes, incidents, or defects they have open with their suppliers. 

  • Systems Running Without Maintenance Support:

    Firms should also measure the total number of systems/suppliers that underpin their important business services which do not have an active maintenance agreement in place. We would generally recommend that firms categorise their suppliers as either “Critical” or “Important”. This should be dictated on the role they place across a business service and whether intolerable harm could be caused as a consequence of service disruption. Remember, a single supplier could be both Critical or Important based on the role it plays across a business service by business service basis! 

  • Volume of Changes:

    The one thing that is constant in any firm, is change. As such, it is critical that financial organisations capture the total number of changes being deployed across their estate at a given time, whilst also accurately tracking the percentage of successful and failed changes over a defined period. This can be defined as “change failure percentage”. This is where we would typically advocate for a highly automated, configuration management strategy for firms. Enforcing controls like “separation of duty” and “four eyes checks” with automated quality control gates that are codified with pre-defined acceptance criteria can really help increase audit traces in this respect. 

  • Backup and Restore Procedures:

    Financial services firms often need to ensure that their systems have a 0-second recovery point objective (RPO). This means that if a system fails, it has the ability to recover to the point in time when the disruption occurred. Without any loss or corruption of data. That said, even with non-stop solutions in place, firms must still have a sound backup and restore strategy to support data recovery requirements. Whilst having a best practice backup strategy is a key requirement, we recommend that firms also measure the critical systems that do not have an automated backup solution place. In addition, where there is a high frequency of these backups failing or not completing within their defined backup windows. 

  • Malware Scanning & Security Conformance:

    Security in itself is a whole category of operational resilience. In fact, we could devote an entire series of blogs to the security domain alone. At the very least, firms should measure the following for operational resilience purposes:

  • Frequency of security incidents and their point of origination. 

  • The severity of the security breach or incident and the impact on customers/users. 

  • Meantime to detect malware, security breaches, and incidents.

  • Mean to time close malware security breaches or incidents. 

  • Percentage of IT assets that have not received a full malware scan across a defined period of time. 

  • Percentage of IT assets that are not running any anti-malware controls/software. 

In Closing: 

Over the course of this blog, we have highlighted a set of key resilience indicators that firms should measure in order to determine their technologies’ operational resilience. These are by no means definitive and there are other indicators firms should consider based on the criticality of their systems. 

It is important for firms to be able to correlate both historical records with real-time events and configuration data, in order to build a meaningful portrayal of their resilience posture. Over time if firms can optimise the methods by which they capture, classify and record these data points, then there are opportunities to introduce pattern detection methods and prevent incidents with AI & ML capabilities. Finally, firms can be more informed about the types of changes and operational resilience gaps across their estate that bring more risk and put in place the required controls and quality gates to prevent service disruption and outages. 

Stay tuned in the coming weeks as we release more material about OpRes and the wider operational resilience agenda across financial services. 

Thanks for reading as ever,

Ben S

Previous
Previous

Improving Operational Resilience With A Modern Observability Strategy

Next
Next

Operational Resilience In A Complex Legacy Technology Estate