In our previous blog, we discussed how the regulatory environment has had an impact on public cloud usage in the Financial Services industry. One of the key themes that we drew upon was operational resilience. In this blog post, we will further explore how firms can achieve resilience and demonstrate adherence to regulatory and compliance obligations.
The significance of IT operational resiliency in the financial services industry is constantly increasing. Regulators are now taking a broader view of operational resilience, rather than focusing on specific, identified risks. As Bank of England recognises, “We want firms to plan on the assumption that any part of their infrastructure could be impacted, whatever the reason.” This is a fundamental shift in how firms should think and approach operational resilience.
Historically, there has been more emphasis on preventing operational disruptions, and less on responding to or recovering from them. Rather than focusing solely on the ‘prevent’ part, firms should also consider the ‘respond’ and ‘recover’ angles of the equation. Some fundamental questions to ask are:
- What problems are you trying to solve?
- What specific aspects of the application require specific levels of availability?
- What is the amount of cumulative downtime that this workload can realistically accumulate in one year?
- What is the actual impact of unavailability?
One approach is to express these goals as recovery time objectives (RTOs) and recovery point objectives (RPOs). Other measurements for the reliability of systems can include mean time to repair (MTTR) or mean time between failures (MTBF). Whichever measure you use, it is important to have a clear definition of SLO, SLI, and SLA (service-level objective, service level indicator and service level agreement), which form part of our systems requirements. These elements should tie back to your business objectives.
Critical or non-critical?
In some situations, the SLO is non-negotiable as it is set externally. For example, systemically important applications such as payment, clearing, settlement, central banks, and market have a specific RTO in the Principles for Financial Market Infrastructures (PFMI) standard:
“The [business continuity] plan should incorporate the use of a secondary site and should be designed to ensure that critical information technology (IT) systems can resume operations within two hours following disruptive events. The plan should be designed to enable the FMI to complete settlement by the end of the day of the disruption, even in case of extreme circumstances.”
Membership of payment schemes like Faster Payments also come with availability targets and steep penalties for non-compliance, whilst card payments providers require certain response times for pre-authorisation and other interactions.
From principles to implementation
It is considered good practice to design systems with independent and isolated components to provide redundancy and allow for varying SLAs. Teams should design and build their systems in an incremental fashion to gradually improve the reliability of systems. Rather than starting at 99.999%, aim for a lower target of around 95% to build up confidence. Once these targets are consistently met, teams can begin to look at further techniques to improve the resiliency of their systems and meet the requirements. At higher availability, this usually means automatic and alert-based recovery with chaos testing to ensure that complex or critical recovery paths do work.
It is also important to have good monitoring in place. The worst failure is ‘silent’, where the application is no longer working but there is no clear way to detect it, and your customer knows about it before you do. Alerting and monitoring should be decoupled from the systems as much as possible. This would ideally be done in independent scalable systems, with redundancy that provides end-to-end monitoring or health probes from the edge.
These health checks must offer as close a representation as possible to real-world usage. For instance, if the application serves consumers in different regions, you will need multiple probes from each region for an accurate picture. Beyond simple availability health checks, you may want to have content-based or functional checks for more complete validations of application responses.
Downtime is often a result of human and process error, rather than faulty architecture. Well-defined SLAs must come with clear contracts for long-term and on-call support. Teams should also aim to streamline and automate their operational procedures as much as possible. This means solving deployment problems using continuous delivery patterns – feature flags, canary deployment, blue/green - and building up operational experience with mission control. This involves members from different service teams coming together on rotas to build up visibility and alerting.
Public cloud and operational resilience
Public cloud providers usually offer multiple availability zones and regions, allowing architects and engineering teams to design services and applications with higher availability at a lower cost. Depending on requirements, organisations may look at different resiliency models such as intra-zone, multi-zone, or multi-region. Each have increasing resilience levels but all come with additional trade-offs such as network latency and data residency.
Some cloud services also come with high availability guarantees, such as AWS S3 and Aurora. However, many teams often do not have experience in designing systems that consume these services correctly, so engineering teams must have a good understanding of the key capabilities of those cloud services – for example, synchronous vs. asynchronous replication of data and eventual vs. strong consistency. This sometimes requires additional training and experimenting with these services in conjunction with reading documentation from the cloud providers. It is important to take extra care with SLA calculation when combining cloud products with discrete guarantees to deliver a critical system. This is due to the compound effect, which we will explore further in a later article.
The above approaches work well with greenfield cloud deployment, but financial institutions with existing legacy deployments will need further considerations. Organisations might consider moving some workloads to the cloud using Platform as a Service and Software as a Service (PaaS and SaaS) models to lower the operational overhead, while supporting a similar level of resiliency. Selecting the right service and environment for each workload is a complex topic and Our ultimate guide to Hybrid Cloud covers this in greater depth.
Leveraging cloud for disaster recovery (DR) often appeals for data backups due to the durability and low-cost of object storage. There is increasing talk of cloud data backups and cross-cloud replication, further driving multi-cloud in the enterprise. However, using cloud for systems DR in practice can be exceedingly difficult and is often less cost-effective. DR on the cloud means translating existing on-premises applications to be cloud-compatible and can be complex for stateful and distributed systems.
In a nutshell
The best way for firms to be operationally resilient is to be proactive and not reactive. If you are building systems from scratch, make sure that service-level metrics are part of your system requirements. If you already have a production system but have not had them clearly defined, aim to make this your highest priority.