Post

A Look at Cyber Resilience

Explore how cyber resilience differs from traditional disaster recovery and learn practical frameworks for building organizational resilience. This guide covers essential concepts like RTO/RPO, maturity models, and proven standards from NIST and ISO to help you prepare for, respond to, and recover from cyber attacks while maintaining critical business operations.

A Look at Cyber Resilience

Overview

In the olden days of computing it was a great sign of IT maturity if there was traditional Disaster Backup and Recovery (DBaR) and Business Continuity Planning (BCP). These concepts focused on preparing for and recovering from significant disruptions, such as natural disasters or major system failures. However, as cyber threats have evolved and become more sophisticated, the need for a more comprehensive approach to resilience has emerged.

In modern times, a cyber attack is probably more likely to cause disruption than a natural disaster. Also, cyber attacks have different kinds of problems from traditional disasters. For example, a ransomware attack may require not just restoring data from backups, but also dealing with data integrity issues, legal implications, and communication strategies. Let's take a look at how you can approach this challenge using the concept of cyber resilience. Cyber resilience is the ability of an organization to prepare for, respond to, and recover from cyber attacks while maintaining essential functions. It encompasses not only technical measures but also organizational processes, people, and culture.

TL;DR

Cyber Resilience goes beyond traditional disaster recovery. Modern organizations face cyber threats that require specialized preparation, response, and recovery strategies. We'll cover seven key frameworks (NIST CSF, SP 800-160/184/34/61, ISO 22301, ISO/IEC 27031), explain critical concepts like RTO and RPO, present a 6-level maturity model, and share practical guidance for building a comprehensive cyber resilience program. Whether you're starting from scratch or enhancing existing capabilities, we'll look at how to work with stakeholders, organize your program, report meaningful metrics to leadership, and maintain resilience through iterative testing and improvement.

Key Cyber Resilience Frameworks and Standards

One of the great things about modern times is that you can typically ask: "Are there any frameworks and/or standards around this topic?" and most-times there are! In the case of cyber resilience, there are several well-regarded frameworks and standards that organizations can leverage to enhance their cyber resilience posture. Here are the key ones:

1. NIST Cybersecurity Framework (CSF) 2.0

This covers high-level outcomes for Respond and Recover, and is tied to governance and risk. For example, it includes:

  • Governance: Establishing policies and procedures for cyber resilience.
  • Risk Management: Identifying and mitigating cyber risks.
  • Respond: Developing and implementing response plans for cyber incidents.
  • Recover: Establishing recovery plans to restore operations after a cyber incident.

Link: https://www.nist.gov/cyberframework

2. NIST SP 800-160 Vol. 2

This covers resilient system engineering, including:

  • Segmentation: Dividing systems into segments to contain cyber threats.
  • Diversity: Using diverse technologies and approaches to reduce vulnerabilities.
  • Isolation: Implementing isolation techniques to protect critical systems.
  • Recovery Patterns: Establishing recovery strategies to restore system functionality.

Link: https://csrc.nist.gov/pubs/sp/800/160/v1/r1/final

3. NIST SP 800-184

This covers cyber event recovery, including:

  • Playbooks: Developing detailed response plans for specific cyber events.
  • Testing: Regularly testing recovery plans to ensure effectiveness.
  • Sequencing: Establishing the order of recovery actions.
  • Dependencies: Identifying and managing dependencies during recovery.

Link: https://csrc.nist.gov/publications/detail/sp/800-184/final

4. NIST SP 800-34

This covers contingency planning, including:

  • Business Impact Analysis: Assessing the impact of disruptions on business operations.
  • Recovery Time Objectives (RTO): Defining acceptable downtime for systems.
  • Recovery Point Objectives (RPO): Establishing acceptable data loss thresholds.
  • Disaster Recovery Integration: Integrating contingency plans with disaster recovery strategies.

Link: https://csrc.nist.gov/publications/detail/sp/800-34/rev-1/final

5. NIST SP 800-61

This covers the incident response lifecycle, which includes:

  • Preparation: Establishing incident response capabilities.
  • Detection and Analysis: Identifying and analyzing cyber incidents.
  • Containment, Eradication, and Recovery: Implementing strategies to contain, eradicate, and recover from incidents.
  • Post-Incident Activity: Conducting lessons learned to improve future response efforts.

Link: https://csrc.nist.gov/pubs/sp/800/61/r3/final

6. ISO 22301

This covers business continuity for critical services, which includes:

  • Impact Tolerances: Defining acceptable levels of disruption for critical services.
  • Continuity Strategies: Developing strategies to ensure the continuity of critical services during disruptions.

Link: https://www.iso.org/standard/75106.html and

Free Preview: https://www.iso.org/obp/ui/#iso:std:iso:22301:ed-2:v1:en

7. ISO/IEC 27031

This covers Information and Communication Technology (ICT) service continuity, including:

  • ICT Recovery: Establishing procedures for recovering IT services.
  • Test Cadence: Implementing regular testing of IT service continuity plans.

Link: https://www.iso.org/standard/27031 and

Free Preview: https://www.iso.org/obp/ui/#iso:std:iso-iec:27031:ed-1:v1:en

Framework and Standards Summary

Put another way, these seven frameworks and standards cover the landscape of cyber resilience from different angles, giving organizations a pretty comprehensive toolkit to build and maintain their cyber resilience capabilities and cyber resilience program.

A Note on Standards

While NIST publications are freely available, many ISO standards require a purchase to view the full document. The "Free Preview" links provided in this post are alternate ways to view the specification.

Here is another way to think of how these work together and complement each other to establish the domains of cyber resilience:

Framework/Standard Focus Area Key Aspects
1. NIST CSF 2.0 Overall Cybersecurity Governance, Risk Management, Respond, Recover
2. NIST SP 800-160 Vol. 2 System Resilience Segmentation, Diversity, Isolation, Recovery
3. NIST SP 800-184 Cyber Event Recovery Playbooks, Testing, Sequencing, Dependencies
4. NIST SP 800-34 Contingency Planning Business Impact, RTO/RPO, DR Integration
5. NIST SP 800-61 Incident Response Lifecycle Management, Lessons Learned
6. ISO 22301 Business Continuity Impact Tolerances, Continuity Strategies
7. ISO/IEC 27031 IT Service Continuity ICT Recovery, Test Cadence

About RTO and RPO

Two important concepts in cyber resilience and disaster recovery are Recovery Time Objective (RTO) and Recovery Point Objective (RPO). In short, RTO is how long you can afford to be down and RPO is how much data you can afford to lose.

RTO vs RPO

Recovery Time Objective (RTO)

RTO is the maximum acceptable amount of time that a system, application, or process can be down after a disruption before it must be restored to avoid unacceptable consequences. By defining this, it helps organizations prioritize recovery efforts and allocate resources effectively. Each service needs to define what is "unacceptable", and what the consequences are if that service is down for that long.

RTO is about the amount of down time that is acceptable.

Example: Technical Support Call Center

Imagine a technical support call center that has 300 employees. If the call center goes down, the organization risks losing customer trust, but it also might impact a Service Level Agreement (SLA) where the organization needs to pay a penalty. So, the downtime of the call center means lost trust, lost productivity, and potential financial penalties.

Therefore, the organization might set an RTO of 4 hours for the call center to be back up and running. Four hours is when the organization believes the consequences become unacceptable.

Example: E-Commerce Site

Imagine an e-commerce website. If the website goes down, the organization risks losing sales and customer trust. So, the organization might set an RTO of 1 hour for the website to be back up and running. There is often a financial figure associated too, like: the company loses $X dollars per minute of downtime.

Therefore, based on the calculation, the organization determines that 1 hour is the maximum acceptable downtime before the consequences become unacceptable.

The idea is to work with the stakeholders to do all of this analysis ahead of time, before an incident occurs. That is, do the math, and determine what the RTO should be for that particular service.

Recovery Point Objective (RPO)

RPO is the maximum acceptable amount of data loss measured in time. It defines the point in time to which data must be restored after a disruption to avoid unacceptable consequences. By defining the RPO for a service, organizations can then determine appropriate backup strategies, data protection, and data retention measures. This requires each service owner to define what constitutes "unacceptable" data loss and its consequences.

RPO is about the amount of data loss that is acceptable.

Example: Financial Transactions Database

Imagine a financial transactions database that processes thousands of transactions per minute. If the database goes down, the organization risks losing critical financial data.

Therefore, the organization might set an RPO of 15 minutes, meaning that in the event of a disruption, they can afford to lose up to 15 minutes worth of transaction data.

Example: Customer Relationship Management (CRM) System

Imagine a Customer Relationship Management (CRM) system that stores customer data and interactions. If the CRM system goes down, the organization risks losing valuable customer information.

Therefore, the organization might set an RPO of 1 hour, meaning that in the event of a disruption, they can afford to lose up to 1 hour worth of customer data.

Similar to RTO, the idea is to work with stakeholders to calculate these things and figure this out before an incident happens. That is, do all the math, and determine what the RPO should be for this particular service. Before an incident happens is the time to do this analysis!

Expectations vs Reality

This upfront analysis is where the rubber meets the road. It bridges the gap between business expectations and technical reality. For example, the process might reveal a service requires a one-hour RTO, but the existing infrastructure can only support a 24-hour recovery.

Or you have a service that cannot tolerate losing more than 15 minutes of data (RPO), but the current backup strategy only allows for daily backups. That would mean by 5pm, you will lose all of the data since the backup at midnight!

Discovering this mismatch before an incident forces an important decision about the risk, either:

  • Mitigate: The organization invests in the necessary technology to meet the objective, or
  • Accept: The business must formally accept the risk (and consequences) of a longer outage.

Without this exercise, the organization is operating on dangerous assumptions, which often leads to a painful and preventable failure during a real crisis.

Cyber Resilience Maturity Model

Cyber Resilience maturity refers to how developed an organization's ability to prepare for, withstand, and recover from cyber attacks. While there isn't a single, universally adopted standard, we can construct a practical maturity model by drawing upon the principles of several key industry frameworks and standards. These provide the foundation for assessing and improving resilience posture:

  • CERT Resilience Management Model (CERT-RMM): A framework from the Software Engineering Institute (SEI) to establish, manage, and measure operational resilience.
  • NIST Cybersecurity Framework (CSF): Provides a high-level, strategic view of cybersecurity outcomes, including the "Respond" and "Recover" functions that are central to resilience.
  • Cybersecurity Maturity Model Certification (CMMC): A model to enforce the protection of sensitive data, its maturity levels (from basic hygiene to advanced) offer a parallel for resilience progression.
  • CIS Controls: A prioritized set of safeguards to mitigate the most prevalent cyber attacks, which forms the technical backbone of a resilient infrastructure.
  • ISO/IEC Standards: A family of standards that provide a formal structure for information security and business continuity. Key standards include:
    • ISO/IEC 27001: For Information Security Management Systems (ISMS).
    • ISO/IEC 27031: For ICT readiness for business continuity.
    • ISO 22301: For Business Continuity Management Systems (BCMS).

Drawing from these, we can define the following maturity levels:

Level 0 - Ad hoc

Symptoms:

  • No dedicated cyber-resilience function or personnel.
  • Backups may exist but are inconsistently performed and rarely, if ever, tested.
  • No documented recovery playbooks.
  • No defined Recovery Time Objectives (RTOs) or Recovery Point Objectives (RPOs).
  • Little to no understanding of which business services are critical.
  • Incident response and disaster recovery are completely separate, if they exist at all.

This is the "hope is our strategy" stage.

Level 1 - Basic DR

Symptoms:

  • Some disaster recovery plans exist but are focused on traditional scenarios (e.g., hardware failure) and are managed system-by-system.
  • RTOs/RPOs are informal, unrealistic, or not aligned with business needs.
  • Cyber-specific scenarios like ransomware, identity compromise, or a cloud provider outage are not modeled.
  • Restore tests may happen annually but are lightly tested and not comprehensive.
  • Business Continuity Planning (BCP) is treated as a separate, non-technical function, disconnected from IT and Security.
  • Processes are document-heavy with little to no automation.

This is "classic DR but not cyber-aware."

Level 2 - Cyber-aware DR

Symptoms:

  • The organization acknowledges that cyber attacks have unique characteristics requiring specialized recovery procedures.
  • Annual or semi-annual cyber tabletop exercises are introduced to simulate attacks.
  • Cyber-specific runbooks begin to appear (e.g., for ransomware recovery or identity restoration).
  • Backups are hardened using techniques like immutability, offline copies (air-gapping), or isolated recovery environments.
  • Critical infrastructure like Active Directory / Entra ID and DNS begins to be treated as a Tier-0 asset requiring enhanced protection and a dedicated recovery plan.

The organization is still fragmented, but the gap between traditional DR and cyber resilience is now visible and being addressed.

Level 3 - Managed Cyber Resilience

Symptoms:

  • A formal, dedicated team or program for cyber resilience now exists with clear ownership.
  • A clear mapping of "critical business services" to underlying applications and infrastructure is maintained.
  • RTOs and RPOs are formally defined, documented, and agreed upon with business stakeholders for all critical systems.
  • Regular, automated restore validations occur (e.g., quarterly or monthly for key applications).
  • Cyber recovery playbooks are comprehensive, regularly tested, and integrated with incident response.
  • Metrics on recovery readiness and test outcomes begin to flow up to leadership.

This is where most mature organizations, such as those in the Fortune 100, are aiming to be.

Level 4 - Integrated Resilience & BCP

Symptoms:

  • Cyber recovery and Business Continuity Planning (BCP) operate as a single, cohesive ecosystem.
  • Dependency mapping is accurate and automated, tracking the full stack from business process down to infrastructure: business process > application > identity > DNS > network > storage.
  • Recovery processes are orchestrated and measurable from end-to-end.
  • Every critical service undergoes periodic, rigorous DR tests that include cyber-attack simulations.
  • Risk and audit teams have direct visibility into the organization's resilience posture through shared dashboards and metrics.
  • The resilience of the cloud and critical third-party suppliers is assessed systematically.

This level of integration dramatically reduces the probability of a major, business-impacting outage.

Level 5 - Adaptive, Threat-Informed Resilience

Symptoms:

  • Recovery testing is continuous and automated wherever possible, becoming part of the standard development lifecycle (DevSecOps).
  • Threat intelligence and detection data (e.g., from MITRE ATT&CK mappings) directly inform and adapt recovery engineering.
  • Chaos engineering principles are used to proactively test the resilience of critical components like identity, cloud services, and storage subsystems.
  • Recovery strategies automatically adapt based on changes in the threat landscape.
  • Metrics reported to the board are predictive (e.g., "What is our exposure to a new ransomware variant?") rather than just descriptive.
  • The organization can withstand a full identity compromise, cloud region failure, or large-scale ransomware attack without existential risk.

This is the pinnacle of maturity - extremely rare, but the gold standard for resilience.

Working With Stakeholders

Within an organization, cyber resilience requires collaboration across multiple stakeholders, including IT, security, business units, and executive leadership. Hopefully, the organization already knows what the critical applications and services are. Meaning, if those apps and services go down, the organization is at risk of significant harm and/or going out of business. In Disaster Recovery Planning (DRP) and Business Continuity Planning (BCP), this is often called a Business Impact Analysis (BIA) and those apps are typically Tier-1 or Tier-0 applications.

Once you have a list of critical applications and services, you can work with the stakeholders to determine the RTO and RPO for each. This will help you prioritize recovery efforts and allocate resources effectively. Doing the analysis upfront to determine the RTO and RPO is incredibly helpful for planning, budgeting, and also when an incident occurs.

Organizing the Cyber Resilience Program

One of the bigger problems organizations face is how to organize the cyber resilience program. There is a tremendous amount of data to gather, analyze, and report on. Again, think in times of an incident, the cyber resilience tools and data will be very helpful to the incident response team, if you can make it accessible and easy to use.

Although a lot of data may end up being stored in Word docs and spreedsheets, the core RTO/RPO data and app inventory details can be stored in platforms like RSA Archer, ServiceNow GRC, or other GRC platforms. These platforms can help manage the data, workflows, and help with your reporting needs.

Reporting Metrics to Leadership

A key role of cyber resilience programs is to be able to report metrics to leadership to demonstrate the organization's resilience posture and progress. Key Performance Indicators (KPIs) and Key Risk Indicators (KRIs) are essential for this purpose. So, what should we report?

Here are some examples of KPIs and KRIs that can be reported to leadership:

Metric Description
Percentage of Critical
Systems with Tested Recovery Plans
The percentage of critical systems that have recovery
plans that have been tested within the last 6 months.
Average Recovery Time
for Critical Systems
The average time taken to recover critical systems
during tests or actual incidents.
Number of Cyber Resilience
Exercises Conducted
The number of cyber resilience exercises (e.g.,
tabletop exercises, simulations) conducted in the last quarter.
Percentage of Backups
Successfully Restored
The percentage of backups that have been successfully
restored during tests.
RTO/RPO Compliance Rate The percentage of critical systems that meet their
defined RTO and RPO objectives.
Number of Identified
Resilience Gaps
The number of gaps identified in the cyber resilience
posture through assessments and audits.
Time to Detect and Respond
to Cyber Incidents
The average time taken to detect and respond to cyber incidents.

Further Considerations

Below are just a few more topics to consider when building out your cyber resilience program.

The Human Element: Your First Line of Defense

While technology and frameworks are critical, remember that cyber resilience is also about people. Your employees can be either your weakest link or your strongest asset. A comprehensive resilience program should include regular security awareness training, phishing simulations, and clear communication channels to build a culture where everyone feels responsible for security.

Supply Chain and Third-Party Risk

Your organization's resilience is only as strong as the weakest link in your supply chain. A vulnerability in a SaaS provider, cloud service, or other critical vendor can have a direct impact on your operations. A mature resilience program must include processes for assessing and managing the risks associated with third-party suppliers.

Conclusion

Building true cyber resilience is a journey, not a destination. It moves an organization from a reactive stance to a proactive one, preparing not just for if a cyber attack happens, but for when. By leveraging established frameworks from NIST and ISO, defining clear RTOs and RPOs with business stakeholders, and measuring progress through meaningful metrics, you can create a robust program that protects critical operations.

Remember two key principles for long-term success:

  1. Keep the Data Current: Your resilience plan is only as good as the information it's built on. Ensure your application inventory, dependency maps, and contact lists are regularly updated.
  2. Test, Test, and Test Again: A plan that hasn't been tested is just a document. Regular testing - from tabletop exercises to full simulations - is the only way to build confidence and uncover weaknesses before a real crisis.

By embracing this iterative approach, you can significantly enhance your organization's ability to withstand and recover from cyber attacks, turning a potential disaster into a manageable event.

This post is licensed under CC BY 4.0 by the author.
robertsinfosec © 2025  | Version: v2025.1204.1431-8912fd6