Monday, January 25, 2021

 

The team at Cyber Protection Group experienced an internet outage last week.  We were not able to utilize our office systems, use our phone systems, or go about our routine work with our system outages.  Although a disruption to our work, we could still complete work remotely.  However, if it were another system or service that went down, we might not be able to properly do our jobs.   What are some simple things to think about in regards to a down system?

 

Maximum Acceptable Downtime

The term, maximum acceptable downtime (MAO) designates the longest acceptable period of time for a system to be down that the organization can tolerate.  The more value a system adds to the organization the shorter the MAO should be.  For instance, it is very crucial and critical for payment processing systems to be up and running.  The organization is unable to receive payments and income without these services.  Therefore, there would be a large direct negative impact.  On the other hand, the organization can tolerate a longer downtime of a server hosting a blog web page. 

Furthermore, there is also a figure called recovery point objective that defines the maximum acceptable data loss in terms of time.  If a file server is designated a RPO of one hours, that means the data must be backed up every hour and provides great value to the organization.  The longer the RPO is, the more data the organization can tolerate losing.

Lastly, there is a time frame, called recovery time objective (RT)).  This figure defines the time in which it should take to recover a downed system.  RTO should be less than or equal to the defined MAO to ensure that the system is back up and running before any intolerable impact occurs.

 

Cost – Benefit Analysis

Cost and finances present the biggest factor when creating recovery plans and implementing measures to keep a system running.  The more a system needs to be available, the more costly it will be to keep it running.  For example, if a system needs to be running 24/7, the organization must implement many elements in order for it never to go down.  They must add multiple power systems, implement different security measures, backing up data, redundant systems, and much more.  Additionally, if that system goes down, it will cost much more to get it back on its feet as fast as possible.  In contrast, a system that only operates, per say, five hours a day – five days a week, does not need nearly as many costly elements to ensure its availability.

As seen in Figure 1, the shorter the outage time, the more initial costs and recovery process costs are.  However, there will be minimal disruption costs and impacts.  The longer the outage time, the less costly initial and recovery processes are.  Although, the longer the outage means much more disruption and negative impacts.  Ideally, the organization should find a happy medium between system outages and costs.

System Outage Cost Graph
Figure 1: System Outage Cost Graph

Avoiding Downtime

When our internet service was down, we were still capable of continuing work remotely.  However, some downed systems will affect the daily work process.  To avoid downtime, organizations can implement redundant systems, (ex: backup servers, multiple power sources, backup cooling systems for servers).  Security measures can be installed to deter any attacks that might impact a system (ex: firewalls, intrusion detection and prevention, virus scanners, keeping systems up to date).  Organizations should also back up data regularly in order to avoid any data loss.  Finally, test the systems that get implemented.  Monitor your systems and ensure that backup systems work, recovery policies are in place and employees know their duties.  A well-prepared organization reduces the impact of system outages greatly.