Disaster Recovery Plan
LAST UPDATED January 13, 2024
1. Executive Summary
This Disaster Recovery Plan has been formulated to ensure the resilience and continuity of CourseKata's educational platform, particularly in the face of potential disasters and risks. The plan outlines the procedures and strategies to mitigate the impact of unforeseen events, ensuring minimal disruption to our services.
CourseKata is led by a tight-knit group of educators, former educators, researchers, and developers, so we take education seriously and personally. This disaster recovery plan was created with stable student learning as an overarching goal. It covers the essential aspects of our operations, including student data management, instructional support, and content delivery. This document serves as a guide for the organization's staff and stakeholders to respond effectively to various disaster scenarios.
The plan draws from and incorporates best practices from the Center for Internet Security (CIS) Critical Security Controls.
2. Organization and Tool Overview
CourseKata is a leading provider in the educational technology sector, specializing in developing continuously improving educational materials. Our platform facilitates the periodic building of content through GitHub repositories, enabling the creation and updating of educational books. These materials are delivered to students via Learning Tools Interoperability (LTI), where they can interact with the content, and their responses are recorded for analysis.
Instructors are provided with a dashboard and updated reports on student progress and engagement. Additionally, we support researchers by providing de-identified data, although this function is considered less critical to our operations.
3. Critical Assets and Functions
The critical assets and functions of our educational platform that require protection include:
Student Access and Data Generation: Ensuring students can continuously access the platform and generate data through their interactions is paramount.
Instructor Access to Data and Progress Tracking: Instructors rely on our platform for real-time data and insights into student progress.
Delivery of Educational Materials: The core functionality of our platform, where the timely and accurate delivery of educational content is essential.
LTI Registration and Installation Process: Particularly critical at the start of each academic quarter, this process facilitates the integration of our platform with various learning management systems.
4. Risk Assessment
The risk assessment for CourseKata's educational platform identifies potential threats that could impact our critical assets and functions:
Natural Disasters: Data centers in Virginia (AWS US East) and California (AWS US West) for the main application, and Iowa (GCP US Central) for the JupyterHub, may be affected by natural events like earthquakes, floods, or storms.
Cyber Attacks: Risks include hacking, phishing, malware, or denial-of-service attacks, despite strong security measures such as 2FA and robust hosting environments.
System Failures: Possible causes include software bugs, hardware malfunctions, or human errors within our CI/CD process and deployment infrastructure.
Data Breaches: Unauthorized access to student and instructor data, potentially impacting data integrity and privacy.
5. Impact Analysis
An analysis of the potential impacts of these risks on our operations reveals:
Natural Disasters: An outage in a primary data center could lead to temporary service disruption. Recovery would involve switching to a backup data center, causing potential downtime of a few hours.
Cyber Attacks: A successful cyber attack may necessitate taking the system offline for assessment and repairs, potentially impacting service availability for up to a full day or more.
System Failures: Unexpected failures could be addressed through immediate rollbacks to stable versions, with a focus on rectifying data inconsistencies that may arise.
Data Breaches: Depending on the nature of the breach, responses may range from system lockdowns to targeted endpoint closures, with varying degrees of impact on service continuity.
6. Preventive Measures
To mitigate these risks, CourseKata has implemented several preventive measures:
Regular Data Backups: We conduct frequent and regular backups of all critical data across different regions and services, aligning with CIS Control 10 (Data Recovery Capabilities), ensuring the ability to quickly restore lost data.
Security Protocols and Training: Our security measures are continuously updated, including employee training on recognizing and responding to threats. This aligns with CIS Control 17 (Implement a Security Awareness and Training Program), focusing on building a security-aware culture.
Disaster Recovery Drills: We perform periodic simulation exercises to test and refine our response to different disaster scenarios, in line with CIS Control 10, which emphasizes the importance of not only having recovery capabilities but also testing them to ensure their effectiveness.
Cross-Region Redundancy: The implementation of cross-region redundancy, particularly for AWS, ensures continuity in case of regional outages. This practice is consistent with CIS Control 9 (Limitation and Control of Network Ports, Protocols, and Services), which involves controlling the network infrastructure to prevent disruptions.
Compliance Audits: CourseKata is committed to conducting periodic audits to ensure adherence to data protection laws and educational regulations. While we are currently aligning our practices with CIS Control 6 (Maintenance, Monitoring, and Analysis of Audit Logs), which emphasizes the importance of continuously monitoring and analyzing network and system actions for policy violations or cybersecurity threats, full implementation of this control is in progress. We are actively gathering the necessary funds and resources to enhance our capabilities in this area. This step will significantly bolster our ability to maintain a secure and compliant educational platform.
By aligning our preventive measures with the CIS Top 18 controls, CourseKata demonstrates a strong commitment to maintaining robust security protocols and ensuring the resilience of our educational platform.
7. Response Strategy
CourseKata's response strategy to various disaster scenarios includes:
Natural Disasters: In case of a data center outage, we will activate our backup and recovery procedures, involving switching to a secondary data center and restoring operations from the latest snapshot. The lead developer, in coordination with the CTO, will oversee this process.
Cyber Attacks: Upon detection of a cyber attack, the lead developer will assess the situation and recommend a course of action. The CTO will make decisions regarding access restrictions or system downtime. A site banner will be posted to inform users, and a detailed report will follow for prolonged issues.
System Failures and Software Bugs: The lead developer is responsible for initiating rollbacks to stable versions in the event of system failures or software bugs. The CI/CD process ensures minimal impact, and manual interventions are possible when necessary.
Data Breaches: In the event of a data breach, we will notify all affected parties as required. If the breach includes PII, we will restrict system access while the breach source is identified and sealed. For breaches involving de-identified information, only the affected endpoints will be closed.
8. Disaster Recovery Procedures
For our main application and data at CourseKata we are using the Pilot Light strategy:
Replicate your data from one region to another and provision a copy of your core workload infrastructure. Resources required to support data replication and backup such as databases and object storage are always on. Other elements such as application servers are loaded with application code and configurations, but are switched off and are only used during testing or when Disaster Recovery failover is invoked.
AWS (main application) Environment Recovery (Terraform Procedure):
In case of an incident in the main region (Virginia, us-east-1), activate the disaster recovery plan in California (us-west-1).
Follow the step-by-step Terraform procedure outlined in the https://coursekata.atlassian.net/wiki/spaces/COURSEKATA/pages/10523836, including pulling the repository (https://github.com/UCLATALL/coursekata-terraform), initializing Terraform, applying the plan, and deploying applications.
Document and save terminal logs for audits and compliance.
Post-recovery, switch the DNS records to the new environment.
(Regularly test this procedure every month, including testing VPC creation and Beanstalk environment setup, excluding DNS.)
GCP (JupyterHub) Environment Recovery:
In case of an incident in the main region (Iowa, us-central1-a), activate the disaster recovery plan in Oregon (us-west1-a).
Follow GCP documentation to restore from backup (https://cloud.google.com/compute/docs/disks/restore-snapshot ).
Re-deploy the application using the documented deployment application included in the source code repository (https://github.com/coursekata/ckhub-k8s)
Document and save terminal logs for audits and compliance.
Post-recovery, switch the DNS records to the new environment.
(Regularly test this procedure every month, excluding DNS.)
9. Testing and Maintenance
To ensure the effectiveness and readiness of our disaster recovery strategies, CourseKata commits to the following testing and maintenance routines:
Regular Testing Schedules: We will conduct monthly tests of the disaster recovery procedures, especially for AWS and GCP environments, to ensure that all systems and protocols function as expected in an emergency. This includes testing backups, recovery processes, and failover mechanisms.
Procedure Updates: The disaster recovery plan and procedures will be reviewed and updated semi-annually or following significant changes in our infrastructure or operations. This ensures that the plan remains relevant and effective.
Audit and Compliance: We will perform annual internal audits of our disaster recovery plan to ensure compliance with relevant regulations and industry standards.
10. Conclusion
This Disaster Recovery Plan underscores CourseKata's earnest and steady commitment as a small non-profit to maintain a resilient and secure educational platform. In our journey of continuous improvement, we are dedicated to thoughtful planning, regular testing, and the ongoing refinement of our disaster recovery strategies. Our approach is not just about preparedness; it reflects our best effort to uphold the trust placed in us by our students and educators.
At CourseKata, we understand the challenges and limitations that come with our size and resources. Nevertheless, we strive to minimize the impact of unforeseen events, ensuring the integrity and reliability of our services. This plan represents our promise to do our utmost, within our means, to provide a stable and dependable educational experience. We recognize the vital role of this commitment in our operational integrity and in the support we offer to our community.
Legal Disclaimer
This document is provided for information purposes only and the contents hereof are subject to change without notice. CourseKata does not warrant that this document is error free, nor does it provide any other warranties or conditions, whether expressed orally or implied in law, including implied warranties and conditions of merchantability or fitness for a particular purpose. CourseKata specifically disclaims any liability with respect to this document and no contractual obligations are formed either directly or indirectly by this document.