Atividade	OPERATIONS
Departamento	sALES&MARKETING eNGINEERING cUSTOMER sUPPORT SAST
Processo	AWS Region outage

1 Descrição
- 1.1 Objetivo
- 1.2 Âmbito
- 1.3 Definições
2 Lista de atividades
3 Descrição das atividades
4 Ficheiros

Descrição

Objetivo

To respond to an AWS region outage impacting Clinical Brain's infrastructure, ensuring rapid service restoration, minimal operational disruption, and clear communication with stakeholders. The goal is to maintain business continuity during such incidents.

Âmbito

This procedure is required due to the need to keep Clinical Brain's services running smoothly and without interruption. It's essential for making sure that our systems can quickly recover from an AWS region outage, helping to avoid long downtimes and keep our operations running efficiently. This aligns with our business goal of maintaining a reliable and consistent service for our users.

Definições

N/A

Lista de atividades

Incident identification
Stakeholder communication
Automated recovery process initiation
DNS entry update for Disaster Recovery
Service restoration verification
Post-incident review

Descrição das atividades

Atividade #1 - Incident identification

Descrição	Identify and confirm an AWS region outage affecting Clinical Brain's services
Recursos	Seq (Clinical Brain (AWS) signal) AWS status pages
Responsável	eNGINEERING
Substituição	todo
Passo a passo	Monitor AWS status pages and Seq Create a JIRA task to add relevant information regarding the diagnosis and analysis conducted Create an entry in the https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669
Stakeholders	sALES&MARKETING cUSTOMER sUPPORT SAST Customers

Plano de Comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

Confirmation of outage

Once

eNGINEERING

sALES&MARKETING

cUSTOMER sUPPORT

SAST

Email

Customers

Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)

Atividade #2 - Stakeholder communication

Descrição	Communicate with internal and external stakeholders about the incident and ongoing response actions
Recursos	https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377847813 TODO - create contact list
Responsável	sALES&MARKETING
Substituição	todo
Passo a passo	todo - migrate https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/376635393 steps to here
Stakeholders	Customers

Plano de Comunicação

Informação	Periodicidade	Emissor	Destinatário	Meio
todo	todo	todo	todo	todo

Atividade #3 - Automated recovery process initiation

Descrição	Initiate automated processes for disaster recovery
Recursos	Azure DevOps Requires: permission to create branches Amazon Web Services (AWS) Requires: access to `medicineone_clinicalbrain-prod` account with the role Disaster_Recovery_Permissions pgAdmin software
Responsável	eNGINEERING
Substituição	todo
Passo a passo	Go to Clinical Brain tags Look through the list of tags to find the one with the highest value. This tag represents the version of the infrastructure currently running in production Go to Clinical Brain branches Launch the Create a branch wizard by clicking on the button New branch In the Name field, enter `disaster-recovery/<major.minor.patch>`. Replace `<major.minor.patch>` with the version numbers of the highest tag you identified earlier. For example, if the highest tag was `1.0.0`, your branch name should be `disaster-recovery/1.0.0` In the Based on field, select the "tags" tab and choose the same tag you identified earlier as having the highest value. This step ensures that your new branch is based on the current production version Click on the Create button. This action will not only create the new disaster-recovery branch but also initiate a pipeline that automatically deploys the infrastructure to the disaster recovery region After initiating the deployment, go to Clinical Brain pipeline to monitor the progress Keep an eye on the pipeline, as the following error is expected to occur: This is due to a credentials mismatch. When the RDS is restored from a production snapshot into the disaster recovery region, it retains the roles from the original database. Consequently, the database still references those roles credentials from the production account, while new credentials are generated and stored in the disaster region's AWS parameter store. Furthermore, these outdated roles, impede the proper authentication of lambdas interacting with the database. To fix the error, navigate to Amazon Web Services (AWS) Log in to the `medicineone_clinicalbrain-prod` account, utilizing the Disaster_Recovery_Permissions role Select the Paris region from the region selection menu Access the Parameter Store service Locate and open the parameter `/databases_connection_strings/clinical_brain/clinical_brain_user` Click on Show decrypted value to reveal its content Note down the Server value, crucial for connecting to the disaster recovery database Note down the Password value. You'll need this for updating the database credentials in an SQL script, the details of which will be provided in the subsequent steps Return to the Parameter store service Open the parameter `/databases_connection_strings/clinical_brain/lambda_user` Click on Show decrypted value and record the Password. This, too, will be required for the SQL script mentioned later Again, in the Parameter Store, find and open `/databases_connection_strings/master_user` parameter Click on Show decrypted value and note the displayed Credentials, needed for authenticating against the disaster recovery database. Launch the pgAdmin software Right click on Servers and navigate to Register → Server In the General tab, enter `clinical-brain-dr` in the Name field Switch to the Connection tab In the Host name/address field, input the Server value you noted earlier Use the Credentials from the Parameter Store for the Username and Password fields Click on Save Expand the clinical-brain-dr server Right click on clinical_brain database and select Query Tool Paste the following script: `ALTER ROLE clinical_brain WITH PASSWORD '<replace_by_clinical_brain_password>'; --replace with the password obtained from /databases_connection_strings/clinical_brain/clinical_brain_user ALTER ROLE lambda WITH PASSWORD '<replace_by_lambda_password>'; --replace with the password obtained from /databases_connection_strings/clinical_brain/lambda_user/databases_connection_strings/clinical_brain/clinical_brain_user` Replace `<replace_by_clinical_brain_password>` with the password obtained earlier for `/databases_connection_strings/clinical_brain/clinical_brain_user` Replace `<replace_by_lambda_password>` with the password obtained earlier for `/databases_connection_strings/clinical_brain/lambda_user` Now that the database credentials are updated, navigate to Clinical Brain pipeline Click on the button Run pipeline Select the previously created disaster recovery branch in the Branch/tag field and click on Run Monitor the pipeline and wait for it to complete successfully Access Amazon Web Services (AWS) again Navigate to the API Gateway service In the left menu, select Custom domain names Find and select the custom domain name `clinicalbrain.medicineone.cloud` In the Configurations tab, locate the API Gateway domain name and take note of its value. This information will be provided to SAST for updating the DNS entry, which is detailed in the communication plan
Stakeholders	sALES&MARKETING cUSTOMER sUPPORT SAST Customers

Plano de comunicação

Informação	Periodicidade	Emissor	Destinatário	Meio

Informação	Periodicidade	Emissor	Destinatário	Meio
Recovery process completed, indicating the value of the API Gateway domain name	Once	eNGINEERING	sALES&MARKETING cUSTOMER sUPPORT SAST	Email
Provide regular updates	Regular intervals or as new information becomes available	eNGINEERING	Customers	Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)

Atividade #4 - DNS entry update for Disaster Recovery

Descrição	Update the DNS entry to redirect traffic to the new API Gateway in the disaster recovery region.
Recursos	todo
Responsável	SAST
Substituição	todo
Passo a passo	Obtain the new API Gateway address from the email sent by eNGINEERING Access the DNS management tool Update the DNS entry to point to the new API Gateway address Verify the redirection of traffic to the disaster recovery region
Stakeholders	sALES&MARKETING cUSTOMER sUPPORT SAST Customer

Plano de comunicação

Informação	Periodicidade	Emissor	Destinatário	Meio

Informação	Periodicidade	Emissor	Destinatário	Meio
DNS update completed	Once	SAST	eNGINEERING sALES&MARKETING cUSTOMER sUPPORT	Email
DNS update completed	Once	SAST	Customers	Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)

Atividade #5 - Service restoration verification

Descrição	Verify the restoration of services post-recovery
Recursos	Seq (Clinical Brain (AWS) signal)
Responsável	eNGINEERING
Substituição	todo
Passo a passo	Check service status through Seq (Clinical Brain (AWS) signal)
Stakeholders	sALES&MARKETING cUSTOMER sUPPORT SAST Customer

Plano de comunicação

Informação	Periodicidade	Emissor	Destinatário	Meio

Informação	Periodicidade	Emissor	Destinatário	Meio
Service verification results	Once	eNGINEERING	eNGINEERING sALES&MARKETING cUSTOMER sUPPORT	Email
Service verification results	Once	eNGINEERING	Customers	Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)

Atividade #6 - Post-incident review

Descrição	Conduct a review of the incident response and document lessons learned
Recursos	Confluence
Responsável	eNGINEERING
Substituição	todo
Passo a passo	Compile a detailed incident report in the Jira task Document lessons learned and recommendations Update the operational procedure as needed.
Stakeholders	sALES&MARKETING cUSTOMER sUPPORT SAST

Plano de comunicação

Informação	Periodicidade	Emissor	Destinatário	Meio
todo	todo	todo	todo	todo

Ficheiros

	File	Modified
Labels No labels Preview View	PNG File image-20231215-194347.png	Jan 23, 2024 by Fernando Tinoco
Labels No labels Preview View	PNG File image-20231222-103749.png	Jan 23, 2024 by Fernando Tinoco

Download All

Clinical Brain - Relatório de Desempenho

Operational procedures | AWS Region outage

Analytics

Descrição

Objetivo

Âmbito

Definições

Lista de atividades

Descrição das atividades

Atividade #1 - Incident identification

Atividade #2 - Stakeholder communication

Atividade #3 - Automated recovery process initiation

Atividade #4 - DNS entry update for Disaster Recovery

Atividade #5 - Service restoration verification

Atividade #6 - Post-incident review

Ficheiros