Operational procedures | AWS Region outage

Atividade

OPERATIONS

Departamento

sALES&MARKETING eNGINEERING cUSTOMER sUPPORT SAST

Processo

AWS Region outage


Descrição

Objetivo

To respond to an AWS region outage impacting Clinical Brain's infrastructure, ensuring rapid service restoration, minimal operational disruption, and clear communication with stakeholders. The goal is to maintain business continuity during such incidents.

Âmbito

This procedure is required due to the need to keep Clinical Brain's services running smoothly and without interruption. It's essential for making sure that our systems can quickly recover from an AWS region outage, helping to avoid long downtimes and keep our operations running efficiently. This aligns with our business goal of maintaining a reliable and consistent service for our users.

Definições

N/A


Lista de atividades

  1. Incident identification

  2. Stakeholder communication

  3. Automated recovery process initiation

  4. DNS entry update for Disaster Recovery

  5. Service restoration verification

  6. Post-incident review


Descrição das atividades

Atividade #1 - Incident identification

Descrição

Identify and confirm an AWS region outage affecting Clinical Brain's services

Recursos

Responsável

eNGINEERING

Substituição

todo

Passo a passo

  1. Monitor AWS status pages and Seq

  2. Create a JIRA task to add relevant information regarding the diagnosis and analysis conducted

  3. Create an entry in the https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669

Stakeholders

sALES&MARKETING cUSTOMER sUPPORT SAST Customers

Plano de Comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

Confirmation of outage

Once

 

eNGINEERING

 

sALES&MARKETING

cUSTOMER sUPPORT

SAST

Email

Customers

Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)


Atividade #2 - Stakeholder communication

Descrição

Communicate with internal and external stakeholders about the incident and ongoing response actions

Recursos

Responsável

sALES&MARKETING

Substituição

todo

Passo a passo

todo - migrate https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/376635393 steps to here

Stakeholders

Customers

Plano de Comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

todo

todo

todo

todo

todo


Atividade #3 - Automated recovery process initiation

Descrição

Initiate automated processes for disaster recovery

Recursos

Responsável

eNGINEERING

Substituição

todo

Passo a passo

  1. Go to Clinical Brain tags

  2. Look through the list of tags to find the one with the highest value. This tag represents the version of the infrastructure currently running in production

  3. Go to Clinical Brain branches

  4. Launch the Create a branch wizard by clicking on the button New branch

    1. In the Name field, enter disaster-recovery/<major.minor.patch>. Replace <major.minor.patch> with the version numbers of the highest tag you identified earlier. For example, if the highest tag was 1.0.0, your branch name should be disaster-recovery/1.0.0

    2. In the Based on field, select the "tags" tab and choose the same tag you identified earlier as having the highest value. This step ensures that your new branch is based on the current production version

    3. Click on the Create button. This action will not only create the new disaster-recovery branch but also initiate a pipeline that automatically deploys the infrastructure to the disaster recovery region

  5. After initiating the deployment, go to Clinical Brain pipeline to monitor the progress

  6. Keep an eye on the pipeline, as the following error is expected to occur:

    • This is due to a credentials mismatch. When the RDS is restored from a production snapshot into the disaster recovery region, it retains the roles from the original database. Consequently, the database still references those roles credentials from the production account, while new credentials are generated and stored in the disaster region's AWS parameter store. Furthermore, these outdated roles, impede the proper authentication of lambdas interacting with the database.

  7. To fix the error, navigate to Amazon Web Services (AWS)

  8. Log in to the medicineone_clinicalbrain-prod account, utilizing the Disaster_Recovery_Permissions role

  9. Select the Paris region from the region selection menu

  10. Access the Parameter Store service

  11. Locate and open the parameter /databases_connection_strings/clinical_brain/clinical_brain_user

  12. Click on Show decrypted value to reveal its content

    image-20231222-103749.png
  13. Note down the Server value, crucial for connecting to the disaster recovery database

  14. Note down the Password value. You'll need this for updating the database credentials in an SQL script, the details of which will be provided in the subsequent steps

  15. Return to the Parameter store service

  16. Open the parameter /databases_connection_strings/clinical_brain/lambda_user

  17. Click on Show decrypted value and record the Password. This, too, will be required for the SQL script mentioned later

  18. Again, in the Parameter Store, find and open /databases_connection_strings/master_user parameter

  19. Click on Show decrypted value and note the displayed Credentials, needed for authenticating against the disaster recovery database.

  20. Launch the pgAdmin software

  21. Right click on Servers and navigate to Register → Server

    • In the General tab, enter clinical-brain-dr in the Name field

    • Switch to the Connection tab

    • In the Host name/address field, input the Server value you noted earlier

    • Use the Credentials from the Parameter Store for the Username and Password fields

    • Click on Save

  22. Expand the clinical-brain-dr server

  23. Right click on clinical_brain database and select Query Tool

  24. Paste the following script:

    ALTER ROLE clinical_brain WITH PASSWORD '<replace_by_clinical_brain_password>'; --replace with the password obtained from /databases_connection_strings/clinical_brain/clinical_brain_user ALTER ROLE lambda WITH PASSWORD '<replace_by_lambda_password>'; --replace with the password obtained from /databases_connection_strings/clinical_brain/lambda_user/databases_connection_strings/clinical_brain/clinical_brain_user
    • Replace <replace_by_clinical_brain_password> with the password obtained earlier for /databases_connection_strings/clinical_brain/clinical_brain_user

    • Replace <replace_by_lambda_password> with the password obtained earlier for /databases_connection_strings/clinical_brain/lambda_user

  25. Now that the database credentials are updated, navigate to Clinical Brain pipeline

  26. Click on the button Run pipeline

  27. Select the previously created disaster recovery branch in the Branch/tag field and click on Run

  28. Monitor the pipeline and wait for it to complete successfully

  29. Access Amazon Web Services (AWS) again

  30. Navigate to the API Gateway service

  31. In the left menu, select Custom domain names

  32. Find and select the custom domain name clinicalbrain.medicineone.cloud

  33. In the Configurations tab, locate the API Gateway domain name and take note of its value. This information will be provided to SAST for updating the DNS entry, which is detailed in the communication plan

Stakeholders

sALES&MARKETING cUSTOMER sUPPORT SAST Customers

Plano de comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

Informação

Periodicidade

Emissor

Destinatário

Meio

Recovery process completed, indicating the value of the API Gateway domain name

Once

eNGINEERING

sALES&MARKETING cUSTOMER sUPPORT SAST

Email

Provide regular updates

Regular intervals or as new information becomes available

eNGINEERING

Customers

Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)


Atividade #4 - DNS entry update for Disaster Recovery

Descrição

Update the DNS entry to redirect traffic to the new API Gateway in the disaster recovery region.

Recursos

todo

Responsável

SAST

Substituição

todo

Passo a passo

  1. Obtain the new API Gateway address from the email sent by eNGINEERING

  2. Access the DNS management tool

  3. Update the DNS entry to point to the new API Gateway address

  4. Verify the redirection of traffic to the disaster recovery region

Stakeholders

sALES&MARKETING cUSTOMER sUPPORT SAST Customer

Plano de comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

Informação

Periodicidade

Emissor

Destinatário

Meio

DNS update completed

Once

SAST

eNGINEERING sALES&MARKETING cUSTOMER sUPPORT

Email

Customers

Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)


Atividade #5 - Service restoration verification

Descrição

Verify the restoration of services post-recovery

Recursos

Responsável

eNGINEERING

Substituição

todo

Passo a passo

  1. Check service status through Seq (Clinical Brain (AWS) signal)

Stakeholders

sALES&MARKETING cUSTOMER sUPPORT SAST Customer

Plano de comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

Informação

Periodicidade

Emissor

Destinatário

Meio

Service verification results

Once

eNGINEERING

eNGINEERING sALES&MARKETING cUSTOMER sUPPORT

Email

Customers

Confluence (https://medicineone.atlassian.net/wiki/spaces/CUSTOMERSUPPORT/pages/377585669)


Atividade #6 - Post-incident review

Descrição

Conduct a review of the incident response and document lessons learned

Recursos

Responsável

eNGINEERING

Substituição

todo

Passo a passo

  1. Compile a detailed incident report in the Jira task

  2. Document lessons learned and recommendations

  3. Update the operational procedure as needed.

Stakeholders

sALES&MARKETING cUSTOMER sUPPORT SAST

Plano de comunicação

Informação

Periodicidade

Emissor

Destinatário

Meio

todo

todo

todo

todo

todo

Ficheiros

  File Modified

PNG File image-20231215-194347.png

Jan 23, 2024 by Fernando Tinoco

PNG File image-20231222-103749.png

Jan 23, 2024 by Fernando Tinoco