Risk Assessment: Database Strategy

Introduction

This document assesses the risks associated with the different database organization strategies proposed for the UniMus:Natur migration to Specify. The assessment focuses on five main options ranging from a completely centralized model to a highly fragmented one.

Risk Evaluation Matrix

Risks are evaluated based on Likelihood (Low, Medium, High) and Impact (Low, Medium, High).

Risk Level Description
High Critical issue that could halt operations or significantly degrade data quality/usability. Requires mitigation.
Medium Significant issue that complicates workflows or administration. Monitoring required.
Low Minor issue with acceptable trade-offs.

Detailed Analysis of Options

Option 1: Single Database (All Museums, All Collections)

Description: One Specify instance containing data for all 60+ collections across all museums.

Risk Category Risk Description Likelihood Impact Overall Level Mitigation
Operational “All or Nothing” Restore: A data corruption event in one collection requires rolling back the entire database, affecting all other museums/collections. Low High High - Point-in-time recovery strategies.
- Strict access controls to prevent accidental mass-updates.
Performance Performance degradation due to table size (millions of records) and concurrent user load. Medium Medium Medium - Database indexing optimization.
- Hardware scaling.
Governance Governance conflicts: Museums disagreeing on shared standards (e.g., Geography tree, agents). High Medium High - Strong central governance committee with decision-making power.
Security Complex permission schemes required to isolate sensitive data between museums. High High High - Granular Role-Based Access Control (RBAC).

Option 2: One Database per Museum (5 Databases)

Description: Separate databases for each museum (Oslo, Bergen, Trondheim, Tromsø, Stavanger).

Risk Category Risk Description Likelihood Impact Overall Level Mitigation
Data Integrity Cross-Museum Standardization: Creating a national standard becomes harder as each museum manages its own lists (e.g., Geography). Medium Medium Medium - National coordination group.
- Shared configuration templates.
User Experience Siloed Access: Researchers needing access to collections in multiple museums need separate accounts or logins (mitigated by FEIDE/SSO). Low Low Low - SSO (Single Sign-On).
Collaboration Shared Resources: Cannot share “Person” or “agents” across museums, leading to duplicate agent records for the same collector nationally. High Low Medium - National Agent Registry service (external).

Option 3: One Database per Main Organism Group (4 Databases)

Description: Databases organized by discipline (e.g., Botany, Zoology) shared across all museums.

Risk Category Risk Description Likelihood Impact Overall Level Mitigation
Governance Complex Governance: Museums must agree on standards within a discipline. A decision in “Botany” affects all 5 museums. High Medium High - Strong discipline-specific councils.
Data Integrity Cross-Discipline Silos: Cannot link resources between disciplines (e.g., a host plant in Botany linked to an insect in Entomology). Medium Medium Medium - External linking via persistent identifiers (UUIDs).

Option 4: One Database per Main Organism Group per Museum (20 Databases)

Description: Databases organized by discipline AND museum (e.g., Oslo Botany, Bergen Botany, etc.).

Risk Category Risk Description Likelihood Impact Overall Level Mitigation
Operational High Fragmentation: Managing 20+ databases increases upgrade and maintenance complexity significantly compared to 1-5 DBs. High Medium High - Automation (Ansible).
User Experience Fragmented Experience: A user in Oslo working on both Botany and Zoology needs two different environments. High Low Medium - SSO.
Data Integrity Maximum Duplication: Agents, Geography, and Taxonomy references are duplicated across 20 silos. Synchronization is very difficult. High High High - Aggressive external authority management.

Option 5: One Database per Collection (~60 Databases)

Description: Maximum isolation, treating each collection as a separate Specify instance.

Risk Category Risk Description Likelihood Impact Overall Level Mitigation
Operational Administration Overhead: Updating schemas, managing users, and backing up 60+ separate databases is labor-intensive. High High High - Infrastructure-as-Code (Ansible/Terraform) for automation.
- Centralized user management.
Data Functionality Loss of Cross-Collection Queries: Cannot easily query “All samples collected by Person X” across the institution. High High High - Build a separate “Data Warehouse” or discovery portal for aggregate queries.
Standardization Schema Drift: Over time, individual collections diverge in field usage, making future aggregation impossible. High High High - Enforce strict template management policies.

Summary recommendation

From a Risk Management perspective:

  • Option 1 (Single DB) carries high governance and “blast radius” risks (one error affects everyone), but offers the best data integrity and lowest admin overhead.
  • Option 5 (Many DBs) carries extreme operational and data fragmentation risks, requiring significant investment in automation and external tooling to mitigate.

Conclusion: If resources for automation and building external aggregation tools are low, Option 1 or 2 (fewer databases) presents the most manageable risk profile despite the governance challenges.


This site uses Just the Docs, a documentation theme for Jekyll.