Migration Strategy: Phased Approach to a Single Specify 7 Database

The Core Problem

We are migrating multiple collections from a source Oracle database with partially duplicated shared data (taxonomy, geography, persons) into a single Specify 7 database. Each collection needs to be testable independently before the next one goes in.

The challenge is that shared data — agents, taxonomy, geography — is global in Specify 7 and referenced by all collections. It cannot be migrated incrementally. Specimen data can.

Specify 7 Database Structure (Brief Recap)

Institution
└── Discipline  (e.g. "Botany", "Zoology")
    ├── TaxonTree          ← SHARED within discipline
    ├── GeographyTree      ← SHARED across institution
    ├── AgentTable         ← SHARED across institution
    └── Collection         ← ONE PER DATASET (karplanter, mosses, marine…)
        └── CollectionObject  ← migrated per dataset

Key rule: Agent, Geography, and Taxon records are shared across all collections. A CollectionObject in “Karplanter” and one in “Mosses” can point to the same Agent (collector) and the same Taxon node. This is exactly what we want — but it means these shared tables must be fully in place before any specimen migration begins.

Infrastructure-as-code: hierarchy YAML

After the database has been bootstrapped once (institution and first guided setup), additional divisions, disciplines, and collections can be kept in version control and applied idempotently:

Config: config/specify_structure/unimus_natur.yaml (edit or add sibling files per environment).
Flow: flows/sync_specify_structure.py — Prefect entrypoint sync_specify_structure_flow. Uses the same DB_* environment variables as other Specify flows (via flows/lib/specify_setup.py). Default is dry_run: true; set dry_run: false only when applying to a target database intentionally.
Deployment: sync-specify-structure-dev in prefect.yaml.

Reports are written under the specify-structure-sync category in the migration-reports S3 prefix when S3_BUCKET is set (same pattern as other migration flows).

Dataset Groups (Collections in Specify 7)

In the Oracle source, specimens group into datasets via:

Oracle field	Where	What it means
`FUNNETIKETT.HERB_ID` → `HERBARIE.HERB_FORK`	USD schemas	The herbarium sub-collection abbreviation (e.g. “V”, “M”, “L”)
`COLLECTING_EVENT.COLLECTIONTYPE_ID`	MUSIT schemas	Organism group / collection type
`MUSEUM_OBJECT.SUB_COLLECTION_ID` + `USER_COLLECTION_SEQS.SUBCOLLECTION`	MUSIT schemas	Sub-collection numbering series

Expected dataset → Specify Collection mapping (verify against actual HERBARIE contents):

Source dataset (HERB_FORK / organism group)	Likely Specify Collection name	Discipline
Karplanter (vascular plants)	Karplanter	Botany
Mosser (bryophytes)	Mosser	Botany
Lav (lichens)	Lav	Botany
Alger (algae)	Alger	Botany
Sopp (fungi)	Sopp	Botany
Kryptogamer (cryptogams, Svalbard)	Kryptogamer	Botany
Marine inv. / fish	Marin	Zoology/Entomology
Entomologi	Entomologi	Zoology/Entomology

⚠️ Confirm exact groupings by running SELECT HERB_ID, HERB_FORK, NAVN FROM HERBARIE in each USD schema before mapping.

Two-Phase Migration

Phase 1 — Shared Foundation (migrate ONCE, in dependency order)

This phase populates Specify 7 with all the shared reference data. No specimens yet. This is a one-time operation that must complete and be validated before Phase 2 starts.

Step 1.1 — Agents
Step 1.2 — Geography  
Step 1.3 — Taxonomy (one tree per discipline)
Step 1.4 — Application Users + SpecifyUser accounts

Each step is both a migration and a validation checkpoint — see “Validation Gates” below.

Step 1.1 — Agents

Source: MUSIT_BOTANIKK_FELLES.ACTOR + PERSON_NAME + GROUPMEMBERSHIP + AUTHORSTRINGS
Also: USD_BOTANIKK_*.PERSONER, USD_BOTANIKK_*.AUTORPERSON, USD_NAT_TAXAREG.AUTORPERSON
Target: Specify Agent table

An implemented subset — MUSIT ACTOR + PERSON_NAME for MUSIT_BOTANIKK_FELLES and MUSIT_ZOOLOGI_ENTOMOLOGI — is loaded by the Prefect flow migrate_musit_agents_flow (flows/migrate_musit_agents.py). Scope, idempotency, and gaps (USD persons, authors, deduplication) are documented in MUSIT collection agents migration.

Merge strategy:

Start with MUSIT ACTOR as canonical — it has the most structured data (birth/death, ORCID, institution).
Match USD PERSONER against ACTOR by name string similarity → link or create new Agent.
Match AUTORPERSON / AUTOR_LISTE against Agents — these are often abbreviations (e.g. “L.”, “Sw.”) so match rules are different.
Preserve original source IDs in Specify Agent.Remarks or a custom field for traceability.

Key fields to map:

Oracle	Specify
`ACTOR.ACTOR_TYPE` (0=person, 1=org, 2=group)	`Agent.AgentType`
`PERSON_NAME.PERSON_SURNAME`	`Agent.LastName`
`PERSON_NAME.PERSON_GIVEN_NAME`	`Agent.FirstName`
`ACTOR.BIRTHDATE` / `DEATHDATE`	`Agent.DateOfBirth` / `DateOfDeath`
`ACTOR.INSTITUTION`	linked `Agent` (org)
`GROUPMEMBERSHIP`	`Agent` group members
`AUTHORSTRINGS.AUTHORSTRING`	`Agent.Abbreviation`

Step 1.2 — Geography

Source: MUSIT_BOTANIKK_FELLES.ADMINISTRATIVE_PLACE + USD_BOTANIKK_*.ADMINISTRATIVTSTED + USD_BOTANIKK_*.GEOREG
Target: Specify Geography tree (custom — built from MUSIT, not from an external standard)

Approach: MUSIT-first, preserve historical names

We build our own geography tree from the Oracle data rather than using a standard hierarchy (GeoNames etc.). The reason is historical fidelity: specimens collected in 1887 in “Christiania”, or in a municipality that was merged or split in the 2020 kommunereform, must remain permanently linked to the name that was correct when the collecting happened. Future digitisation of unregistered old records will also need these historical names.

MUSIT already solved this — ADMINISTRATIVE_PLACE was built to hold historical names alongside current ones, not replace them.

The Norwegian administrative change problem:

Era	Issue	Impact on geography tree
Pre-1960	Old names: “Christiania” (→ Oslo), pre-reform county names (Akershus etc.)	Must be in tree as own nodes
2020 kommunereform	~430 municipalities merged to ~356	Both old and new municipality names must exist
2024 re-splits	Several 2020 mergers reversed	Third layer of names for same geographic area
Future digitisation	Old undigitised records reference any historical name	Tree must remain open and extensible

Strategy:

Build from MUSIT admin-place sources as-is — do not normalise to “current” administrative names. Prefer ADMINISTRATIVE_PLACE when it is populated (ADMPLACE_TYPE = level, PLACE_ID_PARTOF = parent). In Oracle PROD checks (2026-04-15, migration reporting user), ADMINISTRATIVE_PLACE was empty while PLACE_HIERARCHICAL_PLACE → HIERARCHICAL_PLACE_OLD carried almost all admin names on PLACE rows; if your environment matches, import geography nodes from HIERARCHICAL_PLACE_OLD (via HIERACHICAL_TYPE → TYPES) in addition to USD. See Oracle botany datasets — Geolocation.
Supplement from USD ADMINISTRATIVTSTED — each per-museum schema has its own administrative place table; add any names not already present in MUSIT. Match on name + type + parent to avoid duplicates.
GEOREG (the old UTM-grid-based geographic register in USD schemas) contains municipality codes (kommnr) and names. Use as a cross-reference to catch additional historical names not in MUSIT admin tables. Not every USD botany schema exposes GEOREG to the same Oracle user—discover with ALL_TABLES.
Do not delete or merge historical nodes — a “Trondheim” from 1900 and a “Trondheim” that is a post-2020 merged municipality may coexist in the tree. Specify’s Geography tree supports this.
Mark status optionally — a custom GeographyStatus field (CURRENT / HISTORICAL / MERGED_INTO) on the Geography table can help users understand which nodes are current administrative units. This is optional but useful.

Source tables and what they provide:

Source	Table	Content
MUSIT	`ADMINISTRATIVE_PLACE`	Hierarchical admin units; `ADMPLACE_TYPE` = level; `PLACE_ID_PARTOF` = parent (verify populated in your DB)
MUSIT	`HIERARCHICAL_PLACE_OLD`	Hierarchical admin names; `HIERACHICAL_TYPE` = level; `PLACE_ID_PARTOF` = parent; linked from `PLACE` via `PLACE_HIERACHICAL_PLACE`
MUSIT	`PLACE_HIERACHICAL_PLACE`	Junction: which `HIERARCH_PLACE_ID` applies to each collecting `PLACE_ID`
MUSIT	`MUSIT_NATHIST_FELLES.BIO_GEOGRAFISK_REGION`	Shared biogeographic region vocabulary; linked from `PLACE_BIO_GEOGRAFISK_REGION`
USD each schema	`ADMINISTRATIVTSTED`	Per-museum admin place table; `STED_TYPE` = level; `LAND_ID`/`FYLKE_ID`/`KOMMUNE_ID` FK chain
USD each schema	`GEOREG`	Old UTM-zone area register; `KOMMNR` (municipality number), `NAVN`, `LAND`/`FYLKE`/`KOMMUNE` text fields
USD each schema	`FYLKER`	County list with `FYLKENR` (county number)
USD each schema	`KOMMUNER`	Municipality list (where present)
USD each schema	`COUNTRIES`	Country list

What does NOT go in Geography tree:

KOORDINATE_PLACE → maps to Specify Locality (specific collecting sites with coordinates), not to Geography nodes. Localities live at Collection level; Geography nodes are shared. KOORDINATE_PLACE_ID is not global across Oracle schemas (same integer can mean different coordinates in botany vs entomology); always qualify with the owning schema.
INDEXED_LOCALITY, LOCALITY_PLACE → also Specify Locality, not Geography.

Hierarchy depth in Specify:

Specify’s default Geography ranks: Planet → Continent → Country → State/Province → County → Municipality. Norwegian data maps as:

Specify rank	Norwegian equivalent	ADMPLACE_TYPE value (to confirm)
Continent	Kontinent	type 1
Country	Land	type 2
State/Province	Fylke (county)	type 3
County	Kommuneregion	type 4 (if used)
Municipality	Kommune	type 5

⚠️ Confirm which admin model is populated: SELECT COUNT(*) FROM MUSIT_BOTANIKK_FELLES.ADMINISTRATIVE_PLACE vs counts on HIERARCHICAL_PLACE_OLD / PLACE_HIERACHICAL_PLACE. If ADMINISTRATIVE_PLACE is empty, map HIERACHICAL_TYPE (with TYPES) instead of ADMPLACE_TYPE for hierarchy levels.

⚠️ Geography nodes are shared across all Specify collections. Build this tree once, completely, before any specimens are migrated. All four botany museums and the zoology collection will reference the same nodes.

Implementation note

Geography and locality loading now run as part of per-dataset migration, not as a dedicated standalone Prefect flow. Keep the same idempotency rule: resolve by stable source identifiers and update existing Specify rows instead of inserting duplicates.

Step 1.3 — Taxonomy

Source: NorTaxa (Artsdatabanken) via REST API + Oracle LATIN_NAMES for unmatched species (Phase 2 / future)
Target: Specify Taxon trees (one TaxonTreeDef per Discipline)

Approach: NorTaxa-first (API sync)

Rather than migrating Oracle taxonomy trees wholesale, we use NorTaxa as the canonical authority and load curated slices per Specify discipline. Matching legacy data uses LATIN_NAMES.ADB_TAXON_ID (= NorTaxa scientificNameId = Specify taxonomicserialnumber).

Operational documentation: see NorTaxa taxon trees for the full sync design, field mapping, changelog behaviour, purge flows, and expected future scenarios.

Summary of the implemented flow (nortaxa-discipline-trees-dev):

Extract — NorTaxa DataTransfer/Export per discipline root scientificNameId (see flows/lib/nortaxa_discipline_root_specs.py).
Bootstrap — create TaxonTreeDef, full rank ladder (Life → Species …), and root Taxon per discipline if missing.
Merge — insert/update accepted taxa and synonyms; mark orphans (yesno1=false) when a name drops out of the slice.
Changelog — incremental TaxonName/ChangeLog sync: auto-apply safe changes; queue Merge/Split/Delete for curator review.

Key field mapping (NorTaxa → Specify):

NorTaxa / Oracle	Specify
`scientificNameId` / `ADB_TAXON_ID`	`Taxon.taxonomicserialnumber`
`taxonId` (concept)	`Taxon.text2`
`NameString` / rank-local epithet	`Taxon.name`
(computed)	`Taxon.fullname` via `set_fullnames`
`scientificNameAuthorship`	`Taxon.author`
`vernacularNameBokmaal`	`Taxon.commonname`
`taxonRank`	`Taxon.definitionitem`
—	`Taxon.source` = `"NorTaxa"` for managed rows

Flagging non-NorTaxa taxa (planned for Oracle-only inserts):

Add a custom boolean field IsExtraNorTaxa (or NorTaxaStatus varchar) to the Specify Taxon table. Set it on insert for any taxon added outside the base NorTaxa import.

Two sub-categories worth distinguishing:

Category	Example	Flag value
Norwegian species not yet in NorTaxa	Recently described, awaiting review	`PENDING`
Genuinely foreign / extra-limital species	Tropical holotypes, Arctic borderline spp.	`EXTRA_LIMITAL`
NorTaxa match	Any species found in ADB	(null / unset)

Legacy Oracle fields (for unmatched taxa during specimen migration):

Oracle	Specify
`LATIN_NAMES.LATIN_NAME`	epithet / name (rank-dependent)
`TAXON_CATHEGORY.TAX_CATH_CODE`	`Taxon.RankID`
`LATIN_NAMES.PARENT_LATIN_NAME_ID`	`Taxon.Parent` (non-NorTaxa inserts only)
`LATIN_NAMES.IS_VALID`	`Taxon.IsAccepted`
`AUTHORSTRINGS.AUTHORSTRING`	`Taxon.Author`
`LATIN_NAMES.ADB_TAXON_ID`	match key → `taxonomicserialnumber`
`LATIN_NAMES.NHM_TAXON_ID`	secondary key; custom field

⚠️ NorTaxa covers Norwegian-relevant taxa. For marine and entomology collections there will be a long tail of foreign species (holotypes, Arctic material, imported specimens). Budget time for reviewing the EXTRA_LIMITAL tail before going live.

Step 1.4 — Users

Source: USD_METADATA.BRUKARAR + BRUKERNAVN_GRUPPE + GRUPPE
Target: Specify SpecifyUser + Agent

For each row in BRUKARAR: create a SpecifyUser.
Match or create the corresponding Agent (from Step 1.1) via name/email.
Map GRUPPE.MUSEUM → Specify Collection access.

The Prefect flow that performs the load writes a JSON summary artifact, migration_report.json (counts, errors, and a museum-group inventory). Field definitions, S3 layout, and how to interpret dry-run vs live runs are documented in User migration report.

Phase 2 — Specimen Migration (one dataset at a time)

Once Phase 1 is validated, migrate specimens collection by collection. Each iteration follows the same steps and produces a testable result in Specify.

For each dataset (e.g. "Karplanter TRH"):
  Step 2.1 — Collecting Events + Localities
  Step 2.2 — Collection Objects (specimens)
  Step 2.3 — Determinations (+ taxon links)
  Step 2.4 — Attachments (media from USD_FELLES)
  ── VALIDATE ──
  → proceed to next dataset

Filtering by dataset

Source filter query (example for Karplanter in Trondheim):

-- USD source
SELECT f.* 
FROM USD_BOTANIKK_TRONDHEIM.FUNNETIKETT f
JOIN USD_BOTANIKK_TRONDHEIM.HERBARIE h ON f.HERB_ID = h.HERB_ID
WHERE h.HERB_FORK = 'V'   -- 'V' = Vaskulærplanter/Karplanter

-- MUSIT source  
SELECT mo.* 
FROM MUSIT_BOTANIKK_FELLES.MUSEUM_OBJECT mo
WHERE mo.SUB_COLLECTION_ID = <karplanter_id>

The actual HERB_FORK values and SUB_COLLECTION_ID values need to be confirmed from live DB. Enumerate them as the first step of each collection migration.

Iterating across museums

Each dataset × museum combination is a separate batch. Suggested sequence:

Batch	Source	Specify Collection	~Size estimate
1	USD_BOTANIKK_TRONDHEIM / Karplanter	Karplanter (TRH)	Pilot
2	USD_BOTANIKK_TROMSO / Karplanter	Karplanter (TMS)
3	USD_BOTANIKK_BERGEN / Karplanter	Karplanter (BRG)
4	USD_BOTANIKK_SVALBARD / Karplanter	Karplanter (SVA)
5	*/Mosser	Mosser
6	*/Lav	Lav
…	…	…

For each batch, the same Specify Collection (e.g. “Karplanter”) receives records from all four museum schemas — they all share the same taxonomy and geography nodes loaded in Phase 1.

Validation Gates

Each phase and batch needs a defined validation checkpoint before proceeding.

Phase 1 gates

After step	Check
1.1 Agents	Row count in Specify Agent ≈ expected; spot-check 10 known collectors by name; no duplicate agents for same person
1.2 Geography	Norway hierarchy complete to municipality level; spot-check known localities
1.3 Taxonomy	Tree depth/structure correct; known species findable; synonym links intact
1.4 Users	All active users can log in to Specify; permissions correct per collection

Phase 2 gates (per dataset batch)

Check	How
Row count	`COUNT()` in source vs. `COUNT()` in Specify for the collection
Null FK check	No `CollectionObject` with null `Collector`, null `Taxon`, or null `Locality` where source had data
Sample spot-check	Pick 20 specimens, open in Specify, verify all fields match source
Orphan check	No `CollectingEvent` records without linked `CollectionObject`
Duplicate check	No duplicate `CatalogNumber` within a collection

Handling the “Shared Data Updates” Problem

Once Phase 1 is done and Phase 2 begins, new agents or taxa might appear in later datasets that weren’t present in the initial foundation load. This is expected.

Rule: New agents/taxa discovered during a specimen batch are added to the shared tables on the fly during that batch migration. The shared tables grow incrementally, but each item is only ever added once (check before insert).

This means the specimen migration flows must:

Look up the Agent/Taxon/Geography by stable ID (Oracle PK → Specify PK mapping table).
If not found: insert it, record the mapping.
If found: link to existing.

A cross-reference table (maintained in the migration environment, not in Specify) is essential:

oracle_to_specify_map:
  oracle_schema    VARCHAR   -- e.g. 'MUSIT_BOTANIKK_FELLES'
  oracle_table     VARCHAR   -- e.g. 'ACTOR'
  oracle_id        NUMBER    -- e.g. 12345
  specify_table    VARCHAR   -- e.g. 'Agent'
  specify_id       NUMBER    -- e.g. 67890

This map is the migration’s “memory” — it allows any phase to look up whether an Oracle record has already been imported.

Summary: What Gets Migrated When

Phase 1 (once, before any specimens)
├── 1.1  Agents                   ← all schemas merged and deduplicated
├── 1.2  Geography tree           ← custom built from MUSIT to preserve historical names
├── 1.3  Taxonomy tree(s)         ← NorTaxa as primary backbone + unmatched Oracle taxa
└── 1.4  SpecifyUsers             ← from USD_METADATA.BRUKARAR

Phase 2 (repeated N times, one per dataset-batch)
├── Batch 1  Karplanter / TRH     ← first pilot, most carefully validated
├── Batch 2  Karplanter / TMS
├── Batch 3  Karplanter / BRG
├── Batch 4  Karplanter / SVA
├── Batch 5  Mosser / all museums
├── Batch 6  Lav / all museums
├── Batch 7  Sopp / all museums
├── Batch 8  Alger / all museums
├── Batch 9  Marine / Marin
└── Batch N  Entomologi / …

Cross-cutting concern (maintained throughout)
└── oracle_to_specify_map         ← ID mapping table, lives in migration env