MUSIT collection agents migration
This page documents how collection agents in Oracle relate to application users, what the Migrate MUSIT Actors Prefect flow does, and how it differs from User migration report / BRUKARAR flow.
Users vs agents in Oracle (and Specify)
MUSIT keeps two largely separate notions of “people”:
| Concept in MUSIT | Typical Oracle source | Role | Specify target |
|---|---|---|---|
| Application user (login) | USD_METADATA.BRUKARAR (+ groups) | Staff who sign into the apps | SpecifyUser + linked Agent for attribution |
| Collection agent (specimen graph) | MUSIT_*.ACTOR + PERSON_NAME (+ USD PERSONER / authors, not in this flow yet) | Collectors, determiners, organisations on events | Agent only (SpecifyUser usually null) |
The same person may appear in both ACTOR and BRUKARAR. Specify also uses Agent for both login persons and specimen roles, so over time you may want to merge duplicates by name or email; that merge is not implemented in the current flows.
Strategic background and table lists: Oracle Schema Overview — Persons, Agents & Users. Phased order: Migration Strategy — Step 1.1 vs 1.4.
Flow: migrate_musit_agents_flow
- Module:
flows/migrate_musit_agents.py - Prefect name:
Migrate MUSIT Actors - Phase: 1.1 (shared collection
Agentrows before specimen migration).
Implementation note (memory and Prefect)
Oracle returns on the order of hundreds of thousands of ACTOR rows. The flow uses a single Prefect task (extract_and_load_musit_agents_task) that streams rows from Oracle and applies the load in the same process, instead of returning a giant Python list across a task boundary (which would force Prefect to serialize the full result and roughly double memory use). Dry-run progress is logged at INFO every 5 000 rows; per-row lines are DEBUG only so the worker is not flooded with hundreds of thousands of log events (which can overwhelm log backends and look like a crash). The dev worker pod has a 1 Gi memory limit in the default chart; if live (dry_run: false) runs OOM, raise prefect.devWorker.resources.limits.memory in Helm.
Source
For each selected schema, the flow reads ACTOR and joins PERSON_NAME:
- Prefer
ACTOR.VALID_PERSON_NAME_IDwhen set. - Otherwise use the minimum
PERSON_NAME_IDfor thatACTOR_ID(deterministic fallback when there is no valid flag).
Allowed schema names (whitelist inside the flow):
MUSIT_BOTANIKK_FELLESMUSIT_ZOOLOGI_ENTOMOLOGI
Target
Specify Agent via Django ORM:
agenttype: MUSITACTOR_TYPE0 (person) → Specify 1 (Person); MUSIT 1 (organisation) or 2 (group) → Specify 0 (Organization).- Names:
PERSON_GIVEN_NAME/PERSON_SURNAME(truncated to Specify column limits). If a person has no name parts,ACTORNAMEis used as last name. Organisations useACTORNAMEas last name when surname is empty. - Other fields: email, title, middle name, birth/death dates, division (first
Divisionin the DB, same pattern asmigrate_users). specifyuser: always null here (these are not login rows).- Idempotency:
remarksis set to a fixed marker
MUSIT-migration: ACTOR; schema=<SCHEMA>; ACTOR_ID=<id>
An existingAgentwith the sameremarksis skipped on re-run.
Parameters
| Parameter | Default | Meaning |
|---|---|---|
oracle_env | PROD | Oracle env prefix for credentials (ORACLE_<ENV>_…). |
dry_run | true | If true, only logs intended creates; no inserts. |
musit_schemas | both schemas | JSON list of schema names to include (subset of the two allowed values). |
Report artifact
When S3_BUCKET is set, a JSON summary is uploaded to:
{S3_MIGRATION_REPORTS_PREFIX}/collection-agents-musit-actor-person-name/<timestamp>/report.json
(Default prefix: migration-reports — see Migration reports on S3; not under oracle-schema.)
The payload mirrors the user-migration style: shared metadata (report_version, flow, migration_phase, generated_at_utc, oracle_env, dry_run) plus flow-specific counts and diagnostics. See Migration reports on S3.
| Field | Type | Description |
|---|---|---|
report_version | integer | JSON shape version (1). |
flow | string | migrate_musit_agents. |
migration_phase | string | 1.1. |
generated_at_utc | string | UTC folder timestamp. |
oracle_env | string | Oracle env prefix. |
dry_run | boolean | No Agent rows written when true. |
musit_schemas | array of strings | Schemas included in the run. |
oracle_actors_extracted | integer | Rows returned from Oracle for all schemas. |
oracle_rows_per_schema | object | Count of Oracle rows per schema key. |
oracle_actor_type_counts | object | Raw MUSIT ACTOR_TYPE counts (keys "0", "1", "2", "null"). |
agents_created | integer | Agents inserted or simulated. |
agents_skipped | integer | Already present (matching remarks marker). |
agents_linked | integer | Always 0 here (no SpecifyUser on these agents). |
schemas_processed | array | Distinct schemas seen while loading. |
errors | array of strings | Per-row failures. |
What this flow does not do yet
- USD
PERSONER/LEGSAMLER/DETBESTEMMER/AUTORPERSON— still per the strategy doc; separate extract/merge logic is planned for later. AUTHORSTRINGS/ taxonomic author abbreviation — not joined in this version.- Cross-schema deduplication — the same human could exist in botany and entomology
ACTORwith different IDs; both rows become two Specify agents unless you merge manually or extend the flow. - Linking to
SpecifyUsercreated bymigrate_users— user migration still creates its ownAgentper login; reconciling loginAgentwith an existing MUSITACTORAgentis future work.
Recorded outcome: first production load (2026-04-10)
The following summarizes the live run against Oracle PROD (dry_run: false) executed on 2026-04-10.
Both schemas were processed in a single flow run. Botany processing ran approximately 12:10–20:12 UTC; entomology followed immediately, 20:12–21:11 UTC (~9 hours total elapsed).
| Field | Value |
|---|---|
oracle_actors_extracted (total) | 250,274 |
oracle_rows_per_schema — MUSIT_BOTANIKK_FELLES | 235,124 |
oracle_rows_per_schema — MUSIT_ZOOLOGI_ENTOMOLOGI | 15,150 |
agents_created | 250,274 |
agents_skipped | 0 |
errors | 0 |
All extracted actors mapped to Specify agenttype 1 (Person). There were no organisation (ACTOR_TYPE 1 or 2) rows in either schema. No duplicate markers were encountered (clean first-time load). The run is treated as an operational success for Phase 1.1.
Deployment
Registered in prefect.yaml as migrate-musit-agents-dev (see that file for work_pool / parameters). Run from the CLI with PREFECT_API_URL pointed at your server, for example:
prefect deployment run "Migrate MUSIT Actors/migrate-musit-agents-dev" --param dry_run=false
Recommended order
For a greenfield database, run collection agents (migrate_musit_agents_flow) before application users (migrate_users_flow) if you later want a single Agent per person across specimens and logins. If users were migrated first, you may temporarily have two Agent rows for some staff until a merge pass is defined.