MUSIT collection agents migration

This page documents how collection agents in Oracle relate to application users, what the Migrate MUSIT Actors Prefect flow does, and how it differs from User migration report / BRUKARAR flow.

Users vs agents in Oracle (and Specify)

MUSIT keeps two largely separate notions of “people”:

Concept in MUSIT Typical Oracle source Role Specify target
Application user (login) USD_METADATA.BRUKARAR (+ groups) Staff who sign into the apps SpecifyUser + linked Agent for attribution
Collection agent (specimen graph) MUSIT_*.ACTOR + PERSON_NAME (+ USD PERSONER / authors, not in this flow yet) Collectors, determiners, organisations on events Agent only (SpecifyUser usually null)

The same person may appear in both ACTOR and BRUKARAR. Specify also uses Agent for both login persons and specimen roles, so over time you may want to merge duplicates by name or email; that merge is not implemented in the current flows.

Strategic background and table lists: Oracle Schema Overview — Persons, Agents & Users. Phased order: Migration Strategy — Step 1.1 vs 1.4.

Flow: migrate_musit_agents_flow

  • Module: flows/migrate_musit_agents.py
  • Prefect name: Migrate MUSIT Actors
  • Phase: 1.1 (shared collection Agent rows before specimen migration).

Implementation note (memory and Prefect)

Oracle returns on the order of hundreds of thousands of ACTOR rows. The flow uses a single Prefect task (extract_and_load_musit_agents_task) that streams rows from Oracle and applies the load in the same process, instead of returning a giant Python list across a task boundary (which would force Prefect to serialize the full result and roughly double memory use). Dry-run progress is logged at INFO every 5 000 rows; per-row lines are DEBUG only so the worker is not flooded with hundreds of thousands of log events (which can overwhelm log backends and look like a crash). The dev worker pod has a 1 Gi memory limit in the default chart; if live (dry_run: false) runs OOM, raise prefect.devWorker.resources.limits.memory in Helm.

Source

For each selected schema, the flow reads ACTOR and joins PERSON_NAME:

  • Prefer ACTOR.VALID_PERSON_NAME_ID when set.
  • Otherwise use the minimum PERSON_NAME_ID for that ACTOR_ID (deterministic fallback when there is no valid flag).

Allowed schema names (whitelist inside the flow):

  • MUSIT_BOTANIKK_FELLES
  • MUSIT_ZOOLOGI_ENTOMOLOGI

Target

Specify Agent via Django ORM:

  • agenttype: MUSIT ACTOR_TYPE 0 (person) → Specify 1 (Person); MUSIT 1 (organisation) or 2 (group) → Specify 0 (Organization).
  • Names: PERSON_GIVEN_NAME / PERSON_SURNAME (truncated to Specify column limits). If a person has no name parts, ACTORNAME is used as last name. Organisations use ACTORNAME as last name when surname is empty.
  • Other fields: email, title, middle name, birth/death dates, division (first Division in the DB, same pattern as migrate_users).
  • specifyuser: always null here (these are not login rows).
  • Idempotency: remarks is set to a fixed marker
    MUSIT-migration: ACTOR; schema=<SCHEMA>; ACTOR_ID=<id>
    An existing Agent with the same remarks is skipped on re-run.

Parameters

Parameter Default Meaning
oracle_env PROD Oracle env prefix for credentials (ORACLE_<ENV>_…).
dry_run true If true, only logs intended creates; no inserts.
musit_schemas both schemas JSON list of schema names to include (subset of the two allowed values).

Report artifact

When S3_BUCKET is set, a JSON summary is uploaded to:

{S3_MIGRATION_REPORTS_PREFIX}/collection-agents-musit-actor-person-name/<timestamp>/report.json

(Default prefix: migration-reports — see Migration reports on S3; not under oracle-schema.)

The payload mirrors the user-migration style: shared metadata (report_version, flow, migration_phase, generated_at_utc, oracle_env, dry_run) plus flow-specific counts and diagnostics. See Migration reports on S3.

Field Type Description
report_version integer JSON shape version (1).
flow string migrate_musit_agents.
migration_phase string 1.1.
generated_at_utc string UTC folder timestamp.
oracle_env string Oracle env prefix.
dry_run boolean No Agent rows written when true.
musit_schemas array of strings Schemas included in the run.
oracle_actors_extracted integer Rows returned from Oracle for all schemas.
oracle_rows_per_schema object Count of Oracle rows per schema key.
oracle_actor_type_counts object Raw MUSIT ACTOR_TYPE counts (keys "0", "1", "2", "null").
agents_created integer Agents inserted or simulated.
agents_skipped integer Already present (matching remarks marker).
agents_linked integer Always 0 here (no SpecifyUser on these agents).
schemas_processed array Distinct schemas seen while loading.
errors array of strings Per-row failures.

What this flow does not do yet

  • USD PERSONER / LEGSAMLER / DETBESTEMMER / AUTORPERSON — still per the strategy doc; separate extract/merge logic is planned for later.
  • AUTHORSTRINGS / taxonomic author abbreviation — not joined in this version.
  • Cross-schema deduplication — the same human could exist in botany and entomology ACTOR with different IDs; both rows become two Specify agents unless you merge manually or extend the flow.
  • Linking to SpecifyUser created by migrate_users — user migration still creates its own Agent per login; reconciling login Agent with an existing MUSIT ACTOR Agent is future work.

Recorded outcome: first production load (2026-04-10)

The following summarizes the live run against Oracle PROD (dry_run: false) executed on 2026-04-10.

Both schemas were processed in a single flow run. Botany processing ran approximately 12:10–20:12 UTC; entomology followed immediately, 20:12–21:11 UTC (~9 hours total elapsed).

Field Value
oracle_actors_extracted (total) 250,274
oracle_rows_per_schemaMUSIT_BOTANIKK_FELLES 235,124
oracle_rows_per_schemaMUSIT_ZOOLOGI_ENTOMOLOGI 15,150
agents_created 250,274
agents_skipped 0
errors 0

All extracted actors mapped to Specify agenttype 1 (Person). There were no organisation (ACTOR_TYPE 1 or 2) rows in either schema. No duplicate markers were encountered (clean first-time load). The run is treated as an operational success for Phase 1.1.

Deployment

Registered in prefect.yaml as migrate-musit-agents-dev (see that file for work_pool / parameters). Run from the CLI with PREFECT_API_URL pointed at your server, for example:

prefect deployment run "Migrate MUSIT Actors/migrate-musit-agents-dev" --param dry_run=false

For a greenfield database, run collection agents (migrate_musit_agents_flow) before application users (migrate_users_flow) if you later want a single Agent per person across specimens and logins. If users were migrated first, you may temporarily have two Agent rows for some staff until a merge pass is defined.


This site uses Just the Docs, a documentation theme for Jekyll.