Skip to content

Complete Workflow

This page walks through the entire data transformation pipeline from source fisheries data to a standards-compliant Darwin Core Archive.

Overview Diagram

graph LR
    A[Source Data<br/>For example: ERDDAP/CSV] --> B[LinkML Schema<br/>Data Dictionary]
    B --> C[LinkML Mappings<br/>to Darwin Core]
    C --> D[Darwin Core Archive<br/>OBIS/GBIF Ready]

    style B fill:#e1f5ff
    style C fill:#fff4e1

Step-by-Step Process

1. Source Data: ERDDAP Datasets

https://rowlrs-data.marine.rutgers.edu/erddap

The workflow begins with three datasets hosted on a Rutgers ERDDAP server:

  • bottom_trawl_survey_ow1_tows: Station/tow metadata (time, location, coordinates)
  • bottom_trawl_survey_ow1_catch: Species catch data (weights, counts, lengths)
  • species_id_codes: Taxonomic lookup (common names → scientific names → ITIS TSNs)

Challenge: These datasets use fisheries-specific terminology and structure that doesn't directly match Darwin Core.

2. Make LinkML Schema (AKA data model, AKA data dictionary)

Here we are describing two parts: what 'terms' do we have, and how should those terms be grouped?

Note: if you already have a SQL schema, there are tools to automatically turn this into LinkML.

What terms do we have?

Again, this is essentially a well-structured data dictionary. Terms might look like this:

latitude:
    description: Latitude coordinates entered at start of trawl
    range: float
    required: true
    unit:
      ucum_code: deg
    slot_uri: ow1_catch:latitude
    annotations:
      erddap_source: "latitude (start_latitude)"
      erddap_standard_name: "latitude"
      erddap_units: "degrees_north"

How should those terms be grouped?

These are like tables in a database. We define classes, and there is nothing wrong with a term being used in multiple classes.

classes:
  TowRecord:
    description: A single trawl tow/station event
    slots:
      - cruise
      - station
      - time
      - latitude
      - longitude
      - end_latitude
      - end_longitude

Why is this a good approach?

  • Machine-readable data dictionary
  • Document units and sources
  • directs data validation for QC (e.g. are all dates actually dates?)

How does Darwin Core compare to our model? Here is an example of the Darwin Core term eventDate, which happens to match our trawl term, time, exactly. No additional context is needed.

slots:
  eventDate:
    description: The date-time during which an Event occurred
    slot_uri: dwc:eventDate
    exact_mappings:
      - ow1_catch:time  # Shows source field

But what about when they don't match up exactly? Here we began to decide what we will do. Importantly, we're capturing the information in a structured way so it can be leveraged later.

  decimalLongitude:
    description: >-
      The geographic longitude in decimal degrees of the geographic center of a Location.
    range: float
    slot_uri: dwc:decimalLongitude
    unit:
      ucum_code: deg
    related_mappings:
      - ow1_catch:start_longitude
      - ow1_catch:end_longitude
    comments:
      - "Calculate midpoint from TowRecord.start_longitude and TowRecord.end_longitude"
      - "Or use start_longitude as representative point"

4. Create a Darwin Core Archive: Ready for OBIS/GBIF

Here we have python code to take our LinkML instructions and create a valid Darwin Core Archive for sharing to OBIS and GBIF.

What it handles automatically:

Some things we can easily reuse across data sources

  • Simple 1:1 field renames (e.g., timeeventDate)
  • Type conversions (string, float, integer)
  • Validation against target schema

What requires custom logic:

Some things will require careful consideration of how that particular dataset works

  • Complex field generation (IDs, hierarchies)
  • Calculated fields (midpoints, WKT geometries)
  • Multi-source field merging

The bottom line

As much as possible we're using templates and generic functions to make this reusable and quickly implemented for a fisheries dataset.

The Darwin Core Archive Output

The pipeline produces a ZIP file containing:

ow1_dwca.zip
├── event.txt                           # 82 records (2 cruises + 80 tows)
├── occurrence.txt                      # 1,181 species observations
├── extendedmeasurementorfact.txt      # 5,623 measurements
├── meta.xml                            # Archive structure descriptor
└── eml.xml                             # Dataset metadata

Publication to OBIS/GBIF

The resulting archive can be:

  • Uploaded directly to OBIS/GBIF Integrated Publishing Toolkit (IPT)
  • Validated using GBIF Data Validator
  • Published to make data globally discoverable