Aligning BIML Database to Darwin Core for Publication to GBIF

Authors

Paulina Marie Jones

Stephen Formel

Published

May 20, 2025

There are several key components involved in sharing the bee occurrence records associated with the BIML project on the Global Biodiversity Information Facility (GBIF) platform.

This document outlines each step in the data workflow. It is a living document and remains under development.

Step 1: Accessing Data from the USGS/USFWS Interagency Native Bee Lab (BIML)

The occurrence records are provided by BIML. Each specimen is entered into the BIML SQL Server database using an established workflow. The BIML database captures occurrence and sampling event details using custom fields that do not conform to the Darwin Core standard. For example, project-identifying information is stored in a field titled email. For example, when specimens from the GLRI project are entered, the value USFWS_GLRI is used in that field.

BIML also maintains a table titled Project_Identifiers_Table (PIT) to capture project-level metadata for identified specimens. The PIT fields follow Darwin Core standards and supply relevant record-level information. For example:

Project Identifiers Table
ID datasetID collectionID datasetName institutionCode institutionID ownerInstitutionCode publisher
1 FWS_R4_IMSERPP FWS_R4_IMSERPP FWS Region 4 Southeast Regionial Pollinator Program FWS https://ror.org/04k7dar27 USGS FWS
2 USFWS GLRI USFWS GLRI USFWS Great Lakes Restoration Initiative FWS https://ror.org/04k7dar27 USGS FWS
3 FWS_FF06RCMR00_Bees_IndaziflamCMR FWS_FF06RCMR00_Bees_IndaziflamCMR USFWS Charles M Russell NWR IAG Treatments FWS https://ror.org/04k7dar28 USGS FWS

Currently, this table is not integrated into the MS Access database and must be exported manually to a shared location accessible by team members involved in the Darwin Core publishing workflow.

Once specimen identifications are finalized, the data are ready to be published to GBIF. The data are exported from the SQL Server as a dollar sign ($)-delimited flat file. This file is created via a workflow that includes preliminary QA/QC steps. Before data processing begins, the flat file is manually uploaded to a shared location accessible to the data manager and BIML team.

Step 2: Cleaning the Data and Mapping to Darwin Core

publishing_workflow.R

In general, this is the only script that needs to be run. It references several supporting scripts (detailed below) and outputs seven files into the output/ folder. Three of these are Darwin Core–formatted tables: an event table, an occurrence table, and an extended measurement or fact (eMoF) table. The other four describe quality issues to be addressed in future publishing cycles.


# Output QAQC of original data ------------------------------------

source('scripts/BIML_QAQC.R')

# Run filter_and_join_tables_BIML_all.R -----------------------------------

source('scripts/filter_and_join_tables_BIML_all.R')

# crosswalk data and write out for publication ---------------------------

source('scripts/crosswalk_BIML.R')

The publishing_workflow.R script draws from three other R scripts, found in the GitHub repository folder scripts.

filter_and_join_tables_BIML_all.R

  1. Reads and cleans two datasets
    • Imports a biodiversity dataset (USGS_DRO_flat.txt.gz) and renames the email field to datasetID.
    • Imports a project metadata table (Project_Identifiers_Table.csv), renames collectionID to collectionCode, and manually appends a row for the BIML dataset.
  2. Filters and standardizes taxonomic data
    • Removes records lacking species-level names.
    • Replaces unmatched datasetID values with "BIML" to ensure proper linkage.
  3. Joins metadata with occurrence records
    • Merges the cleaned species data with the project metadata table using datasetID as the join key.

crosswalk_BIML.R

  1. Cleans and standardizes collection data
    • Filters records requiring QA/QC.
    • Standardizes geographic coordinates (decimalLatitude, decimalLongitude).
    • Parses and reconstructs incomplete timestamps into ISO 8601 format.
    • Adds time zone information based on spatial coordinates.
    • Harmonizes categorical fields such as sex and prepares all fields for Darwin Core compliance.
  2. Queries external taxonomic and collection metadata
    • Uses the GBIF Backbone API to validate scientific names via rgbif::name_backbone_checklist().
    • Accesses the GRSciColl API to retrieve collection metadata by datasetID.
  3. Generates Darwin Core Archive components
    • Builds and exports three Darwin Core tables:
      • event: sampling events, locations, protocols, and time context
      • occ: taxonomic and occurrence-level data
      • emof: extended measurement or fact data (e.g., trap volume, liquid type)
    • Outputs all files as .csv.gz into the output/ directory.

BIML_QAQC.R

Summary of Quality Assurance Script

This script performs validation and quality checks on a biodiversity data export from BIML.

  1. Loads and parses raw data
    • Imports the USGS_DRO_flat.txt.gz file containing biological occurrence records.
    • Logs problems encountered during read-in using problems(), saved as BIML_problems_identified_during_read-in.csv.
  2. Flags invalid scientific names
    • Uses a regular expression to identify names not conforming to binomial (or trinomial) format.
    • Outputs summary of flagged names to BIML_flag_summary_binomial_names.csv.
  3. Performs QA/QC flagging
    • Parses and formats time1 and time2 fields.
    • Constructs ISO 8601 eventDate strings.
    • Flags records with:
      • Out-of-range coordinates
      • Inconsistent or missing timestamps
      • Pre-1900 dates
      • Invalid country or improperly coded fields
  4. Summarizes flagged records
    • Transforms wide-format flags into long format.
    • Aggregates multiple flags into a pipe-separated string per record.
    • Outputs:
      • BIML_flag_summary_table.csv: distinct flag combinations with counts
      • BIML_flag_table.csv: full table of flagged records

Step 3: Creating the GBIF Project and Mapping Data in the IPT

The occurrence records are published to GBIF via the GBIF-US Integrated Publishing Toolkit (IPT). Within the IPT, authorized managers can create a Darwin Core Archive (DwC-A) and register it with GBIF. This includes:

  1. Uploading the Darwin Core tables
  2. Mapping columns to Darwin Core terms
  3. Updating dataset metadata (which becomes the EML file for the archive)

If you are publishing for the first time, you’ll also need to:

  1. Set visibility and publication status
  2. Register the resource with GBIF
  3. Assign Resource Manager permissions to relevant individuals

Resource Manager permissions are required to publish data, update metadata, and adjust visibility settings. These permissions can be granted by an IPT administrator.