| ID | datasetID | collectionID | datasetName | institutionCode | institutionID | ownerInstitutionCode | publisher |
|---|---|---|---|---|---|---|---|
| 1 | FWS_R4_IMSERPP | FWS_R4_IMSERPP | FWS Region 4 Southeast Regionial Pollinator Program | FWS | https://ror.org/04k7dar27 | USGS | FWS |
| 2 | USFWS GLRI | USFWS GLRI | USFWS Great Lakes Restoration Initiative | FWS | https://ror.org/04k7dar27 | USGS | FWS |
| 3 | FWS_FF06RCMR00_Bees_IndaziflamCMR | FWS_FF06RCMR00_Bees_IndaziflamCMR | USFWS Charles M Russell NWR IAG Treatments | FWS | https://ror.org/04k7dar28 | USGS | FWS |
Aligning BIML Database to Darwin Core for Publication to GBIF
There are several key components involved in sharing the bee occurrence records associated with the BIML project on the Global Biodiversity Information Facility (GBIF) platform.
This document outlines each step in the data workflow. It is a living document and remains under development.
Step 1: Accessing Data from the USGS/USFWS Interagency Native Bee Lab (BIML)
The occurrence records are provided by BIML. Each specimen is entered into the BIML SQL Server database using an established workflow. The BIML database captures occurrence and sampling event details using custom fields that do not conform to the Darwin Core standard. For example, project-identifying information is stored in a field titled email. For example, when specimens from the GLRI project are entered, the value USFWS_GLRI is used in that field.
BIML also maintains a table titled Project_Identifiers_Table (PIT) to capture project-level metadata for identified specimens. The PIT fields follow Darwin Core standards and supply relevant record-level information. For example:
Currently, this table is not integrated into the MS Access database and must be exported manually to a shared location accessible by team members involved in the Darwin Core publishing workflow.
Once specimen identifications are finalized, the data are ready to be published to GBIF. The data are exported from the SQL Server as a dollar sign ($)-delimited flat file. This file is created via a workflow that includes preliminary QA/QC steps. Before data processing begins, the flat file is manually uploaded to a shared location accessible to the data manager and BIML team.
Step 2: Cleaning the Data and Mapping to Darwin Core
publishing_workflow.R
In general, this is the only script that needs to be run. It references several supporting scripts (detailed below) and outputs seven files into the output/ folder. Three of these are Darwin Core–formatted tables: an event table, an occurrence table, and an extended measurement or fact (eMoF) table. The other four describe quality issues to be addressed in future publishing cycles.
# Output QAQC of original data ------------------------------------
source('scripts/BIML_QAQC.R')
# Run filter_and_join_tables_BIML_all.R -----------------------------------
source('scripts/filter_and_join_tables_BIML_all.R')
# crosswalk data and write out for publication ---------------------------
source('scripts/crosswalk_BIML.R')The publishing_workflow.R script draws from three other R scripts, found in the GitHub repository folder scripts.
filter_and_join_tables_BIML_all.R
- Reads and cleans two datasets
- Imports a biodiversity dataset (
USGS_DRO_flat.txt.gz) and renames theemailfield todatasetID. - Imports a project metadata table (
Project_Identifiers_Table.csv), renamescollectionIDtocollectionCode, and manually appends a row for the BIML dataset.
- Imports a biodiversity dataset (
- Filters and standardizes taxonomic data
- Removes records lacking species-level names.
- Replaces unmatched
datasetIDvalues with"BIML"to ensure proper linkage.
- Joins metadata with occurrence records
- Merges the cleaned species data with the project metadata table using
datasetIDas the join key.
- Merges the cleaned species data with the project metadata table using
crosswalk_BIML.R
- Cleans and standardizes collection data
- Filters records requiring QA/QC.
- Standardizes geographic coordinates (
decimalLatitude,decimalLongitude). - Parses and reconstructs incomplete timestamps into ISO 8601 format.
- Adds time zone information based on spatial coordinates.
- Harmonizes categorical fields such as
sexand prepares all fields for Darwin Core compliance.
- Queries external taxonomic and collection metadata
- Uses the GBIF Backbone API to validate scientific names via
rgbif::name_backbone_checklist(). - Accesses the GRSciColl API to retrieve collection metadata by
datasetID.
- Uses the GBIF Backbone API to validate scientific names via
- Generates Darwin Core Archive components
- Builds and exports three Darwin Core tables:
event: sampling events, locations, protocols, and time contextocc: taxonomic and occurrence-level dataemof: extended measurement or fact data (e.g., trap volume, liquid type)
- Outputs all files as
.csv.gzinto theoutput/directory.
- Builds and exports three Darwin Core tables:
BIML_QAQC.R
Summary of Quality Assurance Script
This script performs validation and quality checks on a biodiversity data export from BIML.
- Loads and parses raw data
- Imports the
USGS_DRO_flat.txt.gzfile containing biological occurrence records. - Logs problems encountered during read-in using
problems(), saved asBIML_problems_identified_during_read-in.csv.
- Imports the
- Flags invalid scientific names
- Uses a regular expression to identify names not conforming to binomial (or trinomial) format.
- Outputs summary of flagged names to
BIML_flag_summary_binomial_names.csv.
- Performs QA/QC flagging
- Parses and formats
time1andtime2fields. - Constructs ISO 8601
eventDatestrings. - Flags records with:
- Out-of-range coordinates
- Inconsistent or missing timestamps
- Pre-1900 dates
- Invalid country or improperly coded fields
- Parses and formats
- Summarizes flagged records
- Transforms wide-format flags into long format.
- Aggregates multiple flags into a pipe-separated string per record.
- Outputs:
BIML_flag_summary_table.csv: distinct flag combinations with countsBIML_flag_table.csv: full table of flagged records
Step 3: Creating the GBIF Project and Mapping Data in the IPT
The occurrence records are published to GBIF via the GBIF-US Integrated Publishing Toolkit (IPT). Within the IPT, authorized managers can create a Darwin Core Archive (DwC-A) and register it with GBIF. This includes:
- Uploading the Darwin Core tables
- Mapping columns to Darwin Core terms
- Updating dataset metadata (which becomes the EML file for the archive)
If you are publishing for the first time, you’ll also need to:
- Set visibility and publication status
- Register the resource with GBIF
- Assign Resource Manager permissions to relevant individuals
Resource Manager permissions are required to publish data, update metadata, and adjust visibility settings. These permissions can be granted by an IPT administrator.