
Standardise an Occurrence dataset
Dax Kellie & Martin Westgate
2025-06-06
Source:vignettes/occurrences-example.Rmd
occurrences-example.Rmd
Data of species observations is referred to as occurrence data. In Living Atlases like the Atlas of Living Australia (ALA), this is the default type of data stored.
Using occurrence-based datasets assume that all observations are independent of each other. The benefit of this assumption is that observational data can remain simple in structure - every observation is made at a specific place and time. This simplicity allows all occurrence-based data to be aggregated and used together.
Let’s see how to build an occurrence-based dataset using galaxias.
The dataset
Let’s use an small example dataset of bird observations taken from 4
different site locations. This dataset has many different types of data
like landscape type and age class. Importantly for standardising to
Darwin Core, this dataset contains the scientific name
(species
), coordinate location (lat
&
lon
) and date of observation (date
).
library(galaxias)
library(dplyr)
library(readr)
obs <- read_csv("dummy-dataset-sb.csv",
show_col_types = FALSE) |>
janitor::clean_names()
obs |>
gt::gt() |>
gt::opt_interactive(page_size_default = 5)
Standardise to Darwin Core
To determine what we need to do to standardise our dataset, let’s use
suggest_workflow()
. The output tells us we have one
matching Darwin Core term in our data already (sex
), but we
are missing all minimum required Darwin Core terms.
obs |>
suggest_workflow()
#>
#> ── Matching Darwin Core terms ──────────────────────────────────────────────────
#> Matched 1 of 12 column names to DwC terms:
#> ✔ Matched: sex
#> ✖ Unmatched: age_class, comments, date, landscape, lat, lon, molecular_sex,
#> sample_id, site, species, species_code
#>
#> ── Minimum required Darwin Core terms ──────────────────────────────────────────
#>
#> Type Matched term(s) Missing term(s)
#> ✖ Identifier (at least one) - occurrenceID, catalogNumber, recordNumber
#> ✖ Record type - basisOfRecord
#> ✖ Scientific name - scientificName
#> ✖ Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
#> ✖ Date/Time - eventDate
#>
#> ── Suggested workflow ──────────────────────────────────────────────────────────
#>
#> To make your data Darwin Core compliant, use the following workflow:
#>
#> df |>
#> set_occurrences() |>
#> set_datetime() |>
#> set_coordinates() |>
#> set_scientific_name()
#>
#> ── Additional functions
#> Based on your matched terms, you can also add to your pipe:
#> • `set_individual_traits()`
#> ℹ See all `set_` functions at
#> http://corella.ala.org.au/reference/index.html#add-rename-or-edit-columns-to-match-darwin-core-terms
Under “Suggest workflow”, the output above suggests a series of piped
set_
functions that we can use to rename, modify or add
columns that are missing from obs
but required by Darwin
Core. set_
functions are specialised wrappers around
dplyr::mutate()
, with additional functionality to support
using Darwin Core Standard.
For simplicity, let’s do the easy part first of renaming columns we
already have in our dataset to use accepted standard Darwin Core terms.
set_
functions will automatically check to make sure each
column is correctly formatted. We’ll save our modified dataframe as
obs_dwc
.
obs_dwc <- obs |>
set_scientific_name(scientificName = species) |>
set_coordinates(decimalLatitude = lat,
decimalLongitude = lon) |>
set_datetime(eventDate = lubridate::ymd(date)) # specify year-month-day format
#> ⠙ Checking 1 column: scientificName
#> ⠹ Checking 1 column: scientificName
#> ✔ Checking 1 column: scientificName [326ms]
#>
#> ⠙ Checking 2 columns: decimalLatitude and decimalLongitude
#> ✔ Checking 2 columns: decimalLatitude and decimalLongitude [621ms]
#>
#> ⠙ Checking 1 column: eventDate
#> ✔ Checking 1 column: eventDate [311ms]
#>
Running suggest_workflow()
again will reflect our
progress and show us what’s left to do. Now the output tells us that we
still need to add several columns to our dataset to meet minimum Darwin
Core requirements.
obs_dwc |>
suggest_workflow()
#>
#> ── Matching Darwin Core terms ──────────────────────────────────────────────────
#> Matched 5 of 12 column names to DwC terms:
#> ✔ Matched: decimalLatitude, decimalLongitude, eventDate, scientificName, sex
#> ✖ Unmatched: age_class, comments, landscape, molecular_sex, sample_id, site,
#> species_code
#>
#> ── Minimum required Darwin Core terms ──────────────────────────────────────────
#>
#> Type Matched term(s) Missing term(s)
#> ✔ Scientific name scientificName -
#> ✔ Date/Time eventDate -
#> ✖ Identifier (at least one) - occurrenceID, catalogNumber, recordNumber
#> ✖ Record type - basisOfRecord
#> ✖ Location decimalLatitude decimalLongitude geodeticDatum coordinateUncertaintyInMeters
#>
#> ── Suggested workflow ──────────────────────────────────────────────────────────
#>
#> To make your data Darwin Core compliant, use the following workflow:
#>
#> df |>
#> set_occurrences() |>
#> set_coordinates()
#>
#> ── Additional functions
#> Based on your matched terms, you can also add to your pipe:
#> • `set_individual_traits()`
#> ℹ See all `set_` functions at
#> http://corella.ala.org.au/reference/index.html#add-rename-or-edit-columns-to-match-darwin-core-terms
Here’s a rundown of the columns we need to add:
-
occurrenceID
: Unique identifiers of each record. This ensures that we can identify the specific record for any future updates or corrections. We can usecomposite_id()
,sequential_id()
orrandom_id()
to add a unique IDs to each row. -
basisOfRecord
: The type of record (e.g. human observation, specimen, machine observation). See a list of acceptable values withcorella::basisOfRecord_values()
. -
geodeticDatum
: The Coordinate Reference System (CRS) projection of your data (for example, the CRS of Google Maps is “WGS84”). -
coordinateUncertaintyInMeters
: The area of uncertainty around your observation. You may know this value based on your method of data collection
Now let’s add these columns using set_occurrences()
and
set_coordinates()
. We can also add the suggested function
set_individual_traits()
which will automatically identify
the matched column name sex
and check the column’s
format.
obs_dwc <- obs_dwc |>
set_occurrences(
occurrenceID = composite_id(sequential_id(), site, landscape),
basisOfRecord = "humanObservation"
) |>
set_coordinates(
geodeticDatum = "WGS84",
coordinateUncertaintyInMeters = 30
# coordinateUncertaintyInMeters = with_uncertainty(method = "phone")
) |>
set_individual_traits()
#> ⠙ Checking 2 columns: occurrenceID and basisOfRecord
#> ⠹ Checking 2 columns: occurrenceID and basisOfRecord
#> ✔ Checking 2 columns: occurrenceID and basisOfRecord [632ms]
#>
#> ⠙ Checking 4 columns: decimalLatitude, decimalLongitude, coordinateUncertaintyI…
#> ✔ Checking 4 columns: decimalLatitude, decimalLongitude, coordinateUncertaintyI…
#>
#> ⠙ Checking 1 column: sex
#> ✔ Checking 1 column: sex [314ms]
#>
Running suggest_workflow()
once more will confirm that
our dataset is ready to be used in a Darwin Core Archive!
obs_dwc |>
suggest_workflow()
#>
#> ── Matching Darwin Core terms ──────────────────────────────────────────────────
#> Matched 9 of 16 column names to DwC terms:
#> ✔ Matched: basisOfRecord, coordinateUncertaintyInMeters, decimalLatitude,
#> decimalLongitude, eventDate, geodeticDatum, occurrenceID, scientificName, sex
#> ✖ Unmatched: age_class, comments, landscape, molecular_sex, sample_id, site,
#> species_code
#>
#> ── Minimum required Darwin Core terms ──────────────────────────────────────────
#>
#> Type Matched term(s) Missing term(s)
#> ✔ Identifier (at least one) occurrenceID -
#> ✔ Record type basisOfRecord -
#> ✔ Scientific name scientificName -
#> ✔ Location decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters -
#> ✔ Date/Time eventDate -
#>
#>
#> 🥇 All minimum column requirements met!
#>
#> ── Suggested workflow ──────────────────────────────────────────────────────────
#>
#> 🥇 Your dataframe is Darwin Core compliant!
#> Run checks, or use your dataframe to build a Darwin Core Archive with galaxias:
#> df |>
#> check_dataset()
#>
#> ── Additional functions
#> Based on your matched terms, you can also add to your pipe:
#> • `set_individual_traits()`
#> ℹ See all `set_` functions at
#> http://corella.ala.org.au/reference/index.html#add-rename-or-edit-columns-to-match-darwin-core-terms
To submit our dataset, let’s select columns with valid occurrence
term names and save this dataframe to the file
occurrences.csv
. Importantly, we will save our csv in a
folder called data-processed
, which galaxias looks for
automatically when building a Darwin Core Archive.
obs_dwc <- obs_dwc |>
select(any_of(occurrence_terms())) # select any matching terms
obs_dwc |>
gt::gt() |>
gt::opt_interactive(page_size_default = 5)
Our final step is to save this to our ‘publishing’ directory:
# Save in ./data-processed
use_data_occurrences(obs_dwc)
All done! See the Quick start guide vignette for how to build a Darwin Core Archive.