Standardising an Occurrence Dataset#

Amanda Buyan, Dax Kellie & Martin Westgate

In Living Atlases like the Atlas of Living Australia (ALA), the default type of data is occurrence data, where a record refers to the presence/absence of an organism or taxon in a particular place at a specific time. This is a relatively simple data structure, where it is assumed that each observation or record is independent of each other. This simplicity also allows occurrence-based data to be easily aggregated.

Here, we’ll go through the steps to standardise and build an occurrence dataset using galaxias.

The dataset#

Example Occurrences

The data we’ll use are of bird observations from 4 different sites. As these are occurrence data, this dataset contains evidence of the presence of certain bird species (species) at particular locations (lat, lon) at specific times (date). It also contains additional information about the landscape type, and sex and age class of birds.

>>> import galaxias
>>> occurrences = pd.read_csv('dummy-dataset-sb.csv')
>>> # set all titles to lowercase
>>> occurrences.columns = map(str.lower, occurrences.columns)
>>> occurrences

   landscape     site        date                    species        lat         lon             comments species code sex age class sample id molecular sex
     KKNP  KKNP-01  2008-05-05       Climacteris picumnus -36.867536  143.258586     secondaries worn          BTC   F        Ad      SB01             F
     KKNP  KKNP-01  2008-08-22       Climacteris picumnus -36.877767  143.254638                  NaN          BTC   F         J      SB02             F
     KKNP  KKNP-02  2008-02-05       Climacteris picumnus -36.850162  143.293905                  NaN          BTC   F         J      SB03             F
     KKNP  KKNP-02  2008-08-22       Climacteris picumnus -36.852531  143.313389                  NaN          BTC   F        Ad      SB04             F
     KKNP  KKNP-03  2008-02-06        Artamus cyanopterus -36.881165  143.306694  heavy parasite load          DWS   ?        Ad      SB05             M
     KKNP  KKNP-03  2008-08-22          Acanthiza lineata -36.891223  143.311887  heavy parasite load          STT   M         J      SB06             M
     PRSP  PRSP-01  2008-08-21        Artamus cyanopterus -37.069201  143.690593                  NaN          DWS   ?         J      SB07             F
     PRSP  PRSP-01  2008-02-06       Acanthiza reguloides -37.077008  143.693082                  NaN          BRT   M         J      SB08             M
     PRSP  PRSP-02  2008-02-06            Malurus cyaneus -37.084711  143.701537                  NaN          SFW   M         J      SB09             M
     PRSP  PRSP-02  2008-02-06            Malurus cyaneus -37.087244  143.696344                  NaN          SFW   M        Ad      SB10             M
    PRSP  PRSP-03  2008-02-05       Acanthiza reguloides -37.104941  143.691580                  NaN          BRT   M        Ad      SB11             M
    PRSP  PRSP-03  2008-02-06          Acanthiza lineata -37.108501  143.689778                  NaN          STT   F        Ad      SB12             F
    MBRP  MBRP-01  2008-02-06            Ptilotula fusca -37.069663  143.720473                  NaN          FHE   ?        Ad      SB13             F
    MBRP  MBRP-01  2008-02-06            Ptilotula fusca -37.070489  143.721127                  NaN          FHE   ?        Ad      SB14             M
    MBRP  MBRP-02  2008-02-06  Melithreptus brevirostris -37.067090  143.725065                  NaN          BHE   ?        Ad      SB15             F
    MBRP  MBRP-02  2008-08-22  Melithreptus brevirostris -37.067780  143.727297                  NaN          BHE   ?        Ad      SB16             M
    RSNC  RSNC-01  2008-08-21            Malurus cyaneus -37.177554  144.070257                  NaN          SFW   F        Ad      SB17             F
    RSNC  RSNC-01  2008-02-06            Ptilotula fusca -37.162234  144.063391                  NaN          FHE   ?         J      SB18             M
    RSNC  RSNC-02  2008-02-06       Pardalotus punctatus -37.232654  144.046911                  NaN          SPP   F        Ad      SB19             F
    RSNC  RSNC-02  2008-02-06       Pardalotus punctatus -37.286758  144.066481                  NaN          SPP   F        Ad      SB20             F

Standardise to Darwin Core#

We can use suggest_workflow() to determine what we need to do to standardise this dataset.

>>> galaxias.suggest_workflow(occurrences=occurrences)

── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 1 of 12 column names to DwC terms:

✓ Matched: sex
✗ Unmatched: date, comments, species, lat, age class, species code, molecular sex, landscape, lon, sample id, site

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)    Missing term(s)
-------------------------  -----------------  -------------------------------------------------------------------------------
Identifier (at least one)  -                  occurrenceID OR catalogNumber OR recordNumber
Record type                -                  basisOfRecord
Scientific name            -                  scientificName
Location                   -                  decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
Date/Time                  -                  eventDate

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Occurrences ──

To make your occurrences Darwin Core compliant, use the following workflow:

corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()

Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()

Calling suggest_workflow() tells us that one column in the dataset matches Darwin Core terms (sex), and we are missing all the minimum required Darwin Core terms. We’re also given a suggested workflow consisting of a series of set_ functions for renaming, modifying, or adding missing columns. set_ functions are specialised wrappers around the {pandas} package, with additional functionality to support using Darwin Core Standard.

Let’s start by renaming existing columns to align with Darwin Core terms. set_ functions will automatically check to make sure each column is correctly formatted.

>>> occurrences = galaxias.set_scientific_name(scientificName = 'species')
>>> occurrences = galaxias.set_coordinates(decimalLatitude = 'lat',
...                                        decimalLongitude = 'lon')
>>> occurrences = galaxias.set_datetime(eventDate = 'date',
...                                     string_to_datetime=True,
...                                     yearfirst = True)

Calling suggest_workflow() again accounts for our progress and shows us what still needs to be done. Here, we can see that we’re still missing a couple of minimum required terms.

>>> galaxias.suggest_workflow(occurrences=occurrences)

── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 5 of 12 column names to DwC terms:

✓ Matched: eventDate, scientificName, decimalLatitude, decimalLongitude, sex
✗ Unmatched: molecular sex, landscape, site, sample id, comments, species code, age class

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)                    Missing term(s)
-------------------------  ---------------------------------  ---------------------------------------------
Identifier (at least one)  -                                  occurrenceID OR catalogNumber OR recordNumber
Record type                -                                  basisOfRecord
Scientific name            scientificName                     -
Location                   decimalLatitude, decimalLongitude  geodeticDatum, coordinateUncertaintyInMeters
Date/Time                  eventDate                          -

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Occurrences ──

To make your occurrences Darwin Core compliant, use the following workflow:

corella.set_occurrences()
corella.set_coordinates()

Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()

Here’s a rundown of the columns we need to add:

occurrenceID: Unique identifier for each record, which ensures that we can identify specific records for future updates or corrections. We can use composite_id(), sequential_id(), or random_id() to add a unique ID to each row.
basisOfRecord: The type of record (e.g. human observation, specimen from a museum collection, machine observation). See a list of acceptable values with corella::basisOfRecord_values().
geodeticDatum: The geographic coordinate reference system (CRS), which is a framework for representing spatial data (for example, the CRS of Google Maps is “WGS84”).
coordinateUncertaintyInMeters: The area of uncertainty around your observation, which you may be able to infer based on your data collection method.

As suggested, let’s add these columns using set_occurrences() and set_coordinates(). We can also use an optional function, set_individual_traits(), which will automatically identify the matched column name sex and check the column’s format.

>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['sequential','site','landscape'],
...                                        basisOfRecord = 'HumanObservation')
>>> occurrences = galaxias.set_coordinates(dataframe=occurrences,geodeticDatum = 'WGS84',
...                                         coordinateUncertaintyInMeters = 30)
>>> occurrences = galaxias.set_individual_traits(dataframe=occurrences)

Running suggest_workflow() once more confirms that our dataset has all the required information to be put into a Darwin Core Archive!

>>> galaxias.suggest_workflow(occurrences=occurrences)

── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 9 of 16 column names to DwC terms:

✓ Matched: occurrenceID, eventDate, scientificName, decimalLatitude, decimalLongitude, sex, basisOfRecord, geodeticDatum, coordinateUncertaintyInMeters
✗ Unmatched: site, comments, landscape, molecular sex, species code, sample id, age class

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)                                                                  Missing term(s)
-------------------------  -------------------------------------------------------------------------------  -----------------
Identifier (at least one)  occurrenceID                                                                     -
Record type                basisOfRecord                                                                    -
Scientific name            scientificName                                                                   -
Location                   decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters  -
Date/Time                  eventDate                                                                        -

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Congratulations! You have the required Darwin Core terms for occurrences. Use corella.check_occurrences() to check whether your data is also Darwin Core compliant.

Here, you can do one of two things:

Select only the columns which are currently Darwin core compliant
Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.

To see which Darwin core terms are included in checks in galaxias, consult the list below.

Supported Darwin Core Terms and Their Associated Functions

Darwin Core Term	`set` Function
basisOfRecord	set_occurrences()
occurrenceID	set_occurrences()
scientificName	set_scientific_name()
decimalLatitude	set_coordinates()
decimalLongitude	set_coordinates()
geodeticDatum	set_coordinates()
coordinateUncertaintyInMeters	set_coordinates()
eventDate	set_datetime()
kingdom	set_taxonomy()
phylum	set_taxonomy()
class	set_taxonomy()
order	set_taxonomy()
family	set_taxonomy()
genus	set_taxonomy()
specificEpithet	set_taxonomy()
vernacularName	set_taxonomy()
taxonRank	set_scientific_name()
scientificNameAuthorship	set_scientific_name()
recordedBy	set_observer()
recordedByID	set_observer()
measurementID	set_measurements()
measurementType	set_measurements()
measurementValue	set_measurements()
measurementUnit	set_measurements()
continent	set_locality()
country	set_locality()
countryCode	set_locality()
stateProvince	set_locality()
locality	set_locality()
license	set_license()
rightsHolder	set_license()
accessRights	set_license()
sex	set_individual_traits()
lifeStage	set_individual_traits()
reproductiveCondition	set_individual_traits()
vitality	set_individual_traits()
individualID	set_individual_traits()
eventID	set_events()
parentEventID	set_events()
eventType	set_events()
eventTime	set_datetime()
year	set_datetime()
month	set_datetime()
day	set_datetime()
coordinatePrecision	set_coordinates()
datasetID	set_collection()
datasetName	set_collection()
catalogNumber	set_collection()
individualCount	set_abundance()
organismQuantity	set_abundance()
organismQuantity	set_abundance()
organismQuantityType	set_abundance()

To select only the columns that are Darwin core compliant, run the following snippet of code:

>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(occurrences.columns)))
>>> occurrences_final = occurrences[occ_terms_dwca]
>>> occurrences_final

    eventDate  decimalLongitude geodeticDatum  decimalLatitude     basisOfRecord  coordinateUncertaintyInMeters     occurrenceID sex             scientificName
2008-05-05        143.258586         WGS84       -36.867536  HumanObservation                             30   0-KKNP-01-KKNP   F       Climacteris picumnus
2008-08-22        143.254638         WGS84       -36.877767  HumanObservation                             30   1-KKNP-01-KKNP   F       Climacteris picumnus
2008-02-05        143.293905         WGS84       -36.850162  HumanObservation                             30   2-KKNP-02-KKNP   F       Climacteris picumnus
2008-08-22        143.313389         WGS84       -36.852531  HumanObservation                             30   3-KKNP-02-KKNP   F       Climacteris picumnus
2008-02-06        143.306694         WGS84       -36.881165  HumanObservation                             30   4-KKNP-03-KKNP   ?        Artamus cyanopterus
2008-08-22        143.311887         WGS84       -36.891223  HumanObservation                             30   5-KKNP-03-KKNP   M          Acanthiza lineata
2008-08-21        143.690593         WGS84       -37.069201  HumanObservation                             30   6-PRSP-01-PRSP   ?        Artamus cyanopterus
2008-02-06        143.693082         WGS84       -37.077008  HumanObservation                             30   7-PRSP-01-PRSP   M       Acanthiza reguloides
2008-02-06        143.701537         WGS84       -37.084711  HumanObservation                             30   8-PRSP-02-PRSP   M            Malurus cyaneus
2008-02-06        143.696344         WGS84       -37.087244  HumanObservation                             30   9-PRSP-02-PRSP   M            Malurus cyaneus
2008-02-05        143.691580         WGS84       -37.104941  HumanObservation                             30  10-PRSP-03-PRSP   M       Acanthiza reguloides
2008-02-06        143.689778         WGS84       -37.108501  HumanObservation                             30  11-PRSP-03-PRSP   F          Acanthiza lineata
2008-02-06        143.720473         WGS84       -37.069663  HumanObservation                             30  12-MBRP-01-MBRP   ?            Ptilotula fusca
2008-02-06        143.721127         WGS84       -37.070489  HumanObservation                             30  13-MBRP-01-MBRP   ?            Ptilotula fusca
2008-02-06        143.725065         WGS84       -37.067090  HumanObservation                             30  14-MBRP-02-MBRP   ?  Melithreptus brevirostris
2008-08-22        143.727297         WGS84       -37.067780  HumanObservation                             30  15-MBRP-02-MBRP   ?  Melithreptus brevirostris
2008-08-21        144.070257         WGS84       -37.177554  HumanObservation                             30  16-RSNC-01-RSNC   F            Malurus cyaneus
2008-02-06        144.063391         WGS84       -37.162234  HumanObservation                             30  17-RSNC-01-RSNC   ?            Ptilotula fusca
2008-02-06        144.046911         WGS84       -37.232654  HumanObservation                             30  18-RSNC-02-RSNC   F       Pardalotus punctatus
2008-02-06        144.066481         WGS84       -37.286758  HumanObservation                             30  19-RSNC-02-RSNC   F       Pardalotus punctatus

We can specify that we wish to use occurrences and events in our Darwin Core Archive with use_data(), which will save your occurrences as individual csv files in the default directory data-publish as occurrences.csv.

>>> galaxias.use_data(occurrences=occurrences_final)

In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.