Standardising an Occurrence Dataset#

Amanda Buyan, Dax Kellie & Martin Westgate

In Living Atlases like the Atlas of Living Australia (ALA), the default type of data is occurrence data, where a record refers to the presence/absence of an organism or taxon in a particular place at a specific time. This is a relatively simple data structure, where it is assumed that each observation or record is independent of each other. This simplicity also allows occurrence-based data to be easily aggregated.

Here, we’ll go through the steps to standardise and build an occurrence dataset using galaxias.

The dataset#

Example Occurrences

The data we’ll use are of bird observations from 4 different sites. As these are occurrence data, this dataset contains evidence of the presence of certain bird species (species) at particular locations (lat, lon) at specific times (date). It also contains additional information about the landscape type, and sex and age class of birds.

>>> import galaxias
>>> occurrences = pd.read_csv('dummy-dataset-sb.csv')
>>> # set all titles to lowercase
>>> occurrences.columns = map(str.lower, occurrences.columns)
>>> occurrences
   landscape     site        date                    species        lat         lon             comments species code sex age class sample id molecular sex
0       KKNP  KKNP-01  2008-05-05       Climacteris picumnus -36.867536  143.258586     secondaries worn          BTC   F        Ad      SB01             F
1       KKNP  KKNP-01  2008-08-22       Climacteris picumnus -36.877767  143.254638                  NaN          BTC   F         J      SB02             F
2       KKNP  KKNP-02  2008-02-05       Climacteris picumnus -36.850162  143.293905                  NaN          BTC   F         J      SB03             F
3       KKNP  KKNP-02  2008-08-22       Climacteris picumnus -36.852531  143.313389                  NaN          BTC   F        Ad      SB04             F
4       KKNP  KKNP-03  2008-02-06        Artamus cyanopterus -36.881165  143.306694  heavy parasite load          DWS   ?        Ad      SB05             M
5       KKNP  KKNP-03  2008-08-22          Acanthiza lineata -36.891223  143.311887  heavy parasite load          STT   M         J      SB06             M
6       PRSP  PRSP-01  2008-08-21        Artamus cyanopterus -37.069201  143.690593                  NaN          DWS   ?         J      SB07             F
7       PRSP  PRSP-01  2008-02-06       Acanthiza reguloides -37.077008  143.693082                  NaN          BRT   M         J      SB08             M
8       PRSP  PRSP-02  2008-02-06            Malurus cyaneus -37.084711  143.701537                  NaN          SFW   M         J      SB09             M
9       PRSP  PRSP-02  2008-02-06            Malurus cyaneus -37.087244  143.696344                  NaN          SFW   M        Ad      SB10             M
10      PRSP  PRSP-03  2008-02-05       Acanthiza reguloides -37.104941  143.691580                  NaN          BRT   M        Ad      SB11             M
11      PRSP  PRSP-03  2008-02-06          Acanthiza lineata -37.108501  143.689778                  NaN          STT   F        Ad      SB12             F
12      MBRP  MBRP-01  2008-02-06            Ptilotula fusca -37.069663  143.720473                  NaN          FHE   ?        Ad      SB13             F
13      MBRP  MBRP-01  2008-02-06            Ptilotula fusca -37.070489  143.721127                  NaN          FHE   ?        Ad      SB14             M
14      MBRP  MBRP-02  2008-02-06  Melithreptus brevirostris -37.067090  143.725065                  NaN          BHE   ?        Ad      SB15             F
15      MBRP  MBRP-02  2008-08-22  Melithreptus brevirostris -37.067780  143.727297                  NaN          BHE   ?        Ad      SB16             M
16      RSNC  RSNC-01  2008-08-21            Malurus cyaneus -37.177554  144.070257                  NaN          SFW   F        Ad      SB17             F
17      RSNC  RSNC-01  2008-02-06            Ptilotula fusca -37.162234  144.063391                  NaN          FHE   ?         J      SB18             M
18      RSNC  RSNC-02  2008-02-06       Pardalotus punctatus -37.232654  144.046911                  NaN          SPP   F        Ad      SB19             F
19      RSNC  RSNC-02  2008-02-06       Pardalotus punctatus -37.286758  144.066481                  NaN          SPP   F        Ad      SB20             F

Standardise to Darwin Core#

We can use suggest_workflow() to determine what we need to do to standardise this dataset.

>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 1 of 12 column names to DwC terms:

✓ Matched: sex
✗ Unmatched: date, comments, species, lat, age class, species code, molecular sex, landscape, lon, sample id, site

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)    Missing term(s)
-------------------------  -----------------  -------------------------------------------------------------------------------
Identifier (at least one)  -                  occurrenceID OR catalogNumber OR recordNumber
Record type                -                  basisOfRecord
Scientific name            -                  scientificName
Location                   -                  decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
Date/Time                  -                  eventDate

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Occurrences ──

To make your occurrences Darwin Core compliant, use the following workflow:

corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()

Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()

Calling suggest_workflow() tells us that one column in the dataset matches Darwin Core terms (sex), and we are missing all the minimum required Darwin Core terms. We’re also given a suggested workflow consisting of a series of set_ functions for renaming, modifying, or adding missing columns. set_ functions are specialised wrappers around the {pandas} package, with additional functionality to support using Darwin Core Standard.

Let’s start by renaming existing columns to align with Darwin Core terms. set_ functions will automatically check to make sure each column is correctly formatted.

>>> occurrences = galaxias.set_scientific_name(scientificName = 'species')
>>> occurrences = galaxias.set_coordinates(decimalLatitude = 'lat',
...                                        decimalLongitude = 'lon')
>>> occurrences = galaxias.set_datetime(eventDate = 'date',
...                                     string_to_datetime=True,
...                                     yearfirst = True)

Calling suggest_workflow() again accounts for our progress and shows us what still needs to be done. Here, we can see that we’re still missing a couple of minimum required terms.

>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 5 of 12 column names to DwC terms:

✓ Matched: eventDate, scientificName, decimalLatitude, decimalLongitude, sex
✗ Unmatched: molecular sex, landscape, site, sample id, comments, species code, age class

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)                    Missing term(s)
-------------------------  ---------------------------------  ---------------------------------------------
Identifier (at least one)  -                                  occurrenceID OR catalogNumber OR recordNumber
Record type                -                                  basisOfRecord
Scientific name            scientificName                     -
Location                   decimalLatitude, decimalLongitude  geodeticDatum, coordinateUncertaintyInMeters
Date/Time                  eventDate                          -

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Occurrences ──

To make your occurrences Darwin Core compliant, use the following workflow:

corella.set_occurrences()
corella.set_coordinates()

Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()

Here’s a rundown of the columns we need to add:

  • occurrenceID: Unique identifier for each record, which ensures that we can identify specific records for future updates or corrections. We can use composite_id(), sequential_id(), or random_id() to add a unique ID to each row.

  • basisOfRecord: The type of record (e.g. human observation, specimen from a museum collection, machine observation). See a list of acceptable values with corella::basisOfRecord_values().

  • geodeticDatum: The geographic coordinate reference system (CRS), which is a framework for representing spatial data (for example, the CRS of Google Maps is “WGS84”).

  • coordinateUncertaintyInMeters: The area of uncertainty around your observation, which you may be able to infer based on your data collection method.

As suggested, let’s add these columns using set_occurrences() and set_coordinates(). We can also use an optional function, set_individual_traits(), which will automatically identify the matched column name sex and check the column’s format.

Creating unique occurrenceID

There are multiple ways of creating unique IDs for your occurrences. For creating either sequential or random IDs, you only have to provide the keywords sequential or random.

To create a unique ID using column names, simply provide the column names in a list to the occurrenceID argument, and

>>> #example code snippet of having a sequential ID first
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['sequential','site','landscape'])
>>>
>>> #example code snippet of having a random ID last
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['site','landscape','random'])
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['sequential','site','landscape'],
...                                        basisOfRecord = 'HumanObservation')
>>> occurrences = galaxias.set_coordinates(dataframe=occurrences,geodeticDatum = 'WGS84',
...                                         coordinateUncertaintyInMeters = 30)
>>> occurrences = galaxias.set_individual_traits(dataframe=occurrences)
What if my lat/long are in degrees, minutes, seconds?

The Atlas of Living Australia requires that lat/longs be in decimal degrees. If your lat/longs are in degrees, minutes and seconds (DMS), there is a Python package that will convert your lat/longs into decimal degrees: lat_lon_parser.

Below is a code snippet used on an example dataframe to convert from DMS to decimal degrees. To do this on an occurrences dataframe in a dwca object, replace the variable occ with <NAME_OF_DWCA_OBJECT>.occurrences.

>>> from lat_lon_parser import parse
>>> import pandas as pd
>>> occ = pd.DataFrame(
...     {
...         'decimalLatitude': ["35\° 50' 11\"", "45\° 51' 13\"", "30\° 20' 10\""],
...         'decimalLongitude': ["138\° 01\' 26\"", "139\° 11\' 16\"", "128\° 05\' 29\""]
...     }
... )
>>> for i, row in occ.iterrows():
...     occ.at[i, 'decimalLatitude'] = round(parse(row['decimalLatitude']),2)
...     occ.at[i, 'decimalLongitude'] = round(parse(row['decimalLongitude']),2)
>>> occ
  decimalLatitude decimalLongitude
0           35.84           138.02
1           45.85           139.19
2           30.34           128.09

Running suggest_workflow() once more confirms that our dataset has all the required information to be put into a Darwin Core Archive!

>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 9 of 16 column names to DwC terms:

✓ Matched: occurrenceID, eventDate, scientificName, decimalLatitude, decimalLongitude, sex, basisOfRecord, geodeticDatum, coordinateUncertaintyInMeters
✗ Unmatched: site, comments, landscape, molecular sex, species code, sample id, age class

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)                                                                  Missing term(s)
-------------------------  -------------------------------------------------------------------------------  -----------------
Identifier (at least one)  occurrenceID                                                                     -
Record type                basisOfRecord                                                                    -
Scientific name            scientificName                                                                   -
Location                   decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters  -
Date/Time                  eventDate                                                                        -

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

Congratulations! You have the required Darwin Core terms for occurrences. Use corella.check_occurrences() to check whether your data is also Darwin Core compliant.

Here, you can do one of two things:

  1. Select only the columns which are currently Darwin core compliant

  2. Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.

To see which Darwin core terms are included in checks in galaxias, consult the list below.

Supported Darwin Core Terms and Their Associated Functions

Darwin Core Term

set Function

basisOfRecord

set_occurrences()

occurrenceID

set_occurrences()

scientificName

set_scientific_name()

decimalLatitude

set_coordinates()

decimalLongitude

set_coordinates()

geodeticDatum

set_coordinates()

coordinateUncertaintyInMeters

set_coordinates()

eventDate

set_datetime()

kingdom

set_taxonomy()

phylum

set_taxonomy()

class

set_taxonomy()

order

set_taxonomy()

family

set_taxonomy()

genus

set_taxonomy()

specificEpithet

set_taxonomy()

vernacularName

set_taxonomy()

taxonRank

set_scientific_name()

scientificNameAuthorship

set_scientific_name()

recordedBy

set_observer()

recordedByID

set_observer()

measurementID

set_measurements()

measurementType

set_measurements()

measurementValue

set_measurements()

measurementUnit

set_measurements()

continent

set_locality()

country

set_locality()

countryCode

set_locality()

stateProvince

set_locality()

locality

set_locality()

license

set_license()

rightsHolder

set_license()

accessRights

set_license()

sex

set_individual_traits()

lifeStage

set_individual_traits()

reproductiveCondition

set_individual_traits()

vitality

set_individual_traits()

individualID

set_individual_traits()

eventID

set_events()

parentEventID

set_events()

eventType

set_events()

eventTime

set_datetime()

year

set_datetime()

month

set_datetime()

day

set_datetime()

coordinatePrecision

set_coordinates()

datasetID

set_collection()

datasetName

set_collection()

catalogNumber

set_collection()

individualCount

set_abundance()

organismQuantity

set_abundance()

organismQuantity

set_abundance()

organismQuantityType

set_abundance()

To select only the columns that are Darwin core compliant, run the following snippet of code:

>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(occurrences.columns)))
>>> occurrences_final = occurrences[occ_terms_dwca]
>>> occurrences_final
    eventDate  decimalLongitude geodeticDatum  decimalLatitude     basisOfRecord  coordinateUncertaintyInMeters     occurrenceID sex             scientificName
0  2008-05-05        143.258586         WGS84       -36.867536  HumanObservation                             30   0-KKNP-01-KKNP   F       Climacteris picumnus
1  2008-08-22        143.254638         WGS84       -36.877767  HumanObservation                             30   1-KKNP-01-KKNP   F       Climacteris picumnus
2  2008-02-05        143.293905         WGS84       -36.850162  HumanObservation                             30   2-KKNP-02-KKNP   F       Climacteris picumnus
3  2008-08-22        143.313389         WGS84       -36.852531  HumanObservation                             30   3-KKNP-02-KKNP   F       Climacteris picumnus
4  2008-02-06        143.306694         WGS84       -36.881165  HumanObservation                             30   4-KKNP-03-KKNP   ?        Artamus cyanopterus
5  2008-08-22        143.311887         WGS84       -36.891223  HumanObservation                             30   5-KKNP-03-KKNP   M          Acanthiza lineata
6  2008-08-21        143.690593         WGS84       -37.069201  HumanObservation                             30   6-PRSP-01-PRSP   ?        Artamus cyanopterus
7  2008-02-06        143.693082         WGS84       -37.077008  HumanObservation                             30   7-PRSP-01-PRSP   M       Acanthiza reguloides
8  2008-02-06        143.701537         WGS84       -37.084711  HumanObservation                             30   8-PRSP-02-PRSP   M            Malurus cyaneus
9  2008-02-06        143.696344         WGS84       -37.087244  HumanObservation                             30   9-PRSP-02-PRSP   M            Malurus cyaneus
10 2008-02-05        143.691580         WGS84       -37.104941  HumanObservation                             30  10-PRSP-03-PRSP   M       Acanthiza reguloides
11 2008-02-06        143.689778         WGS84       -37.108501  HumanObservation                             30  11-PRSP-03-PRSP   F          Acanthiza lineata
12 2008-02-06        143.720473         WGS84       -37.069663  HumanObservation                             30  12-MBRP-01-MBRP   ?            Ptilotula fusca
13 2008-02-06        143.721127         WGS84       -37.070489  HumanObservation                             30  13-MBRP-01-MBRP   ?            Ptilotula fusca
14 2008-02-06        143.725065         WGS84       -37.067090  HumanObservation                             30  14-MBRP-02-MBRP   ?  Melithreptus brevirostris
15 2008-08-22        143.727297         WGS84       -37.067780  HumanObservation                             30  15-MBRP-02-MBRP   ?  Melithreptus brevirostris
16 2008-08-21        144.070257         WGS84       -37.177554  HumanObservation                             30  16-RSNC-01-RSNC   F            Malurus cyaneus
17 2008-02-06        144.063391         WGS84       -37.162234  HumanObservation                             30  17-RSNC-01-RSNC   ?            Ptilotula fusca
18 2008-02-06        144.046911         WGS84       -37.232654  HumanObservation                             30  18-RSNC-02-RSNC   F       Pardalotus punctatus
19 2008-02-06        144.066481         WGS84       -37.286758  HumanObservation                             30  19-RSNC-02-RSNC   F       Pardalotus punctatus

We can specify that we wish to use occurrences and events in our Darwin Core Archive with use_data(), which will save your occurrences as individual csv files in the default directory data-publish as occurrences.csv.

>>> galaxias.use_data(occurrences=occurrences_final)

In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.