Standardising an Occurrence Dataset#
Amanda Buyan, Dax Kellie & Martin Westgate
In Living Atlases like the Atlas of Living Australia (ALA), the default type of data is occurrence data, where a record refers to the presence/absence of an organism or taxon in a particular place at a specific time. This is a relatively simple data structure, where it is assumed that each observation or record is independent of each other. This simplicity also allows occurrence-based data to be easily aggregated.
Here, we’ll go through the steps to standardise and build an occurrence dataset using galaxias.
The dataset#
The data we’ll use are of bird observations from 4 different sites. As these are occurrence data,
this dataset contains evidence of the presence of certain bird species (species) at particular
locations (lat, lon) at specific times (date). It also contains additional information
about the landscape type, and sex and age class of birds.
>>> import galaxias
>>> occurrences = pd.read_csv('dummy-dataset-sb.csv')
>>> # set all titles to lowercase
>>> occurrences.columns = map(str.lower, occurrences.columns)
>>> occurrences
landscape site date species lat lon comments species code sex age class sample id molecular sex
0 KKNP KKNP-01 2008-05-05 Climacteris picumnus -36.867536 143.258586 secondaries worn BTC F Ad SB01 F
1 KKNP KKNP-01 2008-08-22 Climacteris picumnus -36.877767 143.254638 NaN BTC F J SB02 F
2 KKNP KKNP-02 2008-02-05 Climacteris picumnus -36.850162 143.293905 NaN BTC F J SB03 F
3 KKNP KKNP-02 2008-08-22 Climacteris picumnus -36.852531 143.313389 NaN BTC F Ad SB04 F
4 KKNP KKNP-03 2008-02-06 Artamus cyanopterus -36.881165 143.306694 heavy parasite load DWS ? Ad SB05 M
5 KKNP KKNP-03 2008-08-22 Acanthiza lineata -36.891223 143.311887 heavy parasite load STT M J SB06 M
6 PRSP PRSP-01 2008-08-21 Artamus cyanopterus -37.069201 143.690593 NaN DWS ? J SB07 F
7 PRSP PRSP-01 2008-02-06 Acanthiza reguloides -37.077008 143.693082 NaN BRT M J SB08 M
8 PRSP PRSP-02 2008-02-06 Malurus cyaneus -37.084711 143.701537 NaN SFW M J SB09 M
9 PRSP PRSP-02 2008-02-06 Malurus cyaneus -37.087244 143.696344 NaN SFW M Ad SB10 M
10 PRSP PRSP-03 2008-02-05 Acanthiza reguloides -37.104941 143.691580 NaN BRT M Ad SB11 M
11 PRSP PRSP-03 2008-02-06 Acanthiza lineata -37.108501 143.689778 NaN STT F Ad SB12 F
12 MBRP MBRP-01 2008-02-06 Ptilotula fusca -37.069663 143.720473 NaN FHE ? Ad SB13 F
13 MBRP MBRP-01 2008-02-06 Ptilotula fusca -37.070489 143.721127 NaN FHE ? Ad SB14 M
14 MBRP MBRP-02 2008-02-06 Melithreptus brevirostris -37.067090 143.725065 NaN BHE ? Ad SB15 F
15 MBRP MBRP-02 2008-08-22 Melithreptus brevirostris -37.067780 143.727297 NaN BHE ? Ad SB16 M
16 RSNC RSNC-01 2008-08-21 Malurus cyaneus -37.177554 144.070257 NaN SFW F Ad SB17 F
17 RSNC RSNC-01 2008-02-06 Ptilotula fusca -37.162234 144.063391 NaN FHE ? J SB18 M
18 RSNC RSNC-02 2008-02-06 Pardalotus punctatus -37.232654 144.046911 NaN SPP F Ad SB19 F
19 RSNC RSNC-02 2008-02-06 Pardalotus punctatus -37.286758 144.066481 NaN SPP F Ad SB20 F
Standardise to Darwin Core#
We can use suggest_workflow() to determine what we need to do to standardise this dataset.
>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 1 of 12 column names to DwC terms:
✓ Matched: sex
✗ Unmatched: date, comments, species, lat, age class, species code, molecular sex, landscape, lon, sample id, site
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ----------------- -------------------------------------------------------------------------------
Identifier (at least one) - occurrenceID OR catalogNumber OR recordNumber
Record type - basisOfRecord
Scientific name - scientificName
Location - decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters
Date/Time - eventDate
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
Calling suggest_workflow() tells us that one column in the dataset matches Darwin Core
terms (sex), and we are missing all the minimum required Darwin Core terms. We’re also
given a suggested workflow consisting of a series of set_ functions for renaming,
modifying, or adding missing columns. set_ functions are specialised wrappers around the
{pandas} package, with additional functionality to support using Darwin Core Standard.
Let’s start by renaming existing columns to align with Darwin Core terms. set_ functions
will automatically check to make sure each column is correctly formatted.
>>> occurrences = galaxias.set_scientific_name(scientificName = 'species')
>>> occurrences = galaxias.set_coordinates(decimalLatitude = 'lat',
... decimalLongitude = 'lon')
>>> occurrences = galaxias.set_datetime(eventDate = 'date',
... string_to_datetime=True,
... yearfirst = True)
Calling suggest_workflow() again accounts for our progress and shows us what still needs to be done. Here, we can see that we’re still missing a couple of minimum required terms.
>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 5 of 12 column names to DwC terms:
✓ Matched: eventDate, scientificName, decimalLatitude, decimalLongitude, sex
✗ Unmatched: molecular sex, landscape, site, sample id, comments, species code, age class
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- --------------------------------- ---------------------------------------------
Identifier (at least one) - occurrenceID OR catalogNumber OR recordNumber
Record type - basisOfRecord
Scientific name scientificName -
Location decimalLatitude, decimalLongitude geodeticDatum, coordinateUncertaintyInMeters
Date/Time eventDate -
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_occurrences()
corella.set_coordinates()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
Here’s a rundown of the columns we need to add:
occurrenceID: Unique identifier for each record, which ensures that we can identify specific records for future updates or corrections. We can use composite_id(), sequential_id(), or random_id() to add a unique ID to each row.basisOfRecord: The type of record (e.g. human observation, specimen from a museum collection, machine observation). See a list of acceptable values with corella::basisOfRecord_values().geodeticDatum: The geographic coordinate reference system (CRS), which is a framework for representing spatial data (for example, the CRS of Google Maps is “WGS84”).coordinateUncertaintyInMeters: The area of uncertainty around your observation, which you may be able to infer based on your data collection method.
As suggested, let’s add these columns using set_occurrences() and set_coordinates(). We
can also use an optional function, set_individual_traits(), which will automatically
identify the matched column name sex and check the column’s format.
Creating unique occurrenceID
There are multiple ways of creating unique IDs for your occurrences. For creating either
sequential or random IDs, you only have to provide the keywords sequential or random.
To create a unique ID using column names, simply provide the column names in a list to the
occurrenceID argument, and
>>> #example code snippet of having a sequential ID first
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['sequential','site','landscape'])
>>>
>>> #example code snippet of having a random ID last
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['site','landscape','random'])
>>> occurrences = galaxias.set_occurrences(occurrences=occurrences,occurrenceID = ['sequential','site','landscape'],
... basisOfRecord = 'HumanObservation')
>>> occurrences = galaxias.set_coordinates(dataframe=occurrences,geodeticDatum = 'WGS84',
... coordinateUncertaintyInMeters = 30)
>>> occurrences = galaxias.set_individual_traits(dataframe=occurrences)
What if my lat/long are in degrees, minutes, seconds?
The Atlas of Living Australia requires that lat/longs be in decimal degrees. If your lat/longs are in degrees, minutes and seconds (DMS), there is a Python package that will convert your lat/longs into decimal degrees: lat_lon_parser.
Below is a code snippet used on an example dataframe to convert from DMS to
decimal degrees. To do this on an occurrences dataframe in a dwca
object, replace the variable occ with <NAME_OF_DWCA_OBJECT>.occurrences.
>>> from lat_lon_parser import parse
>>> import pandas as pd
>>> occ = pd.DataFrame(
... {
... 'decimalLatitude': ["35\° 50' 11\"", "45\° 51' 13\"", "30\° 20' 10\""],
... 'decimalLongitude': ["138\° 01\' 26\"", "139\° 11\' 16\"", "128\° 05\' 29\""]
... }
... )
>>> for i, row in occ.iterrows():
... occ.at[i, 'decimalLatitude'] = round(parse(row['decimalLatitude']),2)
... occ.at[i, 'decimalLongitude'] = round(parse(row['decimalLongitude']),2)
>>> occ
decimalLatitude decimalLongitude
0 35.84 138.02
1 45.85 139.19
2 30.34 128.09
Running suggest_workflow() once more confirms that our dataset has all the required
information to be put into a Darwin Core Archive!
>>> galaxias.suggest_workflow(occurrences=occurrences)
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 9 of 16 column names to DwC terms:
✓ Matched: occurrenceID, eventDate, scientificName, decimalLatitude, decimalLongitude, sex, basisOfRecord, geodeticDatum, coordinateUncertaintyInMeters
✗ Unmatched: site, comments, landscape, molecular sex, species code, sample id, age class
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ------------------------------------------------------------------------------- -----------------
Identifier (at least one) occurrenceID -
Record type basisOfRecord -
Scientific name scientificName -
Location decimalLatitude, decimalLongitude, geodeticDatum, coordinateUncertaintyInMeters -
Date/Time eventDate -
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Congratulations! You have the required Darwin Core terms for occurrences. Use corella.check_occurrences() to check whether your data is also Darwin Core compliant.
Here, you can do one of two things:
Select only the columns which are currently Darwin core compliant
Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.
To see which Darwin core terms are included in checks in galaxias, consult the list below.
Supported Darwin Core Terms and Their Associated Functions
Darwin Core Term |
|
|---|---|
basisOfRecord |
set_occurrences() |
occurrenceID |
set_occurrences() |
scientificName |
set_scientific_name() |
decimalLatitude |
set_coordinates() |
decimalLongitude |
set_coordinates() |
geodeticDatum |
set_coordinates() |
coordinateUncertaintyInMeters |
set_coordinates() |
eventDate |
set_datetime() |
kingdom |
set_taxonomy() |
phylum |
set_taxonomy() |
class |
set_taxonomy() |
order |
set_taxonomy() |
family |
set_taxonomy() |
genus |
set_taxonomy() |
specificEpithet |
set_taxonomy() |
vernacularName |
set_taxonomy() |
taxonRank |
set_scientific_name() |
scientificNameAuthorship |
set_scientific_name() |
recordedBy |
set_observer() |
recordedByID |
set_observer() |
measurementID |
set_measurements() |
measurementType |
set_measurements() |
measurementValue |
set_measurements() |
measurementUnit |
set_measurements() |
continent |
set_locality() |
country |
set_locality() |
countryCode |
set_locality() |
stateProvince |
set_locality() |
locality |
set_locality() |
license |
set_license() |
rightsHolder |
set_license() |
accessRights |
set_license() |
sex |
set_individual_traits() |
lifeStage |
set_individual_traits() |
reproductiveCondition |
set_individual_traits() |
vitality |
set_individual_traits() |
individualID |
set_individual_traits() |
eventID |
set_events() |
parentEventID |
set_events() |
eventType |
set_events() |
eventTime |
set_datetime() |
year |
set_datetime() |
month |
set_datetime() |
day |
set_datetime() |
coordinatePrecision |
set_coordinates() |
datasetID |
set_collection() |
datasetName |
set_collection() |
catalogNumber |
set_collection() |
individualCount |
set_abundance() |
organismQuantity |
set_abundance() |
organismQuantity |
set_abundance() |
organismQuantityType |
set_abundance() |
To select only the columns that are Darwin core compliant, run the following snippet of code:
>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(occurrences.columns)))
>>> occurrences_final = occurrences[occ_terms_dwca]
>>> occurrences_final
eventDate decimalLongitude geodeticDatum decimalLatitude basisOfRecord coordinateUncertaintyInMeters occurrenceID sex scientificName
0 2008-05-05 143.258586 WGS84 -36.867536 HumanObservation 30 0-KKNP-01-KKNP F Climacteris picumnus
1 2008-08-22 143.254638 WGS84 -36.877767 HumanObservation 30 1-KKNP-01-KKNP F Climacteris picumnus
2 2008-02-05 143.293905 WGS84 -36.850162 HumanObservation 30 2-KKNP-02-KKNP F Climacteris picumnus
3 2008-08-22 143.313389 WGS84 -36.852531 HumanObservation 30 3-KKNP-02-KKNP F Climacteris picumnus
4 2008-02-06 143.306694 WGS84 -36.881165 HumanObservation 30 4-KKNP-03-KKNP ? Artamus cyanopterus
5 2008-08-22 143.311887 WGS84 -36.891223 HumanObservation 30 5-KKNP-03-KKNP M Acanthiza lineata
6 2008-08-21 143.690593 WGS84 -37.069201 HumanObservation 30 6-PRSP-01-PRSP ? Artamus cyanopterus
7 2008-02-06 143.693082 WGS84 -37.077008 HumanObservation 30 7-PRSP-01-PRSP M Acanthiza reguloides
8 2008-02-06 143.701537 WGS84 -37.084711 HumanObservation 30 8-PRSP-02-PRSP M Malurus cyaneus
9 2008-02-06 143.696344 WGS84 -37.087244 HumanObservation 30 9-PRSP-02-PRSP M Malurus cyaneus
10 2008-02-05 143.691580 WGS84 -37.104941 HumanObservation 30 10-PRSP-03-PRSP M Acanthiza reguloides
11 2008-02-06 143.689778 WGS84 -37.108501 HumanObservation 30 11-PRSP-03-PRSP F Acanthiza lineata
12 2008-02-06 143.720473 WGS84 -37.069663 HumanObservation 30 12-MBRP-01-MBRP ? Ptilotula fusca
13 2008-02-06 143.721127 WGS84 -37.070489 HumanObservation 30 13-MBRP-01-MBRP ? Ptilotula fusca
14 2008-02-06 143.725065 WGS84 -37.067090 HumanObservation 30 14-MBRP-02-MBRP ? Melithreptus brevirostris
15 2008-08-22 143.727297 WGS84 -37.067780 HumanObservation 30 15-MBRP-02-MBRP ? Melithreptus brevirostris
16 2008-08-21 144.070257 WGS84 -37.177554 HumanObservation 30 16-RSNC-01-RSNC F Malurus cyaneus
17 2008-02-06 144.063391 WGS84 -37.162234 HumanObservation 30 17-RSNC-01-RSNC ? Ptilotula fusca
18 2008-02-06 144.046911 WGS84 -37.232654 HumanObservation 30 18-RSNC-02-RSNC F Pardalotus punctatus
19 2008-02-06 144.066481 WGS84 -37.286758 HumanObservation 30 19-RSNC-02-RSNC F Pardalotus punctatus
We can specify that we wish to use occurrences and events in our Darwin Core Archive
with use_data(), which will save your occurrences as individual csv
files in the default directory data-publish as occurrences.csv.
>>> galaxias.use_data(occurrences=occurrences_final)
In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.