set_occurrences#
One of the functions you can use to check certain columns of your data is set_occurrences()
.
This function aims to check that you have the following Darwin Core Vocabulary Terms:
basisOfRecord
: how the occurrence was recorded (was it observed by a human? machine? is it part of a collection?)occurrenceID
orcatalogNumber
orrecordNumber
: a unique identifier for the record (only one of these is necessary)occurrenceStatus
(OPTIONAL): whether a species is present or absent. Not required for data submission.
Specifying basisOfRecord
value#
As mentioned above, the basisOfRecord
value is a required and important
field for an observation, as it lets others know how the record was recorded.
For example, was it a machine that observed it? A human? Is this a specimen
that’s part of a collection?
Depending on your answer to these questions, the information you provide will differ.
Luckily, Darwin Core has a predefined vocabulary to help you with this, and galaxias
will tell you what this vocab is with the following function:
>>> galaxias.basisOfRecord_values()
basisOfRecord values
0 humanObservation
1 machineObservation
2 livingSpecimen
3 preservedSpecimen
4 fossilSpecimen
5 materialCitation
For this exercise, let’s assume a human has seen these, which equates to a value of
HumanObservation
. We can then set the basisOfRecord
argument as HumanObservation
,
and it will, by default, set the value of basisOfRecord
for the whole dataframe.
>>> my_archive.set_occurrences(
... basisOfRecord='HumanObservation'
... )
>>> my_archive.occurrences.head()
Species Latitude Longitude Collection_date basisOfRecord
0 Corymbia latifolia -13.04 131.07 29/3/2022 HumanObservation
1 Eucalyptus tectifica -13.04 131.07 13/9/2022 HumanObservation
2 Banksia aemula -33.60 150.72 15/8/2022 HumanObservation
3 Eucalyptus sclerophylla -33.60 150.72 16/6/2022 HumanObservation
4 Persoonia laurina -33.60 150.72 19/10/2022 HumanObservation
How to generate occurrence IDs#
Note
If you have occurrence IDs already in your dataset, you can specify the name of the column
that contains your IDs, and galaxias
will rename that column to comply with the Darwin
Core Vocabulary Standard.
- catalogNumber
and / or recordNumber
is normally used for collections,
so it is best to go with occurrenceID
if you’re generating them using galaxias
.
Every occurrence needs a unique identifier for easy future identification. If your
occurences don’t have either an occurrenceID
, catalogNumber
or recordNumber
,
you can provide a value of True
to the occurrenceID
. You will then have to
further specify whether or not you want a randomly generated UUID for each occurrence
(random_id
), composite IDs (composite_id
) or sequential IDs (sequential_id
).
The example used here will be random; however, you can see a vignette HERE all about
generating IDs.
>>> my_archive.set_occurrences(
... basisOfRecord='HumanObservation',
... occurrenceID=True,
... random_id=True
... )
>>> my_archive.occurrences.head()
occurrenceID Species Latitude Longitude Collection_date basisOfRecord
0 0b0ef912-52c6-481d-8a10-ddbd7dbf97b6 Corymbia latifolia -13.04 131.07 29/3/2022 HumanObservation
1 c0f38a6b-9f7f-44f5-be10-302de750083c Eucalyptus tectifica -13.04 131.07 13/9/2022 HumanObservation
2 7b4dba12-dba3-4be7-b761-803c8c0a33b6 Banksia aemula -33.60 150.72 15/8/2022 HumanObservation
3 5cd50060-7e66-4ddf-b585-e62479b33540 Eucalyptus sclerophylla -33.60 150.72 16/6/2022 HumanObservation
4 0e00432a-6122-481f-a12d-d4b0a9c65b10 Persoonia laurina -33.60 150.72 19/10/2022 HumanObservation
specify occurrenceStatus
column#
Note
This is an optional field, but we are including it here to share how this argument works, and how this will rename your column
Sometimes, you may want to include the occurrenceStatus
field in your observations, especially
if you were expecting to see a species in a particular area, and/or have seen them in the past but
did not see them on that particular day, you can include this to say they were absent.
Since we have a column that denotes whether or not a species was present or absent, we can
provide the name of that column, and galaxias
will rename the column to conform with the
Darwin Core standard.
>>> my_archive.set_occurrences(
... basisOfRecord='HumanObservation',
... occurrenceStatus='PRESENT'
... )
>>> my_archive.occurrences.head()
occurrenceID Species Latitude Longitude Collection_date basisOfRecord occurrenceStatus
0 05a20950-691d-4c82-9e8c-e67c37c0683b Corymbia latifolia -13.04 131.07 29/3/2022 HumanObservation PRESENT
1 956f448f-e583-47df-bb8f-55c5d8d351c7 Eucalyptus tectifica -13.04 131.07 13/9/2022 HumanObservation PRESENT
2 bd86192c-d530-4cd5-89f8-025127dc70bf Banksia aemula -33.60 150.72 15/8/2022 HumanObservation PRESENT
3 26000299-5928-4419-b6c7-6d6803a7f442 Eucalyptus sclerophylla -33.60 150.72 16/6/2022 HumanObservation PRESENT
4 d49b3c32-6921-4f21-9f28-b919327ce90b Persoonia laurina -33.60 150.72 19/10/2022 HumanObservation PRESENT
what does check_data
and suggest_workflow
say now?#
Note
each of the set_*
functions checks your data for compliance with the
Darwin core standard, but it’s always good to double-check your data.
Now that we’ve taken care of the pieces of information set_occurrences()
is responsible
for, we can assign the new dataframe to a variable:
>>> occ = my_archive.set_occurrences(
... basisOfRecord='HumanObservation',
... occurrenceStatus='status',
... occurrenceID=True
... )
Now, we can check that this new dataframe complies with the Darwin Core standard for the basisOfRecord
,
occurrenceStatus
, occurrenceID
, catalogNumber
and recordNumber
columns.
>>> my_archive.check_dataset()
Number of Errors Pass/Fail Column name
------------------ ----------- ----------------
0 ✓ occurrenceID
0 ✓ basisOfRecord
0 ✓ occurrenceStatus
══ Results ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Errors: 0 | Passes: 3
✗ Data does not meet minimum Darwin core requirements
Use corella.suggest_workflow()
However, since we don’t have all of the required columns, we can run suggest_workflow()
again to see what other functions we can use to check our data:
>>> my_archive.suggest_workflow()
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 3 of 7 column names to DwC terms:
✓ Matched: occurrenceID, basisOfRecord, occurrenceStatus
✗ Unmatched: Species, Latitude, Collection_date, Longitude
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ----------------- ------------------------------------------------
Identifier (at least one) occurrenceID -
Record type basisOfRecord -
Scientific name - scientificName
Location - decimalLatitude, decimalLongitude, geodeticDatum
Date/Time - eventDate
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
Other functions#
To learn more about how to use other functions, go to
Optional functions:
Creating Unique IDs:
Passing Dataset: