Standardising an Events Dataset#
In a research project, data collection can take place at multiple locations and times. At each location and time, there often multiple collected samples to capture variation in a study area or time-period. In Darwin Core, the data collected from this type of project is Event-based.
Events are any action that “occurs at some location during some time.” (from TDWG). Each sample, for example, is a unique event, with its own environmental attributes (like topography, tree cover and soil composition) that affect what organisms occur there and how likely they are to occur. Observations of organisms take place within each Event. As such, Events add hierarchy to a dataset by grouping simultaneous observations into groups, as opposed to Occurrence-only data which is processed as if all occurrences are independent. Event-based data collection adds richness to ecological data that can be useful for more advanced modelling techniques.
Here we will demonstrate an example of how to convert Event-based data to Darwin Core
standard. To do so, we will create two csv files, events.csv and occurrences.csv,
to build a Darwin Core Archive.
The dataset#
Sites
Observations
Species List
For this example, we’ll use a dataset of frog observations from a
2015 paper in PLOS ONE.
Data were collected by volunteers using 5-minute audio surveys, where each row documents
whether each frog species was detected over that 5-minute recording, recorded as present
(1) or absent (0). For the purpose of this vignette, we have downloaded the source data
from Dryad, reduced the number
of rows, and converted the original excel spreadsheet to three .csv files:
sites.csv, observations.csv and species_list.csv.
Sites#
The sites spreadsheet contains columns that describe each survey location (e.g. depth,
water_type, latitude, longitude) and overall presence/absence of each frog
species in a site (e.g. cpar, csig, limdum). We won’t use the aggregated
species data stored here - we’ll instead export the raw observations - but we’ll still
import the data, because it’s the only place that spatial information are stored.
>>> import pandas as pd
>>> sites = pd.read_csv('sites.csv')
>>> sites
site_code log_size depth water_type latitude longitude cpar csig limdum limper limtas lper lver ulae richness
0 AMA100 2.914535 deep moving -35.168178 149.119212 1 1 1 0 1 1 1 0 6
1 AMH100 0.554287 shallow moving -35.235300 149.045351 0 0 0 1 1 0 0 0 2
2 ANB100 2.012837 deep moving -35.278130 149.110090 0 1 1 1 0 0 0 0 3
3 ANU004 1.753792 deep moving -35.280547 149.111692 0 1 1 0 1 0 0 0 3
4 ANU018 2.377488 deep still -35.281900 149.112100 0 1 0 1 0 0 0 0 2
5 ANU020 2.155047 deep still -35.279600 149.116900 1 1 1 1 1 0 0 0 5
6 ANU022 0.957248 shallow moving -35.278586 149.120161 1 1 0 1 1 0 0 1 5
7 ARA100 2.301030 deep still -35.276400 149.077900 1 1 0 0 1 1 0 0 4
Observations#
The observations spreadsheet contains columns that describe the sample’s physical
properties (e.g. water_type, veg_canopy), linked to sites by the site_code
column. More importantly, it records whether each species in the region was recorded
during that particular survey (e.g. cpar, csig, limdum).
>>> import pandas as pd
>>> observations = pd.read_csv('observations.csv')
>>> observations
site_code year depth water_type log_size veg_aqu veg_canopy veg_bank pc_urban pc_canopy cpar csig limdum limper limtas lper lver ulae
0 AMA100 2004 deep moving 2.914535 1 0 1 0.2260 0.0427 1 0 0 0 1 1 0 0
1 AMA100 2007 deep moving 2.914535 1 0 1 0.2067 0.0987 1 0 1 0 1 0 0 0
2 AMA100 2007 deep moving 2.914535 1 0 1 0.2067 0.0987 1 0 1 0 1 0 0 0
3 AMA100 2005 deep moving 2.914535 1 0 1 0.2813 0.0567 1 1 1 0 1 0 0 0
4 AMA100 2008 deep moving 2.914535 1 0 1 0.2067 0.0607 1 0 1 0 0 1 0 0
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
118 ARA100 2009 deep still 2.301030 1 0 1 0.0779 0.0329 1 0 0 0 1 0 0 0
119 ARA100 2010 deep still 2.301030 1 0 1 0.0307 0.1450 1 0 0 0 1 1 0 0
120 ARA100 2008 deep still 2.301030 1 0 1 0.1593 0.0307 0 0 0 0 0 0 0 0
121 ARA100 2010 deep still 2.301030 1 0 1 0.0307 0.1450 1 0 0 0 1 0 0 0
122 ARA100 2011 deep still 2.301030 1 0 1 0.1171 0.0607 1 0 0 0 0 1 0 0
[123 rows x 18 columns]
Species list#
Finally, the species list spreadsheet lists the eight frog species recorded in this
dataset, and the abbreviation column contains the abbreviated column name used
in the observations dataset.
>>> import pandas as pd
>>> species = pd.read_csv('species_list.csv')
>>> species
scientific_name common_name abbreviation
0 Crinia parinsignifera Plains Froglet cpar
1 Crinia signifera Common Eastern Froglet csig
2 Limnodynastes dumerilii Pobblebonk limdum
3 Limnodynastes peronii Striped Marsh Frog limper
4 Limnodynastes tasmaniensis Spotted Grass Frog limtas
5 Litoria peronii Emerald Spotted Frog lper
6 Litoria verreauxii Alpine Tree Frog lver
7 Uperoleia laevigata Smooth Toadlet ulae
Preparing data for both events.csv and occurrences.csv#
For events.csv, we want the following types of data to be represented:
Geographic location of site (latitude/longitude)
The coordinate reference system of the device used to measure geographic location
Uncertainty of the measurement of your location
Date of the event
A unique identifier for each event
For occurrences.csv, we want the following types of data to be represented:
Scientific name (common name as well if available)
How the record was observed
Presence/absence
Identifier linking this observation to an event
A unique identifier for each observation
Figuring out how to get all this information to link can be difficult! However, the
Darwin Core standard has provided a field to link both events.csv and occurrences.csv:
eventID. This is a unique identifier which will be used to denote the unique events
in events.csv, and will be able to link the sightings in occurrences.csv with specific
events in your study.
Preparing eventID to link events.csv and occurrences.csv#
As mentioned above, the eventID field will be the common key that links your events.csv
and occurrences.csv file at the end of the workflow. For our example, we will be doing
some initial preparation and merging of the three aforementioned data frames to ensure we have
all information we need in one table to easily link all information together.
Looking at the tables, we can see that site_code will link the sites and observations
dataframes together, and the abbreviations column values in the species dataframe are
column names in the observations table.
How that we have the two ways to link our table, we know we want the site_code and year
columns represented in our composite table, as well as the columns with species’ name abbreviations.
We will also create a composite eventID from a sequential number, the site_code and
the year in which the site was visited.
>>> # get list of all species abbreviations
>>> species_abb = list(species['abbreviation'])
>>>
>>> # join observations and select sites columns by site_code
>>> observations_site = observations[['site_code','year'] + species_abb]
>>>
>>> # add composite eventID
>>> observations_site_id = galaxias.set_events(dataframe=observations_site,eventID=['sequential','site_code','year'])
>>> observations_site_id
eventID site_code year cpar csig limdum limper limtas lper lver ulae
0 0-AMA100-2004 AMA100 2004 1 0 0 0 1 1 0 0
1 1-AMA100-2007 AMA100 2007 1 0 1 0 1 0 0 0
2 2-AMA100-2007 AMA100 2007 1 0 1 0 1 0 0 0
3 3-AMA100-2005 AMA100 2005 1 1 1 0 1 0 0 0
4 4-AMA100-2008 AMA100 2008 1 0 1 0 0 1 0 0
.. ... ... ... ... ... ... ... ... ... ... ...
118 118-ARA100-2009 ARA100 2009 1 0 0 0 1 0 0 0
119 119-ARA100-2010 ARA100 2010 1 0 0 0 1 1 0 0
120 120-ARA100-2008 ARA100 2008 0 0 0 0 0 0 0 0
121 121-ARA100-2010 ARA100 2010 1 0 0 0 1 0 0 0
122 122-ARA100-2011 ARA100 2011 1 0 0 0 0 1 0 0
[123 rows x 11 columns]
Preparing events.csv#
In your events.csv, you will be capturing the “where” and “when” of the data, as outlined
in the “Preparing data for both events.csv and occurrences.csv” section. We already
have the unique identifier of the event, as well as a date for the event (the year column.)
All that is left to do is to include the geographic location of these sites, and then write the
events.csv file to your disk.
Note
We are including the geographic location in the events.csv, as the coordinates for each
observation are tied to the location of the site in question. If you have sites, but you instead
want to include coordinates for individual sightings, then coordinates are more appropriately
placed in the occurrences.csv file.
>>> observations_site_id_coords = observations_site_id.merge(sites[['site_code','latitude','longitude']],on='site_code',how='left')
>>> events_all = galaxias.set_coordinates(dataframe=observations_site_id_coords,
... decimalLatitude='latitude',
... decimalLongitude='longitude',
... coordinateUncertaintyInMeters = 30,
... geodeticDatum = 'WGS84')
>>> events_all
eventID site_code year cpar csig limdum limper limtas lper lver ulae decimalLatitude decimalLongitude geodeticDatum coordinateUncertaintyInMeters
0 0-AMA100-2004 AMA100 2004 1 0 0 0 1 1 0 0 -35.168178 149.119212 WGS84 30
1 1-AMA100-2007 AMA100 2007 1 0 1 0 1 0 0 0 -35.168178 149.119212 WGS84 30
2 2-AMA100-2007 AMA100 2007 1 0 1 0 1 0 0 0 -35.168178 149.119212 WGS84 30
3 3-AMA100-2005 AMA100 2005 1 1 1 0 1 0 0 0 -35.168178 149.119212 WGS84 30
4 4-AMA100-2008 AMA100 2008 1 0 1 0 0 1 0 0 -35.168178 149.119212 WGS84 30
.. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
118 118-ARA100-2009 ARA100 2009 1 0 0 0 1 0 0 0 -35.276400 149.077900 WGS84 30
119 119-ARA100-2010 ARA100 2010 1 0 0 0 1 1 0 0 -35.276400 149.077900 WGS84 30
120 120-ARA100-2008 ARA100 2008 0 0 0 0 0 0 0 0 -35.276400 149.077900 WGS84 30
121 121-ARA100-2010 ARA100 2010 1 0 0 0 1 0 0 0 -35.276400 149.077900 WGS84 30
122 122-ARA100-2011 ARA100 2011 1 0 0 0 0 1 0 0 -35.276400 149.077900 WGS84 30
[123 rows x 15 columns]
Here, you can do one of two things:
Select only the columns which are currently Darwin core compliant
Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.
To see which Darwin core terms are included in checks in galaxias, consult the list below.
Supported Darwin Core Terms and Their Associated Functions
Darwin Core Term |
|
|---|---|
basisOfRecord |
set_occurrences() |
occurrenceID |
set_occurrences() |
scientificName |
set_scientific_name() |
decimalLatitude |
set_coordinates() |
decimalLongitude |
set_coordinates() |
geodeticDatum |
set_coordinates() |
coordinateUncertaintyInMeters |
set_coordinates() |
eventDate |
set_datetime() |
kingdom |
set_taxonomy() |
phylum |
set_taxonomy() |
class |
set_taxonomy() |
order |
set_taxonomy() |
family |
set_taxonomy() |
genus |
set_taxonomy() |
specificEpithet |
set_taxonomy() |
vernacularName |
set_taxonomy() |
taxonRank |
set_scientific_name() |
scientificNameAuthorship |
set_scientific_name() |
recordedBy |
set_observer() |
recordedByID |
set_observer() |
measurementID |
set_measurements() |
measurementType |
set_measurements() |
measurementValue |
set_measurements() |
measurementUnit |
set_measurements() |
continent |
set_locality() |
country |
set_locality() |
countryCode |
set_locality() |
stateProvince |
set_locality() |
locality |
set_locality() |
license |
set_license() |
rightsHolder |
set_license() |
accessRights |
set_license() |
sex |
set_individual_traits() |
lifeStage |
set_individual_traits() |
reproductiveCondition |
set_individual_traits() |
vitality |
set_individual_traits() |
individualID |
set_individual_traits() |
eventID |
set_events() |
parentEventID |
set_events() |
eventType |
set_events() |
eventTime |
set_datetime() |
year |
set_datetime() |
month |
set_datetime() |
day |
set_datetime() |
coordinatePrecision |
set_coordinates() |
datasetID |
set_collection() |
datasetName |
set_collection() |
catalogNumber |
set_collection() |
individualCount |
set_abundance() |
organismQuantity |
set_abundance() |
organismQuantity |
set_abundance() |
organismQuantityType |
set_abundance() |
To select only the columns that are Darwin core compliant, run the following snippet of code:
>>> event_terms = list(galaxias.event_terms())
>>> event_terms_dwca = list(set(event_terms).intersection(list(events_all.columns)))
>>> events = events_all[event_terms_dwca]
>>> events
coordinateUncertaintyInMeters year decimalLatitude geodeticDatum decimalLongitude eventID
0 30 2004 -35.168178 WGS84 149.119212 0-AMA100-2004
1 30 2007 -35.168178 WGS84 149.119212 1-AMA100-2007
2 30 2007 -35.168178 WGS84 149.119212 2-AMA100-2007
3 30 2005 -35.168178 WGS84 149.119212 3-AMA100-2005
4 30 2008 -35.168178 WGS84 149.119212 4-AMA100-2008
.. ... ... ... ... ... ...
118 30 2009 -35.276400 WGS84 149.077900 118-ARA100-2009
119 30 2010 -35.276400 WGS84 149.077900 119-ARA100-2010
120 30 2008 -35.276400 WGS84 149.077900 120-ARA100-2008
121 30 2010 -35.276400 WGS84 149.077900 121-ARA100-2010
122 30 2011 -35.276400 WGS84 149.077900 122-ARA100-2011
[123 rows x 6 columns]
Prepare occurrences.csv#
Let’s return to the observations_site_id variable, which contains the eventID field
that will link our events.csv and occurrences.csv. When we are formatting these
occurrences, we are going to need to have the species abbreviations as values under a column
name, rather than the column names themselves. But how do we do this while keeping all the
information we need for linking the files?
Luckily, we can use the melt function inherent in each pandas dataframe. We will specify
the eventID column as our id_vars, as this is the key we want associated with each observation.
We will then choose the name abbreviation for the var_name, as var_name will become
the column names (aka the abbreviations for each species). Finally, the value_name will be named
presence for now, as this will contain all of our presence/absence data.
>>> observations_site_id_select = observations_site_id[['eventID'] + species_abb]
>>> observations_site_id_abbr = observations_site_id_select.melt(id_vars=['eventID'],
... var_name='abbreviation',
... value_name='status')
>>> observations_site_id_abbr
eventID abbreviation status
0 0-AMA100-2004 cpar 1
1 1-AMA100-2007 cpar 1
2 2-AMA100-2007 cpar 1
3 3-AMA100-2005 cpar 1
4 4-AMA100-2008 cpar 1
.. ... ... ...
979 118-ARA100-2009 ulae 0
980 119-ARA100-2010 ulae 0
981 120-ARA100-2008 ulae 0
982 121-ARA100-2010 ulae 0
983 122-ARA100-2011 ulae 0
[984 rows x 3 columns]
Now we’ll merge the correct names to our frog species by joining species with observations_site_id_abbr.
>>> observations_site_id_spec = observations_site_id_abbr.merge(species,on='abbreviation',how='left')
>>> observations_site_id_spec
eventID abbreviation status scientific_name common_name
0 0-AMA100-2004 cpar 1 Crinia parinsignifera Plains Froglet
1 1-AMA100-2007 cpar 1 Crinia parinsignifera Plains Froglet
2 2-AMA100-2007 cpar 1 Crinia parinsignifera Plains Froglet
3 3-AMA100-2005 cpar 1 Crinia parinsignifera Plains Froglet
4 4-AMA100-2008 cpar 1 Crinia parinsignifera Plains Froglet
.. ... ... ... ... ...
979 118-ARA100-2009 ulae 0 Uperoleia laevigata Smooth Toadlet
980 119-ARA100-2010 ulae 0 Uperoleia laevigata Smooth Toadlet
981 120-ARA100-2008 ulae 0 Uperoleia laevigata Smooth Toadlet
982 121-ARA100-2010 ulae 0 Uperoleia laevigata Smooth Toadlet
983 122-ARA100-2011 ulae 0 Uperoleia laevigata Smooth Toadlet
[984 rows x 5 columns]
Now we can reformat our data to use valid Darwin Core column names using set_
functions. Importantly, Darwin Core Standard requires that we add a unique occurrenceID
and the type of observation in the column basisOfRecord.
>>> # first, change the 1's and 0's to PRESENT and ABSENT
>>> observations_site_id_spec['status'] = observations_site_id_spec['status'].map({1: 'PRESENT', 0: 'ABSENT'})
>>>
>>> # now, we will reformat the data to use valid Darwin Core columns
>>> obs = galaxias.set_occurrences(occurrences=observations_site_id_spec,
... occurrenceID=['eventID','sequential'],
... basisOfRecord = 'HumanObservation',
... occurrenceStatus = 'status')
>>> obs_name = galaxias.set_scientific_name(dataframe=obs,scientificName='scientific_name')
>>> obs_dwc = galaxias.set_taxonomy(dataframe=obs_name,vernacularName='common_name')
>>> obs_dwc
occurrenceID eventID abbreviation occurrenceStatus scientificName vernacularName basisOfRecord
0 0-AMA100-2004-0 0-AMA100-2004 cpar PRESENT Crinia parinsignifera Plains Froglet HumanObservation
1 1-AMA100-2007-1 1-AMA100-2007 cpar PRESENT Crinia parinsignifera Plains Froglet HumanObservation
2 2-AMA100-2007-2 2-AMA100-2007 cpar PRESENT Crinia parinsignifera Plains Froglet HumanObservation
3 3-AMA100-2005-3 3-AMA100-2005 cpar PRESENT Crinia parinsignifera Plains Froglet HumanObservation
4 4-AMA100-2008-4 4-AMA100-2008 cpar PRESENT Crinia parinsignifera Plains Froglet HumanObservation
.. ... ... ... ... ... ... ...
979 118-ARA100-2009-979 118-ARA100-2009 ulae ABSENT Uperoleia laevigata Smooth Toadlet HumanObservation
980 119-ARA100-2010-980 119-ARA100-2010 ulae ABSENT Uperoleia laevigata Smooth Toadlet HumanObservation
981 120-ARA100-2008-981 120-ARA100-2008 ulae ABSENT Uperoleia laevigata Smooth Toadlet HumanObservation
982 121-ARA100-2010-982 121-ARA100-2010 ulae ABSENT Uperoleia laevigata Smooth Toadlet HumanObservation
983 122-ARA100-2011-983 122-ARA100-2011 ulae ABSENT Uperoleia laevigata Smooth Toadlet HumanObservation
[984 rows x 7 columns]
We now have a dataframe with observations organised at the occurrence-level. Our
final step is to reduce obs_long_dwc to only include columns with valid column
names in Occurrence-based datasets. This drops the abbreviation column from our
dataframe.
>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(obs_dwc.columns)))
>>> occurrences = obs_dwc[occ_terms_dwca]
>>> occurrences
occurrenceStatus basisOfRecord scientificName eventID occurrenceID vernacularName
0 PRESENT HumanObservation Crinia parinsignifera 0-AMA100-2004 0-AMA100-2004-0 Plains Froglet
1 PRESENT HumanObservation Crinia parinsignifera 1-AMA100-2007 1-AMA100-2007-1 Plains Froglet
2 PRESENT HumanObservation Crinia parinsignifera 2-AMA100-2007 2-AMA100-2007-2 Plains Froglet
3 PRESENT HumanObservation Crinia parinsignifera 3-AMA100-2005 3-AMA100-2005-3 Plains Froglet
4 PRESENT HumanObservation Crinia parinsignifera 4-AMA100-2008 4-AMA100-2008-4 Plains Froglet
.. ... ... ... ... ... ...
979 ABSENT HumanObservation Uperoleia laevigata 118-ARA100-2009 118-ARA100-2009-979 Smooth Toadlet
980 ABSENT HumanObservation Uperoleia laevigata 119-ARA100-2010 119-ARA100-2010-980 Smooth Toadlet
981 ABSENT HumanObservation Uperoleia laevigata 120-ARA100-2008 120-ARA100-2008-981 Smooth Toadlet
982 ABSENT HumanObservation Uperoleia laevigata 121-ARA100-2010 121-ARA100-2010-982 Smooth Toadlet
983 ABSENT HumanObservation Uperoleia laevigata 122-ARA100-2011 122-ARA100-2011-983 Smooth Toadlet
[984 rows x 6 columns]
We can specify that we wish to use occurrences and events in our Darwin Core Archive
with use_data(), which will save both occurrences and events as individual csv
files in the default directory data-publish. The names of these files are, by
default, occurrences.csv and events.csv.
>>> galaxias.use_data(occurrences=occurrences,events=events)
In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.
Summary#
The hierarchical structure of Event-based data (ie Site -> Sample -> Occurrence) adds richness, allowing for information like repeated sampling and presence/absence information to be preserved. This richness can enable more nuanced probabilistic analyses like species distribution models or occupancy models. We encourage users with Event-based data to use galaxias to standardise their data for publication and sharing.