Standardising an Events Dataset#

In a research project, data collection can take place at multiple locations and times. At each location and time, there often multiple collected samples to capture variation in a study area or time-period. In Darwin Core, the data collected from this type of project is Event-based.

Events are any action that “occurs at some location during some time.” (from TDWG). Each sample, for example, is a unique event, with its own environmental attributes (like topography, tree cover and soil composition) that affect what organisms occur there and how likely they are to occur. Observations of organisms take place within each Event. As such, Events add hierarchy to a dataset by grouping simultaneous observations into groups, as opposed to Occurrence-only data which is processed as if all occurrences are independent. Event-based data collection adds richness to ecological data that can be useful for more advanced modelling techniques.

Here we will demonstrate an example of how to convert Event-based data to Darwin Core standard. To do so, we will create two csv files, events.csv and occurrences.csv, to build a Darwin Core Archive.

The dataset#

Sites Observations Species List

For this example, we’ll use a dataset of frog observations from a 2015 paper in PLOS ONE. Data were collected by volunteers using 5-minute audio surveys, where each row documents whether each frog species was detected over that 5-minute recording, recorded as present (1) or absent (0). For the purpose of this vignette, we have downloaded the source data from Dryad, reduced the number of rows, and converted the original excel spreadsheet to three .csv files: sites.csv, observations.csv and species_list.csv.

Sites#

The sites spreadsheet contains columns that describe each survey location (e.g. depth, water_type, latitude, longitude) and overall presence/absence of each frog species in a site (e.g. cpar, csig, limdum). We won’t use the aggregated species data stored here - we’ll instead export the raw observations - but we’ll still import the data, because it’s the only place that spatial information are stored.

>>> import pandas as pd
>>> sites = pd.read_csv('sites.csv')
>>> sites
  site_code  log_size    depth water_type   latitude   longitude  cpar  csig  limdum  limper  limtas  lper  lver  ulae  richness
0    AMA100  2.914535     deep     moving -35.168178  149.119212     1     1       1       0       1     1     1     0         6
1    AMH100  0.554287  shallow     moving -35.235300  149.045351     0     0       0       1       1     0     0     0         2
2    ANB100  2.012837     deep     moving -35.278130  149.110090     0     1       1       1       0     0     0     0         3
3    ANU004  1.753792     deep     moving -35.280547  149.111692     0     1       1       0       1     0     0     0         3
4    ANU018  2.377488     deep      still -35.281900  149.112100     0     1       0       1       0     0     0     0         2
5    ANU020  2.155047     deep      still -35.279600  149.116900     1     1       1       1       1     0     0     0         5
6    ANU022  0.957248  shallow     moving -35.278586  149.120161     1     1       0       1       1     0     0     1         5
7    ARA100  2.301030     deep      still -35.276400  149.077900     1     1       0       0       1     1     0     0         4

Observations#

The observations spreadsheet contains columns that describe the sample’s physical properties (e.g. water_type, veg_canopy), linked to sites by the site_code column. More importantly, it records whether each species in the region was recorded during that particular survey (e.g. cpar, csig, limdum).

>>> import pandas as pd
>>> observations = pd.read_csv('observations.csv')
>>> observations
    site_code  year depth water_type  log_size  veg_aqu  veg_canopy  veg_bank  pc_urban  pc_canopy  cpar  csig  limdum  limper  limtas  lper  lver  ulae
0      AMA100  2004  deep     moving  2.914535        1           0         1    0.2260     0.0427     1     0       0       0       1     1     0     0
1      AMA100  2007  deep     moving  2.914535        1           0         1    0.2067     0.0987     1     0       1       0       1     0     0     0
2      AMA100  2007  deep     moving  2.914535        1           0         1    0.2067     0.0987     1     0       1       0       1     0     0     0
3      AMA100  2005  deep     moving  2.914535        1           0         1    0.2813     0.0567     1     1       1       0       1     0     0     0
4      AMA100  2008  deep     moving  2.914535        1           0         1    0.2067     0.0607     1     0       1       0       0     1     0     0
..        ...   ...   ...        ...       ...      ...         ...       ...       ...        ...   ...   ...     ...     ...     ...   ...   ...   ...
118    ARA100  2009  deep      still  2.301030        1           0         1    0.0779     0.0329     1     0       0       0       1     0     0     0
119    ARA100  2010  deep      still  2.301030        1           0         1    0.0307     0.1450     1     0       0       0       1     1     0     0
120    ARA100  2008  deep      still  2.301030        1           0         1    0.1593     0.0307     0     0       0       0       0     0     0     0
121    ARA100  2010  deep      still  2.301030        1           0         1    0.0307     0.1450     1     0       0       0       1     0     0     0
122    ARA100  2011  deep      still  2.301030        1           0         1    0.1171     0.0607     1     0       0       0       0     1     0     0

[123 rows x 18 columns]

Species list#

Finally, the species list spreadsheet lists the eight frog species recorded in this dataset, and the abbreviation column contains the abbreviated column name used in the observations dataset.

>>> import pandas as pd
>>> species = pd.read_csv('species_list.csv')
>>> species
              scientific_name             common_name abbreviation
0       Crinia parinsignifera          Plains Froglet         cpar
1            Crinia signifera  Common Eastern Froglet         csig
2     Limnodynastes dumerilii              Pobblebonk       limdum
3       Limnodynastes peronii      Striped Marsh Frog       limper
4  Limnodynastes tasmaniensis      Spotted Grass Frog       limtas
5             Litoria peronii    Emerald Spotted Frog         lper
6          Litoria verreauxii        Alpine Tree Frog         lver
7         Uperoleia laevigata          Smooth Toadlet         ulae

Preparing data for both events.csv and occurrences.csv#

For events.csv, we want the following types of data to be represented:

  • Geographic location of site (latitude/longitude)

  • The coordinate reference system of the device used to measure geographic location

  • Uncertainty of the measurement of your location

  • Date of the event

  • A unique identifier for each event

For occurrences.csv, we want the following types of data to be represented:

  • Scientific name (common name as well if available)

  • How the record was observed

  • Presence/absence

  • Identifier linking this observation to an event

  • A unique identifier for each observation

Figuring out how to get all this information to link can be difficult! However, the Darwin Core standard has provided a field to link both events.csv and occurrences.csv: eventID. This is a unique identifier which will be used to denote the unique events in events.csv, and will be able to link the sightings in occurrences.csv with specific events in your study.

Preparing events.csv#

In your events.csv, you will be capturing the “where” and “when” of the data, as outlined in the “Preparing data for both events.csv and occurrences.csv” section. We already have the unique identifier of the event, as well as a date for the event (the year column.) All that is left to do is to include the geographic location of these sites, and then write the events.csv file to your disk.

Note

We are including the geographic location in the events.csv, as the coordinates for each observation are tied to the location of the site in question. If you have sites, but you instead want to include coordinates for individual sightings, then coordinates are more appropriately placed in the occurrences.csv file.

>>> observations_site_id_coords = observations_site_id.merge(sites[['site_code','latitude','longitude']],on='site_code',how='left')
>>> events_all = galaxias.set_coordinates(dataframe=observations_site_id_coords,
...                                       decimalLatitude='latitude',
...                                       decimalLongitude='longitude',
...                                       coordinateUncertaintyInMeters = 30,
...                                       geodeticDatum = 'WGS84')
>>> events_all
             eventID site_code  year  cpar  csig  limdum  limper  limtas  lper  lver  ulae  decimalLatitude  decimalLongitude geodeticDatum  coordinateUncertaintyInMeters
0      0-AMA100-2004    AMA100  2004     1     0       0       0       1     1     0     0       -35.168178        149.119212         WGS84                             30
1      1-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
2      2-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
3      3-AMA100-2005    AMA100  2005     1     1       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
4      4-AMA100-2008    AMA100  2008     1     0       1       0       0     1     0     0       -35.168178        149.119212         WGS84                             30
..               ...       ...   ...   ...   ...     ...     ...     ...   ...   ...   ...              ...               ...           ...                            ...
118  118-ARA100-2009    ARA100  2009     1     0       0       0       1     0     0     0       -35.276400        149.077900         WGS84                             30
119  119-ARA100-2010    ARA100  2010     1     0       0       0       1     1     0     0       -35.276400        149.077900         WGS84                             30
120  120-ARA100-2008    ARA100  2008     0     0       0       0       0     0     0     0       -35.276400        149.077900         WGS84                             30
121  121-ARA100-2010    ARA100  2010     1     0       0       0       1     0     0     0       -35.276400        149.077900         WGS84                             30
122  122-ARA100-2011    ARA100  2011     1     0       0       0       0     1     0     0       -35.276400        149.077900         WGS84                             30

[123 rows x 15 columns]

Here, you can do one of two things:

  1. Select only the columns which are currently Darwin core compliant

  2. Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.

To see which Darwin core terms are included in checks in galaxias, consult the list below.

Supported Darwin Core Terms and Their Associated Functions

Darwin Core Term

set Function

basisOfRecord

set_occurrences()

occurrenceID

set_occurrences()

scientificName

set_scientific_name()

decimalLatitude

set_coordinates()

decimalLongitude

set_coordinates()

geodeticDatum

set_coordinates()

coordinateUncertaintyInMeters

set_coordinates()

eventDate

set_datetime()

kingdom

set_taxonomy()

phylum

set_taxonomy()

class

set_taxonomy()

order

set_taxonomy()

family

set_taxonomy()

genus

set_taxonomy()

specificEpithet

set_taxonomy()

vernacularName

set_taxonomy()

taxonRank

set_scientific_name()

scientificNameAuthorship

set_scientific_name()

recordedBy

set_observer()

recordedByID

set_observer()

measurementID

set_measurements()

measurementType

set_measurements()

measurementValue

set_measurements()

measurementUnit

set_measurements()

continent

set_locality()

country

set_locality()

countryCode

set_locality()

stateProvince

set_locality()

locality

set_locality()

license

set_license()

rightsHolder

set_license()

accessRights

set_license()

sex

set_individual_traits()

lifeStage

set_individual_traits()

reproductiveCondition

set_individual_traits()

vitality

set_individual_traits()

individualID

set_individual_traits()

eventID

set_events()

parentEventID

set_events()

eventType

set_events()

eventTime

set_datetime()

year

set_datetime()

month

set_datetime()

day

set_datetime()

coordinatePrecision

set_coordinates()

datasetID

set_collection()

datasetName

set_collection()

catalogNumber

set_collection()

individualCount

set_abundance()

organismQuantity

set_abundance()

organismQuantity

set_abundance()

organismQuantityType

set_abundance()

To select only the columns that are Darwin core compliant, run the following snippet of code:

>>> event_terms = list(galaxias.event_terms())
>>> event_terms_dwca = list(set(event_terms).intersection(list(events_all.columns)))
>>> events = events_all[event_terms_dwca]
>>> events
     coordinateUncertaintyInMeters  year  decimalLatitude geodeticDatum  decimalLongitude          eventID
0                               30  2004       -35.168178         WGS84        149.119212    0-AMA100-2004
1                               30  2007       -35.168178         WGS84        149.119212    1-AMA100-2007
2                               30  2007       -35.168178         WGS84        149.119212    2-AMA100-2007
3                               30  2005       -35.168178         WGS84        149.119212    3-AMA100-2005
4                               30  2008       -35.168178         WGS84        149.119212    4-AMA100-2008
..                             ...   ...              ...           ...               ...              ...
118                             30  2009       -35.276400         WGS84        149.077900  118-ARA100-2009
119                             30  2010       -35.276400         WGS84        149.077900  119-ARA100-2010
120                             30  2008       -35.276400         WGS84        149.077900  120-ARA100-2008
121                             30  2010       -35.276400         WGS84        149.077900  121-ARA100-2010
122                             30  2011       -35.276400         WGS84        149.077900  122-ARA100-2011

[123 rows x 6 columns]

Prepare occurrences.csv#

Let’s return to the observations_site_id variable, which contains the eventID field that will link our events.csv and occurrences.csv. When we are formatting these occurrences, we are going to need to have the species abbreviations as values under a column name, rather than the column names themselves. But how do we do this while keeping all the information we need for linking the files?

Luckily, we can use the melt function inherent in each pandas dataframe. We will specify the eventID column as our id_vars, as this is the key we want associated with each observation. We will then choose the name abbreviation for the var_name, as var_name will become the column names (aka the abbreviations for each species). Finally, the value_name will be named presence for now, as this will contain all of our presence/absence data.

>>> observations_site_id_select = observations_site_id[['eventID'] + species_abb]
>>> observations_site_id_abbr = observations_site_id_select.melt(id_vars=['eventID'],
...                                                              var_name='abbreviation',
...                                                              value_name='status')
>>> observations_site_id_abbr
             eventID abbreviation  status
0      0-AMA100-2004         cpar       1
1      1-AMA100-2007         cpar       1
2      2-AMA100-2007         cpar       1
3      3-AMA100-2005         cpar       1
4      4-AMA100-2008         cpar       1
..               ...          ...     ...
979  118-ARA100-2009         ulae       0
980  119-ARA100-2010         ulae       0
981  120-ARA100-2008         ulae       0
982  121-ARA100-2010         ulae       0
983  122-ARA100-2011         ulae       0

[984 rows x 3 columns]

Now we’ll merge the correct names to our frog species by joining species with observations_site_id_abbr.

>>> observations_site_id_spec = observations_site_id_abbr.merge(species,on='abbreviation',how='left')
>>> observations_site_id_spec
             eventID abbreviation  status        scientific_name     common_name
0      0-AMA100-2004         cpar       1  Crinia parinsignifera  Plains Froglet
1      1-AMA100-2007         cpar       1  Crinia parinsignifera  Plains Froglet
2      2-AMA100-2007         cpar       1  Crinia parinsignifera  Plains Froglet
3      3-AMA100-2005         cpar       1  Crinia parinsignifera  Plains Froglet
4      4-AMA100-2008         cpar       1  Crinia parinsignifera  Plains Froglet
..               ...          ...     ...                    ...             ...
979  118-ARA100-2009         ulae       0    Uperoleia laevigata  Smooth Toadlet
980  119-ARA100-2010         ulae       0    Uperoleia laevigata  Smooth Toadlet
981  120-ARA100-2008         ulae       0    Uperoleia laevigata  Smooth Toadlet
982  121-ARA100-2010         ulae       0    Uperoleia laevigata  Smooth Toadlet
983  122-ARA100-2011         ulae       0    Uperoleia laevigata  Smooth Toadlet

[984 rows x 5 columns]

Now we can reformat our data to use valid Darwin Core column names using set_ functions. Importantly, Darwin Core Standard requires that we add a unique occurrenceID and the type of observation in the column basisOfRecord.

>>> # first, change the 1's and 0's to PRESENT and ABSENT
>>> observations_site_id_spec['status'] = observations_site_id_spec['status'].map({1: 'PRESENT', 0: 'ABSENT'})
>>>
>>> # now, we will reformat the data to use valid Darwin Core columns
>>> obs = galaxias.set_occurrences(occurrences=observations_site_id_spec,
...                                occurrenceID=['eventID','sequential'],
...                                basisOfRecord = 'HumanObservation',
...                                occurrenceStatus = 'status')
>>> obs_name = galaxias.set_scientific_name(dataframe=obs,scientificName='scientific_name')
>>> obs_dwc = galaxias.set_taxonomy(dataframe=obs_name,vernacularName='common_name')
>>> obs_dwc
            occurrenceID          eventID abbreviation occurrenceStatus         scientificName  vernacularName     basisOfRecord
0        0-AMA100-2004-0    0-AMA100-2004         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
1        1-AMA100-2007-1    1-AMA100-2007         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
2        2-AMA100-2007-2    2-AMA100-2007         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
3        3-AMA100-2005-3    3-AMA100-2005         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
4        4-AMA100-2008-4    4-AMA100-2008         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
..                   ...              ...          ...              ...                    ...             ...               ...
979  118-ARA100-2009-979  118-ARA100-2009         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
980  119-ARA100-2010-980  119-ARA100-2010         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
981  120-ARA100-2008-981  120-ARA100-2008         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
982  121-ARA100-2010-982  121-ARA100-2010         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
983  122-ARA100-2011-983  122-ARA100-2011         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation

[984 rows x 7 columns]

We now have a dataframe with observations organised at the occurrence-level. Our final step is to reduce obs_long_dwc to only include columns with valid column names in Occurrence-based datasets. This drops the abbreviation column from our dataframe.

>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(obs_dwc.columns)))
>>> occurrences = obs_dwc[occ_terms_dwca]
>>> occurrences
    occurrenceStatus     basisOfRecord         scientificName          eventID         occurrenceID  vernacularName
0            PRESENT  HumanObservation  Crinia parinsignifera    0-AMA100-2004      0-AMA100-2004-0  Plains Froglet
1            PRESENT  HumanObservation  Crinia parinsignifera    1-AMA100-2007      1-AMA100-2007-1  Plains Froglet
2            PRESENT  HumanObservation  Crinia parinsignifera    2-AMA100-2007      2-AMA100-2007-2  Plains Froglet
3            PRESENT  HumanObservation  Crinia parinsignifera    3-AMA100-2005      3-AMA100-2005-3  Plains Froglet
4            PRESENT  HumanObservation  Crinia parinsignifera    4-AMA100-2008      4-AMA100-2008-4  Plains Froglet
..               ...               ...                    ...              ...                  ...             ...
979           ABSENT  HumanObservation    Uperoleia laevigata  118-ARA100-2009  118-ARA100-2009-979  Smooth Toadlet
980           ABSENT  HumanObservation    Uperoleia laevigata  119-ARA100-2010  119-ARA100-2010-980  Smooth Toadlet
981           ABSENT  HumanObservation    Uperoleia laevigata  120-ARA100-2008  120-ARA100-2008-981  Smooth Toadlet
982           ABSENT  HumanObservation    Uperoleia laevigata  121-ARA100-2010  121-ARA100-2010-982  Smooth Toadlet
983           ABSENT  HumanObservation    Uperoleia laevigata  122-ARA100-2011  122-ARA100-2011-983  Smooth Toadlet

[984 rows x 6 columns]

We can specify that we wish to use occurrences and events in our Darwin Core Archive with use_data(), which will save both occurrences and events as individual csv files in the default directory data-publish. The names of these files are, by default, occurrences.csv and events.csv.

>>> galaxias.use_data(occurrences=occurrences,events=events)

In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.

Summary#

The hierarchical structure of Event-based data (ie Site -> Sample -> Occurrence) adds richness, allowing for information like repeated sampling and presence/absence information to be preserved. This richness can enable more nuanced probabilistic analyses like species distribution models or occupancy models. We encourage users with Event-based data to use galaxias to standardise their data for publication and sharing.