Standardising an Events Dataset#

In a research project, data collection can take place at multiple locations and times. At each location and time, there often multiple collected samples to capture variation in a study area or time-period. In Darwin Core, the data collected from this type of project is Event-based.

Events are any action that “occurs at some location during some time.” (from TDWG). Each sample, for example, is a unique event, with its own environmental attributes (like topography, tree cover and soil composition) that affect what organisms occur there and how likely they are to occur. Observations of organisms take place within each Event. As such, Events add hierarchy to a dataset by grouping simultaneous observations into groups, as opposed to Occurrence-only data which is processed as if all occurrences are independent. Event-based data collection adds richness to ecological data that can be useful for more advanced modelling techniques.

Here we will demonstrate an example of how to convert Event-based data to Darwin Core standard. To do so, we will create two csv files, events.csv and occurrences.csv, to build a Darwin Core Archive.

The dataset#

Sites Observations Species List

For this example, we’ll use a dataset of frog observations from a 2015 paper in PLOS ONE. Data were collected by volunteers using 5-minute audio surveys, where each row documents whether each frog species was detected over that 5-minute recording, recorded as present (1) or absent (0). For the purpose of this vignette, we have downloaded the source data from Dryad, reduced the number of rows, and converted the original excel spreadsheet to three .csv files: sites.csv, observations.csv and species_list.csv.

Sites#

The sites spreadsheet contains columns that describe each survey location (e.g. depth, water_type, latitude, longitude) and overall presence/absence of each frog species in a site (e.g. cpar, csig, limdum). We won’t use the aggregated species data stored here - we’ll instead export the raw observations - but we’ll still import the data, because it’s the only place that spatial information are stored.

>>> import pandas as pd
>>> sites = pd.read_csv('sites.csv')
>>> sites

  site_code  log_size    depth water_type   latitude   longitude  cpar  csig  limdum  limper  limtas  lper  lver  ulae  richness
  AMA100  2.914535     deep     moving -35.168178  149.119212     1     1       1       0       1     1     1     0         6
  AMH100  0.554287  shallow     moving -35.235300  149.045351     0     0       0       1       1     0     0     0         2
  ANB100  2.012837     deep     moving -35.278130  149.110090     0     1       1       1       0     0     0     0         3
  ANU004  1.753792     deep     moving -35.280547  149.111692     0     1       1       0       1     0     0     0         3
  ANU018  2.377488     deep      still -35.281900  149.112100     0     1       0       1       0     0     0     0         2
  ANU020  2.155047     deep      still -35.279600  149.116900     1     1       1       1       1     0     0     0         5
  ANU022  0.957248  shallow     moving -35.278586  149.120161     1     1       0       1       1     0     0     1         5
  ARA100  2.301030     deep      still -35.276400  149.077900     1     1       0       0       1     1     0     0         4

Observations#

The observations spreadsheet contains columns that describe the sample’s physical properties (e.g. water_type, veg_canopy), linked to sites by the site_code column. More importantly, it records whether each species in the region was recorded during that particular survey (e.g. cpar, csig, limdum).

>>> import pandas as pd
>>> observations = pd.read_csv('observations.csv')
>>> observations

    site_code  year depth water_type  log_size  veg_aqu  veg_canopy  veg_bank  pc_urban  pc_canopy  cpar  csig  limdum  limper  limtas  lper  lver  ulae
    AMA100  2004  deep     moving  2.914535        1           0         1    0.2260     0.0427     1     0       0       0       1     1     0     0
    AMA100  2007  deep     moving  2.914535        1           0         1    0.2067     0.0987     1     0       1       0       1     0     0     0
    AMA100  2007  deep     moving  2.914535        1           0         1    0.2067     0.0987     1     0       1       0       1     0     0     0
    AMA100  2005  deep     moving  2.914535        1           0         1    0.2813     0.0567     1     1       1       0       1     0     0     0
    AMA100  2008  deep     moving  2.914535        1           0         1    0.2067     0.0607     1     0       1       0       0     1     0     0
..        ...   ...   ...        ...       ...      ...         ...       ...       ...        ...   ...   ...     ...     ...     ...   ...   ...   ...
  ARA100  2009  deep      still  2.301030        1           0         1    0.0779     0.0329     1     0       0       0       1     0     0     0
  ARA100  2010  deep      still  2.301030        1           0         1    0.0307     0.1450     1     0       0       0       1     1     0     0
  ARA100  2008  deep      still  2.301030        1           0         1    0.1593     0.0307     0     0       0       0       0     0     0     0
  ARA100  2010  deep      still  2.301030        1           0         1    0.0307     0.1450     1     0       0       0       1     0     0     0
  ARA100  2011  deep      still  2.301030        1           0         1    0.1171     0.0607     1     0       0       0       0     1     0     0

[123 rows x 18 columns]

Species list#

Finally, the species list spreadsheet lists the eight frog species recorded in this dataset, and the abbreviation column contains the abbreviated column name used in the observations dataset.

>>> import pandas as pd
>>> species = pd.read_csv('species_list.csv')
>>> species

              scientific_name             common_name abbreviation
     Crinia parinsignifera          Plains Froglet         cpar
          Crinia signifera  Common Eastern Froglet         csig
   Limnodynastes dumerilii              Pobblebonk       limdum
     Limnodynastes peronii      Striped Marsh Frog       limper
Limnodynastes tasmaniensis      Spotted Grass Frog       limtas
           Litoria peronii    Emerald Spotted Frog         lper
        Litoria verreauxii        Alpine Tree Frog         lver
       Uperoleia laevigata          Smooth Toadlet         ulae

Preparing data for both `events.csv` and `occurrences.csv`#

For events.csv, we want the following types of data to be represented:

Geographic location of site (latitude/longitude)
The coordinate reference system of the device used to measure geographic location
Uncertainty of the measurement of your location
Date of the event
A unique identifier for each event

For occurrences.csv, we want the following types of data to be represented:

Scientific name (common name as well if available)
How the record was observed
Presence/absence
Identifier linking this observation to an event
A unique identifier for each observation

Figuring out how to get all this information to link can be difficult! However, the Darwin Core standard has provided a field to link both events.csv and occurrences.csv: eventID. This is a unique identifier which will be used to denote the unique events in events.csv, and will be able to link the sightings in occurrences.csv with specific events in your study.

Preparing `eventID` to link `events.csv` and `occurrences.csv`#

As mentioned above, the eventID field will be the common key that links your events.csv and occurrences.csv file at the end of the workflow. For our example, we will be doing some initial preparation and merging of the three aforementioned data frames to ensure we have all information we need in one table to easily link all information together.

Looking at the tables, we can see that site_code will link the sites and observations dataframes together, and the abbreviations column values in the species dataframe are column names in the observations table.

How that we have the two ways to link our table, we know we want the site_code and year columns represented in our composite table, as well as the columns with species’ name abbreviations. We will also create a composite eventID from a sequential number, the site_code and the year in which the site was visited.

>>> # get list of all species abbreviations
>>> species_abb = list(species['abbreviation'])
>>>
>>> # join observations and select sites columns by site_code
>>> observations_site = observations[['site_code','year'] + species_abb]
>>>
>>> # add composite eventID
>>> observations_site_id = galaxias.set_events(dataframe=observations_site,eventID=['sequential','site_code','year'])
>>> observations_site_id

             eventID site_code  year  cpar  csig  limdum  limper  limtas  lper  lver  ulae
    0-AMA100-2004    AMA100  2004     1     0       0       0       1     1     0     0
    1-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0
    2-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0
    3-AMA100-2005    AMA100  2005     1     1       1       0       1     0     0     0
    4-AMA100-2008    AMA100  2008     1     0       1       0       0     1     0     0
..               ...       ...   ...   ...   ...     ...     ...     ...   ...   ...   ...
118-ARA100-2009    ARA100  2009     1     0       0       0       1     0     0     0
119-ARA100-2010    ARA100  2010     1     0       0       0       1     1     0     0
120-ARA100-2008    ARA100  2008     0     0       0       0       0     0     0     0
121-ARA100-2010    ARA100  2010     1     0       0       0       1     0     0     0
122-ARA100-2011    ARA100  2011     1     0       0       0       0     1     0     0

[123 rows x 11 columns]

Preparing `events.csv`#

In your events.csv, you will be capturing the “where” and “when” of the data, as outlined in the “Preparing data for both events.csv and occurrences.csv” section. We already have the unique identifier of the event, as well as a date for the event (the year column.) All that is left to do is to include the geographic location of these sites, and then write the events.csv file to your disk.

Note

We are including the geographic location in the events.csv, as the coordinates for each observation are tied to the location of the site in question. If you have sites, but you instead want to include coordinates for individual sightings, then coordinates are more appropriately placed in the occurrences.csv file.

>>> observations_site_id_coords = observations_site_id.merge(sites[['site_code','latitude','longitude']],on='site_code',how='left')
>>> events_all = galaxias.set_coordinates(dataframe=observations_site_id_coords,
...                                       decimalLatitude='latitude',
...                                       decimalLongitude='longitude',
...                                       coordinateUncertaintyInMeters = 30,
...                                       geodeticDatum = 'WGS84')
>>> events_all

             eventID site_code  year  cpar  csig  limdum  limper  limtas  lper  lver  ulae  decimalLatitude  decimalLongitude geodeticDatum  coordinateUncertaintyInMeters
    0-AMA100-2004    AMA100  2004     1     0       0       0       1     1     0     0       -35.168178        149.119212         WGS84                             30
    1-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
    2-AMA100-2007    AMA100  2007     1     0       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
    3-AMA100-2005    AMA100  2005     1     1       1       0       1     0     0     0       -35.168178        149.119212         WGS84                             30
    4-AMA100-2008    AMA100  2008     1     0       1       0       0     1     0     0       -35.168178        149.119212         WGS84                             30
..               ...       ...   ...   ...   ...     ...     ...     ...   ...   ...   ...              ...               ...           ...                            ...
118-ARA100-2009    ARA100  2009     1     0       0       0       1     0     0     0       -35.276400        149.077900         WGS84                             30
119-ARA100-2010    ARA100  2010     1     0       0       0       1     1     0     0       -35.276400        149.077900         WGS84                             30
120-ARA100-2008    ARA100  2008     0     0       0       0       0     0     0     0       -35.276400        149.077900         WGS84                             30
121-ARA100-2010    ARA100  2010     1     0       0       0       1     0     0     0       -35.276400        149.077900         WGS84                             30
122-ARA100-2011    ARA100  2011     1     0       0       0       0     1     0     0       -35.276400        149.077900         WGS84                             30

[123 rows x 15 columns]

Here, you can do one of two things:

Select only the columns which are currently Darwin core compliant
Use optional functions to ensure other parts of your data are Darwin core compliant, and include those in your final dataset.

To see which Darwin core terms are included in checks in galaxias, consult the list below.

Supported Darwin Core Terms and Their Associated Functions

Darwin Core Term	`set` Function
basisOfRecord	set_occurrences()
occurrenceID	set_occurrences()
scientificName	set_scientific_name()
decimalLatitude	set_coordinates()
decimalLongitude	set_coordinates()
geodeticDatum	set_coordinates()
coordinateUncertaintyInMeters	set_coordinates()
eventDate	set_datetime()
kingdom	set_taxonomy()
phylum	set_taxonomy()
class	set_taxonomy()
order	set_taxonomy()
family	set_taxonomy()
genus	set_taxonomy()
specificEpithet	set_taxonomy()
vernacularName	set_taxonomy()
taxonRank	set_scientific_name()
scientificNameAuthorship	set_scientific_name()
recordedBy	set_observer()
recordedByID	set_observer()
measurementID	set_measurements()
measurementType	set_measurements()
measurementValue	set_measurements()
measurementUnit	set_measurements()
continent	set_locality()
country	set_locality()
countryCode	set_locality()
stateProvince	set_locality()
locality	set_locality()
license	set_license()
rightsHolder	set_license()
accessRights	set_license()
sex	set_individual_traits()
lifeStage	set_individual_traits()
reproductiveCondition	set_individual_traits()
vitality	set_individual_traits()
individualID	set_individual_traits()
eventID	set_events()
parentEventID	set_events()
eventType	set_events()
eventTime	set_datetime()
year	set_datetime()
month	set_datetime()
day	set_datetime()
coordinatePrecision	set_coordinates()
datasetID	set_collection()
datasetName	set_collection()
catalogNumber	set_collection()
individualCount	set_abundance()
organismQuantity	set_abundance()
organismQuantity	set_abundance()
organismQuantityType	set_abundance()

To select only the columns that are Darwin core compliant, run the following snippet of code:

>>> event_terms = list(galaxias.event_terms())
>>> event_terms_dwca = list(set(event_terms).intersection(list(events_all.columns)))
>>> events = events_all[event_terms_dwca]
>>> events

     coordinateUncertaintyInMeters  year  decimalLatitude geodeticDatum  decimalLongitude          eventID
                             30  2004       -35.168178         WGS84        149.119212    0-AMA100-2004
                             30  2007       -35.168178         WGS84        149.119212    1-AMA100-2007
                             30  2007       -35.168178         WGS84        149.119212    2-AMA100-2007
                             30  2005       -35.168178         WGS84        149.119212    3-AMA100-2005
                             30  2008       -35.168178         WGS84        149.119212    4-AMA100-2008
..                             ...   ...              ...           ...               ...              ...
                           30  2009       -35.276400         WGS84        149.077900  118-ARA100-2009
                           30  2010       -35.276400         WGS84        149.077900  119-ARA100-2010
                           30  2008       -35.276400         WGS84        149.077900  120-ARA100-2008
                           30  2010       -35.276400         WGS84        149.077900  121-ARA100-2010
                           30  2011       -35.276400         WGS84        149.077900  122-ARA100-2011

[123 rows x 6 columns]

Prepare occurrences.csv#

Let’s return to the observations_site_id variable, which contains the eventID field that will link our events.csv and occurrences.csv. When we are formatting these occurrences, we are going to need to have the species abbreviations as values under a column name, rather than the column names themselves. But how do we do this while keeping all the information we need for linking the files?

Luckily, we can use the melt function inherent in each pandas dataframe. We will specify the eventID column as our id_vars, as this is the key we want associated with each observation. We will then choose the name abbreviation for the var_name, as var_name will become the column names (aka the abbreviations for each species). Finally, the value_name will be named presence for now, as this will contain all of our presence/absence data.

>>> observations_site_id_select = observations_site_id[['eventID'] + species_abb]
>>> observations_site_id_abbr = observations_site_id_select.melt(id_vars=['eventID'],
...                                                              var_name='abbreviation',
...                                                              value_name='status')
>>> observations_site_id_abbr

             eventID abbreviation  status
    0-AMA100-2004         cpar       1
    1-AMA100-2007         cpar       1
    2-AMA100-2007         cpar       1
    3-AMA100-2005         cpar       1
    4-AMA100-2008         cpar       1
..               ...          ...     ...
118-ARA100-2009         ulae       0
119-ARA100-2010         ulae       0
120-ARA100-2008         ulae       0
121-ARA100-2010         ulae       0
122-ARA100-2011         ulae       0

[984 rows x 3 columns]

Now we’ll merge the correct names to our frog species by joining species with observations_site_id_abbr.

>>> observations_site_id_spec = observations_site_id_abbr.merge(species,on='abbreviation',how='left')
>>> observations_site_id_spec

             eventID abbreviation  status        scientific_name     common_name
    0-AMA100-2004         cpar       1  Crinia parinsignifera  Plains Froglet
    1-AMA100-2007         cpar       1  Crinia parinsignifera  Plains Froglet
    2-AMA100-2007         cpar       1  Crinia parinsignifera  Plains Froglet
    3-AMA100-2005         cpar       1  Crinia parinsignifera  Plains Froglet
    4-AMA100-2008         cpar       1  Crinia parinsignifera  Plains Froglet
..               ...          ...     ...                    ...             ...
118-ARA100-2009         ulae       0    Uperoleia laevigata  Smooth Toadlet
119-ARA100-2010         ulae       0    Uperoleia laevigata  Smooth Toadlet
120-ARA100-2008         ulae       0    Uperoleia laevigata  Smooth Toadlet
121-ARA100-2010         ulae       0    Uperoleia laevigata  Smooth Toadlet
122-ARA100-2011         ulae       0    Uperoleia laevigata  Smooth Toadlet

[984 rows x 5 columns]

Now we can reformat our data to use valid Darwin Core column names using set_ functions. Importantly, Darwin Core Standard requires that we add a unique occurrenceID and the type of observation in the column basisOfRecord.

>>> # first, change the 1's and 0's to PRESENT and ABSENT
>>> observations_site_id_spec['status'] = observations_site_id_spec['status'].map({1: 'PRESENT', 0: 'ABSENT'})
>>>
>>> # now, we will reformat the data to use valid Darwin Core columns
>>> obs = galaxias.set_occurrences(occurrences=observations_site_id_spec,
...                                occurrenceID=['eventID','sequential'],
...                                basisOfRecord = 'HumanObservation',
...                                occurrenceStatus = 'status')
>>> obs_name = galaxias.set_scientific_name(dataframe=obs,scientificName='scientific_name')
>>> obs_dwc = galaxias.set_taxonomy(dataframe=obs_name,vernacularName='common_name')
>>> obs_dwc

            occurrenceID          eventID abbreviation occurrenceStatus         scientificName  vernacularName     basisOfRecord
      0-AMA100-2004-0    0-AMA100-2004         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
      1-AMA100-2007-1    1-AMA100-2007         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
      2-AMA100-2007-2    2-AMA100-2007         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
      3-AMA100-2005-3    3-AMA100-2005         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
      4-AMA100-2008-4    4-AMA100-2008         cpar          PRESENT  Crinia parinsignifera  Plains Froglet  HumanObservation
..                   ...              ...          ...              ...                    ...             ...               ...
118-ARA100-2009-979  118-ARA100-2009         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
119-ARA100-2010-980  119-ARA100-2010         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
120-ARA100-2008-981  120-ARA100-2008         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
121-ARA100-2010-982  121-ARA100-2010         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation
122-ARA100-2011-983  122-ARA100-2011         ulae           ABSENT    Uperoleia laevigata  Smooth Toadlet  HumanObservation

[984 rows x 7 columns]

We now have a dataframe with observations organised at the occurrence-level. Our final step is to reduce obs_long_dwc to only include columns with valid column names in Occurrence-based datasets. This drops the abbreviation column from our dataframe.

>>> occ_terms = list(galaxias.occurrence_terms())
>>> occ_terms_dwca = list(set(occ_terms).intersection(list(obs_dwc.columns)))
>>> occurrences = obs_dwc[occ_terms_dwca]
>>> occurrences

    occurrenceStatus     basisOfRecord         scientificName          eventID         occurrenceID  vernacularName
          PRESENT  HumanObservation  Crinia parinsignifera    0-AMA100-2004      0-AMA100-2004-0  Plains Froglet
          PRESENT  HumanObservation  Crinia parinsignifera    1-AMA100-2007      1-AMA100-2007-1  Plains Froglet
          PRESENT  HumanObservation  Crinia parinsignifera    2-AMA100-2007      2-AMA100-2007-2  Plains Froglet
          PRESENT  HumanObservation  Crinia parinsignifera    3-AMA100-2005      3-AMA100-2005-3  Plains Froglet
          PRESENT  HumanObservation  Crinia parinsignifera    4-AMA100-2008      4-AMA100-2008-4  Plains Froglet
..               ...               ...                    ...              ...                  ...             ...
         ABSENT  HumanObservation    Uperoleia laevigata  118-ARA100-2009  118-ARA100-2009-979  Smooth Toadlet
         ABSENT  HumanObservation    Uperoleia laevigata  119-ARA100-2010  119-ARA100-2010-980  Smooth Toadlet
         ABSENT  HumanObservation    Uperoleia laevigata  120-ARA100-2008  120-ARA100-2008-981  Smooth Toadlet
         ABSENT  HumanObservation    Uperoleia laevigata  121-ARA100-2010  121-ARA100-2010-982  Smooth Toadlet
         ABSENT  HumanObservation    Uperoleia laevigata  122-ARA100-2011  122-ARA100-2011-983  Smooth Toadlet

[984 rows x 6 columns]

We can specify that we wish to use occurrences and events in our Darwin Core Archive with use_data(), which will save both occurrences and events as individual csv files in the default directory data-publish. The names of these files are, by default, occurrences.csv and events.csv.

>>> galaxias.use_data(occurrences=occurrences,events=events)

In data terms, that’s it! Don’t forget to add metadata. An explanation of how to add metadata is here.

Summary#

The hierarchical structure of Event-based data (ie Site -> Sample -> Occurrence) adds richness, allowing for information like repeated sampling and presence/absence information to be preserved. This richness can enable more nuanced probabilistic analyses like species distribution models or occupancy models. We encourage users with Event-based data to use galaxias to standardise their data for publication and sharing.