Quick start guide

galaxias is an R package to help users build repositories that are optimised for storing, documenting and sharing biodiversity data. These repositories can be seamlessly converted into a Darwin Core Archive, the data standard used by the Global Biodiversity Information Facility (GBIF) and it’s partner nodes.

## Error in get(paste0(generic, ".", class), envir = get_method_env()) : 
##   object 'type_sum.accel' not found

Start a new project

To start a new data project using galaxias, call:

library(galaxias)
galaxias_project("myprojectname")

If you are using RStudio, this should launch a new RStudio instance for that project, which will have the following structure:

├── README.md                        : Description of the repository
├── metadata.md                      : Boilerplate metadata statement for this project
├── projectname.Rproj                : RStudio project file
├── data-raw                         : Folder to store source data
└── data                             : Folder to store processed data

It is a good idea to write some basic information in your README.Rmd file first, as this provides guidance to your users as to what your package contains, as well as what they are allowed to use it for.

Adding data to your project

We recommend that you first add your data to the data-raw folder, then use a script within that folder to manipulate it to Darwin Core format. Let’s assume, for example, that your data is looks like this:

library(tibble)

df <- tibble(
  latitude = c(-35.310, -35.273),
  longitude = c(149.125, 149.133),
  date = c("14-01-2023", "15-01-2023"),
  time = c("10:23", "11:25"),
  species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"))

(Note that normally you’d import your data from an external file (e.g. using readr::read_csv()), but we’ve constructed one here for example purposes.)

We recommend using functions from the corella package for converting tibbles to Darwin Core. corella is automatically loaded when you load galaxias. A minimally complete set of formatted observations might look like this:

library(lubridate)

occurrences <- df |>
  # basic requirements of Darwin Core
  use_occurrences(occurrenceID = sequential_id(),
                  basisOfRecord = "humanObservation") |> 
  # place and time
  use_coordinates(decimalLatitude = latitude, 
                  decimalLongitude = longitude) |>
  use_locality(country = "Australia", 
               locality = "Canberra") |>
  use_datetime(eventDate = lubridate::dmy(date),
               eventTime = lubridate::hm(time)) |>
  # taxonomy
  use_scientific_name(scientificName = species, 
                      taxonRank = "species") |>
  use_taxonomy(kingdom = "Animalia", 
               phylum = "Aves")

print(occurrences, n = 5)

## # A tibble: 2 × 12
##   basisOfRecord   occurrenceID decimalLatitude decimalLongitude country locality
##   <chr>           <chr>                  <dbl>            <dbl> <chr>   <chr>   
## 1 humanObservati… 01                     -35.3             149. Austra… Canberra
## 2 humanObservati… 02                     -35.3             149. Austra… Canberra
## # ℹ 6 more variables: eventDate <date>, eventTime <Period>,
## #   scientificName <chr>, taxonRank <chr>, kingdom <chr>, phylum <chr>

Note that this deliberately includes some redundancy. The coordinate data are useful by themselves, for example, but in case of ambiguity it is useful to specify a text string given some information on location. Similarly, the scientificName field should be sufficient to identify the taxon in question, but adding higher taxonomic information makes the identification less ambiguous.

We have saved our tibble as an object with a Darwin Core-specific name, so we can subset to only Darwin Core terms, and save it out using write_csv()

library(readr)
occurrences |>
  select(any_of(occurrence_terms())) |>
  write_csv(".data/occurrences.csv")

If your data are already in Darwin Core format, you can simply place them in the data folder. You can then use build_schema() to create a ‘schema’ file, which is an xml document that tells users what data is present in your archive.

Adding package metadata

A critical part of a Darwin Core archive is a metadata statement; this tells users who owns the data, what the data were collected for, and what uses they can be put to (i.e. a data licence). To get an example statement, call use_metadata()

use_metadata()

This creates a blank statement called ‘metadata.md’, which looks like this:

## ## Dataset
##  
##  ### Title
##  
##  A Sentence Giving Your Dataset Title In Title Case
##  
##  ### Abstract
##  
##  A paragraph outlining the content of the dataset
##  
##  ### Creator
##  
##  #### Individual name

Once you have editted the statement to reflect the information you want to convey, you can convert it to EML using:

build_metadata()

A second piece of metadata required by the Darwin Core standard is the ‘schema’ document. This is a machine-readable xml document that describes the content of the archive’s data files. You can generate one using:

build_schema()

Build an archive

At the end of the above process, you should have a folder named data that contains at least three files:

One or more .csv files containing data
a meta.xml file containing your schema
an eml.xml file containing your metadata

If that is true, then you can run build_archive() to zip your data into a Darwin Core Archive:

build_archive()

Martin Westgate

2024-12-12

Start a new project

Adding data to your project

Adding package metadata

Build an archive