
Quick start guide
Martin Westgate
2024-12-12
Source:vignettes/quick_start_guide.Rmd
quick_start_guide.Rmd
galaxias
is an R package to help users build
repositories that are optimised for storing, documenting and sharing
biodiversity data. These repositories can be seamlessly converted into a
Darwin Core Archive, the data standard used by the Global Biodiversity Information Facility
(GBIF) and it’s partner nodes.
## Error in get(paste0(generic, ".", class), envir = get_method_env()) :
## object 'type_sum.accel' not found
Start a new project
To start a new data project using galaxias, call:
library(galaxias)
galaxias_project("myprojectname")
If you are using RStudio, this should launch a new RStudio instance for that project, which will have the following structure:
├── README.md : Description of the repository
├── metadata.md : Boilerplate metadata statement for this project
├── projectname.Rproj : RStudio project file
├── data-raw : Folder to store source data
└── data : Folder to store processed data
It is a good idea to write some basic information in your
README.Rmd
file first, as this provides guidance to your
users as to what your package contains, as well as what they are allowed
to use it for.
Adding data to your project
We recommend that you first add your data to the
data-raw
folder, then use a script within that folder to
manipulate it to Darwin Core format. Let’s assume, for example, that
your data is looks like this:
library(tibble)
df <- tibble(
latitude = c(-35.310, -35.273),
longitude = c(149.125, 149.133),
date = c("14-01-2023", "15-01-2023"),
time = c("10:23", "11:25"),
species = c("Callocephalon fimbriatum", "Eolophus roseicapilla"))
(Note that normally you’d import your data from an external file
(e.g. using readr::read_csv()
), but we’ve constructed one
here for example purposes.)
We recommend using functions from the corella
package
for converting tibbles to Darwin Core. corella
is
automatically loaded when you load galaxias
. A minimally
complete set of formatted observations might look like this:
library(lubridate)
occurrences <- df |>
# basic requirements of Darwin Core
use_occurrences(occurrenceID = sequential_id(),
basisOfRecord = "humanObservation") |>
# place and time
use_coordinates(decimalLatitude = latitude,
decimalLongitude = longitude) |>
use_locality(country = "Australia",
locality = "Canberra") |>
use_datetime(eventDate = lubridate::dmy(date),
eventTime = lubridate::hm(time)) |>
# taxonomy
use_scientific_name(scientificName = species,
taxonRank = "species") |>
use_taxonomy(kingdom = "Animalia",
phylum = "Aves")
print(occurrences, n = 5)
## # A tibble: 2 × 12
## basisOfRecord occurrenceID decimalLatitude decimalLongitude country locality
## <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 humanObservati… 01 -35.3 149. Austra… Canberra
## 2 humanObservati… 02 -35.3 149. Austra… Canberra
## # ℹ 6 more variables: eventDate <date>, eventTime <Period>,
## # scientificName <chr>, taxonRank <chr>, kingdom <chr>, phylum <chr>
Note that this deliberately includes some redundancy. The coordinate
data are useful by themselves, for example, but in case of ambiguity it
is useful to specify a text string given some information on location.
Similarly, the scientificName
field should be sufficient to
identify the taxon in question, but adding higher taxonomic information
makes the identification less ambiguous.
We have saved our tibble as an object with a Darwin Core-specific
name, so we can subset to only Darwin Core terms, and save it out using
write_csv()
library(readr)
occurrences |>
select(any_of(occurrence_terms())) |>
write_csv(".data/occurrences.csv")
If your data are already in Darwin Core format, you can
simply place them in the data
folder. You can then use
build_schema()
to create a ‘schema’ file, which is an
xml
document that tells users what data is present in your
archive.
Adding package metadata
A critical part of a Darwin Core archive is a metadata statement;
this tells users who owns the data, what the data were collected for,
and what uses they can be put to (i.e. a data licence). To get an
example statement, call use_metadata()
This creates a blank statement called ‘metadata.md’, which looks like this:
## ## Dataset
##
## ### Title
##
## A Sentence Giving Your Dataset Title In Title Case
##
## ### Abstract
##
## A paragraph outlining the content of the dataset
##
## ### Creator
##
## #### Individual name
Once you have editted the statement to reflect the information you
want to convey, you can convert it to EML
using:
A second piece of metadata required by the Darwin Core standard is
the ‘schema’ document. This is a machine-readable xml
document that describes the content of the archive’s data files. You can
generate one using:
Build an archive
At the end of the above process, you should have a folder named
data
that contains at least three files:
- One or more
.csv
files containing data - a
meta.xml
file containing your schema - an
eml.xml
file containing your metadata
If that is true, then you can run build_archive()
to zip
your data into a Darwin Core Archive: