Initial Data Check#
When you’re ready to start submitting your data, there are a number of things to check to ensure the ingestion process into the ALA is smooth. Some of this is ensuring that your column names conform to Darwin Core Vocabulary standards, and that your data is in the correct format (i.e. numerical columns are actually numerical).
For these examples, we will be using the the dataset linked above. If, however, you want to go through this workflow using your own data, please feel free to do so!
To read in the data you want to use, you’re going to use pandas
to read in the csv file as a table.
>>> import galaxias
>>> import pandas as pd
>>> my_archive = galaxias.dwca(occurrences='<YOUR-FILENAME>.csv')
Now that you have a dataframe with data in it, we can check the data using the
function galaxias.check_dataset()
.
>>> galaxias.check_dataset()
Number of Errors Pass/Fail Column name
------------------ ----------- -------------
══ Results ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
Errors: 0.0 | Passes: 0
✗ Data does not meet minimum Darwin core requirements
Use corella.suggest_workflow()
None
For our initial data example, the data tests may not be showing any errors, but
unfortunately, this means no column names were checked. This is because the names
of the columns are not part of the standard Darwin Core Vocabulary. Thankfully,
we have created a series of functions that can help you get your data into the
Darwin Core standard. To show the functions galaxias
contains that can help you
do this, we have developed an all-purpose function called suggest_workflow()
. Here
are the results of this particular dataset:
>>> galaxias.suggest_workflow()
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── All DwC terms ──
Matched 0 of 4 column names to DwC terms:
✓ Matched:
✗ Unmatched: Species, Latitude, Longitude, Collection_date
── Minimum required DwC terms occurrences ──
Type Matched term(s) Missing term(s)
------------------------- ----------------- ------------------------------------------------
Identifier (at least one) - occurrenceID OR catalogNumber OR recordNumber
Record type - basisOfRecord
Scientific name - scientificName
Location - decimalLatitude, decimalLongitude, geodeticDatum
Date/Time - eventDate
── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
── Occurrences ──
To make your occurrences Darwin Core compliant, use the following workflow:
corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()
Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
None
To learn more about how to use other functions, go to
Optional functions:
set_individual_traits <set_individual_traits.html>
Creating Unique IDs:
Passing Dataset: