Initial Data Check#

When you’re ready to start submitting your data, there are a number of things to check to ensure the ingestion process into the ALA is smooth. Some of this is ensuring that your column names conform to Darwin Core Vocabulary standards, and that your data is in the correct format (i.e. numerical columns are actually numerical).

For these examples, we will be using the the dataset linked above. If, however, you want to go through this workflow using your own data, please feel free to do so!

To read in the data you want to use, you’re going to use pandas to read in the csv file as a table.

>>> import galaxias
>>> import pandas as pd
>>> my_archive = galaxias.dwca(occurrences='<YOUR-FILENAME>.csv')

Now that you have a dataframe with data in it, we can check the data using the function galaxias.check_dataset().

>>> galaxias.check_dataset()
Number of Errors    Pass/Fail    Column name
------------------  -----------  -------------


══ Results ════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════


Errors: 0.0 | Passes: 0

✗ Data does not meet minimum Darwin core requirements
Use corella.suggest_workflow()

None

For our initial data example, the data tests may not be showing any errors, but unfortunately, this means no column names were checked. This is because the names of the columns are not part of the standard Darwin Core Vocabulary. Thankfully, we have created a series of functions that can help you get your data into the Darwin Core standard. To show the functions galaxias contains that can help you do this, we have developed an all-purpose function called suggest_workflow(). Here are the results of this particular dataset:

>>> galaxias.suggest_workflow()
── Darwin Core terms ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── All DwC terms ──

Matched 0 of 4 column names to DwC terms:

✓ Matched: 
✗ Unmatched: Species, Latitude, Longitude, Collection_date

── Minimum required DwC terms occurrences ──

Type                       Matched term(s)    Missing term(s)
-------------------------  -----------------  ------------------------------------------------
Identifier (at least one)  -                  occurrenceID OR catalogNumber OR recordNumber
Record type                -                  basisOfRecord
Scientific name            -                  scientificName
Location                   -                  decimalLatitude, decimalLongitude, geodeticDatum
Date/Time                  -                  eventDate

── Suggested workflow ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

── Occurrences ──

To make your occurrences Darwin Core compliant, use the following workflow:

corella.set_occurrences()
corella.set_scientific_name()
corella.set_coordinates()
corella.set_datetime()

Additional functions: set_abundance(), set_collection(), set_individual_traits(), set_license(), set_locality(), set_taxonomy()
None

To learn more about how to use other functions, go to

Optional functions:

Creating Unique IDs:

Passing Dataset: