A ‘How-To’ Guide for Extracting GloBI Data 😃

GloBI has a ton of useful data in it and can be used in many different ways. However, it can also be a bit overwhelming with so much data. This page offers some helpful links, hacks, and instructions for how to extract the information available in a useful format.

Do you have web programing skills? Extra time on your hands? Want to help make easy to push buttons, search boxes, or other web interface type improvements to GloBI? PLEASE contact Jorrit! We’d love the extra help!

Contents

❗ ⚠️ Page under construction ⚠️ ❗



Introduction to using GloBI

Did you know there is a video tutorial on how to use GloBI?
A Practical Exploration of Biotic Interaction Data Management and Information Retrieval through TPT and GloBI (video)

There are also detailed step-by-step instructions from the Species Interaction Data Workshop on how to extract information from GloBI.

  1. Getting Interaction Data
  2. Working with the Whole Dataset
  3. Exploring Ixodes (tick) Records By Pointing and Clicking
  4. Data Sources: Interaction Data Record Review
  5. Data Sources: Taxonomic Name Review


Top of Page



Pre-compiled Datasets

Did you know, GloBI has a number of datasets and files pre-compiled and ready to download?!

Just go to the GloBI Sources page, find a collection or group’s dataset you are interested in, and click the “review” button on the left with the green checkmark. Then copy the file you want and paste it into your web browser address box. It will automatically start to download the file 😃

Here are some multiple collection precompiled datasets you may find useful:

Terrestrial Parasite Tracker (TPT) data

Terrestrial Parasite Tracker is a NSF-funded project that aims to digitize natural history collection records related to parasites and their vertebrate hosts.

SCAN data

Symbiota Collections of Arthropods Network (SCAN) (https://scan-bugs.org) is “A Data Portal Built to Visualize, Manipulate, and Export Species Occurrences.”

Big-Bee data

The Big Bee project (https://big-bee.net/) aims to “Extend Anthophila research through image and trait digitization.”

Big-Bee publishes a quarterly report of global bee interactions indexed by GloBI that includes additional data curation such as the removal of duplicate records. This publication includes interactions from museum specimens, journal publications, and observations in both comma and tab-delimited formats.

Katja C. Seltmann. (2022). Global Bee Interaction Data (v1.0.1) [Data set]. Zenodo. https://doi.org/10.5281/zenodo.6564738 on May 19, 2022


Top of Page



General Searches

Search from home page

Search from browse page

To search from the GloBI browse page:


Top of Page



Using R

For those that are comfortable using R, install and use rglobi for more precise and filtered datasets.


Top of Page



Using the Command Line

General datasets

If you are are comfortable with using command line scripts and codes (i.e., in terminal, shell, etc.) you may find the following “Big Data Cheatsheet” useful.

After you download the dataset you need (see Pre-compiled Datasets, General Searches, or the GloBI data page), you can modify the following code to fit your dataset needs:

$ time cat data/stable/interactions.csv.gz\
  | gunzip\
  | mlr --csv filter '$sourceTaxonKingdomName == "Fungi"'\
  | mlr --csv filter '$targetTaxonGenusName == "Quercus"'\
  | mlr --csv cut -f sourceTaxonName,targetTaxonName\
  > data/oakfungi.csv
$ time cat data/stable/interactions.csv.gz\
  | gunzip\
  | mlr --csv filter '$targetTaxonKingdomName == "Fungi"'\
  | mlr --csv filter '$sourceTaxonGenusName == "Quercus"'\
  | mlr --csv cut -f targetTaxonName,sourceTaxonName\
  | tail -n+2\
  >> data/oakfungi.csv

You can also load a dataset from GloBI into a sqlite3 database on your personal computer by using/modifying the following code (after downloading the dataset):

$ cat interactions.csv.gz\
| gunzip\
| sqlite3 -csv globi.db '.import /dev/stdin interactions'

Or for SCAN specific data:

$ cat interactions.csv.gz\
| gunzip\
| grep "globalbioticinteractions/scan" > globi-scan.csv

or:

cat interactions.csv.gz\
| gunzip\
| grep "globalbioticinteractions/scan"\
| wc -l

To reduce the size of sqlite3 (or other) database, you can drop columns before importing them using powertools like cut or mlr/miller. See the importing csv files to sqlite page .

If you can provide further step by step instructions on how to use these scripts, please add it to the working guide and I will add to this page. I’m not a command line person, so any help adding to this section would be much appreciated!


Custom Taxon List Search

Want association information for multiple taxa without searching each name individually? Want to be able to download a csv with this data? Well, now you can!

  1. Create a simple text (.txt) file with the names you want to find associations for.
    • File should have a single column of names
    • For species names that have spaces in them, replace all spaces with “%20”. You can do this with a find and replace all procedure. Example: Cremnops montrealensisCremnops%20montrealensis
    • Save file with extension .txt
  2. Open a command line terminal on your computer (on Macs there is a built in one called “Terminal”)
  3. Navigate to the folder you want the resulting files to be saved to. Example for macs:
    Cd YourDocuments/YourFolderName/ProjectsFolder
    
  4. Once in the folder you want your results to save to, run the following code in terminal:
    • This will produce and save a csv file with all associations in GloBI for the taxa in the list you made as well as the records the associations came from.
      cat YourNameList.txt\
      | sed 's+^+https://api.globalbioticinteractions.org/interaction.csv?includeObservations=true\&sourceTaxon=+g'\
      | xargs -L1 curl\
      > results-YourNameList.csv
      

      Where:

    sed 's+^+https://api.globalbioticinteractions.org/interaction.csv?includeObservations=true&sourceTaxon=+g'

    • Turns each name into a url request for individual records that involve the specified taxon.

    xargs -L1 curl

    • Executes one request at a time using “curl” (command-line web browser)

    > results-...csv

    • Saves the results in a file called “result[something].csv”
  5. If you only want the taxon level information (not information about the individual records the associations came from) you can modify the code by omitting the includeObservations=true. For example:
    cat YourTaxaList.txt\
    | sed 's+^+https://api.globalbioticinteractions.org/interaction.csv?sourceTaxon=+g'\
    | xargs -L1 curl\
    > results-YourInteractions.csv
    

Things to note about this procedure:

Top of Page



GloBI Hacks

No-download data viewing

You can view datasets from GloBI (or any other .csv/.tsv files online) without actually downloading them! This is possible by using Google Sheets and removing the “.gz” extention of a file from the GloBI Sources page.

=IMPORTDATA("YOUR FILE NAME")

Example:

=IMPORTDATA("https://depot.globalbioticinteractions.org/reviews/globalbioticinteractions/scan/indexed-interactions-sample.tsv")

Name matching with other databases

To match or cross-reference names in GloBI to names in other databases such as ITIS or NCBI, check out the tool Nomer


Top of Page



Hopefully, this page had some helpful content to help you navigate GloBI!!

💡 Have something helpful to add to this page?

➡️ Please add it to the working guide we are creating to help pull data out of GloBI.

💡 Have a problem or something we need to add?

➡️ Please submit a request on the issue page!


🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧 🚧

Need additional help the working guide or issue page doesn’t cover? Contact page editor Erika Tucker

This page is supported in part by the Terrestrial Parasite Tracker group’s efforts to produce sustainable, open access, digitization methods and related research tools (NSF award# 1901932).