Day Two Group Assignment Part Two
Overview
Teaching: 20 min
Exercises: 55 minQuestions
What are the likely side-effects in a data analysis pipeline?
How can I create a bipartite visualizations in R?
What does it take to calculate ecological networks metrics?
Objectives
Articulate benefits / downsides of data pre-processing
Discover limitations of traditional analysis and visualizations
In this session, we’ll go through the steps from acquiring data, to “cleaning” data, to visualizing and analyzing the results.
First, we’ll have a look at likely side-effects of preparing data for analysis and visualization.
Side-effects of Pre-processing Data
Data may be reformatted and “cleaned” to help facilitate analysis, visualization, and re-use.
In this exercise, we’ll look at a specific dataset and it’s transformations in the GloBI processing pipeline.
This example pipeline likely reflects other research data processing (automated or manual) techniques in use today.
Exercise 1. Data Processing Side-effects
Re-visit the GloBI process page at https://globalbioticinteractions.org/process .
Locate the original dataset related to Olito, Colin; Fox, Jeremy W. (2015), Data from: Species traits and abundances predict metrics of plant–pollinator network structure, but not pairwise interactions, Dryad, Dataset, https://doi.org/10.5061/dryad.7st32 .
Now, locate the manually transcribed version
interactions.tsv
of this dataset at https://github.com/zedomel/olito2015.Inspect a version of the same dataset as seen by GloBI before taxonomic linking at https://depot.globalbioticinteractions.org/reviews/zedomel/olito2015/indexed-interactions.csv
Inspect dataset version after GloBI’s taxonomic linking at olito2015.csv
Compare the different versioned of the dataset and describe the similarities and changes. Note where in the process diagram the datasets live.
Now, let us have a look at visualizing and analyzing network data.
Visualization: Bipartite and Pre-canned Data
Exercise 2. Looking at Pre-canned Data
Locate a populate R package by doing a web search for “bipartite R package”
In the R package page, notice the “starting with bipartite” vignette.
If feasible, install package in your R environment
Reproduce the bipartite visualization vignette example with the
olito2015
network .
Notice how much time it take to setup an environment and reproduce a “getting started” example.
Visualization: Bipartite and “Real” Data
Some biodiversity data infrastructures (like GloBI, GBIF) hide the complexities of working with big datasets by offering Web-accessible API (Application Programming Interfaces). Instead of getting all the data on your system, you ask for a specific subset of the data, and let some remote server do the heavy lifting. In this exercise, we’ll use an API that GloBI provides through the rglobi package.
Exercise 3. Plot a bipartite visualization with "real" data
use the
rglobi::get_interactions_by_taxa
method to retrieve records describing Fungi interacting with Oak trees (Quercus). Alternatively, use the GloBI Browser to do a similar web query.save the results in a csv file
count the number of records in the csv file
use this csv file to re-create the bipartite visualization of the second exercise
Feel free to use the cheatsheet.
Analyzing Lots of Data
To prepare for this workshop, you downloaded one of GloBI’s data products, the interactions.csv.gz from the https://globalbioticinteractions.org/data page.
Exercise 4. Counting all the things
count the number of records in
interactions.csv.gz
(extra credit) count the number of records that contain “Fungi” in them
(extra credit) count the number of record that contain both “Fungi” and “Quercus” in them
Please feel free to use any tool you’d like. Also, please see the Big Data Cheatsheet.
Also, for your convenience, please see files oakfungi-sample.csv and oakfungi.csv for examples of results.
Exercise 5. Visualizing all the things
create a bipartite graph for oakfungi-sample.csv
now, create a bipartite graph for oakfungi.csv
compare the visualization and notice the differences
The bipartite r package contains various methods to quantitatively describe networks.
Exercise 6. Exploring Network metrics
re-visit the bipartite vignette pdf
in the vignette, look for the network, group, link, and species metrics (or indices)
(extra credit) calculate some indices using oakfungi-sample.csv or oakfungi.csv datasets.
Now that we’ve tried a couple of ways to access, visualize, and analyze data, let’s reflect on how these methods fit into a research workflow.
Discussion
What are the benefits to using a whole dataset like GloBI’s
interactions.csv.gz
?What are the benefits of using the GloBI Web API instead?
What method would you choose for your publication?
How would you cite / publish your results?
How would you assess the quality of the retrieved data?
What’s Next?
Please see schedule to see what’s next.
Key Points
Working with big datasets often requires different tools and skills
Data processing introduces errors and bias
Many tools are suited for small datasets only