Day Two Group Assignment Part Two

Overview

Teaching: 20 min
Exercises: 55 min

Questions

What are the likely side-effects in a data analysis pipeline?

How can I create a bipartite visualizations in R?

What does it take to calculate ecological networks metrics?

Objectives

Articulate benefits / downsides of data pre-processing

Discover limitations of traditional analysis and visualizations

In this session, we’ll go through the steps from acquiring data, to “cleaning” data, to visualizing and analyzing the results.

First, we’ll have a look at likely side-effects of preparing data for analysis and visualization.

Side-effects of Pre-processing Data

Data may be reformatted and “cleaned” to help facilitate analysis, visualization, and re-use.

In this exercise, we’ll look at a specific dataset and it’s transformations in the GloBI processing pipeline.

This example pipeline likely reflects other research data processing (automated or manual) techniques in use today.

Exercise 1. Data Processing Side-effects

Re-visit the GloBI process page at https://globalbioticinteractions.org/process .

Locate the original dataset related to Olito, Colin; Fox, Jeremy W. (2015), Data from: Species traits and abundances predict metrics of plant–pollinator network structure, but not pairwise interactions, Dryad, Dataset, https://doi.org/10.5061/dryad.7st32 .

Now, locate the manually transcribed version interactions.tsv of this dataset at https://github.com/zedomel/olito2015.

Inspect a version of the same dataset as seen by GloBI before taxonomic linking at https://depot.globalbioticinteractions.org/reviews/zedomel/olito2015/indexed-interactions.csv

Inspect dataset version after GloBI’s taxonomic linking at olito2015.csv

Compare the different versioned of the dataset and describe the similarities and changes. Note where in the process diagram the datasets live.

Now, let us have a look at visualizing and analyzing network data.

Visualization: Bipartite and Pre-canned Data

Exercise 2. Looking at Pre-canned Data

Locate a populate R package by doing a web search for “bipartite R package”

In the R package page, notice the “starting with bipartite” vignette.

If feasible, install package in your R environment

Reproduce the bipartite visualization vignette example with the olito2015 network .

Notice how much time it take to setup an environment and reproduce a “getting started” example.

Visualization: Bipartite and “Real” Data

Some biodiversity data infrastructures (like GloBI, GBIF) hide the complexities of working with big datasets by offering Web-accessible API (Application Programming Interfaces). Instead of getting all the data on your system, you ask for a specific subset of the data, and let some remote server do the heavy lifting. In this exercise, we’ll use an API that GloBI provides through the rglobi package.

Exercise 3. Plot a bipartite visualization with "real" data

use the rglobi::get_interactions_by_taxa method to retrieve records describing Fungi interacting with Oak trees (Quercus). Alternatively, use the GloBI Browser to do a similar web query.

save the results in a csv file

count the number of records in the csv file

use this csv file to re-create the bipartite visualization of the second exercise

Feel free to use the cheatsheet.

Analyzing Lots of Data

To prepare for this workshop, you downloaded one of GloBI’s data products, the interactions.csv.gz from the https://globalbioticinteractions.org/data page.

Exercise 4. Counting all the things

count the number of records in interactions.csv.gz

(extra credit) count the number of records that contain “Fungi” in them

(extra credit) count the number of record that contain both “Fungi” and “Quercus” in them

Please feel free to use any tool you’d like. Also, please see the Big Data Cheatsheet.

Also, for your convenience, please see files oakfungi-sample.csv and oakfungi.csv for examples of results.

Exercise 5. Visualizing all the things

create a bipartite graph for oakfungi-sample.csv

now, create a bipartite graph for oakfungi.csv

compare the visualization and notice the differences

The bipartite r package contains various methods to quantitatively describe networks.

Exercise 6. Exploring Network metrics

re-visit the bipartite vignette pdf

in the vignette, look for the network, group, link, and species metrics (or indices)

(extra credit) calculate some indices using oakfungi-sample.csv or oakfungi.csv datasets.

Now that we’ve tried a couple of ways to access, visualize, and analyze data, let’s reflect on how these methods fit into a research workflow.

Discussion

What are the benefits to using a whole dataset like GloBI’s interactions.csv.gz ?

What are the benefits of using the GloBI Web API instead?

What method would you choose for your publication?

How would you cite / publish your results?

How would you assess the quality of the retrieved data?

What’s Next?

Please see schedule to see what’s next.

Key Points

Working with big datasets often requires different tools and skills

Data processing introduces errors and bias

Many tools are suited for small datasets only

previous episode

Dead Wood Interaction Data Workshop

lesson home

Day Two Group Assignment Part Two

Overview

Side-effects of Pre-processing Data

`Exercise 1. Data Processing Side-effects`

Visualization: Bipartite and Pre-canned Data

`Exercise 2. Looking at Pre-canned Data`

Visualization: Bipartite and “Real” Data

`Exercise 3. Plot a bipartite visualization with "real" data`

Analyzing Lots of Data

`Exercise 4. Counting all the things`

`Exercise 5. Visualizing all the things`

`Exercise 6. Exploring Network metrics`

`Discussion`

What’s Next?

Key Points

previous episode

lesson home