Scripting analyses

http://www.riffomonas.org/reproducible_research/scripting_analyses/ Press 'h' to open the help menu for interacting with the slides

1 / 46

Welcome back to the Riffomonas Reproducible Research Tutorial Series. I hope you're able to make it through the last tutorial in which we discussed how to use a single Bash script to automate our analysis. It took a little effort but it's great to see how we could accidentally burn our project to the ground and with the help of Git and Bash, we're able to rebuild it with minimal further input from us.

In that tutorial, we used Mothur to process the raw sequence data to generate files that we'll be using in this and future lessons. If you'd never use Mothur before, you should definitely check it out. Other tools for analyzing 16S rRNA gene sequence data are out there but they tend to require a lot of other dependencies. These dependencies can be really difficult to track down, this is frequently called dependency hell for those trying to reproduce someone else's analysis.

If you go to the Mothur GitHub repository, you can get any version of Mothur you want and it's a self-contained executable. Similarly, our wiki has all past versions of the SILVA, RDP, and Greengenes databases that will work well with Mothur. We've really gone out of our way to make Mothur a great tool for encouraging reproducible analyses. In addition, in Mothur we have a command called make.sra which has been a game changer in terms of helping people to post their sequence data to NCBI's Sequence Read Archive, also called the SRA. This is the place your sequence data should be deposited before you submit your manuscript.

In today's tutorial, we'll be discussing some of the best practices for analyzing your process data using a scripting language. Here you have various options for tools that you can use, many people will use R or Python. I'll be discussing R although again I don't expect you to know much about R for this tutorial.

Now join me in opening the slides for today's tutorial which you can find within the Reproducible Research Tutorial Series at the riffomonas.org website.

Exercise

Name some limitations of using tools with graphical user interfaces (GUIs) for doing reproducible data analysis?
Name some limitations of using bash scripts for doing reproducible data analysis?

2 / 46

So before we get going and talking about scripting our analysis, let's remember what we talked about in the previous tutorial. So, take a couple of moments and think to yourself, maybe jot down in a piece of paper or write into a Word document.

What are some of the limitations of using tools with a graphical user interface, or GUI as they are called, for doing reproducible analysis? On the other hand, can you also think of some limitations of using bash scripts for doing reproducible data analysis? So a couple of thoughts come to mind when thinking about the limitations of both GUIs as well as using bash scripts.

So for GUIs, it can be frequent...it can be often difficult to document all the mouse movements and toggles that need to be checked as well as, you know, what you're entering into the formula bar, things like that. It's also difficult to maintain under version control. And so if you wanted to see where a bug was introduced, that's going to be really difficult.

And as we talked about in the previous tutorial, Git doesn't work very well with file formats like xlsx or docx files, that it really excels in working with text files. What are some of the limitations of using bash scripts? So although I think bash scripts are the way to go, I'm willing to admit that there are certainly limitations to using bash scripts.

So first of all, there's a learning curve. You have to know bash, you have to know your way around the command line. In addition, the code can also be pretty cryptic. Most people know how to use Excel, it's one of the beauties of Excel and why it's so popular is that it's so easy to use that one of my kids could pick it up who'd never seen a spreadsheet before and figure out how it works.

If I gave them bash code to read, it might be harder for them to read and harder for them to figure out what's going on and I think the same is true for anybody that's coming to replicate our analysis. And so, again, while the analysis might be reproducible because we can, you know, run bash and also striver.bash and recreate our project like we did at the end of the last tutorial, if somebody dug down and looked inside that file, they might be hard-pressed to tell you what exactly is going on with every line.

And so, again, there's ways around that. We can think about how we comment our code, we can educate people better, things like that to improve the transparency of what's going on in those bash scripts. So, although there's strengths and weaknesses of both approaches, I'm going to tell you that the best for reproducibility is using a bash script, although hopefully, you appreciate it also that one of the limitations of using a bash script for driving our analysis is that it's kind of a...it's kind of dumb, right?

There's no intelligence baked into it to tell the computer, to tell Linux where you're at in the work flow. And so, you know, when we reran the bash script, it started from the very beginning again although some of those reference files had already been downloaded.

And so what we'll talk about in a couple of tutorials is a tool called Make which is similar in idea to our bash script but has built-in intelligence to figure out where we're at in our dependency tree, what things have already been downloaded, what things have already been processed, and so where we're ready to pick up in the analysis.

Learning goalsArticulate the strengths and weaknesses of point-and-click vs. command line tools
Develop a set of best practices for scripting the analysis of data in a programmatic manner
Define and identify violations of the DRY principle
Apply tools that will make analyses reproducible when using random data
Use tools to protect yourself and others from weird data
3 / 46

In today's tutorial, we're going to talk more about scripting our analyses. And so whereas before we used bash to script, kind of pulling together elements from different software, different tools, in this tutorial we're going to look at a scripting language such as R or Python. We're going to develop a set of best practices that you can use for scripting the analysis of your data analysis, we're going to define and identify one of those called the DRY principle which is short for "Don't Repeat Yourself," we'll apply tools that will make our analyses more reproducible when using random data.

We've already seen this a bit with using Mothur but I didn't tell you about it. And also, you will use some tools to protect ourselves and others from possibly encountering weird data, the idea of defensive programming. So to get you thinking about where we're going here, I want you to think about this case study where you've perhaps generated the visualization of an ordination diagram for your latest microbiome analysis and you've done it using point-and-click package like Microsoft Excel or Prism.

Case studyYou generate a visualization of the NMDS ordination data we generated in the last tutorial using a point-and-click package (e.g. Excel, Prism). You mimic the boring shape and coloring scheme that was in the original Kozich et al. paper. You show your PI, which gets them excited
4 / 46

Case studyYou generate a visualization of the NMDS ordination data we generated in the last tutorial using a point-and-click package (e.g. Excel, Prism). You mimic the boring shape and coloring scheme that was in the original Kozich et al. paper. You show your PI, which gets them excited
Your PI starts asking questions...Can you change the color and shape of the plotting symbols?
Is there a way to plot this in 3D with 3 NMDS axes?
Is there a way to plot this in 3D with 2 NMDS axes and time?
Could you create an animation over time?

4 / 46

What do you thinkSo you've mimicked the boring shape and coloring scheme that was in the original Kozich paper, I think it was like black and white circles, pretty nondescript. You show it to your PI which gets them really excited, okay? So PIs tend to do this, I know I do. And so the PI starts asking you questions: can you change the color and the shape of the plotting symbols?

Is there a way to plot this in 3D using three axes rather than in 2D with two axes? Can you plot...this is temporal data, so can we plot it in 3D where one of the dimensions is time? Could we perhaps think about how we could animate this over time? And so make a movie of how the murine microbiota was changing over time, okay?

Point-and-click has significant drawbacksExpense
Centralized development
Cannot be automated
Not flexible
5 / 46

So these are just a handful of questions that somebody might come up with. And pretty quickly, you might think, "Well, yeah, those would be awesome but I wouldn't know where to begin," and part of that is because tools like Excel or Prism lock you into how you're doing your analysis their way.

Some of the other drawbacks of point-and-click methods like Excel or Prism is that they tend to be expensive. As a trainee, you might not realize this because someone else is paying for it for you but these tools are really expensive. Now, there's centralized development, there is a Excel way to make a plot, there is a Prism way to make a plot, and if somebody comes along with a new way to visualize data or something that you see that you want to incorporate into your analysis, well, if the folks at Microsoft don't incorporate that into Excel, you're kind of out of luck.

It also can't be automated, so how difficult would it be to change the plotting symbols of your figure? In Excel, you're going to have to go back into Excel, you're going to have to, you know, I think click on the thing...click on the points and then go somewhere to change the settings. It's not super easy and then you're going to have to export that into a new document and perhaps you then, you know, do some type of staging using Illustrator which is another really expensive tool and all that, because it's not automated, lowers the likelihood that you could reproduce it. It's also not flexible, you know, everything is kind of in their format, their way of doing things.

Open Source: Programmatic AnalysesTend to be "open source"
Free, decentralized, automatable, (overly) flexible
Extendible
6 / 46

So in contrast, if you were to use an open source tool like R or, say, Python, these all tend to be open source which means that they're free, they're decentralized which means that anybody can contribute code to it. You can open up...you can look under the hood and see how it makes a plot, you can modify that code, you can add code to change how plotting is done. These are things that are not possible with, say, something like Excel or Prism. They're also automatable, right? We can run R or Python from the command line and so I could change a line of my code and, voila, that will change the plot without me having to really muck around too much with formatting or things like that.

It's also very flexible and you might get frustrated because it's perhaps overly flexible, that there might be hundreds of parameters settings...maybe not hundreds, but a dozen parameter settings for things that you can change to alter the way that your plot looks. It's also extendable that you could take a simple plot in R and you could turn it into a gif to add animation or you could make a mp4 video of a plot, you could incorporate HTML into it so that it could become interactive.

These are all the beauties of open-source tools is that there's large numbers of data scientists out there and people like you and people like me who are contributing code, making packages to make the language better and customizable to our needs.

Changing aestheticsChange an option value, rerun the script
Entire hexidecimal color pallete
7 / 46

So again, if we wanted to change an aesthetic like the plotting symbol, the plotting color, we can change the option value and rerun the script. We could incorporate this into our analysis driver bash file. It's pretty straightforward. We also have access to the entire color palette, so I don't know about you but when I go into Excel or any of the Microsoft software and I'm trying to pick a color, sometimes I can't remember what exact shade of blue I used because I'm trying to pick out of their kind of very categorical color palette.

Well, if I can give it a specific code for a specific color I want, then that's what I get and that's possible with a scripting tool like R or Python. But people have, again, taken that simple hexadecimal color palette and they've ripped off it, they've done really awesome, fun things.

Wes Anderson collor pallette

wesanderson package

8 / 46

And so one of those is here, the Wes Anderson palettes that are inspired by the movie The RoyalTenenbaums or Darjeeling Limited or any of the Wes Anderson movies. My lab has also made one that corresponds to sports teams, so if I make a plot with blue and red, I can use the Chicago Cubs version of blue and red or University of Michigan's maize and blue, it's really adaptable.

Beyonce collor pallette

beyonce package

9 / 46

If you're a Beyonce fan, there are color palettes inspired by various pictures of Beyonce and the clothes that she's wearing, and so this is, again, seems a bit silly but it's fun, it shows you good colors that work well together, and helps to make our plots look a little bit more attractive than basic red, green, blue, black color schemes.

catterplots R package

catterplots package

10 / 46

People have also in R made packages to change plotting symbols, and so you can make a Catterplot where instead of using circles or squares or diamonds or what-have-you, you can put cats in different poses into your plots and people are doing this, again, with other types of symbols.

And, again, this is silly, I don't know that you would make a Catterplot for your next manuscript, but it shows you the flexibility that you can customize the plotting symbols, you can customize the colors to do anything you want. Go try to make a Catterplot in Excel, good luck. It also allows you to think differently. There are numerous packages in R, Python, and other languages for data visualization to do things like interactive plots and analysis, again, incorporating HTML and JavaScript to make your plots interactive.

Think differentNumerous packages in R, Python, etc for data visualization
Interactive plots and analysis
All generated by distributed groups of people that had a need, solved the problem, and made is available to others to use or improve upon
11 / 46

And again, these are all generated by developers around the world in myriad of fields from, say, economics to microbiology that have a need, they solve the problem, and then they made it available for others to use or improve upon. Again, this is the beauty of reproducible research, they made their methods available for others to then refer upon.

Animated version of NMDS

RGL package

12 / 46

One of the packages I like is called RGL which is a...I don't know what RGL stands for actually but it allows you to generate 3D visualizations and if you were to open this in R, you could put your mouse on the image and you could then spin the image around any way you want.

So some people get excited about making 3D plots in a 2D medium which I think is totally worthless, but here what I have is a gif of the visualization from the Kozich data analysis where I've got the red and blue balls and I have it animated to spin the ordination around and so through this gif, you now see it in three dimensions which is pretty cool.

We're not to the point where we're ready to submit this type of figure for a manuscript because most of our manuscripts are too tied to a PDF type format but, again, for a presentation, this type of gif and this type of visualization could be very attractive and allow people to see things in three dimensions.

Animated 2D Scatter plot

gganimate package

13 / 46

Another example is the gganimate package that builds upon a series of plotting tools called ggplot and here is data from the famous Gapminder dataset looking at the life expectancy of different countries by their GDP per capita.

Each dot represents a different country, the color is according to the continent they're on, and the size of the point indicates the population and it's animating over five-year increments. So again, this is a really cool way...perhaps not perfect for every type of analysis but it's a really cool way to represent data and instead of seeing images as being static like they are, say, in Excel, we can now animate these images to get a better representation of, say, temporal patterns. So this plot got one, two, three, four, five different variables going on in a fairly simple and attractive way.

That's fancy and all...

... but it's eye candy. Don't lose track of the ablity to make analyses more reproducibe because we can programmatically generate these plots14 / 46

So again, that's fancy and all but it's eye candy, that's like an advantage of R or Python. It's eye candy, so don't lose track that the ability to make an analysis more reproducible is because we can programmatically generate these plots.

Interactive vs. programmatic analysesYou can run R and other scripting languages in an interactive mode
This has many of the same limitations of point-and-click tools
The difference between an interactive and programmatic analysis is that the programmatic analysis utilizes a script to store the commands for repeated and automated use
15 / 46

With several lines of code, we're able to generate those very complicated plots and if we wanted to change something, we could alter the line of code, rerun the code, and get a new version of the plot. That's the advantage of the scripting languages, the eye candy is an extra benefit. So there's a difference between what I will call interactive versus programmatic analyses.

You can run R and other scripting languages in an interactive mode where you go into R and you can write commands at the console and this is similar to, perhaps, Excel, where you go into Excel and you start manipulating things and you generate a plot. So this has many of the same limitations between going in interactively and using R and using Excel because you're getting into...unless you have a way to document the code you're entering at that command prompt within R, then it's really no better or different than using Excel.

The difference between an interactive programmatic analysis and a programmatic one is that a programmatic one will use a script to store the commands for repeated and automated use. So the big difference I want to drive home to you is that if you're using R or Python to make plots, it's critical that you're recording those commands in some type of script file that we'll then be able to use as part of our analysis driver file.

ProgrammingWe've already made a couple of scripts in bash and mothur
Scripts are text-based files for automating analyses
These are the way we tell the computer how to make a paper airplane (or something more useful)
There are many formal languages: (bash), R, Python, Perl, Julia, Go, Java, C/C++, Fortran, Pascal, etc
16 / 46

So we've already made a couple of scripts using bash and Mothur's syntax, scripts are text-based files for automating analyses, they're a way to tell the computer, so to speak, how to make a paper airplane or something more useful. There are many formal languages out there for programming.

So we've already used bash, that's a type of language. I've mentioned R, Python. There's others like Perl, Julia, Go, Java, C/C++, Fortran, Pascal, these all have varying usages and varying popularities across the field, I don't know all of these.

Which language should you chose?Dunno. It depends.
17 / 46