Hi. My name is Pat Schloss. I'm a professor at the Department of Microbiology and Immunology here at the University of Michigan. I'm really excited about a new project that I've been working on that deals with reproducibility and how we can use various tools to improve the reproducibility of our analysis.
These tutorials will be available at our website as a series of slide decks. Over the next few weeks, I'll be releasing videos to have me talking you through the slides and doing demos of the tools and the practices I discuss. The tutorials that I will be presenting, I've previously taught at a variety of workshops. These workshops have gone pretty well and I've used the feedback from those sessions to remove and add content to make them better.
Like I said, I'm really excited to share them with you. If you have any comments, please don't hesitate to contact me or to leave a comment on the YouTube website. Perhaps we can circle back to some of these questions and comments at a future tutorial to answer any questions people might have. The title of this project is the Riffomonas Project.
The name comes from the practice of music where one musician takes a theme from themselves or others to vary it, layer things on top of it, and perhaps present it in a different context. Similarly, I hope to show that one of the benefits of focusing on making your research more reproducible is that if others can reproduce your work, they will be able to use your methods on their data, or perhaps their methods on your data to help move science forward.
If you will, scientific riffing. My hope is that we can approach reproducibility from a positive perspective rather than as a negative perspective. Instead of seeing research as being in a state of crisis where it is perceived that people are doing garbage science, perhaps instead we could see reproducibility in a positive way.
We can appreciate the effort that scientists go through to make their research more reproducible and fodder for our own scientific riffing. The concepts and tools that I'm going to be talking about are pretty general and can be used in a variety of types of data analysis because I'm a microbiologist whose interest in the role that bacteria play in shaping human health and disease, I'm going to use examples throughout the series that are taken from microbial ecology and the human microbiome literature in particular.
But I'll also use a series of goofy examples, including folding paper airplanes, predicting people's age based on their names. There should be something here for everyone interested in making their data analysis more reproducible. Come along with me, and let me show you the project's website where you can find the slides that we'll be using in this series, and begin to introduce you to the content of…in the tutorial series that we'll be doing over the next few weeks.
Wonderful. So the first part that I want to introduce you to about the Riffomonas Project is our website. And so you can get to the Riffomonas website by going to http://www.riffomonas.org. Here will be a launch pad for disseminating different instructional materials, different bits of information about how we can improve the reproducibility of our research in microbiology, and eventually, perhaps other fields as well.
If you look up here at the top navigation bar, click on Training modules, you'll see currently, there are two different training modules in here. The first is a module called minimalR, and this is an R tutorial that I regularly teach from to help people to get up to speed with R, assuming that they know nothing as they start, but then also getting them going without overwhelming them with a lot of features and jargon off the bat.
That it really is the minimal R, and that people find over the course of doing several of these modules within the minimalR series, that, very quickly, they get up to speed on a diverse array of features within the R programing language. But what we're here to talk about is the Reproducible Research tutorial series.
And as it says here in the intro, this is a series of tutorials on improving the reproducibility of data analysis for those doing microbial ecology research. Now the data set that we're going to be working with a lot through this series is from the human microbiome research.
That's really not that relevant. What's important is that we're working with microbiome data, microbial ecology data, sequence data, data, any type of data. But data, in particular, needs to be analyzed through a complex series of steps. And we're going to think about, "How could we make these analysis more reproducible?"
Again, because my group developed some other software package, we're going to be using mothur. It's not a requirement that you know mothur or R, but it would certainly help as you move through these tutorials. I'm not going to be teaching you R or mothur in this series of tutorials, okay? But what we're going to learn are a series of practical tools, but also concepts and thinking about reproducibility and the factors that impede our ability to carry out reproducible research.
And so we'll use tools like the bash command line, we'll use high-performance computing clusters, we'll talk about scripting languages like mothur and R, we'll use a tool called version control, specifically Git and GitHub, we'll talk about automation, using a tool called Make, and a concept that has really been a big contributor to my research group, which is literate programming and using Rmarkdown.
And so, again, these are things that my lab is using, that I created these tutorials initially to onboard people coming into my research group, and so now my goal is to give them to you to help onboard you, so to speak, into the area of microbiome research and making it reproducible. And so before we launch into the initial tutorial, I want to call your attention down here to these dependencies, and that a big pain in making analysis reproducible is, "What software are we using? Do we have the right versions of software?"
And we'll talk about all these things as we go along. But as we get into the computational aspects of this series, we will primarily be using Amazon web servers. The cost will be fairly minimal but you might also want to try this on your local high-performance computing facility at your institution.
Here at Michigan, we've got one called Flux. You might have one at your institution. Generally, at your home institution, they're a bit cheaper than they are on Amazon. Regardless, once you get there, you're going to need certain types of software. I mean, alternatively, you could also run it on your laptop without going to one of these clusters.
But I think in the long run, it's going to be worth your effort to learn how to use these other computing resources like Amazon, or your local computer cluster. And so if you're going to do it locally, or on your local high-performance computer cluster, you're going to need tools like R, Make, Git, Wget, and Atom or Nano installed.
Okay? The part of my justification for using Amazon, for example, is that all these tools are generally already installed and so if you're trying to kind of bring this up on your own laptop or your own computer, there's a bit of frustration in getting going there.
So let's go ahead up here to the tutorials and let's click on this first link for the Introduction. This is going to be the format of the slides I will be using in this tutorial. These are HTML-based slides, and if you want, you can go ahead and hit the F button and that will give you a full-screen.
Also, as I said, down here in the lower left corner, you can press the H key to open the Help menu to help you identify how to navigate around the slides. And so one that you might find useful is hitting P, hit Esc to get out of that. And so by hitting P, it brings you to this Presenter View.
And so you might be familiar with this from, say, a tool like PowerPoint where on the left side, we have the slides I'm going to talk about, and on the right side are going to be the notes. And so if you want to follow along, you can see the notes that I'm using here. But I'm going to go out of the Presenter View, and I'm going to go ahead and open this up. All right. So let's get going.
So the goals of this introductory session is to be relatively light and to help orient you in where we're going to go. And so I'm going to summarize the motivations for this series, I'm going to help you to understand where the tutorial is going, and what we're going to get out of it. And then I'm going to give you some preliminary readings, there's a number of them, five or six papers, but none of them are super-long, and they're all meant to kind of provoke you into thinking about reproducibility and your own practices.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |