```R calc_shannon <- function(otu_count_vector){ if(all(otu_count_vector >= 0)){ relative_abundance <- otu_count_vector / sum(otu_count_vector) shannon <- -1 * sum(relative_abundance * log(relative_abundance)) } else { warning("One or more of the values were less than zero.") shannon <- NA } return(shannon) } data <- c(-1,-2,-3,-4) calc_shannon(data) #[1] NA #Warning message: #In calc_shannon(data) : One or more of the values were less than zero. ``` Better. ??? But perhaps calc_shannon should be smart enough to know that this is nonsense data, it should be smart enough to know that all of our counts should be positive. And so what we might do is to run a test to say that all of our otu_count_vector values greater than or equal to zero, okay? If it is, then we run those lines of code, otherwise, we send out a warning that says, "One or more of the values were less than zero," and it returns a Shannon value of N/A and so you see what the output here would look like. --- ## Defensive programming and testing: Tools * `if() ... else if() ... else` - the final `else` is a catch all for situations you can't anticipate * `length` and `is.xxxxx` functions (e.g. `is.numeric`, `is.character`, `is.logical`) * `all` and `any` * `stopifnot` - script stops if something is not true * Test code with known examples * [`testthat` package](https://github.com/hadley/testthat) for automating code tests ??? So again, these are defensive programming practices to test that you're getting the right input, that your code is doing the right thing. Some tools that we have that I showed in that example with the Shannon calculator are if functions, "if blocks so if," "else if," and "else," that final "else" is to catch any situations you can't anticipate. Again, users do weird things and you are your weirdest user. Also find that the "length" function as well as the "is" functions, so things like is.numerical, is.character, is.logical, are useful functions for testing to make sure that your functions are getting the right data. Functions like "all" and "any" are great, there's a function called "stopifnot"which will stop your code if a certain condition is not met. And as I mentioned, you can test your code with known examples but it's better to automate those code tests using the testthat package. So the next thing to think about is setting a random number generator seed. A lot of the analyses we do use nonparametric statistics where we're using a random number generator to test the significance of our observations. --- ## Setting random number generator seed * Some functions make use of random numbers to perform an analysis * Making use of the ability to set the random number generator seed will insure that you get the same randomness each time * Remember `get_shared_otus.batch`? ```bash set.dir(input=data/mothur, output=data/mothur, seed=19760620) ``` * This set the mothur random number generator to `19760620` (my birthday in [ISO 8601 format](https://en.wikipedia.org/wiki/ISO_8601)) ??? And so you might remember from our get_shared_otus.batch file, that at the top line we had set.dir and there was a...there is an option of seed 19760620. Well, many of the functions in Mothur depend upon a random number generator, so when we do our clustering, there's randomness, when we do the NMDS, there's randomness, when we do rarefaction, there's randomness. And so by setting a seed for the random number generator, we will still get pseudo-random numbers to come out of the random number generator but they will all be in the same order. And so if we run that analysis repeatedly with the same seed, we'll get the same results over and over again but if we don't set the seed, then we'll get variation. And so I tell people to pick a consistent seed, I use my birthdate, so I was born in June 20th of 1976, you might pick one or you might pick some other number and don't try to hack your data to pick a random number generator that gives you the right result, that's bad. Pick a number, stick with it, use it across all your projects, and your results should really be insensitive to the seed that you pick. --- ## Setting random number generator seed in R ```R > runif(5) [1] 0.8870255 0.8145864 0.7129467 0.6600283 0.1066385 > runif(5) [1] 0.1168198 0.5150428 0.1028263 0.9711794 0.7676322 > runif(5) [1] 0.05080043 0.51089731 0.99899027 0.84039403 0.08163530 > set.seed(19760620) > runif(5) [1] 0.96847218 0.11941491 0.42317798 0.09717774 0.94767339 > set.seed(19760620) > runif(5) [1] 0.96847218 0.11941491 0.42317798 0.09717774 0.94767339 > set.seed(19760620) > runif(5) [1] 0.96847218 0.11941491 0.42317798 0.09717774 0.94767339 ``` * The general gist of your analysis should be insensitive to the random seed * Pick a number (e.g. birthday, anniversary, 1) and stick to it across your analyses ??? And so again, here's an example in R that if you'll use the runif command which will generate five random...runif will generate a random number between 0 and 1 and if you give runif the number 5, it'll generate five numbers. So if we run runif three times, we'll get 15 different random numbers, but if I set a seed and then run runif, I get five random numbers and if I do it again, I get the same five random numbers, okay? So this makes the analysis much more reproducible. So, again, pick a number, stick with it, it could be your birthday, your anniversary, one, and stick to it across your analyses. --- ## Package versions * Packages change either through incroporating or deprecating options or changing underlying algorithms * Need to know version numbers of code being used so that others (you!) can replicate your work or understand why there are differences ??? Another area to think about in terms of programming your analyses is to keep track of the package versions that you use. So packages change either through incorporating or deprecating options or by changing the underlying algorithms. Again, Mothur is on Version 1.40 and things have changed from Version 1 to Version 40. We get emails from people saying, "You know, I can't run this certain function," and we say, "Well, what version are you using?" And they say, "Well, I'm using Version 10." It's like, well, you know, that was from, like, six, seven years ago, a lot's changed. And so if somebody used Version 10 for their analysis then, yeah, you might want to go back and use Version 10 to use the same software they were using back then. And so the same happens with any type of code that you use, so we need to know the version numbers of the code being used so that you and others can replicate the code and perhaps understand why there are differences between the versions. --- ## Tooling The output from `sessionInfo` would ideally go into the general `README` file ```R > sessionInfo() R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: OS X El Capitan 10.11.6 locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] knitr_1.15.10 rmarkdown_1.3 loaded via a namespace (and not attached): [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2 [5] htmltools_0.3.5 Rcpp_0.12.8 stringi_1.1.2 stringr_1.2.0 [9] digest_0.6.12 evaluate_0.10 ``` Or for specific packages ```R > packageDescription("dplyr")$Version [1] "0.5.0" ``` ??? There's a function in R called sessionInfo that would ideally go in the general README file. So if you run sessionInfo, you're going to get output about the version of R you're using, base packages you're using, other packages that are getting loaded. If you want to know the specific version of a package, then you could say "packageDescription, name of the package, $Version." --- ## Tooling for package versions * Install old versions of packages using the [`devtools` package](https://github.com/hadley/devtools) ```R library(devtools) install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org") ``` ??? And so, again, these are useful tools for documenting the versions you'll use and so calling sessionInfo and copying and pasting this into your readme file would be really valuable. In R you can get old versions of packages using the devtools package, and so this line, for example, would allow you to get an old version of ggplot and, again, you can get older versions of packages using this. -- * [`packrat` package](https://rstudio.github.io/packrat/): Installs specific versions of packages in an isolated manner that can be transferred between people ??? Another package that's useful for this is called packrat. Packrat will install specific versions of packages in an isolated manner that can be transferred between people. So if you go into a project, instead of updating your package across your entire computer, it updates the package or uses a specific package within your single project and then those packages that you're using can be transferred and be replicated by others. -- * A bunch of complex systems that only demonstrate how hard software versioning is (e.g. [Docker](https://www.docker.com)) ??? There's also a bunch of complex systems that really only demonstrate, I think, how hard software versioning is. So there's the thing called Docker, we've talked a bit about Amazon Machine Images, as ways to keep track of versions that are being used in a data analysis workflow. --- ## R quirks to appreciate for effects on reproducibility * `setwd` - NEVER * `attach` - NEVER * `.Rprofile` - If you have one, it must accompany project and don't put much/anything in `~/.Rprofile` * Do not save on quitting: ```R R --no-save --no-restore-data ``` ??? I have this line in my `~/.bashrc` file: `alias R='R --no-save --no-restore-data'` ??? There's also some R quirks to appreciate because of their effects on reproducibility. And so there's a couple functions in R that you should never use, I think the original R developers regret putting these into R. And so these are setwd and attach, setwd you'll more frequently see...this is a way of basically changing directories within R. So, again, we want to use relative paths relative to the root of your project directory and so when you write your R code, assume that you're running that R code from your root of the project directory, don't use setwd to move in. Just like we don't want to use cd to move into different directories using bash, we don't want to use setwd to change our working directory. There's another hidden file called the Rprofile file and if you use that, that is a file that runs code...R will run the code in that file when R starts, and so you can hide stuff in there. Your project has an Rprofile file that has some stuff in it, and so if you give people your R code, you need to be transparent about what's in that Rprofile file because any R code that you're running as part of your project is also running what's in that Rprofile file. And so you'll see in our repository that that Rprofile file is part of what we're keeping under version control. Also, when you run R, do not save on quitting. And so when you're running R in an interactive mode, it will ask if you want to save. And so you do not want to save and you do not want to restore when you start R and that's because you might be running an analysis and you define Variable A, you quit, you save, and then when you run R again, you start again, and if it restores the data, then A is still alive and A might have a value that you don't anticipate or if you give me your code, I might not have A or the version of A that you have. So it's really best to not save the data from your session and not to restore the data from a previous session. That saving and restoring really limits the ability to reproduce other people's analyses. So there's other R tooling that are useful for helping with reproducibility. --- ## Ways to leverage R tooling * Interact with files from other software (e.g. Excel) * Checking metadata formatting * Plotting ??? There's R functions, R packages that allow you to interact with files from other software packages like Excel. So I'll get Excel files for my collaborators and, again, I want to keep my raw data raw, I don't want to muck with that Excel file. Well, there are R packages that allow me to read in data from an Excel file to then work with in my data analysis so I don't have to mess with that Excel file or modify it in any way. There are also tools for checking your metadata formatting to make sure that if you have a date that it's properly formatted, or that if you have weights of individuals that you don't have negative weights or you don't have weights that suggest somebody weighs three tons. And there's other tools for plotting, all sorts of, you know, amazing stuff that'll allows you to generate all sorts of different types of data visualizations. --- ## Remember that raw data should stay raw * [`readr` package](http://readr.tidyverse.org) for importing data from a variety of text formats * [`readxl` package](http://readxl.tidyverse.org) for reading and writing Excel files * [`haven` package](https://github.com/tidyverse/haven) for reading in SAS, SPSS, and Stata data files * [`googlesheets` packge](https://github.com/jennybc/googlesheets) for accessing and managing Google spreadsheets * Variety of tools and APIs for accessing web content (e.g. [`httr`](https://github.com/hadley/httr), [`rvest`](https://github.com/hadley/rvest), [`rentrez`](https://github.com/ropensci/rentrez)) ??? So again, in thinking about raw data staying raw, we have a variety of read commands in R for importing data from other formats. As I've mentioned, openxlsx allows you to read in Excel files, there's a haven package for reading things in from SAS or other statistics software, there's a googlesheets package for allowing you to access and manage Google spreadsheets, and there's a variety of tools and APIs for accessing web content. R Open Site is a group of R developers that are working with public databases to get access using R, so I can public...I can use this last one, rentrez, to do searches of GenBank and PubMed within R, and so that's a really beautiful way to interact with other databases. --- class: middle center [![Other people's data](/reproducible_research/assets/images/otherpeoplesdata.png)](https://twitter.com/cbahlai/status/649285357016584192) ??? So if we're checking metadata formatting...I love this example from Christie Bahlai who I've mentioned in an earlier tutorial where she did an analysis and she did a statistical test to find that corn is different from corn but where corn in the second case somehow had a space added after it and R thought that those were two different variables, okay? So this is a fun hashtag, "otherpeoplesdata," if you feel like you've got it bad, go check out what people are posting under this hashtag and you see all sorts of crazy things that people are accidentally doing. And so some tools that we can use for checking metadata include using the "summary" function to make sure that we only have corn, we don't have corn-space, that things are in the right range, that things are the right data type. --- ## Checking metadata formatting * `summary(data$crop)` - Check for data type (e.g. numeric, logic, character) - Check for range (e.g. weights of zero) - Check for NA values * `table(data$crop)` - Check for number of different types and their frequency * `gsub('corn ', 'corn', data$crop)` * [`validate` package](https://cran.r-project.org/web/packages/validate/vignettes/intro.html) allows you to `check_that` variables meet certain criteria * [googleforms](https://www.google.com/forms/about/) for controlled data entry ??? You can also use "table" as a function to check for the number of different types and their frequency. You can use things like "gsub" to clean up the text. gsub is a function for finding and replacing text and so, again, if you have programmatically listed out your steps for cleaning up the data, then you can rerun that code without touching your raw data. So Christie could run this gsub command on her data set and without changing her raw data, it will preserve that typo but in her analysis, she will have corrected for it. And as I mentioned, there's a package called validate that allows you to check variables certain criteria. And then you can also use googleforms that allow you to control for data entry, and so then using some of the googlesheets documentation, you can use R to interact with Google Forms as well as with Google Spreadsheets. --- ## Running R for programmatic analysis From within R ```R > source('code/plot_nmds.R') ``` -- From command line ```R $ R -e "source('code/plot_nmds.R')" ``` ??? So, great, how do we run R outside of R? Well, we can...you know, from within R, we could say like we did earlier, source code/plot_nmds.R, we just saw this with running that utilities file. From the command line, we could type R, -e for execute, and then this line of code, source code/plot_nmds.R. -- Another example ```R $ R -e "source('code/utilities.R'); calc_shannon(c(1,1,1,1,2,4,5,6,7))" ``` ??? We could stitch together multiple R commands with a semicolon. --- class: middle center ![Too much talking, let's do something ](/reproducible_research/assets/images/a_lot_of_talking.gif) ??? What are we doing again? We've talked a lot about using R and the wonderful things we can do with R and, of course, to be fair, you know, many of these things...pretty much all of these things we can do with Python and other languages as well. But our goal here is to generate code that is robust to errors that we might incorporate or others might incorporate so that the analyses can be replicated, but also to make code that's better and more reproducible. And so by automating our code, we can make that code in our overall analysis much more reproducible. --- class: middle ```R ################################################################################ # # plot_nmds.R # # Here we take in the *.nmds.axes file from the mouse stability analysis and # plot it in R as we did in Figure 4 of Kozich et al. # # Dependencies: 2-D axes file generated by the nmds command in mothur # Produces: results/figures/nmds_figure.png # ################################################################################ plot_nmds <- function(axes_file){ axes <- read.table(file=axes_file, header=T, row.names=1) day <- as.numeric(gsub(".*D(\\d*)$", "\\1", rownames(axes))) early <- day <= 10 late <- day >= 140 & day <= 150 plot_axes <- axes[early | late, ] plot_day <- day[early | late] plot_early <- early[early | late] plot_late <- late[early | late] pch <- vector() pch[plot_early] <- 21 pch[plot_late] <- 19 output_file_name <- "results/figures/nmds_figure.png" png(file=output_file_name) plot(plot_axes$axis2~plot_axes$axis1, pch=pch, xlab="PCoA Axis 1", ylab="PCoA Axis 2") legend(x=max(plot_axes$axis1)-0.125, y=min(plot_axes$axis2)+0.125, legend=c("Early", "Late"), pch=c(21,19)) dev.off() } ``` ??? Again, I didn't mean for this to be a tutorial on how to program an R but I wanted to highlight some of the tooling and approaches that you might take to make your analyses more reproducible and more robust. At long last, we're going to go back into AWS and we're going to add some R code as we close out this session, so I'll go AWS. And hopefully, you've gotten accustomed to doing this like second nature. Grab the IP address. And we're in, great. So we'll cd into Kozich, so I'm going to create a new file in code called plot_nmds, so I'll do nano code/plot_nmds.r and I'm going to copy this code in here and save it. --- ## Exercise * Add a line to the `analysis_driver.bash` script to run the R file from the command line * Run the code and commit the new file and changes * Add comments to `code/plot_nmds.R` and add one or two things to make the code more defensive (e.g. confirm number of columns is 2; what if the names don't have a "D"?) * Run the code and commit the new file and changes ??? And to run this, I want to double check that I know what the dependencies are and what it produces, so the dependencies are a 2D axes file generated by the nmds command in Mothur. And so when I run plot_nmds, I need to give it the name of that axes file, it's going to produce a file called results/figures/nmds_figure.png. Great, so this will be very useful information as we put this into our analysis_driver.bash file, so I'll save this and come out. And so I will go to the bottom of this file and I'll say, "Construct NMDS png file," and I will then do R -e source('code/plot_nmds.R'). And then I need to call this function, so plot_nmds, parentheses, and then in single quotes, I'm going to put the name of the NMDS file and then that's going to end in double quotes. But I forgot the name of the file, so let's save this and back out and then do ls data/mothur and then it ends in nmds.axes. And so here is the horribly long file name, so I will copy this and I will go back into my analysis_driver.bash file, come back to the end, and inside those single quotes, I will paste that name of the file. And so I'm going to copy this line because I don't want to run the entire analysis_driver.bash file, I just want to run this line, and so I will quit out, I will save it, and I will run that. It says there's an unexpected end of input and I think that's because I have some weird spacing going on when I copied and pasted, so I will get rid of that space and run it again. Oh, there's another space up here, these are the problems with copying and pasting. You know what? I'm going to show you a different way to do this. So a useful function as you're building your analysis_driver.bash file if you want to do this copying and pasting is tail, so if we do "tail analysis_driver.bash," we get the last five lines of our file and so I can copy this now. I think using nano...copying and pasting from nano is introducing some weird line breaks. So I can paste this in, it runs, it looks like it built some type of plot, and we remember from the header of our plots_nmds.R file, then it put the figure into results, figures, and then nmds_figure.png. And so if you forget the path, what I'm doing is if I hit Tab twice, it'll tell me the directories that are within results, so then I could do results, figures, hit Tab twice, it shows me what's there: nmds_figure.png. But I don't know how to open that, do you remember how to open that? Right, we can use FileZilla, so let's see if we can use FileZilla to open up this nmds_figure file to see what it looks like. So I'm going to come into my applications and I didn't put this in my...for some reason, I don't have this in my doc but I'm going to go ahead and do FileZilla. And I want to connect and I need to put in my new IP address in here and so to do that, I need to come over to the EC2 and highlight this and copy and then go back to FileZilla and paste it in here and then say, "Connect," "OK." And there is my Kozich directory. Double-click on that, we said it was in "results," "figures." Here's the moment of truth, how's it look? It looks great, so that's what the PNG file looks like, okay? And so, again, we can use FileZilla to access the files and move things back and forth and carry on. So I'm going to stop here with FileZilla and AWS and so I'm going to close all these things and I'm going to go back to my terminal window and do exit, exit, exit, and over here, I'm going to stop my instance. Actually, you know what? Before I stop my instance, I forgot to do something really important. I forgot to do something really important which was I forgot to commit my changes. So I'm going to log back in, I'm going to then cd into Kozich, I do git status. I see that I've got some other weird things going on in here now that I didn't anticipate, so things like code.plot.nmds.swp. I'm going to delete that because I don't...it's not under version control, I'm not using it, I'll do rm that. And if I do git status...I'm not sure what that R directory is or that R file is, so I'm going to add that to my gitignore and up here I'm going to add R. Quit that, git status, and so now I see I've got four files so I'm going to git add gitignore analysis_driver.bash code/plot_nmds.R results/figures/nmds_figure, git status. So those are all ready to be committed, I can do git commit, "Generate NMDS figure," git status, we're good to go, and I'll go ahead and git push and add my credentials. Now we're ready to quit, so exit, exit, and I will then come back to "Actions" here, "Instance State,""Stop," "Yes, Stop." So we've already done the first exercise I had asked...was planning on asking you to do of adding a line to the analysis_driver.bash script to run the R file from the command line, we did commit it. So what I'd like you to also do is to perhaps add some comments. You did this previously in an earlier tutorial, so add those comments to the code plot_nmds.R file and add perhaps one or two things to make the code more defensive. Perhaps you could confirm the number of columns as two in the file, what if the names don't have a D in them, and then run the code and commit the new file and changes. I hope you now have a better understanding for why using tools like R or Python can help make your analyses more reproducible than, say, using a tool like Microsoft Excel. Beyond reproducibility, I love using R because there's a community of wonderful data scientists who are constantly striving to make the language better by expanding the tools we have for working with data and for making cool data visualizations. If you're interested in learning more about R, I would suggest that you check out my minimalR tutorial which is also available on the riffomonas.org website. There's an R package that has been a total game changer for my research group, it's called R Markdown. As we'll see in the next tutorial, this is the way to blend written text with R code. Have you ever had the updated analyses and then find that you need to update all the p-values, summary statistics, perhaps you have a table with a bunch of numbers in it and you have to update those as well? That can be really tedious, right? It's also prone to having a lot of errors. R Markdown is a way to avoid all that tedium and error. I can't wait to share with you how R Markdown has been so instrumental in improving the reproducibility of the manuscripts coming out of my research group.