Scripting an analysis from the command line

July 20, 2020 • PD Schloss • 5 min read

Normally when we think about computer programs, we think of something like a web browser or an app on your phone. In the past few episodes we have used a variety of programs including git and atom. But really, a computer program is really a set of instructions telling a computer how to do something. It can be really simple or quite complex. I think of a computational research project as writing a computer program. At first it doesn’t seem very complicated. A few commands to get my raw data, a few more to process the raw data, and a few more to generate figures and run statistical tests. By slowly building up the project, my project becomes a pretty complicated computer program. This is not a metaphor! We have the ability to piece together different programs as well as programs we write to instruct the computer how to convert raw data into a finished product. You might think that you can’t possibly write your own computer program. You can! In today’s episode we will take the first steps of converting our work from the past few episodes into small programs that we can execute from our command line interface.

We may not always appreciate it, but when we’re working at the command line, we are using a terminal program but also a program that provides the command line interface called bash. Today we will learn how to write simple programs, also called scripts to automatically download our data and put it in the correct location. As we go forward through our project, we will use this approach to create an automated and reproducible workflow. Writing these types of programs is critical to achieving our overarching goal of understanding the sensitivity and specificity of amplicon sequence variants. Even if you don’t have a clue what amplicon sequence variants are, I’m sure you’ll get a lot out of this episode. Also, at the end of this episode, I have several exercises for you to work on. If you haven’t been following along but would like to, please check out the blog post that accompanies this video where you will find instructions on catching up.

Please take the time to watch today’s episode, follow along on your own computer, and attempt the exercises. Don’t worry if you aren’t sure how to solve the exercises, at the end I will provide solutions.

The reference notes and links that follow are a supplement to the material in the video. The exercises and their solutions are provided at the bottom.

Important things to remember

Installations

If you haven’t been following along, you can get caught up by doing the following:

Exercises

1. Create and close an issue to write a script that installs mothur

2. We need to align our sequences to make sure they’re in the correct orientation and start and end at the same alignment coordinates. Also, if they’re aligned, then it will be easier to extract variable regions from all of the sequences. Create and close an issue that aligns sequences our rrnDB fasta sequences to our SILVA SEED reference alignment