Using cut, sort, and unique to explore data with bash

August 6, 2020 • PD Schloss • 5 min read • •

A lot of bioinformatics is getting data from one format to another. Sometimes that conversion can be pretty sophisticated and other times it can be pretty simple. For example, I have a file with a bunch of aligned sequences and I want to know how many times each sequence occurs in each genome. I know that the header for each sequence in my FASTA file has the information I need to know which genome the sequence comes from, but I don’t know which field it is. To figure this out, I’m going to use some of my favorite tools to help me. These tools will include grep, which we saw in the last episode, as well as cut, sort, and uniq.

Much of what we’ll do today you could also do in R or Python. However, using these bash commands will allow me to get to my answer in a single line of code, whereas R or Python will require a lot more effort. The cut command “cuts” text out of a line based on how we define different regions of the line. For example, we’ve seen that the header lines start with the genus and species name of the organism that was sequenced. We previously put an underscore (i.e. _) between the genus and species names. We can use cut to extract the genus name by pulling out anything that occurs before the underscore. The sort command sorts the lines we give the command. We can also sort lines based on specific fields within the line. For example, we could sort header lines based on the GenBank RefSeq accession number. Finally, the uniq function deconvolutes a set of lines to return the unique lines. We can also ask it to count the number of times each line occurred in the original set of lines. Hopefully, you can see from my description of these three commands that we could count the number of genera represented in the dataset, the number of 16S rRNA gene copies per genome, or numerous other questions about our data.

We’ll use these commands to help us figure out which bit of information between the pipe characters corresponds to a unique identifier for each genome. This will allow us to use mothur’s count.seqs function to count the number of times each unique 16S rRNA gene sequence appears in each genome and the number of times it appears in more distantly related genomes. If we can generate this information, we can use it to quantify how specific an amplicon sequence variant or ASV is to a genome, species, genus, or any other taxonomic level.

Even if you’re only watching this video to learn more about bash commands and don’t know what a 16S rRNA gene is, I’m sure you’ll get a lot out of today’s video. Please take the time to follow along on your own computer and attempt the exercises. Don’t worry if you aren’t sure how to solve the exercises, at the end of the video I will provide solutions. If you haven’t been following along but would like to, please check out the notes below where you’ll find instructions on catching up, reference notes, and links to supplemental material. You can find my version of the project on GitHub.

Important things to remember

Common `cut` arguments

-f: which field you want
-d: the delimiter used to define fields

Common `sort` arguments

-n: numerical sort
-r: reverse sort
-k: which field to sort on
-t: the delimiter to define fields based on (default is a space character)

Common `uniq` arguments

default: return the unique lines
-c: the number of times each unique line appears

Installations

If you haven’t been following along, you can get caught up by doing the following:

(windows) Install the Ubuntu Linux BASH shell for Windows 10

(mac) Install homebrew and wget

  /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  brew install wget

To get to where we are at the beginning of this episode (you won’t have the same issue numbers at Pat)…
- Set up a GitHub account
- Create a new GitHub repository
  - Call it “Schloss_rrnAnalysis_XXXX_2020” (feel free to use your own last name)
  - Make it Public
  - Don’t check the box next to “Initialize this repository with a README”
  - Click the green “Create repository” button
- Go to your command line and enter the following replacing <your_github_id> with your GitHub user id
```
git clone git@github.com:SchlossLab/Schloss_rrnAnalysis_mSphere_2020.git
cd Schloss_rrnAnalysis_XXXX_2020
git reset --hard 77e314f5
git remote set-url origin git@github.com:<your_github_id>/Schloss_rrnAnalysis_XXXX_2020.git
git push -u origin master  
```
- Return to GitHub and refresh your browser
- Within Schloss_rrnAnalysis_XXXX_2020/ run
```
make data/v19/rrnDB.align
make data/v4/rrnDB.align
```
- Then you should now be good to go

Exercises

1. How many genomes are in the rrnDB archive that are from the Thermus?

2. Which organism has the most copies of the 16S rRNA gene per genome?

3. After the species coli, which Escherichia species has the most genomes represented in the rrnDB?