Laying out a plan for our project
Former US president and Army general, Dwight Eisenhower once said, “Plans are worthless, but planning is everything”. That is especially true for research projects. If we don’t have a plan for getting somewhere, then we’ll be sure to get nowhere. I hate working in the midst of chaos, can you tell? We’ve spent two episodes getting our project organized and we haven’t touched any data or code yet. If you have ever taken over a project from a former member of your research group, then you’ll know the frustration of trying to find things and make sense of what they did. Your first step would be to get the project organized. Right?! Well, today we’ll also talk a bit more about organization and how we can use issues in GitHub to give ourselves a todo list and use branches in git to systematically work through those issues. Don’t worry, today we’ll download the data that we’ll be using to determine to what degree inter- and intra-genomic variation limit the ability to interpret amplicon sequence variants (ASVs) as a biologically coherent entity that proponents of ASVs claim. As we go along, we’ll see the command line and git
commands that we saw in the last two episodes so that we can continue to practice those skills while learning new git
tools.
Please take the time to watch today’s episode, follow along on your own computer, and attempt the exercises. Don’t worry if you aren’t sure how to solve the exercises, at the end I will provide solutions. The reference notes and links that follow are a supplement to the material in the video.
git
and GitHub revisited
The notes below are meant to supplement the video presentation of the Code Club episode.
GitHub flow
GitHub flow is a process used by software developers that allows them to work on multiple new features at the same time. We can adapt the idea behind GitFlow for data analysis. This will allow us to systematically identify and addressing different steps in our data analysis plan. It makes use of an “issue tracker”, which accompanies each GitHub repository. To implement this process, we’ll learn a few more git commands including git branch
, git checkout
, and git merge
. We’ll learn GitFlow through the process of downloading our reference files.
- Create an issue in your repository’s issue tracker on GitHub
- Create and check out a branch in your local repository
git branch issue_1 git branch git checkout issue_1 git branch git status
- Work on the issue. As you go through the issue, feel free to add to the thread for the issue on GitHub
- Commit the change when you are done with the issue. In the commit message, include the statement “closes #[issue_number]”
- Checkout your master branch
git checkout master
- Merge your issue branch into the master branch. If you have conflicts, open the offending files and resolve the conflict and commit the change.
git merge issue_1
- Push to your remote repository
git push
- Refresh the issue and see that it has closed or close it yourself
.gitignore
The .gitignore
file sits in the root of our project directory. This file is a listing of files and directories that we want git
to ignore. We generally want to ignore things that live inside of data/
since these tend to be large files. GitHub limits us to a limit of 100 MB per file and 1 GB per repository. Software developers will also recommend ignoring anything that is not code since it can easily be recreated. For bioinformatics data analyses, generating these files can take a long time and significant resources. I tend to commit smaller summary files that others might want to use without the headache of rerunning my analysis themselves.
Exercises
1. Create a new issue. In this issue, we want to start accumulating a bibliography of papers describing oligotypes and amplicon and exact sequence variants. As we find new references, we can add them to the issue’s thread.
2. Use GitHub flow to resolve Issue 2.
3. Use GitHub flow to resolve Issue 3.