How to implement a bioinformatics workflow at UW-Madison: A guide for Researchers


The purpose of this document is to outline the general steps that researchers in the Department of Bacteriology can follow to take advantage of high-throughput computing at UW-Madison. While such servers provide immense computing power, it has a steep learning curve before one can be able to run analyses on these servers, depending on the researcher’s prior knowledge. This document some of the general steps that I can walk you through, and the order in which different topics can be learned. 

While this guide assumes CHTC’s remote server for running analyses, a lot of the information in this guide applies to how to run analyses on any server, for example, the bigger picture, having the right software on your laptop to facilitate your workflow, and a basic understanding of how to use the command-line terminal to interact with programs. 

CHTC uses the HTcondor system, which is a workload management system for compute-intensive jobs, and some of the examples in this guide for submit and executable scripts are specific to it. Another popular workload management system is SLURM, which some other servers on-campus use. HTCondor was developed at UW-Madison, but it is used globally, so learning it is a useful and transferrable skill even if one day you move from UW-Madison. 

Because this is a shared system, with folks from all over campus using it, it’s important to be aware of guideline for usage. As a user you should familiarize yourself with which analyses you want to run, which softwares do you want to install, and how you will run your jobs. I am happy to answer your questions regarding this so please ask me if you’re unsure!

Likewise, one major area of learning when transitionning from local computing to remote, is understanding how softwares are installed. In this guide, we’ll go over 3 ways: compiling using make, conda environments and docker containers. Many bioinformatics packages (over 10,000!) are available as conda environments, and containers are becoming more and more popular. Those will be the two most applicable ways for bioinformatics software.

In terms of how to run analyses, understanding how environments (conda) and containers (docker, singularity, apptainer) work is useful whether you’re hoping to run analyses on your laptop or any remote server because they allow for reproducibility. 

Intended use

I hope the guide can serve as a written reference that can be used for researchers to “refresh” on how to do certain tasks associated with the computational life-cycle of a project, from project planning before even getting data, to post-publication plans. A guide must have a balance between being general enough that information is easy to find and applicable to most researchers. As a trade-off, it cannot go into a lot of detail regarding specific bugs, errors, troubleshooting techniques, and strategies. I hope the guide can be used in addition to 1-on-1 mentoring and teaching of how to do these tasks. This guide also only minimally goes into the scientific discourse and subject-matter expertise required to assess which actual bioinformatics analyses are appropriate to answer specific microbiological research questions. 

General Overview

  1. Understanding the bigger picture:
  1. Your scientific research question
    1. Project planning and experimental design
    2. Expected data
    3. Planned Analyses
    4. Post-analyses plans 
  2. Learning about your computing and storage options
  3. How do these things “abstract” things connect
  4. Getting set up with appropriate software to facilitate workflow
    1. Install suggested software on your laptop/computer:
      1. [for moving files] 
        1. Cyberduck: (Mac)
        2. Filezilla: (Windows)
      2. [for accessing services off-campus] Global VPN: 
      3. [for accessing servers, text file editor] Visual Studio Code (optional) 
      4. [simple text file editor] 
        1. BBEdit: (Mac)
        2. Notepad++: (windows)
    2. Create an account on CHTC: 
    3. Understand guidelines:
  5.  Learning how to use the command line
      1. Recommended resources: 
      2. “CheatSheet” : 
  6. Getting started with CHTC:
    1. Using ssh to connect:
      1. The first time you connect it will ask you if you’re sure you want to connect
      2. You need your netID and Duo 2-factor authentication to login to CHTC. You also need VPN if you’re off campus.
    2. After having connected for the first time, you can then use FileZilla or Cyberduck to transfer files in the future.
  7. Learning how to move files between a computer, research drive, and CHTC
    1. Each Bacteriology Lab/PI has a 25TB:
      2. How to manage ResearchDrive permissions (can only be done by PI): 
      3. A PI can give permissions to a single (or a few) folders from their ResearchDrive to a collaborator, by changing the Permissions in Globus. 
      4. You can transfer files between a PI’s Research Drive and CHTC (/home or /staging) of your own account by following these instructions: 
    2. Between your laptop and CHTC: 
      1. Using a File Transfer program like Cyberduck, FileZilla
      2. You can transfer files using the command line directly to chtc: 
    3. Between folders within CHTC: 
      1. mv copy command-line 
    4. Between ResearchDrive and your Laptop: 
      1. Finder > Go > Connect to Server (Mac):
        1. Type this in the box: smb:// (Mac)
      2. (On Windows)
  8. Moving Large files between Research Drive and CHTC for large files using Globus
    1. How to connect the /staging folder on CHTC to your Globus service:  
  9. Learning how to run a first job on CHTC
      1. The basics of a submit file and an sh file:
  10. Useful Commands to track a job on CHTC: 
    1. condor_submit <.sub>
    2. condor_submit -i <.sub>
    3. condor_hold <jobID>
    4. condor_q –better-analyze <jobID>
    5. condor_ssh_to_job <jobID>
  11. The Interactive Mode (– i ) and how to use it for troubleshooting
  12. Learning about how to install software
    1. How to know which methodology to use
      1. Reading the scientific literature, the methods section of papers, benchmark software papers, Github wiki pages
      2. Subject-matter experts
    2. How to reading and understand software-specific installation instructions
    3. Conda environments: 
      1. What is conda and how to use it 
      2. Using conda environments on CHTC: 
      3. General conda cheat sheet: 
      4. The bioconda channel:  
    4. From “source”
      1. Using a .tar.gz file and compiling it ourselves
    5. Containers (Docker, Apptainer)
      1. Docker, apptainer, singularity – what are these things? What are the commonalities or the differences?
      1. How to use Docker containers on CHTC:
        1. Testing them on CHTC: 
        2. Running them: 
      2. How to use Apptainer on CHTC: 

13. Bioinformatics Specific Databases available:

    1. Common databases: /staging/ptran5/bacteriology_tran_data hosts the NCBI nr database (v 12/14/2023), the GTDB-tk r214 database, the CheckM database (v 1/13/2015). CHTC also hosts the AlphaFold databases.
    2. Docker for non-local databases
    3. /home vs /staging

14. Learning to manage data after an analysis is completed

      1. How to transfer data using the command line: 
      2. Move data outside of CHTC (e.g. to ResearchDrive, etc.)
      3. Keep copies of all essential files (e.g. software version numbers, log files, raw data, intermediate files, output files)
      4. FAIR data principles: 
      5. Popular Repositories in Microbiology:
        1. How to submit your data to NCBI (BioProject, BioSamples, Genbank, SRA): 
      6. How to publish your code on GitHub (hosted by anyone) or GitLab (UW-Madison license).