The following tutorials and documentation will be helpful for researchers considering similar approaches.
Using the command line
A command line enables you to perform diverse operations programmatically. That is, you can do a lot of things with it, and you can automate the process for easy repetition, alteration, and sharing. You should learn how to use a command line for bioinformatics. In fact, many bioinformatics programs are run not from a user interface but from the command line. Complicating things, the Windows command line uses its own language, whereas Mac and Linux machines, which use the Unix operating system, have command lines using more common bash language.
- For Mac and Unix operating systems, use the local command line.
- For Windows operating systems, you can try to use git bash (latest Windows) or Ubuntu. Ideally, you establish a remote connection (SSH) to a Unix operating system, perhaps a computer cluster maintained by your organization.
Some resources for Unix:
- Commands for manipulating FASTA and FASTQ files
- Commands useful for bioinformaticians generally
- Commands “every data scientist should know”
- A tutorial on awk, which selects fields within data
- @AstrobioMike’s introduction to bash
If you’re a Python novice, I encourage you to take a workshop and to use Python to analyze and plot your data, so you can get practice. These guides are helpful references and will expand your toolkit.
Python Data Science Handbook by Jake VanderPlas. Online Jupyter notebook with clear instruction and code examples, focused on different Python packages for data science (Jupyter Notebook, Numpy, Pandas, Matplotlib, machine learning with scikit-learn).
Metagenomics (assembly + genomics)
Going from sequencing to assembled genomes requires (1) assembly, (2) read-mapping, (3) binning genomes, and (4) bin refinement and quality check.
A tutorial on using Anvi’o for metagenomics by A. Murat Eren. Anvi’o is the best platform for binning genomes from assembly-based metagenomics, and its creators continue to add new functions for comparative genomics.
CheckM is great for assessing genome completeness and contamination.
Metagenomics (amplicon sequencing)
Current best practices are to remove sequencing errors from reads, obtaining amplicon sequencing variants (ASVs) instead of operational taxonomic units (OTUs). The DADA2 tutorial worked well for me.