Mon Apr 01 2024
What happening during metagenome sequence result to actual feature
Sequencing (reads ⇒ contigs)
We can get original “reads” from metagenome sequencing results. The reads or so called are usually scattered and broken. Then we assemble high quality reads to contigs, which we considered as part of continued sequences containing enough information.
Many tools are useful in assembling reads to contigs, including:
- IDBA-UD (doi:10.1093/bioinformatics/bts174)
- MEGAHIT
- MetaSPAdes
- Ray Meta: handy for huge data
In this process, any database (for labeling name/function) will NOT be used. IDBA-UD, for example, only use de Bruijn graph approach which may detect just base alignment information especially at connecting part.
Labeling (contigs ⇒ genes)
The contigs are still unable to be treated as units for further analysis.
As we learned in Biology lessons: Genes exhibit degeneracy, which means the codons encoding same amino acid may differ. This feature also brings up with codon usage bias in species. However, this is beyond the scope of our current discussion. Moreover, most mutations in gene sequences, even if they alter the protein sequence, do not necessarily affect its function.
After considering those facts, many contigs, even with differences, should be treated as one unit (with abundance) in following analysis.
Thanks for previous work, there are many powerful databases containing quite a lot genes that have been studied their sequences (genotype) and functions (phenotype) already. It is convenient to classify those contigs into genes reasonably by sequence alignment.
But there will still be quite a few contigs not in databases playing important roles in research. Here is where Sequence clustering would be used. Many algorithms and methods apply on those unnamed contigs, grouping them as units for further analysis. CD-HIT as well as many other tools are designed for this process.
Now we have acknowledged variance on genes level but abundance of each gene. Noting that, it is reads instead of contigs aligned to specific gene which count for the abundance of the gene. It is reasonable to use reads instead of contigs since contigs actually ignore the abundance of reads directly from original sequence result. Contigs only concentrate on the reads’ elongation and try to get continuous sequences as long as possible. SOAPaligned (short oligonucleotide alignment program) is an improved ultrafast tool for short reads alignment and suitable for this case.
CAGs (genes ⇒ gene groups)
Co-abundance gene groups (CAGs) are clusters of genes that tend to have similar abundance patterns across different samples. Such as abundance increasing/decreasing asynchronously, as well as other more complex pattern, across different samples.
Group genes by that is reasonable because genes may be functionally related, so that the abundance asynchronously changes based on the function is strong or not in different samples. In other words. Grouping genes based on their functional relationships is a reasoned approach, as gene abundances may vary asynchronously across different samples, reflecting the varying strengths of their functions. This method acknowledges that genes with similar roles may exhibit coordinated changes in expression levels in response to diverse biological contexts, thereby providing insights into the underlying regulatory mechanisms and functional dynamics.