对全基因组的基因进行Nr注释是必不可少的一步。由于Nr数据库非常大,导致使用BLAST会消耗巨大的计算资源和时间。使用DIAMOND则能快500-20000倍,而获得和BLAST比较一致的结果。
软件安装
wget https://github.com/bbuchfink/diamond/releases/download/v0.9.24/diamond-linux64.tar.gz tar xzf diamond-linux64.tar.gz可以将diamond添加至环境变量,如果不添加的话,每次使用软件时需要加上全局路径。这里可以使用diamond help或diamond version看一下是否安装成功。
下载数据库
从NCBI上下载Nr数据库
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz tar nr.gz建立数据库索引 使用diamond下面的一个子命令makedb
/home/jilei/anaconda3/bin/diamond makedb --in nr.faa -d nr输入文件为fasta格式,生成一个.dmnd的文件。
开始比对
/home/jilei/anaconda3/bin/diamond blastx -d /data2/hanmz/hanmz_home/database/nr/nr -q /data1/jilei/dini/CP1S1/final.contigs.fa -p 30 -f 100 -sensitive -e 1e-5 --id 90 -k 50 -o CP1S1.samAligner options: –query (-q) input query file –max-target-seqs (-k) maximum number of target sequences to report alignments for –top report alignments within this percentage range of top alignment score (overrides --max-target-seqs) –compress compression for output files (0=none, 1=gzip) –evalue (-e) maximum e-value to report alignments –min-score minimum bit score to report alignments (overrides e-value setting) –id minimum identity% to report an alignment –query-cover minimum query cover% to report an alignment –sensitive enable sensitive mode (default: fast) –more-sensitive enable more sensitive mode (default: fast) –block-size (-b) sequence block size in billions of letters (default=2.0) –index-chunks (-c) number of chunks for index processing –tmpdir (-t) directory for temporary files –gapopen gap open penalty (default=11 for protein) –gapextend gap extension penalty (default=1 for protein) –matrix score matrix for protein alignment (default=BLOSUM62) –custom-matrix file containing custom scoring matrix –lambda lambda parameter for custom matrix –K K parameter for custom matrix –seg enable SEG masking of queries (yes/no) –salltitles print full subject titles in output files
上述运行会花费较长时间运行结束后会在当前目录下生成一个如下图的文件 这样的daa文件可以进一步导入到Megan中,进行物种组成的可视化分析。
