5.1.1 Assessment
Like other types of sequencing data analysis, the assessment of sequencing data is necessary as it helps to enhance the quality and accuracy of the data, thereby forming a basis for subsequent analysis and discovery. A number of existing programs such as cutadapt, fastp, and trimmomatic are commonly used to ensure accuracy and reliability. However, the distribution of reads data for nascent RNA is uneven across the genome, with a large concentration of reads near the transcriptional start and stop sites. These reads hold biological significance, and therefore, certain metrics need to be calculated and the results visualized to accurately assess the quality of the sequencing data during preprocessing.
5.1.2 What it does?
The assessment module is a wrapper script that pre-processes FASTQ files for quality control and reads mapping (single-end or paired-end sequencing).
5.1.3 Features:
Preprocess (from raw reads to clean reads)
remove low quality reads
remove adapter
remove polyX
trim two ends
Bowtie2 alignment and split
clean fastq ->original.sam
original.sam -> 1 unmap.sam 2 map.sam
map.sam -> 1 low_quality.sam 2 high_quality.sam
high_quality.sam -> 1 unique_map.sam 2 multiple_map.sam
unique_map.sam -> 1 mito.sam 2 chr.sam
chr.sam -> 1 assign.sam 2 unassign.sam
Obtain strand-specific genome track(bigwig)
unique_map.sam -> postive.bw, reverse.bw unique_map.sam -> 5end_postive.bw, 5end_reverse.bw
5.1.4 Example
nasap assessment --output_root ./tmp --read1 ./data/test_r1.fq.gz --cores 12 --adapter1 TGGAATTCTCGGGTGCCAAGG --bowtie_index /home/meta_data/index/index_hg38_bowtie2/index_hg38_bowtie2 --gtf /home/meta_data/annotation/Homo_sapiens.GRCh38.93.gtf
Parameters
parameter | description |
---|---|
--bowtie_index (Required) | bowtie2 index file. |
--gtf (Required) | gtf file. |
--read1 (Required) | Sample FASTQ(gz) file. |
--read2 (Optional) | Mate pair end FASTQ(gz) file. |
--adapter1 (Optional) | Adapter sequence. |
--adapter2 (Optional) | Adapter sequence for read2. |
--umi (Optional) | UMI location. |
--cores (Optional) | Multiprocess num. |
--output_root (Optional) | Output root directory. |
--bowtie_index:
Download the index file for the specified specie from the bowtie2 official website.
--gtf:
Download specie gtf annotation files from ensembl database. Please note that the downloaded file should be unpacked first.
--adapter1/--adapter2:
The adapter sequence can be automatically detected without specifying. However, this method is not recommended due to its inaccuracy. If the adapter sequence of the sequencing data is unknown, it can be found on this website.
When the adapter sequence is known, use --adapter1 to specify the sequence. If it has adapter2 on the pair end reads, set the adapter sequence of read2 by specifying the --adapter2 parameter.
For example:
single end sequence:
--adapter1 ATACAGCGGT
pair end sequence:
--adapter1 ATACAGCGGT --adapter2 CAGGTACGAT
--umi_loc
Preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name. o activate UMI processing, use the command line option --umi_loc.
--umi_loc can be specified with "read1, read2 or per_read"
For example:
single end sequence
--umi_loc read1 --umi_len 8
pair end sequence
--umi_loc per_read --umi_len 8
5.1.5 Results
Adapter ratio
Measure | Value | Recommend |
---|---|---|
Reads with adapter | 15872809 | - |
Uninformative adapter reads | 1036494 | - |
Percent of uninformative adapter reads | 3.414% | <5% |
Peak adapter insertion size | 50 | - |
Adapter loss rate | 0.2508587895771456% | <5% |
RNA intergrity
Measure | Value | Recommend |
---|---|---|
Degradation ratio | 0.6724109152607194% | <1 |
RNA insert sizes distribution plot
Insert sizes below 20 nucleotides in the read length distribution indicates bad quality samples with degraded or poor RNA quality reads. The plot assessed the RNA intergrity with RNA insert sizes distribution.
Reads length distribution plot
The plot provides a detailed description of the nascent RNA reads length distribution in each preprocess steps.
Preprocess summary stack plot
The plot summarises the nascent RNA reads length distribution in each preprocess steps.
Library complexity
Measure | Value | Recommend |
---|---|---|
NRF | 0.805873 | 0.5 < NRF < 0.8 |
PBC1 | 0.855506 | 0.5 < PBC1 < 0.8 |
PBC2 | 7.671887 | 1 < PBC2 < 3 |
QC trend
Sequencing quality score plot
Sequencing quality is evaluated with the percentage of bases with the quality score greater than 20 (Q20) and greater than 30 (Q30). The plot compares the nascent RNA reads quality scores in each preprocess steps.
Nascent RNA purity
Measure | Value | Recommend |
---|---|---|
reads assign known genes | 16529808 | - |
reads mapped to chrM | 676979 | - |
mRNA contamination | 0.2742952586685951 | 1 < value < 1.8 |
File | Directory |
---|---|
Exon intro ratio | csv/exon_intron_ratio.csv |
Exon to intron read density ratio plot
The plot evaluate mRNA contamination with the exon to intron read density ratio. Read coverage over exon/intron of protein-coding genes. This plot is used to check if reads coverage is uniform and if there is any exon bias.
The smoothed line (lower panel) and pearson correlation coefficients (upper panel) of exon and intron are for all protein-coding genes.