5.1 Assessment

5.1.1 Assessment

Like other types of sequencing data analysis, the assessment of sequencing data is necessary as it helps to enhance the quality and accuracy of the data, thereby forming a basis for subsequent analysis and discovery. A number of existing programs such as cutadapt, fastp, and trimmomatic are commonly used to ensure accuracy and reliability. However, the distribution of reads data for nascent RNA is uneven across the genome, with a large concentration of reads near the transcriptional start and stop sites. These reads hold biological significance, and therefore, certain metrics need to be calculated and the results visualized to accurately assess the quality of the sequencing data during preprocessing.

5.1.2 What it does?

The assessment module is a wrapper script that pre-processes FASTQ files for quality control and reads mapping (single-end or paired-end sequencing).

5.1.3 Features:

Preprocess (from raw reads to clean reads)

remove low quality reads
remove adapter
remove polyX
trim two ends

Bowtie2 alignment and split

clean fastq ->original.sam
original.sam -> 1 unmap.sam 2 map.sam
map.sam -> 1 low_quality.sam 2 high_quality.sam
high_quality.sam -> 1 unique_map.sam 2 multiple_map.sam
unique_map.sam -> 1 mito.sam 2 chr.sam
chr.sam -> 1 assign.sam 2 unassign.sam

Obtain strand-specific genome track(bigwig)

unique_map.sam -> postive.bw, reverse.bw unique_map.sam -> 5end_postive.bw, 5end_reverse.bw

5.1.4 Example

nasap assessment --output_root ./tmp --read1 ./data/test_r1.fq.gz --cores 12 --adapter1 TGGAATTCTCGGGTGCCAAGG --bowtie_index /home/meta_data/index/index_hg38_bowtie2/index_hg38_bowtie2 --gtf /home/meta_data/annotation/Homo_sapiens.GRCh38.93.gtf

Parameters

parameter	description
--bowtie_index (Required)	bowtie2 index file.
--gtf (Required)	gtf file.
--read1 (Required)	Sample FASTQ(gz) file.
--read2 (Optional)	Mate pair end FASTQ(gz) file.
--adapter1 (Optional)	Adapter sequence.
--adapter2 (Optional)	Adapter sequence for read2.
--umi (Optional)	UMI location.
--cores (Optional)	Multiprocess num.
--output_root (Optional)	Output root directory.

--bowtie_index:
Download the index file for the specified specie from the bowtie2 official website.

--gtf:
Download specie gtf annotation files from ensembl database. Please note that the downloaded file should be unpacked first.

--adapter1/--adapter2:
The adapter sequence can be automatically detected without specifying. However, this method is not recommended due to its inaccuracy. If the adapter sequence of the sequencing data is unknown, it can be found on this website. When the adapter sequence is known, use --adapter1 to specify the sequence. If it has adapter2 on the pair end reads, set the adapter sequence of read2 by specifying the --adapter2 parameter.

For example：
single end sequence:
--adapter1 ATACAGCGGT
pair end sequence:
--adapter1 ATACAGCGGT --adapter2 CAGGTACGAT

--umi_loc
Preprocess unique molecular identifer (UMI) enabled data, shift UMI to sequence name. o activate UMI processing, use the command line option --umi_loc. --umi_loc can be specified with "read1, read2 or per_read"

For example:
single end sequence
--umi_loc read1 --umi_len 8 pair end sequence
--umi_loc per_read --umi_len 8

5.1.5 Results

Adapter ratio

Measure	Value	Recommend
Reads with adapter	15872809	-
Uninformative adapter reads	1036494	-
Percent of uninformative adapter reads	3.414%	<5%
Peak adapter insertion size	50	-
Adapter loss rate	0.2508587895771456%	<5%

RNA intergrity

Measure	Value	Recommend
Degradation ratio	0.6724109152607194%	<1

RNA insert sizes distribution plot

adapter_insertion_distribution
Insert sizes below 20 nucleotides in the read length distribution indicates bad quality samples with degraded or poor RNA quality reads. The plot assessed the RNA intergrity with RNA insert sizes distribution.

Reads length distribution plot

reads_distribution The plot provides a detailed description of the nascent RNA reads length distribution in each preprocess steps.

Preprocess summary stack plot

reads_ratio
The plot summarises the nascent RNA reads length distribution in each preprocess steps.

Library complexity

Measure	Value	Recommend
NRF	0.805873	0.5 < NRF < 0.8
PBC1	0.855506	0.5 < PBC1 < 0.8
PBC2	7.671887	1 < PBC2 < 3

QC trend

Sequencing quality score plot qc_trend

Sequencing quality is evaluated with the percentage of bases with the quality score greater than 20 (Q20) and greater than 30 (Q30). The plot compares the nascent RNA reads quality scores in each preprocess steps.

Nascent RNA purity

Measure	Value	Recommend
reads assign known genes	16529808	-
reads mapped to chrM	676979	-
mRNA contamination	0.2742952586685951	1 < value < 1.8

File	Directory
Exon intro ratio	csv/exon_intron_ratio.csv

Exon to intron read density ratio plot

exon_intron The plot evaluate mRNA contamination with the exon to intron read density ratio. Read coverage over exon/intron of protein-coding genes. This plot is used to check if reads coverage is uniform and if there is any exon bias. The smoothed line (lower panel) and pearson correlation coefficients (upper panel) of exon and intron are for all protein-coding genes.