Differences between revisions 10 and 14 (spanning 4 versions)

RNA-seq Data Analysis

Contents

RNA-seq Data Analysis

Next Generation Sequencing (NGS)

Platforms:
- Illumina's Genome Analyzer
- Applied Biosystems' SOLiD
- Roche's 454 Life Sciences
Terminology
- Sequencing Depth or Coverage: Total number of reads mapped to the genome/transcriptome, also known as library size.
- Transcript/gene length: Number of bases in a gene.
- Read counts: Number of reads mapping to that gene/transcript (expression measurement). The reads are typically 30 ~ 400 bp, depending on the DNA-sequencing technology used.
Illumina's sequencing technology
- One flow cell: 8 lanes, one of which is often used for the control sample.
- Multiplexing (see Conceptual overview)
  - Pooling: Sequencing multiple samples on a single unit (an Illumina's lane or flow cell)
  - Barcoding: Individual barcode sequences are added to each sample so they can be distinguished and sorted during data analysis.
    - Barcodes (barcode sequences) are used to separate and de-multiplex reads from each sample.
    - Barcodes in a single unit: 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run.
  - Highlights:
    1. Fast, High-Throughput Strategy: Large sample numbers can be simultaneously sequenced during a single experiment
    2. Cost-Effective Method: Multisample pooling improves productivity by reducing time and reagent use
    3. High-Quality Data: Accurate maintenance of read length of unknown sequences (why?)
    4. Balanced Blocked Designs: A feasibility to construct balanced blocked designs for the purpose of testing differential expression.
- Sequencing modes
  - Single-end Read: One read sequenced from one end of each cDNA insert
  - Paired-end Read: two reads sequenced from each cDNA sample insert (one from each end)
    - The costs of paired end sequencing are higher than single end sequencing
- Quantitative standards (spike-ins, see ENCODE Consortium 2011)
  - It is highly desirable to include a ladder of RNA spike-ins to calibrate quantification, sensitivity, coverage and linearity.
- Questions about the sequencing
  - Is the cost calculated by the number of lanes or flow cells used?
  - About $2,000 per lane?
  - How many samples can be in one lane?
  - How many reads can be gotten in one lane?
  - How many reads per sample can be gotten in one lane or one flow cell?

RNA Sequencing Pipeline

RNA Sequencing: Experimental Design

Sequencing variations
- Different genes have different variances and are potentially subject to different errors and biases.
- Sources of variation affecting only a minority of genes should be integrated into the design as well (PCR-based GC bias).
- Technical variability (experimental errors and biases): Repeated measurements of the same biological sample in multiple lanes or flow cells. For example, the same biological sample is in different lanes, which provides information about the variability of lanes. Two main sources of variation that may contribute to confounding effects:
  1. Batch effects: errors that occur after random fragmentation of the RNA until it is input to the flow cell (PCR, reverse transcription).
  2. Lane effects: errors that occur from the flow cell until obtaining the data from the sequencing machine (bad sequencing cycles, base-calling)
- Biological variability (see ENCODE Consortium 2011)
  1. A biological replicate is defined as an independent growth of cells/tissue and subsequent analysis.
  2. Experiments should be performed with two or more biological replicates, unless there is a compelling reason why this is impractical or wasteful (e.g. overlapping time points with high temporal resolution).
  3. A typical R2 (Pearson) correlation of gene expression (RPKM) between two biological replicates, for RNAs that are detected in both samples using RPKM or read counts, should be between 0.92 to 0.98. Experiments with biological correlations that fall below 0.9 should be either be repeated or explained.
Sequencing depth (see ENCODE Consortium 2011)
- The amount of sequencing needed for a given sample is determined by the goals of the experiment and the nature of the RNA sample.
- Experiments whose purpose is to evaluate the similarity between the transcriptional profiles of two polyA+ samples may require only modest depths of sequencing (e.g. 30M pair-end reads of length > 30NT, of which 20-25M are mappable to the genome or known transcriptome)
- Experiments whose purpose is discovery of novel transcribed elements and strong quantification of known transcript isoforms requires more extensive sequencing. The ability to detect reliably low copy number transcripts/isoforms depends upon the depth of sequencing and on a sufficiently complex library.
- For experiments from a typical mammalian tissue or in which sensitivity of detection is important, a minimum depth of 100-200 M 2 x 76 bp or longer reads is currently recommended.
Purposes of the experimental design
- avoid or eliminate the technical variation (possible confounding factors)
- estimate the biological variation
Sequencing design
- Sampling: subject sampling, RNA sampling, and fragment sampling
- Randomization: assigning individuals at random to groups (reduce the sample variability or variation)
- Replication: biological replicates allow for estimation of within-treatment group (biological) variability, which is needed for making inferences between treatment groups.
- Blocking: experimental units are grouped into homogeneous clusters (blocks)
Balanced block designs
- Barcoding: DNA fragments can be labeled or barcoded with sample specific sequences that allow multiple samples to be included in the same sequencing.
- Pooling: All the samples of RNA are pooled into the same batch and then sequenced in one lane of a flow cell.
- Any batch and lane effects are the same for all the samples.
Balanced incomplete block designs (BIBD)
- Technical constraints and the scientific hypotheses:
  1. the number of treatments (I)
  2. the number of biological replicates per treatment (J)
  3. the number of unique barcodes (s) that can be included in a single lane (block)
  4. the number of lanes available for sequencing (L)
- When s < I, i.e., the number of unique bar codes in one lane is less than the number of treatments or samples,
  - a complete block design is not be possible.
  - In these cases, a BIBD is suggested.
- Balanced incomplete block:
  - Incomplete: cannot fit all treatments (samples) in each block
  - Balanced: each pair of treatments occur together in the same number of times, and then the variance of the difference between two treatments is constant.
- BIBD:
  - The total number of possible technical replicates per biological replicate is T = sL/JI.
  - The number of times each pair of treatments occurs together is k = J(s-1)/(I-1), an integer.
  - Extensive list of BIBD can be found in Fisher and Yates (1963) and Cochran and Cox (1957)

RNA Sequencing: Statistical Analysis

RNAseq Data Analysis using edgeR package

Filtering extremely low expressed genes using a CPM cutoff
- CPM cutoff
  1. median CPMs
  2. search in CPM grids of [1, 50] by comparing the similarity of sample distributions (boxplots, density plots, and wilcox.test)
Normalization
Differential expression testing
- Design matrix
- Comparison or contract
- Dispersion estimation
- Statistical testings (model fitting)
  1. ET (exact tests): exactTest
  2. LRT: glmFit, glmLRT
  3. QLF: glmQLFit, glmQLFTest
- Interpreting the testing results
  1. adjusted pvalue (FDR): topTags
  2. QQ-plots of pvalues, z-values, or chi-squared values: qqpvalue, qqstats
    - conservative or liberal
    - confounding or unknown effects
  3. MAplot, volcano plot
  4. pvalue or FDR scores

References

Auer and Doerge (2010) Statistical Design and Analysis of RNA Sequencing Data, Genetics
Illumina's technical note: Estimating Sequencing Coverage
Some Issues of Statistical Design & Analysis in RNA-seq Experiment, Shaheena Bashir (2012)
Standards, Guidelines and Best Practices for RNA-Seq V1.0. The ENCODE Consortium (June 2011)

-  ⇤ ← Revision 10 as of 2015-01-29 22:48:36 → 
  Size: 7484
  Editor: ChangjiangXu
  Comment:
+   ← Revision 14 as of 2016-01-18 04:51:13 → ⇥
  Size: 8947
  Editor: ChangjiangXu
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
-#acl CscGroup:read
+#acl All:read
 Line 18:
-   * One flow cell: 8 lanes
   * One lane is often used for the control sample.
   * Multiplexing:
     * a way to save money by sequencing multiple samples on a single unit (an Illumina's flow cell)
     * a feasibility to construct balanced blocked designs for the purpose of testing differential expression.
   * Barcoding: 
     * used to separate inputs.
     * The output can be deconvoluted to individual samples.
     * Many barcodes in a single unit: 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run.
   * Quantitative standards (spike-ins, see ENCODE Consortium 2011)
     * It is highly desirable to include a ladder of RNA spike-ins to calibrate quantification, sensitivity, coverage and linearity.
+   * One flow cell: 8 lanes, one of which is often used for the control sample.
   * Multiplexing (see [[http://www.illumina.com/technology/next-generation-sequencing/multiplexing-sequencing-assay.html | Conceptual overview]])
     * Pooling: Sequencing multiple samples on a single unit (an Illumina's lane or flow cell)
     * Barcoding: Individual barcode sequences are added to each sample so they can be distinguished and sorted during data analysis.
       * Barcodes (barcode sequences) are used to separate and de-multiplex reads from each sample.
       * Barcodes in a single unit: 12 different samples can be indexed with unique subsequences and loaded onto each lane. In total, 96 samples can be sequenced per run.
     * Highlights:
       1. Fast, High-Throughput Strategy: Large sample numbers can be simultaneously sequenced during a single experiment
       1. Cost-Effective Method: Multisample pooling improves productivity by reducing time and reagent use
       1. High-Quality Data: Accurate maintenance of read length of unknown sequences (why?)
       1. Balanced Blocked Designs: A feasibility to construct balanced blocked designs for the purpose of testing differential expression.
 Line 33:
-   * Questions about the sequencing cost
+   * Quantitative standards (spike-ins, see ENCODE Consortium 2011)
     * It is highly desirable to include a ladder of RNA spike-ins to calibrate quantification, sensitivity, coverage and linearity.
   * Questions about the sequencing
-Line 91:
+Line 93:
+=== RNAseq Data Analysis using edgeR package ===
 * Filtering extremely low expressed genes using a CPM cutoff
   * CPM cutoff 
     1. median CPMs
     1. search in CPM grids of [1, 50] by comparing the similarity of sample distributions (boxplots, density plots, and wilcox.test)
 * Normalization
 * Differential expression testing
   * Design matrix
   * Comparison or contract
   * Dispersion estimation
   * Statistical testings (model fitting)
     1. ET (exact tests): exactTest
     1. LRT: glmFit, glmLRT
     1. QLF: glmQLFit, glmQLFTest
   * Interpreting the testing results
     1. adjusted pvalue (FDR): topTags
     1. QQ-plots of pvalues, z-values, or chi-squared values: qqpvalue, qqstats
        * conservative or liberal
        * confounding or unknown effects
     1. MAplot, volcano plot
     1. pvalue or FDR scores
-Line 92:
+Line 115:
-'''References'''
+== References ==