CNV Enrichment Test Code

Reference Paper

Functional impact of global rare copy number variation in autism spectrum disorders.
Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, et al.
Nature. 2010 Jun 9 (Epub ahead of print)

Brief Description

CNVs can occur both in control and case patients, about with the same frequency
whenever a gene is affected by a CNV, every gene-set that gene belongs to will (have a probability to) be "perturbed"
every patient contributes to a gene-set perturbation score
- the patient contribution can be of 0 or 1:
- 0 = no CNV-affected gene in this gene-set for this patient
- 1 = one or more CNV-affected genes in this gene-set for this patient
- note: perturbation scores can be statistically treated as counts
we perform a Fisher's Exact Test on perturbation scores, comparing the case to the control class
we specifically test if perturbations are significantly higher in cases than in controls, for every gene-set
we add a permutation-based "empirical" FDR (False Discovery Rate)
- permutation of sample-class associations

For a more detailed description, please refer to the paper supplementaries

Code

The script below calls the functions in the zipped file at the bottom.

Script

# CNV GENE-SET ENRICHMENT [LITE]

# INPUT DATA

# CNV data

# 'CNV.df' is a data.frame with the following columns:
# $Class (values: "case", "control")
# $SampleID [i.e. patient identifier]
# $Chr [i.e. CNV genomic coordinate: chromosome]
# $Coord_i [i.e. CNV genomic coordinate: begin position] 
# $Coord_f [i.e. CNV genomic coordinate: end position] 
# $Length [i.e. CNV genomic coordinate: length]
# $Type (values: '0' = DEL, '1' = DEL, '3' = DUP)  
# $Genes_eg (use ';' to separate multiple values) [i.e. genes overlapped by the CNV, Entrez-gene identifier] 
# $Gene_count [i.e. number of genes in the previous field]

# 'Sample2Class.df' can be extracted from the previous:
# it is a data.frame with the following columns:
# $SampleID
# $Class

# Gene sets

# 'GS2Genes.ls' is a list, with 
# - gene-set IDs as names 
# - genes (entrez-gene IDs) as slot content
# 'GS2Name.chv' is a character vector, where 
# - names are the gene-set IDs 
# - values are the gene-set descriptions

# These can be generated by parsing the GMT-formatted gene-sets at this page:
# http://baderlab.org/GeneSetDB_02

# LOAD FUNCTIONS
# - Load all functions

function.names <- c (
        "Filter_CNV_01.R", 
        "CNV_Key_01.R", "Sample2genes_from_CNV_01.R", "Sample2GS_from_Sample2genes_01.R",       "FisherTestGS_01.R", 
        "Add_GSname_01.R", "AddGSsize_toEnrdf_01.R", 
        "GeneCounts_from_CNV_01.R", "AddSuppGenes_toEnrdf_01.R"
        )

for (f.name in function.names)
        {source (f.name)}

# EXECUTE MAIN

# 1) Set parameters

# To keep only CNV-DEL, of any gene length
CNV_Limits.nv <- c (+Inf, 0)
names (CNV_Limits.nv) <- c ("DEL", "DUP")

# To keep both CNV-DEL and CNV-DUP, of any gene length
CNV_Limits.nv <- c (+Inf, +Inf)
names (CNV_Limits.nv) <- c ("DEL", "DUP")

# Add any Entrez-gene you want to remove from the analysis
# (use only for hypothesized false positives, e.g. instable regions)
Blacklist.eg   <- ""

# Number of randomizations for FDR computation
Iter.n         <- 5

# 2) Filter
        
CNV_f.df <- f.Filter_CNV2gene (
                                CNV.df = CNV.df, 
                                CNV_limits.nv = CNV_Limits.nv,
                                blacklist.eg  = Blacklist.eg
                                )

# 3) Generate 'sample 2 gene-sets' table

CNV_f.df <- f.MakeCNV_key (CNV_f.df)    

Samples2Genes.ls <- 
                f.Comp_Sample2genes (CNV.df = CNV_f.df)
                
Sample2GS.tab <- f.Comp_Sample2GS (
                                        GS2genes.ls = GS2Genes.ls, 
                                        Sample2genes.ls = Samples2Genes.ls, 
                                        Sample2class.df = Sample2Class.df
                                        )

# 4) Run Test
                
Enr_1.df <- f.FisherTestGS_FDR_Wrap (
                                        Sample2GS.tab   = Sample2GS.tab, 
                                        Sample2class.df = Sample2Class.df, 
                                        iter.n = Iter.n
                                        )                                       

# 5) Add gene-set attributes

Enr_2.df <- f.AddGeneSize (
                                        GS2genes.ls = GS2Genes.ls, 
                                        Enr.df      = Enr_1.df
                                        )

Enr_2.df <- f.Add_GSname (
                                        GS2name.chv = GS2Name.chv, 
                                        Enr.df       = Enr_2.df
                                        )
                                        
# 6) Add support genes
#    - defined as the genes with more mutations in cases than ctrls

CNV_Gene.tab <- 
                f.CompGeneCounts (CNV.df = CNV_f.df)    


Enr_3.df <- f.AddSupportGenes (
                                        GS_enr.df       = Enr_2.df, 
                                        Genes2class.tab = CNV_Gene.tab, 
                                        GS2genes.ls     = GS2Genes.ls
                                        )

Enr_final.df <- Enr_3.df

# EXECUTE ANCILLARY

# 1) Compute Gene Tables
#    - these have stats by-gene

source ("GeneTable_01.R")

GeneTable.ls <- f.MakeGeneTable (
                                        Enr.df         = Enr_final.df, 
                                        eg2sy.chv      = Ann_eg2sy.chv, 
                                        GeneCounts.tab = CNV_Gene.tab, 
                                        GS2genes.ls    = GS2Genes.ls, 
                                        CNV.df         = CNV_f.df
                                        )

Functions