CNV Enrichment Test Code
Reference Paper
Functional impact of global rare copy number variation in autism spectrum disorders.
Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, et al.
Nature. 2010 Jun 9 (Epub ahead of print)
Brief Description
This test was designed to assess which gene-sets are hit by CNVs, comparing cases to controls. It is a "self-contained" type test; unlike "competitive" tests, it does not rely on a gene-level statistics, but directly constructs a statistic at the gene-set level. It requires an input table connecting CNVs to patients (cases and controls). Please see the code below for a more detailed description of the input requirements.
Essential description of the test procedure:
- CNVs can occur both in control and case patients (about with the same frequency for the Autism study)
- whenever a gene is affected by a CNV, we hypothesize every gene-set including that gene will (have a non-zero probability to) be "perturbed"
every patient contributes to the gene-set perturbation score as follows
- 0 = no CNV-affected gene in this gene-set for this patient
- 1 = one or more CNV-affected genes in this gene-set for this patient
- note: perturbation scores can be statistically treated as counts
we perform a Fisher's Exact Test on perturbation scores, comparing the case to the control class
we specifically test if perturbations are significantly higher in cases than in controls, for every gene-set- we also compute an "empirical" FDR (False Discovery Rate), by permuting patient-to-case/control associations
For a more detailed description, please refer to the paper supplementaries
Code
The script below calls the functions in the zipped file at the bottom.
Script
# CNV GENE-SET ENRICHMENT [LITE] # (by DANIELE MERICO, Sept 2010) # INPUT DATA # CNV data # 'CNV.df' is a data.frame with the following columns: # $Class (values: "case", "control") # $SampleID [i.e. patient identifier] # $Chr [i.e. CNV genomic coordinate: chromosome] # $Coord_i [i.e. CNV genomic coordinate: begin position] # $Coord_f [i.e. CNV genomic coordinate: end position] # $Length [i.e. CNV genomic coordinate: length] # $Type (values: '0' = DEL, '1' = DEL, '3' = DUP) # $Genes_eg (use ';' to separate multiple values) [i.e. genes overlapped by the CNV, Entrez-gene identifier] # $Gene_count [i.e. number of genes in the previous field] # 'Sample2Class.df' can be extracted from the previous: # it is a data.frame with the following columns: # $SampleID # $Class # Gene sets # 'GS2Genes.ls' is a list, with # - gene-set IDs as names # - genes (entrez-gene IDs) as slot content # 'GS2Name.chv' is a character vector, where # - names are the gene-set IDs # - values are the gene-set descriptions # These can be generated by parsing the GMT-formatted gene-sets at this page: # http://baderlab.org/GeneSetDB_02 # - NB: don't forget filtering for size; # we originally discarded gene-sets smaller than 5 genes and larger than 700 genes # LOAD FUNCTIONS # - Load all functions function.names <- c ( "Filter_CNV_01.R", "CNV_Key_01.R", "Sample2genes_from_CNV_01.R", "Sample2GS_from_Sample2genes_01.R", "FisherTestGS_01.R", "Add_GSname_01.R", "AddGSsize_toEnrdf_01.R", "GeneCounts_from_CNV_01.R", "AddSuppGenes_toEnrdf_01.R" ) for (f.name in function.names) {source (f.name)} # EXECUTE MAIN # 1) Set parameters # To keep only CNV-DEL, of any gene length CNV_Limits.nv <- c (+Inf, 0) names (CNV_Limits.nv) <- c ("DEL", "DUP") # To keep both CNV-DEL and CNV-DUP, of any gene length CNV_Limits.nv <- c (+Inf, +Inf) names (CNV_Limits.nv) <- c ("DEL", "DUP") # Add any Entrez-gene you want to remove from the analysis # (use only for hypothesized false positives, e.g. instable regions) Blacklist.eg <- "" # Number of randomizations for FDR computation Iter.n <- 5 # 2) Filter CNV_f.df <- f.Filter_CNV2gene ( CNV.df = CNV.df, CNV_limits.nv = CNV_Limits.nv, blacklist.eg = Blacklist.eg ) # 3) Generate 'sample 2 gene-sets' table CNV_f.df <- f.MakeCNV_key (CNV_f.df) Samples2Genes.ls <- f.Comp_Sample2genes (CNV.df = CNV_f.df) Sample2GS.tab <- f.Comp_Sample2GS ( GS2genes.ls = GS2Genes.ls, Sample2genes.ls = Samples2Genes.ls, Sample2class.df = Sample2Class.df ) # 4) Run Test Enr_1.df <- f.FisherTestGS_FDR_Wrap ( Sample2GS.tab = Sample2GS.tab, Sample2class.df = Sample2Class.df, iter.n = Iter.n ) # 5) Add gene-set attributes Enr_2.df <- f.AddGeneSize ( GS2genes.ls = GS2Genes.ls, Enr.df = Enr_1.df ) Enr_2.df <- f.Add_GSname ( GS2name.chv = GS2Name.chv, Enr.df = Enr_2.df ) # 6) Add support genes # - defined as the genes with more mutations in cases than ctrls CNV_Gene.tab <- f.CompGeneCounts (CNV.df = CNV_f.df) Enr_3.df <- f.AddSupportGenes ( GS_enr.df = Enr_2.df, Genes2class.tab = CNV_Gene.tab, GS2genes.ls = GS2Genes.ls ) Enr_final.df <- Enr_3.df # EXECUTE ANCILLARY # 1) Compute Gene Tables # - these have stats by-gene source ("GeneTable_01.R") GeneTable.ls <- f.MakeGeneTable ( Enr.df = Enr_final.df, eg2sy.chv = Ann_eg2sy.chv, GeneCounts.tab = CNV_Gene.tab, GS2genes.ls = GS2Genes.ls, CNV.df = CNV_f.df )