CNV Enrichment Test Code
Reference Paper
Functional impact of global rare copy number variation in autism spectrum disorders.
Pinto D, Pagnamenta AT, Klei L, Anney R, Merico D, Regan R, et al.
Nature. 2010 Jun 9 (Epub ahead of print)
Brief Description
- CNVs can occur both in control and case patients, about with the same frequency
- whenever a gene is affected by a CNV, every gene-set that gene belongs to will (have a probability to) be "perturbed"
every patient contributes to a gene-set perturbation score
- the patient contribution can be of 0 or 1:
- 0 = no CNV-affected gene in this gene-set for this patient
- 1 = one or more CNV-affected genes in this gene-set for this patient
- note: perturbation scores can be statistically treated as counts
we perform a Fisher's Exact Test on perturbation scores, comparing the case to the control class
we specifically test if perturbations are significantly higher in cases than in controls, for every gene-set- we add a permutation-based "empirical" FDR (False Discovery Rate)
- permutation of sample-class associations
For a more detailed description, please refer to the paper supplementaries
Code
The script below calls the functions in the zipped file at the bottom.
- Script
# CNV GENE-SET ENRICHMENT [LITE] # INPUT DATA # CNV data # 'CNV.df' is a data.frame with the following columns: # $Class (values: "case", "control") # $SampleID [i.e. patient identifier] # $Chr [i.e. CNV genomic coordinate: chromosome] # $Coord_i [i.e. CNV genomic coordinate: begin position] # $Coord_f [i.e. CNV genomic coordinate: end position] # $Length [i.e. CNV genomic coordinate: length] # $Type (values: '0' = DEL, '1' = DEL, '3' = DUP) # $Genes_eg (use ';' to separate multiple values) [i.e. genes overlapped by the CNV, Entrez-gene identifier] # $Gene_count [i.e. number of genes in the previous field] # 'Sample2Class.df' can be extracted from the previous: # it is a data.frame with the following columns: # $SampleID # $Class # Gene sets # 'GS2Genes.ls' is a list, with # - gene-set IDs as names # - genes (entrez-gene IDs) as slot content # 'GS2Name.chv' is a character vector, where # - names are the gene-set IDs # - values are the gene-set descriptions # These can be generated by parsing the GMT-formatted gene-sets at this page: # http://baderlab.org/GeneSetDB_02 # LOAD FUNCTIONS # - Load all functions function.names <- c ( "Filter_CNV_01.R", "CNV_Key_01.R", "Sample2genes_from_CNV_01.R", "Sample2GS_from_Sample2genes_01.R", "FisherTestGS_01.R", "Add_GSname_01.R", "AddGSsize_toEnrdf_01.R", "GeneCounts_from_CNV_01.R", "AddSuppGenes_toEnrdf_01.R" ) for (f.name in function.names) {source (f.name)} # EXECUTE MAIN # 1) Set parameters # To keep only CNV-DEL, of any gene length CNV_Limits.nv <- c (+Inf, 0) names (CNV_Limits.nv) <- c ("DEL", "DUP") # To keep both CNV-DEL and CNV-DUP, of any gene length CNV_Limits.nv <- c (+Inf, +Inf) names (CNV_Limits.nv) <- c ("DEL", "DUP") # Add any Entrez-gene you want to remove from the analysis # (use only for hypothesized false positives, e.g. instable regions) Blacklist.eg <- "" # Number of randomizations for FDR computation Iter.n <- 5 # 2) Filter CNV_f.df <- f.Filter_CNV2gene ( CNV.df = CNV.df, CNV_limits.nv = CNV_Limits.nv, blacklist.eg = Blacklist.eg ) # 3) Generate 'sample 2 gene-sets' table CNV_f.df <- f.MakeCNV_key (CNV_f.df) Samples2Genes.ls <- f.Comp_Sample2genes (CNV.df = CNV_f.df) Sample2GS.tab <- f.Comp_Sample2GS ( GS2genes.ls = GS2Genes.ls, Sample2genes.ls = Samples2Genes.ls, Sample2class.df = Sample2Class.df ) # 4) Run Test Enr_1.df <- f.FisherTestGS_FDR_Wrap ( Sample2GS.tab = Sample2GS.tab, Sample2class.df = Sample2Class.df, iter.n = Iter.n ) # 5) Add gene-set attributes Enr_2.df <- f.AddGeneSize ( GS2genes.ls = GS2Genes.ls, Enr.df = Enr_1.df ) Enr_2.df <- f.Add_GSname ( GS2name.chv = GS2Name.chv, Enr.df = Enr_2.df ) # 6) Add support genes # - defined as the genes with more mutations in cases than ctrls CNV_Gene.tab <- f.CompGeneCounts (CNV.df = CNV_f.df) Enr_3.df <- f.AddSupportGenes ( GS_enr.df = Enr_2.df, Genes2class.tab = CNV_Gene.tab, GS2genes.ls = GS2Genes.ls ) Enr_final.df <- Enr_3.df # EXECUTE ANCILLARY # 1) Compute Gene Tables # - these have stats by-gene source ("GeneTable_01.R") GeneTable.ls <- f.MakeGeneTable ( Enr.df = Enr_final.df, eg2sy.chv = Ann_eg2sy.chv, GeneCounts.tab = CNV_Gene.tab, GS2genes.ls = GS2Genes.ls, CNV.df = CNV_f.df )