The search for Genetic Markers and the Context-dependence problem
Generally speaking, the search of genetic/molecular markers for human diseases aims at finding genetic/molecular features (e.g.: Single Nucleotide Polymorphisms, or other mutations of the DNA sequence, transcript levels, levels of specific protein isoforms) which are specifically associated with a disease state versus a healthy state. In human studies, the main entities are patients (i.e. human individuals), from whom a biological sample is drawn (possibly from different organs / tissues, or at different time-points), and then analyzed to measure either the presence, or the level, or any other parameter associated to genetic/molecular markers. A very straightforsward approach to the problem is to:
group the samples into healthy and disease classes (as the same patient could be in different states)
- carry out a statistical test (e.g. two-sample t-test for quantitative features or Fisher's test for count data) to find features which are differential in the two classes
However, this approach is only partially successful, as not all the patients share the same markers. Is that related to uncontrolled exeprimental variables and human variability? For sure. As a matter of fact, it is reasonable to expect a much greater reproducibility when the search is carried out in a different context, i.e. model-animal medical research:
- the pathology is an experimental model of pathology (thus more uniform)
- the animals are grown in lab conditions (thus environmental factors are controlled)
- the genetic background of the animals is uniform
All this sounds common sense to every Life Science scientist. However, I would like to point out the underlying issue in a more explicit way, tagging this problem as context-dependence. As a matter of fact, a molecular marker is directly or indirectly correlated to the molecular mechanism of disease, and that is also the reason why marker searches are not useful just for diagnostic or prognostic reasons, but also to understand the pathogenesis process, and its underlying physiological causes. Indeed, given a set of individuals that can be regarded as healthy, it is possible to assume that their physiological systems are in different particular states, that their genetic background is different, and that they are exposed to different environmental factors; all this body of factors account for the context. As a consequence, these individuals can develop a pathology following different causal paths, and thus exhibit different markers. How is this counteracted in actual marker search? In my opinion, two main heuristics are adopted:
- search for combinations of features (e.g. SNP Rs-102354 AND SNP Rs-560234 occurring with value T and C); B. restrict to patients with common contexts.
My experience is quite limited concerning the first strategy. Concerning the second strategy, a full array of context homogeneity criteria can be applied to refine the patient classes:
- environmental factors acting as risk-factors
- drug treatment
- co-occurring pathologies acting as risk-factors
- demographic data related to the physiological state (e.g. sex, age)
- demographic data related to the genetic backgorund (e.g. ethnic group, or family)
- phenotypic data (e.g. reported symptoms, observed signs, exam results)
- hetiological groups (e.g. Atherosclerotic Ischemic Stroke vs Lacunar Ischemic Stroke), when the pathology cause is known
- transcriptional state of the sampled cellular system (i.e. in transcriptional microarray studies)
- histological profile of the cellular system (e.g. epithelial neoplasia vs bone-marrow neoplasia)
Of course, the downfall of narrowing the classes is the loss generality (e.g. be unable to find the common features of a pathology), or statistical support (e.g. end-up with a very limited number of individuals, insufficient to perform any statistical test). In relation to that, it is not a-priori known (at least, according to my own general experience, but special heuristics may be available in particular cases) which refinement strategy is going to work in the best way. For that reason, it may be useful to rely on both strategies. In addition, looking for combinations of features may provide inetersting insights, also useful to complete/analyze gene networks and pathways; e.g. a full analogy can be traced between the notion of genetic interaction in synthetic letality experiments in yeast, and genetic interaction for co-occurring markers in human disease.
I consider these issues not only interesting, but also important for my projects here at CCBR, as they can be relevant for the search of relevant sub-networks and patterns between healthy and disease states.