bioinfo-statistics
Realizing the promise of genome-wide association studies for effector gene prediction. Nat Genet (2025) 본문
Realizing the promise of genome-wide association studies for effector gene prediction. Nat Genet (2025)
spnz3 2025. 6. 21. 20:02Costanzo, M.C., Harris, L.W., Ji, Y. et al. Realizing the promise of genome-wide association studies for effector gene prediction. Nat Genet (2025). https://doi.org/10.1038/s41588-025-02210-5
몇가지 생각들
- Variant의 effector gene prediction (variant의 영향을 밝히는 것)과 variant를 활용해 gene-disease 관계를 밝히는 것(mendelian randomization 등)은 약간 다른 문제임
- GWAS 결과 상 어떤 유전자에 coding region에 variant가 있고 protein altering variant이기까지 하면 그 유전자가 질병에 causal인 것이 확실한 것인지? - Protein altering variant인 것과 LD 문제를 해결하는 것은 별개일 것 같다?
여전히 다른 유전자의 variant와 LD관계에 있고 그 유전자가 질병과 연관이 있을 가능성이 있으므로
- Mendelian randomization 결과도 분석 방법과 보고 방식에 통일이 필요하지 않을까
- 그냥 GWAS 결과들을 모아서 한번에 통일된 파이프라인으로 effector gene prediction을 다시 하는 것이 빠르지 않을까
Introduction
‘Variant-to-function’ (V2F) problem
Although V2F is not necessary for every application of GWAS (for example, polygenic risk scores or Mendelian randomization), it is a crucial step toward mechanistic understanding of disease.
Arguably, the most important step within V2F is the identification of effector genes, as genes and their products offer the most direct clues into biological mechanisms and are the targets of most therapies. Consequently, in recent years, it has become increasingly common for GWAS to include lists of predicted effector genes as a major study outcome. ...
However, sow confusion rather than clarity. a rapid increase in published lists but little consistency in the evidence types We argue that it is now time to develop guidelines or standards for constructing and reporting these effector gene lists.
History of effector gene prediction
Determining the effector gene for a GWAS association is a challenging task8,9,10. Linkage disequilibrium makes it difficult to localize an association to the underlying causal variant(s); most associations are outside of protein-coding regions14, and understanding the regulatory effects of causal variants requires a variety of assays in different cells and tissues10. For these reasons, the earliest GWAS publications annotated each locus with the gene closest to the strongest-associated variant, even though it was understood that the nearest gene was not necessarily the effector11.
Over time, researchers began to apply more sophisticated approaches to prioritize genes nearby GWAS associations (Fig. 1). We define ‘gene prioritization’ as the activity of aggregating multiple lines of evidence across GWAS significant loci to rank all genes at each locus by each evidence type. Gene prioritization on its own does not determine the directional relationship between gene perturbation and a trait (that is, whether gene activation or inhibition would be protective from disease), an important secondary question that requires additional data and approaches. This is a necessary first step toward an outcome of ‘effector gene prediction’, which we define as integrating the combined weight of evidence to identify the gene most likely to be the effector at each locus (Fig. 1 and Box 1). Some of the earliest high-profile systematic gene prioritization efforts were conducted by the GIANT consortium in 2010, for example, amassing evidence from the literature, pathway analyses and other criteria to evaluate 95 genes found near 48 independent GWAS loci for waist–hip ratio15.
In recent years, lists of the most likely effector genes for GWAS traits, generated by gene prioritization and subsequent effector gene prediction, have become a focal point of many studies.
Surveying the landscape of gene prioritization
GWAS 논문들을 모아서 조사함. 두가지 evidence로 나눔. 각각 ‘bottom-up’, ‘top-down’ approach임

Variant-centric evidence
Evidence 종류
Frequently, a first step is to perform statistical fine-mapping20,21 to predict which variants are likely to be causal, potentially using functional priors22 or trans-ethnic analyses23 to improve resolution.
- Proximity of a GWAS variant to a gene (Table 1).
- In the <10% of cases24 in which a causal variant alters a protein-coding sequence, a prediction that the change is deleterious (for example, by tools such as Variant Effect Predictor25 and others) is usually considered to provide a clear link between the variant and that gene. When a causal variant lies in the noncoding genome, the simple assumption that it affects the promoter of the nearest gene may often be correct19,26,27.
- Significant gene-level aggregate association of common variants in and near a gene, estimated using tools such as MAGMA (Multi-Marker Analysis of Genomic Annotation)28,
- Non-coding variants - three-dimensional physical contact with promoters or enhancers29,30,31, by their location within regions annotated as regulatory elements32,33,34 or by their statistical associations with molecular properties of genes or their protein products at quantitative trait loci (QTL)35,36,37,38.
유의점
- All these approaches implicitly assume that the variant impacts the effector gene in cis, rather than genes far away in trans, a reasonable assumption supported by evidence of primarily local regulation, although the true extent of trans effects remains uncertain.
- As gene regulation can only provide evidence linking a gene to a disease when it is shown to occur in a disease-relevant tissue, the cellular context of variant-centric regulatory evidence is critical. Here, single-cell approaches provide much higher resolution than those using bulk tissue samples32,39.
Gene-centric evidence
Begins with the set of genes that are inside a GWAS locus and considers evidence about each gene and its product, rather than about the causal variant in the locus.
- Guilt-by-association evidence is based on the idea that genes relevant to the same disease exhibit similarities to or interactions with each other
- Perturbational evidence arises from phenotypes conferred by impairment of a gene or its product, for example, Mendelian (monogenic) mutations or knockout experiments in model organisms40 or human cell lines, ‘burden tests’ of rare coding variants identified from whole-exome sequencing41,42 or functional impairment of a gene product by a drug.
- Literature
Computational pipelines
DEPICT (data-driven expression prioritized integration for complex traits)44, FUMA (functional mapping and annotation), SNP2GENE45, PoPS (polygenic priority score)26 and L2G (locus-to-ene)46 ...
Broad inconsistency in evidence types and presentation

Rarely did studies use the same set of evidence categories, with 73 distinct sets occurring across the 169 papers and the most common set (variant location and QTL only) shared by only ten papers. We did not identify any clear trends in the usage of different evidence categories over time, nor were we able to group studies into distinctive ‘types’
Empirically, therefore, the most common type of predicted effector gene list, to the extent that commonality can be found, uses three or four evidence types (most likely including the QTL and variant location criteria) and includes evidence for all genes at each identified locus but without a quantitative scoring system (Fig. 3d).
Concordance of independently created lists
most of these lists did not indicate the most likely effector gene at each locus, instead prioritizing multiple genes for each locus without a scoring system for the strength of evidence, so that the strongest predictions could not be identified and compared. Pairs of lists for four traits (Alzheimer’s disease51,52, heart failure53,54, stroke55,56 and estimated glomerular filtration rate57,58) did identify the top gene per locus, allowing us to compare their predictions (Supplementary Note).
...
While these comparisons indicate that predictions for shared loci across different studies are clearly more concordant than would be expected by chance, concordance rates of 50–75% are low from the perspective of producing ‘canonical’ gene lists that can be trusted by researchers who perform downstream studies of these genes.
Future directions
Our analysis of the inputs to these lists illustrates the main reason for their heterogeneity: effector gene prediction is an evolving science, with numerous evidence sources and methods available as inputs for predicting genes. Multiple efforts are underway to leverage this breadth of input data to more accurately predict effector genes17, including pipelines such as cS2G (combined SNP2Gene)65, Ei (effector index)66, FLAMES (fine-mapped locus assessment model of effector genes)27 and CALDERA (calling disease-related genes)67. Advances in machine learning and artificial intelligence may also soon revolutionize the entire field of effector gene prediction68. The availability of benchmarking sets of ‘gold standard’ genes will be crucial for these efforts, both for training the models they use and for evaluating the quality of their output, and, as more genes undergo detailed functional characterization, more of these sets will become available.
Concluding remarks
We envision a scenario in which such lists accompany GWAS publications as frequently as Manhattan plots and are submitted to catalogs similar to the GWAS Catalog in formats that can be easily integrated within downstream approaches, such as knowledge graphs or machine learning (https://go.nature.com/4koXsbd).