bioinfo-statistics
생물정보학 분석 방법 - 통계 원리 리스트 본문
chatGPT 4O 버전. 업데이트 예정
🧬 1. Genomic Association Analysis
AnalysisStatistical MethodsStatistical Principle
| GWAS | Linear regression, Logistic regression, Linear Mixed Models (LMMs), GLMMs, Firth regression, Bayesian fine-mapping | Linear/logistic models estimate marginal SNP effects via maximum likelihood; LMMs introduce random effects to model relatedness (REML); Bayesian fine-mapping applies posterior probability inference under sparsity priors. |
| Meta-analysis of GWAS | Fixed-effects, Random-effects, Inverse variance weighting, Bayesian meta-analysis | Combines effect sizes assuming shared (fixed) or varying (random) true effects; weights estimates by inverse variance; Bayesian models use hierarchical priors on effect heterogeneity. |
| Rare variant association | Burden test, SKAT, SKAT-O, C-alpha, REGENIE | Burden tests use linear regression on aggregated rare variant counts; SKAT uses variance component score tests in mixed model framework; REGENIE uses ridge regression for stepwise LMM estimation. |
🧪 2. Expression and QTL Mapping
AnalysisStatistical MethodsStatistical Principle
| eQTL/sQTL/pQTL analysis | Linear regression, ANOVA, FastQTL, Matrix eQTL, Bayesian QTL mapping | Tests genotype-expression association via linear models; Bayesian methods infer posterior SNP effects using sparsity and sharing priors; Matrix eQTL applies vectorized least squares. |
| Allele-specific expression (ASE) | Binomial/Beta-binomial models | Models allelic imbalance via discrete distributions; overdispersion is handled with beta-binomial likelihood. |
| Multi-tissue QTL analysis | Meta-Tissue, mashr, TensorQTL | Uses multivariate normal priors for joint effect estimation; mashr performs adaptive shrinkage via empirical Bayes; TensorQTL uses regression with Kronecker covariance structure. |
🔍 3. Causal Inference and Mediation
AnalysisStatistical MethodsStatistical Principle
| Mendelian Randomization (MR) | IVW, MR-Egger, Weighted median/mode, GSMR, Bayesian MR | IVW is two-stage least squares with inverse variance weights; MR-Egger performs bias-adjusted regression with intercept; Bayesian MR uses prior-informed inference on causal parameters. |
| Mediation MR / Two-step MR | Product of coefficients, Sobel test, Delta method, Parametric bootstrap | Estimates indirect effects as product of path coefficients; standard errors from asymptotic delta method or empirical resampling; assumes linear causal model and independence of instruments. |
| Causal discovery | PC algorithm, GES, LiNGAM, FCI, DAG-GNN | Constraint-based (PC, FCI) methods use conditional independence tests; score-based (GES) optimize over DAG likelihoods; LiNGAM assumes non-Gaussianity; DAG-GNN fits structural causal models via variational inference. |
🔬 4. Transcriptomics and Epigenomics
AnalysisStatistical MethodsStatistical Principle
| Differential gene expression (DGE) | DESeq2, edgeR, limma | Negative binomial models (DESeq2, edgeR) and linear modeling with empirical Bayes moderation (limma); shrinkage of dispersion or variance estimates. |
| Alternative splicing analysis | rMATS, MAJIQ, SUPPA2 | Generalized linear models and likelihood-ratio tests for splicing event inclusion levels (PSI); MAJIQ uses Bayesian local splicing variations. |
| DNA methylation analysis | Beta regression, M-value linear regression, Limma | Methylation levels modeled via beta distribution or transformed M-values using linear models; empirical Bayes for variance shrinkage. |
| ATAC-seq/ChIP-seq differential peak | DiffBind, csaw, DESeq2 | Negative binomial or sliding window models for read counts; significance via likelihood-ratio or Wald tests. |
| RNA editing analysis | REDItools, GLMs for mismatch rates | Binomial/Poisson modeling of editing proportions at specific loci; sometimes beta-binomial for overdispersion. |
🧫 5. Multi-Omics Integration
AnalysisStatistical MethodsStatistical Principle
| Multi-omics factor analysis | MOFA, iCluster, SNF | MOFA uses Bayesian group factor analysis with variational inference; iCluster fits penalized Gaussian latent variable models; SNF fuses similarity graphs across omics. |
| Multi-omics regression / prediction | Elastic net, Random forest, SVM, Kernel Ridge Regression | Regularized linear models with L1/L2 penalties (elastic net); non-linear ensemble models; kernel-based methods for high-dimensional data. |
| Multi-omics QTL integration | eQTL + meQTL + pQTL | Joint linear models or Bayesian multivariate models; conditional independence and co-mapping of variants across omics layers. |
| Pathway-based integration | PARADIGM, NetGSA | Bayesian factor graph models (PARADIGM) and multivariate linear models with structured gene set priors (NetGSA). |
| Mediation with multi-omics | Two-step MR, structural equation modeling | Estimates indirect effects across omics layers using path analysis, product of coefficients, or SEM with latent variables. |
🤖 6. Machine Learning and Predictive Modeling
AnalysisStatistical MethodsStatistical Principle
| Classification / regression | Logistic regression, Random forest, SVM, XGBoost | Logistic regression uses maximum likelihood; random forests are ensemble decision trees using bootstrapping; SVM maximizes margin; XGBoost uses gradient-boosted trees. |
| Survival analysis | Cox regression, DeepSurv | Cox models estimate hazard ratios via partial likelihood; DeepSurv uses neural networks to model risk functions under right-censoring. |
| Dimensionality reduction | PCA, t-SNE, UMAP, Autoencoders | PCA decomposes variance via SVD; t-SNE/UMAP use manifold learning; autoencoders learn latent representations via neural nets. |
| Feature selection | LASSO, RFE, Boruta | LASSO penalizes coefficients with L1 norm; RFE uses backward selection based on model weights; Boruta uses permutation importance. |
| Deep learning for sequences | CNN, RNN, Transformers | CNN captures local patterns; RNN models temporal dependencies; Transformers learn attention-based representations in sequence data. |
🧠 7. Population and Evolutionary Genetics
AnalysisStatistical MethodsStatistical Principle
| Ancestry inference | PCA, ADMIXTURE, STRUCTURE | PCA performs eigen-decomposition of genotype matrix; ADMIXTURE uses maximum likelihood estimation of ancestry proportions; STRUCTURE uses Bayesian clustering with MCMC. |
| Phasing and imputation | SHAPEIT, Beagle, IMPUTE | Hidden Markov Models (HMMs) for haplotype state transitions; genotype imputation via forward-backward or phasing-assisted likelihoods. |
| Selection scans | iHS, XP-EHH, Fst, PBS | Detects selection via extended haplotype homozygosity (iHS/XP-EHH), population differentiation (Fst), or allele frequency branch lengths (PBS). |
| IBD / ROH detection | KING, GERMLINE, PLINK | Pairwise segment matching via HMM or shared haplotype; runs of homozygosity via window-based homozygosity testing. |
| Demographic inference | MSMC, fastsimcoal2, dadi | Coalescent models and diffusion approximations; infer demographic history via composite likelihood over site frequency spectra or pairwise coalescent rates. |
'생각 정리' 카테고리의 다른 글
| 단백질/ 유전자 - 질병 연관성이 인종에 따라 왜 달라지는지에 대해 (0) | 2025.05.21 |
|---|---|
| Mendelian Randomization with Molecular QTLs: Methods, Challenges, and Advances (1) | 2025.05.12 |
| 통계 방법들의 리스트와 관계, 발전 방향 (0) | 2025.05.10 |
| Gene-disease causality network 방법 리스트 (DeepSeek) (0) | 2025.02.19 |
| 변이 영향 예측 딥러닝 모델 - 멀리 있는(trans) 유전자 발현에의 영향을 예측하는 모델이 있는지? (1) | 2024.08.31 |