bioinfo-statistics

생물정보학 분석 방법 - 통계 원리 리스트 본문

생각 정리

생물정보학 분석 방법 - 통계 원리 리스트

spnz3 2025. 5. 11. 23:08

chatGPT 4O 버전. 업데이트 예정 

 

🧬 1. Genomic Association Analysis

AnalysisStatistical MethodsStatistical Principle
GWAS Linear regression, Logistic regression, Linear Mixed Models (LMMs), GLMMs, Firth regression, Bayesian fine-mapping Linear/logistic models estimate marginal SNP effects via maximum likelihood; LMMs introduce random effects to model relatedness (REML); Bayesian fine-mapping applies posterior probability inference under sparsity priors.
Meta-analysis of GWAS Fixed-effects, Random-effects, Inverse variance weighting, Bayesian meta-analysis Combines effect sizes assuming shared (fixed) or varying (random) true effects; weights estimates by inverse variance; Bayesian models use hierarchical priors on effect heterogeneity.
Rare variant association Burden test, SKAT, SKAT-O, C-alpha, REGENIE Burden tests use linear regression on aggregated rare variant counts; SKAT uses variance component score tests in mixed model framework; REGENIE uses ridge regression for stepwise LMM estimation.
 

🧪 2. Expression and QTL Mapping

AnalysisStatistical MethodsStatistical Principle
eQTL/sQTL/pQTL analysis Linear regression, ANOVA, FastQTL, Matrix eQTL, Bayesian QTL mapping Tests genotype-expression association via linear models; Bayesian methods infer posterior SNP effects using sparsity and sharing priors; Matrix eQTL applies vectorized least squares.
Allele-specific expression (ASE) Binomial/Beta-binomial models Models allelic imbalance via discrete distributions; overdispersion is handled with beta-binomial likelihood.
Multi-tissue QTL analysis Meta-Tissue, mashr, TensorQTL Uses multivariate normal priors for joint effect estimation; mashr performs adaptive shrinkage via empirical Bayes; TensorQTL uses regression with Kronecker covariance structure.
 

🔍 3. Causal Inference and Mediation

AnalysisStatistical MethodsStatistical Principle
Mendelian Randomization (MR) IVW, MR-Egger, Weighted median/mode, GSMR, Bayesian MR IVW is two-stage least squares with inverse variance weights; MR-Egger performs bias-adjusted regression with intercept; Bayesian MR uses prior-informed inference on causal parameters.
Mediation MR / Two-step MR Product of coefficients, Sobel test, Delta method, Parametric bootstrap Estimates indirect effects as product of path coefficients; standard errors from asymptotic delta method or empirical resampling; assumes linear causal model and independence of instruments.
Causal discovery PC algorithm, GES, LiNGAM, FCI, DAG-GNN Constraint-based (PC, FCI) methods use conditional independence tests; score-based (GES) optimize over DAG likelihoods; LiNGAM assumes non-Gaussianity; DAG-GNN fits structural causal models via variational inference.

🔬 4. Transcriptomics and Epigenomics

AnalysisStatistical MethodsStatistical Principle
Differential gene expression (DGE) DESeq2, edgeR, limma Negative binomial models (DESeq2, edgeR) and linear modeling with empirical Bayes moderation (limma); shrinkage of dispersion or variance estimates.
Alternative splicing analysis rMATS, MAJIQ, SUPPA2 Generalized linear models and likelihood-ratio tests for splicing event inclusion levels (PSI); MAJIQ uses Bayesian local splicing variations.
DNA methylation analysis Beta regression, M-value linear regression, Limma Methylation levels modeled via beta distribution or transformed M-values using linear models; empirical Bayes for variance shrinkage.
ATAC-seq/ChIP-seq differential peak DiffBind, csaw, DESeq2 Negative binomial or sliding window models for read counts; significance via likelihood-ratio or Wald tests.
RNA editing analysis REDItools, GLMs for mismatch rates Binomial/Poisson modeling of editing proportions at specific loci; sometimes beta-binomial for overdispersion.
 

🧫 5. Multi-Omics Integration

AnalysisStatistical MethodsStatistical Principle
Multi-omics factor analysis MOFA, iCluster, SNF MOFA uses Bayesian group factor analysis with variational inference; iCluster fits penalized Gaussian latent variable models; SNF fuses similarity graphs across omics.
Multi-omics regression / prediction Elastic net, Random forest, SVM, Kernel Ridge Regression Regularized linear models with L1/L2 penalties (elastic net); non-linear ensemble models; kernel-based methods for high-dimensional data.
Multi-omics QTL integration eQTL + meQTL + pQTL Joint linear models or Bayesian multivariate models; conditional independence and co-mapping of variants across omics layers.
Pathway-based integration PARADIGM, NetGSA Bayesian factor graph models (PARADIGM) and multivariate linear models with structured gene set priors (NetGSA).
Mediation with multi-omics Two-step MR, structural equation modeling Estimates indirect effects across omics layers using path analysis, product of coefficients, or SEM with latent variables.
 

🤖 6. Machine Learning and Predictive Modeling

AnalysisStatistical MethodsStatistical Principle
Classification / regression Logistic regression, Random forest, SVM, XGBoost Logistic regression uses maximum likelihood; random forests are ensemble decision trees using bootstrapping; SVM maximizes margin; XGBoost uses gradient-boosted trees.
Survival analysis Cox regression, DeepSurv Cox models estimate hazard ratios via partial likelihood; DeepSurv uses neural networks to model risk functions under right-censoring.
Dimensionality reduction PCA, t-SNE, UMAP, Autoencoders PCA decomposes variance via SVD; t-SNE/UMAP use manifold learning; autoencoders learn latent representations via neural nets.
Feature selection LASSO, RFE, Boruta LASSO penalizes coefficients with L1 norm; RFE uses backward selection based on model weights; Boruta uses permutation importance.
Deep learning for sequences CNN, RNN, Transformers CNN captures local patterns; RNN models temporal dependencies; Transformers learn attention-based representations in sequence data.
 

🧠 7. Population and Evolutionary Genetics

AnalysisStatistical MethodsStatistical Principle
Ancestry inference PCA, ADMIXTURE, STRUCTURE PCA performs eigen-decomposition of genotype matrix; ADMIXTURE uses maximum likelihood estimation of ancestry proportions; STRUCTURE uses Bayesian clustering with MCMC.
Phasing and imputation SHAPEIT, Beagle, IMPUTE Hidden Markov Models (HMMs) for haplotype state transitions; genotype imputation via forward-backward or phasing-assisted likelihoods.
Selection scans iHS, XP-EHH, Fst, PBS Detects selection via extended haplotype homozygosity (iHS/XP-EHH), population differentiation (Fst), or allele frequency branch lengths (PBS).
IBD / ROH detection KING, GERMLINE, PLINK Pairwise segment matching via HMM or shared haplotype; runs of homozygosity via window-based homozygosity testing.
Demographic inference MSMC, fastsimcoal2, dadi Coalescent models and diffusion approximations; infer demographic history via composite likelihood over site frequency spectra or pairwise coalescent rates.