class: center, middle, inverse, title-slide # Taking a network view of EHR and Biobank data to find explainable multivariate patterns ### Nick Strayer ### 2019-04-03 --- # Talk outline ![:space 6]() - Electronic medical records - PheWAS - Changing how we ![:colorText orangered](think about) these data - Multimorbidity explorer - Changing how we ![:colorText orangered](model) these data - The stochastic block model - Applying to real data - Future directions --- # Electronic medical records (EHR) ![:space 1]() In an effort to make healthcare more efficient EHR systems have become common in the US. ![:space 3]() .pull-left[
While originally made for billing purposes there is still a huge sum of information that, with careful effort, _hopefully_, can be extracted for research. ] .pull-right[
In this presentation I will focus on the subset of EHR pertaining to billing codes: ICD9, ICD10, and Phecodes. ] ![:space 4]() ![:centerPic 35](https://securecdn.pymnts.com/wp-content/uploads/2018/03/bigdata.jpg) ![:small 0.6]([image source](https://www.pymnts.com/data/2018/datatorrent-jeff-bettencourt-real-time-big-data/)) --- # Biobanks ![:space 6]
Some hospitals have repositories of biological samples that can be matched to their EHR. ![:space 8]
Data could be anything from plain unprocessed-plasma all the way to full single-cell sequencing. ![:space 8]
Here I will focus on plain SNP-chip readings, aka presence or absence of a given marker at multiple points on the genome. --- # PheWAS ![:space 5] In an effort to extract information from these data the technique PheWAS was made. ![:space 8] ![:borderedCenterPic 80](figures/phewas-paper-head.png) --- ## Concept ![:space 5] ![:centerPic 90](figures/phewas-explainer.svg) --- ## The univariate problem ![:space 15]()
PheWAS looks at one genotype
phenotype association at a time. ![:space 12]() .pull-left[
This gives us the multiple-comparisons problem. ] .pull-right[
Also, does the world work like this? ] --- # Changing how we ![:colorText orangered](think about) these data ![:centerPic 95](figures/obs-2-network.svg) --- # Multimorbidity explorer ![:space 14]() ![:borderedCenterPic 80](figures/me_paper_abstract.png) --- ## What it is Application that allows researchers to explore the results of PheWAS studies along with investigating individual-level data that produced those results using ![:colorText orangered](interactive visualizations.)
--- ## Why it is ![:space 15]() .pull-left[ ####
Interact with results PheWAS results are typically delivered with static plots and tables. ME allows researchers to interact with those results. ] .pull-right[ ####
Expand past plain associations By giving researcher's the ability to look at the network behavior of genotype-phenotype associations, it can provide more nuanced insights from the data than a table. ] --- ## How it is ![:space 4]()
Central package contains all the building blocks of a typical ME deployment. ![:indent]()
Visualizations custom built in Javascript. ![:space 5]()
Creating a custom app is an exercise in combining the neccesary components. ![:space 5]()
Hosted on lab RStudio Connect instance running on AWS. ![:indent]()
If more security is desired, Docker image is available. --- class: center, middle ## Demo --- # Changing how we ![:colorText orangered](model) these data ![:space 8]()
If visualizing the data as a network can help us understand them, why not model them as a network, too? ![:space 4]()
Mathematical and statistical models of networks are a diverse and booming field. ![:space 4]()
Unfortunately, many of the methods sit on fragile methodological foundations. ![:indent]()
Luckily, a new model type has taken the stage recently... --- ## The ![:colorText orangered](S)tochastic ![:colorText orangered](B)lock ![:colorText orangered](M)odel ![:space 3]() ![:centerPic 90](figures/the-sbm.svg) --- ## SBM formula ![:centerPic 100](figures/sbm-formula.svg) --- ## Bipartite expansion ![:space 2]() In its basic form the SBM only works with a single type of node, but we have two (phenotype and patients)... ![:space 4]() ![:borderedCenterPic 80](figures/bipartite-sbm-paper.png) ![:space 6]()
To fix this we can add a constraint to the model to only cluster nodes of the same type. ![:indent]() <img src='figures/bell-curve.svg' width = 40px/> This is equivalent to an infinitely strong prior on cluster 'types'. --- ## ![:colorText orangered](Bi)partite ![:colorText orangered](SBM) formula ![:centerPic 100](figures/bipartite-sbm-formula.svg) --- ## Statistical features Model is fit in a bayesian manor ![:borderedCenterPic 55](figures/bayesian_sbm_paper.png) ![:space 2]() .pull-left[ ### Priors ![:centerPic 40](figures/bayes.svg) Priors are set on both the group/cluster structure and the edge counts between groups. ] .pull-right[ ### Fitting .center[
] Metropolis-Hastings is used to fit model using MCMC Very similar to how a dirichlet mixture model is fit ] --- ### Prior on clusters ![:centerPic 90](figures/group-prior.svg) --- ### Prior on edge counts ![:centerPic 90](figures/edge-prior.svg) --- ### MCMC acception formula ![:centerPic 90](figures/mcmc-formula.svg) --- ## Simulation results ![:space 2]() ![:centerPic 65](figures/biSBM_simulations.png) The BiSBM outperforms dirichlet mixture models even when the true generating distribution is dirichlet. ![:space 3]() .right[ ![:small 0.9](Figure from A network approach to topic models, Gerlach et al 2019) ] --- class: center, middle # Applying in the real world --- ### Looking for [patterns in Myeloid disease](https://prod.tbilab.org/connect/#/apps/20/access) ![:centerPic 80](figures/mds_circles.png) --- ### Investigating large-scale patient structure 55k patients over 1500 phecodes, dimension reduced using BiSBM, visualized with UMAP. ![:centerPic 85](figures/hf_embeddings.png) --- # Future directions ![:space 18]() - Simulations! - Turning into a semi-supervized method - More battle-testing with real data - Optimizing visualizations for information transfer --- # Aknowledgements ![:space 2]() A thanks to those who have helped build these ideas to this point - Yaomin Xu - TBILab members - Wells Lab - Savona Lab - Travis Spaulding - Denny Lab - Vanderbilt drug repurposing group - Vanderbilt PheWAS group ![:space 7]() And thanks to the funding sources who have paid me to play around with javascript - NIH Big Data to Knowledge training grant - Vanderbilt Biostatistics department 2018 development grant --- # References These slides: ![:indent]()
[nickstrayer.me/biostat_seminar](http://nickstrayer.me/biostat_seminar/) ![:indent]()
[github.com/nstrayer/biostat_seminar](https://github.com/nstrayer/biostat_seminar) ![:space 8]() Relevant papers: - ![:small 0.7](Gerlach, Martin, Tiago P. Peixoto, and Eduardo G. Altmann. 2018. “A Network Approach to Topic Models.” Science Advances) - ![:small 0.7](Larremore, Daniel B., Aaron Clauset, and Abigail Z. Jacobs. 2014. “Efficiently Inferring Community Structure in Bipartite Networks.” Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics) - ![:small 0.7](Peixoto, Tiago P. 2016. “Nonparametric Bayesian Inference of the Microcanonical Stochastic Block Model.” arXiv [physics.data-An]. arXiv. http://arxiv.org/abs/1610.02703.) - ![:small 0.7](Peixoto, Tiago P. 2017. “Bayesian Stochastic Blockmodeling.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1705.10225.)