Understanding Probabilistic Graphical Models
Why do we need Probabilistic Graphical Models? They give us intuitive diagrams of complex relationships between stochastic variables. They are also convenient from computational point of view, since we already have a lot of algorithms for working with graphs and statistics. Using PGM we can simulate dynamics of nuclear power plant, create models of chemical components, generate music and many other things.
Basics
Every Probabilistic Graphical Model(PGM) consists of two types of components: nodes as random variables, and edges as statistical dependencies between them. Sometimes all of those variables may be available for observations, while sometimes only their subset.
Let’s start with a simple example. Consider you have 3 binary(Yes/No) variables representing presence of rain(R), water on windows(W1) and water on the road(W2). Without any additional information we know that there is a causal relationship between rain and two other variables:
What is important about this example:
- Only rain can cause wet windows and roads, but not vice versa. Also, there is no cycles. This is a Directed Acyclic Graphical(DAG) model, most common type of PGM.
- If we observe that R=Yes, we know that probabilities of W1=Yes and W2=Yes are high. When R=No, probabilities of W1=Yes and W2=Yes are low.
- When we know the value of R, probabilities of wet windows and wet roads become independent. They may be both high or low, but these events do not influence each other anymore.
- In addition, this is a Singly Connected DAG(SC-DAG), which means that there is only a single path(ignoring directions of edges) between any pair of nodes.
Some other examples:
We may also represent our model as Conditional Probability Tables(CPT), where each table represents conditional probability of observing one variable given another.
For water on the windows given rain:
And for water on the road given rain:
Bayes or Belief Networks(BN) are special cases of DAG. They may be singly or multiply connected, and use Bayesian logic for interpretation of probabilities. In short, it means that they start with some prior assumptions about dependencies and use Bayes rule to update their beliefs about data. Similarly, most of DAG models are actually Bayesian.
Hidden Markov Models(HMM) and Linear Gaussian State-Space Models(GSS) are also special kinds of SC-DAG.
Undirected Graphical Models
Also known as Markov Random Fields(MRF). Like directed models, they represent conditional dependencies between random variables, but do not imply ordered causality. Let’s look at a simplified model: Happiness scores among people.
Happiness level of each person in this example affects those scores of other people. When we observe one of them, it also gives some information about happiness of the other people, since we are all connected.
Among other popular types of UGM are Conditional Random Fields and Restricted Boltzmann Machines.
By the way, DAG provides simpler inference procedure and easier to interpret, while undirected models have more potential to describe complex relationships.
Factor Graph is also very similar to MRF. Moreover, traditional factor models may be extended to include both Directed and Undirected models. There is actually a lot of similar research going with a goal to find more deeper connections between PGMs, NNs and other statistical tools.
PGMs and Neural Networks
Traditional Probabilistic Graphical Models work well with discrete variables. However, NN-based PGMs extend these abilities to continuous high-dimensional data. Generative Adversarial Networks, Variational Auto-encoders, Boltzmann Machines, Deep Belief Networks and many other neural nets may be considered as types of PGM.
Just like with Neural Networks, there are a lot of distinct forms of PGMs for different domains. But they all share those basic principles.
You may also find some interactive online examples of conditional probabilities and Markov models in the Playground.
Inference and Learning
If you have observed some of the variables, what distributions will have other variables? Common method is to use Bayes rule — P(A given B=b) = P(A and B=b)/P(B=b). The intuition behind it is: to get a probability of event A given that B has taken value b, you need to take the probability that both A and B=b happened and scale it up by the probability of B=b, since you already know that it occurred.
Likewise, by calculating joint probabilities like A and B and marginal distributions like B=b from training data you can infer CPTs. When some variables are hidden you can estimate their distributions from observed data using Expectation Maximization(EM) or Markov Chain Monte Carlo(MCMC) methods.
Furthermore, inference computations in PGM can be done locally. Information about influence that nodes send to each other usually called messages. They are usually divided in Pi(carrying prior probabilities) and Lambda(carrying likelihood probabilities) types. Common algorithm for exchange of those messages is Belief Propagation(BP), which is also a generalization of Forward-Backward algorithm for HMM and Kalman Smoothing algorithm for GSS. In the case of cyclic or undirected graphs it’s called Loopy BP, because those messages can travel in close loops.
In conclusion, Probabilistic Graphical Models are very common in Machine Learning and AI in general. They help us to build interpretable models of complex systems and to make useful predictions in a wide range of problems.