## Get All Weeks Probabilistic Graphical Models 2: Inference Coursera Quiz Answers

Probabilistic graphical models (PGMs) are a rich framework for encoding probability distributions over complex domains: joint (multivariate) distributions over large numbers of random variables that interact with each other. These representations sit at the intersection of statistics and computer science, relying on concepts from probability theory, graph algorithms, machine learning, and more.

They are the basis for the state-of-the-art methods in a wide variety of applications, such as medical diagnosis, image understanding, speech recognition, natural language processing, and many, many more. They are also a foundational tool in formulating many machine learning problems.

This course is the second in a sequence of three. Following the first course, which focused on representation, this course addresses the question of probabilistic inference: how a PGM can be used to answer questions. Even though a PGM generally describes a very high dimensional distribution, its structure is designed so as to allow questions to be answered efficiently. The course presents both exact and approximate algorithms for different types of inference tasks, and discusses where each could best be applied.

The (highly recommended) honors track contains two hands-on programming assignments, in which key routines of the most commonly used exact and approximate algorithms are implemented and applied to a real-world problem.

### Probabilistic Graphical Models 2: Inference Coursera Quiz Answers

### Week 1 Quiz Answers

#### Quiz 1: Variable Elimination

Q1. Intermediate Factors. Consider running variable elimination on the following Bayesian network over binary variables. Which of the nodes, if eliminated first, results in the largest intermediate factor? By largest factor we mean the factor with the largest number of entries.

- X_3X

3

- X_5 X

5

- X_2X

2

- X_4X

4

Q2. Elimination Orderings. Which of the following characteristics of the variable elimination algorithm are affected by the choice of elimination ordering? You may select 1 or more options.

- Runtime of the algorithm
- Which marginals can be computed correctly
- Memory usage of the algorithm
- Size of the largest intermediate factor

Q3. Marginalization. Suppose we run variable elimination on a Bayesian network where we eliminate all the variables in the network. What number will the algorithm produce?

Enter answer here

Q4. Marginalization. Suppose we run variable elimination on a Markov network where we eliminate all the variables in the network. What number will the algorithm produce?

- 1/Z1/Z, where ZZ is the partition function for the network.
- ZZ, the partition function for the network.
- A positive number, not necessarily between 0 and 1, which depends on the structure of the network.
- A positive number, always between 0 and 1, which depends on the structure of the network.

Q5. Intermediate Factors. If we perform variable elimination on the graph shown below with the variable ordering B,A,C,F,E,DB,A,C,F,E,D, what is the intermediate factor produced by the third step (just before summing out CC)?

- \psi(C,D,E,F)ψ(C,D,E,F)
- \psi(A,B,C,D,F)ψ(A,B,C,D,F)
- \psi(C,D,F)ψ(C,D,F)
- \psi(C,F)ψ(C,F)

Q6. Induced Graphs. If we perform variable elimination on the graph shown below with the variable ordering B,A,C,F,E,DB,A,C,F,E,D, what is the induced graph for the run?

- None of these

Q7. *Time Complexity of Variable Elimination. Consider a Bayesian network taking the form of a chain of nn variables, X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_nX

- O(nk^3)
*O*(*nk*3) - O(kn^2)
*O*(*kn*2) - O(k^n)
*O*(*kn*) - O(nk^2)
*O*(*nk*2)

Q8. Time Complexity of Variable Elimination. Suppose we eliminate all the variables in a Markov network using the variable elimination algorithm. Which of the following could affect the runtime of the algorithm? You may select 1 or more options.

- Number of factors in the network
- Number of values each variable can take
- The values of the factor entries (assuming that all entries are still positive)

Q9. Intermediate Factors. If we perform variable elimination on the graph shown below with the variable ordering F,E,D,C,B,AF,E,D,C,B,A, what is the intermediate factor produced by the third step (just before summing out DD)?

- \psi(B,C,D,E,F)ψ(B,C,D,E,F)
- \psi(B,C,D,E)ψ(B,C,D,E)
- \psi(B,C,D)ψ(B,C,D)
- \psi(B,C)ψ(B,C)
- \psi(A,B,C,D)ψ(A,B,C,D)

### Week 2 Quiz Answers

#### Quiz 1: Message Passing in Cluster Graphs

Q1. Cluster Graph Construction. Consider the pairwise MRF, H, shown below with potentials over {A,B}, {B,C}, {A,D}, {B,E}, {C,F}, {D,E} and {E,F}.

Which of the following is/are valid cluster graph(s) for H? (A cluster graph is valid if it satisfies the running intersection property and family preservation. You may select 1 or more options).

Q2. Message Passing in a Cluster Graph.

Suppose we wish to perform inference over the Markov network MM as shown below. Each of the variables X_iX

are binary, and the only potentials in the network are the pairwise potentials \phi_{i,j}(X_i, X_j)ϕ

with one potential for each pair of variables X_i, X_jX

connected by an edge in MM. Which of the following expressions correctly computes the message \delta_{3 \rightarrow 6}δ

that cluster C_3

will send to cluster C_6

during belief propagation? Assume that the variables in the sepsets are equal to the intersection of the variables in the adjacent cliques.

Q3. Message Passing Computation. Consider the Markov network MM from the previous question. If the initial factors in the Markov network MM are of the form as shown in the table below, regardless of the specific value of i, ji,j (we basically wish to encourage variables that are connected by an edge to share the same assignment), compute the message \delta_{3 \rightarrow 6}

assuming that it is the first message passed during in loopy belief propagation. Assume that the messages are all initialized to the 1 message, i.e. all the entries are initially set to 1.

Separate the entries of the message with spaces. Order the entries by lexicographic variable order: for example, if the message is over one variable X_iX

Enter answer here

Q4. *Extracting Marginals at Convergence. Given that you can renormalize the messages at any point during belief propagation and still obtain correct marginals, consider the message \delta_{3 \rightarrow 6}δ

3→6

that you computed. Use this observation to compute the final and possibly approximate marginal probability P(X_4 = 1, X_5 = 1)P(X

4

=1,X

=1) (X_4

and X_5

are the variables in the previous question) in cluster C_6

at convergence (as extracted from the cluster beliefs), giving your answer to 2 decimal places.

Enter answer here

Q5. Family Preservation. Suppose we have a factor P(A \mid C)P(A∣C) that we wish to include in our sum-product message passing inference. We should:

- Assign the factor to all cliques that contain AA or CC
- Assign the factor to one clique that contain AA and CC
- Assign the factor to one clique that contain AA or CC
- Assign the factor to all cliques that contain AA and CC

#### Quiz 2: Clique Tree Algorithm

Q1. Message Ordering. In the clique tree below which of the following starting message-passing orders is/are valid? (Note: These are not necessarily full sweeps that result in calibration. You may select 1 or more options.)

*C*1→*C*2,*C*2→*C*3,*C*5→*C*3,*C*3→*C*4- C_4\rightarrow C_3, C_3\rightarrow C_2, C_2\rightarrow C_1
*C*4→*C*3,*C*3→*C*2,*C*2→*C*1 - C_4\rightarrow C_3, C_5\rightarrow C_3, C_2\rightarrow C_3
*C*4→*C*3,*C*5→*C*3,*C*2→*C*3 - C_1\rightarrow C_2, C_2\rightarrow C_3, C_3\rightarrow C_4, C_3\rightarrow C_5
*C*1→*C*2,*C*2→*C*3,*C*3→*C*4,*C*3→*C*5

Q2. Message Passing in a Clique Tree. In the clique tree above, what is the correct form of the message from clique 3 to clique 2, \delta_{3\rightarrow 2}δ

- \sum_{B,D,G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}∑
*B*,*D*,*G*,*H**ψ*3(*C*3)×*δ*4→3×*δ*5→3 - \sum_{G,H} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}∑
*G*,*H**ψ*3(*C*3)×*δ*4→3×*δ*5→3 - \sum_{B,D} \psi_3(C_3) \times \delta_{4\rightarrow 3} \times \delta_{5\rightarrow 3}∑
*B*,*D**ψ*3(*C*3)×*δ*4→3×*δ*5→3 - \sum_{B,D} \psi_3(C_3) \times \sum_{D,H} \left(\psi_4(C_4) \times \delta_{4\rightarrow 3} \right)\times \sum_{B,H} \left(\psi_5(C_5) \times \delta_{5\rightarrow 3} \right)∑
*B*,*D**ψ*3(*C*3)×∑*D*,*H*(*ψ*4(*C*4)×*δ*4→3)×∑*B*,*H*(*ψ*5(*C*5)×*δ*5→3)

Q3. Clique Tree Properties. Consider the following Markov Network over potentials \phi_{A,B}, \phi_{B,C}, \phi_{A,D}, \phi_{B,E}, \phi_{C,F}, \phi_{D,E},ϕ

Which of the following properties are necessary for a valid clique tree for the above network, but are NOT satisfied by this graph:

You may select 1 or more options.

- No loops
- Running intersection property
- Node degree less than or equal to 2
- Family preservation

Q4. Cluster Graphs vs. Clique Trees. Suppose that we ran sum-product message passing on a cluster graph GG for a Markov network MM and that the algorithm converged. Which of the following statements is true only if GG is a clique tree and is not necessarily true otherwise?

- GG is calibrated.
- If there are EE edges in GG, there exists a message ordering that guarantees convergence after passing 2E2E messages.
- All the options are true for cluster graphs in general.
- The sepsets in GG are the product of the two messages passed between the clusters adjacent to the sepset.
- The beliefs and sepsets of GG can be used to compute the joint distribution defined by the factors of MM.

Q5. Clique Tree Calibration. Which of the following is true? You may select more than one option.

- If there exists a pair of adjacent cliques that are max-calibrated, then a clique tree is max-calibrated.
- After we complete one upward pass of the max-sum message passing algorithm, the clique tree is max-calibrated.
- If a clique tree is max-calibrated, then all pairs of cliques are max-calibrated.
- If a clique tree is max-calibrated, then all pairs of adjacent cliques are max-calibrated.

### Week 3 Quiz Answers

#### Quiz 1: MAP Message Passing

Q1. **Real-World Applications of MAP Estimation. **Suppose that you are in charge of setting up a soccer league for a bunch of kindergarten kids, and your job is to split the N*N* children into K*K* teams. The parents are very controlling and also uptight about which friends their kids associate with. So some of them bribe you to set up the teams in certain ways.

The parents’ bribe can take two forms: For some children i*i*, the parent says “I will pay you A_{ij}*A**i**j* dollars if you put my kid i*i* on the same team as kid j*j*“; in other cases, the parent of child i*i* says “I will pay you B_i*B**i* dollars if you put my kid on team k*k*.” In our notation, this translates to factor f_{i,j}(x_i,x_j) = A_{ij}\cdot \mathbf{1}\{x_i=x_j\}*f**i*,*j*(*x**i*,*x**j*)=*A**i**j*⋅**1**{*x**i*=*x**j*} or g_i(x_i) = B_i\cdot \mathbf{1}\{x_i=k\}*g**i*(*x**i*)=*B**i*⋅**1**{*x**i*=*k*}, respectively, where x_i*x**i* is the assigned team of child i*i* and \mathbf{1}\{\}**1**{} is the indicator function. More formally, if we define x_i*x**i* to be the assigned team of child i*i*, the amount of money you get for the first type of bribe will be f_{i,j}(x_i,x_j)*f**i*,*j*(*x**i*,*x**j*).

Being greedy and devoid of morality, you want to make as much money as possible from these bribes. What are you trying to find?

- \textrm{argmax}_{\bar{x}} \sum_i g_i(x_i) + \sum_{i,j} f_{i,j}(x_i,x_j)argmax
*x*ˉ∑*i**gi*(*xi*)+∑*i*,*j**fi*,*j*(*xi*,*xj*) - \textrm{argmax}_{\bar{x}} \prod_i g_i(x_i) \cdot \prod_{i,j} f_{i,j}(x_i,x_j)argmax
*x*ˉ∏*i**gi*(*xi*)⋅∏*i*,*j**fi*,*j*(*xi*,*xj*) - \textrm{argmax}_{\bar{x}} \sum_i g_i(x_i)argmax
*x*ˉ∑*i**gi*(*xi*) - \textrm{argmax}_{\bar{x}} \prod_i g_i(x_i)argmax
*x*ˉ∏*i**gi*(*xi*)

Q2. ***Decoding MAP Assignments. **You want to find the optimal solution to the above problem using a clique tree over a set of factors \phi*ϕ*. How could you accomplish this such that you are guaranteed to find the optimal solution? (Ignore issues of tractability, and assume that if you specify a set of factors \phi*ϕ*, you will be given a valid clique tree of minimum tree width.)

- Set \phi_{i,j} = f_{i,j}
*ϕi*,*j*=*fi*,*j*, \phi_i = g_i*ϕi*=*gi*, get the clique tree over this set of factors, run max-sum message passing on this clique tree, and decode the marginals. - Set \phi_{i,j} = f_{i,j}
*ϕi*,*j*=*fi*,*j*, \phi_i = g_i*ϕi*=*gi*, get the clique tree, run sum product message passing, and decode the marginals. - Set \phi_{i,j} = \exp(f_{i,j})
*ϕi*,*j*=exp(*fi*,*j*), \phi_i = \exp(g_i)*ϕi*=exp(*gi*), get the clique tree over this set of factors, run max-sum message passing on this clique tree, and decode the marginals. - Set \phi_{i,j} = \exp(f_{i,j})
*ϕi*,*j*=exp(*fi*,*j*), \phi_i = \exp(g_i)*ϕi*=exp(*gi*), get the clique tree, run sum-product message passing, and decode the marginals. - The optimal solution is not guaranteed to be found in this manner using clique trees.

#### Quiz 2: Sampling Methods

Q1. Forward Sampling. One strategy for obtaining an estimate to the conditional probability P({\bf y} \mid {\bf e})P(y∣e) is by using forward sampling to estimate P({\bf y}, {\bf e})P(y,e) and P({\bf e})P(e) separately and then computing the ratio. We can use the Hoeffding Bound to obtain a bound on both the numerator and the denominator. Assume M is large. When does the resulting bound provide meaningful guarantees? Think about the difference between the true value and our estimate. Recall that we need M \geq

to get an additive error bound \epsilonϵ that holds with probability 1-\delta1−δ for our estimate.

- It always provides meaningful guarantees.
- It provides a meaningful guarantee, but only when \deltaδ is small relative to P({\bf e})P(e) and P({\bf y}, {\bf e})P(y,e)
- It provides a meaningful guarantee, but only when \epsilonϵ is small relative to P({\bf e})P(e) and P({\bf y}, {\bf e})P(y,e)
- It never provides a meaningful guarantee.

Q2. Rejecting Samples. Consider the process of rejection sampling to generate samples from the posterior distribution P(X \mid e)P(X∣e). If we want to obtain MM samples, what is the expected number of samples that would need to be drawn from P(X)P(X)?

- M \cdot (1 -P(e))M⋅(1−P(e))
- M \cdot P(e)M⋅P(e)
- M / P(e)M/P(e)
- M / (1 – P(e))M/(1−P(e))
- M \cdot P(X \mid e)M⋅P(X∣e)
- M \cdot (1 – P(X \mid e))M⋅(1−P(X∣e))

Q3. Stationary Distributions. Consider the simple Markov chain shown in the figure below. By definition, a stationary distribution \piπ for this chain must satisfy which of the following properties? You may select 1 or more options.

Q4. *Gibbs Sampling in a Bayesian Network. Suppose we have the Bayesian network shown in the image below.

If we are sampling the variable X_{23}X

23

as a substep of Gibbs sampling, what is the closed form equation for the distribution we should use over the value x_{23}’x

23

′

? By closed form, we mean that all computation such as summations are tractable and that we have access to all terms without requiring extra computation.

- P(x_{23}’ \mid x_{22}, x_{24})P(x_{15} \mid x_{23}’, x_{14}, x_{9}, x_{25})
*P*(*x*23′∣*x*22,*x*24)*P*(*x*15∣*x*23′,*x*14,*x*9,*x*25) - P(x_{23}’ \mid x_{-23})
*P*(*x*23′∣*x*−23) where x_{-23}*x*−23 is all variables except x_{23}*x*23 - P(x_{23}’ \mid x_{22}, x_{24})
*P*(*x*23′∣*x*22,*x*24) - {\Large \frac{P(x_{23}’ \mid x_{22}, x_{24})P(x_{15} \mid x_{23}’, x_{14}, x_{9}, x_{25})}{\sum_{x_9”,x_{14}”,x_{22}”,x_{24}”,x_{25}”}P(x_{23}’ \mid x_{22}”, x_{24}”)P(x_{15}” \mid x_{23}’, x_{14}”, x_{9}”, x_{25}”)}}∑
*x*9′′,*x*14′′,*x*22′′,*x*24′′,*x*25′′*P*(*x*23′∣*x*22′′,*x*24′′)*P*(*x*15′′∣*x*23′,*x*14′′,*x*9′′,*x*25′′)*P*(*x*23′∣*x*22,*x*24)*P*(*x*15∣*x*23′,*x*14,*x*9,*x*25) - {\Large \frac{P(x_{23}’ \mid x_{22}, x_{24})P(x_{15} \mid x_{23}’, x_{14}, x_{9}, x_{25})}{\sum_{x_{23}”}P(x_{23}” \mid x_{22}, x_{24})P(x_{15} \mid x_{23}”, x_{14}, x_{9}, x_{25})}}∑
*x*23′′*P*(*x*23′′∣*x*22,*x*24)*P*(*x*15∣*x*23′′,*x*14,*x*9,*x*25)*P*(*x*23′∣*x*22,*x*24)*P*(*x*15∣*x*23′,*x*14,*x*9,*x*25)

Q5. Gibbs Sampling. Suppose we are running the Gibbs sampling algorithm on the Bayesian network X\rightarrow Y\rightarrow ZX→Y→Z. If the current sample is \langle x_0, y_0, z_0 \rangle⟨x

- P(y_1 \mid x_0, z_0)
*P*(*y*1∣*x*0,*z*0) - P(x_0, z_0 \mid y_1)
*P*(*x*0,*z*0∣*y*1) - P(x_0, y_1, z_0)
*P*(*x*0,*y*1,*z*0) - P(y_1 | x_0)
*P*(*y*1∣*x*0)

Q6. Collecting Samples. Assume we have a Markov chain that we have run for a sufficient burn-in time, and now wish to collect samples and use them to estimate the probability that

- No, once we collect one sample, we have to continue running the chain in order to “re-mix” it before we get another sample.
- Yes, and if we collect mm consecutive samples, we can use the Hoeffding bound to provide (high-probability) bounds on the error in our estimated probability.
- Yes, that would give a correct estimate of the probability. However, we cannot apply the Hoeffding bound to estimate the error in our estimate.
- No, Markov chains are only good for one sample; we have to restart the chain (and burn-in) before we can collect another sample.

Q7. Markov Chain Mixing. Which of the following classes of chains would you expect to have the shortest mixing time in general?

- Markov chains for networks with nearly deterministic potentials.
- Markov chains with distinct regions in the state space that are connected by low probability transitions.
- Markov chains with many distinct and peaked probability modes.
- Markov chains where state spaces are well connected and transitions between states have high probabilities.

#### Quiz 3: Sampling Methods PA Quiz

Q1. This quiz is a companion quiz to the Sampling Methods Programming Assignment. Please refer to the writeup for the programming assignment for instructions on how to complete this quiz.

Let’s run an experiment using our Gibbs sampling method. As before, use the toy image network and set the on-diagonal weight of the pairwise factor (in ConstructToyNetwork.m) to be 1.0 and the off-diagonal weight to be 0.1. Now run Gibbs sampling a few times, first initializing the state to be all 1’s and then initializing the state to be all 2’s. What effect does the initial assignment have on the accuracy of Gibbs sampling? Why does this effect occur?

- The initial state is not an important factor in our result as Gibbs can make large moves of multiple variables to quickly escape this bad state.
- The initial state has a significant impact on the result of our sampling, which makes sense as strong correlation makes mixing time long and we remain close to the initial assignment for a long time.
- The initial state has a significant impact on the result as, though our chain mixes quickly, it will mix to a distribution far from the actual distribution and close to the initial assignment.
- The initial state has a significant impact on the result of our sampling as Gibbs will never switch variables because the pairwise potentials enforce strong agreement so we are in a local optima.

Q2. Set the on-diagonal weight of our toy image network to 1 and off-diagonal weight to .2. Now visualize multiple runs with each of Gibbs, MHUniform, Swendsen-Wang variant 1, and Swendsen-Wang variant 2 using VisualizeMCMCMarginals.m (see TestToy.m for how to do this). How do the mixing times of these chains compare? How do the final marginals compare to the exact marginals? Why?

- The Swendsen-Wang variants outperform the other approaches, with faster mixing and better final marginals. This is likely due to the block-flipping nature of Swendsen-Wang which allows us to flip blocks and quickly mix in environments with strong agreeing potentials.
- All variants perform poorly in the case of strong pairwise potentials. All algorithms are subject to positive feedback loops with the tight loops in our grid and strong pairwise agreement potentials, preventing appropriate mixing.
- Having strong pairwise potentials enforcing agreement is not a problem for any of these sampling methods and all perform equally well — mixing quickly and ending up close to the final marginals.
- Gibbs outperforms the other variants in this instance. Gibbs has some issues with strong pairwise potentials, but is not nearly as bad as MH where blocks end up stuck with the same level so we cannot mix appropriately.

Q3. Set the on-diagonal weight of our toy image network to .5 and off- diagonal weight to .5. Now visualize multiple runs with each of Gibbs, MHUniform, Swendsen-Wang variant 1, and Swendsen-Wang variant 2 using VisualizeMCMCMarginals.m (see TestToy.m for how to do this). How do the mixing times of these chains compare? How do the final marginals compare to the exact marginals? Why?

- All variants perform equally well. They all mix quickly and have very low variance throughout their runs — remaining close to the true marginals. This is because the pairwise marginals do not force us into preferring agreement when we should not.
- Gibbs and MHUniform perform very well and are somewhat better than the Swendsen-Wang variants. This is because the first two variants use local moves so the local marginals remained consistently close the the true marginals, while SW allows big swings over multiple variables that perturb the distribution.
- Gibbs performs poorly relative to the other variants — exhibiting slower mixing time and marginals further from the exact ones. This difference is likely due to the Gibbs strong global dependence that prevents it from acting appropriately unless all variables are relatively well synced to their true marginals.
- Swendsen-Wang outperforms the other variants, though all perform relatively well. SW is better because its larger block moves allow for faster mixing and mean it reaches marginal estimates closer to the true marginals faster.

Q4. When creating our proposal distribution for Swendsen-Wang, if you set all the q_{i,j}

- Switching q_{i,j}
*qi*,*j* to 0 is equivalent to a randomized variant of Gibbs sampling where we are allowed to take a random, rather than fixed, order. - Switching q_{i,j}
*qi*,*j* to 0 is equivalent to MH-Uniform. - Switching q_{i,j}
*qi*,*j* to 0 is equivalent to the first variant of Swendsen-Wang. - Switching q_{i,j}
*qi*,*j* to 0 leaves us without a valid proposal distribution and is not a feasible sampling algorithm.

#### Quiz 4: Inference in Temporal Models

Q1. Unrolling DBNs. Which independencies hold in the unrolled network for the following 2-TBN for all tt?

(Hint: it may be helpful to draw the unrolled DBN for several slices)

- (Weather^t \perp Velocity^t \mid Weather^{(t-1)}, Obs^{1…t})(
*Weathert*⊥*Velocityt*∣*Weather*(*t*−1),*Obs*1…*t*) - (Weather^t \perp Velocity^t \mid Obs^{1…t})(
*Weathert*⊥*Velocityt*∣*Obs*1…*t*) - None of these
- (Weather^t \perp Location^t \mid Velocity^t, Obs^{1…t})(
*Weathert*⊥*Locationt*∣*Velocityt*,*Obs*1…*t*) - (Failure^t \perp Location^t \mid Obs^{1…t})(
*Failuret*⊥*Locationt*∣*Obs*1…*t*) - (Failure^t \perp Velocity^t \mid Obs^{1…t})(
*Failuret*⊥*Velocityt*∣*Obs*1…*t*)

Q2. *Limitations of Inference in DBNs. What makes inference in DBNs difficult?

- Standard clique tree inference cannot be applied to a DBN
- As tt grows large, we generally lose independencies of the form (X^{(t)} \perp Y^{(t)} \mid
- As tt grows large, we generally lose all independencies in the ground network
- In many networks, maintaining an exact belief state over the variables requires a full joint distribution over all variables in each time slice

Q3. Entanglement in DBNs. Which of the following are consequences of entanglement in Dynamic Bayesian Networks over discrete variables?

- The belief state never factorizes.
- All variables in the unrolled DBN become correlated.
- The size of an exact representation of the belief state is exponentially large in the number of variables.
- The size of an exact representation of the belief state is quadratic in the number of variables.

### Week 5

#### Quiz 1: Inference Final Exam

Q1. Reparameterization. Suppose we have a calibrated clique tree TT and calibrated cluster graph GG for the same Markov network, and have thrown away the original factors. Now we wish to reconstruct the joint distribution over all the variables in the network only from the beliefs and sepsets. Is it possible for us to do so from the beliefs and sepsets in TT? Separately, is it possible for us to do so from the beliefs and sepsets in GG?

It is possible in GG but not in TT.

It is possible in both TT and GG

It is not possible in TT or GG.

It is possible in TT but not in GG.

Q2. *Markov Network Construction. Consider the unrolled network for the plate model shown below, where we have nn students and mm courses. Assume that we have observed the grade of all students in all courses. In general, what does a pairwise Markov network that is a minimal I-map for the conditional distribution look like? (Hint: the factors in the network are the CPDs reduced by the observed grades. We are interested in modeling the conditional distribution, so we do not need to explicitly include the Grade variables in this new network. Instead, we model their effect by appropriately choosing the factor values in the new network.)

A fully connected graph with instantiations of the Difficulty and Intelligence variables.

Impossible to tell without more information on the exact grades observed.

A fully connected bipartite graph where instantiations of the Difficulty variables are on one side and instantiations of the Intelligence variables are on the other side.

A graph over instantiations of the Difficulty variables and instantiations of the Intelligence variables, not necessarily bipartite; there could be edges between different Difficulty variables, and there could also be edges between different Intelligence variables.

A bipartite graph where instantiations of the Difficulty variables are on one side and instantiations of the Intelligence variables are on the other side. In general, this graph will not be fully connected.

Q3. **Clique Tree Construction. Consider a pairwise Markov network that consists of a graph with mm variables on one side and nn on the others. This graph is bipartite but fully connected, in that each of the mm variables on the one side is connected to all and only the nn variables on the other side. Define the size of a clique to be the number of variables in the clique. There exists a clique tree T^*T

∗

for the pairwise Markov network such that the size of the largest clique in T^*T ∗ is the smallest amongst all possible clique trees for this network. What is the size of the largest sepset in T^*T

∗

?

Note: if you’re wondering why we would ever care about this, remember that the complexity of inference depends on the number of entries in the largest factor produced in the course of message passing, which in turn, is affected by the size of the largest clique in the network, amongst other things.

Hint: Use the relationship between sepsets and conditional independence to derive a lower bound for the size of the largest sepset, then construct a clique tree that achieves this bound.

\max(m,n)+1max(m,n)+1

mnmn

m+nm+n

\min(m,n)+1min(m,n)+1

\min(m,n)min(m,n)

mn + 1mn+1

\max(m,n)max(m,n)

m+n+1m+n+1

Q4. Uses of Variable Elimination. Which of the following quantities can be computed using the sum-product variable elimination algorithm? (In the options, let XX be a set of query variables, and EE be a set of evidence variables in the respective networks.) You may select 1 or more options.

P(X)P(X) in a Markov network

The partition function for a Markov network

P(X)P(X) in a Bayesian network

The most likely assignment to the variables in a Bayesian network.

Q5. *Time Complexity of Variable Elimination. Consider a Bayesian network taking the form of a chain of nn variables, X_1 \rightarrow X_2 \rightarrow \cdots \rightarrow X_nX

1

→X

2

→⋯→X

n

, where each of the X_iX

i

can take on kk values. Assume we eliminate the X_iX

i

starting from X_2X

2

, going to X_3, \ldots, X_nX

3

,…,X

n

and then back to X_1X

1

. What is the computational cost of running variable elimination with this ordering?

O(nk)O(nk)

O(kn^2)O(kn

2

)

O(k^n)O(k

n

)

O(nk^3)O(nk

3

)

Q6. *Numerical Issues in Belief Propagation. In practice, one of the issues that arises when we propagate messages in a clique tree is that when we multiply many small numbers, we quickly run into the precision limits of floating-point numbers, resulting in arithmetic underflow. One possible approach for addressing this problem is to renormalize each message, as it’s passed, such that its entries sum to 1. Assume that we do not store the renormalization factor at each step. Which of the following statements describes the consequence of this approach?

We will be unable to extract the partition function, but the variable marginals that are obtained from renormalizing the beliefs at each clique will still be correct.

This does not change the results of the algorithm: when the clique tree is calibrated, we can obtain from it both the partition function and the correct marginals.

This renormalization will give rise to incorrect marginals at calibration.

Calibration will not even be achieved using this scheme.

Q7. Convergence in Belief Propagation. Suppose we ran belief propagation on a cluster graph GG and a clique tree TT for the same Markov network that is a perfect map for a distribution PP. Assume that both GG and TT are valid, i.e., they satisfy family preservation and the running intersection property. Which of the following statements regarding the algorithm are true? You may select 1 or more options.

Assuming the algorithm converges, if a variable XX appears in two clusters in GG, the marginals P(X)P(X) computed from the two cluster beliefs must agree.

If the algorithm converges, the final clique beliefs in TT, when renormalized to sum to 1, are true marginals of PP.

If the algorithm converges, the final cluster beliefs in GG, when renormalized to sum to 1, are true marginals of PP.

Assuming the algorithm converges, if a variable XX appears in two cliques in TT, the marginals P(X)P(X) computed from the the two clique beliefs must agree.

Q8. Metropolis-Hastings Algorithm. Assume we have an n \times nn×n grid-structured MRF over the variables X_{i,j}X

i,j

. Let \bf{X_i} = {X_{i,1}, \ldots, X_{i,n}}X

i

={X

i,1

,…,X

i,n

} and X−i=X−Xi. Consider the following instance of the Metropolis-Hastings algorithm: at each step, we take our current assignment \bf{x_{-i}}x

−i

and use exact inference to compute the conditional probability P(\bf{X_i} \mid \bf{x_{-i}})P(X

i

∣x

−i

). We then sample \bf{x_i}’x

i

′

from this posterior distribution, and use that as our proposal. What is the correct acceptance probability for this proposal?

Hint: what is the relationship between this and Gibbs sampling?

Q9. *Value of Information. In the influence diagram on the right, when does performing LabTest have value? That is, when would you want to observe the LabTest variable?

Hint: Think about when information is valuable in making a decision.

- When there is some treatment t
*t*such that V(D, t)*V*(*D*,*t*) is different for different diseases D*D*. - When there is some disease d
*d*such that argmax_t V(d, t) ≠ argmax_t \sum_d P(d) V(d, t)*argmaxt**V*(*d*,*t*)=*argmaxt*∑*d**P*(*d*)*V*(*d*,*t*) - When there is some lab value l
*l*such that argmax_t \sum_d P(d | l) V(d, t) ≠ argmax_t \sum_d P(d) V(d, t)*argmaxt*∑*d**P*(*d*∣*l*)*V*(*d*,*t*)=*argmaxt*∑*d**P*(*d*)*V*(*d*,*t*) - When P(D | L)
*P*(*D*∣*L*) is different from P(D)*P*(*D*).

Q10. *Belief Propagation.

Say you had a probability distribution

encoded in a set of factors \PhiΦ, and that you constructed a loopy cluster graph CC to do inference in it. While you were performing loopy belief propagation on this graph, lightning struck and your computer shut down; to your horror, when you booted it back up, the only information you could recover were the graph structure CC and the cluster beliefs at the current iteration. (For each cluster, the cluster belief is its initial potential multiplied by all incoming messages. You don’t have access to the sepset beliefs, the messages, or the original factors \PhiΦ.) Assume the lightning struck before you had finished, i.e., the graph is not yet calibrated. Can you still recover the original distribution P_\PhiP

Φ from this? Why?

- We can reconstruct the original distribution by taking the product of cluster beliefs and normalizing it.
- We can reconstruct the (unnormalized) original distribution by taking the ratio of the product of cluster beliefs to sepset beliefs, and the sepset beliefs can be obtained by marginalizing the cluster beliefs.
- We can’t reconstruct the (unnormalized) original distribution because we don’t have the sepset beliefs to compute the ratio of the product of cluster beliefs to sepset beliefs.
- We can’t reconstruct the original distribution because we were preforming loopy belief propagation, and the reparameterization property doesn’t hold when it’s loopy.

##### Probabilistic Graphical Models 2: Inference Course Review

In our experience, we suggest you enroll in Probabilistic Graphical Models 2: Inference courses and gain some new skills from Professionals completely free and we assure you will be worth it.

Probabilistic Graphical Models 2: Inference Course for free, if you are stuck anywhere between a quiz or a graded assessment quiz, just visit Networking Funda to get Probabilistic Graphical Models 2: Inference Coursera Quiz Answers.

##### Conclusion:

I hope this Probabilistic Graphical Models 2: Inference Coursera Quiz Answer would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Quiz Answers.

This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.

Keep Learning!

##### Get All Course Quiz Answers of Probabilistic Graphical Models Specialization

Probabilistic Graphical Models 1: Representation Coursera Quiz Answers

Probabilistic Graphical Models 2: Inference Coursera Quiz Answers

Probabilistic Graphical Models 3: Learning Coursera Quiz Answers