### Get All Weeks Probabilistic Graphical Models 3: Learning Coursera Quiz Answers

#### Probabilistic Graphical Models 3: Learning Coursera Quiz Answers

### Week 1 Quiz Answers

#### Quiz 1: Learning in Parametric Models

Q1. Computing Sufficient Statistics. Suppose that you are playing Dungeons & Dragons, and you suspect that the 4-sided die that your Dungeon Master is using is biased. In the past 60 times that you have attacked with your dagger and the 4-sided die was rolled to calculate how many hit points of damage you inflicted, 20 times it has come up 1, 15 times it has come up 2, 15 times it has come up 3, and 10 times it has come up 4.

Let θ1 be the true probability of the die landing on 11, and similarly for θ2, θ3, θ4. You want to estimate these parameters from the past 60 rolls that you observed using a simple multinomial model. Which of the following is a sufficient statistic for this data?

- The total amount of damage you inflicted with your trusty dagger (i.e., the sum of all die rolls).
- None of these are sufficient statistics.
- The total number of times you viciously attacked a monster with your dagger (i.e., the total number of times that the dice was rolled).
- A vector with with four components, with the i^{th}i th

component being the number of times you dealt ii hit points worth of damage.

Q2. MLE Parameter Estimation. In the context of the previous question, what is the unique Maximum Likelihood Estimate (MLE) of the parameter \theta_1θ1

? Give your answers rounded to the nearest ten-thousandth (i.e. 1/6 should be 0.1667).

Enter answer here

Q3. Likelihood Functions. For a Naive Bayes network with one parent node, XX, and 3 children nodes, Y_1, Y_2, Y_3Y

1

*L*(*θ*:*D*)=∏*m*=1*M**P*(*y*1[*m*]∣*x*[*m*]:*θY*1∣*X*)*P*(*y*2[*m*]∣*x*[*m*]:*θY*2∣*X*)*P*(*y*3[*m*]:*θY*2∣*X*)*P*(*x*[*m*]∣*y*1[*m*],*y*2[*m*],*y*3[*m*]:*θ*)- L(\theta:D) = \prod_{m=1}^M P(x[m], y_1[m], y_2[m], y_3[m]:\theta)
*L*(*θ*:*D*)=∏*m*=1*M**P*(*x*[*m*],*y*1[*m*],*y*2[*m*],*y*3[*m*]:*θ*) - L(\theta:D) = (\prod_{m=1}^M P(x[m]:\theta_X))(\prod_{m=1}^M P(y_1[m]|x[m]:\theta_{Y_1|X}))(\prod_{m=1}^M P(y_2[m]|x[m]:\theta_{Y_2|X}))(\prod_{m=1}^M P(y_3[m]|x[m]:\theta_{Y_3|X}))
*L*(*θ*:*D*)=(∏*m*=1*M**P*(*x*[*m*]:*θX*))(∏*m*=1*M**P*(*y*1[*m*]∣*x*[*m*]:*θY*1∣*X*))(∏*m*=1*M**P*(*y*2[*m*]∣*x*[*m*]:*θY*2∣*X*))(∏*m*=1*M**P*(*y*3[*m*]∣*x*[*m*]:*θY*3∣*X*)) - L(\theta:D) = \prod_{m=1}^M P(y_1[m], y_2[m], y_3[m]:\theta)P(x[m]|y_1[m],y_2[m],y_3[m]:\theta)
*L*(*θ*:*D*)=∏*m*=1*M**P*(*y*1[*m*],*y*2[*m*],*y*3[*m*]:*θ*)*P*(*x*[*m*]∣*y*1[*m*],*y*2[*m*],*y*3[*m*]:*θ*)

Q4. MLE for Naive Bayes. Using a Naive Bayes model for spam classification with the vocabulary V={“SECRET”, “OFFER”, “LOW”, “PRICE”, “VALUED”, “CUSTOMER”, “TODAY”, “DOLLAR”, “MILLION”, “SPORTS”, “IS”, “FOR”, “PLAY”, “HEALTHY”, “PIZZA”}. We have the following example spam messages SPAM = {“MILLION DOLLAR OFFER”, “SECRET OFFER TODAY”, “SECRET IS SECRET”} and normal messages, NON-SPAM = {“LOW PRICE FOR VALUED CUSTOMER”, “PLAY SECRET SPORTS TODAY”, “SPORTS IS HEALTHY”, “LOW PRICE PIZZA”}.

We create a multinomial naive Bayes model for the data given above. This can be modeled as a parent node taking values SPAM and NON-SPAM and a child node for each word in the vocabulary. The \thetaθ values are estimated based on the number of times that a word appears in the vocabulary. Give the MLE for \theta_{\textrm{SPAM}}θ

SPAM

. Enter the value as a decimal rounded to the nearest ten-thousandth (0.xxxx).

Enter answer here

Q5. MLE for Naive Bayes. Using the same data and model above, give the MLE for \theta_{\textrm{SECRET}|\textrm{SPAM}}θ

SECRET∣SPAM

. Enter the value as a decimal rounded to the nearest ten-thousandth (0.xxxx).

Enter answer here

Q6. MLE for Naive Bayes. Using the same data and model above, give the MLE for \theta_{\textrm{SECRET}|\textrm{NON-SPAM}}θ

SECRET∣NON-SPAM

. Enter the value as a decimal rounded to the nearest ten-thousandth (0.xxxx).

Enter answer here

Q7. Learning Setups. Consider the following scenario: You have been given a dataset that contains patients and their gene expression data for 10 genes. You are also given a 0/1 label where 1 means that patient has disease AA and 0 means the patient does not.

Your goal is to learn a classification algorithm that could predict these labels with high accuracy. You split the data into three sets:

1: Set of patients used for fitting the classifier parameters (e.g., the weights and bias of a logistic regression classifier).

2: Set of patients used for tuning the hyperparameters of the classifier (e.g., how much regularization to apply).

3: Set of patients used to assess the performance of the classifier.

What are these sets called?

1 & 2: Training Set, 3: Validation Set.

1: Training Set, 2: Validation Set, 3: Test Set

1: Validation Set, 2: Test Set, 3: Training Set.

1: Training Set, 2: Test Set, 3: Validation Set

Q8. Constructing CPDs. Assume that we are trying to construct a CPD for a random variable whose value labels a document (e.g., an email) as belonging to one of two categories (e.g., spam or non-spam). We have identified KK words whose presence (or absence) in the document each changes the distribution over labels (e.g., the presence of the word “free” is more likely to indicate that the email is spam). Assume that we have MM labeled documents that we use to estimate the parameters for the CPD of the label given indicator variables representing the appearance of words in the document. We plan to use maximum likelihood estimation to select the parameters of this CPD.

If M=1,000M=1,000 and K=30K=30, which of the following CPD types are most likely to provide the best generalization performance to unseen data? Mark all that apply.

- None of these CPDs would work.
- A linear Gaussian CPD
- A sigmoid CPD
- A table CPD

Q9. Constructing CPDs. For the same scenario as described in the previous question,

if M=100,000M=100,000 and K=3K=3, which of the following CPD types is most likely to provide the best generalization performance to unseen data?

- A sigmoid CPD
- A table CPD
- A linear Gaussian CPD
- A tree CPD with K=3K=3 leaves

#### Quiz 2: Bayesian Priors for BNs

Q1. **BDe Priors. **The following is a common approach for defining a parameter prior for a Bayesian network, and is referred to as the BDe prior. Let P_0*P*0 be some distribution over possible assignments x_1,\ldots,x_n*x*1,…,*xn*, and select some fixed \alpha*α*. For a node X*X* with parents \bf{U}**U** we define \alpha_{x|\bf{u}} = \alpha P_0(x,\bf{u} )*αx*∣**u**=*αP*0(*x*,**u**).

For this question, assume X*X* takes one of m*m* values and that X*X* has k*k* parents, each of which takes d*d* values. If we choose P_0*P*0 to be the uniform distribution, then what is the value of \alpha_{x|\bf{u}}*αx*∣**u**?

*α**α*/(*md*)*α*/(*mdk*)*α*/(*mkd*)

Q2. **Learning with a Dirichlet Prior. **Suppose we are interested in estimating the distribution over the English letters. We assume an alphabet that consists of 26 letters and the space symbol, and we ignore all other punctuation and the upper/lower case distinction. We model the distribution over the 27 symbols as a multinomial parametrized by \theta = (\theta_1,\ldots,\theta_{27})*θ*=(*θ*1,…,*θ*27) where \sum_i \theta_i = 1∑*i**θi*=1 and all \theta_i \geq 0*θi*≥0.

Now we go to Stanford’s Green library and repeat the following experiment: randomly pick up a book, open a page, pick a spot on the page, and write down the nearest symbol that is in our alphabet. We use X[m]*X*[*m*] to denote the letter we obtain in the m*m*th experiment.

In the end, we have collected a dataset D = \{x[1],\ldots,x[2000]\}*D*={*x*[1],…,*x*[2000]} consisting of 2000 symbols, among which “e” appears 260 times. We use a Dirichlet prior over \theta*θ*, i.e. P(\theta) = Dirichlet(\alpha_1,\ldots,\alpha_{27})*P*(*θ*)=*Dirichlet*(*α*1,…,*α*27) where each \alpha_i = 10*αi*=10. What is the predictive probability that letter “e” occurs with this prior? (i.e., what is P(X[2001] = *P*(*X*[2001]= “e*e*” \mid D)∣*D*)?) Write your answer as a decimal rounded to the nearest **ten thousandth** (0.xxxx).

Q3. **Learning with a Dirichlet Prior. **In the setting of the previous question, suppose we had collected M = 2000*M*=2000 symbols, and the number of times “a” appeared was 100, while the number of times “p” appeared was 87. Now suppose we draw 2 more samples, X[2001] and X[2002]. If we use \alpha_i=10*αi*=10 for all i *i*, what is the probability of P(X[2001] = *P*(*X*[2001]=”p”, X[2002] = ,*X*[2002]= “a”|D)∣*D*)? (round your answer to the nearest **millionth**, 0.xxxxxx)

Q4. ***Learning with a Dirichlet Prior. **In the setting of previous two questions, suppose we have collected M*M* symbols, and let \alpha = \sum_i \alpha_i*α*=∑*i**αi* (we no longer assume that each \alpha_i = 10*αi*=10). In which situation(s) does the Bayesian predictive probability using the Dirichlet prior ( i.e., P(X[M+1]\mid D)*P*(*X*[*M*+1]∣*D*) ) converge to the MLE estimation for any distribution over M*M*? You may select 1 or more options.

- \alpha \rightarrow 0
*α*→0 and M*M*is fixed - None of the above
- M \rightarrow \infty
*M*→∞ and \alpha*α*is fixed - \frac{\alpha}{M} \rightarrow \infty
*Mα*→∞ - M \rightarrow 0
*M*→0 and \alpha*α*is fixed and non-zero

### Week 2 Quiz Answers

#### Quiz 1: Parameter Estimation in MNs

Q1. *MN Parameter Estimation and Inference. Consider the process of gradient-ascent training for a log-linear model with kk features, given a data set DD with MM instances. Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree.

Assume that we use clique tree calibration to compute the expected sufficient statistics in this model and that the cost of doing this is cc. Also, assume that we need rr iterations for the gradient process to converge.

What is the cost of this procedure? Recall that in big-O notation, same or lower-complexity terms are collapsed.

- O(Mk + r(c + k))O(Mk+r(c+k))
- O(r(Mc + k))O(r(Mc+k))
- O(Mk + rck)O(Mk+rck)
- O(Mk + rc)O(Mk+rc)

Q2. *CRF Parameter Estimation. Consider the process of gradient-ascent training for a CRF log-linear model with kk features, given a data set DD with MM instances.

Assume for simplicity that the cost of computing a single feature over a single instance in our data set is constant, as is the cost of computing the expected value of each feature once we compute a marginal over the variables in its scope. Also assume that we can compute each required marginal in constant time after we have a calibrated clique tree.

Assume that we use clique tree calibration to compute the expected sufficient statistics in this model, and that the cost of running clique tree calibration is cc. Assume that we need rr iterations for the gradient process to converge.

What is the cost of this procedure?

Recall that in big-O notation, same or lower-complexity terms are collapsed.

- O(Mk + rck)O(Mk+rck)
- O(Mk + Mrc)O(Mk+Mrc)
- O(Mk + rc)O(Mk+rc)
- O(rMc + rk)O(rMc+rk)

Q3. Parameter Learning in MNs vs BNs. Compared to learning parameters in Bayesian networks, learning in Markov networks is generally…

- equally difficult, as both require an inference step at each iteration.
- less difficult because we must separately optimize decoupled portions of the likelihood function in a Bayes Net, while we can optimize portions together in a Markov network.
- equally difficult, though MN inference will be better by a constant factor difference in the computation time as we do not need to worry about directionality.
- more difficult because we cannot push in sums to decouple the likelihood function, allowing independent parallel optimizations, as we can in Bayes Nets.

### Week 3 Quiz Answers

#### Quiz 1: Structure Scores

Q1. Likelihood Scores. Consider the following 4 graphs:

Which of the following statements about the likelihood scores of the different graphs is/are true? You may choose more than 1 option.

\mathrm{Score}_L(G1 : D) \geq \mathrm{Score}_L(G4 : D)Score

L

(G1:D)≥Score

L

(G4:D) for every dataset DD

\mathrm{Score}_L(G2 : D) \geq \mathrm{Score}_L(G1 : D)Score

L

(G2:D)≥Score

L

(G1:D) for every dataset DD

\mathrm{Score}_L(G4 : D) \geq \mathrm{Score}_L(G3 : D)Score

L

(G4:D)≥Score

L

(G3:D) for every dataset DD

\mathrm{Score}_L(G2 : D) \geq \mathrm{Score}_L(G4 : D)Score

L

(G2:D)≥Score

L

(G4:D) for every dataset DD

Q2. BIC Scores. Consider the same 4 graphs as in the previous question, but now think about the BIC score. Which of the following statements is/are true?

\mathrm{Score}*{BIC}(G1 : D) \geq \mathrm{Score}*{BIC}(G2 : D)Score

BIC

(G1:D)≥Score

BIC

(G2:D) for every dataset DD

\mathrm{Score}*{BIC}(G1 : D) \geq \mathrm{Score}*{BIC}(G4 : D)Score

BIC

(G1:D)≥Score

BIC

(G4:D) for every dataset DD

\mathrm{Score}*{BIC}(G1 : D) = \mathrm{Score}*{BIC}(G3 : D)Score

BIC

(G1:D)=Score

BIC

(G3:D) for every dataset DD

\mathrm{Score}*{BIC}(G2 : D) \ne \mathrm{Score}*{BIC}(G3 : D)Score

BIC

(G2:D)

=Score

BIC

(G3:D) for every dataset DD

Q3. Likelihood Guarantees. Consider graphs G2G2 and G3G3.

We have a dataset DD generated from some probability distribution PP, and the likelihood scores for G2G2 and G3G3 are \mathrm{Score}_L(G2 : D)Score

L

(G2:D) and \mathrm{Score}_L(G3 : D)Score

L

(G3:D), respectively.

Let \theta^*_{D,2}θ

D,2

∗

and \theta^*_{D,3}θ

D,3

∗

be the maximum likelihood parameters for each network, taken with respect to the dataset DD.

Now let L(X : G, \theta)L(X:G,θ) represent the likelihood of dataset XX given the graph GG and parameters \thetaθ, so \mathrm{Score}_L(G2 : D) = L(D : G2, \theta_{D,2}^*)Score

L

(G2:D)=L(D:G2,θ

D,2

∗

) and \mathrm{Score}*L(G3 : D) = L(D : G3, \theta*{D,3}^*)Score

L

(G3:D)=L(D:G3,θ

D,3

∗

).

Suppose that L(D : G2, \theta_{D,2}^*) > L(D : G3, \theta_{D,3}^*)L(D:G2,θ

D,2

∗

)>L(D:G3,θ

D,3

∗

). If we draw a new dataset EE from the distribution PP, which of the following statements can we guarantee?

If more than one statement holds, choose the more general statement.

*L*(*E*:*G*2,*θD*,2∗)=*L*(*E*:*G*3,*θD*,3∗)- None of the others
- L(E : G2, \theta_{D,2}^*) \gt L(E : G3, \theta_{D,3}^*)
*L*(*E*:*G*2,*θD*,2∗)>*L*(*E*:*G*3,*θD*,3∗) - L(E : G2, \theta_{D,2}^*) \lt L(E : G3, \theta_{D,3}^*)
*L*(*E*:*G*2,*θD*,2∗)<*L*(*E*:*G*3,*θD*,3∗) - L(E : G2, \theta_{D,2}^*) \leq L(E : G3, \theta_{D,3}^*)
*L*(*E*:*G*2,*θD*,2∗)≤*L*(*E*:*G*3,*θD*,3∗)

Q4. Hidden Variables. Consider the case where the generating distribution has a naive Bayes structure, with an unobserved class variable CC and its binary-valued children X_1,\ldots,X_{100}X

1

,…,X

100

. Assume that CC is strongly correlated with each of its children (that is, distinct classes are associated with fairly different distributions over each X_iX

i

). Now suppose we try to learn a network structure directly on X_1,\ldots,X_{100}X

1

,…,X

100

, without including CC in the network. What network structure are we likely to learn if we have 10,000 data instances, and we are using table CPDs with the likelihood score as the structure learning criterion?

- A fully connected network, i.e., one with an edge between every pair of nodes.
- The empty network, i.e., a network consisting of only the variables but no edges between them.
- Some connected network over X_1,\ldots,X_{100}X

that is not fully connected nor empty.

Q5. Hidden Variables. Now suppose that we use the BIC score instead of the likelihood score in the previous question. What network structure are we likely to learn with the same 10,000 data instances?

- The empty network, i.e., a network consisting of only the variables but no edges between them.
- Some connected network over X_1, that is not fully connected nor empty.
- A fully connected network, i.e., one with an edge between every pair of nodes.

#### Quiz 2: Tree Learning and Hill Climbing

Q1. Detective! You are a detective tracking down a serial robber who has already stolen from 1,000 victims, thereby giving you a large enough training set.

No one else has been able to catch him (or her), but you are certain that there is a method (specifically, a Bayesian network) in this madness. You decide to model the robber’s activity with a tree-structured network (meaning that each node has at most one parent): this network has (observed) variables such as the location of the previous crime, the gender and occupation of the previous victim, the current day of the week, etc., and a single unobserved variable, which is the location of the next robbery. The aim is to predict the location of the next robbery. Unfortunately, you have forgotten all of your classical graph algorithms, but fortunately, you have a copy of The Art of Computer Programming (volume 4B) next to you. Which graph algorithm do you look up to help you find the optimal tree-structured network? Assume that the structure score we are using satisfies score decomposability and score equivalence.

- Finding a directed spanning forest with the largest diameter (i.e., the longest distance between any pair of nodes).
- Finding an undirected spanning forest with the largest diameter (i.e., the longest distance between any pair of nodes).
- Finding the shortest path between all pairs of points.
- Finding the maximum-weight undirected spanning forest (i.e., a set of undirected edges such that there is at most one path between any pair of nodes).
- Finding the size of the largest clique in a graph.

Q2. *Recovering Directionality. Once again, assume that our structure score satisfies score decomposability and score equivalence. After we find the optimal undirected spanning forest (containing nn nodes), how can we recover the optimal directed spanning forest (and catch the robber)?

If more than one option is correct, pick the faster option; if the options take the same amount of time, pick the more general option.

- Pick any arbitrary direction for each edge, which takes O(n)O(n) time.
- Because of score equivalence, all possible directed versions of the optimal undirected spanning forest have the same score, so this is valid.
- Pick any arbitrary root, and direct all edges away from it. This takes O(n)O(n) time.
- Evaluate all possible directions for the edges by iterating over them.
- This takes O(2^n)O(2 ) time, since there are at most 2^n possible sets of edge directions in the spanning forest.
- Evaluate all possible directions for the edges.
- While there are at most 2^n
- possible sets of edge directions, we can exploit score decomposability to find the best directed spanning forest in O(n)O(n) time.

Q3. *Augmenting Trees. It turns out that the tree-structured network we learnt in the preceding questions was not sufficient to apprehend the robber, allowing him to claim his 1001th victim.

- M \cdot (I_{\hat{P}}(X_i; X_j,C) – I_{\hat{P}}(X_i; X_j))
*M*⋅(*IP*^(*Xi*;*Xj*,*C*)−*IP*^(*Xi*;*Xj*)) - M \cdot (I_{\hat{P}}(X_i; X_j,C) – I_{\hat{P}}(X_i; C)) – H_{\hat{P}}(X_i)
*M*⋅(*IP*^(*Xi*;*Xj*,*C*)−*IP*^(*Xi*;*C*))−*HP*^(*Xi*) - M \cdot (I_{\hat{P}}(X_i; X_j,C) – I_{\hat{P}}(X_i; X_j)) – H_{\hat{P}}(X_i)
*M*⋅(*IP*^(*Xi*;*Xj*,*C*)−*IP*^(*Xi*;*Xj*))−*HP*^(*Xi*) - M \cdot I_{\hat{P}}(X_i; X_j, C) – H_{\hat{P}}(X_i)
*M*⋅*IP*^(*Xi*;*Xj*,*C*)−*HP*^(*Xi*) - M \cdot (I_{\hat{P}}(X_i; X_j,C) – I_{\hat{P}}(X_i; C))
*M*⋅(*IP*^(*Xi*;*Xj*,*C*)−*IP*^(*Xi*;*C*)) - M \cdot I_{\hat{P}}(X_i; X_j, C)
*M*⋅*IP*^(*Xi*;*Xj*,*C*)

(A,B) is the mutual information in the empirical distribution of the variables in set \mathbf{A}A with the variables in set \mathbf{B}B.

Q4. Trees vs. Forests. Congratulations! Your hybrid naive-Bayes/tree-structured network managed to correctly predict where the criminal would be next, allowing the police to catch him (or her) before the 1002th victim got robbed.

The grateful populace beg you to return to studying probabilistic graphical models. While re-watching the video lectures, you begin to wonder if the algorithm we have been using to learn tree-structured networks can produce a forest, rather than a single tree. Assume that we use the likelihood score, and also assume that the maximum spanning forest algorithm breaks ties (between equal-scoring trees) arbitrarily. Which of the following is true? In this question, interpret “forest” to mean a set of two or more disconnected trees.

- It’s possible for the algorithm to produce a forest, since there are cases in which a forest will have a higher score than any tree.
- It’s theoretically possible for the algorithm to produce a forest. However, this will only occur in very contrived and unrealistic circumstances, not in practice.
- This algorithm will never produce a forest, since there will always be a tree that has strictly higher score.
- It’s possible for the algorithm to produce a forest even though trees will always score more highly, since the algorithm need not find the structure that is globally optimal (relative to the likelihood score).

### Week 4 Quiz Answers

#### Quiz 1: Learning with Incomplete Data

Q1. Missing At Random.

Suppose we are conducting a survey of job offers and salaries for Stanford graduates. We already have the major of each of these students recorded, so in the survey form, each graduating student is only asked to list up to two job offers and salaries he/she received. Which of the following scenarios is/are missing at random (MAR)?

- CS students get more offers than other majors and found selecting only two too stressful, so they neglected to list any.
- The database software ignored salaries that were large prime numbers because they could not be easily stored as products of smaller numbers.
- The database software ignored salaries of 0 submitted by students who were taking unpaid positions with community service organizations.
- The person recording the information didn’t care about humanities students and neglected to record their salaries.

Q2. Computing Sufficient Statistics.

Given the network and data instances shown below, how do we compute the expected sufficient statistics for a particular value of the parameters?

- \bar{M}[x_0,y_0,z_0] = P(y_0,z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(z_0\mid x_1, y_1, \theta) + P(z_0\mid x_0, y_1, \theta) + P(x_0,z_0\mid y_0, \theta)
*M*ˉ[*x*0,*y*0,*z*0]=*P*(*y*0,*z*0∣*x*0,*θ*)+*P*(*z*0∣*x*0,*y*0,*θ*)+*P*(*z*0∣*x*1,*y*1,*θ*)+*P*(*z*0∣*x*0,*y*1,*θ*)+*P*(*x*0,*z*0∣*y*0,*θ*) - \bar{M}[x_0,y_0,z_0] = P(z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(z_0\mid y_0, \theta)
*M*ˉ[*x*0,*y*0,*z*0]=*P*(*z*0∣*x*0,*θ*)+*P*(*z*0∣*x*0,*y*0,*θ*)+*P*(*z*0∣*y*0,*θ*) - \bar{M}[x_0,y_0,z_0] = P(y_0,z_0\mid x_0, \theta) + P(z_0\mid x_0, y_0, \theta) + P(x_0,z_0\mid y_0, \theta)
*M*ˉ[*x*0,*y*0,*z*0]=*P*(*y*0,*z*0∣*x*0,*θ*)+*P*(*z*0∣*x*0,*y*0,*θ*)+*P*(*x*0,*z*0∣*y*0,*θ*) - \bar{M}[x_0,y_0,z_0] = 3
*M*ˉ[*x*0,*y*0,*z*0]=3

Q3. Likelihood of Observed Data.

In a Bayesian Network with partially observed training data, computing the likelihood of observed data for a given set of parameters…

- requires probabilistic inference, while it DOES NOT in the case of fully observed data.
- cannot be achieved by probabilistic inference, while it CAN in the case of fully observed data.
- requires probabilistic inference, AS IN the case of fully observed data.

Q4. PGM with latent variables.

Adding hidden variables to a model can significantly increase the expressiveness of a model.

However, there are also some issues that arise when we try to add hidden variables.

For which of these problems can we learn a reasonable model by simply choosing the parameters that maximize training likelihood?

Assume that all variables, hidden or (partially) observed, are discrete and follow a table CPD.

You may choose more than one option.

- Choosing which edges involving the hidden nodes to add to the graph.
- Given a fixed set of edges, learning the parameters in the table CPDs of observed nodes that have hidden nodes as parents.
- Choosing the number of hidden variables to add to the graphical model.
- Given a fixed set of edges, learning the parameters in the table CPDs of each hidden node.

#### Quiz 2: Expectation Maximization

Q1. **Bayesian Clustering using Normal Distributions.**

Suppose we are doing Bayesian clustering with K*K* classes, and multivariate normal distributions as our class-conditional distributions.

Let \mathbf{X}\in\mathbf{R}^n**X**∈**R***n* represent a single data point, and C\in\{1,2,\ldots,K\}*C*∈{1,2,…,*K*} its unobserved class.

Which of the following statement(s) is/are always true in the general case?

- X_i \perp C \ |\ \mathbf{X_{-i}} \quad\forall i\in\{1,2,\ldots,n\}
*Xi*⊥*C*∣**X**−**i**∀*i*∈{1,2,…,*n*}, i.e., for any given data point, - if we know all but one coordinate, then knowing the class that it comes from doesn’t give us any information about the last coordinate.
- X_i \perp X_j \ |\ C \quad\forall i,j\in\{1,2,\ldots,n\}, i\neq j
*Xi*⊥*Xj* ∣*C*∀*i*,*j*∈{1,2,…,*n*},*i*=*j*, i.e., for any given data point, - if we know which class the data point comes from, then knowing one coordinate does not give us any information about another coordinate.
- P(\mathbf{X}|C=c) \sim N(\mu_c, I_n) \quad\forall c\in\{1,2,\ldots,K\}
*P*(**X**∣*C*=*c*)∼*N*(*μc*,*In*)∀*c*∈{1,2,…,*K*}, for some class-specific parameter \mu_c*μc* that represents the distribution of data coming from the class c*c*. (I_n*In* is the n\times n*n*×*n*identity matrix.) - P(\mathbf{X}|C=c) \sim N(\mu_c, \Sigma_c) \quad\forall c\in\{1,2,\ldots,K\}
*P*(**X**∣*C*=*c*)∼*N*(*μc*,Σ*c*)∀*c*∈{1,2,…,*K*}, for some class-specific parameters \mu_c*μc* and \Sigma_cΣ*c* that represent the distribution of data coming from the class c*c*. - P(\mathbf{X}) \sim N(\mu, \Sigma)
*P*(**X**)∼*N*(*μ*,Σ), for some parameters \mu*μ*and \SigmaΣ that represent the overall distribution of the data.

**Q2. Hard Assignment EM.** Continuing from the previous question, let us now fix each class-conditional distribution to have the identity matrix as its covariance matrix.

If we use hard-assignment EM to estimate the class-dependent mean vectors, which of the following can we say about the resulting algorithm?

- It is an instance of k-means, but using a different distance metric rather than standard Euclidean distance.
- It is an an algorithm that cannot be viewed as an instance of k-means.
- It is equivalent to running standard k-means clustering with K
*K*clusters.

Q3. ***Hard Assignment EM.** Now suppose that we fix each class-conditional distribution to have the same diagonal matrix D*D* as its covariance matrix, where D*D* is **not** the identity matrix.

- If we use hard-assignment EM to estimate the class-dependent mean vectors, which of the following can we say about the resulting algorithm?
- It is an instance of k-means, but using a different distance metric rather than standard Euclidean distance.
- It is an an algorithm that cannot be viewed as an instance of k-means.
- It is equivalent to running standard k-means clustering with K
*K*clusters.

Q4. **EM Running Time. **Assume that we are trying to estimate parameters for a Bayesian network structured as a binary (directed) tree (**not** a polytree) with n*n* variables, where each node has at most one parent. We parameterize the network using table CPDs. Assume that each variable has d*d* values. We have a data set with M*M* instances, where some observations in each instances are missing. What is the tightest asymptotic bound you can give for the worst case running time of EM on this data set for K*K* iterations? In this and following questions, concentrate on the EM part of the learning algorithm only. You don’t need to consider the running time of additional steps if the full learning algorithm needs any.

- O(KMn^2d)
*O*(*KMn*2*d*) - O(KMnd^2)
*O*(*KMnd*2) - O(KMn^2d^2)
*O*(*KMn*2*d*2) - Can’t tell using only the information given

**Q5. EM Running Time. **Use the setting of the previous question, but now we assume that the network is a polytree, in which some variables have several parents. What is the cost of running EM on this data set for K*K* iterations?

- O(KMn d^2)
*O*(*KMnd*2) - O(KMn^2d^2)
*O*(*KMn*2*d*2) - Can’t tell using only the information given.
- O(KMn^2 d)
*O*(*KMn*2*d*)

Q6. ***Optimizing EM.**

Now, going back to the tree setting, where each node has at most one parent, assume that we are in a situation where at most 2 variables in each data instance are unobserved (not necessarily the same 2 in each instance). Can we implement EM more efficiently? If so, which of the following reduced complexities can you achieve?

- O(KMnd^2)
*O*(*KMnd*2) - O(KMd^2)
*O*(*KMd*2) - No computational savings can be achieved.
- O(K(M + n)d^2)
*O*(*K*(*M*+*n*)*d*2)

**Q7. *Optimizing EM.**

Still in the tree setting, now assume that we are in a situation where at most 2 variables in each data instance are unobserved, but it’s the same 2 each instance. Can we implement EM more efficiently? If so, which of the following reduced complexities can you achieve?

- O(KMd^2)
*O*(*KMd*2) - O(KMnd^2)
*O*(*KMnd*2) - O(K(M + n)d^2)
*O*(*K*(*M*+*n*)*d*2) - No computational savings can be achieved.

#### Quiz 5: Learning: Final Exam

Q1. Bayesian Clustering using Normal Distributions.

```
Suppose we are doing Bayesian clustering with KK classes, and multivariate normal distributions as our class-conditional distributions.
Let \mathbf{X}\in\mathbf{R}^nX∈R
```

n

represent a single data point, and C\in{1,2,\ldots,K}C∈{1,2,…,K} its unobserved class.

```
Which of the following statement(s) is/are always true in the general case?
X_i \perp C \ |\ \mathbf{X_{-i}} \quad\forall i\in\{1,2,\ldots,n\}X
```

i

⊥C ∣ X

−i

∀i∈{1,2,…,n}, i.e., for any given data point,

```
if we know all but one coordinate, then knowing the class that it comes from doesn't give us any information about the last coordinate.
X_i \perp X_j \ |\ C \quad\forall i,j\in\{1,2,\ldots,n\}, i\neq jX
```

i

⊥X

j

∣ C∀i,j∈{1,2,…,n},i

=j, i.e., for any given data point,

```
if we know which class the data point comes from, then knowing one coordinate does not give us any information about another coordinate.
P(\mathbf{X}|C=c) \sim N(\mu_c, I_n) \quad\forall c\in\{1,2,\ldots,K\}P(X∣C=c)∼N(μ
```

c

,I

n

)∀c∈{1,2,…,K}, for some class-specific parameter \mu_cμ

c

that represents the distribution of data coming from the class cc. (I_nI

n

is the n\times nn×n identity matrix.)

` P(\mathbf{X}|C=c) \sim N(\mu_c, \Sigma_c) \quad\forall c\in\{1,2,\ldots,K\}P(X∣C=c)∼N(μ `

c

,Σ

c

)∀c∈{1,2,…,K}, for some class-specific parameters \mu_cμ

c

and \Sigma_cΣ

c

that represent the distribution of data coming from the class cc.

` P(\mathbf{X}) \sim N(\mu, \Sigma)P(X)∼N(μ,Σ), for some parameters \muμ and \SigmaΣ that represent the overall distribution of the data.`

Q2. Hard Assignment EM. Continuing from the previous question, let us now fix each class-conditional distribution to have the identity matrix as its covariance matrix.

If we use hard-assignment EM to estimate the class-dependent mean vectors, which of the following can we say about the resulting algorithm?

```
It is an instance of k-means, but using a different distance metric rather than standard Euclidean distance.
It is an an algorithm that cannot be viewed as an instance of k-means.
It is equivalent to running standard k-means clustering with KK clusters.
```

Q3. *Hard Assignment EM. Now suppose that we fix each class-conditional distribution to have the same diagonal matrix DD as its covariance matrix, where DD is not the identity matrix.

```
If we use hard-assignment EM to estimate the class-dependent mean vectors, which of the following can we say about the resulting algorithm?
It is an instance of k-means, but using a different distance metric rather than standard Euclidean distance.
It is an an algorithm that cannot be viewed as an instance of k-means.
It is equivalent to running standard k-means clustering with KK clusters.
```

Q4. EM Running Time. Assume that we are trying to estimate parameters for a Bayesian network structured as a binary (directed) tree (not a polytree) with nn variables, where each node has at most one parent. We parameterize the network using table CPDs. Assume that each variable has dd values. We have a data set with MM instances, where some observations in each instances are missing. What is the tightest asymptotic bound you can give for the worst case running time of EM on this data set for KK iterations? In this and following questions, concentrate on the EM part of the learning algorithm only. You don’t need to consider the running time of additional steps if the full learning algorithm needs any.

O(KMn^2d)O(KMn

2

d)

O(KMnd^2)O(KMnd

2

)

O(KMn^2d^2)O(KMn

2

d

2

)

Can’t tell using only the information given

Q5. EM Running Time. Use the setting of the previous question, but now we assume that the network is a polytree, in which some variables have several parents. What is the cost of running EM on this data set for KK iterations?

O(KMn d^2)O(KMnd

2

)

O(KMn^2d^2)O(KMn

2

d

2

)

Can’t tell using only the information given.

O(KMn^2 d)O(KMn

2

d)

Q6. *Optimizing EM.

Now, going back to the tree setting, where each node has at most one parent, assume that we are in a situation where at most 2 variables in each data instance are unobserved (not necessarily the same 2 in each instance). Can we implement EM more efficiently? If so, which of the following reduced complexities can you achieve?

O(KMnd^2)O(KMnd

2

)

O(KMd^2)O(KMd

2

)

No computational savings can be achieved.

O(K(M + n)d^2)O(K(M+n)d

2

)

Q7. *Optimizing EM.

Still in the tree setting, now assume that we are in a situation where at most 2 variables in each data instance are unobserved, but it’s the same 2 each instance. Can we implement EM more efficiently? If so, which of the following reduced complexities can you achieve?

.

Question 1

*Multiplexer CPDs. What is the form of the independence that is implied by the multiplexer CPD and that we used in our derivation of the posterior over the parameters of the simple Bayesian Network x \rightarrow yx→y? (i.e. the factorization of P( \bf{\theta}*x , \bf{\theta}*{Y|x^1} , \bf{\theta}_{Y|x^0} \mid D )P(θ

x

,θ

Y∣x

1

,θ

Y∣x

0

∣D)). Recall that a CPD is defined as a multiplexer if it has the structure P(Y|A,Z_1,…,Z_k)={\bf I}{Y=Z_a} P(Y∣A,Z

1

,…,Z

k

)=I{Y=Z

a

} where the values of AA are the natural numbers 1 through kk. Also note that the answer is specific to multiplexer CPDs and is not implied by the graph structure alone.

\bf{\theta}*{Y|x^0}, \bf{\theta}*{Y|x^1} \perp Xθ

Y∣x

0

,θ

Y∣x

1

⊥X

\bf{\theta}*{Y|x^0} \perp \bf{\theta}*{Y|x^1} \mid X,Yθ

Y∣x

0

⊥θ

Y∣x

1

∣X,Y

\bf{\theta}*{Y|X} \perp X \mid \bf{\theta}*{X}θ

Y∣X

⊥X∣θ

X

\bf{\theta}*{Y|x^0} \perp \bf{\theta}*{Y|x^1} \mid Xθ

Y∣x

0

⊥θ

Y∣x

1

∣X

\bf{\theta}_{Y|X} \perp \bf{\theta}_Xθ

Y∣X

⊥θ

X

Q2. *Score Consistency. Assume that the dataset DD has mm examples, each drawn independently from the distribution P^*P

∗

, for which the graph G^*G

∗

is a perfect map.

` What do we mean when we say that the BIC score \mathrm{Score}_{BIC}(G : D)Score `

BIC

(G:D), measured with respect to DD, is consistent?

Hint: We are looking for a definition that will always be true, not just probably be true.

As m\rightarrow\inftym→∞, with probability 1 we will draw a dataset DD from P^*P

∗

such that the inequality

` \mathrm{Score}_{BIC}(G^* : D) \gt \mathrm{Score}_{BIC}(G : D)Score `

BIC

(G

∗

:D)>Score

BIC

(G:D)

` holds for all other graphs GG which are not I-equivalent to G^*G `

∗

.

As m\rightarrow\inftym→∞, with probability 1 we will draw a dataset DD from P^*P

∗

such that the inequality

` \mathrm{Score}_{BIC}(G^* : D) \gt \mathrm{Score}_{BIC}(G : D)Score `

BIC

(G

∗

:D)>Score

BIC

(G:D)

` holds for all other graphs G \neq G^*G `

=G

∗

.

As m\rightarrow\inftym→∞, no matter which examples were drawn from P^*P

∗

into the dataset DD, the inequality

` \mathrm{Score}_{BIC}(G^* : D) \gt \mathrm{Score}_{BIC}(G : D)Score `

BIC

(G

∗

:D)>Score

BIC

(G:D)

` will always be true for all graphs GG which are not I-equivalent to G^*G `

∗

.

As m\rightarrow\inftym→∞, no matter which examples were drawn from P^*P

∗

into the dataset DD, the inequality

` \mathrm{Score}_{BIC}(G^* : D) \gt \mathrm{Score}_{BIC}(G : D)Score `

BIC

(G

∗

:D)>Score

BIC

(G:D)

` will always be true for all other graphs G \neq G^*G `

=G

∗

.

Q3. EM and Convergence.

```
When checking for the convergence of the EM algorithm, we can choose to measure changes in either the log-likelihood function or in the parameters.
For a generic application, we typically prefer to check for convergence using the log-likelihood function.
However, this is not always the case, especially when the values of the parameters are important in and of themselves.
In which situations would we also be concerned about reaching convergence in terms of the parameters? Do not worry about the implementation details in the following models.
We are building a graphical model to represent a biological network, where each node corresponds to a gene. We want to learn the interactions between genes by finding the parameters that maximize the likelihood of a given training dataset of gene expression measurements. The interactions we find will then be further studied by biologists.
We have a graphical model in which each node is a superpixel, and we are using EM to learn the parameters that specify the relations between superpixels.
Our end-goal is to build an image segmentation pipeline that is highly accurate.
```

We are building a graphical model for medical diagnosis, where nodes can represent symptoms, diseases, predisposing factors, and so on. This system will not be deployed in the clinic; our only aim is to understand how various predisposing factors can interact with each other in increasing disease risk.

We are building a graphical model for medical diagnosis, where nodes can represent symptoms, diseases, predisposing factors, and so on. Our only aim is to maximize our chances of correctly predicting diseases that patients are suffering from.

```
We are trying to transcribe human speech by building a Hidden Markov Model (HMM) and learning its parameters with the EM algorithm. The end-goal is correctly transcribing raw audio input into words.
We are trying to better understand high-energy physics by using a graphical model to analyze time-series data from particle accelerators. The hope is to elucidate the types of interactions between different particle types.
We have a graphical model in which each node represents an object part, and we are using EM to learn the parameters that specify the relations between object parts.
Our end-goal is to build an image classification system that can accurately recognize the image as one of several known objects.
```

Q4. Parameter Estimation with Missing Data.

The process of learning Bayesian Network parameters with missing data (partially observed instances) is more difficult than learning with complete data for which of the following reasons? You may select one or more options.

We lose local decomposition, whereby each CPD can be estimated independently.

While there is still always a single optimal value for the parameters, it can only be found using an iterative method.

Because there can be multiple optimal values, we must always run our learning algorithm multiple times from different initializations to make sure we find ALL of them.

We require more training data, because we must throw out all incomplete instances.

Q5. Optimality of Hill Climbing. Jack and Jill come up to you one day with a worried look on their face. “All this while we’ve been climbing hills, trying to improve upon our graph structure,” they say. “We’ve been considering edge deletions, reversals, and additions at each step. Today, we found that no single edge deletion, reversal, or addition could give us a higher-scoring structure. Are we guaranteed that our current graph is the best graph structure?”

` What should you tell them? You may assume that their dataset is sufficiently large, and that your answer should hold for a general graph.`

Yes – greedy hill-climbing provably finds the true graph structure, provided our dataset is large enough.

Yes, but only if we use random restarts and tabu search.

No – greedy hill-climbing will find only local maxima of the scoring function with respect to our available moves. While it might find the true graph structure on occasion, we cannot guarantee this.

Yes, but only if we extend our range of available moves to allow for pairs of edges to be changed simultaneously.

No – greedy hill-climbing will only find the true graph structure if we restrict the number of parents for each node to at most 2.

No – greedy hill-climbing can never find the true graph structure, only local maxima of the scoring function with respect to our available moves.

Q6. *Latent Variable Cardinality.

```
Assume that we are doing Bayesian clustering, and want to select the cardinality of the hidden class variable.
Which of these methods can we use?
Assume that the structure of the graph has already been fixed.
You may choose more than one option.
Training several models, each with a different cardinality for that hidden variable.
For each model, we choose the (table CPD) parameters that maximize the likelihood on the training set.
We then pick the model that performs the best on some external evaluation task, using a held-out validation set.
For example, say we are using Bayesian clustering to classify customers visiting an online store, with the aim of giving class-specific product recommendations.
We could run each model in an alpha-beta testing framework (where different customers may see the result of different models), and measure the percentage of customers that end up purchasing what each model recommends.
If we have relevant prior knowledge, we can simply use this to set the cardinality by hand.
Training several models, each with a different cardinality for that hidden variable.
For each model, we choose the (table CPD) parameters that maximize the likelihood on the training set.
We then pick the model with the highest training set likelihood.
Training several models, each with a different cardinality for that hidden variable.
For each model, we choose the (table CPD) parameters that maximize the likelihood on the training set.
We then pick the model with the highest test set likelihood.
Training several models, each with a different cardinality for that hidden variable.
For each model, we choose the (table CPD) parameters that maximize the likelihood on the training set.
We then pick the model with the highest likelihood on a held-out validation set.
```

Q7. EM Stopping Criterion.

` When learning the parameters \theta\in\mathbf{R}^nθ∈R `

n

of a graphical model using the EM algorithm, an important design decision is choosing when to stop training.

` Let \ell_{\mathrm{Train}}(\theta)ℓ `

Train

(θ), \ell_{\mathrm{Valid}}(\theta)ℓ

Valid

(θ), and \ell_{\mathrm{Test}}(\theta)ℓ

Test

(θ)

```
be the log-likelihood of the parameters \thetaθ on the training set, a held-out validation set, and the test set, respectively.
Let \theta^tθ
```

t

be the parameters at the tt-th iteration of the EM algorithm.

` We can denote the change in the dataset log-likelihoods at each iteration with \Delta\ell_{\mathrm{Train}}^t = \ell_{\mathrm{Train}}(\theta^t) - \ell_{\mathrm{Train}}(\theta^{t-1})Δℓ `

Train

t

=ℓ

Train

(θ

t

)−ℓ

Train

(θ

t−1

)

` and the corresponding analogues for the validation set and the test set. Likewise, let \Delta\theta^t = \theta^t - \theta^{t-1}Δθ `

t

=θ

t

−θ

t−1

be the vector of changes in the parameters at time step tt.

```
Which of the following would be reasonable conditions for stopping training at iteration tt? You may choose more than one option.
\|\Delta\theta^t\|_2^2∥Δθ
```

t

∥

2

2

becomes small, i.e., it falls below a certain tolerance \epsilon>0ϵ>0.

` Note: The \ell_2ℓ `

2

norm, also known as the Euclidean norm, is defined for any vector x\in\mathbf{R}^nx∈R

n

as

` \|x\|_2^2=\sum_{i=1}^nx_i^2∥x∥ `

2

2

=∑

i=1

n

x

i

2

.

` \Delta\ell_{\mathrm{Train}}Δℓ `

Train

becomes negative.

` \Delta\ell_{\mathrm{Train}}^tΔℓ `

Train

t

becomes small, i.e., it falls below a certain tolerance \epsilon>0ϵ>0.

` \Delta\ell_{\mathrm{Test}}Δℓ `

Test

becomes small, i.e., it falls below a certain tolerance \epsilon>0ϵ>0

Q8.

Question 8

EM Parameter Selection.

` Once again, we are using EM to estimate parameters of a graphical model. We use nn random starting points \{\theta^0_i\}_{i=1,2,\ldots,n}{θ `

i

0

}

i=1,2,…,n

, and run

` EM to convergence from each of them to obtain a set of candidate parameters \{\theta_i\}_{i=1,2,\ldots,n}{θ `

i

}

i=1,2,…,n

. We wish to select one of these candidate parameters for

` use. As in the previous question, let \ell_{\mathrm{Train}}(\theta)ℓ `

Train

(θ), \ell_{\mathrm{Valid}}(\theta)ℓ

Valid

(θ), and \ell_{\mathrm{Test}}(\theta)ℓ

Test

(θ)

```
be the log-likelihood of the parameters \thetaθ on the training set, a held-out validation set, and the test set, respectively.
Which of the following methods of selecting final parameters \thetaθ would be a reasonable choice? You may pick more than one option.
Pick \theta = \mathrm{argmax}_{i=1,2,\ldots,n} \ell_{\mathrm{Train}}(\theta_i)θ=argmax
```

i=1,2,…,n

ℓ

Train

(θ

i

).

` Pick \theta = \mathrm{argmax}_{i=1,2,\ldots,n} \ell_{\mathrm{Test}}(\theta_i)θ=argmax `

i=1,2,…,n

ℓ

Test

(θ

i

).

` Any one; the \theta_iθ `

i

are all equivalent, since all of them are local maxima of the log-likelihood function.

` Pick \theta = \mathrm{argmax}_{i=1,2,\ldots,n} \ell_{\mathrm{Valid}}(\theta_i)θ=argmax `

i=1,2,…,n

ℓ

Valid

(θ

i

).

Q9. Greedy Hill-Climbing.

Your friend is performing greedy hill-climbing structure search over a network with three variables using three possible operations and the BIC score with dataset D:

Add an edge

Delete an edge

Reverse an edge

She tells you that after examining D, she took a single step and got the following graph:

She also tells you that for the next step she has determined that there is a unique optimal greedy operation oo to take. Which of the following steps could oo be?

Hint: The fact that it is unique eliminates some possibilities for oo.

Add edge B\rightarrow CB→C

Add edge A\rightarrow CA→C

Delete edge A\rightarrow BA→B

Add edge C\rightarrow AC→A

Add edge C\rightarrow BC→B

Reverse edge A\rightarrow BA→B

Q10. Graph Structure Search.

Consider performing graph structure search using a decomposable score. Suppose our current candidate is graph GG below.

We want to compute the changes of scores associated with applying three different operations:

Delete the edge A\rightarrow DA→D

Reverse the edge C\rightarrow EC→E

Add the edge F\rightarrow EF→E

Let \delta(G : o_1), \delta(G : o_2), \delta(G : o_3)δ(G:o

1

),δ(G:o

2

),δ(G:o

3

) denote the score changes associated with each of these three operations, respectively. Which of the following equations is/are true for all datasets D?

δ(G:o3)=FamScore(E,{C,F}:D) − FamScore(E,{C}:D)

δ(G:o3)=FamScore(C,{A,E}:D)

δ(G:o1)=FamScore(D,{B,C}:D)

δ(G:o1)=FamScore(D,{B,C}:D) − FamScore(D,{A,B,C}:D)

δ(G:o2)=FamScore(C,{A,E}:D) − FamScore(C,{A}:D) − FamScore(E,{C}:D)

δ(G:o2)=FamScore(C,{A,E}:D) +FamScore(E,∅:D)− FamScore(C,{A}:D) − FamScore(E,{C}:D)

Q11. Structure Learning with Incomplete Data.

After implementing the pose clustering algorithm in PA9, your friend tries to pick the number of pose clusters KK for her data by running EM and evaluating the log-likelihood of her data for different values of KK. What happens to her log-likelihood as she varies KK?

The log-likelihood (almost) always increases as KK increases.

The log-likelihood (almost) always decreases as KK increases.

The log-likelihood remains the same regardless of KK.

Impossible to say – depends on the data and on what KK is.

Q12. Calculating Likelihood Differences. While doing a hill-climbing search, you run into the following two graphs, and need to choose between them using the likelihood score.

There are five nodes in both G1 and G2, node A through node E. In G1, there are directed edges from A to C, from B to C, from C to D, and from D to E. In G2, there are directed edges from A to both B and C, from B to C, from C to D, and from E to D.

What is the difference in likelihood scores, \mathrm{score}_L(G_1 : \textbf{D}) – \mathrm{score}_L(G_2 : \textbf{D})score

L

(G

1

:D)−score

L

(G

2

:D), given a dataset \textbf{D}D of size MM?

` Give your answer in terms of the entropy HH and mutual information II. The subscripts below denote empirical values according to \textbf{D}D: for example, H_\textbf{D}(X)H `

D

(X) is the empirical entropy of the variable XX in the dataset \textbf{D}D.

M \times [I_\textbf{D}(C;A,B) + I_\textbf{D}(D;C) + I_\textbf{D}(E;D) – I_\textbf{D}(B;A) – I_\textbf{D}(D;C,E)]M×[I

D

(C;A,B)+I

D

(D;C)+I

D

(E;D)−I

D

(B;A)−I

D

(D;C,E)]

M \times [I_\textbf{D}(D;C) + I_\textbf{D}(E;D) – I_\textbf{D}(A;B) – I_\textbf{D}(D;C,E) – H_\textbf{D}(A,B,C,D,E)]M×[I

D

(D;C)+I

D

(E;D)−I

D

(A;B)−I

D

(D;C,E)−H

D

(A,B,C,D,E)]

M \times [I_\textbf{D}(D;C) + I_\textbf{D}(E;D) – I_\textbf{D}(B;A) – I_\textbf{D}(D;C,E)]M×[I

D

(D;C)+I

D

(E;D)−I

D

(B;A)−I

D

(D;C,E)]

M \times [I_\textbf{D}(A;B) – H_\textbf{D}(A,B)]M×[I

D

(A;B)−H

D

(A,B)]

M \times I_\textbf{D}(A;B)M×I

D

(A;B)

O(KMd^2)O(KMd

2

)

O(KMnd^2)O(KMnd

2

)

O(K(M + n)d^2)O(K(M+n)d

2

)

No computational savings can be achieved.

##### Probabilistic Graphical Models 3: Learning Course Review

In our experience, we suggest you enroll in Probabilistic Graphical Models 3: Learning courses and gain some new skills from Professionals completely free and we assure you will be worth it.

Probabilistic Graphical Models 3: Learning Course for free, if you are stuck anywhere between a quiz or a graded assessment quiz, just visit Networking Funda to get Probabilistic Graphical Models 3: Learning Coursera Quiz Answers.

Probabilistic Graphical Models 1: Representation Coursera Quiz Answers

Probabilistic Graphical Models 2: Inference Coursera Quiz Answers

Probabilistic Graphical Models 3: Learning Coursera Quiz Answers