## All Weeks Bayesian Statistics: Techniques and Models Quiz Answers

### Bayesian Statistics: Techniques and Models Quiz Answers

**Lesson 1**

Q1. Which objective of statistical modeling is best illustrated by the following example?

You fit a linear regression of monthly stock values for your company. You use the estimates and recent stock history to calculate a forecast of the stock’s value for the next three months.

- Quantify uncertainty
- Inference
- Hypothesis testing
**Prediction**

Q2. Which objective of statistical modeling is best illustrated by the following example?

A biologist proposes a treatment to decrease genetic variation in plant size. She conducts an experiment and asks you (the statistician) to analyze the data to conclude whether a 10% decrease in variation has occurred.

- Quantify uncertainty
- Inference
**Hypothesis testing**- Prediction

Q3. Which objective of statistical modeling is best illustrated by the following example?

The same biologist form the previous question asks you how many experiments would be necessary to have a 95% chance at detecting a 10% decrease in plant variation.

**Quantify uncertainty**- Inference
- Hypothesis testing
- Prediction

Q4. Which of the following scenarios best illustrates the statistical modeling objective of inference?

- A social scientist collects data and detects positive correlation between sleep deprivation and traffic accidents.
- A natural language processing algorithm analyzes the first four words of a sentence and provides words to complete the sentence.
- A venture capitalist uses data about several companies to build a model and makes recommendations about which company to invest in next based on growth forecasts.
**A model inputs academic performance of 1000 students and predicts which student will be valedictorian after another year of school.**

Q5. Which step in the statistical modeling cycle was **not** followed in the following scenario?

Susan gathers data recording heights of children and fits a linear regression predicting height from age. To her surprise, the model does not predict well the heights for ages 14-17 (because the growth rate changes with age), both for children included in the original data as well as other children outside the model training data.

- Fit the model
- Plan and properly collect relevant data
**Use the model**- Explore the data

Q6. Which of the following is a possible consequence of failure to plan and properly collect relevant data?

- You may not be able to visually explore the data.
- Your selected model will not be able to fit the data.
- You will not produce enough data to make conclusions with a sufficient degree of confidence.
- Your analysis may produce incomplete or misleading results.

Q7. For Questions 6 and 7, consider the following:

Xie operates a bakery and wants to use a statistical model to determine how many loaves of bread he should bake each day in preparation for weekday lunch hours. He decides to fit a Poisson model to count the demand for bread. He selects two weeks which have typical business, and for those two weeks, counts how many loaves are sold during the lunch hour each day. He fits the model, which estimates that the daily demand averages 22.3 loaves.

Over the next month, Xie bakes 23 loaves each day, but is disappointed to find that on most days he has excess bread and on a few days (usually Mondays), he runs out of loaves early.

Which of the following steps of the modeling process did Xie skip?

- Understand the problem
- Postulate a model
- Fit the model
- Check the model and iterate
- Use the model

Q8. What might you recommend Xie do next to fix this omission and improve his predictive performance?

- Abandon his statistical modeling initiative.
- Collect three more weeks of data from his bakery and other bakeries throughout the city. Re-fit the same model to the extra data and follow the results based on more data.
- Plot daily demand and model predictions against the day of the week to check for patterns that may account for the extra variability. Fit and check a new model which accounts for this.
- Trust the current model and continue to produce 23 loaves daily, since in the long-run average, his error is zero.

**Lesson 2**

Q1. Which of the following is one major difference between the frequentist and Bayesian approach to modeling data?

- The frequentist paradigm treats the data as fixed while the Bayesian paradigm considers data to be random.
- Frequentist models require a guess of parameter values to initialize models while Bayesian models require initial distributions for the parameters.
- Frequentist models are deterministic (don’t use probability) while Bayesian models are stochastic (based on probability).
- Frequentists treat the unknown parameters as fixed (constant) while Bayesians treat unknown parameters as random variables.

Q2. Suppose we have a statistical model with unknown parameter \theta*θ*, and we assume a normal prior \theta \sim \text{N}(\mu_0, \sigma_0^2) *θ*∼N(*μ*0,*σ*02), where \mu_0*μ*0 is the prior mean and \sigma_0^2 *σ*02 is the prior variance. What does increasing \sigma_0^2*σ*02 say about our prior beliefs about \theta *θ*?

- Increasing the variance of the prior
**widens**the range of what we think \theta*θ*might be, indicating**greater**confidence in our prior mean guess \mu_0*μ*0. - Increasing the variance of the prior
**narrows**the range of what we think \theta*θ*might be, indicating**greater**confidence in our prior mean guess \mu_0*μ*0. - Increasing the variance of the prior
**narrows**the range of what we think \theta*θ*might be, indicating**less**confidence in our prior mean guess \mu_0*μ*0. - Increasing the variance of the prior
**widens**the range of what we think \theta*θ*might be, indicating**less**confidence in our prior mean guess \mu_0*μ*0.

Q3. In the lesson, we presented Bayes’ theorem for the case where parameters are continuous. What is the correct expression for the posterior distribution of \theta*θ* if it is discrete (takes on only specific values)?

*p*(*θ*)=∫*p*(*θ*∣*y*)⋅*p*(*y*)*dy**p*(*θ*∣*y*)=∫*p*(*y*∣*θ*)⋅*p*(*θ*)*dθp*(*y*∣*θ*)⋅*p*(*θ*)*p*(*θj*∣*y*)=∑*j**p*(*y*∣*θj*)⋅*p*(*θj*)*p*(*y*∣*θj*)⋅*p*(*θj*)*p*(*θ*)=∑*j**p*(*θ*∣*yj*)⋅*p*(*yj*)

Q4. For Questions 4 and 5, refer to the following scenario.

In the quiz for Lesson 1, we described Xie’s model for predicting demand for bread at his bakery. During the lunch hour on a given day, the number of orders (the response variable) follows a Poisson distribution. All days have the same mean (expected number of orders). Xie is a Bayesian, so he selects a conjugate gamma prior for the mean with shape 3 3 and rate 1 / 15 1/15. He collects data on Monday through Friday for two weeks.

Which of the following hierarchical models represents this scenario?

*yi*∣*μ*∼iidN(*μ*,1.02)for*i*=1,…,10,*μ*∼N(3,152)*yi*∣*λi*∼indPois(*λi*)for*i*=1,…,10,*λi*∣*α*∼Gamma(*α*,1/15)*α*∼Gamma(3.0,1.0)*yi*∣*λ*∼iidPois(*λ*)for*i*=1,…,10,*λ*∣*μ*∼Gamma(*μ*,1/15)*μ*∼N(3,1.02)*yi*∣*λ*∼iidPois(*λ*)for*i*=1,…,10,*λ*∼Gamma(3,1/15)

Q5. Which of the following graphical depictions represents the model from Xie’s scenario?

- .

a)

- .

b)

- .

c)

- .

d)

Q6. Graphical representations of models generally do not identify the distributions of the variables (nodes), but they do reveal the structure of dependence among the variables.

Identify which of the following hierarchical models is depicted in the graphical representation below.

*xi*,*j*∣*αj*,*β*∼indGamma(*αj*,*β*),*i*=1,…,*n*,*j*=1,…,*mβ*∼Exp(*b*0)*αj*∣*ϕ*∼iidExp(*ϕ*),*j*=1,…,*mϕ*∼Exp(*r*0)*xi*,*j*∣*α*,*β*∼iidGamma(*α*,*β*),*i*=1,…,*n*,*j*=1,…,*mβ*∼Exp(*b*0)*α*∼Exp(*a*0)*ϕ*∼Exp(*r*0)*xi*,*j*∣*αj*,*β*∼indGamma(*αj*,*β*),*i*=1,…,*n*,*j*=1,…,*mβ*∼Exp(*b*0)*αj*∼Exp(*a*0),*j*=1,…,*mϕ*∼Exp(*r*0)*xi*,*j*∣*αi*,*βj*∼indGamma(*αi*,*βj*),*i*=1,…,*n*,*j*=1,…,*mβj*∣*ϕ*∼iidExp(*ϕ*),*j*=1,…,*mαi*∣*ϕ*∼iidExp(*ϕ*),*i*=1,…,*nϕ*∼Exp(*r*0)

Q7. Consider the following model for a binary outcome y*y*:

*yi*∣*θi*∼indBern(*θi*),*i*=1,…,6*θi*∣*α*∼iidBeta(*α*,*b*0),*i*=1,…,6*α*∼Exp(*r*0)

where \theta_i*θ**i* is the probability of success on trial i*i*. What is the expression for the joint distribution of all variables, written as p(y_1, \ldots, y_6, \theta_1, \ldots, \theta_6, \alpha)*p*(*y*1,…,*y*6,*θ*1,…,*θ*6,*α*) and denoted by p(\cdots)*p*(⋯)? You may ignore the indicator functions specifying the valid ranges of the variables (although the expressions are technically incorrect without them).

**Hint:**

The PMF for a Bernoulli random variable is f_y(y \mid \theta) = \theta^{y} (1-\theta)^{1-y} *f**y*(*y*∣*θ*)=*θ**y*(1−*θ*)1−*y* for y=0*y*=0 or y=1*y*=1 and 0 < \theta < 10<*θ*<1.

The PDF for a Beta random variable is f_\theta( \theta \mid \alpha, \beta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha – 1} (1 – \theta)^{\beta – 1} *f**θ*(*θ*∣*α*,*β*)=Γ(*α*)Γ(*β*)Γ(*α*+*β*)*θ**α*−1(1−*θ*)*β*−1 where \Gamma()Γ() is the gamma function, 0 < \theta < 10<*θ*<1 and \alpha, \beta > 0 *α*,*β*>0.

The PDF for an exponential random variable is f_\alpha( \alpha \mid \lambda) = \lambda \exp(-\lambda \alpha) *fα*(*α*∣*λ*)=*λ*exp(−*λα*) for \lambda, \alpha > 0 *λ*,*α*>0.

*p*(⋯)=∏*i*=16[*θiyi*(1−*θi*)1−*yi*Γ(*α*)Γ(*b*0)Γ(*α*+*b*0)*θiα*−1(1−*θi*)*b*0−1*r*0exp(−*r*0*α*)]*p*(⋯)=∏*i*=16[*θiyi*(1−*θi*)1−*yi*]⋅Γ(*α*)Γ(*b*0)Γ(*α*+*b*0)*θα*−1(1−*θ*)*b*0−1⋅*r*0exp(−*r*0*α*)*p*(⋯)=∏*i*=16[*θiyi*(1−*θi*)1−*yi*Γ(*α*)Γ(*b*0)Γ(*α*+*b*0)*θiα*−1(1−*θi*)*b*0−1]*p*(⋯)=∏*i*=16[*θiyi*(1−*θi*)1−*yi*Γ(*α*)Γ(*b*0)Γ(*α*+*b*0)*θiα*−1(1−*θi*)*b*0−1]⋅*r*0exp(−*r*0*α*)

Q8. In a Bayesian model, let y*y* denote all the data and \theta*θ* denote all the parameters. Which of the following statements about the relationship between the joint distribution of all variables p(y, \theta) = p(\cdots) *p*(*y*,*θ*)=*p*(⋯) and the posterior distribution p(\theta \mid y)*p*(*θ*∣*y*) is true?

- They are proportional to each other so that p(y, \theta) = c \cdot p(\theta \mid y)
*p*(*y*,*θ*)=*c*⋅*p*(*θ*∣*y*) where c*c*is a constant number that doesn’t involve \theta*θ*at all. - The joint distribution p(y,\theta)
*p*(*y*,*θ*) is equal to the posterior distribution times a function f(\theta)*f*(*θ*) which contains the modification (update) of the prior. - Neither is sufficient alone–they are both necessary to make inferences about \theta
*θ*. - They are actually equal to each other so that p(y, \theta) = p(\theta \mid y)
*p*(*y*,*θ*)=*p*(*θ*∣*y*).

**Lesson 3**

Q1. If a random variable X*X* follows a standard uniform distribution (X \sim \text{Unif}(0,1)*X*∼Unif(0,1)), then the PDF of X*X* is p(x) = 1*p*(*x*)=1 for 0 \le x \le 10≤*x*≤1.

We can use Monte Carlo simulation of X*X* to approximate the following integral: \int_0^1 x^2 dx = \int_0^1 x^2 \cdot 1 dx = \int_0^1 x^2 \cdot p(x) dx = \text{E}(X^2) ∫01*x*2*d**x*=∫01*x*2⋅1*d**x*=∫01*x*2⋅*p*(*x*)*d**x*=E(*X*2).

If we simulate 1000 independent samples from the standard uniform distribution and call them x_i^**xi*∗ for i=1,\ldots,1000*i*=1,…,1000, which of the following calculations will approximate the integral above?

- (10001∑
*i*=11000*xi*∗)2 - 10001∑
*i*=11000(*xi*∗−*x*∗ˉ)2 where \bar{x^*}*x*∗ˉ is the calculated average of the x_i^**xi*∗ samples. - 10001∑
*i*=11000*xi*∗2 - 10001∑
*i*=11000*xi*∗

Q2. Suppose we simulate 1000 samples from a \text{Unif}(0, \pi) Unif(0,*π*) distribution (which has PDF p(x) = \frac{1}{\pi} *p*(*x*)=*π*1 for 0 \le x \le \pi 0≤*x*≤*π*) and call the samples x_i^* *x**i*∗ for i = 1, \ldots, 1000 *i*=1,…,1000.

If we use these samples to calculate \frac{1}{1000} \sum_{i=1}^{1000} \sin( x_i^* ) 10001∑*i*=11000sin(*xi*∗), what integral are we approximating?

- ∫−∞∞sin(
*x*)*dx* - ∫01
*π*sin(*x*)*dx* - ∫01sin(
*x*)*dx* - ∫0
*π**π*sin(*x*)*dx*

Q3. Suppose random variables X *X* and Y *Y* have a joint probability distribution p(X, Y) *p*(*X*,*Y*). Suppose we simulate 1000 samples from this distribution, which gives us 1000 (x_i^*, y_i^*) (*x**i*∗,*y**i*∗) pairs.

If we count how many of these pairs satisfy the condition x_i^* < y_i^* *xi*∗<*yi*∗ and divide the result by 1000, what quantity are we approximating via Monte Carlo simulation?

- Pr[
*X*<*Y*] - E(
*XY*) - Pr[
*X*<E(*Y*)] - Pr[E(
*X*)<E(*Y*)]

Q4. If we simulate 100 samples from a \text{Gamma} (2, 1) Gamma(2,1) distribution, what is the approximate distribution of the sample average \bar{x^*} = \frac{1}{100} \sum_{i=1}^{100} x_i^* *x*∗ˉ=1001∑*i*=1100*x**i*∗?

**Hint**: the mean and variance of a \text{Gamma}(a,b) Gamma(*a*,*b*) random variable are a / b *a*/*b* and a / b^2 *a*/*b*2 respectively.

- Gamma(2,0.01)
- N(2,2)
- N(2,0.02)
- Gamma(2,1)

Q5. For Questions 5 and 6, consider the following scenario:

Laura keeps record of her loan applications and performs a Bayesian analysis of her success rate \theta *θ*. Her analysis yields a \text{Beta}(5,3) Beta(5,3) posterior distribution for \theta *θ*.

The posterior mean for \theta*θ* is equal to \frac{5}{5+3} = 0.625 5+35=0.625. However, Laura likes to think in terms of the odds of succeeding, defined as \frac{\theta}{1 – \theta}1−*θ**θ*, the probability of success divided by the probability of failure.

Use R to simulate a large number of samples (more than 10,000) from the posterior distribution for \theta *θ* and use these samples to approximate the posterior mean for Laura’s odds of success ( \text{E}(\frac{\theta}{1-\theta}) E(1−*θ**θ*) ).

Report your answer to at least one decimal place.

Q6. Laura also wants to know the posterior probability that her odds of success on loan applications is greater than 1.0 (in other words, better than 50:50 odds).

Use your Monte Carlo sample from the distribution of \theta*θ* to approximate the probability that \frac{\theta}{1-\theta} 1−*θ**θ* is greater than 1.0.

Report your answer to at least two decimal places.

Q7. Use a (large) Monte Carlo sample to approximate the 0.3 quantile of the standard normal distribution ( \text{N}(0,1) N(0,1)), the number such that the probability of being less than it is 0.3.

Use the \tt quantile quantile function in R. You can of course check your answer using the \tt qnormqnorm function.

Report your answer to at least two decimal places

Q8. To measure how accurate our Monte Carlo approximations are, we can use the central limit theorem. If the number of samples drawn m*m* is large, then the Monte Carlo sample mean \bar{\theta^*} *θ*∗ˉ used to estimate \text{E}(\theta)E(*θ*) approximately follows a normal distribution with mean \text{E}(\theta) E(*θ*) and variance \text{Var}(\theta) / m Var(*θ*)/*m*. If we substitute the sample variance for \text{Var}(\theta) Var(*θ*), we can get a rough estimate of our Monte Carlo standard error (or standard deviation).

Suppose we have 100 samples from our posterior distribution for \theta*θ*, called \theta_i^* *θ**i*∗, and that the sample variance of these draws is 5.2. A rough estimate of our Monte Carlo standard error would then be \sqrt{ 5.2 / 100 } \approx 0.228 5.2/100≈0.228. So our estimate \bar{\theta^*} *θ*∗ˉ is probably within about 0.456 0.456 (two standard errors) of the true \text{E}(\theta) E(*θ*).

What does the standard error of our Monte Carlo estimate become if we increase our sample size to 5,000? Assume that the sample variance of the draws is still 5.2.

Report your answer to at least three decimal places.

**Week 01 : Markov chains**

Q1. All but one of the following scenarios describes a valid Markov chain. Which one is not a Markov chain?

- Suppose you have a special savings account which accrues interest according to the following rules: the total amount deposited in a given month will earn 10(1/2)^{(r-1)}10(1/2)(
*r*−1)% interest in the r*r*th month after the deposit. For example, if the deposits in January total $100, then you will earn $10 interest in January, $5 interest at the end of February, $2.50 in March, etc. In addition to the interest from January, if you deposit $80 in February, you will earn an additional $8 at the end of February, $4 at the end of March, and so forth. The total amount of money deposited in a given month follows a gamma distribution. Let X_t*Xt* be the total dollars in your account, including all deposits and interest up to the end of month t*t*. - While driving through a city with square blocks, you roll a six-sided die each time you come to an intersection. If the die shows 1, 2, 3, or 4, then you turn left. If the die shows 5 or 6, you turn right. Each time you reach an intersection, you report your coordinates X_t
*Xt*. - Three friends take turns playing chess with the following rules: the player who sits out the current round plays against the winner in the next round. Player A, who has 0.7 probability of winning any game regardless of opponent, keeps track of whether he plays in game t
*t*with an indicator variable X_t*Xt*. - At any given hour, the number of customers entering a grocery store follows a Poisson distribution. The number of customers in the store who leave during that hour also follows a Poisson distribution (only up to as many people are in the store). A clerk reports the total number of customers in the store X_t
*Xt* at the end of hour t*t*.

Q2. Which of the following gives the transition probability matrix for the chess example in the previous question? The first row and column correspond to X=0*X*=0 (player A not playing) while the second row and column correspond to X=1*X*=1 (player A playing).

- (010.30.7)
- (010.30.7)
- (00.310.7)
- (00.310.7)
- (0.70.301)
- (0.70.301)
- (0.30.701)
- (0.30.701)

Q3. Continuing the chess example, suppose that the first game is between Players B and C. What is the probability that Player A will play in Game 4? Round your answer to two decimal places

Q4. Which of the following is the stationary distribution for X*X* in the chess example?

- ( .750, .250 )
- ( .231, .769 )
- ( 0.0, 1.0 )
- ( .250, .750 )
- ( .769, .231 )

Q5. If the players draw from the stationary distribution in Question 4 to decide whether Player A participates in Game 1, what is the probability that Player A will participate in Game 4? Round your answer to two decimal places.

**Week 02 : MCMC**

Q1. For Questions 1 through 3, consider the following model for data that take on values between 0 and 1:

*x**i*∣*α*,*β*∼iidBeta(*α*,*β*),*i*=1,…,*n*,*α*∼Gamma(*a*,*b*),*β*∼Gamma(*r*,*s*),

where \alpha*α* and \beta*β* are independent a priori. Which of the following gives the full conditional density for \alpha*α* up to proportionality?

*p*(*α*∣*β*,*x*)∝Γ(*α*)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**xi*]*α*−1*αa*−1*e*−*bαI*(0<*α*<1)*p*(*α*∣*β*,*x*)∝Γ(*α*)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**xi*]*α*−1*αa*−1*e*−*bαI*(*α*>0)*p*(*α*∣*β*,*x*)∝[∏*i*=1*n**xi*]*α*−1*αa*−1*e*−*bαI*(*α*>0)*p*(*α*∣*β*,*x*)∝Γ(*α*)*n*Γ(*β*)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**xi*]*α*−1[∏*i*=1*n*(1−*xi*)]*β*−1*αa*−1*e*−*bαβr*−1*e*−*sβI*(0<*α*<1)*I*(0<*β*<1)

Q2. Suppose we want posterior samples for \alpha*α* from the model in Question 1. What is our best option?

- The full conditional for \alpha
*α*is not a proper distribution (it doesn’t integrate to 1), so we cannot sample from it. - The full conditional for \alpha
*α*is proportional to a common distribution which we can sample directly, so we can draw from that. - The joint posterior for \alpha
*α*and \beta*β*is a common probability distribution which we can sample directly. Thus we can draw Monte Carlo samples for both parameters and keep the samples for \alpha*α*. - The full conditional for \alpha
*α*is not proportional to any common probability distribution, and the marginal posterior for \beta*β*is not any easier, so we will have to resort to a Metropolis-Hastings sampler.

Q3. If we elect to use a Metropolis-Hastings algorithm to draw posterior samples for \alpha*α*, the Metropolis-Hastings candidate acceptance ratio is computed using the full conditional for \alpha*α* as

Γ(*α*∗)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**x**i*]*α**α**a*−1*e*−*b**α**q*(*α*∣*α*∗)*I**α*>0Γ(*α*)*n*Γ(*α*∗+*β*)*n*[∏*i*=1*n**x**i*]*α*∗*α*∗*a*−1*e*−*b**α*∗*q*(*α*∗∣*α*)*I**α*∗>0

where \alpha^* *α*∗ is a candidate value drawn from proposal distribution q(\alpha^* | \alpha)*q*(*α*∗∣*α*). Suppose that instead of the full conditional for \alpha*α*, we use the full joint posterior distribution of \alpha*α* and \beta*β* and simply plug in the current (or known) value of \beta*β*. What is the Metropolis-Hastings ratio in this case?

*αa*−1*e*−*bαq*(*α*∣*α*∗)*Iα*>0*α*∗*a*−1*e*−*bα*∗*q*(*α*∗∣*α*)*Iα*∗>0- Γ(
*α*∗)*n*Γ(*β*)*nq*(*α*∗∣*α*)Γ(*α*∗+*β*)*n*[∏*i*=1*n**xi*]*α*∗−1[∏*i*=1*n*(1−*xi*)]*β*−1*α*∗*a*−1*e*−*bα*∗*βr*−1*e*−*sβq*(*α*∣*α*∗)*I*(0<*α*∗)*I*(0<*β*) - Γ(
*α*∗)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**xi*]*ααa*−1*e*−*bαq*(*α*∣*α*∗)*Iα*>0Γ(*α*)*n*Γ(*α*∗+*β*)*n*[∏*i*=1*n**xi*]*α*∗*α*∗*a*−1*e*−*bα*∗*q*(*α*∗∣*α*)*Iα*∗>0 - Γ(
*α*∗)*n*Γ(*α*+*β*)*n*[∏*i*=1*n**xi*]*αq*(*α*∣*α*∗)*Iα*>0Γ(*α*)*n*Γ(*α*∗+*β*)*n*[∏*i*=1*n**xi*]*α*∗*q*(*α*∗∣*α*)*Iα*∗>0

Q4. For Questions 4 and 5, re-run the Metropolis-Hastings algorithm from Lesson 4 to draw posterior samples from the model for mean company personnel growth for six new companies: (-0.2, -1.5, -5.3, 0.3, -0.8, -2.2). Use the same prior as in the lesson.

Below are four possible values for the standard deviation of the normal proposal distribution in the algorithm. Which one yields the best sampling results?

- 0.5
- 1.5
- 3.0
- 4.0

Q5. Report the posterior mean point estimate for \mu*μ*, the mean growth, using these six data points. Round your answer to two decimal places.

**Week 03 : Common models and multiple factor ANOVA**

Q1. For Questions 1 and 2, consider the Anscombe data from the \tt carcar package in R which we analyzed in the quizzes for Lesson 7.

In the original model, we used normal priors for the three regression coefficients. Here we will consider using Laplace priors centered on 0. The parameterization used in JAGS for the Laplace (double exponential) distribution has an inverse scale parameter \tau*τ*. This is related to the variance v*v* of the prior in the following way: v = 2 / \tau^2 *v*=2/*τ*2. Suppose we want the Laplace prior to have variance v=2*v*=2. What value of \tau*τ* should we use in the JAGS code?

Q2. When using an informative variable selection prior like the Laplace, we typically center and scale the data:

```
library("car")
data("Anscombe")
head(Anscombe)
?Anscombe
Xc = scale(Anscombe, center=TRUE, scale=TRUE)
str(Xc)
data_jags = as.list(data.frame(Xc))
```

Because we subtracted the mean from all (continuous) variables including the response, this is a rare case where we do not need an intercept. Fit the model in JAGS using the Laplace prior with variance 2 for each of the three coefficients, and an inverse gamma prior for the observation variance with effective sample size 1 and prior guess 1.

How do the inferences for the coefficients compare to the original model fit in the quiz for Lesson 7 (besides that their scale has changed due to scaling the data)

- The inferences are essentially unchanged. The first two coefficients (for income and percentage youth) are significantly positive and the percent urban coefficient is still negative.
- The inferences are similar, with one exception. The first two coefficients (for income and percentage youth) are significantly positive and the percent urban coefficient’s posterior looks like the Laplace prior, with a spike near 0. This indicates that the percent urban “effect” is very weak.
- The inferences are vastly different. The marginal posterior for all three coefficients look like their Laplace priors, with a spike near 0. This indicates that the “effect” associated with each covariate is very weak.
- Inexplicably, the signs of all coefficients have changed (from positive to negative and from negative to positive).

Q3. Consider an ANOVA model for subjects’ responses to three experimental factor variables related to a proposed health supplement: dose, frequency, and physical activity. Dose has two levels: 100mg and 200mg. Frequency has three levels: “daily,” “twice-weekly,” and “weekly.” Physical activity has two levels: “low” and “high.” If these are the only covariates available and we assume that responses are iid normally distributed, what is the maximum number of parameters we could potentially use to uniquely describe the mean response?

Q4. If we have both categorical *and* continuous covariates, then it is common to use the linear model parameterization instead of the cell means model. If it is unclear how to set it up, you can use the \tt model.matrix model.matrix function in R as we have in the lessons.

Suppose that in addition to the experimental factors in the previous question, we have two continuous covariates: weight in kg and resting heart rate in beats per minute. If we use 100mg dose, daily frequency, and low physical activity as the baseline group, which of the following gives the linear model parameterization for an additive model with no interactions?

- E(
*yi*)=*μgi*+*β*1weight*i*+*β*2heart*i* for g_i \in \{ 1, 2, \ldots, 7 \}*gi*∈{1,2,…,7} - E(
*yi*)=*μgi*+*β*1weight*i*+*β*2heart*i* for g_i \in \{ 1, 2, \ldots, 12 \}*gi*∈{1,2,…,12} - E(
*yi*)=*β*0+*β*1*I*dose*i*=100+*β*2*I*freq*i*=daily+*β*3*I*phys*i*=low+……+*β*4weight*i*+*β*5heart*i* - E(
*yi*)=*β*0+*β*1*I*dose*i*=200+*β*2*I*freq*i*=twice weekly+*β*3*I*freq*i*=weekly+……+*β*4*I*phys*i*=high+*β*5weight*i*+*β*6heart*i*

Q5. The reading in this honors section describes an analysis of the warp breaks data. Of the models fit, we concluded that the full cell means model was most appropriate. However, we did not assess whether constant observation variance across all groups was appropriate. Re-fit the model with a separate variance for each group. For each variance, use an Inverse-Gamma(1/2, 1/2) prior, corresponding to prior sample size 1 and prior guess 1 for each variance.

Report the DIC value for this model, rounded to the nearest whole number.

**Week 04: Predictive distributions and mixture models**

Q1. Consider the Poisson process model we fit in the quiz for Lesson 10 which estimates calling rates of a retailer’s customers. The data are attached below.

Re-fit the model and use your posterior samples to simulate predictions of the number of calls by a new 29 year old customer from Group 2 whose account is active for 30 days. What is the probability that this new customer calls at least three times during this period? Round your answer to two decimal places.

Q2. Suppose we fit a single component normal distribution to the data whose histogram is shown below.

If we use a noninformative prior for \mu*μ* and \sigma^2*σ*2 and plot the fit distribution evaluated at the posterior means (in blue), what would the fit look like? Is this model appropriate for these data?

The single normal fit ignores the smaller component, fitting the cluster of points with most data. Consequently, the model places almost no probability in the region of the smaller component.

- .

The single normal fit accommodates the bi-modality in the dat, but fails to capture the imbalance in the two components. It is not appropriate.

- .

A single normal distribution does not allow bi-modality. Consequently, the fit places a lot of probability in a region with no data. It is not appropriate.

- .

The single normal fit nicely captures the features of the data. It is appropriate

Q3. Which of the following histograms shows data that might require a mixture model to fit?

- .

A)

- .

B)

- .

C)

- .

D)

Q4. The Dirichlet distribution with parameters \alpha_1 = \alpha_2 = \ldots = \alpha_K = 1 *α*1=*α*2=…=*α**K*=1 is uniform over its support, the values for which the random vector contains a valid set of probabilities. If \theta *θ* contains five probabilities corresponding to five categories and has a \text{Dirichlet}(1,1,1,1,1) Dirichlet(1,1,1,1,1) prior, what is the effective sample size of this prior?

**Hint**: If \theta*θ* has a \text{Dirichlet}(\alpha_1, \alpha_2, \ldots, \alpha_K)Dirichlet(*α*1,*α*2,…,*αK*) prior, and the counts of multinomial data in each category are x_1, x_2, \ldots, x_K*x*1,*x*2,…,*xK*, then the posterior of \theta*θ* is \text{Dirichlet}(\alpha_1 + x_1, \alpha_2 + x_2, \ldots, \alpha_K + x_K)Dirichlet(*α*1+*x*1,*α*2+*x*2,…,*αK*+*xK*). The data sample size is clearly \sum_{k=1}^K x_k ∑*k*=1*K**xk*.

Q5. Recall that in the Bayesian formulation of a mixture model, it is often convenient to introduce latent variables z_i*z**i* which indicate “population” membership of y_i*y**i* (the “population” may or may not have meaning in the context of the data). One possible hierarchical formulation is given by:

*yyi*∣*zi*,*θ*Pr(*zi*=*j*∣*w*)*wθ*∼ind*fzi*(*y*∣*θ*),*i*=1,…,*n*=*wj*,*j*=1,…,*J*∼Dirichlet(*w*)∼*p*(*θ*)

where f_{j} (y \mid \theta) *f**j*(*y*∣*θ*) is a probability density for y *y* for mixture component j*j* and w=(w_1, w_2, \ldots, w_J)*w*=(*w*1,*w*2,…,*w**J*) is a vector of prior probabilities of membership.

What is the full conditional distribution for z_i*zi*?

- Pr(
*zi*=*j*∣⋯)=∑ℓ=1*J**f*ℓ(*yi*∣*θ*)*fj*(*yi*∣*θ*),*j*=1,…,*J* - Pr(
*zi*=*j*∣⋯)=∑ℓ=1*J**f*ℓ(*yi*∣*θ*)*w*ℓ*fj*(*yi*∣*θ*)*wj*,*j*=1,…,*J* - Pr(
*zi*=*j*∣⋯)=*wj*,*j*=1,…,*J* - JPr(
*zi*=*j*∣⋯)=∑ℓ=1*J**wjI*(*zi*=*j*))(1−*wj*)1−*I*(*zi*=*j*))*wjI*(*zi*=*j*))(1−*wj*)1−*I*(*zi*=*j*)),*j*=1,…,*J*

##### Get All Course Quiz Answers of **Bayesian Statistics Specialization**

Bayesian Statistics: From Concept to Data Analysis Quiz Answers

Bayesian Statistics: Techniques and Models Quiz Answers

Bayesian Statistics: Mixture Models Coursera Quiz Answers

Bayesian Statistics: Time Series Analysis Quiz Answer