## All Weeks Fitting Statistical Models to Data with Python Quiz Answers

### Fitting Statistical Models to Data with Python Week 01 Quiz Answers

#### Week 1: Fitting Statistical Models to Data with Python

Q1. The figure below presents the fits of four different regression models to the same set of data, where there is a predictor variable (x) and a dependent variable (y) of interest. Which of the four plots reflects a model that fits the data well?

- Plot (a)
- Plot (b)
- Plot (c)
- Plot (d)

Q2. A researcher at a large pharmaceutical company wishes to study the effect of a new experimental drug on the amount of pain suffered by people who frequently experience migraines. The researcher conducts a randomized experiment, where half of the migraine sufferers who volunteer receive the experimental drug, and half receive a placebo pill. The researcher then records a pain score two hours later, and wants to formally model the pain score as a function of experimental group and other confounding variables (age, BMI, gender, and race/ethnicity).

What is the dependent variable in this model, and what are the independent variables?

- Dependent = Experimental Group, Independent = Pain Score
- Dependent = Experimental Group, Independent = Pain Score, Age, BMI, Gender, Race/Ethnicity
- Dependent = Pain Score, Independent = Experimental Group
- Dependent = Pain Score, Independent = Experimental Group, Age, BMI, Gender, Race/Ethnicity
- None of the above

Q3. Suppose that the migraine researcher was also interested in pain score trends immediately following administration of the drug. The researcher continues to collect pain score measurements at four, six, and eight hours after administration of either the experimental drug or the placebo pill. The researcher wants to then model the pain score as a function of the previously mentioned variables in addition to time since administration.

What needs to change about the model that the researcher is now fitting?

- The correlation of the repeated measures needs to be taken into account, and time since administration needs to be added to the model as an independent variable.
- The correlation of the repeated measures needs to be taken into account, but nothing else needs to be change
- Nothing; the same model mentioned in Problem 2 will be appropriate for studying the group differences.
- Time needs to be added to the model as an independent variable; nothing else needs to be changed.
- Time needs to be added to the model as a dependent variable; nothing else needs to be changed.

Q4. After performing the analysis, the researcher writes a press release describing the results of the experiment, and claims that the new experimental drug will reduce pain by 25% based on the results of the modeling.

What else does the researcher need to say about this finding?

- A. Nothing; this is an interesting effect that should lead migraine sufferers to use the drug.
- B. The researcher should provide the predicted pain scores in both groups based on the model, in addition to measures of uncertainty in the predicted scores, for reference.
- C. The researcher should provide a confidence interval for this predicted effect based on the fitted model.
- Answers B and C

#### Week 2: Fitting Statistical Models to Data with Python

#### Quiz 1: Linear Regression Quiz

Q1. Which of the following scatterplot(s) would fitting a linear regression model to the data be appropriate? (Select all that apply.)

- a
- b
- c
- d
- e

Q2. Which of the following scatterplot(s) would have a correlation coefficient that is close to 0? (Select all that apply.)

- a
- b
- c
- d
- e

Q3. Which of the following scatterplots would have the highest absolute correlation (i.e. shows the strongest linear relationship?

- a
- b
- c
- d

Q4. What distribution do the true errors need to follow in order to perform various inference procedures in linear regression?

- True errors must be N(0,1)
- True errors must be N(0, σ2)
- True errors must be Uniformly distributed
- True errors do not need any specific distribution

Q5. Which of the following are assumptions needed for conducting a hypothesis test on the population slope in a linear regression analysis? (Select all that apply.)

- True errors must be normally distributed.
- True errors have constant variance.
- The population relationship between the dependent variable and the explanatory variable is in fact linear.

Q6. A study was conducted to model the linear relationship between Las Vegas nightly hotel cost (dollars) and hotel rating (on a 100 point scale). Nightly hotel cost will be used to predict hotel rating. A random sample of 30 Las Vegas hotels was collected and an estimated slope (b1) was found to be 0.21. Which of the following is a correct interpretation of the estimated slope (b1)?

- When a hotel’s nightly cost is $0 dollars the hotel’s rating is expected to be 0.21 points.
- When a hotel rating is 0 points the hotel’s nightly cost is expected to be $0.21 dollars.
- The hotel rating is estimated to increase by 0.21 points for every additional dollar spent on nightly hotel cost, on average.
- The nightly hotel cost is estimated to increase by $0.21 dollars for every additional hotel rating point, on average.

Q7. Background for Questions 7 – 13

In 1905, R.J. Gladstone conducted a study of the relationship between brain weight and size of the head. Brain weight (grams) and head size (cubic cm) measurements were performed for 237 adults. Two categorical variables for Sex (0=male, 1=female) and Age (0=young, 20-46 years old, 1=old, 46+ years old) are available. The linear regression results for regressing brain weight on the head size are summarized below.

One subject in the study has a head size of 3500 cm3 and a brain weight of 1430.86 grams. What is the value of the observed error (residual) for this subject?

- -183.3 grams
- 183.3 grams
- -4195.752 cm3
- 4195.752 cm3

Q8. The study relating brain weight (grams) and head size (cubic cm) yielded an R-squared of 0.6393. Which of the following is a correct interpretation of the R-squared?

- 0.6393% of the variation in brain weight can be accounted for by the linear relationship with head size.
- 63.93% of the variation in brain weight can be accounted for by the linear relationship with head size.
- We would expect brain weight to increase by 0.6393 grams for every additional cubic cm in head size, on average.
- We would expect head size to increase by 0.6393 cubic cm for every additional gram in brain weight, on average.

Q9. What is the appropriate p-value for testing if there is a significant positive linear relationship between brain weight and head size?

- 4.61e-11
- <2e-16
- 2.305e-11
- <1e-16

Q10. A 95% confidence interval for the mean brain weight for all adults in 1905 with a head size of 3400 cm3 was calculated to be (1210.14 grams, 1232.33 grams). How would the width of the 95% prediction interval for the brain weight for an individual adult in 1905 with a head size of 3400 cm3 compare to this one?

- Wider
- Narrower
- Stays the same

Q11. A 95% confidence interval for the mean brain weight for all adults in 1905 with a head size of 3400 cm3 was calculated to be (1210.14 grams, 1232.33 grams). How would the width of the 95% confidence interval for the mean brain weight for all adults in 1905 with a head size of 3600 cm3 compare to this one?

- Wider
- Narrower
- Stay the same

Q12. The head size of an 8 year old child is found to be 1800 cm3, What caution(s) should be noted if asked to predict this child’s brain weight? (Select all that apply.)

- Correlation does not imply causation for brain weight.
- Extrapolation – A head size of 1800 cm3 is outside the range of our data.
- Extrapolation – The model was created using only data for adults, not children.
- We do not know if the child is male or female.
- No cautions need to be noted, it is fine to plug in the 1800 cm3 in to our estimated regression line to make the prediction.

Q13. A new model was fit, this time adding in the two categorical variables Sex (0=male, 1=female) and Age (0=young, 20-46 years old, 1= old, 46+ years old), the model summary is shown below

Which of the following is an appropriate interpretation of the estimated coefficient for age of -23.97 in the above table?

- he average brain weight for younger subjects is estimated to be 23.97 grams less than the average brain weight for older subjects.
- Keeping head size and sex constant, the average brain weight for younger subjects is estimated to be 23.97 grams less than the average brain weight for older subjects.
- The average brain weight for older subjects is estimated to be 23.97 grams less than the average brain weight for younger adults.
- Keeping head size and sex constant, the average brain weight for older subjects is estimated to be 23.97 grams less than the average brain weight for younger adults.

#### Quiz 2: Logistic Regression Quiz

Q1. Imagine that you are collecting variables while participants attempted to shoot a soccer ball. Which of the following collected variables could be predicted using a logistic regression model?

- Sex (male vs. female)
- Scoring a soccer goal on a given shot
- Height
- Whether a shot on goal traveled more than 20 feet
- Age (years)

Q2. Which of the following is a possible form/shape for a logistic regression model, where the y-axis represents the probability of success?

- Graph:

- Graph:

- Graph:

- Graph:

Q3. Two probabilities have been transformed using the logit function. The two values after transformation are -2 and 0.25. Which of the two values corresponds to a higher original probability?

- -2
- 0.25
- They are the same
- Can’t tell

Q4. NHANES records whether an individual has smoked 100 cigarettes or more. The next few questions will focus on fitting models to predict whether someone has smoked 100+ cigarettes.

First, a model is fit using body mass index (BMI) as the variable to predict smoking status. The output is here:

What does the coefficient of 0.0037 mean?

- For each increase by one in BMI, the probability of smoking 100 cigarettes increases by about 0.0037, on average.
- For each increase by one in BMI, the odds of smoking 100 cigarettes increases by about 0.0037, on average.
- For each increase in one in BMI, the log odds of smoking 100 cigarettes increases by about 0.0037, on average.
- For each increase in one in BMI, the odds of smoking 100 cigarettes increases multiplicatively by about 0.0037, on average.

Q5. Next, a model is fit adding Age as an additional covariate to BMI as the variables predicting smoking status. The output is here:

What does the coefficient of 0.0169 mean in context?

- For each increase of one in BMI, the odds of smoking 100 cigarettes increases by about 0.0169, on average.
- For each increase of one in Age, the odds of smoking 100 cigarettes increases by about 0.0169, on average.
- For each increase of one in Age, the log odds of smoking 100 cigarettes increases by about 0.0169, on average.
- For each increase of one in Age, the log odds of smoking 100 cigarettes increases by about 0.0169 while holding BMI constant, on average.

Q6. Based on the logistic regression with both Age and BMI as covariates, are the coefficients statistically significant at a two-sided 10% significance level?

- Both coefficients are significant
- Neither coefficient is significant
- Only the coefficient for BMI is significant
- Only the coefficient for Age is significant

Q7. The 95% confidence interval for the coefficient for Age is given above as (0.014, 0.020). If instead we wanted a 90% confidence interval, how would the width of the interval change?

- It would be wider
- It would be narrower
- It would stay the same
- Can’t tell

Q8. We’d like to predict the log odds of smoking 100+ cigarettes for a given individual using the logistic regression model with the two variables: BMI and Age. For an individual with a BMI of 22 who is 45 years old, what would the predicted log odds be?

- -0.417
- 0.8265
- 0.327
- -0.7367
- Can’t tell

Q9. The sample of adults surveyed in NHANES contains adults age 20-80 with BMIs of 14.5-64.6. For the individual with a BMI of 22 who is 45 years old, do you trust the predicted log odds calculated above as being reasonable?

- No, this is extrapolation
- No, this is interpolation
- Yes, this is extrapolation
- Yes, this is interpolation

Q10. Fill in the blanks. With 95% confidence, I estimate that the increase in log odds of smoking 100+ cigarettes for each increase by one in BMI, while holding Age constant, is between ** _ and _**, on average.

- -1.2435 and 0.149
- 0.014 and 0.020
- -1.535 and -0.952
- -0.005 and 0.011
- Can’t tell

#### Quiz 3: Python Assessment

Q1. What is the value of the coefficient for predictor RM?

Your answer should be written in this format: #.####

Answers :

Q2. Are the predictors for this model statistically significant, yes or no? (Hint: What are their p-values?)

- Yes
- No

Q3. What most likely happened to our R-Squared value when we added the third predictor LSTAT to our initial model?

- Decreased
- Increased
- Stayed the same

Q4. What type of model should we use when our target outcome, or dependent variable, is continuous?

- Logistic regression
- Linear Regression
- Confidence intervals

Q5. Which of our predictors has the largest coefficient?

- Intercept
- DMDEDUC2x[T.HS]
- DMDEDUC2x[T.SomeCollege]
- DMDEDUC2x[T.x9_11]

Q6. Which values for DMDEDUC2x and RIAGENDRx are represented in our intercept, or what is our reference level?

- Male and Some College
- Female and Age
- Female and College
- Male and HS

Q7. What model should we use when our target outcome, or dependent variable is binary, or only has two outputs, 0 and 1?

- Hypothesis Tests
- Linear Regression
- Logistic Regression
- None of the above

#### Week 3: Fitting Statistical Models to Data with Python

#### Quiz 1: Name That Model

Q1. You are interested in predicting the probability that an NCAA men’s basketball team wins their first round game in the annual NCAA men’s basketball tournament, where potential predictors of the binary indicator of winning the first game include a variety of team-level variables measured for each of the 64 teams competing in the first round. There is only one observation per team, and the dependent variable is a binary indicator (1, 0) of whether the team won their first round game.

What type of model would you fit?

- Linear regression model
- Logistic regression

model - Multilevel linear

regression model with random team effects - Multilevel logistic

regression model with random team effects - Marginal linear model,

fitted using GEE - Marginal logistic

model, fitted using GEE

Q2. You are interested in

estimating the relationship between gender (the IV) and a binary indicator of

ever having experienced a major depressive disorder (the DV), where both

variables were collected from a large national sample that involved area

cluster sampling. You also wish to estimate between-cluster variance in the

probability of having experienced a major depressive disorder, and explain this

variance with the fixed effects of cluster-level covariates.

What type of model

would you fit?

- Linear regression model
- Logistic regression model
- Multilevel linear regression model with random cluster effects
- Multilevel logistic regression model with random cluster effects
- Marginal linear model, fitted using GEE
- Marginal logistic model, fitted using GEE

Q3. You want to fit a model

that enables the prediction of a continuous measure of birth weight for all of

the newborns at a single large hospital. The data arise from a simple random

sample of 500 births, and the predictors including information collected from

both the mother and the father.

What type of model would you fit?

- Linear regression model
- Logistic regression model
- Multilevel linear regression model with random hospital effects
- Multilevel logistic regression model with random hospital

effects - Marginal linear model, fitted using GEE
- Marginal logistic model, fitted using GEE

Q4. After publishing a research

paper describing the results from the model fitted for Question #3, you are

contacted by 20 other large hospitals, and they wish to contribute to the

estimation of a model for predicting birth weight. The team agrees that

estimation of the variance in expected birth weight between hospitals and

explanation of that variance with hospital-level covariates is a key objective.

What type of model would you fit?

- Linear regression model
- Logistic regression model
- Multilevel linear regression model with random hospital effects
- Multilevel logistic regression model with random hospital

effects - Marginal linear model, fitted using GEE
- Marginal logistic model, fitted using GEE

Q5. You wish to fit a model to

a “forced choice” binary dependent variable measuring political party

preference (if you had to pick a political party, which would you select:

Democratic or Republican?), and examine the relationship of parental political

attitudes with the preference of the respondents. Based on the study design,

there are multiple respondents measured from each of several neighborhoods, and

respondents within the same neighborhood may have shared political views, but

you aren’t interested in explicitly estimating between-neighborhood variance.

You only wish to estimate the overall relationship of interest in the larger

population, and account for possible within-neighborhood correlation in the DV.

What type of model would you fit?

- Linear regression model
- Logistic regression model
- Multilevel linear regression model with random neighborhood

effects - Multilevel logistic regression model with random neighborhood

effects - Marginal linear model, fitted using GEE
- Marginal logistic model, fitted using GEE

#### Quiz 2: Python Assessment

Q1. What is clustered data?

- Clustered data is when there are observations that are the exact same in a dataset.
- Data is considered clustered when our dataset features have a low variance.
- Clustered data is when one group in a dataset is over-represented.
- Data is considered clustered when observations are correlated within groups, sometimes related to study designs.

Q2. Which of the following features has the highest correlation between two observations in the same cluster?

- BPXSY1
- SDMVSTRA
- RIDAGEYR
- BMXBMI
- smq

Q3. What is true about multiple linear regression and marginal linear models when dependence is present in data?

- The standard error in multiple linear regression and marginal linear models tends to be the same.
- A multiple linear regression model is theoretically justified when there is dependence in the data.
- Marginal linear model estimates and standard errors are meaningful due to the dependence of data, but only when dependence is strictly between observations within the same group.
- The standard error in multiple linear regression tends to be higher than in marginal linear models.

Q4. Multilevel models are expressed in terms of ** __**.

- Mixed effects
- Correlation coefficients
- Random effects
- Fixed effects

Q5. Which of the following is NOT true regarding reasons why we fit marginal models?

- Quicker computational times; faster estimation
- Robust standard errors that reflect the specified correlation structure
- Easier accommodation of non-normal outcomes
- All the above are true

#### Week 4: Python Assessment

Q1. For this problem, we are going to be using the above code to recreate some of the mathematics behind the Introduction to Bayesian Statistics lecture. The math has already been worked out for you, so you will only have to manipulate code, but if you are curious of the math behind the update for the mean of a distribution, you can look here: https://en.wikipedia.org/wiki/Conjugate_prior. The math for this problem is located under the continuous distributions section where our model parameter is mu and we have a known variance sigma^2

Before we get started, we need to get some values.

First, what is the mean of the prior that we are using?

Answers :

Q2. What is the standard deviation of the prior?

Answers :

Q3. Let’s say that we observe a person with an IQ of 125, as we did in the lecture. Which way should the posterior distribution, after our Bayesian update, shift?

- Left
- Right
- Stay the Same

Q4. Now, lets say that I observe two more people and I see that they also have IQs of 110. So we have three people with IQs of 110. How does the variance of my estimate change from my prior? We can do this in the code by setting:

- new_data = [110, 110, 110]
- The variance decreases
- The variance increases
- The variance stays the same

Q5. What is the posterior mean after observing three people with an IQ of 110 in a row?

Answers :

Q6. If I observe now five people: the first three have an IQ of 110, and the last two have an IQ of 125, which of the following are true?

- The posterior mean is the average of 110, 110, 110, 125, and 125
- The posterior mean is equal to 110
- The posterior mean is equal to 100
- The posterior mean is equal to 125
- The posterior mean is equal to 115.717
- The posterior standard deviation is the same as the prior standard deviation
- The posterior standard deviation is greater than the prior standard deviation
- The posterior standard deviation is less than the prior standard deviation
- The posterior standard deviation is equal to 10
- The posterior standard deviation is equal to 1.768
- The posterior standard deviation is equal to 3

#### Get All Course Quiz Answers of Statistics with Python Specialization

Understanding and Visualizing Data with Python Coursera Quiz Answers

Inferential Statistical Analysis with Python Coursera Quiz Answers

Fitting Statistical Models to Data with Python Coursera Quiz Answers