Get Mastering Data Analysis in Excel Coursera Quiz Answers
Table of Contents
Mastering Data Analysis in Excel Week 01 Quiz Answers
Excel Essentials Practice
Q1. Background Information: You are provided below with an Excel Spreadsheet that gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks.Week 1 Practice Quiz SpreadsheetXLSX FileDownload file
Excel Problem Type: Summing a column
Problem Information: Daily continuously compounded returns can be summed to obtain returns over longer time intervals. Sum the daily returns to calculate annual continuously compounded returns for 2010. Give each result in percent, rounded to two digits to the right of the decimal place – for example, 11.76%.
Solve: What is the Dow Chemical Annual return?
- 20.51%
- 18.65%
- 23.23%
- 26.15%
Q2. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Calculating correlation for a two-column array
Question: What is the correlation between daily continuously compounded returns for Dow Chemical and for the S&P 500 Index? Round your answer two digits to the right of the decimal place – for example, .84
- .78
- .57
- .48
- .79
Q3. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Identifying the maximum value in a column and sorting multiple columns while preserving rows.
Question: On what day in 2010 did Dow Chemical returns out perform S&P 500 Index returns the most?
- February 1, 2010
- October 25, 2010
- April 28, 2010
- February 9, 2010
Q4. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Using Excel “If” statements to determine how many days in 2010 Dow Chemical returns are higher than Dupont Returns.
Problem Information: Assuming Dow Chemical Returns are in Column B and Dupont Returns in Column C, the “If” statements will be of the form =IF(B3>C3, 1, 0).
Set up a column of “If” statements and then each day where Dow return > Dupont return will have a value of 1, otherwise 0.
Question: How many days out of the 252 trading days in 2010 did Dow outperform Dupont?
- 122
- 125
- 124
- 128
Q5. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Sorting multiple columns while preserving rows
Question: What was the fifth-worst performing day for the S&P 500 Index in 2010?
- May 10, 2010
- February 4, 2010
- June 29, 2010
- May 20, 2010
Q6. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Defining the Sharpe Ratio
Problem Information: A “Sharpe Ratio” is a way of measuring the performance of an investment asset that takes into account both returns and the standard deviation (also called the volatility) of returns over time. A stock’s Sharpe ratio is the difference between its returns and the return of a risk-free investment, such as a government bond, divided by the standard deviation of returns of the asset. For example, if a stock returns 15% per year, and the risk-free asset returns 3% per year, and the volatility of the stock is 18% per year, the Sharpe Ratio is 12%/18% = .67.
Question: Assume a risk-free asset returns 2% per year, and the standard deviation of returns of Dupont stock is 20%. What is the Sharpe Ratio for Dupont stock for 2010? Give the answer to two digits to the right of the decimal place.
- .93
- .88
- .84
- .83
Q7. Excel Problem Type: Optimization using the “Solver” plug-in
Problem Information: Assume that at a particular gas station, the quantity of automobile fuel sold in a week is a function of the fuel’s retail price.
The quantity of fuel sold in a week (in gallons) = (1,000 – 300x), where x is the price in dollars per gallon.
The function f(x) for revenues from weekly sales, in dollars, will equal x*(1000 – 300x) = 1000x – 300x^2.
Without using calculus or any other advanced math, the MS Solver plug-in can be used to find the input value for x that results in a maximum value for a function f(x). The price x is in the Solver “variable cell” and the function 1000x – 300x^2 is the Solver “objective.”
Question: What is the price x that maximizes weekly revenues?
- $1.45 per gallon
- $1.67 per gallon
- $14.50 per gallon
- $16.67 per gallon
Q8. The Excel spreadsheet provided at the beginning of this practice quiz, gives one year’s daily continually compounded returns for two chemical company stocks, Dow and Dupont, and the S&P 500, a weighted index of 500 large company stocks. Use this spreadsheet to answer the question.
Excel Problem Type: Scatter plots and trend line options
Solve: Generate a scatter plot that pairs the daily returns of Dow Chemical (y axis) “against” the S&P 500 returns (x axis). The slope of the regression line is also called “Beta.”
Question: What is Beta for Dow Chemical? Give the answer rounded two digits to the right of the decimal place.
- 1.55
- 1.66
- 1.00
- 1.62
Quiz : Excel Essentials
Q1. Please download the following workbook for the Excel Essentials Quiz.Excel-Essentials-Quiz(1)XLSX FileDownload file
This spreadsheet contains monthly continuously compounded returns for two stock indexes – RSP and SPY – and two individual stocks – Amazon and Duke Energy – for the 12 years from May 2003 to May 2015.
Use Excel’s chart function to generate a scatter plot of SPY index monthly returns (y axis) against Amazon monthly returns (x axis)
When you use “trendline” option for slope, R-squared, and the y-intercept, double-check your results against the equivalent cell formula answers.
Question 1: What is the slope of the best-fit line (rounded to two decimal places)?
- 0.11
- 0.12
- 0.15
- 0.18
Q2. What is the coefficient of determination (R-squared)? Use the “rsq” Excel function (Trendline in Excel may give an inaccurate value for R-squared).
- 0.18
- 0.20
- 0.22
- 0.24
Q3. What is the Y-intercept, in percent? Use the “trendline” but double-check against the “intercept” function.
- 0.25%
- 0.35%
- 0.45%
- 0.55%
Q4. Answer Question 4 and 5 based on the information below:
The annual “Sharpe Ratio” is a metric that combines profitability and risk – it measures units of profitability per unit of risk.
First calculate the difference between the annual return of a stock and the annual return of a risk-free investment in government bonds. Second, divide that difference by the annualized population standard deviation of returns of the stock.
For example, if the annual return of a stock is 10%, the annual risk-free bond return is 2%, and the annualized population standard deviation of returns of the stock is 16%, then the Sharpe Ratio = 8%/16% = 0.5.
For this problem, you can estimate the annualized standard deviation of returns by multiplying your calculated value for the monthly population standard deviation of returns by the square root of 12.
Question 4: Assuming the risk-free rate is 1.5% per year over the full 12-year interval measured, which asset had the higher Sharpe ratio: SPY or RSP?
- SPY
- RSP
Q5. For the asset you chose in Question 4, what was the Sharpe ratio? Round your results to two decimal places
- 0.56
- 0.53
- 0.50
- 0.48
Q6. In the month ending on which date did Amazon achieve the highest returns?
Note: Use “paste special” and choose “values and number formats” to keep return values from changing.
- September 1, 2010
- October 1, 2009
- April 2, 2007
- July 3, 2006
Q7. What was the monthly return from the question above?
- 22.9%
- 24.11%
- 43.27%
- 51.87%
Q8. What was Duke Energy’s return that same month?
- 0.51%
- 1.13%
- 3.04%
- 3.18%
Q9. Using the Solver plug-in (Solver Add-In) for Excel, answer Questions 9 and 10, based on the information below:Solver Add-InXLSX FileDownload file
Between possible pricing of $5 per pound to $25 per pound, the quantity of coffee Egger’s Roast Coffee can sell each month is a linear function of the retail selling price per pound. The linear function is (quantity sold in pounds) = (-400*(Price per pound)) + 10,000.
Question 9: What is the revenue-maximizing selling price per pound for Egger’s Roast Coffee?
If this question is too challenging, there is another example below to review. This can also be found in “Course Resources” as a quick reference.Solver_optimization_example(1)XLSX FileDownload file
- $5.00
- $12.50
- $13.50
- $25.00
Q10. What is the monthly revenue at that price per pound? ( , indicates thousands)
- $15,100
- $62,500
- $62,100
- $40,000
Mastering Data Analysis in Excel Week 02 Quiz Answers
Binary Classification practice
Q1. What is one reason False Positive classifications were expensive in the Battle of Britain?
- German bombers would expect to be intercepted.
- German bombers were always present.
- Pilots needed more practice.
- Aviation fuel for fighter planes was scarce.
Q2. The portion of test outcomes that are True Negatives plus the portion that are False Negatives must equal:
- One, minus the classification incidence/test incidence
- The False Negative (FN) Rate
- One, minus the condition incidence
- The Negative Predictive Value (NPV)
Q3. Use the Cancer Diagnosis Spreadsheet to answer Questions 3 and 4.Cancer DiagnosisXLSX FileDownload file
This spreadsheet gives 10,000 pairs of scores – the level of a (fictional) cancer diagnostic protein in Column A – along with the actual condition: 1 = cancer, 0 = no cancer in Column C.
Change the cost per False Negative classification to $20,000 [cell G3]. Change the cost per False Positive classification to $1,000 [cell H3].
Question: What is the new minimum cost per event/cost per test (rounded to the nearest dollar)?
- $181
- $187
- $185
- $183
Q4. What is the lowest level of protein that should be classified “Positive” to achieve the minimum cost per test at the new costs per error given above?
- 18202.407
- 18204.545
- 18204.498
- 18213.7
Q5. Can a change in classification threshold change a diagnostic test’s True Positive Rate? Use logic – no need to calculate any numbers.
- No
- Yes
Q6. “Condition Incidence” is the portion of a population that actually has the Condition being studied. Can a change in threshold change the Condition incidence? Use logic – no need to calculate any numbers.
- Yes
- No
Q7. Does the change in threshold change the test’s “classification incidence” (also called “test incidence”)? Use logic – no need to calculate any numbers.
- No
- Yes
Q8. Does the change in threshold change the test’s Area under the ROC Curve? Use logic – no need to calculate any numbers.
- No
- Yes
Q9. Use the Forecasting Soldier Performance Spreadsheet to answer this question: Forecasting Soldier PerformanceXLSX FileDownload file
What is the False Positive Rate if we use the sum of standardized height and standardized weight as the score; and set a threshold at -1.28?
- 0.67
- 0.4
- 0.6
- 0.33
Quiz : Binary Classification (graded)
Q1. A test for “driving while intoxicated” was given 100 times. 20 people tested were actually intoxicated, and 10 people were mis-classified as intoxicated. What would the False Positive rate be?
- 10%
- 12.5%
- 50%
- 30%
Q2. If a fire alarm malfunctions and fails to go off when there actually is a fire, that is a:
- False Negative
- True Positive
- False Positive
- True Negative
Q3. Use the Binary Classification Metrics Spreadsheet Definitions to answer the following:
If the “classification incidence/test incidence” is 10% for the whole population, and the true “condition incidence” is 12% for the whole population, the True Positive rate:
- must be 0%
- cannot be 100%
- must be 100%
- can be 100%
Q4. Use the Cancer Diagnosis Spreadsheet to answer Questions 4 to 6.Cancer DiagnosisXLSX FileDownload file
Keep the cost per False Positive test set at $500. Use MS Solver to determine the maximum cost per False Negative test that permits an average cost per test of $100.
- $17,082
- $12,262
Q5. Assume a cost of $15,000 per False Negative (FN) and $100 per False Positive (FP). What is the minimum average cost per test?
- $1.00
- $259,800
- $16,551
- $25.98
Q6. If, instead of assuming a cost $15,000 per FN and $100 per FP, the costs are assumed to be $7,500 per FN and $50 per FP, what changes?
- The minimum cost threshold of 16,551.930
- The minimum Cost per Test
- The False Positive Rate
- The True Positive Rate
Q7. Use logic and the definition in the Binary Performance Metrics Spreadsheet to answer the following question.Binary Performance MetricsXLSX FileDownload file
In general, increasing the cost per FN while keeping the cost per FP constant will cause the cost-minimizing threshold score to:
- Increase
- Stay the Same
- Decrease
Q8. Make a copy of the Bombers and Seagulls Spreadsheet to answer questions 8-10.Bombers and SeagullsXLSX FileDownload file
Modify the spreadsheet data so that there are 4 bombers instead of 3, and 16 seagulls instead of 17, by changing the actual condition for the radar score of 66 from a 0 to a 1 in cell D43.
What is the new Area Under the Curve:
- 0.72
- 0.824
- 0.75
- 0.78
Q9. Assuming the costs for classification errors are 5 million pounds per FN and 4 million pounds per FP, how much does changing the value at Cell D43 from 0 to 1 change the minimum cost per event?
- Increases by 250,000 pounds
- Increases by 950,000 pounds
- Increases by 5 million pounds.
- Unknown
Q10. Change the cost per FN to 50 million pounds. How does changing the data in cell D43 from a 0 to a 1 change the cost-minimizing threshold?
- Decreases it from 75 to 66.
- Increases it from 66 to 75.
- Decreases it from 75 to 62
- Decreases it from 75 to 70
Q11. Use the Binary Performance Metrics Spreadsheet definitions to answer the following question.Binary Performance MetricsXLSX FileDownload file
A population tested for “driving while intoxicated” has a Condition incidence of 20%. If the test has a true positive rate of 70% and a false positive rate of 10%, what is the test’s Positive Predictive Value (PPV)?
- 0.50
- 0.36
- 0.64
- 0.60
Q12. Use the Soldier Performance Spreadsheet to answer question 12. Forecasting Soldier PerformanceXLSX FileDownload file
Rank the outcomes using soldier’s age as the score, with the oldest at the top. A threshold of 24 years represents what point on the ROC Curve?
- .33, .67
- .67, .33
- .25, .75
- .5, .5
Mastering Data Analysis in Excel Week 03 Quiz Answers
Using the Information Gain Calculator Spreadsheet (practice)
Q1. Using the information Gain Calculator, without changing any inputs in the confusion matrix, what is the conditional probability of getting a Positive Test, if you have a defective chip? Use the link below to access the spreadsheet. There is also an explanation about using the Information Gain Calculator that you may find helpful to review beforehand. Information Gain CalculatorXLSX FileDownload file
- 14%
- 50%
- 25%
- 37.5%
Q2. The conditional probability of getting a Positive Test if you have a defective chip can be written p(Test POS | “+”). What is this probability called on the Confusion Matrix?
- The False Negative Rate
- The False Positive Rate
- The True Positive Rate
- The True Negative Rate
Q3. What is the remaining uncertainty or entropy of the test classification if we learn a chip is truly defective?
- .5917 bits
- .8113 bits
- 1 bit
- .9183 bits
Q4. What is the probability that a chip chosen at random from the assembly line is defective?
- .3
- .7
- .2.
- .8
Q5. What is the conditional Probability of Getting a “Negative” Test classification if you have a non-defective chip?
- 14%
- 50%
- 75%
- 25%
Q6. The conditional probability of getting a Negative Test if you have a non-defective chip can be written P(Y = “NEG” | X = “-”). What is this probability called on the Confusion Matrix?
- True Negative Rate
- False Positive Rate
- True Positive Rate
- False Negative Rate
Q7. Challenging question: What is the remaining uncertainty, or entropy, of the Test Classification, if we know that a chip is not-defective?
- 1 bit
- .5917 bits
- .9183 bits
- .8113 bits
Q8. How frequently will a non-defective chip occur?
- .2
- .7
- .3
- .8
Q9. What is the expected, or average, uncertainty or entropy, remaining regarding a Test Outcome, give knowledge of whether or not a chip is defective?
- .8490 bits
- .0323 bits
- 1 bit
- .8813 bits
Q10. The optical scanner breaks down and begins to classify 30% of all chips as defective completely at random. What is the random test’s True Positive Rate and False Positive Rate?
- 30% and 70%
- 70% and 70%
- 70% and 30%
- 30% and 30%
Quiz : Information Measures (graded)
Q1. Suppose we have two coins: one “fair” coin, where p(head) = p(tails) = .5; and an “unfair” coin where p(heads) does not equal p(tails). Which coin has a larger entropy prior to observing the outcome?
- The fair coin
- The unfair coin
Q2. If you roll one fair dice (6-sided), what is its entropy before the result is observed?
- 2.58 bits
- 2.32 bits
- 0.43 bits
- 0.46 bits
Q3. If your friend picks one number between 1001 to 5000, under the strategy used in video Entropy of a Guessing Game, what is the maximum number of questions you need to ask to find out that number?
- 12
- 10
- 11
- 13
Q4. Use the “Information Gain Calculator” spreadsheet to calculate the “Conditional Entropy” H(X|Y) given a = 0.4, c = 0.5, e = 0.11. Information Gain CalculatorXLSX FileDownload file
- 0.87 bits
- 0.90 bits
- 0.97 bits
- 1.87 bits
Q5. On the “Information Gain Calculator” spreadsheet, given a = 0.3, c = 0.2, suppose now we also know that H(X,Y) = H(X) + H(Y). What is the joint probability e?Information Gain CalculatorXLSX FileDownload file
- 0.06
- 0.3
- 0.5
- 0.04
Q6. Given a = 0.2, c = 0.5 on the Information Gain Calculator Spreadsheet suppose now we also know the true positive rate is 0.18. What is the Mutual Information? Information Gain CalculatorXLSX FileDownload file
- 0.72 bits
- 0.13 bits
- 0.08 bits
- 1.64 bits
Q7. Consider the Monty Hall problem, but instead of the usual 3 doors, assume there are 5 doors to choose from. You first choose door #1. Monty opens doors #2 and #3. What is the new probability that there is a prize behind door #4?
- 0.5
- 0.67
- 0.2
- 0.4
Q8. Again, consider the Monty Hall problem, but with 5 doors to choose from instead of 3. You pick door #1, and Monty opens 2 of the other 4 doors. How many bits of information are communicated to you by Monty when you observe which two doors he opens?
- 1.52 bits
- 2.32 bits
- 0.80 bits
- 0.67 bits
Q9. B stands for “the coin is fair”, ~B stands for “the coin is crooked”. The p(heads | B) = 0.5, and p(heads | ~B) = 0.4. Your friend tells you that he often tests people to see if they can guess whether he is using the fair coin or the crooked coin, but that he is careful to use the crooked coin 70% of the time. He tosses the coin once and it comes up heads.
What is your new best estimate of the probability that the coin he just tossed is fair?
- 0.15
- 0.35
- 0.40
- 0.43
Q10. Suppose you are given either a fair dice or an unfair dice (6-sided). You have no basis for considering either dice more likely before you roll it and observe an outcome. For the fair dice, the chance of observing “3” is 1/6. For the unfair dice, the chance of observing “3” is 1/3. After rolling the unknown dice, you observe the outcome to be 3.
What is the new probability that the die you rolled is fair?
- 0.08
- 0.23
- 0.33
- 0.36
Mastering Data Analysis in Excel Week 04 Quiz Answers
Regression Models and PIG (practice)
Q1. Given ordered data sets X = {12, 23, 4, 36, 10, 67, 58, 40, 33} and Y = {1.5, 10, 8.3, 4, 1.4, 1.8, 2.2, 4, 3}, what is the correlation of their ordered pairs after standardization?
- 0.36
- 0.58
- -0.58
- -0.36
Q2. If a linear regression between standardized ordered pairs (both Gaussian distributions) has mean = 0, and root mean square error (standard deviation) = 0.5, what is the coefficient of determination (R-squared)?
- 0.25
- 0.75
- .5
Q3. Given correlation of R = 0.87 on a standardized linear regression (with Gaussian distributions), what is the Percentage Information Gain (P.I.G.)?
Hint: Use the Correlation and P.I.G. Spreadsheet.Correlation and P.I.G.XLSX FileDownload file
- 44.8%
- 49.8%
- 38.8%
- 55.3%
Q4. True/False: After standardization, the slope of the regression line, “beta,” will equal the covariance and the correlation between the ordered pairs (x, y).
Hint: Refer to the Standardization SpreadsheetStandardization SpreadsheetXLSX FileDownload file
- False
- True
Q5. Assume a regression line on standardized data with slope = 0.7. At the point x = 1, the point estimate for y = 0.7.
What upper bound for the confidence interval of y will ensure that 95% of all possible outcomes for y will be below it?
Hint: Use the Correlation and Model Error Spreadsheet.Correlation and Model ErrorXLSX FileDownload file
- 0.77
- None of the above
- 1.47
- 1.87
Q6. True/False: For a parameterized Gaussian model with continuous distributions, if the correlation R between random variables X and Y is increased, the percentage information gain from the linear regression also increases, until it becomes infinite at R = 1.
- False
- True
Q7. Assume you work for a general home construction company “big box” retail chain. Your supervisor points out that there is a near-linear association between retail price of an item and how long on average it remains on the shelf. Correlation between price and time held in inventory is .82. The current model has a standard deviation of model error of 12.4 hours.
To reduce the standard deviation of model error to 8 hours, what would the linear correlation need to be?
- .93
- .97
- .83
- .53
Q8. The latency [time delay for a signal to be processed] of a fiber optic network is closely correlated to the physical distance a signal must travel across the network. A network provider uses a linear model for expected latency in nanoseconds. The model is: Nanoseconds = (3.34*physical distance in meters) + 25. The standard deviation of model error is 8 nanoseconds. You are asked to give the range in nanoseconds within which 99% of all latencies will fall.
- 20.61 nanoseconds
- 26.31 nanoseconds
- 51.21 nanoseconds
- 41.21 nanoseconds
Q9. Assume a linear regression model to forecast the profitability of future company clients had a standard deviation of model error of $3,885.
Assume this model error remains constant on new data. If you used the linear model to forecast the profitability for 200 completely new customers, and calculated the mean profitability for the 200, what would the standard deviation of model error for that mean profitability in that group be?
Give your answer to the nearest dollar.
- $388
- $19
- $275
- $3885
Quiz : Parametric Models for Regression (graded)
Q1. A manufacturer has developed a specialized metal alloy for use in jet engines. In its pure form, the alloy starts to soften at 1500 F. However, small amounts of impurities in production cause the actual temperature at which the alloy starts to lose strength to vary around that mean, in a Gaussian distribution with standard deviation = 10.5 degrees F.
If the manufacturer wants to ensure that no more than 1 in 10,000 of its commercial products will suffer from softening, what should it set as the maximum temperature to which the alloy can be exposed?
Hint: Refer to the Excel NormSFunctions Spreadsheet.Excel NormS Functions SpreadsheetXLSX FileDownload file
- 1496.281
- 39.0497 F
- 1539.0497
- 1460.9503 F
Q2. A carefully machined wire comes off an assembly line within a certain tolerance. Its diameter is 100 microns, and all the wires produced have a uniform distribution of error, between -11 microns and +29 microns.
A testing machine repeatedly draws samples of 180 wires and measures the sample mean. What is the distribution of sample means?
Hint: Use the CLT and Excel Rand() Spreadsheet.CLT and Excel RandXLSX FileDownload file
- A Uniform Distribution with mean = 109 microns and standard deviation = .8607 microns.
- A Gaussian distribution that, in Phi notation, is written, ϕ(109, 133.33).
- A Uniform Distribution with mean = 109 microns and standard deviation = 11.54 microns.
- A Gaussian Distribution that, in Phi notation, is written ϕ(109, .7407).
Q3. A population of people suffering from Tachycardia (occasional rapid heart rate), agrees to test a new medicine that is supposed to lower heart rate. In the population being studied, before taking any medicine the mean heart rate was 120 beats per minute, with standard deviation = 15 beats per minute.
After being given the medicine, a sample of 45 people had an average heart rate of 112 beats per minute. What is the probability that this much variation from the mean could have occurred by chance alone?
Hint: Use the Typical Problem with NormSDist Spreadsheet.Typical Problem_ NormSDistXLSX FileDownload file
- 1.73%
- .0173%
- 29.690%
- 99.9827%
Q4. Two stocks have the following expected annual returns:
Oil stock – expected return = 9% with standard deviation = 13%
IT stock – expected return = 14% with standard deviation = 25%
The Stocks prices have a small negative correlation: R = -.22.
What is the Covariance of the two stocks?
Hint: Use the Algebra with Gaussians Spreadsheet.Algebra with GaussiansXLSX FileDownload file
- -.00219
- -.00573
- -.0286
- -.00715
Q5. Two stocks have the following expected annual returns:
Oil stock – expected return = 9% with standard deviation = 13%
IT stock – expected return = 14% with standard deviation = 25%
The Stocks prices have a small negative correlation: R = -.22.
Assume return data for the two stocks is standardized so that each is represented as having mean 0 and standard deviation 1. Oil is plotted against IT on the (x,y) axis.
What is the covariance?
Hint: Use the Standardization Spreadsheet. Standardization SpreadsheetXLSX FileDownload file
- -.22
- -.00573
- 0
- -1
Q6. Two stocks have the following expected annual returns:
Oil stock – expected return = 9% with standard deviation = 13%
IT stock – expected return = 14% with standard deviation = 25%
The Stocks prices have a small negative correlation: R = -.22.
What is the standard deviation of a portfolio consisting of 70% Oil and 30% IT?
Hint: Use either the Algebra with Gaussians or the Markowitz Portfolio Optimization Spreadsheet.Algebra with GaussiansXLSX FileDownload fileMarkowitz Portfolio OptimizationXLSX FileDownload file
- 12.68%
- 11.79%
- 17.93%
- 10.44%
Q7. Two stocks have the following expected annual returns:
Oil stock – expected return = 9% with standard deviation = 13%
IT stock – expected return = 14% with standard deviation = 25%
The Stocks prices have a small negative correlation: R = -.22.
Use MS Solver and the Markowitz Portfolio Optimization Spreadsheet to Find the weighted portfolio of the two stocks with lowest volatility. Solver Add-InXLSX FileDownload fileMarkowitz Portfolio OptimizationXLSX FileDownload file
What is the minimum volatility?
- 11.58%
- 10.36%
- 10.43%
- 9.5%
Q8. You are a data-analyst for a restaurant chain and are asked to forecast first-year revenues from new store locations. You use census tract data to develop a linear model.
Your first model has a standard deviation of model error of $25,000 at a correlation of R = .30. Your boss asks you to keep working on improving the model until the new standard deviation of model error is $15,000 or less.
What positive correlation R would you need to have a model error of $15,000?
(Note: you can answer this question by making small additions to the Correlation and Model Error spreadsheet). Correlation and Model ErrorXLSX FileDownload file
- R = .500
- R = .8200
- R = .572
- R = .428
Q9. An automobile parts manufacturer uses a linear regression model to forecast the dollar value of the next years’ orders from current customers as a function of a weighted sum of their past-years’ orders. The model error is assumed Gaussian with standard deviation of $130,000.
If the correlation is R = .33, and the point forecast orders $5.1 million, what is the probability that the customer will order more than $5.3 million?
Hint: Use the Typical Problem with NormSDist Spreadsheet. Typical Problem_ NormSDistXLSX FileDownload file
- 93.8%
- 12.4%
- 4.3%
- 6.2%
Q10. An automobile parts manufacturer uses a linear regression model to forecast the dollar value of the next years’ orders from current customers as a function of a weighted sum of that customer’s past-years orders. The linear correlation is R = .33.
After standardizing the x and y data, what portion of the uncertainty about a customer’s order size is eliminated by their historical data combined with the model?
Hint: Use the Correlation and P.I.G. Spreadsheet. Correlation and P.I.G.XLSX FileDownload file
- 4.2%
- 3.5%
- 4.5%
- 5.2%
Q11. A restaurant offers different dinner “specials” each weeknight. The mean cash register receipt per table on Wednesdays is $75.25 with standard deviation of $13.50. The restaurant experiments one Wednesday with changing the “special” from blue fish to lobster. The average amount spent by 85 customers is $77.20.
How probable is it that Wednesday receipts are better than average by chance alone?
Hint: Use the Typical Problem with NormSDist Spreadsheet. Typical Problem_ NormSDistXLSX FileDownload file
- 9.15%
- 90.85%
- 8.30%
- 9.05%
Q12. Your company currently has no way to predict how long visitors will spend on the Company’s web site. All it known is the average time spent is 55 seconds, with an approximately Gaussian distribution and standard deviation of 9 seconds. It would be possible, after investing some time and money in analytics tools, to gather and analyzing information about visitors and build a linear predictive model with a standard deviation of model error of 4 seconds.
How much would the P.I. G. of that model be?
Hint: Use the Correlation and P.I.G. SpreadsheetHow to use the AUC calculator.pdfPDF FileOpen file
- 48.2%
- 57.2%
- 53.3%
- 61.5%
Mastering Data Analysis in Excel Week 05 Quiz Answers
Quiz : Probability, AUC, and Excel Linest Function
Q1. Keep the 125 outcomes in the Histogram Spreadsheet unchanged. Change the bin ranges so that bin 1 is [-3, -1), bin 2 is [-1,1) bin 3 is [1, 3).Histograms SpreadsheetXLSX FileDownload file
What is the approximate probability that a new outcome will fall within bin 1?
- 5%
- 5
- 4%
- .4
Q2. Use the Excel Probability Functions Spreadsheet.Excel_Probability_FunctionsXLSX FileDownload file
Assume a continuous uniform probability distribution over the range [47, 51.5].
What is the skewness of the probability distribution?
- 2.17
- 0
- 1.69
- 49.25
Q3. Use the Excel Probability Functions Spreadsheet, provided in question #2.
Assume a continuous uniform probability distribution over the range [-12, 20]
What is the entropy of this distribution?
- 6 bits
- 4 bits
- 3 bits
- 5 bits
Q4. Use the Excel Probability Functions Spreadsheet that was previously provided.
Assume a Gaussian Probability function with mean = 3 and
standard deviation =4.
What is the value of f(x) at f(3.5)?
- 4.05
- .352
- .099
- .550
Q5. Use the Excel Probability Functions Spreadsheet previously provided in this quiz.
Assume a Gaussian Probability Distribution with mean = 3 and standard deviation = 4.
What is the cumulative distribution at x = 7?
- 1.00
- .960
- .841
- .060
Q6. Use the AUC Calculator Spreadsheet. AUC_Calculator and Review of AUC CurveXLSX FileDownload file
If the “modification factor” in the original example given in the AUC Calculator Spreadsheet is changed from -1 to -2, what is the change in the actual Area Under the ROC Curve?
- The area decreases
- No change
- The area increases
Q7. Use the AUC Calculator Spreadsheet provided in question #6.
If the “modification factor” in the original example given in the AUC Calculator Spreadsheet is changed from -1 to -2, what is the threshold (row 10) that results in the lowest cost per event?
- 1.3
- .45
- 3.5
- .9
Q8. Refer to the AUC Calculator Spreadsheet previously provided.
Assume a binary classification model is trained on 200 ordered pairs of scores and outcomes and has an AUC of .91 on this “training set.” The same model, on 5,000 new scores and outcomes, has an AUC of .5.
Which statement is most likely to be correct?
- The model overfit the training set data and will need to be improved to work better on the new data.
- The original model identified signal as noise and has no predictive value on new data.
- The original model is expected to perform worse on test set data and is functioning acceptably.
Q9. Refer to the Excel Linest Function Spreadsheet.Excel Linest FunctionXLSX FileDownload file
If a multivariate linear regression gives a weight beta(1) of 0.4 on x(1) = “age in years,” and a new input x(7) of “age in months” is added to the regression data, which of the following statements is false?
- If the x(1) data are removed, the new beta(7) on the new x(7) data will be 0.4.
- Using Excel linest, and including x(1) and x(7) data, the new beta(7) on the age in months will be 0.
- If the x(1) data are removed, the new beta(7) on the new x(7) data will be .033
Q10. Use the Excel Linest Function Spreadsheet that was provided in question #9.
What is the Correlation, R for the linear regression shown in the example?
- .778 or – .778
- .367
- .606
Mastering Data Analysis in Excel Week 06 Quiz Answers
Part 1: Building your Own Binary Classification Model
Q1. First Binary Classification ModelData_Final ProjectXLSX FileDownload file
You work for a bank as a business data analyst in the credit card risk-modeling department. Your bank conducted a bold experiment three years ago: for a single day it quietly issued credit cards to everyone who applied, regardless of their credit risk, until the bank had issued 600 cards without screening applicants.
After three years, 150, or 25%, of those card recipients defaulted: they failed to pay back at least some of the money they owed. However, the bank collected very valuable proprietary data that it can now use to optimize its future card-issuing process.
The bank initially collected six pieces of data about each person:
· Age
· Years at current employer
· Years at current address
· Income over the past year
· Current credit card debt, and
· Current automobile debt
In addition, the bank now has a binary outcome: default = 1, and no default = 0.
Your first assignment is to analyze the data and create a binary classification model to forecast future defaults.
You will combine data from the above six inputs to output a single “score.” Use the Soldier Performance spreadsheet for a simple example of combining multiple inputs.Forecasting Soldier PerformanceXLSX FileDownload file
The relative rank-ordering of scores will determine the model’s effectiveness. For convenience– in particular, so that you can use the AUC Calculator Spreadsheet–you are asked to use a scale for your score that has a maximum < 3.5 and a minimum > -3.5.
At first you are not told what your bank’s own best estimate for its cost per False Negative (accepted applicant who becomes a defaulting customer) and False Positive (rejected customer who would not have defaulted) classification.
Therefore, the best you can do is to design your model to maximize the Area Under the ROC Curve, or AUC.
You are told that if your model is effective (“high enough” AUC, not defined further) and “robust” (again not defined, but in general this means relatively little decrease in AUC across multiple sets of new data) then it may be adopted by the bank as its predictive model for default, to determine which future applicants will be issued credit cards.
You are first given a “Training Set” of 200 out of the 600 people in the experiment. The Data_For_Final_Project (below) has both the training set and test set you will need.
Design your model using the Training Set. Standardized versions of the input data also provided for your convenience. You may combine the six inputs by adding them to, or subtracting them from, each other, taking simple ratios, etc. Exclude inputs that are not helpful and then experiment with how to combine the most informative inputs.
Note that will need some of your quiz answers again later, so please write them down and keep track of them as you go along.
Question: What is your model? Give it as a function of the two or more of the six inputs. For example: (Age + Years at Current Address)/Income [not a great model!].
Your model should have at least two inputs.
Q2. What is your model’s AUC on the Training Set? Use two digits to the right of the decimal place.
Q3. Initial Assessment for Over-fitting (testing your model on new data)
Next test your model, without changing any parameters, on the Test Set of 200 additional applicants. See the Test Set spreadsheet. It is part of the Data_For_Final_Project (below) and has both the training and test set.Data_Final ProjectXLSX FileDownload file
Hint: Make and use a second copy of the AUC Calculator Spreadsheet so that you can compare Test Set and Training Set results easily.AUC_Calculator and Review of AUC CurveXLSX FileDownload file
Q4. What is your model’s new AUC on the Test Set? Give two digits to the right of the decimal place.
Finding the Cost-Minimizing Threshold for your Model
Now that you have, hopefully, developed your model to the point where it is relatively “robust” across the training set and test set, your boss at the bank finally gives you its current rough estimate of the bank’s average costs for each type of classification error.
[Note that all bank models here include only profits and losses within three years of when a card is issued, so the impact of out-years (years beyond 3) can be ignored.]
Cost Per False Negative: $5000
Cost Per False Positive: $2500
For the 600 individuals that were automatically given cards without being classified, the total cost of the experiment turned out to be 25%*($5000)*600 or $750,000. This is $1,250 per event.
Only models with lower cost per event than $1,250 should have any value.
Question: What is the threshold score on the Training Set data for your model that minimizes Cost per Event? You will need this number to answer later questions.
Hint: Using the AUC Calculator Spreadsheet, identify which Column displays the same cost-per-event (row 17) as the overall minimum cost-per-event shown in Cell J2. The threshold is shown in row 10 of that Column. What the threshold means is that at and above this number everything is classified as a “default.”AUC_Calculator and Review of AUC CurveXLSX FileDownload file
Q5. Finding the Minimum Cost Per Event
Question: Again referring only to the Training Set data, what is the overall minimum cost-per-event?
Hint: You will need this number to answer later questions. If you used the AUC Calculator, the overall minimum cost per event will be displayed in Cell J2.
Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.
For Example – enter $800.00 as “800”
Q6. Comparing the New Minimum Cost Per Event on Test Set Data
When you compared AUC for the Training and Test Sets, all that is necessary is to look up the two different values in Cell G8. But to get an accurate measure of the cost-savings using the original model on new data, you can not automatically use the new threshold that results in the overall lowest cost-per-event on the Test Set.
Remember that your model is being tested for its ability to forecast – but the new optimal threshold will be known only after the outcomes for the entire Test Set are known.
All you can use is the model you developed on the Training Set data and the threshold from the Training Set that you should have recorded when answering Question 4.
Question: At that same threshold score (NOT the threshold score that would minimize costs for the new Test Set, but the “old” threshold score that minimized costs on the Training Set) what is the cost per event on the test set?
Hint: Using the AUC Calculator Spreadsheet previously provided, locate the column on the Training Set data that has the lowest-cost-per event. That same column and threshold in the Test Set copy of the AUC Calculator will have a new cost-per-event, displayed in row 17. This is almost always higher than the minimum cost-per-event on the Training Set, and also higher than what the minimal cost-per-event would be on the Test Set, if one could know the new optimal threshold in advance. This number is the actual cost per event when applying the model-and-threshold developed with the Training Set to the new, Test Set data.
Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.
For Example – enter $800.00 as “800”
Q7. Putting a Dollar Value on Your Model Plus the Data
Assume your Test Set cost-per-event results from Question 6 are sustainable long term.
Question: How much money does the bank save, per event, using your model and its data-inputs, instead of issuing credit cards to everyone who asks?
Hint: the cost of issuing credit cards to everyone (no model, no forecast) has been determined to be 25%*$5000 = $1,250 per event. Dollar value of the model-plus-data is the difference between $1,250 and your number.
Note: for Coursera to interpret your answer correctly you must give your answer as an integer – no decimals or dollar sign.
For Example – enter $800.00 as “800”
Q8. Payback Period for Your Model
Question: Given that it apparently cost the bank $750,000 to conduct the three-year experiment, if the bank processes 1000 credit card applicants per day on average, how many days will it take to ensure future savings will pay back the bank’s initial investment?
Give number rounded to the nearest day (integer value).
Hint: multiply your answer to Question 7 – the cost savings per applicant – by 1000 to get the savings per day.
Q9. Any model that is reducing uncertainty will have a True Positive Rate…
- …Less than the Test Incidence (% of outcomes classified as “default”)
- …Greater than the Test Incidence (% of outcomes classified as “default”)
- …Equal to the Test Incidence (% of outcomes classified as “default”)
Q10. Given that the base rate of default in the population is 25%, any test that is reducing uncertainty will have a Positive Predictive Value (PPV)..
- …Equal to .25
- …Greater than .25
- …Less than .25
Q11. Given that the base rate of default in the population is 25%, any test that is reducing uncertainty will have a Negative Predictive Value (NPV)…
- Equal to .75
- …Less than .75
- …Greater than .75
Q12. Confusion Matrix Metrics. To determine all performance metrics for a binary classification, it is sufficient to have three values
- The Condition Incidence (here the default rate of 25%)
- The probability of True Positives (the True Positive rate multiplied by the Condition Incidence)
- The “Test Incidence” (also called “classification incidence” – the sum of the probability of True Positives and False Positives)
These three values can all be obtained from the AUC Calculator Spreadsheet and and then used as inputs to the Information Gain Calculator Spreadsheet to determine all other performance metrics. AUC_Calculator and Review of AUC CurveXLSX FileDownload fileInformation Gain CalculatorXLSX FileDownload file
Question: What is your model’s True Positive Rate?
Q13. Save this answer as it will be needed again for Part 3 (Quiz 3)
Question: What is your model’s “test incidence”?
Save this answer as it will be needed again for Part 3 (Quiz 3)
Part 2: Should the Bank Buy Third-Party Credit Information?
Q1. Introduction
Part 2 is intended to illustrate how binary classification performance metrics make it possible for you to put an exact value, in dollars per event, on new information that relates to a predictive model.
Note that new information will be worth far more if it is compared to no forecasting model rather than the state of partial knowledge available from the current model. Sellers of information (and data science consultants!) love to take credit for any information gain they achieve over the base rate.
Very often some intermediate state of knowledge is already available for which no additional spending is required. Evaluating the realistic incremental financial gain from new information, whether licensing a third-party commercial database or collecting new data internally, is therefore of great practical value, as this sets an upper bound on what your Company should be willing to pay to license or create the new information.
In this case study, your boss has been in discussions with an advanced machine-learning predictive-analytics credit-risk analytics company that claims to score individual probability of default with very high information gain. Let’s call the company Eggertopia. Eggertopia sales representatives claim their pre-processed risk-scores can achieve AUC values as high as .85 or even higher. However, Eggertopia scores are sold per-event, and they are expensive!
Your boss asks you to determine the incremental financial value to the bank of purchasing Eggertopia risk scores on future credit-card applicants.
Eggertopia agrees to apply its algorithms to generate credit scores for the 400 individuals in the Training and Test Sets. Eggertopia scores do not need to be combined with anything else to make a model. However, since the scores range from approximately -600 (best credit risk) to 4900 (most likely to default) they will need to be standardized and adjusted to fit the -3.5 to 3.5 range of the AUC Calculator Spreadsheet (below)AUC_Calculator and Review of AUC CurveXLSX FileDownload file
You will determine the sustainable AUC of the Eggertopia scores, the sustainable cost-per-event, and the savings per event, when comparing Eggertopia data to the base rate forecast.
You will then calculate the incremental savings per event if you compare use of Eggertopia data to use of your current model developed in Part 1.
Question: What is the AUC of the Eggertopia Scores on the Training Set? Give your answer to two digits to the right of the decimal point.
- .95
- .85
- .88
- .83
Q2. What is the optimum threshold on the training set to minimize the average cost per test?
- .2
- .25
- .1
- .15
Q3. What is the average cost-per-event at the Training Set optimum threshold?
- $500
- $640
- $600
- $540
Q4. What is the AUC of the Eggertopia scores on the Test Set?
- .80
- .85
- .75
- .88
Q5. Using the same threshold as used on the training set, what is the cost per event of the Eggertopia scores on the Test Set? Round to the nearest dollar.
- $803
- $833
- $823
- $838
Q6. If the bank did not have your model, or any other way of forecasting default, what is the maximum (break-even) price per event that the bank could theoretically pay for Eggertopia scores? In other words, what are Eggertopia’s scores’ absolute savings-per-event?
Hint: Calculate the difference between the cost-per-event at a 25% default rate, and the cost-per-event using Eggertopia scores
- $423
- $418
- $412
- $425
Q7. What is the True Positive rate of the forecasting model using Eggertopia Scores?
- .70
- .72
- .74
- .76
Q8. What is its Positive Predictive Value (PPV) of the forecasting model using Eggertopia scores?
Hint: To calculate the PPV, divide the portion of True Positives by the total number of Positive Classifications. Review confusion matrix definitions and letter designations on the Information Gain Spreadsheet, [PPV is defined at Cell G41], obtain True Positive and False Positive Rates from the AUC Calculator Spreadsheet, and use algebra to solve. Information Gain CalculatorXLSX FileDownload file
- .54
- .48
- .52
- .50
Q9. Incremental Financial Value of Eggertopia Scores
You calculated a cost per event for your own predictive model on Test Set data to answer Quiz 1 – Part 1, Question 6.
Incremental Financial Value of Eggertopia Scores
You calculated a cost per event for your own predictive model on Test Set data to answer Quiz 1 – Part 1, Question 6.
Question: Assuming that the performance of the Eggertopia model and your model both remain stable on any future data (a big assumption), what is the maximum, or break-even, price that the bank could pay per score for Eggertopia, given that it already has your model and data?
Get All Course Quiz Answers of Excel to MySQL: Analytic Techniques for Business Specialization
Business Metrics for Data-Driven Companies Quiz Answers
Mastering Data Analysis in Excel Coursera Quiz Answers
Data Visualization and Communication with Tableau Quiz Answers
Managing Big Data with MySQL Coursera Quiz Answers