#### Table of Contents

## All Weeks Machine Learning With Big Data Coursera Quiz Answers

### Quiz 1: Machine Learning Overview

Q1. What is NOT machine learning?

- Learning from data
**Explicit, step-by-step programming**- Data-driven decisions
- Discover hidden patterns

Q2. Which of the following is NOT a category of machine learning?

- Cluster Analysis
- Classification
- Regression
- Association Analysis
**Algorithm Prediction**

Q3. Which categories of machine learning techniques are supervised?

**classification and regression**- regression and association analysis
- classification and cluster analysis
- cluster analysis and association analysis

Q4. In unsupervised approaches,

- the target is unlabeled.
**the target is unknown or unavailable.**- the target is provided.
- the target is what is being predicted.

Q5. What is the sequence of the steps in the machine learning process?

**Acquire -> Prepare -> Analyze -> Report -> Act**- Acquire -> Prepare -> Analyze -> Act -> Report
- Prepare -> Acquire -> Analyze -> Report -> Act
- Prepare -> Acquire -> Analyze -> Act -> Report

Q6. Are the steps in the machine learning process apply-once or iterative?

- Apply-once
**Iterative**- The first two steps, Acquire and Prepare, are apply-once, and the other steps are iterative.

Q7. Phase 2 of CRISP-DM is Data Understanding. In this phase,

- we acquire as well as explore the data that is related to the problem.
- we define the problem or opportunity to be addressed.
- we prepare the data for analysis.

Q8. What is the main difference between KNIME and Spark MLlib?

- KNIME requires programming, while Spark MLlib does not.
- KNIME requires programming in Java, while Spark MLlib requires programming in Python.
**KNIME is a graphical user interface-based machine learning tool, while Spark MLlib provides a programming-based distributed platform for scalable machine learning algorithms.**- KNIME originated in Germany, while Spark MLlib was created in California, USA.

### Quiz 2: Data Exploration

Q1. Which of these statements is true about samples and variables?

- A sample is an instance or example of an entity in your data.
**All of these statements are true.**- A sample can have many variables to describe it.
- A variable describes a specific characteristic of an entity in your data.

Q2. Other names for ‘variable’ are

- categorical, nominal
**feature, column, attribute**- sample, row, observation
- numerical, quantitative

Q3. What is the purpose of exploring data?

**To gain a better understanding of your data.**- To gather your data into one repository.
- To digitize your data.
- To generate labels for your data.

Q4. What are the two main categories of techniques for exploring data? Choose two.

- Histogram
- Outliers
**Visualization**- Trends
- Correlations
**Summary statistics**

Q5. Which of the following are NOT examples of summary statistics?

- mean, median, mode
**data sources, data locations**- standard deviation, range, variation
- skewness, kurtosis

Q6. What are the two measures for measuring shape as mentioned in the lecture? Choose two.

**Kurtosis****Skewness**- Contingency Table
- Range
- Mode

Q7. Which of the following would NOT be a good reason to use a box plot?

- To show and compare distribution values
- To show data distribution shapes such as asymmetry and skewness.
**To show correlations between two variables.**

Q8. All of the following are true about data visualization EXCEPT

**Is more important than summary statistics for data exploration**- Should be used with summary statistics for data exploration.
- Is useful for communicating results.
- Provides an intuitive way to look at data.

### Quiz 3: Data Exploration in KNIME and Spark

Q1. What is the maximum of the average wind speed measurements at 9 am (to 2 decimal places)?

**23.55**- 29.84
- 5.50
- 4.55

2. How many rows containing rain accumulation at 9 am measurements have missing values?

**6**- 4
- 3
- 2

3. What is the correlation between the relative humidity at 9 am and at 3 pm (to 2 decimal places, and without removing or imputing missing values)?

**0.88**- 1.00
- -0.45
- 0.19

4. If the histogram for the air temperature at 9 am has 50 bins, what is the number of elements in the bin with the most elements (without removing or imputing missing values)?

**57**- 224
- 49
- 166

5. What is the approximate maximum max_wind_direction_9am when the maximum max_wind_speed_9am occurs?

**70**- 30
- 312

### Quiz 4 – Data Preparation

Q1. Which of the following is NOT a data quality issue?

- Inconsistent data
**Scaled data**- Missing values
- Duplicate data

Q2. Imputing missing data means to

**replace missing values with something reasonable.**- drop samples with missing values.
- replace missing values with outliers.
- merge samples with missing values.

Q3. A data sample with values that are considerably different than the rest of the other data samples in the dataset is called an/a _____________.

**Outlier**- Invalid data
- Noise
- Inconsistent data

Q4. Which one of the following examples illustrates the use of domain knowledge to address a data quality issue?

- Simply discard the samples that lie significantly outside the distribution of your data
- Drop samples with missing values
**Merge duplicate records while retaining relevant data**- None of these

Q5. Which of the following is NOT an example of feature selection?

- Adding an in-state feature based on an applicant’s home state.
- Re-formatting an address field into separate street address, city, state, and zip code fields.
- Removing a feature with a lot of missing values.
**Replacing a missing value with the variable mean.**

Q6. Which one of the following is the best feature set for your analysis?

**Feature set with the smallest set of features that best capture the characteristics of the data for the intended application**- Feature set with the smallest number of features
- Feature set with the largest number of features
- Feature set that contains exclusively re-coded features

Q7. The mean value and the standard deviation of a zero-normalized feature are

- mean = 0 and standard deviation = 0
- mean = 1 and standard deviation = 0
**mean = 0 and standard deviation = 1**- mean = 1 and standard deviation = 1

Q8. Which of the following is NOT true about PCA?

- PCA stands for principal component analysis
- PC1 and PC2, the first and second principal components, respectively, are always orthogonal to each other.
- PC1, the first principal component , captures the largest amount of variance in the data along a single dimension.
**PCA is a dimensionality reduction technique that removes a feature that is very correlated with another feature.**

### Quiz 5 – Handling Missing Valuers in KNIME and Spark

Q1. If we remove all missing values from the data, how many air pressure at 9 am measurements have values between 911.736 and 914.67?

**77**- 287
- 80

Q2. If we impute the missing values with the minimum value, how many air temperature at 9 is measurements are less than 42.292?

**28**- 23
- 1
- 5

Q3. How many samples have missing values for air_pressure_9am?

**3**- 5
- 1092
- 0

Q4. Which column in the weather dataset has the most number of missing values?

**rain_accumulation_9am**- number
- They are all the same
- air_temp_9am

Q5. When we remove all the missing values from the dataset, the number of rows is 1064, yet the variable with the most missing values has 1089 rows. Why did the number of rows decrease so much?

**Because the missing values in each column are not necessarily in the same row**- Because rows with missing values as well as rows with 0s are removed
- Because rows with missing values as well as rows with duplicate values are removed

### Quiz 6 – Classification

Q1. Which of the following is a TRUE statement about classification?

**Classification is a supervised task.**- Classification is an unsupervised task.
- In a classification problem, the target variable has only two possible outcomes.

Q2. In which phase are model parameters adjusted?

- Testing phase
**Training phase**- Data preparation phase
- Model parameters are constant throughout the modeling process.

Q3. Which classification algorithm uses a probabilistic approach?

**naive bayes**- none of the above
- decision tree
- k-nearest-neighbors

Q4. What does the ‘k’ stand for in k-nearest-neighbors?

- the number of samples in the dataset
**the number of nearest neighbors to consider in classifying a sample**- the distance between neighbors: All neighboring samples that are ‘k’ distance apart from the sample are considered in classifying that sample.
- the number of training datasets

Q5. During the construction of a decision tree, there are several criteria that can be used to determine when a node should no longer be split into subsets. Which one of the following is NOT applicable?

- The tree depth reaches a maximum threshold.
- The number of samples in the node reaches a minimum threshold.
- All (or X% of) samples have the same class label.
**The value of the Gini index reaches a maximum threshold.**

Q6. Which statement is true of tree induction?

- You want to split the data in a node into subsets that are as homogeneous as possible
**All of these statements are true of tree induction.**- An impurity measure is used to determine the best split for a node.
- For each node, splits on all variables are tested to determine the best split for the node.

Q7. What does ‘naive’ mean in Naive Bayes?

- The full Bayes’ Theorem is not used. The ‘naive’ in naive bayes specifies that a simplified version of Bayes’ Theorem is used.
- The Bayes’ Theorem makes estimating the probabilities easier. The ‘naïve’ in the name of classifier comes from this ease of probability calculation.
**The model assumes that the input features are statistically independent of one another. The ‘naïve’ in the name of classifier comes from this naïve assumption.**

Q8. The feature independence assumption in Naive Bayes simplifies the classification problem by

- assuming that the prior probabilities of all classes are independent of one another.
- assuming that classes are independent of the input features.
- ignoring the prior probabilities altogether.
**allowing the probability of each feature given the class to be estimated individually.**

### Quiz 7 Classification in KNIME and Spark

Q1. KNIME: In configuring the Numeric Binner node, what would happen if the definition for the humidity_low bin is changed from

```
] -infinity ... 25.0 [
to
] -infinity ... 25.0 ]
```

(i.e., the last bracket is changed from [ to ] ?

**The definition for the humidity_low bin would change from excluding 25.0 to including 25.0**- The definition for the humidity_low bin would change from having 25.0 as the endpoint to having 25.1 as the endpoint
- Nothing would change

Q2. KNIME: Considering the Numeric Binner node again, what would happen if the “Append new column” box is not checked?

**The relative_humidity_3pm variable will become a categorical variable**- The relaltive_humidity_3pm variable will remain unchanged, and a new unnamed categorical variable will be created
- The relative_humidity_3pm variable will become undefined, and an error will occur

Q3. KNIME: How many samples had a missing value for air_temp_9am before missing values were addressed?

**5**- 3
- 0

Q4. KNIME: How many samples were placed in the test set after the dataset was partitioned into training and test sets?

**213**- 851
- 20

Q5. KNIME: What are the target and predicted class labels for the first sample in the test set?

**Both are humidity_not_low**- Target class label is humidity_not_low, and predicted class label is humidity_low
- Target class label is humidity_low, and predicted class label is humidity_not_low

Q6. Spark: What values are in the number column?

**Integer values starting at 0**- Time and date values
- Random integer values

Q7. Spark: With the original dataset split into 80% for training and 20% for test, how many of the first 20 samples from the test set were correctly classified?

**19**- 10
- 1

Q8. Spark: If we split the data using 70% for training data and 30% for test data, how many samples would the training set have (using seed 13234)?

**730**- 334
- 70

### Quiz 8 – Model Evaluation

Q1. A model that generalizes well means that

- The model is overfitting.
- The model does a good job of fitting to the noise in the data.
**The model performs well on data not used in training.**- The model performs well on data used to adjust its parameters.

Q2. What indicates that the model is overfitting?

- High training error and low generalization error
**Low training error and high generalization error**- High training error and high generalization error
- Low training error and low generalization error

Q3. Which method is used to avoid overfitting in decision trees?

- Post-pruning
- None of these
- Pre-pruning
**Pre-pruning and post-pruning**

Q4. Which of the following best describes a way to create and use a validation set to avoid overfitting?

- leave-one-out cross-validation
- random sub-sampling
- k-fold cross-validation
**All of these**

Q5. Which of the following statements is NOT correct?

- The test set is used to evaluate model performance on new data.
- The validation set is used to determine when to stop training the model.
- The training set is used to adjust the parameters of the model.
**The test set is used for model selection to avoid overfitting.**

Q6. How is the accuracy rate calculated?

- Add the number of true positives and the number of false negatives.
- Divide the number of true positives by the number of true negatives.
**Divide the number of correct predictions by the total number of predictions**- Subtract the number of correct predictions from the total number of predictions.

Q7. Which evaluation metrics are commonly used for evaluating the performance of a classification model when there is a class imbalance problem?

**precision and recall**- precision and accuracy
- accuracy and error
- precision and error

Q8. How do you determine the classifier accuracy from the confusion matrix?

- Divide the sum of the diagonal values in the confusion matrix by the sum of the off-diagonal values.
- Divide the sum of all the values in the confusion matrix by the total number of samples.
**Divide the sum of the diagonal values in the confusion matrix by the total number of samples.**- Divide the sum of the off-diagonal values in the confusion matrix by the total number of samples.

### Quiz 9 – Model Evaluation in KNIME and Spark

Q1. KNIME: In the confusion matrix as viewed in the Scorer node, low_humidity_day is:

**the target class label**- the predicted class label
- the only input variable that is categorical

Q2. KNIME: In the confusion matrix, what is the difference between low_humidity_day and Prediction(low_humidity_day)?

**low_humidity_day is the target class label, and Prediction(low_humidity_day) is the predicted class label**- low_humidity_day is the predicted class label, and Prediction(low_humidity_day) is the target class label
- There is no difference. The two are the same

Q3. KNIME: In the Table View of the Interactive Table, each row is color-coded. Blue specifies:

**that the target class label for the sample is humidity_not_low**- that the target class label for the sample is humidity_low
- that the predicted class label for the sample is humidity_not_low
- that the predicted class label for the sample is humidity_low

Q4. KNIME: To change the colors used to color-code each sample in the Table View of the Interactive Table node:

**change the color settings in the Color Manager node**- change the color settings in the Interactive Table dialog
- It is not possible to change these colors

Q5. KNIME: In the Table View of the Interactive Table, the values in RowID are not consecutive because:

**the RowID values are from the original dataset, and only the test samples are displayed here**- the samples are randomly ordered in the table
- only a few samples from the test set are randomly selected and displayed here

Q6. Spark: To get the error rate for the decision tree model, use the following code:

```
print ("Error = %g " % (1.0 - accuracy)) [X]
```

```
evaluator = MuticlassClassificationEvaluator(
labelCol="label",
predictionCol="prediction",
metricName="error")
```

```
error = evaluator.evaluate(1 - predictions)
```

Q7. Spark: To print out the accuracy as a percentage, use the following code:

```
print ("Accuracy = %.2g" % (accuracy * 100)) [X]
```

```
print ("Accuracy = %100g" % (accuracy))
```

```
print ("Accuracy = %100.2g" % (accuracy))
```

Q8. Spark: In the last line of code in Step 4, the confusion matrix is printed out. If the “transpose()” is removed, the confusion matrix will be displayed as:

```
array([[87., 14.], [X]
[26., 83.]])
```

```
array([[83., 26.],
[14., 87.]])
```

```
array([[83., 87.],
[14., 26.]])
```

### Quiz 10 – Regression, Cluster Analysis, & Association Analysis

Q1. What is the main difference between classification and regression?

- In classification, you’re predicting a number, and in regression, you’re predicting a category.
- There is no difference since you’re predicting a numeric value from the input variables in both tasks.
**In classification, you’re predicting a category, and in regression, you’re predicting a number.**- In classification, you’re predicting a categorical variable, and in regression, you’re predicting a nominal variable.

Q2. Which of the following is NOT an example of regression?

- Predicting the price of a stock
- Estimating the amount of rain
**Determining whether power usage will rise or fall**- Predicting the demand for a product

Q3. In linear regression, the least squares method is used to

- Determine the distance between two pairs of samples.
- Determine whether the target is categorical or numerical.
**Determine the regression line that best fits the samples.**- Determine how to partition the data into training and test sets.

Q4. How does simple linear regression differ from multiple linear regression?

- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input can be a mix of categorical and numerical variables.
**In simple linear regression, the input has only one variable. In multiple linear regression, the input has more than one variables.**- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input has only numerical variables.
- They are the just different terms for linear regression with one input variable.

Q5. The goal of cluster analysis is

- To segment data so that differences between samples in the same cluster are maximized and differences between samples of different clusters are minimized.
- To segment data so that all samples are evenly divided among the clusters.
- To segment data so that all categorical variables are in one cluster, and all numerical variables are in another cluster.
**To segment data so that differences between samples in the same cluster are minimized and differences between samples of different clusters are maximized.**

Q6. Cluster results can be used to

- Determine anomalous samples
- Segment the data into groups so that each group can be analyzed further
- Classify new samples
- Create labeled samples for a classification task
**All of these choices are valid uses of the resulting clusters.**

- The mean of all the samples in the two closest clusters.
**The mean of all the samples in the cluster**- The mean of all the samples in the two farthest clusters.
- The mean of all the samples in all clusters

Q8. The main steps in the k-means clustering algorithm are

**Assign each sample to the closest centroid, then calculate the new centroid.**- Calculate the centroids, then determine the appropriate stopping criterion depending on the number of centroids.
- Calculate the distances between the cluster centroids, then find the two closest centroids.
- Count the number of samples, then determine the initial centroids.

Q9. The goal of association analysis is

- To find the most complex rules to explain associations between as many items as possible in the data.
- To find the number of outliers in the data
**To find rules to capture associations between items or events**- To find the number of clusters for cluster analysis

Q10. In association analysis, an item set is

- A transaction or set of items that occur together
- A set of transactions that occur a certain number of times in the data
- A set of items that two rules have in common
- A set of items that infrequently occur together

Q11. The support of an item set

**Captures the frequency of that item set**- Captures how many times that item set is used in a rule
- Captures the number of items in that item set
- Captures the correlation between the items in that item set

Q12. Rule confidence is used to

- Identify frequent item sets
- Determine the rule with the most items
- Measure the intuitiveness of a rule
**Prune rules by eliminating rules with low confidence**

#### Quiz 11 – Cluster Analysis in Spark

Q1. What percentage of samples have 0 for rain_accumulation?

**157812 / 158726 = 99.4%**- 157237 / 158726 = 99.1%
- There is not enough information to determine this

Q2. Why is it necessary to scale the data (Step 4)?

- Since the values of the features are on different scales, all features need to be scaled so that all values will be positive.
**Since the values of the features are on different scales, all features need to be scaled so that no one feature dominates the clustering results.**- Since the values of the features are on different scales, all features need to be scaled so that the cluster centers can be displayed on the same plot for easier analysis.

Q3. If we wanted to create a data subset by taking every 5th sample instead of every 10th sample, how many samples would be in that subset?

**317,452**- 1,587,257
- 158,726

4. This line of code creates a k-means model with 12 clusters:

```
kmeans = KMeans (k=12, seed=1)
```

What is the significance of “seed=1”?

**This sets the seed to a specific value, which is necessary to reproduce the k-means results**- This means that this is the first iteration of k-means. The seed value is incremented by 1 every time k-means is executed
- This specifies that the first cluster centroid is set to sample #1

Q5. Just by looking at the values for the cluster centers, which cluster contains samples with the lowest relative humidity?

**Cluster 4**- Cluster 3
- Cluster 9

Q6. What do clusters 7, 8, and 11 have in common?

**They capture weather patterns associated with warm and dry days**- They capture weather patterns associated with high air pressure
- They capture weather patterns associated with very strong winds

Q7. If we perform clustering with 20 clusters (and seed = 1), which cluster appears to identify Santa Ana conditions (lowest humidity and highest wind speeds)?

**Cluster 12**- Cluster 1
- Cluster 16

Q8. We did not include the minimum wind measurements in the analysis since they are highly correlated with the average wind measurements. What is the correlation between min_wind_speed and avg_wind_speed (to two decimals)? (Compute this using one-tenth of the original dataset, and dropping all rows with missing values.)

**Next Quiz Answers >>**

**<< Previous Quiz Answers**

Big Data Integration and Processing

#### All Courses Quiz Answers of Big Data Specialization

Course 01: Introduction to Big Data

Course 02: Big Data Modeling and Management Systems

Course 03: Big Data Integration and Processing

Course 04: Machine Learning With Big Data

Course 05: Graph Analytics for Big Data