All Weeks Machine Learning With Big Data Coursera Quiz Answers
Want to make sense of the volumes of data you have collected? Need to incorporate data-driven decisions into your process? This course provides an overview of machine learning techniques to explore, analyze, and leverage data. You will be introduced to tools and algorithms you can use to create machine learning models that learn from data and scale those models up to big data problems.
Machine Learning With Big Data Coursera Quiz Answers
Quiz 1: Machine Learning Overview
- Learning from data
- Explicit, step-by-step programming
- Data-driven decisions
- Discover hidden patterns
- Cluster Analysis
- Association Analysis
- Algorithm Prediction
- classification and regression
- regression and association analysis
- classification and cluster analysis
- cluster analysis and association analysis
- the target is unlabeled.
- the target is unknown or unavailable.
- the target is provided.
- the target is what is being predicted.
- Acquire -> Prepare -> Analyze -> Report -> Act
- Acquire -> Prepare -> Analyze -> Act -> Report
- Prepare -> Acquire -> Analyze -> Report -> Act
- Prepare -> Acquire -> Analyze -> Act -> Report
- The first two steps, Acquire and Prepare, are apply-once, and the other steps are iterative.
- we acquire as well as explore the data that is related to the problem.
- we define the problem or opportunity to be addressed.
- we prepare the data for analysis.
- KNIME requires programming, while Spark MLlib does not.
- KNIME requires programming in Java, while Spark MLlib requires programming in Python.
- KNIME is a graphical user interface-based machine learning tool, while Spark MLlib provides a programming-based distributed platform for scalable machine learning algorithms.
- KNIME originated in Germany, while Spark MLlib was created in California, USA.
Quiz 2: Data Exploration
- A sample is an instance or example of an entity in your data.
- All of these statements are true.
- A sample can have many variables to describe it.
- A variable describes a specific characteristic of an entity in your data.
- categorical, nominal
- feature, column, attribute
- sample, row, observation
- numerical, quantitative
- To gain a better understanding of your data.
- To gather your data into one repository.
- To digitize your data.
- To generate labels for your data.
- Summary statistics
- mean, median, mode
- data sources, data locations
- standard deviation, range, variation
- skewness, kurtosis
- Contingency Table
- To show and compare distribution values
- To show data distribution shapes such as asymmetry and skewness.
- To show correlations between two variables.
- Is more important than summary statistics for data exploration
- Should be used with summary statistics for data exploration.
- Is useful for communicating results.
- Provides an intuitive way to look at data.
Quiz 3: Data Exploration in KNIME and Spark
Quiz 4 – Data Preparation
- Inconsistent data
- Scaled data
- Missing values
- Duplicate data
- replace missing values with something reasonable.
- drop samples with missing values.
- replace missing values with outliers.
- merge samples with missing values.
- Invalid data
- Inconsistent data
- Simply discard the samples that lie significantly outside the distribution of your data
- Drop samples with missing values
- Merge duplicate records while retaining relevant data
- None of these
- Adding an in-state feature based on an applicant’s home state.
- Re-formatting an address field into separate street address, city, state, and zip code fields.
- Removing a feature with a lot of missing values.
- Replacing a missing value with the variable mean.
- Feature set with the smallest set of features that best capture the characteristics of the data for the intended application
- Feature set with the smallest number of features
- Feature set with the largest number of features
- Feature set that contains exclusively re-coded features
- mean = 0 and standard deviation = 0
- mean = 1 and standard deviation = 0
- mean = 0 and standard deviation = 1
- mean = 1 and standard deviation = 1
- PCA stands for principal component analysis
- PC1 and PC2, the first and second principal components, respectively, are always orthogonal to each other.
- PC1, the first principal component , captures the largest amount of variance in the data along a single dimension.
- PCA is a dimensionality reduction technique that removes a feature that is very correlated with another feature.
Quiz 5 – Handling Missing Valuers in KNIME and Spark
- They are all the same
- Because the missing values in each column are not necessarily in the same row
- Because rows with missing values as well as rows with 0s are removed
- Because rows with missing values as well as rows with duplicate values are removed
Quiz 6 – Classification
- Classification is a supervised task.
- Classification is an unsupervised task.
- In a classification problem, the target variable has only two possible outcomes.
- Testing phase
- Training phase
- Data preparation phase
- Model parameters are constant throughout the modeling process.
- naive bayes
- none of the above
- decision tree
- the number of samples in the dataset
- the number of nearest neighbors to consider in classifying a sample
- the distance between neighbors: All neighboring samples that are ‘k’ distance apart from the sample are considered in classifying that sample.
- the number of training datasets
Q5. During the construction of a decision tree, there are several criteria that can be used to determine when a node should no longer be split into subsets. Which one of the following is NOT applicable?
- The tree depth reaches a maximum threshold.
- The number of samples in the node reaches a minimum threshold.
- All (or X% of) samples have the same class label.
- The value of the Gini index reaches a maximum threshold.
- You want to split the data in a node into subsets that are as homogeneous as possible
- All of these statements are true of tree induction.
- An impurity measure is used to determine the best split for a node.
- For each node, splits on all variables are tested to determine the best split for the node.
- The full Bayes’ Theorem is not used. The ‘naive’ in naive bayes specifies that a simplified version of Bayes’ Theorem is used.
- The Bayes’ Theorem makes estimating the probabilities easier. The ‘naïve’ in the name of classifier comes from this ease of probability calculation.
- The model assumes that the input features are statistically independent of one another. The ‘naïve’ in the name of classifier comes from this naïve assumption.
- assuming that the prior probabilities of all classes are independent of one another.
- assuming that classes are independent of the input features.
- ignoring the prior probabilities altogether.
- allowing the probability of each feature given the class to be estimated individually.
Quiz 7 Classification in KNIME and Spark
] -infinity ... 25.0 [ to ] -infinity ... 25.0 ]
(i.e., the last bracket is changed from [ to ] ?
- The definition for the humidity_low bin would change from excluding 25.0 to including 25.0
- The definition for the humidity_low bin would change from having 25.0 as the endpoint to having 25.1 as the endpoint
- Nothing would change
Q2. KNIME: Considering the Numeric Binner node again, what would happen if the “Append new column” box is not checked?
- The relative_humidity_3pm variable will become a categorical variable
- The relaltive_humidity_3pm variable will remain unchanged, and a new unnamed categorical variable will be created
- The relative_humidity_3pm variable will become undefined, and an error will occur
Q3. KNIME: How many samples had a missing value for air_temp_9am before missing values were addressed?
Q4. KNIME: How many samples were placed in the test set after the dataset was partitioned into training and test sets?
Q5. KNIME: What are the target and predicted class labels for the first sample in the test set?
- Both are humidity_not_low
- Target class label is humidity_not_low, and predicted class label is humidity_low
- Target class label is humidity_low, and predicted class label is humidity_not_low
Q6. Spark: What values are in the number column?
- Integer values starting at 0
- Time and date values
- Random integer values
Q7. Spark: With the original dataset split into 80% for training and 20% for test, how many of the first 20 samples from the test set were correctly classified?
Q8. Spark: If we split the data using 70% for training data and 30% for test data, how many samples would the training set have (using seed 13234)?
Quiz 8 – Model Evaluation
- The model is overfitting.
- The model does a good job of fitting to the noise in the data.
- The model performs well on data not used in training.
- The model performs well on data used to adjust its parameters.
- High training error and low generalization error
- Low training error and high generalization error
- High training error and high generalization error
- Low training error and low generalization error
- None of these
- Pre-pruning and post-pruning
- leave-one-out cross-validation
- random sub-sampling
- k-fold cross-validation
- All of these
- The test set is used to evaluate model performance on new data.
- The validation set is used to determine when to stop training the model.
- The training set is used to adjust the parameters of the model.
- The test set is used for model selection to avoid overfitting.
- Add the number of true positives and the number of false negatives.
- Divide the number of true positives by the number of true negatives.
- Divide the number of correct predictions by the total number of predictions
- Subtract the number of correct predictions from the total number of predictions.
- precision and recall
- precision and accuracy
- accuracy and error
- precision and error
- Divide the sum of the diagonal values in the confusion matrix by the sum of the off-diagonal values.
- Divide the sum of all the values in the confusion matrix by the total number of samples.
- Divide the sum of the diagonal values in the confusion matrix by the total number of samples.
- Divide the sum of the off-diagonal values in the confusion matrix by the total number of samples.
Quiz 9 – Model Evaluation in KNIME and Spark
- the target class label
- the predicted class label
- the only input variable that is categorical
- low_humidity_day is the target class label, and Prediction(low_humidity_day) is the predicted class label
- low_humidity_day is the predicted class label, and Prediction(low_humidity_day) is the target class label
- There is no difference. The two are the same
- that the target class label for the sample is humidity_not_low
- that the target class label for the sample is humidity_low
- that the predicted class label for the sample is humidity_not_low
- that the predicted class label for the sample is humidity_low
- change the color settings in the Color Manager node
- change the color settings in the Interactive Table dialog
- It is not possible to change these colors
- the RowID values are from the original dataset, and only the test samples are displayed here
- the samples are randomly ordered in the table
- only a few samples from the test set are randomly selected and displayed here
print ("Error = %g " % (1.0 - accuracy)) [X]
evaluator = MuticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="error")
error = evaluator.evaluate(1 - predictions)
print ("Accuracy = %.2g" % (accuracy * 100)) [X]
print ("Accuracy = %100g" % (accuracy))
print ("Accuracy = %100.2g" % (accuracy))
array([[87., 14.], [X] [26., 83.]])
array([[83., 26.], [14., 87.]])
array([[83., 87.], [14., 26.]])
- In classification, you’re predicting a number, and in regression, you’re predicting a category.
- There is no difference since you’re predicting a numeric value from the input variables in both tasks.
- In classification, you’re predicting a category, and in regression, you’re predicting a number.
- In classification, you’re predicting a categorical variable, and in regression, you’re predicting a nominal variable.
- Predicting the price of a stock
- Estimating the amount of rain
- Determining whether power usage will rise or fall
- Predicting the demand for a product
- Determine the distance between two pairs of samples.
- Determine whether the target is categorical or numerical.
- Determine the regression line that best fits the samples.
- Determine how to partition the data into training and test sets.
- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input can be a mix of categorical and numerical variables.
- In simple linear regression, the input has only one variable. In multiple linear regression, the input has more than one variables.
- In simple linear regression, the input has only categorical variables. In multiple linear regression, the input has only numerical variables.
- They are the just different terms for linear regression with one input variable.
- To segment data so that differences between samples in the same cluster are maximized and differences between samples of different clusters are minimized.
- To segment data so that all samples are evenly divided among the clusters.
- To segment data so that all categorical variables are in one cluster, and all numerical variables are in another cluster.
- To segment data so that differences between samples in the same cluster are minimized and differences between samples of different clusters are maximized.
- Determine anomalous samples
- Segment the data into groups so that each group can be analyzed further
- Classify new samples
- Create labeled samples for a classification task
- All of these choices are valid uses of the resulting clusters.
- The mean of all the samples in the two closest clusters.
- The mean of all the samples in the cluster
- The mean of all the samples in the two farthest clusters.
- The mean of all the samples in all clusters
- Assign each sample to the closest centroid, then calculate the new centroid.
- Calculate the centroids, then determine the appropriate stopping criterion depending on the number of centroids.
- Calculate the distances between the cluster centroids, then find the two closest centroids.
- Count the number of samples, then determine the initial centroids.
- To find the most complex rules to explain associations between as many items as possible in the data.
- To find the number of outliers in the data
- To find rules to capture associations between items or events
- To find the number of clusters for cluster analysis
- A transaction or set of items that occur together
- A set of transactions that occur a certain number of times in the data
- A set of items that two rules have in common
- A set of items that infrequently occur together
- Captures the frequency of that item set
- Captures how many times that item set is used in a rule
- Captures the number of items in that item set
- Captures the correlation between the items in that item set
- Identify frequent item sets
- Determine the rule with the most items
- Measure the intuitiveness of a rule
- Prune rules by eliminating rules with low confidence
Quiz 11 – Cluster Analysis in Spark
- 157812 / 158726 = 99.4%
- 157237 / 158726 = 99.1%
- There is not enough information to determine this
- Since the values of the features are on different scales, all features need to be scaled so that all values will be positive.
- Since the values of the features are on different scales, all features need to be scaled so that no one feature dominates the clustering results.
- Since the values of the features are on different scales, all features need to be scaled so that the cluster centers can be displayed on the same plot for easier analysis.
kmeans = KMeans (k=12, seed=1)
What is the significance of “seed=1”?
- This sets the seed to a specific value, which is necessary to reproduce the k-means results
- This means that this is the first iteration of k-means. The seed value is incremented by 1 every time k-means is executed
- This specifies that the first cluster centroid is set to sample #1
- Cluster 4
- Cluster 3
- Cluster 9
- They capture weather patterns associated with warm and dry days
- They capture weather patterns associated with high air pressure
- They capture weather patterns associated with very strong winds
- Cluster 12
- Cluster 1
- Cluster 16
Q8. We did not include the minimum wind measurements in the analysis since they are highly correlated with the average wind measurements. What is the correlation between min_wind_speed and avg_wind_speed (to two decimals)? (Compute this using one-tenth of the original dataset, and dropping all rows with missing values.)
Next Quiz Answers >>
<< Previous Quiz Answers
Machine Learning With Big Data Course Review:
In our experience, we suggest you enroll in the Machine Learning With Big Data Course and gain some new skills from Professionals completely free and we assure you will be worth it.
Machine Learning With Big Data course is available on Coursera for free, if you are stuck anywhere between quiz or graded assessment quiz, just visit Networking Funda to get Machine Learning With Big Data Coursera Quiz Answers.
I hope this Machine Learning With Big Data Coursera Quiz Answers would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Coursera Quiz Answers.
This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.