Table of Contents
Data Engineering and Machine Learning using Spark Week 01 Quiz Answers
Graded Quiz: Spark for Data Engineering
Q1. Select the option where all four statements about streaming data characteristics are correct.
- Data is generated in finite, small batches; often originates from more than one source; is often available as a complete data set; requires incremental processing .
- Data is generated incrementally; often originates from more than one source; is unavailable as a complete data set; requires incremental processing.
- Data is generated incrementally; often originates from more than one source; is unavailable as a complete data set; requires batch processing.
- Data is generated continuously; often originates from more than one source; is unavailable as a complete data set; requires incremental processing.
Q2. Select the data sink option that is not fault-tolerant and that is recommended for debugging only.
- Console and Memory
- Files
- Foreach and ForeachBatch
- Kafka
Q3. Select the answer with the options that best completes the following statement:
Apache Spark Structured Streaming processes a data stream with the Spark SQL engine _______________.
- Extended SQL APIs
- Dataset and DataFrame APIs
- RDD APIs
- Structured Streaming specific APIs
Q4. Select the website where you can find and download the GraphFrames package.
- On the sparkpackages.org website
- On the spark-packages.org website.
- On the Spark.com website
- On the GraphFrames.com website
Q5. Identify which options correctly describe a directed graph and an undirected graph. (Multiple answers)
- A directed graph contains edges with a single direction between two vertices, indicating a one-way relationship, illustrated using lines without arrows.
- Undirected graphs have edges representing a relationship without a direction, illustrated using lines with arrows.
- Undirected graphs have edges representing a relationship without a direction, illustrated using lines without arrows.
- A directed graph contains edges with a single direction between two vertices, indicating a one-way relationship, illustrated using lines with arrows.
Q6. Select the option that lists the correct order of these ETL workflow items.
Step 1: The first data processing step loads a Parquet file to create a DataFrame with a “Telephone number” column.
Step 2: Data stored in the “Telephone” column is cleaned and transformed into three columns to separate the country code, the area code, and the local phone number.
Step 3: A data processing step creates a second DataFrame with other information, such as age, from a database.
Step 4: These two DataFrames are joined and loaded into the data warehouse for further analysis.
- Step 4, Step 2, Step 1, Step 3
- Step 1, Step 3, Step 2, Step 4
- Step 1, Step 4, Step 3, Step 2
- Step 1, Step 2, Step 3, Step 4
Q7. Select the answers that define and describe Graph Theory. (Multiple answers)
- Graph theory for Apache Spark is the study of graphs generated from parametric specifications.
- The graph is a construct that contains a set of vertices with pairwise edges that connect one vertex to another.
- The graph is a construct that contains an X, Y, and Z-axis.
- Graph theory is the mathematical study of modeling pairwise relationships between objects.
Q8. Select the options that define watermarking. (Multiple answers)
- Updates results after initial data processing.
- Enables the inclusion of late-arriving data stream processing
- Is the process that manages and tags first-arriving data
- Is the process that manages late data
Q9. Select the statements that are true about using GraphFrames. (Multiple Answers)
- Is ideal for modeling data with connecting relationships and computes relationship strength and direction
- Provides one DataFrame for graph vertices and one DataFrame for edges that can be used with SparkSQL for analysis
- Comes with popular built-in graph algorithms for use with the edge and vertex DataFrames
- Performs Motif finding, which searches the graph for structural patterns. Motif finding is supported in GraphFrames with the `find()` method that uses domain specific language (DSL) to specify the search query in terms of edges and vertices.
Q10. Select the built-in data sources from which Spark can extract data.
- Parquet
- JDBC
- Microsoft Excel
- Apache ORC
Data Engineering and Machine Learning using Spark Week 02 Quiz Answers
Graded Quiz: SparkML
Q1. Select the best definition of a machine learning system.
- A machine learning system consists of already trained data models that predict results on previously unseen data.
- A machine learning system trains data models and uses that information to calculate results on the known data.
- A machine learning system consists of already trained data models that predict results on known data.
- A machine learning system applies a specific machine learning algorithm to train data models. After training the model, the system infers or “predicts” results on previously unseen data.
Q2. Which of the following options are true about Spark ML inbuilt utilities?
- Spark ML inbuilt utilities includes a statistics package.
- Spark ML utilities help during the intermediate steps of data processing, cleaning, and building models.
- Spark ML inbuilt utilities includes a linear algebra package.
- Spark ML inbuilt utilities includes the Feature module.
Q3. Select the statements that are true about Spark’s support for machine learning data sources.
- Has standard libraries to support images and LIBSVM data types
- Supports both feature vector and label column data
- LIBSVM loads the ”libsvm” data files and creates a DataFrame with two columns including the feature vector and label.
- Images are not a common data source
Q4. How do you perform supervised machine learning classification on Apache Spark?
- The Spark ML library provides the spark.ml.classification library for classifications.
- The Spark ML library provides the spark.classification library for classifications
- The Spark ML library provides the spark.ml.regression library for regressions
- The Spark ML library provides the spark.regression.library for regressions
Q5. Select the statements that are true for classification using Apache Spark.
- Classification is a form of an implicit function approximation where the model predicts real valued outputs for a given input.
- Classification examples include weather predictions, stock market price predictions, house value estimation, and others.
- The Spark ML model predicts each object’s target category or “class.”
- Producing a prediction from a discrete set of possible outcomes from the task is called classification.
Q6. Select the statements that are true about regression using Apache Spark ML.
- The predicted value is usually a continuous real number, such as a float or integer
- Examples of regression analysis include Weather predictions, stock market price predictions, house value estimation, and others.
- Examples of regression analysis include predicting a sports tournament winner, heads, or tails on a coin toss, classifying images with a pre-set number of distinct categories
- Regression is a form of an implicit function approximation where the model predicts real valued outputs for a given input.
Q7. Select the answers that correctly fill in the blank. Unsupervised learning _________.
- Does not require explicit labels mapped to features
- Requires explicit labels mapped to features
- Automatically learns patterns and latent spaces in the data
- Is a subset of machine learning algorithms
Q8. View the following code samples and place the code in the order needed to perform clustering using Spark ML
#1 Perform predictions on test data
test_data = spark.read.format(“libsvm”).load(”test_data.txt”)
predictions = model.transform(test_data)
#2 Create a model and train it
kmeans = KMeans().setK(5)
model = kmeans.fit(data)
#3 Load data
data = spark.read.format(“libsvm”).load(”data.txt”)
- #2, #3, #1
- #1, #2, #3
- #3, #1, #2
- #3, #2, #1
Q9. Select the answer that correctly fills in the blank. The Spark MLlib provides a clustering library located at _______________
- (clustering.spark)
- (spark.clustering)
- (spark.ml.clustering)
- (clustering.ml.spark)
Q10. Select the clustering algorithms for which Spark MLlib provides functions.
- Gaussian Mixture Models
- k-means
- Early Dirichlet Allocation
- Latent Dirichlet Allocation
Get all Course Quiz Answers of IBM Data Engineering Professional Certificate
Introduction to Data Engineering Coursera Quiz Answers
Python for Data Science, AI & Development Coursera Quiz Answers
Introduction to Relational Databases (RDBMS) Quiz Answers
Databases and SQL for Data Science with Python Quiz Answers
Introduction to NoSQL Databases Coursera Quiz Answers