Distributed Computing with Spark SQL Coursera Quiz Answer

All Weeks Distributed Computing with Spark SQL Coursera Quiz Answer

Week 01 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01: Assignment #1 Quiz – Queries in Spark SQL

Q 1. What is the first value for “Incident Number”?

Answer: comment the answer

Q 2. What is the first value for “Incident Number” on April 4th, 2016?

Answer: comment the answer

Q 3. Is the first fire call in this table on Brooke or Conor’s birthday? Conor’s birthday is 4/4 and Brooke’s is 9/27 (in MM/DD format).

Brooke’s birthday
Conor’s birthday

Q 4. What is the “Station Area” for the first fire call in this table? Note that this table is a subset of the dataset.

Answer: comment the answer

Q 5. How many incidents were on Conor’s birthday in 2016?

Answer: comment the answer

Q 6. How many fire calls had an “Ignition Cause” of “4 act of nature”?

Answer: comment the answer

Q 7. What is the most common “Ignition Cause”?

Hint: Put the entire string.

Answer: comment the answer

Q 8. What is the total incidents from the two joined tables?

Answer: comment the answer

Quiz 02: Module 1 Quiz

Q 1. Which of the following are true when it comes to the business value of big data? (Select all that apply.)

Answer:

Businesses are increasingly making data-driven decisions
The size of the data businesses collect is growing

Q 2.

Question 2

Spark uses…

(Select all that apply.)

Answer:

Your database technology (e.g., Postgres or SQL Server) to run Spark queries
One very large computer that is able to run computation against large databases
A distributed cluster of networked computers made of a driver node and many executor nodes
A driver node to distribute work across a number of executor nodes

Q 3. How does Spark execute code backed by DataFrames? (Select all that apply.)

Answer:

It optimizes your query by figuring out the best “how” to execute what you want
It iterates over all of the source data to exhaustively evaluate queries
It executes code determined in advance

Q 4. What are the properties of Spark DataFrames? (Select all that apply.)

Answer:

Distributed: Computed across multiple nodes
Resilient: Fault-tolerant
Dataset: Collection of partitioned data
Tables: Operates as any table in SQL environments

Q 5. What is the difference between Spark and database technologies? (Select all that apply.)

Answer:

Spark does not interact with databases but uses its proprietary DataFrame technology instead
Spark is a computation engine and is not for data storage
Spark is a highly optimized compute engine and is not a database

Q 6. What is Amdahl’s law of scalability? (Select all that apply.)

A formula that gives the number of processors (or other unit of parallelism) needed to complete a task
A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data
A formula that gives the expected speed of a single processor performing a computation
Amdahl’s law states that the speedup of a task is a function of how much of that task can be parallelized
A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized

Q 7. Spark offers a unified approach to analytics. What does this include? (Select all that apply.)

Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
Spark allows analysts, data scientists, and data engineers to all use the same core technology
Spark code can be written in the following languages: SQL, Scala, Java, Python, and R
Spark unifies applications such as SQL queries, streaming, and machine learning
Spark unifies databases with optimized computation allowing for faster computation against the data it stores

Q 8. What is a Databricks notebook?

A single Spark query
A collaborative, interactive workspace that allows you to execute Spark queries at scale
A cluster that executes Spark code
A Spark instance that executes queries

Q 9. How can you get data into Databricks? (Select all that apply.)

By connecting to Dropbox or Google Drive
By registering the data as a table
By uploading it through the user interface
By “mounting” data backed by cloud storage

Q 10. What are the qualities of big data? (Select all that apply.)

Variety: the diversity of data
Volume: the amount of data
Valorous: the positives impact of data
Veracity: the reliability of data
Velocity: the speed of data

Week 02 : Distributed Computing with Spark SQL Coursera Quiz Answer

quiz 01 : Assignment #2 Quiz – Spark Internals

Q 1. How many fire calls are in our table?

Answer: Comment the answer

Q 2. How large is our fireCalls dataset in memory? Input just the numeric value (e.g. 51.2)

Answer: Comment the answer

Q 3. Which Unit Type is most common?

ENGINE
MEDIC
TRUCK
RESCUE CAPTAIN

Q 4. What type of transformation, wide or narrow, did the GROUP BY and ORDER BY queries result in?

Narrow
Wide

Q 5.Looking at the query below, how many tasks are in the last stage of the last job?

Answer: Comment the Answer

Quiz 02:Module 2 Quiz

Q 1. What are the different units of parallelism? (Select all that apply.)

Answer:

Core
Task
Executor
Partition

Q 2. What is a partition?

A division of computation that executes a query
A synonym with “task”
A portion of a large distributed set of data
The result of data filtered by a WHERE clause

Q 3. What is the difference between in-memory computing and other technologies? (Select all that apply.)

In-memory operates from RAM while other technologies operate from disk
In-memory computing is slower than other types of computing
In-memory operations were not realistic in older technologies when memory was more expensive

Q 4. Why is caching important?

It reformats data already stored in RAM for faster access
It improves queries against data read one or more times
It stores data on the cluster to improve query performance
It always stores data in-memory to improve performance

Q 5. Which of the following is a wide transformation? (Select all that apply.)

ORDER BY
GROUP BY
SELECT
WHERE

Q 6. Broadcast joins…

Shuffle both of the tables, minimizing computational resources
Shuffle both of the tables, minimizing data transfer by transferring data in parallel
Transfer the smaller of two tables to the larger, increasing data transfer requirements
Transfer the smaller of two tables to the larger, minimizing data transfer

Q 7. Adaptive Query Execution uses runtime statistics to:

Dynamically coalesce shuffle partitions
Dynamically switch join strategies
Dynamically optimize skew joins
Dynamically cache data

Q 8. Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)

Data Skew
Incompatible data formats

Q 9. What is a stage boundary?

Any transition between Spark tasks
An action caused by a SQL query is predicate
When all of the slots or available units of processing have to sync with one another
A narrow transformation

Q 10. What happens when Spark code is executed in local mode?

The executor and driver are on the same machine
The code is executed against a local cluster
The code is executed in the cloud
A cluster of virtual machines is used rather than physical machines

Week 03 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01:Assignment #3 Quiz – Engineering Data Pipelines

Q 1. What type of table is “newTable”?

EXTERNAL
MANAGED

Q 2. How many rows are in “newTable”?

Answer: Comment the Answer.

Q 3.What is the “Battalion” of the first entry in the sorted table?

Answer: Comment the Answer.

Q 4. Was this query faster or slower on the table with increased partitions?

Slower
Faster

Q 5. Does the data stored within the table still exist at the original location (‘dbfs:/tmp/newTableLoc’) after you dropped the table?

Answer:

Quiz 02: Module 3 Quiz

Q 1. Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)

Resources are isolated and therefore more manageable and debuggable
It results in copies of the data in case of a data center outage
It allows for elastic resources so larger storage or compute resources are used only when needed
It makes updates to new software versions easier

Q 2. You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?

IO: the transfer of data is more demanding than the computation
IO: computation is more demanding that the data transfer
CPU: the transfer of data is more demanding than the computation
CPU: computation is more demanding than the data transfer

Q 3. Processing virtual shopping cart orders in real time is an example of

Online Transaction Processing (OLTP)
Online Analytical Processing (OLAP)

Q 4. When are BLOB stores an appropriate place to store data? (Select all that apply.)

For cheap storage
For storing large files
For a “data lake” of largely unstructured data
For online transaction processing on a website

Q 5. JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?

Specify a column, number of partitions, and the column’s minimum and maximum values. Spark then divides that range of values between parallel connections.
Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.

Q 6. What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)

Corruptible
Compression
Parallelism
Columnar

Q 7. SQL is normally used to query tabular (or “structured”) data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)

It does not need a formal structure
It allows for easy joins between relational JSON tables
It allows for missing data
It allows for complex data types
It allows for data change over time

Q 8. Data writes in Spark can happen in serial or in parallel. What controls this parallelism?

The number of stages in a write operation
The number of data partitions in a DataFrame
The numPartitions setting in the Spark configuration
The number of jobs in a write operation

Q 9. Fill in the blanks with the appropriate response below:

A _______ table manages _______and a DROP TABLE command will result in data loss.

Managed, both the data and metadata such as the schema and data location
Unmanaged, only the metadata such as the schema and data location
Unmanaged, both the data and metadata such as the schema and data location
Managed, only the metadata such as the schema and data location

Week 04 : Distributed Computing with Spark SQL Coursera Quiz Answer

Assignment #4 Quiz – Lakehouse

Q 1. How many folders were created? Enter the number of records you see from the output below (include the _delta_log in your count)

Answer: 9

Q 2. Delete all the records where City is null. How many records are left in the delta table?

Answer: 416869

Q 3. After you deleted all records where the City is null, how many files were removed? Hint: Look at operationsMetrics in the transaction log using the DESCRIBE HISTORY command.

Answer: 22

Q 4. There are quite a few missing Call_Type_Group values. Use the UPDATE command to replace any null values with Non Life-threatening.

After you replace the null values, how many Non Life-threatening call types are the

Answer: 302506

Q 5. Travel back in time to the earliest version of the Delta table (version 0). How many records were there?

Answer: 417419

Module 4 Quiz

Q 1. What are the ACID properties?

Atomicity, Consistency, Isolation, and Durability
Atomicity, Consistency, Idempotent, and Durability
Atomicity, Consistency, Isolation, and Duration
Atomicity, Congruency, Isolation, and Durability

Q 2. Which of the following are true statements about data warehouses?

They use closed protocols and proprietary software
They enable machine learning workloads
They provide the structure needed for BI applications
They have a high degree of flexibility

Q 3. Which of these features does Delta Lake support? (Select all that apply.)

Cluster Creation
Delete
Time Travel
Schema Evolution
Space Travel

Q 4. Which of the following are true statements about data lakes?

They provide the structure needed for BI applications
They use closed protocols and proprietary software
They enable machine learning workloads
They have a high degree of flexibility

Q 5. Which of the following are valid data models?

Relational
Non-relational
Query-oriented
Star
Medallion

Q 6. What are the benefits a lakehouse architecture provides?

Combine scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses
Combine scalability and ACID transactional guarantees of data lakes with the speed and low-cost storage of data warehouses
Combine scalability and low-cost storage of data warehouses with the speed and ACID transactional guarantees of data lakes
Combine speed and low-cost storage of data lakes with the scalability and ACID transactional guarantees of data warehouses

Q 7. Machine learning is suited to solve which of the following tasks? (Select all that apply.)

Image Recognition
Financial Forecasting
Reporting
Fraud Detection
Natural Language Processing
A/B Testing
Churn Analysis

Q 8. What is Machine Learning? (Select all that apply.)

A function that maps features to an output
Learning patterns in your data without being explicitly programmed
Hand-coded logic
Statistical moments calculated against a dataset

Q 9. Fill in the blanks with the appropriate answer below.)

Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task

unsupervised, regression
supervised, classification
unsupervised, classification
supervised, regression

Q 10. Linear regression is one algorithm used for machine learning. What is this algorithm learning?

It learns the line of best fit through the data
It learns the average of the label you’re trying to predict
It learns the median of the label you’re trying to predict
It learns the most similar other datapoints in that dataset to the ones you provide

Get All Course Quiz Answers of Learn SQL Basics for Data Science Specialization

SQL for Data Science Coursera Quiz Answers

Data Wrangling, Analysis and AB Testing with SQL Coursera Quiz Answers

Distributed Computing with Spark SQL Coursera Quiz Answer

All Weeks Distributed Computing with Spark SQL Coursera Quiz Answer

Table of Contents

Week 01 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01: Assignment #1 Quiz – Queries in Spark SQL

Quiz 02: Module 1 Quiz

Week 02 : Distributed Computing with Spark SQL Coursera Quiz Answer

quiz 01 : Assignment #2 Quiz – Spark Internals

Quiz 02:Module 2 Quiz

Week 03 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01:Assignment #3 Quiz – Engineering Data Pipelines

Quiz 02: Module 3 Quiz

Week 04 : Distributed Computing with Spark SQL Coursera Quiz Answer

Assignment #4 Quiz – Lakehouse

Module 4 Quiz

Get All Course Quiz Answers of Learn SQL Basics for Data Science Specialization

Team Networking Funda

Leave a ReplyCancel Reply

All Weeks Distributed Computing with Spark SQL Coursera Quiz Answer

Table of Contents

Week 01 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01: Assignment #1 Quiz – Queries in Spark SQL

Quiz 02: Module 1 Quiz

Week 02 : Distributed Computing with Spark SQL Coursera Quiz Answer

quiz 01 : Assignment #2 Quiz – Spark Internals

Quiz 02:Module 2 Quiz

Week 03 : Distributed Computing with Spark SQL Coursera Quiz Answer

Quiz 01:Assignment #3 Quiz – Engineering Data Pipelines

Quiz 02: Module 3 Quiz

Week 04 : Distributed Computing with Spark SQL Coursera Quiz Answer

Assignment #4 Quiz – Lakehouse

Module 4 Quiz

Get All Course Quiz Answers of Learn SQL Basics for Data Science Specialization

Team Networking Funda

Related Posts

Developing Data Products Quiz Answers – Coursera Graded Solution

Complete Practical Machine Learning Quiz Answers

Regression Models Quiz Answers – All Weeks Graded Quiz Solution

Leave a ReplyCancel Reply

Trending now