All Weeks Distributed Computing with Spark SQL Coursera Quiz Answer
Table of Contents
Week 01 : Distributed Computing with Spark SQL Coursera Quiz Answer
Quiz 01: Assignment #1 Quiz – Queries in Spark SQL
Q 1. What is the first value for “Incident Number”?
Answer: comment the answer
Q 2. What is the first value for “Incident Number” on April 4th, 2016?
Answer: comment the answer
Q 3. Is the first fire call in this table on Brooke or Conor’s birthday? Conor’s birthday is 4/4 and Brooke’s is 9/27 (in MM/DD format).
- Brooke’s birthday
- Conor’s birthday
Q 4. What is the “Station Area” for the first fire call in this table? Note that this table is a subset of the dataset.
Answer: comment the answer
Q 5. How many incidents were on Conor’s birthday in 2016?
Answer: comment the answer
Q 6. How many fire calls had an “Ignition Cause” of “4 act of nature”?
Answer: comment the answer
Q 7. What is the most common “Ignition Cause”?
Hint: Put the entire string.
Answer: comment the answer
Q 8. What is the total incidents from the two joined tables?
Answer: comment the answer
Quiz 02: Module 1 Quiz
Q 1. Which of the following are true when it comes to the business value of big data? (Select all that apply.)
Answer:
- Businesses are increasingly making data-driven decisions
- The size of the data businesses collect is growing
Q 2.
Question 2
Spark uses…
(Select all that apply.)
Answer:
- Your database technology (e.g., Postgres or SQL Server) to run Spark queries
- One very large computer that is able to run computation against large databases
- A distributed cluster of networked computers made of a driver node and many executor nodes
- A driver node to distribute work across a number of executor nodes
Q 3. How does Spark execute code backed by DataFrames? (Select all that apply.)
Answer:
- It optimizes your query by figuring out the best “how” to execute what you want
- It iterates over all of the source data to exhaustively evaluate queries
- It executes code determined in advance
Q 4. What are the properties of Spark DataFrames? (Select all that apply.)
Answer:
- Distributed: Computed across multiple nodes
- Resilient: Fault-tolerant
- Dataset: Collection of partitioned data
- Tables: Operates as any table in SQL environments
Q 5. What is the difference between Spark and database technologies? (Select all that apply.)
Answer:
- Spark does not interact with databases but uses its proprietary DataFrame technology instead
- Spark is a computation engine and is not for data storage
- Spark is a highly optimized compute engine and is not a database
Q 6. What is Amdahl’s law of scalability? (Select all that apply.)
- A formula that gives the number of processors (or other unit of parallelism) needed to complete a task
- A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data
- A formula that gives the expected speed of a single processor performing a computation
- Amdahl’s law states that the speedup of a task is a function of how much of that task can be parallelized
- A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized
Q 7. Spark offers a unified approach to analytics. What does this include? (Select all that apply.)
- Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
- Spark allows analysts, data scientists, and data engineers to all use the same core technology
- Spark code can be written in the following languages: SQL, Scala, Java, Python, and R
- Spark unifies applications such as SQL queries, streaming, and machine learning
- Spark unifies databases with optimized computation allowing for faster computation against the data it stores
Q 8. What is a Databricks notebook?
- A single Spark query
- A collaborative, interactive workspace that allows you to execute Spark queries at scale
- A cluster that executes Spark code
- A Spark instance that executes queries
Q 9. How can you get data into Databricks? (Select all that apply.)
- By connecting to Dropbox or Google Drive
- By registering the data as a table
- By uploading it through the user interface
- By “mounting” data backed by cloud storage
Q 10. What are the qualities of big data? (Select all that apply.)
- Variety: the diversity of data
- Volume: the amount of data
- Valorous: the positives impact of data
- Veracity: the reliability of data
- Velocity: the speed of data
Week 02 : Distributed Computing with Spark SQL Coursera Quiz Answer
quiz 01 : Assignment #2 Quiz – Spark Internals
Q 1. How many fire calls are in our table?
Answer: Comment the answer
Q 2. How large is our fireCalls dataset in memory? Input just the numeric value (e.g. 51.2)
Answer: Comment the answer
Q 3. Which Unit Type is most common?
- ENGINE
- MEDIC
- TRUCK
- RESCUE CAPTAIN
Q 4. What type of transformation, wide or narrow, did the GROUP BY and ORDER BY queries result in?
- Narrow
- Wide
Q 5.Looking at the query below, how many tasks are in the last stage of the last job?
Answer: Comment the Answer
Quiz 02:Module 2 Quiz
Q 1. What are the different units of parallelism? (Select all that apply.)
Answer:
- Core
- Task
- Executor
- Partition
Q 2. What is a partition?
- A division of computation that executes a query
- A synonym with “task”
- A portion of a large distributed set of data
- The result of data filtered by a WHERE clause
Q 3. What is the difference between in-memory computing and other technologies? (Select all that apply.)
- In-memory operates from RAM while other technologies operate from disk
- In-memory computing is slower than other types of computing
- In-memory operations were not realistic in older technologies when memory was more expensive
Q 4. Why is caching important?
- It reformats data already stored in RAM for faster access
- It improves queries against data read one or more times
- It stores data on the cluster to improve query performance
- It always stores data in-memory to improve performance
Q 5. Which of the following is a wide transformation? (Select all that apply.)
- ORDER BY
- GROUP BY
- SELECT
- WHERE
Q 6. Broadcast joins…
- Shuffle both of the tables, minimizing computational resources
- Shuffle both of the tables, minimizing data transfer by transferring data in parallel
- Transfer the smaller of two tables to the larger, increasing data transfer requirements
- Transfer the smaller of two tables to the larger, minimizing data transfer
Q 7. Adaptive Query Execution uses runtime statistics to:
- Dynamically coalesce shuffle partitions
- Dynamically switch join strategies
- Dynamically optimize skew joins
- Dynamically cache data
Q 8. Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
- Data Skew
- Incompatible data formats
Q 9. What is a stage boundary?
- Any transition between Spark tasks
- An action caused by a SQL query is predicate
- When all of the slots or available units of processing have to sync with one another
- A narrow transformation
Q 10. What happens when Spark code is executed in local mode?
- The executor and driver are on the same machine
- The code is executed against a local cluster
- The code is executed in the cloud
- A cluster of virtual machines is used rather than physical machines
Week 03 : Distributed Computing with Spark SQL Coursera Quiz Answer
Quiz 01:Assignment #3 Quiz – Engineering Data Pipelines
Q 1. What type of table is “newTable”?
- EXTERNAL
- MANAGED
Q 2. How many rows are in “newTable”?
Answer: Comment the Answer.
Q 3.What is the “Battalion” of the first entry in the sorted table?
Answer: Comment the Answer.
Q 4. Was this query faster or slower on the table with increased partitions?
- Slower
- Faster
Q 5. Does the data stored within the table still exist at the original location (‘dbfs:/tmp/newTableLoc’) after you dropped the table?
Answer:
- No
- Yes
Quiz 02: Module 3 Quiz
Q 1. Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
- Resources are isolated and therefore more manageable and debuggable
- It results in copies of the data in case of a data center outage
- It allows for elastic resources so larger storage or compute resources are used only when needed
- It makes updates to new software versions easier
Q 2. You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
- IO: the transfer of data is more demanding than the computation
- IO: computation is more demanding that the data transfer
- CPU: the transfer of data is more demanding than the computation
- CPU: computation is more demanding than the data transfer
Q 3. Processing virtual shopping cart orders in real time is an example of
- Online Transaction Processing (OLTP)
- Online Analytical Processing (OLAP)
Q 4. When are BLOB stores an appropriate place to store data? (Select all that apply.)
- For cheap storage
- For storing large files
- For a “data lake” of largely unstructured data
- For online transaction processing on a website
Q 5. JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
- Specify a column, number of partitions, and the column’s minimum and maximum values. Spark then divides that range of values between parallel connections.
- Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
- Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
- Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.
Q 6. What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
- Corruptible
- Compression
- Parallelism
- Columnar
Q 7. SQL is normally used to query tabular (or “structured”) data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
- It does not need a formal structure
- It allows for easy joins between relational JSON tables
- It allows for missing data
- It allows for complex data types
- It allows for data change over time
Q 8. Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
- The number of stages in a write operation
- The number of data partitions in a DataFrame
- The numPartitions setting in the Spark configuration
- The number of jobs in a write operation
Q 9. Fill in the blanks with the appropriate response below:
A _______ table manages _______and a DROP TABLE command will result in data loss.
- Managed, both the data and metadata such as the schema and data location
- Unmanaged, only the metadata such as the schema and data location
- Unmanaged, both the data and metadata such as the schema and data location
- Managed, only the metadata such as the schema and data location
Week 04 : Distributed Computing with Spark SQL Coursera Quiz Answer
Assignment #4 Quiz – Lakehouse
Q 1. How many folders were created? Enter the number of records you see from the output below (include the _delta_log in your count)
Answer: 9
Q 2. Delete all the records where City is null. How many records are left in the delta table?
Answer: 416869
Q 3. After you deleted all records where the City is null, how many files were removed? Hint: Look at operationsMetrics in the transaction log using the DESCRIBE HISTORY command.
Answer: 22
Q 4. There are quite a few missing Call_Type_Group values. Use the UPDATE command to replace any null values with Non Life-threatening.
After you replace the null values, how many Non Life-threatening call types are the
Answer: 302506
Q 5. Travel back in time to the earliest version of the Delta table (version 0). How many records were there?
Answer: 417419
Module 4 Quiz
Q 1. What are the ACID properties?
- Atomicity, Consistency, Isolation, and Durability
- Atomicity, Consistency, Idempotent, and Durability
- Atomicity, Consistency, Isolation, and Duration
- Atomicity, Congruency, Isolation, and Durability
Q 2. Which of the following are true statements about data warehouses?
- They use closed protocols and proprietary software
- They enable machine learning workloads
- They provide the structure needed for BI applications
- They have a high degree of flexibility
Q 3. Which of these features does Delta Lake support? (Select all that apply.)
- Cluster Creation
- Delete
- Time Travel
- Schema Evolution
- Space Travel
Q 4. Which of the following are true statements about data lakes?
- They provide the structure needed for BI applications
- They use closed protocols and proprietary software
- They enable machine learning workloads
- They have a high degree of flexibility
Q 5. Which of the following are valid data models?
- Relational
- Non-relational
- Query-oriented
- Star
- Medallion
Q 6. What are the benefits a lakehouse architecture provides?
- Combine scalability and low-cost storage of data lakes with the speed and ACID transactional guarantees of data warehouses
- Combine scalability and ACID transactional guarantees of data lakes with the speed and low-cost storage of data warehouses
- Combine scalability and low-cost storage of data warehouses with the speed and ACID transactional guarantees of data lakes
- Combine speed and low-cost storage of data lakes with the scalability and ACID transactional guarantees of data warehouses
Q 7. Machine learning is suited to solve which of the following tasks? (Select all that apply.)
- Image Recognition
- Financial Forecasting
- Reporting
- Fraud Detection
- Natural Language Processing
- A/B Testing
- Churn Analysis
Q 8. What is Machine Learning? (Select all that apply.)
- A function that maps features to an output
- Learning patterns in your data without being explicitly programmed
- Hand-coded logic
- Statistical moments calculated against a dataset
Q 9. Fill in the blanks with the appropriate answer below.)
Predicting whether a website user is fraudulent or not is an example of _________ machine learning. It is a __________ task
- unsupervised, regression
- supervised, classification
- unsupervised, classification
- supervised, regression
Q 10. Linear regression is one algorithm used for machine learning. What is this algorithm learning?
- It learns the line of best fit through the data
- It learns the average of the label you’re trying to predict
- It learns the median of the label you’re trying to predict
- It learns the most similar other datapoints in that dataset to the ones you provide
Get All Course Quiz Answers of Learn SQL Basics for Data Science Specialization
SQL for Data Science Coursera Quiz Answers
Data Wrangling, Analysis and AB Testing with SQL Coursera Quiz Answers
322
GOod job