Google Cloud Professional Data Engineer: Get Certified Quiz Answers

Google Cloud Professional Data Engineer: Get Certified – Quiz Answers

1. You are consulting with a company that provides a software-as-a-service (SaaS) platform for collecting and analyzing data from agricultural IoT sensors. The company currently uses Bigtable to store the data but is finding performance to be less than expected. You suspect the problem may be hot-spotting so you look into the structure of the row key. The current rowkey is the concatenation of the following: the DateTime of sensor reading, customer ID, sensor ID. What alternative rowkey would you suggest?

Datetime of Sensor Reading, sensor ID, Customer
Random Number, Datetime of Sensor Reading, sensor ID, Customer
Customer ID, Sensor ID, Datetime of Sensor Reading
Sensor ID, Datetime of Sensor Reading, Customer ID

2. Compliance with regulations requires that you keep copies of logs generated by applications that perform financial transactions for 3 years. You currently run applications on-premises but will move them to Google Cloud. You want to keep the logs for three years as inexpensively as possible. You do not expect to query the logs but must be able to provide access to files on demand. How would you configure GCP resources to meet this requirement?

send application logs to Cloud Logging and create a Cloud Storage sink to store the logs for the long term
send application logs to Cloud Logging and leave them there
send application logs to Cloud Logging and leave them there and create a data lifecycle management policy to delete logs over 3 years old.
send application logs to Cloud Logging and create a Bigtable sink to store the logs for the long term

3. A team of developers is consolidating several data pipelines used by an insurance company to process claims. The claims processing logic is complex and already encoded in a Java library. The current data pipelines run in batch mode but the insurance company wants to process claims as soon as they are created. What GCP service would you recommend using?

Cloud Datastore
Cloud Dataflow
Cloud Dataprep
Cloud Pub/Sub

4. A regional auto dealership is migrating its business applications to Google Cloud. The company currently uses a third-party application that uses PostgreSQL for storing data. The CTO wants to reduce the cost of supporting this application and the database. What would you recommend to the CTO as the best option to reduce the cost of maintaining and operating the database?

Use Cloud Datastore
Use Cloud Spanner
Use Cloud SQL
Use a SQL Server database

5. You are developing a machine learning model to predict the likelihood of a device failure. The device generates a stream of metrics every thirty seconds. The metrics include 3 categorical values, 5 integer values, and 1 floating-point value. The floating-point value ranges from 0 to 100. For the purposes of the model, the floating-point value is more precise than needed. Mapping that value to a feature with possible values “high”, “medium”, and “low” is sufficient. What features engineering technique would you use to transform the floating-point value to high, medium, or low?

L1 Regularization
Bucketing
Clustering
Normalization
L2 Regularization

6. A company is migrating its backend services to Google Cloud. Services are implemented in Java and Kafka is used as a messaging platform between services. The DevOps team would like to reduce its operational overhead. What managed GCP service might they use as an alternative to Kafka?

Cloud Dataflow
Cloud Dataproc
Cloud Pub/Sub
Cloud Datastore

7. You are implementing a data warehouse using BigQuery. A data modeler, unfamiliar with BigQuery, developed a model that is highly normalized. You are concerned that a highly normalized model will require the frequent joining of tables to respond to common queries. You want to denormalize the data model but still want to be able to represent 1-to-many relations. How could you do this with BigQuery?

Rather than store associated data in another table, store them in ARRAYS within the primary table.
Rather than store associated data in another table, store them in STRUCTS within the primary table
Use partitioning and clustering to denormalize
Model entities using wide-column tables

8. A team of machine learning engineers is developing deep learning models using Tensorflow. They have extremely large data sets and must frequently retrain models. They are currently using a managed instance group with a fixed number of VMs and they are not meeting SLAs for retraining. What would you suggest the machine learning engineers try next?

Enable autoscaling of the managed instance group and set a high maximum number of VMs
Keep the same number of VMs in the managed instance group but use larger machine types
Attach TPUs to the Compute Engine VMs
Deploy the training service in containers and use Kubernetes Engine to scale as needed

9. You have a BigQuery table partitioned by ingestion time and want to create a view that returns only rows ingested in the last 7 days. Which of the following statements would you use in the WHERE clause of the view definition to limit results to include only the most recent seven days of data?

PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
_PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
_INGESTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
INGESTTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);

10. Your department currently uses HBase on a Hadoop cluster for an analytics database. You want to migrate that data to Google Cloud. There is only one workload run on the Hadoop cluster and uses the HBase API. You would like to avoid having to manage a Spark and Hadoop cluster but you do not want to change the application code of the one workload running on the cluster. How could you move the workload to GCP, use a managed service, and not change the application?

Migrate the data to Cloud Storage and use it’s HBase API
Migrate the data to Bigtable and use the HBase API
Migrate the data to Datastore and use the HBase API
Migrate the data to BigQuery and use its HBase API

11. A data pipeline is not performing well enough to meet SLAs. You have determined that long-running database queries are slowing processing. You decide to try to use a read-through cache. You want the cache to support sets and sorted sets as well. What Google Cloud Service would you use?

Cloud Memorystore with Memcache
Cloud Memorystore with Redis
Cloud Memorystore with SQL Server
Cloud Datastore

12. Sensor data from manufacturing machines is ingested thru Pub/Sub and read by a Cloud Dataflow job, which analyzes the data. The data arrives in one-minute intervals and includes a timestamp and measures of temperature, vibration, and ambient humidity. Industrial engineers have determined if the average temperature exceeds 10% of the maximum safe operating temperature for more than 10 minutes and the average ambient humidity is above 90% for more than 10 minutes then the machine should be shut down. What operation would you perform on the stream of data to determine when to trigger an alert to shut down the machine?

Set a 10-minute watermark and when the watermark is reached, trigger an alert.
Create a 10-minute tumbling window, compute the average temperature and average humidity, and if both exceed the specified thresholds, then trigger an alert.
Create a 10-minute sliding window, compute the average temperature and average humidity, and if both exceed the specified thresholds, then trigger an alert.
Create a Redis cache using Memcache, use an ordered list data structure, write a Java or Python function to compute 10-minute averages for temperature and humidity, and if both exceed the specified thresholds trigger an alert.

13. Autonomous vehicles stream data about vehicle performance to a Cloud Pub/Sub queue for ingestion. You want to randomly sample the data stream to collect 0.01% of the data for your own analysis. You want to do this with the least amount of new code and infrastructure while still having access to the data as soon as possible. What is the best option for doing this?

Create a sink from the Cloud Pub/Sub topic to a Cloud Storage bucket and write the data to files on an hourly basis. Create a containerized application running in App Engine to read the latest hourly data file and randomly sample 0.01% of the data.
Create a Cloud Function that executes when a message is written to the Cloud Pub/Sub topic. Randomly generate a number between 0 and 1 in the function and if the random number is less than 0.01, then write the message to another topic that you created to act as the source of data for your analysis.
Create an App Engine application that executes continuously and polls the Cloud Pub/Sub topic. When a message is written to the Cloud Pub/Sub topic. Randomly generate a number between 0 and 1 in the function and if the random number is less than 0.01, then write the message to another topic that you created to act as the source of data for your analysis.
Create a sink from the Cloud Pub/Sub topic to a Cloud Spanner database table. Create a containerized application running in App Engine to read the data continuously and randomly sample 0.01% of the data.

14. You are training a deep learning neural network. You are using gradient descent to find optimal weights. You want to update the weights after each instance is analyzed. Which type of gradient descent would you use?

Batch gradient descent
Stochastic gradient descent
Mini-batch gradient descent
Max-batch gradient descent

15. You are designing a Bigtable database for a multi-tenant analytics service that requires low latency writes at extremely high volumes of data. As an experienced relational data modeler, you are familiar with the process of normalizing data models. Your data model consists of 15 tables with each table have at most 20 columns. Your initial tests with 20% of expected load indicate much higher than expected latency and some potential issues with connection overhead. What can you do to address these problems?

Further normalize the data model to reduce latency and add more memory to address connection overhead issues.
Denormalize the data model to use a single, wide-column table in Bigtable to reduce latency and address connection issues.
Keep the same data model but use BigQuery, which is specifically an analytical database.
Denormalize the data model to use a single, wide-column table but implement that table in BigQuery to reduce latency and address connection issues.

16. You are designing a time series database. Data will arrive from thousands of sensors at one-minute intervals. You want to model this time series data using recommended practices. Which of the following would you implement?

Design rows to store the set of measurements from one sensor at one point in time.
Design rows to store the set of measurements from all sensors for one point in time.
Design rows to store the set of measurements from one sensor over a one hour period
Design rows to store the set of measurements from one sensor over as long a period of time as possible while not exceeding 100 MB per row.

17. You are migrating a data warehouse to BigQuery and want to optimize the data types using in BigQuery. You have many columns in the existing data warehouse that store absolute points in time values. They are implemented using 8-byte integers in the existing data warehouse. What data type would you use in BigQuery?

Long integer
Datetime
Timestamp
Time

18. Your organization is migrating an enterprise data warehouse from an on-premises PostgreSQL database to Google Cloud to use BigQuery. The data warehouse is used by 7 different departments each of which has its own data, workloads, and reports. You would like to follow recommended data warehouse migration practices. Which of the following procedures would you follow as the first steps in the migration process?

Export data from the on premises data warehouse, transfer the data to Cloud Storage, load data from Cloud Store into BigQuery. Next transfer all workloads and then transfer all reporting jobs to GCP.
Export data from the on premises data warehouse, transfer the data to Cloud Storage, load data from Cloud Store into Bigtable. Next transfer all report jobs and then transfer all workloads to GCP.
Transfer groups of tables related to one use case at a time. Denormalize the tables in the process to take advantage of clustering. Configure and test downstream processes to read from Bigtable.
Transfer groups of tables related to one use case at a time. Do not modify tables in the process. Configure and test downstream processes to read from BigQuery.

19. The Chief Finance Officer of your company has requested a set of data warehouse reports for use by end-users who are not proficient in SQL. You want to use Google Cloud Services. Which of the following are services you could use to create the reports?

Looker(Correct)
Tableau
Data Studio
Cloud Dataprep
Cloud Fusion

20. A team of data scientists is using a Redis cache provided by Cloud Memorystore to store a large data set in memory. they have a custom Python application for analyzing the data. While optimizing the program they found that a significant amount of time is spent counting the number of distinct elements in sets. They are will to use less precise numbers if they can get an approximate answer faster. Which Redis data type would you recommend they use?

Sorted Sets
Stochastic Sets
HyperLogLog
List

21. You are migrating an on-premises Spark and Hadoop cluster to Google Cloud using Cloud Dataproc. The on-premises cluster uses HDFS and attached storage for persistence. The cluster runs continually, 24×7. You understand that it is common to use ephemeral Spark and Hadoop clusters in Google Cloud but are concerned about the time it would take to load data into HDFS each time a cluster is created. What would you do to ensure data is accessible to a new cluster as soon as possible?

Store the data in Bigtable and copy data to HDFS when the cluster is created.
Store the data in Cloud Storage and copy the data to HDFS when the cluster is created.
Use the Cloud Storage Connector to read data directly from Cloud Storage
Create snapshots of each disk before shutting down a cluster and use them as disk images when creating a new cluster.

22. A Spark job is failing but you cannot identify the problem from the contents of the log file. You want to run the job again and get more logging information. Which of the following command fragments would you use as part of a command to submit a job to Spark and have it log more detail than the default amount?

gcloud dataproc jobs submit spark –driver-log-levels
gcloud dataproc submit jobs spark –driver-log-levels
gcloud dataproc jobs submit spark –enable-debug
gcloud dataproc submit jobs spark –enable-debug

23. You have migrated a Spark cluster from on-premises to Cloud Dataproc. You are following the best practice of using ephemeral clusters to keep costs down. When the cluster starts, data is copied to the cluster HDFS before jobs start running. You would like to minimize the time between creating a cluster and starting jobs running on that cluster. Which of the following could do the most to reduce that time without increasing cost?

Use SSDs
Use the Cloud Storage Connector and keep data in Cloud Storage instead of copying it each time to HDFS.
Use Cloud SQL to persist data when clusters are not running.
Create a managed instance group of VMs with 1 vCPU and 4 GB of memory and attach sufficient persistent disk to store the data when clusters are not running and then read the data directly from the managed instance group.

24. A Python ETL process is loading a data warehouse is not meeting ingestion SLAs. The service that performs the ingestion and initial processing cannot keep up with incoming data at peak times. The peak times do not last longer than one minute and occur at most once per day but data is sometimes lost during those times. You need to ensure data is not lost to the ingestion process. What would you try first to prevent data loss?

Rewrite the ETL process in Java or C
Ingest data into a Cloud Pub/Sub topic using a push processing model
Ingest data into a Cloud Pub/Sub topic using a pull subscription
Ingest data into a Cloud Dataflow topic using a pull subscription

25. You support an ETL process on-premises and need to migrate it to a virtual machine running in Google Cloud. The process sometimes fails without warning. You do not have time to diagnose and correct the problem before migrating. What can you do to discover failure as soon as possible?

Create a process to run in App Engine that analysis the list of processes running on the virtual machine to ensure the process name always appears in the list and if not, send a notification to you.
Create a Cloud Monitor uptime check and if the uptime check fails send a notification to you.
Create a Cloud Monitor alert with a condition that checks for CPU utilization below 5%. If CPU utilization drops below 5% for more than 1 minute, send a notification to you.
Create an alert based on Cloud Logging to alert you when Cloud Logging stops receiving log data from the process

26. Many applications and services are running in several Google Cloud services. You would like to know if all services’ logs are up to date with ingesting data into Cloud Logging. How would you get this information with the least effort?

Write a Python script to call the Cloud Logging API to get ingestion status
View the Cloud Logging Resource page in Google Cloud Console
View the Cloud Logging Router page in Google Cloud Console
Write a custom Logs View query to get the information

27. You have concluded that symbolic machine learning algorithms will not perform well on a classification problem. You have decided to build a model based on a deep learning network. Several features are categorical variables with 3 to 7 distinct values each. How would you represent these features when presenting data to the network?

Feature cross
One-hot encoding
Regression
Standardization

28. A Cloud Dataflow job will need to list files and copy those files from a Cloud Storage bucket. What is the best way to ensure the job will have access when it tries to read data from those buckets? The job will not write data to Cloud Storage.

Assign the job the Storage Object Viewer role
Create a Cloud Identity account and grant it Storage Object Viewer role
Create a service account and grant it the Storage Object Viewer role
Create a service account and grant it a custom role that has storage.objects.get permission only.

29. Your team is setting up a development environment to create a proof of concept system. You will use the environment for one week. Only members of the team will have access. No confidential or sensitive data will be used. You want to grant most members of the team the ability to modify resources and read data. Only one member of the team should have administrator capabilities, such as the ability to modify permissions. The administrator should have all permissions other members of the team have. What role would you assign to the team member with the administrator role?

The Owner primitive role
The Editor primitive role
The role/cloudasset.owner predefined role
The role/cloudasset.viewer predefined role

30. As an administrator of a BigQuery data warehouse, you grant access to users according to their responsibilities in the organization. You follow the Principle of Least Privilege when granting access. Several users need to be able to read and update data in a BigQuery table as well as delete tables in a dataset. What role would you assign to those users?

roles/bigquery.dataViewer
roles/bigquery.dataEditor
roles/bigquery.metadataViewer
roles/bigquery.metadataOwner

31. A colleague has asked for your advice about tuning a classifier built using random forests. What hyperparameter or hyperparameters would you suggest adjusting to improve accuracy?

Number of trees only
Number of trees and depth of trees
Learning rate
Number of clusters

32. When training a neural network, what parameter is learned?

Weights on input values to a node
Learning rate
Optimal activation function
Number of layers in the network

33. You are building a classifier to identify customers most likely to buy additional products when presented with an offer. You have approximately 20 features. The model is not performing as well as needed. You suspect the model is missing some relationships that are determined by a combination of two features. What features engineering techniques would you try to improve the quality of the model?

Normalization
Regularization
Feature cross
Bucketing

34. The CTO of your organization wants to reduce the amount of money spent on running Hadoop clusters in the cloud but does not want to adversely impact the time it takes for jobs to run. When workloads run, they utilize 86% of CPU and 92% of memory. A single cluster is used for all workloads and it runs continuously. What are some options for reducing costs without significantly impacting performance?

Reduce the number and size of virtual machines in the cluster.
Use preemptible worker nodes and use ephemeral clusters.
Use preemptible work nodes and Shielded VMs
Reduce the number of virtual machines and use ephemeral clusters

35. You have been asked to help diagnose a deep learning neural network that has been trained with a large dataset over hundreds of epochs but the accuracy, precision, and recall are below the levels required on both training and test data sets. You start by reviewing the features and see all the features on numeric. Some are on the scale of 0 to 1, some are on the scale of 0 to 100, and several are on the scale of 0 to 10,000. What feature engineering technique would you use and why?

Regularization, to map all features to the same 0 to 1 scale
Normalization, to map all features to the same 0 to 1 scale
Regularization, to reduce the amount of information captured in the model
Backpropagation to reduce the amount of information captured in the model

36. You have trained a deep learning model. After training is complete, the model scores high on accuracy, precision, and recall when measured using training data; however, when validation data is used, the accuracy, precision, and recall are much lower. This is an example of what kind of problem?

Underfitting
Overfitting
Insufficiently complex model
Learning rate is too small

37. A business intelligence analyst is running many BigQuery queries that are scanning large amounts of data, which leads to higher BigQuery costs. What would you recommend the analyst do to better understand the cost of queries before executing them?

Use the bq query command with the SQL statement and the –dry-run option
Use the bq query command with the SQL statement and the –estimate option
Use the bq query command with the SQL statement and the –max-rows-per-request option
Use the gcloud bigquery command with the SQL statement and the –max-rows-per-request option

38. A business intelligence analyst wants to build a machine learning model to predict the number of units of a product that will be sold in the future based on dozens of features. The features are all stored in a relational database. The business analyst is familiar with reporting tools but not programming in general. What service would you recommend the analyst use to build a model?

Tensorflow
Spark ML
AutoML Tables
Bigtable ML

39. A team of machine learning engineers wants to use Kubernetes to run their models. They would like to use standard practices for machine learning workflows. What tool would you recommend they use?

Kubeflow
Tensorflow
Spark ML
Scikit-Learn

40. When testing a regression model, you notice that small changes in a few features can lead to large differences in the output. This is an example of what kind of problem?

High variance
Low variance
High bias
Low bias

41. A machine learning engineer has built a deep learning network to classify medical radiology images. When evaluated, the model performed well with 95% accuracy and high precision and recall. The engineer noted that the training took an unusually long time and asked you how to decrease the training time without adding additional computing resources or risk reducing the quality of the model. What would you recommend?

Reduce the number of layers in the model.
Reduce the number of nodes in each layer of the model.
Increase the learning rate.
Decrease the learning rate.

42. A number of machine learning models used by your company are producing questionable results, particularly with some demographic groups. You suspect there may be an unfairness bias in these models. Which of the following could you use to assess the possibility of unfairness and bias?

Anti-classification
Classification parity
Regularization
Normalization

43. A regression model developed three months ago is no longer performing as well as it originally did. What could be the cause of this?

Data skew
Underfitting
Increased latency
Decreased recall

44. A data scientist is developing a machine learning model to predict the toxicity of drug candidates. The training data set consists of a large number of chemical and physical attributes and there is a large number of instances. Training takes almost a week on an n2-standard-16 virtual machine. What would you recommend to reduce the training time without compromising the quality of the model?

Randomly sample 5% of the training set and train on that smaller data set
Attach a GPU to the virtual machine
Increase the machine size to make more memory available
Increase the machine size to make more CPUs available

45. Your company has an organization with several folders and several projects defined in the Resource Hierarchy. You want to limit access to all VMs created within a project. How would you specify those restrictions?

Create a policy and attach it to the project
Create a policy and attach to each VM as it is created
Create a custom role and attach it to a group that contains all identities with access to the project
Create a custom role and attach it to each identity with access to the project

46. An insurance company needs to keep logs of applications used to make underwriting decisions. Industry regulations require the company to store logs for seven years. The logs are not likely to be accessed. Approximately 12 TB of log data is generated per year. What is the most cost-effective way to store this data?

Use Nearline Cloud Storage
Use Multi-regional Cloud Storage
Use Firestore mode of Cloud Datastore
Use Coldline Storage

47. A multi-national enterprise used Cloud Spanner for an inventory management system. After some investigation, you find that hot-spotting is adversely impacting the performance of the Cloud Spanner database. Which two of the following options could be used to avoid hot-spotting?

Use an auto-incrementing value as the primary key
Bit-reverse sequential values used as the primary key
Promote low cardinality attributes in multi-attribute primary keys
Promote high cardinality attributes in multi-attribute primary keys
Further normalize the data model

48. An online gaming company is building a prototype data store for a player information system using Cloud Datastore. Developers have created a database with 10,000 fictitious player records. The attributes include a player identifier, a list of possessions, a health status, and a team identifier. Queries that return player identifier and list of possessions filtered by health status return results correctly, however, queries that return player identifier and team identifier filtered by health status and team identifier do not return any results even when there are entities in the database that satisfy the filter. What would you first check when troubleshooting this problem?

Verify two indexes exists, one on the player identifier and one on the team identifier
Verify a single composite index exists on the player identifier and the team identifier
Verify that both the player identifier and the team identifier are defined as integer data types
Verify the SCAN_ENABLED database parameter is set to True

49. An online game company is developing a service that combines gaming with math tutoring for children ages 8 to 13. The company plans to collect some personally identifying information from the children. The game will be released in the European Union only. What regulation would the company need to take into consideration as it develops the game?

Child Online Protection Act
General Data Protection Regulation (GDPR)
Sarbanes-Oxley
FedRAMP

50. You have migrated a data warehouse from on-premises to BigQuery. You have not modified the ETL process other than to change the target database to BigQuery. The overall load performance is slower than expected and you have been asked to tune the process. You have determined that the most time-consuming part of the load process is the final step of the ETL process. It loads data from CSV files compressed using Snappy compression into BigQuery. The files are stored in Cloud Storage. What change would you make to make the load process save the most time in the load process?