Google Cloud Professional Data Engineer
1. You are consulting with a company that provides a software-as-a-service (SaaS) platform for collecting and analyzing data from agricultural IoT sensors. The company currently uses Bigtable to store the data but is finding performance to be less than expected. You suspect the problem may be hot-spotting so you look into the structure of the row key. The current rowkey is the concatenation of the following: the DateTime of sensor reading, customer ID, sensor ID. What alternative rowkey would you suggest?
- Datetime of Sensor Reading, sensor ID, Customer
- Random Number, Datetime of Sensor Reading, sensor ID, Customer
- Customer ID, Sensor ID, Datetime of Sensor Reading
- Sensor ID, Datetime of Sensor Reading, Customer ID
2. Compliance with regulations requires that you keep copies of logs generated by applications that perform financial transactions for 3 years. You currently run applications on-premises but will move them to Google Cloud. You want to keep the logs for three years as inexpensively as possible. You do not expect to query the logs but must be able to provide access to files on demand. How would you configure GCP resources to meet this requirement?
- send application logs to Cloud Logging and create a Cloud Storage sink to store the logs for the long term
- send application logs to Cloud Logging and leave them there
- send application logs to Cloud Logging and leave them there and create a data lifecycle management policy to delete logs over 3 years old.
- send application logs to Cloud Logging and create a Bigtable sink to store the logs for the long term
3. A team of developers is consolidating several data pipelines used by an insurance company to process claims. The claims processing logic is complex and already encoded in a Java library. The current data pipelines run in batch mode but the insurance company wants to process claims as soon as they are created. What GCP service would you recommend using?
- Cloud Datastore
- Cloud Dataflow
- Cloud Dataprep
- Cloud Pub/Sub
4. A regional auto dealership is migrating its business applications to Google Cloud. The company currently uses a third-party application that uses PostgreSQL for storing data. The CTO wants to reduce the cost of supporting this application and the database. What would you recommend to the CTO as the best option to reduce the cost of maintaining and operating the database?
- Use Cloud Datastore
- Use Cloud Spanner
- Use Cloud SQL
- Use a SQL Server database
5. You are developing a machine learning model to predict the likelihood of a device failure. The device generates a stream of metrics every thirty seconds. The metrics include 3 categorical values, 5 integer values, and 1 floating-point value. The floating-point value ranges from 0 to 100. For the purposes of the model, the floating-point value is more precise than needed. Mapping that value to a feature with possible values “high”, “medium”, and “low” is sufficient. What features engineering technique would you use to transform the floating-point value to high, medium, or low?
- L1 Regularization
- L2 Regularization
6. A company is migrating its backend services to Google Cloud. Services are implemented in Java and Kafka is used as a messaging platform between services. The DevOps team would like to reduce its operational overhead. What managed GCP service might they use as an alternative to Kafka?
- Cloud Dataflow
- Cloud Dataproc
- Cloud Pub/Sub
- Cloud Datastore
7. You are implementing a data warehouse using BigQuery. A data modeler, unfamiliar with BigQuery, developed a model that is highly normalized. You are concerned that a highly normalized model will require the frequent joining of tables to respond to common queries. You want to denormalize the data model but still want to be able to represent 1-to-many relations. How could you do this with BigQuery?
- Rather than store associated data in another table, store them in ARRAYS within the primary table.
- Rather than store associated data in another table, store them in STRUCTS within the primary table
- Use partitioning and clustering to denormalize
- Model entities using wide-column tables
8. A team of machine learning engineers is developing deep learning models using Tensorflow. They have extremely large data sets and must frequently retrain models. They are currently using a managed instance group with a fixed number of VMs and they are not meeting SLAs for retraining. What would you suggest the machine learning engineers try next?
- Enable autoscaling of the managed instance group and set a high maximum number of VMs
- Keep the same number of VMs in the managed instance group but use larger machine types
- Attach TPUs to the Compute Engine VMs
- Deploy the training service in containers and use Kubernetes Engine to scale as needed
9. You have a BigQuery table partitioned by ingestion time and want to create a view that returns only rows ingested in the last 7 days. Which of the following statements would you use in the WHERE clause of the view definition to limit results to include only the most recent seven days of data?
PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
_PARTITIONTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
_INGESTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
INGESTTIME BETWEEN TIMESTAMP_TRUNC(TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 * 24 HOUR),DAY) AND TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(),DAY);
10. Your department currently uses HBase on a Hadoop cluster for an analytics database. You want to migrate that data to Google Cloud. There is only one workload run on the Hadoop cluster and uses the HBase API. You would like to avoid having to manage a Spark and Hadoop cluster but you do not want to change the application code of the one workload running on the cluster. How could you move the workload to GCP, use a managed service, and not change the application?
- Migrate the data to Cloud Storage and use it’s HBase API
- Migrate the data to Bigtable and use the HBase API
- Migrate the data to Datastore and use the HBase API
- Migrate the data to BigQuery and use its HBase API
11. A data pipeline is not performing well enough to meet SLAs. You have determined that long-running database queries are slowing processing. You decide to try to use a read-through cache. You want the cache to support sets and sorted sets as well. What Google Cloud Service would you use?
- Cloud Memorystore with Memcache
- Cloud Memorystore with Redis
- Cloud Memorystore with SQL Server
- Cloud Datastore
12. Sensor data from manufacturing machines is ingested thru Pub/Sub and read by a Cloud Dataflow job, which analyzes the data. The data arrives in one-minute intervals and includes a timestamp and measures of temperature, vibration, and ambient humidity. Industrial engineers have determined if the average temperature exceeds 10% of the maximum safe operating temperature for more than 10 minutes and the average ambient humidity is above 90% for more than 10 minutes then the machine should be shut down. What operation would you perform on the stream of data to determine when to trigger an alert to shut down the machine?
- Set a 10-minute watermark and when the watermark is reached, trigger an alert.
- Create a 10-minute tumbling window, compute the average temperature and average humidity, and if both exceed the specified thresholds, then trigger an alert.
- Create a 10-minute sliding window, compute the average temperature and average humidity, and if both exceed the specified thresholds, then trigger an alert.
- Create a Redis cache using Memcache, use an ordered list data structure, write a Java or Python function to compute 10-minute averages for temperature and humidity, and if both exceed the specified thresholds trigger an alert.
13. Autonomous vehicles stream data about vehicle performance to a Cloud Pub/Sub queue for ingestion. You want to randomly sample the data stream to collect 0.01% of the data for your own analysis. You want to do this with the least amount of new code and infrastructure while still having access to the data as soon as possible. What is the best option for doing this?
- Create a sink from the Cloud Pub/Sub topic to a Cloud Storage bucket and write the data to files on an hourly basis. Create a containerized application running in App Engine to read the latest hourly data file and randomly sample 0.01% of the data.
- Create a Cloud Function that executes when a message is written to the Cloud Pub/Sub topic. Randomly generate a number between 0 and 1 in the function and if the random number is less than 0.01, then write the message to another topic that you created to act as the source of data for your analysis.
- Create an App Engine application that executes continuously and polls the Cloud Pub/Sub topic. When a message is written to the Cloud Pub/Sub topic. Randomly generate a number between 0 and 1 in the function and if the random number is less than 0.01, then write the message to another topic that you created to act as the source of data for your analysis.
- Create a sink from the Cloud Pub/Sub topic to a Cloud Spanner database table. Create a containerized application running in App Engine to read the data continuously and randomly sample 0.01% of the data.
14. You are training a deep learning neural network. You are using gradient descent to find optimal weights. You want to update the weights after each instance is analyzed. Which type of gradient descent would you use?
- Batch gradient descent
- Stochastic gradient descent
- Mini-batch gradient descent
- Max-batch gradient descent
15. You are designing a Bigtable database for a multi-tenant analytics service that requires low latency writes at extremely high volumes of data. As an experienced relational data modeler, you are familiar with the process of normalizing data models. Your data model consists of 15 tables with each table have at most 20 columns. Your initial tests with 20% of expected load indicate much higher than expected latency and some potential issues with connection overhead. What can you do to address these problems?
- Further normalize the data model to reduce latency and add more memory to address connection overhead issues.
- Denormalize the data model to use a single, wide-column table in Bigtable to reduce latency and address connection issues.
- Keep the same data model but use BigQuery, which is specifically an analytical database.
- Denormalize the data model to use a single, wide-column table but implement that table in BigQuery to reduce latency and address connection issues.
16. You are designing a time series database. Data will arrive from thousands of sensors at one-minute intervals. You want to model this time series data using recommended practices. Which of the following would you implement?
- Design rows to store the set of measurements from one sensor at one point in time.
- Design rows to store the set of measurements from all sensors for one point in time.
- Design rows to store the set of measurements from one sensor over a one hour period
- Design rows to store the set of measurements from one sensor over as long a period of time as possible while not exceeding 100 MB per row.
17. You are migrating a data warehouse to BigQuery and want to optimize the data types using in BigQuery. You have many columns in the existing data warehouse that store absolute points in time values. They are implemented using 8-byte integers in the existing data warehouse. What data type would you use in BigQuery?
- Long integer
18. Your organization is migrating an enterprise data warehouse from an on-premises PostgreSQL database to Google Cloud to use BigQuery. The data warehouse is used by 7 different departments each of which has its own data, workloads, and reports. You would like to follow recommended data warehouse migration practices. Which of the following procedures would you follow as the first steps in the migration process?
- Export data from the on premises data warehouse, transfer the data to Cloud Storage, load data from Cloud Store into BigQuery. Next transfer all workloads and then transfer all reporting jobs to GCP.
- Export data from the on premises data warehouse, transfer the data to Cloud Storage, load data from Cloud Store into Bigtable. Next transfer all report jobs and then transfer all workloads to GCP.
- Transfer groups of tables related to one use case at a time. Denormalize the tables in the process to take advantage of clustering. Configure and test downstream processes to read from Bigtable.
- Transfer groups of tables related to one use case at a time. Do not modify tables in the process. Configure and test downstream processes to read from BigQuery.
19. The Chief Finance Officer of your company has requested a set of data warehouse reports for use by end-users who are not proficient in SQL. You want to use Google Cloud Services. Which of the following are services you could use to create the reports?
- Data Studio
- Cloud Dataprep
- Cloud Fusion
20. A team of data scientists is using a Redis cache provided by Cloud Memorystore to store a large data set in memory. they have a custom Python application for analyzing the data. While optimizing the program they found that a significant amount of time is spent counting the number of distinct elements in sets. They are will to use less precise numbers if they can get an approximate answer faster. Which Redis data type would you recommend they use?
- Sorted Sets
- Stochastic Sets
21. You are migrating an on-premises Spark and Hadoop cluster to Google Cloud using Cloud Dataproc. The on-premises cluster uses HDFS and attached storage for persistence. The cluster runs continually, 24×7. You understand that it is common to use ephemeral Spark and Hadoop clusters in Google Cloud but are concerned about the time it would take to load data into HDFS each time a cluster is created. What would you do to ensure data is accessible to a new cluster as soon as possible?
- Store the data in Bigtable and copy data to HDFS when the cluster is created.
- Store the data in Cloud Storage and copy the data to HDFS when the cluster is created.
- Use the Cloud Storage Connector to read data directly from Cloud Storage
- Create snapshots of each disk before shutting down a cluster and use them as disk images when creating a new cluster.
22. A Spark job is failing but you cannot identify the problem from the contents of the log file. You want to run the job again and get more logging information. Which of the following command fragments would you use as part of a command to submit a job to Spark and have it log more detail than the default amount?
- gcloud dataproc jobs submit spark –driver-log-levels
- gcloud dataproc submit jobs spark –driver-log-levels
- gcloud dataproc jobs submit spark –enable-debug
- gcloud dataproc submit jobs spark –enable-debug
23. You have migrated a Spark cluster from on-premises to Cloud Dataproc. You are following the best practice of using ephemeral clusters to keep costs down. When the cluster starts, data is copied to the cluster HDFS before jobs start running. You would like to minimize the time between creating a cluster and starting jobs running on that cluster. Which of the following could do the most to reduce that time without increasing cost?
- Use SSDs
- Use the Cloud Storage Connector and keep data in Cloud Storage instead of copying it each time to HDFS.
- Use Cloud SQL to persist data when clusters are not running.
- Create a managed instance group of VMs with 1 vCPU and 4 GB of memory and attach sufficient persistent disk to store the data when clusters are not running and then read the data directly from the managed instance group.
24. A Python ETL process is loading a data warehouse is not meeting ingestion SLAs. The service that performs the ingestion and initial processing cannot keep up with incoming data at peak times. The peak times do not last longer than one minute and occur at most once per day but data is sometimes lost during those times. You need to ensure data is not lost to the ingestion process. What would you try first to prevent data loss?
- Rewrite the ETL process in Java or C
- Ingest data into a Cloud Pub/Sub topic using a push processing model
- Ingest data into a Cloud Pub/Sub topic using a pull subscription
- Ingest data into a Cloud Dataflow topic using a pull subscription
25. You support an ETL process on-premises and need to migrate it to a virtual machine running in Google Cloud. The process sometimes fails without warning. You do not have time to diagnose and correct the problem before migrating. What can you do to discover failure as soon as possible?
- Create a process to run in App Engine that analysis the list of processes running on the virtual machine to ensure the process name always appears in the list and if not, send a notification to you.
- Create a Cloud Monitor uptime check and if the uptime check fails send a notification to you.
- Create a Cloud Monitor alert with a condition that checks for CPU utilization below 5%. If CPU utilization drops below 5% for more than 1 minute, send a notification to you.
- Create an alert based on Cloud Logging to alert you when Cloud Logging stops receiving log data from the process
26. Many applications and services are running in several Google Cloud services. You would like to know if all services’ logs are up to date with ingesting data into Cloud Logging. How would you get this information with the least effort?
- Write a Python script to call the Cloud Logging API to get ingestion status
- View the Cloud Logging Resource page in Google Cloud Console
- View the Cloud Logging Router page in Google Cloud Console
- Write a custom Logs View query to get the information
27. You have concluded that symbolic machine learning algorithms will not perform well on a classification problem. You have decided to build a model based on a deep learning network. Several features are categorical variables with 3 to 7 distinct values each. How would you represent these features when presenting data to the network?
- Feature cross
- One-hot encoding
28. A Cloud Dataflow job will need to list files and copy those files from a Cloud Storage bucket. What is the best way to ensure the job will have access when it tries to read data from those buckets? The job will not write data to Cloud Storage.
- Assign the job the Storage Object Viewer role
- Create a Cloud Identity account and grant it Storage Object Viewer role
- Create a service account and grant it the Storage Object Viewer role
- Create a service account and grant it a custom role that has storage.objects.get permission only.
29. Your team is setting up a development environment to create a proof of concept system. You will use the environment for one week. Only members of the team will have access. No confidential or sensitive data will be used. You want to grant most members of the team the ability to modify resources and read data. Only one member of the team should have administrator capabilities, such as the ability to modify permissions. The administrator should have all permissions other members of the team have. What role would you assign to the team member with the administrator role?
- The Owner primitive role
- The Editor primitive role
- The role/cloudasset.owner predefined role
- The role/cloudasset.viewer predefined role
30. As an administrator of a BigQuery data warehouse, you grant access to users according to their responsibilities in the organization. You follow the Principle of Least Privilege when granting access. Several users need to be able to read and update data in a BigQuery table as well as delete tables in a dataset. What role would you assign to those users?
31. A colleague has asked for your advice about tuning a classifier built using random forests. What hyperparameter or hyperparameters would you suggest adjusting to improve accuracy?
- Number of trees only
- Number of trees and depth of trees
- Learning rate
- Number of clusters
32. When training a neural network, what parameter is learned?
- Weights on input values to a node
- Learning rate
- Optimal activation function
- Number of layers in the network
33. You are building a classifier to identify customers most likely to buy additional products when presented with an offer. You have approximately 20 features. The model is not performing as well as needed. You suspect the model is missing some relationships that are determined by a combination of two features. What features engineering technique would you try to improve the quality of the model?
- Feature cross
34. The CTO of your organization wants to reduce the amount of money spent on running Hadoop clusters in the cloud but does not want to adversely impact the time it takes for jobs to run. When workloads run, they utilize 86% of CPU and 92% of memory. A single cluster is used for all workloads and it runs continuously. What are some options for reducing costs without significantly impacting performance?
- Reduce the number and size of virtual machines in the cluster.
- Use preemptible worker nodes and use ephemeral clusters.
- Use preemptible work nodes and Shielded VMs
- Reduce the number of virtual machines and use ephemeral clusters
35. You have been asked to help diagnose a deep learning neural network that has been trained with a large dataset over hundreds of epochs but the accuracy, precision, and recall are below the levels required on both training and test data sets. You start by reviewing the features and see all the features on numeric. Some are on the scale of 0 to 1, some are on the scale of 0 to 100, and several are on the scale of 0 to 10,000. What feature engineering technique would you use and why?
- Regularization, to map all features to the same 0 to 1 scale
- Normalization, to map all features to the same 0 to 1 scale
- Regularization, to reduce the amount of information captured in the model
- Backpropagation to reduce the amount of information captured in the model
36. You have trained a deep learning model. After training is complete, the model scores high on accuracy, precision, and recall when measured using training data; however, when validation data is used, the accuracy, precision, and recall are much lower. This is an example of what kind of problem?
- Insufficiently complex model
- Learning rate is too small
37. A business intelligence analyst is running many BigQuery queries that are scanning large amounts of data, which leads to higher BigQuery costs. What would you recommend the analyst do to better understand the cost of queries before executing them?
- Use the bq query command with the SQL statement and the –dry-run option
- Use the bq query command with the SQL statement and the –estimate option
- Use the bq query command with the SQL statement and the –max-rows-per-request option
- Use the gcloud bigquery command with the SQL statement and the –max-rows-per-request option
38. A business intelligence analyst wants to build a machine learning model to predict the number of units of a product that will be sold in the future based on dozens of features. The features are all stored in a relational database. The business analyst is familiar with reporting tools but not programming in general. What service would you recommend the analyst use to build a model?
- Spark ML
- AutoML Tables
- Bigtable ML
39. A team of machine learning engineers wants to use Kubernetes to run their models. They would like to use standard practices for machine learning workflows. What tool would you recommend they use?
- Spark ML
40. When testing a regression model, you notice that small changes in a few features can lead to large differences in the output. This is an example of what kind of problem?
- High variance
- Low variance
- High bias
- Low bias
41. A machine learning engineer has built a deep learning network to classify medical radiology images. When evaluated, the model performed well with 95% accuracy and high precision and recall. The engineer noted that the training took an unusually long time and asked you how to decrease the training time without adding additional computing resources or risk reducing the quality of the model. What would you recommend?
- Reduce the number of layers in the model.
- Reduce the number of nodes in each layer of the model.
- Increase the learning rate.
- Decrease the learning rate.
42. A number of machine learning models used by your company are producing questionable results, particularly with some demographic groups. You suspect there may be an unfairness bias in these models. Which of the following could you use to assess the possibility of unfairness and bias?
- Classification parity
43. A regression model developed three months ago is no longer performing as well as it originally did. What could be the cause of this?
- Data skew
- Increased latency
- Decreased recall
44. A data scientist is developing a machine learning model to predict the toxicity of drug candidates. The training data set consists of a large number of chemical and physical attributes and there is a large number of instances. Training takes almost a week on an n2-standard-16 virtual machine. What would you recommend to reduce the training time without compromising the quality of the model?
- Randomly sample 5% of the training set and train on that smaller data set
- Attach a GPU to the virtual machine
- Increase the machine size to make more memory available
- Increase the machine size to make more CPUs available
45. Your company has an organization with several folders and several projects defined in the Resource Hierarchy. You want to limit access to all VMs created within a project. How would you specify those restrictions?
- Create a policy and attach it to the project
- Create a policy and attach to each VM as it is created
- Create a custom role and attach it to a group that contains all identities with access to the project
- Create a custom role and attach it to each identity with access to the project
46. An insurance company needs to keep logs of applications used to make underwriting decisions. Industry regulations require the company to store logs for seven years. The logs are not likely to be accessed. Approximately 12 TB of log data is generated per year. What is the most cost-effective way to store this data?
- Use Nearline Cloud Storage
- Use Multi-regional Cloud Storage
- Use Firestore mode of Cloud Datastore
- Use Coldline Storage
47. A multi-national enterprise used Cloud Spanner for an inventory management system. After some investigation, you find that hot-spotting is adversely impacting the performance of the Cloud Spanner database. Which two of the following options could be used to avoid hot-spotting?
- Use an auto-incrementing value as the primary key
- Bit-reverse sequential values used as the primary key
- Promote low cardinality attributes in multi-attribute primary keys
- Promote high cardinality attributes in multi-attribute primary keys
- Further normalize the data model
48. An online gaming company is building a prototype data store for a player information system using Cloud Datastore. Developers have created a database with 10,000 fictitious player records. The attributes include a player identifier, a list of possessions, a health status, and a team identifier. Queries that return player identifier and list of possessions filtered by health status return results correctly, however, queries that return player identifier and team identifier filtered by health status and team identifier do not return any results even when there are entities in the database that satisfy the filter. What would you first check when troubleshooting this problem?
- Verify two indexes exists, one on the player identifier and one on the team identifier
- Verify a single composite index exists on the player identifier and the team identifier
- Verify that both the player identifier and the team identifier are defined as integer data types
- Verify the SCAN_ENABLED database parameter is set to True
49. An online game company is developing a service that combines gaming with math tutoring for children ages 8 to 13. The company plans to collect some personally identifying information from the children. The game will be released in the European Union only. What regulation would the company need to take into consideration as it develops the game?
- Child Online Protection Act
- General Data Protection Regulation (GDPR)
50. You have migrated a data warehouse from on-premises to BigQuery. You have not modified the ETL process other than to change the target database to BigQuery. The overall load performance is slower than expected and you have been asked to tune the process. You have determined that the most time-consuming part of the load process is the final step of the ETL process. It loads data from CSV files compressed using Snappy compression into BigQuery. The files are stored in Cloud Storage. What change would you make to make the load process save the most time in the load process?
- Use LZ0 compression instead of Snappy compression with the CSV files
- Use uncompressed Avro files instead of compressed CSV
- Use compressed JSON files instead of compressed CSV
- Use uncompressed JSON files instead of compressed CSV
Get More Quiz Answers