Table of Contents
Feature Engineering Coursera Quiz Answers
Apache Beam and Cloud Dataflow Quiz Answers
Q1. Which of these accurately describes the relationship between Apache Beam and Cloud Dataflow?
- Apache Beam is the API for data pipeline building in java or python and Cloud Dataflow is the implementation and execution framework.
Q2. TRUE or FALSE: The Filter method can be carried out in parallel and autoscaled by the execution framework:
Q3. What is the purpose of a Cloud Dataflow connector?
- Connectors allow you to output the results of a pipeline to a specific data sink like Bigtable, Google Cloud Storage, flat file, BigQuery, and more…
Q4. Below you’ll find a Cloud Dataflow preprocessing graph. Correctly identify the terms for A, B, and C
- A is a data source, B are transformation steps, and C is a data sink
Q5. To run a pipeline you need something called a __
Q6. Your development team is about to execute this code block. What is your team about to do?
- We are compiling our Cloud Dataflow pipeline written in Java and are submitting it to the cloud for execution
Q7. TRUE or FALSE: A ParDo acts on all items at once (like a Map in MapReduce)
Feature crosses Quiz Answers
Q1. You are building a model to predict the number of points (“margin”) by which Team A will beat Team B in a basketball game. Your input features are (1) whether or not it is a home game for Team A (2) the average number of points Team A scored in its past 7 games and (3) the average number of points Team B scored in its past 7 games. Which of these is a linear model suitable for machine learning?
- Ans: margin = b + w1 is_home_game + w2 avg_points_A + w3 avg_points_B, margin = (avg_points_A – avg_points_B), margin = w1 is_home + w2 * (avg_points_A – avg_points_B)^3
Q2. Feature crosses are more common in modern machine learning because:
- Ans: Feature crosses memorize, and that is okay only if you have extremely large datasets.
Q3. The function tf.feature_column.crossed_column requires:
- Ans: A list of categorical or bucketized features
Q4. You might create an embedding of a feature cross in order to:
- Ans: Identify similar sets of inputs for clustering, Reuse weights learned in one problem in another problem, Create a lower-dimensional representation of the input space
Preprocessing and Feature Creation Quiz Answers
Q1. You are training a model to predict how long it will take to sell a house. The list price of the house, with numeric 20,000 to 500,000 values, is one of the inputs to the model. Which of these is a good practice?
- Ans: Rescale the real valued feature like a price to a range from 0 to 1
Q2. Which of these tools are commonly used for data pre-processing? (Select 3 correct responses)
- Ans: BigQuery, Apache Beam, TensorFlow
Q3. Which one of these is NOT something you would commonly do in data preprocessing?
- Ans: Tune your ML model hyperparameters
Q4. In your TensorFlow model you are calculating the distance between two points on a map as a new feature. How do you ensure the preprocessing you’re doing for model training is also do the exact same way in prediction?
- Ans: Wrap features in training/evaluation input function AND wrap features in serving input function:
Q5. The below code preprocesses the latitude and longitude using feature columns. What is the point of the 38.0 and 42.0 in the column buckets?
- Ans: Latitudes must be between 38 and 42 will be discretized into the specified number of bins.
Q6. What are two advantages of using TensorFlow to preprocess your code instead of building an Apache Beam pipeline? (Select two correct responses)
- In TensorFlow the same pipelines can be used in both training and serving
- In TensorFlow you will have access to helper APIs to help automatically bucketize and process features instead of writing your own java or python code
Q7. What is one key advantage of preprocessing your features using Apache Beam?
- Ans: The same code you use to preprocess features in training and evaluation can also be used in serving
Preprocessing with Cloud Dataprep Quiz Answers
Q1. What are some of the advantages to exploring datasets with a UI tool like Cloud Dataprep?
- Dataprep uses Dataflow behind-the-scenes and you can create your transformations in a UI tool instead of writing Java or Python
- Dataprep has a number of transformation steps available that you can chain together as part of a recipe
- Dataprep supports outputting your data into BigQuery, Google Cloud Storage, or flat files
Q2. TRUE or FALSE: You can automatically setup pipelines to run at defined intervals with Cloud Dataprep
- Ans: True
Raw Data and Features Quiz Answers
Q1. What are the characteristics of a good feature?
- Related to the objective
- Have enough examples in the data
- Be numeric with meaningful magnitude
- Knowable at prediction time
Q2. I want to build a model to predict whether Team A will win its basketball game against Team B. I will train my model on features computed on historical basketball games. One of my features is how many games this season Team A has won. How should I compute this feature?
- Ans: Compute num_games_won / num_games_played until the N-1 th game in order to train with the label for the N th game
Q3. I want to build a model to predict whether Team A will win its basketball game against Team B. Which of these attributes (computed on historical basketball games) are good features? Assume that these features are all computed appropriately without taking into account non-causal data.
- How often Team A wins games
- How often Team A wins games where its opponent is ranked in the top 10
- How many of the last 7 games that Team A played that it has won
Representing Features questions Quiz Answers
Q1. What is one-hot encoding?
- Ans: One hot encoding is a process by which categorical variables are converted into a form that could be provided to neural networks to do a better job in prediction.
Q2. Which of these offers the best way to encode categorical data that is already indexed, i.e. has integers in [0-N]?
- Ans: tf.feature_column.categorical_column_with_identity
Q3. What do you use the tf.feature_column.bucketized_column function for?
- Ans: To discretize floating point values into a smaller number of categorical bins
tf.transform Quiz Answers
Q1. What is a common use case for where you would use tf.transform instead of a Cloud Dataflow pipeline or regular TensorFlow for preprocessing?
- Ans: You want to scale your inputs based on min/max value in the dataset, You need to compute the vocabulary list for categorical columns from your training dataset
Q2. The Analyze phase of tf.transform is carried out via:
Ans: A Python Beam pipeline that contains TensorFlow functions
Q3. The Transform phase of tf.transform is carried out via:
- Inside a TensorFlow serving input function during prediction
- Inside a Beam pipeline for training and in TensorFlow during evaluation
- Inside a Beam pipeline for evaluation and in TensorFlow during training
- A Beam pipeline while creating a training or evaluation dataset