Convolutional Neural Networks Coursera Quiz Answers – Practice & Graded Quizzes

Welcome to your go-to guide for Convolutional Neural Networks Coursera quiz answers! Whether you’re working through practice quizzes to refine your understanding or preparing for graded quizzes to test your knowledge, this guide has you covered.

Covering all course modules, this resource will help you master Convolutional Neural Networks (CNNs), which are pivotal in image recognition, computer vision, and deep learning applications.

Convolutional Neural Networks Coursera Quiz Answers for All Modules

Convolutional Neural Networks Module 01 Quiz Answers

Q1: What do you think applying this filter to a grayscale image will do?

Answer: Detect vertical edges

Explanation: Filters (such as the Sobel filter) designed to detect edges typically respond to changes in intensity in a specific direction. A filter for detecting vertical edges would highlight regions where the image intensity changes horizontally, i.e., edges that run vertically.


Q2: Suppose your input is a 300 by 300 color (RGB) image, and you are not using a convolutional network. If the first hidden layer has 100 neurons, each one fully connected to the input, how many parameters does this hidden layer have (including the bias parameters)?

Answer: 9,000,001

Explanation: Each pixel of the 300×300 RGB image has 3 color channels, so there are 300 * 300 * 3 = 270,000 input features. With 100 neurons, each neuron has 270,000 weights, resulting in 270,000 * 100 = 27,000,000 weights. Additionally, each neuron has a bias term, so we add 100 bias parameters. The total number of parameters is 27,000,100.


Q3: Suppose your input is a 300 by 300 color (RGB) image, and you use a convolutional layer with 100 filters that are each 5×5. How many parameters does this hidden layer have (including the bias parameters)?

Answer: 7600

Explanation: For each filter, you have a 5×5 filter size and 3 color channels (since it’s a color image), resulting in 5 * 5 * 3 = 75 parameters per filter. With 100 filters, the total number of parameters is 75 * 100 = 7500. Additionally, there is one bias term for each filter, so you add 100 bias parameters. The total number of parameters is 7500 + 100 = 7600.


Q4: You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7×7, using a stride of 2 and no padding. What is the output volume?

Answer: 29x29x32

Explanation: Using the formula for output size:Output size=((Input size−Filter size)Stride)+1\text{Output size} = \left( \frac{(\text{Input size} – \text{Filter size})}{\text{Stride}} \right) + 1Output size=(Stride(Input size−Filter size)​)+1

For each dimension:((63−7)2)+1=29\left( \frac{(63 – 7)}{2} \right) + 1 = 29(2(63−7)​)+1=29

So the output volume is 29×29, with 32 channels (one for each filter). The final output volume is 29x29x32.


Q5: You have an input volume that is 15x15x8, and pad it using “pad=2.” What is the dimension of the resulting volume (after padding)?

Answer: 19x19x8

Explanation: Padding increases the height and width by 2 pixels in each direction, resulting in a 15 + 2 * 2 = 19×19 size. The depth (number of channels) remains the same at 8. The resulting volume is 19x19x8.


Q6: You have an input volume that is 63x63x16, and convolve it with 32 filters that are each 7×7, and stride of 1. You want to use a “same” convolution. What is the padding?

Answer: 3

Explanation: In a “same” convolution, padding is chosen such that the output size matches the input size. The formula for padding is:Padding=(Filter size−1)2\text{Padding} = \frac{(\text{Filter size} – 1)}{2}Padding=2(Filter size−1)​

For a 7×7 filter, the padding required is (7 – 1) / 2 = 3.


Q7: You have an input volume that is 32x32x16, and apply max pooling with a stride of 2 and a filter size of 2. What is the output volume?

Answer: 16x16x16

Explanation: Max pooling with a 2×2 filter and stride of 2 reduces each dimension by a factor of 2. The output size is 32 / 2 = 16 for both height and width. The depth (number of channels) remains the same at 16, so the output volume is 16x16x16.


Q8: Because pooling layers do not have parameters, they do not affect the backpropagation (derivatives) calculation.

Answer: False

Explanation: Pooling layers do not have weights, but they still affect backpropagation because the gradient needs to be propagated through the pooling operation, selecting the maximum value (or average in the case of average pooling).


Q9: In lecture we talked about “parameter sharing” as a benefit of using convolutional networks. Which of the following statements about parameter sharing in ConvNets are true? (Check all that apply.)

Answer:

  • It reduces the total number of parameters, thus reducing overfitting.
  • It allows a feature detector to be used in multiple locations throughout the whole input image/input volume.
    Explanation:
    Parameter sharing in ConvNets means using the same filter (weight set) across the entire input image, which reduces the number of parameters and helps the network generalize better by detecting features at various locations in the image.

Q10: In lecture we talked about “sparsity of connections” as a benefit of using convolutional layers. What does this mean?

Answer: Each activation in the next layer depends on only a small number of activations from the previous layer.

Explanation: Sparsity of connections means that each neuron in a convolutional layer is connected to only a small subset of neurons from the previous layer (based on the filter size), making the model more efficient and reducing the number of parameters.

Convolutional Neural Networks Module 02 Quiz Answers

Q1: Which of the following do you typically see in a ConvNet? (Check all that apply.)

Answer:

  • FC layers in the last few layers
  • Multiple CONV layers followed by a POOL layer
    Explanation:
    In Convolutional Neural Networks (ConvNets), it’s common to see fully connected (FC) layers towards the end of the network for classification tasks, and multiple convolution (CONV) layers followed by pooling (POOL) layers for feature extraction. FC layers are not typically used in the first few layers of a ConvNet.

Q2: In order to be able to build very deep networks, we usually only use pooling layers to downsize the height/width of the activation volumes while convolutions are used with “valid” padding. Otherwise, we would downsize the input of the model too quickly.

Answer: True

Explanation: Pooling layers reduce the spatial dimensions (height and width) of the feature maps, and convolutions with valid padding help retain more useful information. Using too many layers of convolution with no downsampling could reduce the input size too quickly and cause information loss.


Q3: Training a deeper network (for example, adding additional layers to the network) allows the network to fit more complex functions and thus almost always results in lower training error. For this question, assume we’re referring to “plain” networks.

Answer: False

Explanation: While adding more layers allows the network to fit more complex functions, it can also lead to overfitting or vanishing gradients, which can hinder training. A deeper network does not always guarantee better performance on training data.


Q4: The following equation captures the computation in a ResNet block. What goes into the two blanks above?

Answer:

  • 000 and a[l]a[l]a[l], respectively
    Explanation:
    In a ResNet block, the identity shortcut (skip connection) adds the input a[l]a[l]a[l] to the output of the transformation (denoted as z[l]z[l]z[l]) after passing through a nonlinear activation. The blanks represent the values added together: a[l]a[l]a[l] and 000.

Q5: Which ones of the following statements on Residual Networks are true? (Check all that apply.)

Answer:

  • Using a skip-connection helps the gradient to backpropagate and thus helps you to train deeper networks
  • The skip-connection makes it easy for the network to learn an identity mapping between the input and the output within the ResNet block.
    Explanation:
    Skip connections in ResNets allow gradients to flow more easily during backpropagation, making it easier to train deeper networks. They also help the network learn identity mappings where appropriate, which helps prevent the network from learning unnecessary transformations.

Q6: Suppose you have an input volume of dimension nH×nW×nCn_H \times n_W \times n_CnH​×nW​×nC​. Which of the following statements do you agree with? (Assume that “1×1 convolutional layer” below always uses a stride of 1 and no padding.)

Answer:

  • You can use a 2D pooling layer to reduce nHn_HnH​, nWn_WnW​, but not nCn_CnC​.
  • You can use a 1×1 convolutional layer to reduce nCn_CnC​ but not nHn_HnH​, nWn_WnW​.
    Explanation:
    A 2D pooling layer reduces the spatial dimensions (height and width) but does not affect the depth (number of channels). A 1×1 convolutional layer operates on each channel independently and can be used to change the number of channels, but it does not affect the height or width of the input.

Q7: Which ones of the following statements on Inception Networks are true? (Check all that apply.)

Answer:

  • A single inception block allows the network to use a combination of 1×1, 3×3, 5×5 convolutions and pooling.
  • Inception blocks usually use 1×1 convolutions to reduce the input data volume’s size before applying 3×3 and 5×5 convolutions.
    Explanation:
    Inception networks use multiple filters (1×1, 3×3, 5×5) to capture different features at various scales, and 1×1 convolutions are used to reduce the number of input channels before applying larger convolutions, reducing computational complexity.

Q8: Which of the following are common reasons for using open-source implementations of ConvNets (both the model and/or weights)? Check all that apply.

Answer:

  • The same techniques for winning computer vision competitions, such as using multiple crops at test time, are widely used in practical deployments (or production system deployments) of ConvNets.
  • It is a convenient way to get working with an implementation of a complex ConvNet architecture.
  • Parameters trained for one computer vision task are often useful as pretraining for other computer vision tasks.
    Explanation:
    Open-source implementations allow for efficient reuse of trained models, which saves time and computational resources. Pretrained models can also be adapted to new tasks, making transfer learning easier.

Q9: In Depthwise Separable Convolution you:

Answer:

  • Perform two steps of convolution.
  • For the “Depthwise” computations, each filter convolves with only one corresponding color channel of the input image.
  • The final output is of the dimension nout×nout×nc′n_{out} \times n_{out} \times n’_cnout​×nout​×nc′​ (where nc′n’_cnc′​ is the number of filters used in the previous convolution step).
    Explanation:
    Depthwise separable convolutions break the standard convolution into two steps: one convolution for each channel (depthwise) and another pointwise convolution (1×1) to mix the channels. This reduces computational cost while maintaining performance.

Q10: Fill in the missing dimensions shown in the image below (marked W, Y, Z).

Answer:

  • W = 30, Y = 30, Z = 5
    Explanation:
    The dimensions of the convolutional and pooling layers depend on the kernel size, stride, and padding. Based on these parameters, the output dimensions for a given input volume can be calculated, and in this case, the appropriate dimensions are W = 30, Y = 30, and Z = 5.

Convolutional Neural Networks Module 03 Quiz Answers

Q1: You are building a 3-class object classification and localization algorithm. The classes are: pedestrian (c=1), car (c=2), motorcycle (c=3). What should yyy be for the image below?

Answer: y=[1,?,?,?,?,0,0,0]y = [1, ?, ?, ?, ?, 0, 0, 0]y=[1,?,?,?,?,0,0,0]

Explanation: For this object detection problem, the first component y0=1y_0 = 1y0​=1 indicates that the object is present (since it’s a pedestrian), and the remaining values are the bounding box coordinates and the class probabilities. The “?” represents a don’t-care condition, meaning the loss function will not care about these components. Since the object is a pedestrian (class 1), we set c1=1c_1 = 1c1​=1 and all other class probabilities (c2c_2c2​ and c3c_3c3​) to 0.


Q2: What is the most appropriate set of output units for your neural network in the factory automation task?

Answer: Logistic unit, bxb_xbx​, byb_yby​, bhb_hbh​ (since bw=bhb_w = b_hbw​=bh​)

Explanation: Since the bounding box is always square, you can use a logistic unit to detect if the object is present (soft-drink can) and the bounding box with bxb_xbx​, byb_yby​, and bhb_hbh​, as the width is the same as the height.


Q3: If you build a neural network that inputs a picture of a person’s face and outputs NNN landmarks on the face, how many output units will the network have?

Answer: 2N

Explanation: For each landmark, you need two coordinates: xxx and yyy. Therefore, if there are NNN landmarks, the output dimension will be 2N2N2N (for xxx and yyy coordinates of each landmark).


Q4: When training one of the object detection systems described in lecture, you need a training set that contains many pictures of the object(s) you wish to detect. However, bounding boxes do not need to be provided in the training set, since the algorithm can learn to detect the objects by itself.

Answer: False

Explanation: For object detection, you must provide bounding box annotations in the training set so the model can learn to localize and classify the objects. Without bounding boxes, the network cannot learn to associate the object with a specific location in the image.


Q5: What is the IoU between these two boxes? The upper-left box is 2×2, and the lower-right box is 2×3. The overlapping region is 1×1.

Answer: 1/10

Explanation: IoU (Intersection over Union) is calculated as:IoU=Area of overlapArea of union\text{IoU} = \frac{\text{Area of overlap}}{\text{Area of union}}IoU=Area of unionArea of overlap​

Area of overlap = 1×1 = 1.
Area of union = (2×2 + 2×3) – 1 = 7.
Thus, IoU=17\text{IoU} = \frac{1}{7}IoU=71​, which is approximately 1/10.


Q6: Suppose you run non-max suppression on the predicted boxes. The parameters you use for non-max suppression are that boxes with probability ≤ 0.4 are discarded, and the IoU threshold for deciding if two boxes overlap is 0.5. How many boxes will remain after non-max suppression?

Answer: 5

Explanation: Non-max suppression eliminates boxes with low probability (≤ 0.4) and removes boxes that overlap too much (based on IoU threshold of 0.5). After applying these rules, the remaining number of boxes is 5.


Q7: Suppose you are using YOLO on a 19×19 grid, on a detection problem with 20 classes, and with 5 anchor boxes. During training, for each image, you will need to construct an output volume yyy as the target value for the neural network; this corresponds to the last layer of the neural network. What is the dimension of this output volume?

Answer: 19x19x(5×25)

Explanation: YOLO divides the image into a grid (19×19), and for each grid cell, it predicts 5 anchor boxes, each with a 5×5 output (the first value is for the objectness score, followed by the bounding box coordinates, and the class probabilities). Therefore, the output volume has dimensions 19x19x(5×25).


Q8: What is Semantic Segmentation?

Answer: Locating objects in an image by predicting each pixel as to which class it belongs to.

Explanation: Semantic segmentation assigns a class label to each pixel in the image, distinguishing objects based on pixel-wise classification.


Q9: Using the concept of Transpose Convolution, fill in the values of X, Y, and Z below.

Answer: X = 2, Y = -6, Z = -4

Explanation: Transpose convolution works by reversing the operation of a regular convolution, and the padding, stride, and filter size determine the output dimensions and values. The exact calculation for the missing values depends on the padding and stride parameters provided in the question.


Q10: Suppose your input to a U-Net architecture is h×w×3h \times w \times 3h×w×3, where 3 denotes your number of channels (RGB). What will be the dimension of your output?

Answer: h x w x n, where nnn = number of output classes

Explanation: In U-Net, the output typically consists of class probabilities for each pixel, so the output dimensions are h×w×nh \times w \times nh×w×n, where nnn is the number of classes (e.g., for binary segmentation, n=1n = 1n=1).

Convolutional Neural Networks Module 04 Quiz Answers

Q1: Face verification requires comparing a new picture against one person’s face, whereas face recognition requires comparing a new picture against K person’s faces.

Answer: True

Explanation: Face verification checks whether the given image matches a specific person’s identity, while face recognition involves comparing the image against multiple faces (K faces) to identify the person.


Q2: Why do we learn a function d(img1,img2)d(\text{img1}, \text{img2})d(img1,img2) for face verification? (Select all that apply.)

Answer:

  • This allows us to learn to recognize a new person given just a single image of that person.
  • We need to solve a one-shot learning problem.
    Explanation:
    In face verification, we want to determine whether two images are from the same person. This requires learning a function that can compare the images effectively. One-shot learning is used because the network is expected to recognize a new person after seeing just one image of that person.

Q3: In order to train the parameters of a face recognition system, it would be reasonable to use a training set comprising 100,000 pictures of 100,000 different persons.

Answer: True

Explanation: In face recognition, training on a large dataset with multiple individuals is essential to capture the diversity of facial features and identities. Using 100,000 images from 100,000 different individuals is a reasonable approach.


Q4: Which of the following is a correct definition of the triplet loss?

Answer: max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)\text{max} \left( \|f(A) – f(P)\|^2 – \|f(A) – f(N)\|^2 + \alpha, 0 \right)max(∥f(A)−f(P)∥2−∥f(A)−f(N)∥2+α,0)

Explanation: The triplet loss function is used in deep metric learning. It minimizes the distance between an anchor AAA and a positive example PPP while maximizing the distance between the anchor AAA and a negative example NNN, with a margin α\alphaα.


Q5: Consider the following Siamese network architecture: The upper and lower neural networks have different input images, but have exactly the same parameters.

Answer: True

Explanation: In a Siamese network, both branches share the same parameters, meaning that they perform the same operations on different inputs (images in this case). This is what allows the network to learn a similarity or distance measure between the two inputs.


Q6: You train a ConvNet on a dataset with 100 different classes. You wonder if you can find a hidden unit which responds strongly to pictures of cats. (I.e., a neuron so that, of all the input/training images that strongly activate that neuron, the majority are cat pictures.) You are more likely to find this unit in layer 4 of the network than in layer 1.

Answer: True

Explanation: In deeper layers of a ConvNet, neurons tend to respond to more complex patterns and abstract features (such as “cats”), while neurons in earlier layers typically respond to low-level features like edges and textures. Therefore, it is more likely to find a unit responding to a complex feature like “cat” in layer 4.


Q7: Neural style transfer is trained as a supervised learning task in which the goal is to input two images (xxx), and train a network to output a new, synthesized image (yyy).

Answer: False

Explanation: Neural style transfer is typically not a supervised learning task but an optimization problem where the goal is to generate a new image that combines the content of one image and the style of another. It does not involve training a network in a supervised manner, but rather using a pre-trained network (like VGG).


Q8: In the deeper layers of a ConvNet, each channel corresponds to a different feature detector. The style matrix G[l]G^{[l]}G[l] measures the degree to which the activations of different feature detectors in layer lll vary (or correlate) together with each other.

Answer: True

Explanation: The style matrix in neural style transfer captures the correlations between the activations of different feature detectors. This matrix helps measure the “style” of the image by capturing how features co-occur across the layers of the network.


Q9: In neural style transfer, what is updated in each iteration of the optimization algorithm?

Answer: The pixel values of the generated image GGG

Explanation: In neural style transfer, during optimization, the pixel values of the generated image are updated iteratively to minimize the difference between the content and style features extracted from the content and style images.


Q10: You are working with 3D data. You are building a network layer whose input volume has size 32x32x32x16 (this volume has 16 channels), and applies convolutions with 32 filters of dimension 3x3x3 (no padding, stride 1). What is the resulting output volume?

Answer: 30x30x30x32

Explanation: With no padding and a stride of 1, the output dimensions for each spatial dimension (height, width, and depth) are reduced by 2 (due to the 3x3x3 filter). The resulting volume will have dimensions 30x30x30 for the spatial size, and 32 channels because we applied 32 filters.

Frequently Asked Questions (FAQ)
Are the Convolutional Neural Networks Coursera quiz answers accurate?

Yes, these answers have been carefully reviewed to ensure they align with the latest course material and CNN principles.

Can I use these answers for both practice and graded quizzes?

Absolutely! These answers are suitable for both practice quizzes and graded assessments, ensuring you’re well-prepared for all quizzes.

Does this guide include answers for all modules of the course?

Yes, this guide provides answers for every module, ensuring comprehensive preparation for the entire course.

Will this guide help me better understand CNN architectures and applications?

Yes, in addition to providing quiz answers, this guide reinforces essential concepts such as convolutional layers, pooling, backpropagation, and CNN applications in image classification and object detection.

Conclusion

We hope this guide to Convolutional Neural Networks Coursera Quiz Answers helps you master CNNs and their applications in deep learning and computer vision.

Bookmark this page for quick access and share it with your peers. Ready to dive deep into the world of CNNs and ace your quizzes? Let’s get started!

Source: Convolutional Neural Networks

Get all Course Quiz Answers of Deep Learning Specialization

Course 01: Neural Networks and Deep Learning Coursera Quiz Answers

Course 02: Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization Quiz Answers

Course 03: Structuring Machine Learning Projects Coursera Quiz Answers

Course 04: Convolutional Neural Networks Coursera Quiz Answers

Course 05: Sequence Models Coursera Quiz Answers

Share your love

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *