Fundamentals of Scalable Data Science Coursera Quiz Answers

All Weeks Fundamentals of Scalable Data Science Coursera Quiz Answers

Fundamentals of Scalable Data Science Week 01 Quiz Answers

Quiz : Challenges, terminology, methods and technology

Q1. Which programming languages are supported for writing programs on top of ApacheSpark?

Swift
Scala
Java
Go
JavaScript
C/C++
C#
R
Python

Fundamentals of Scalable Data Science Week 02 Quiz Answers

Quiz : Data storage solutions, and ApacheSpark

Q1.

rdd = sc.parallelize(range(100))
rdd2 = range(100)

Please consider the following code.

Where is data in “rdd” stored physically?

On the local Driver machine
In main-memory of ApacheSpark worker nodes

Q2.

rdd = sc.parallelize(range(100))
rdd2 = range(100)

Please consider the following code.

Where is data in “rdd2” stored physically?

In main-memory of ApacheSpark worker nodes
On the local Driver machine

Q3. What is the parallel version of the following code?1len(range(9999999999))

sc.parallelize(range(9999999999)).count()
parallelize(range(9999999999)).count()
len(sc.parallelize(range(9999999999)))
size(sc.parallelize(range(9999999999)))
count(sc.parallelize(range(9999999999)))

Q4. Which storage solutions support seamless modification of schemas? (Select all that apply)

ObjectStorage
NoSQL
SQL/Relational Databases

Q5. Which storage solutions support dynamic scaling on storage? (Select all that apply)

ObjectStorage
NoSQL
SQL/Relational Databases

Q6. Which storage solutions support normalization and integrity checks on data out of the box? (Select all that apply)

ObjectStorage
NoSQL
SQL/Relational Databased

Quiz : Programming language options and functional programming

Q1. Which programming languages can be used for using GraphX, the ApacheSpark graph processing engine? (Select all that apply)

Scala
R
Python

Q2. What is the result of the following code?

def f(x):      
return (x+2)*2    
l = [1,2,3,4]
map(f,l)

Hint: “map” is the python equivalent to “apply” used in the lecture

This is an example of a correct input format (although others are accepted as well):

[3, 6, 1, 25]

[6,8,10,12]

Q3. What is the result of the following python code running on top of ApacheSpark (pyspark) ?

sc.parallelize([1,2,3,4,5]).reduce(lambda a,b:a+b)

Please have a look at the following API documentation if necessary:

https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics

Q4. What has to be changed in an ApacheSpark program using only functions on the RDD API which has been tested on 1 KB of data if it should run on 1 PB of data?

Nothing
You have to implement it in a parallel way

Q5. In which programming languages do you find the most libraries for data science specific tasks?

Java
Scala
Python
R

Q6. Data frames are the central data structure in R and also in python’s Pandas. But they are not using ApacheSpark in the background, therefore what is the major limitation of data frames used in R or python’s Pandas?

They are not as cool as ApacheSpark
They are not running in parallel on multiple machines
They are not OpenSource
They are very unstable

Quiz : ApacheSparkSQL and Cloudant

Q1. What statements are true about cloudant? (Select all that apply)

Cloudant is based on ApacheCouchDB
Cloudant is a SQL database
Cloudant is a NoSQL database
Cloudant is a very fast and scalable key-value store
Cloudant is meant for storing JSON documents effectively
BigCouch is a tool to inflate storage on CouchDB
BigCouch is a component between the client and a set of CouchDB services used for horizontal scaling

Q2. Please have a look at the following flow:

Which nodes are actually simulating sensors of a hypothetical IoT device?

(Please select Fluid Simulator, Voltage Sensor Simulator, Mechanical Sensor Simulator, this is some legacy from a previous version of this course and we can’t change the quiz at that point for fairness reasons)

Fluid Simulator
Mechanical Sensor Simulator
Fluid Data
Drum Data
Voltage Sensor Simulator
msg.payload
Voltage Data
Washer01

Q3. In the “End-to-End Scenario”, where does all the data get stored in?

(Please select Cloudant (ApacheCouchDB), this is some legacy from a previous version of this course and we can’t change the quiz at that point for fairness reasons)

Cloudant (ApacheCouchDB)
ApacheSpark
Object Storage
OpenStack Swift

Q4. How does the Catalyst optimizer work internally?

Abbreviations:

AST – Abstract Syntax Tree

LEP – Logical Execution Plan

PEP – Physical Execution Plan

A AST is created from an SQL LEP. This AST is transformed (optimised). Then multiple PEPs are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.
A LEP is created from an SQL AST. This LEP is transformed (optimised). Then multiple PEPs are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.
A AST is created from an SQL PEP. This AST is transformed (optimised). Then multiple LEPs are created from the optimised PEP. Finnaly, based on cost based statistics an optimal LEP is chosen to be executed.
A PEP is created from an SQL AST. This AST is transformed (optimised). Then multiple PEP are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.

Q5. What is the advantage of using ApacheSparkSQL over RDDs? (select all that apply)

ApacheSparkSQL bypasses the RDD interface which has been proven to be very complicated
SQL is simpler than RDD but has some performance drawbacks
Catalyst and Tungsten are able to optimise the execution, so are more likely to execute more quickly than if you would had implemented something equivalent using the RDD API.
The API is simpler and doesn’t require specific functional programming skills

Fundamentals of Scalable Data Science Week 03 Quiz Answers

Quiz : Averages and standard deviation

Q1. What is the advantage of median over mean?

Median is more outlier resistent. Odd values influence median less than mean.
Mean is more outlier resistent. Odd values influence mean less than median.

Q2. What is the mean of the following list?

1,2,4,5,34,1,32,4,34,2,1,3

Please use a decimal point instead of a comma

241 is the mean of this list

Q3. What is the median of the following list?

1,2,4,5,34,1,32,4,34,2,1,3

Please use a decimal point instead of a comma

Q4. Which of the following two plots has a higher standard deviation?

Plot 1

Plot 2

Plot 1
Plot 2

Q5. What is the standard deviation of the following list?

34,1,23,4,3,3,12,4,3,1

Please enter at least 3 digits after the decimal

Please use a decimal point instead of a comma

Quiz : Skewness and kurtosis

Q1.

Plot 1

Plot 2

Which of the two plots indicates a higher kurtosis value?

Plot 1
Plot 2

Q2. What is the kurtosis of the following list?

34,1,23,4,3,3,12,4,3,1

Please enter at least three digits after the decimal

Q3. The higher the kurtosis value, the longer the “tails” of the distributions are. So, kurtosis measures the outlier content. The higher the kurtosis value, the more outliers are in the dataset because the more far a values is away from the mean, the more it contributes to the kurtosis. In other words, the distribution has long tails. Which are examples of long tailed datasets?

Velocity values recorded from all connected cars over one year in a country
Velocity values recorded from one single connected cars over one hour
Latitude coordinates of all rain drops fallen on earth for the last 60 minutes
Number of minutes a lift in a smart building was waiting at each floor over the last 24h
Hour of the day a smart light bulb has been turned on and off over the last year

Q4.

What is true about this value distribution?

This distribution is positively skewed
This distribution is negatively skewed

Q5. Consider a connected car. We are measuring the car’s velocity 600 times per minute. Note that in time intervals the car stands the velocity of zero is measured. If we now plot the distribution of velocity values, is this distribution positively or negatively skewed?

Some further explaination from the discussion forum:

Just imagine a car driving in Bangalore, so if you measure it’s velocity, most of the time it is zero – sometimes it is between 3-5 km/h and rarely above, so please imagine how such a chart looks like if you have velocity on the x axis and frequency (how often you’ve measured that velocity) on the y axis

negatively skewed
positively skewed

Quiz : Covariance, correlation and multidimensional Vector Spaces

Q1. What is to coordinate of this data point in vector notation?

x=2;y=3
(2,3)
(3,2)
[2,3]

Q2. What is to coordinate of this data point in vector notation?

Note: The point is in the center of the cube, only the coordinate system changed. An X,Y,Z coordinate in vector notation looks like this: (X,Y,Z)

(3,1,2)
(3,2,1)
(1,2,3)
(2,1,3)

Q3. Consider the following plot with two data points and a plane separating those points. This is a process which is called binary classification which will be covered in the next course.

Which of the following vectors are lying on the plane of separation and are therefore hard to classify?

Remember: We are using vector notation of (X,Y,Z)

(2.5,1.5,1.5)
(0,0,0)
(1.5,1.5,1.5)
(1.5,2,3)
(1,2,1.5)
(2,1.5,3)

Q4. Given the following plot you can clearly see that there are two clusters.

Please select the correct answers regarding properties about these data

Remember: We are using vector notation of (X,Y,Z)

There is one cluster centroid at (1,1,1)
There is one cluster centroid at (2,2,2)
There is one cluster centroid at (1,2,1)
Data point (2.5,2.5,2.5) lies within a cluster
There is one cluster centroid at (1,2,2)
Data point (1.5,1.5,1.5) lies within a cluster
Data point (2.1,1.9,2.2) lies within a cluster

Q5. What is the correlation between the two lists?

1,2,3,4,5,6,7,8,9,10

7,6,5,4,5,6,7,8,9,10

Please enter at least three digits after the decimal

Q6. What is the covariance between the two lists?

1,2,3,4,5,6,7,8,9,10

7,6,5,4,5,6,7,8,9,10

Please enter at least three digits after the decimal

Q7. The correlation between the following two lists is zero, can you explain why?

1,2,3,4,5,6,7

7,6,5,4,5,6,7

Correlation of 1st half of the list is negative and between the last half of the list is positive so they cancel out
The second list is totally random in respect to the first list, therefore they don’t correlate at all

Q8. Please select all data with potentially high correlation

Color of car + Speed of car
Crop fitness + Amount of water available
Outside temperature + Power consumption of AC
Age of driver + Number of accidents
Season of Year+Amount of Rain

Fundamentals of Scalable Data Science Week 04 Quiz Answers

Quiz : Visualization and dimension reduction

Q1. Among dimension reduction, what technique exist to visualize more than three dimensions in a single plot? (Select all that apply)

Color coding for discrete dimensions
Using different symbols apart from simple points for discrete dimensions
Sum Kurtosis and Skewness together for each row
Calculate the Standard Deviation of each row
Apply Fast Fourier Transformation
Add multiple lines to a run chart

Q2. What properties are true about PCA for dimension reduction? (Select all that apply)

PCA removes the dimensions with the highest correlation because their information content is lowest
PCA is a linear transformation, this means most of the original propertes of your data are preserved
PCA transforms your dataset so that the first k dimensions have the lowest correlation among each other
PCA is a non-linear transformation, this means most of the original properties of your data set are lost

Q3. Which of the following statements is true about information loss using PCA?