# Fundamentals of Scalable Data Science Coursera Quiz Answers

## All Weeks Fundamentals of Scalable Data Science Coursera Quiz Answers

Apache Spark is the de-facto standard for large-scale data processing. This is the first course of a series of courses towards the IBM Advanced Data Science Specialization. We strongly believe that is crucial for success to start learning a scalable data science platform since memory and CPU constraints are to most limiting factors when it comes to building advanced machine learning models.

In this course, we teach you the fundamentals of Apache Spark using python and pyspark. We’ll introduce Apache Spark in the first two weeks and learn how to apply it to compute basic exploratory and data pre-processing tasks in the last two weeks. Through this exercise, you’ll also be introduced to the most fundamental statistical measures and data visualization technologies.

### Fundamentals of Scalable Data Science Week 01 Quiz Answers

#### Quiz : Challenges, terminology, methods and technology

Q1. Which programming languages are supported for writing programs on top of ApacheSpark?

- Swift
**Scala****Java**- Go
- JavaScript
- C/C++
- C#
**R****Python**

### Fundamentals of Scalable Data Science Week 02 Quiz Answers

#### Quiz : Data storage solutions, and ApacheSpark

`Q1. `

`rdd = sc.parallelize(range(100))`

`rdd2 = range(100)`

Please consider the following code.

Where is data in “**rdd**” stored physically?

- On the local Driver machine
**In main-memory of ApacheSpark worker nodes**

Q2.

`rdd = sc.parallelize(range(100))`

`rdd2 = range(100)`

Please consider the following code.

Where is data in “**rdd2**” stored physically?

- In main-memory of ApacheSpark worker nodes
**On the local Driver machine**

Q3. What is the parallel version of the following code?`1len(range(9999999999))`

- sc.parallelize(range(9999999999)).count()
- parallelize(range(9999999999)).count()
- len(sc.parallelize(range(9999999999)))
**size(sc.parallelize(range(9999999999)))**- count(sc.parallelize(range(9999999999)))

Q4. Which storage solutions support seamless modification of schemas? (Select all that apply)

**ObjectStorage****NoSQL**- SQL/Relational Databases

Q5. Which storage solutions support dynamic scaling on storage? (Select all that apply)

- ObjectStorage
- NoSQL
**SQL/Relational Databases**

Q6. Which storage solutions support normalization and integrity checks on data out of the box? (Select all that apply)

- ObjectStorage
- NoSQL
**SQL/Relational Databased**

#### Quiz : Programming language options and functional programming

Q1. Which programming languages can be used for using GraphX, the ApacheSpark graph processing engine? (Select all that apply)

- Scala
**R****Python**

Q2. What is the result of the following code?

`def f(x): `

`return (x+2)*2 `

`l = [1,2,3,4]`

`map(f,l)`

Hint: “map” is the python equivalent to “apply” used in the lecture

This is an example of a correct input format (although others are accepted as well):

[3, 6, 1, 25]

**[6,8,10,12]**

Q3. What is the result of the following python code running on top of ApacheSpark (pyspark) ?

`sc.parallelize([1,2,3,4,5]).reduce(lambda a,b:a+b)`

Please have a look at the following API documentation if necessary:

https://spark.apache.org/docs/latest/rdd-programming-guide.html#basics

Q4. What has to be changed in an ApacheSpark program using only functions on the RDD API which has been tested on 1 KB of data if it should run on 1 PB of data?

- Nothing
- You have to implement it in a parallel way

Q5. In which programming languages do you find the most libraries for data science specific tasks?

- Java
- Scala
**Python**- R

Q6. Data frames are the central data structure in R and also in python’s Pandas. But they are not using ApacheSpark in the background, therefore what is the major limitation of data frames used in R or python’s Pandas?

- They are not as cool as ApacheSpark
**They are not running in parallel on multiple machines**- They are not OpenSource
- They are very unstable

#### Quiz : ApacheSparkSQL and Cloudant

Q1. What statements are true about cloudant? (Select all that apply)

- Cloudant is based on ApacheCouchDB
- Cloudant is a SQL database
- Cloudant is a NoSQL database
- Cloudant is a very fast and scalable key-value store
- Cloudant is meant for storing JSON documents effectively
- BigCouch is a tool to inflate storage on CouchDB
- BigCouch is a component between the client and a set of CouchDB services used for horizontal scaling

Q2. Please have a look at the following flow:

Which nodes are actually simulating sensors of a hypothetical IoT device?

(Please select Fluid Simulator, Voltage Sensor Simulator, Mechanical Sensor Simulator, this is some legacy from a previous version of this course and we can’t change the quiz at that point for fairness reasons)

**Fluid Simulator****Mechanical Sensor Simulator**- Fluid Data
- Drum Data
**Voltage Sensor Simulator****msg.payload****Voltage Data**- Washer01

Q3. In the “End-to-End Scenario”, where does all the data get stored in?

(Please select Cloudant (ApacheCouchDB), this is some legacy from a previous version of this course and we can’t change the quiz at that point for fairness reasons)

**Cloudant (ApacheCouchDB)**- ApacheSpark
- Object Storage
- OpenStack Swift

Q4. How does the Catalyst optimizer work internally?

Abbreviations:

AST – Abstract Syntax Tree

LEP – Logical Execution Plan

PEP – Physical Execution Plan

- A AST is created from an SQL LEP. This AST is transformed (optimised). Then multiple PEPs are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.
- A LEP is created from an SQL AST. This LEP is transformed (optimised). Then multiple PEPs are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.
- A AST is created from an SQL PEP. This AST is transformed (optimised). Then multiple LEPs are created from the optimised PEP. Finnaly, based on cost based statistics an optimal LEP is chosen to be executed.
- A PEP is created from an SQL AST. This AST is transformed (optimised). Then multiple PEP are created from the optimised LEP. Finnaly, based on cost based statistics an optimal PEP is chosen to be executed.

Q5. What is the advantage of using ApacheSparkSQL over RDDs? (select all that apply)

**ApacheSparkSQL bypasses the RDD interface which has been proven to be very complicated****SQL is simpler than RDD but has some performance drawbacks****Catalyst and Tungsten are able to optimise the execution, so are more likely to execute more quickly than if you would had implemented something equivalent using the RDD API.**- The API is simpler and doesn’t require specific functional programming skills

### Fundamentals of Scalable Data Science Week 03 Quiz Answers

#### Quiz : Averages and standard deviation

Q1. What is the advantage of median over mean?

- Median is more outlier resistent. Odd values influence median less than mean.
**Mean is more outlier resistent. Odd values influence mean less than median.**

Q2. What is the mean of the following list?

1,2,4,5,34,1,32,4,34,2,1,3

Please use a decimal point instead of a comma

**241 is the mean of this list**

Q3. What is the median of the following list?

1,2,4,5,34,1,32,4,34,2,1,3

Please use a decimal point instead of a comma

Q4. Which of the following two plots has a higher standard deviation?

Plot 1

Plot 2

- Plot 1
- Plot 2

Q5. What is the standard deviation of the following list?

34,1,23,4,3,3,12,4,3,1

*Please enter at least 3 digits after the decimal*

Please use a decimal point instead of a comma

#### Quiz : Skewness and kurtosis

Q1.

Plot 1

Plot 2

Which of the two plots indicates a higher kurtosis value?

- Plot 1
- Plot 2

Q2. What is the kurtosis of the following list?

34,1,23,4,3,3,12,4,3,1

*Please enter at least three digits after the decimal*

Q3. The higher the kurtosis value, the longer the “tails” of the distributions are. So, kurtosis measures the outlier content. The higher the kurtosis value, the more outliers are in the dataset because the more far a values is away from the mean, the more it contributes to the kurtosis. In other words, the distribution has long tails. Which are examples of long tailed datasets?

- Velocity values recorded from all connected cars over one year in a country
- Velocity values recorded from one single connected cars over one hour
- Latitude coordinates of all rain drops fallen on earth for the last 60 minutes
- Number of minutes a lift in a smart building was waiting at each floor over the last 24h
- Hour of the day a smart light bulb has been turned on and off over the last year

Q4.

What is true about this value distribution?

- This distribution is positively skewed
- This distribution is negatively skewed

Q5. Consider a connected car. We are measuring the car’s velocity 600 times per minute. Note that in time intervals the car stands the velocity of zero is measured. If we now plot the distribution of velocity values, is this distribution positively or negatively skewed?

Some further explaination from the discussion forum:

Just imagine a car driving in Bangalore, so if you measure it’s velocity, most of the time it is zero – sometimes it is between 3-5 km/h and rarely above, so please imagine how such a chart looks like if you have velocity on the x axis and frequency (how often you’ve measured that velocity) on the y axis

- negatively skewed
- positively skewed

#### Quiz : Covariance, correlation and multidimensional Vector Spaces

Q1. What is to coordinate of this data point in vector notation?

- x=2;y=3
- (2,3)
- (3,2)
- [2,3]

Q2. What is to coordinate of this data point in vector notation?

Note: The point is in the center of the cube, only the coordinate system changed. An X,Y,Z coordinate in vector notation looks like this: (X,Y,Z)

- (3,1,2)
- (3,2,1)
- (1,2,3)
- (2,1,3)

Q3. Consider the following plot with two data points and a plane separating those points. This is a process which is called binary classification which will be covered in the next course.

Which of the following vectors are lying on the plane of separation and are therefore hard to classify?

Remember: We are using vector notation of (X,Y,Z)

- (2.5,1.5,1.5)
- (0,0,0)
- (1.5,1.5,1.5)
- (1.5,2,3)
- (1,2,1.5)
- (2,1.5,3)

Q4. Given the following plot you can clearly see that there are two clusters.

Please select the correct answers regarding properties about these data

Remember: We are using vector notation of (X,Y,Z)

- There is one cluster centroid at (1,1,1)
- There is one cluster centroid at (2,2,2)
- There is one cluster centroid at (1,2,1)
- Data point (2.5,2.5,2.5) lies within a cluster
- There is one cluster centroid at (1,2,2)
- Data point (1.5,1.5,1.5) lies within a cluster
- Data point (2.1,1.9,2.2) lies within a cluster

Q5. What is the correlation between the two lists?

1,2,3,4,5,6,7,8,9,10

7,6,5,4,5,6,7,8,9,10

*Please enter at least three digits after the decimal*

Q6. What is the covariance between the two lists?

1,2,3,4,5,6,7,8,9,10

7,6,5,4,5,6,7,8,9,10

*Please enter at least three digits after the decimal*

Q7. The correlation between the following two lists is zero, can you explain why?

1,2,3,4,5,6,7

7,6,5,4,5,6,7

- Correlation of 1st half of the list is negative and between the last half of the list is positive so they cancel out
- The second list is totally random in respect to the first list, therefore they don’t correlate at all

Q8. Please select all data with potentially high correlation

- Color of car + Speed of car
- Crop fitness + Amount of water available
- Outside temperature + Power consumption of AC
- Age of driver + Number of accidents
- Season of Year+Amount of Rain

### Fundamentals of Scalable Data Science Week 04 Quiz Answers

#### Quiz : Visualization and dimension reduction

Q1. Among dimension reduction, what technique exist to visualize more than three dimensions in a single plot? (Select all that apply)

- Color coding for discrete dimensions
- Using different symbols apart from simple points for discrete dimensions
- Sum Kurtosis and Skewness together for each row
- Calculate the Standard Deviation of each row
- Apply Fast Fourier Transformation
- Add multiple lines to a run chart

Q2. What properties are true about PCA for dimension reduction? (Select all that apply)

- PCA removes the dimensions with the highest correlation because their information content is lowest
- PCA is a linear transformation, this means most of the original propertes of your data are preserved
- PCA transforms your dataset so that the first k dimensions have the lowest correlation among each other
- PCA is a non-linear transformation, this means most of the original properties of your data set are lost

Q3. Which of the following statements is true about information loss using PCA?

- PCA dimensionality reduction is lossless
- PCA dimensionality reduction is lossy

Q4. How many outliers are present in this dataset?

Q5. Which statements are true about this plot? (select all that apply)

- This is a box plot
- This is a star ship of space invaders
- There are more than 15 outliers present
- The distribution is centered around mean zero
- Kurtosis is zero
- Kurtosis is >1
- Skew is zero
- Skew is > 1
- >50% of all values are <30
- Median is >20
- >50% of all values are >30

##### Fundamentals of Scalable Data Science Coursera Course Review:

In our experience, we suggest you enroll in Fundamentals of Scalable Data Science courses and gain some new skills from Professionals completely free and we assure you will be worth it.

Fundamentals of Scalable Data Science course is available on Coursera for free, if you are stuck anywhere between quiz or graded assessment quiz, just visit Networking Funda to get Fundamentals of Scalable Data Science Coursera Quiz Answers.

##### Conclusion:

I hope this Fundamentals of Scalable Data Science Coursera Quiz Answers would be useful for you to learn something new from this Course. If it helped you then don’t forget to bookmark our site for more Coursera Quiz Answers.

This course is intended for audiences of all experiences who are interested in learning about new skills in a business context; there are no prerequisite courses.

Keep Learning!

#### Get All Course Quiz Answers of Advanced Data Science with IBM Specialization

Fundamentals of Scalable Data Science Coursera Quiz Answers

Advanced Machine Learning and Signal Processing Quiz Answers

Applied AI with DeepLearning Coursera Quiz Answers

Advanced Data Science Capstone Coursera Quiz Answers