Big Data Integration and Processing Coursera Quiz Answers

All Weeks Big Data Integration and Processing Coursera Quiz Answers

Quiz 1 – Retrieving Big Data Quiz

Q1.What does it mean for a query language to be declarative?

  • The language specifies the process of how to obtain the data.
  • The language specifies both the process of how to obtain the data and specifies what data to obtain.
  • The language specifies what data to obtain.
  • A language specific declaration of data types in order to define the method of data retrieval.

Q2. Use the following table named “user_table” to answer the next 2 problems.

userId	username	email
1	admin	        [email protected]
2	h4xor	        [email protected]

How would you go about querying the entire username column (however many)?

  • SELECT user_table FROM username
  • SELECT username FROM user_table
  • SELECT username FROM user_table WHERE userId=1
  • SELECT username FROM userId WHERE *

Q3. How would you go about querying the entire database table (please refer to question 2’s table)?

  • SELECT user_table FROM *
  • SELECT * FROM * WHERE user_table
  • SELECT username, email FROM userId
  • SELECT * FROM user_table

Q4. What is the global indexing table?

  • A global table that uses a specific technique called indexing and the table uses an index as the primary key.
  • An index table in order to keep track of a given data type that might exist within multiple machines.
  • An index table in order to keep track of data records within one machine.
  • An index table in order to keep track of a given data type that might exist within one machine.

Q5. What are the three computing steps of a semi-join?

  • Project, Ship, Reduce
  • Project, Decompose, Send
  • Index, Join, Display
  • Query, Join, Display
  • None Applicable

Q6. What is the purpose of a semi-join?Quiz 2 – Postgres, MongoDB and Pandas

  • Another name for join: an operation to combine two tables by column.
  • Increase the efficiency of sending data across multiple machines.
  • Increase the speed of the join for trade-off of increased data transmission cost.

Q7. What is a subquery?

  • A query statement within another query.
  • A short query than normal.
  • An alternative query that acts as a substitute for another query.

Q8. What is a correlated subquery?

  • A type of query that contains a subquery that requires information from a query one level up.
  • A type of query that contains a relationship between a variable attribute x and a variable attribute y. The two variables have a dependent relationship causing a correlation.
  • A type of query that requires two tables in order to calculate values.

Q9. What is the purpose of GROUP BY queries?

  • Enables calculations based on specific columns of the table.
  • Enables queries within queries.
  • Required before you can use functions like AVG, SUM, MIN, MAX, COUNT.

Q10. Consider the following generic statement for questions 10-12:

db.<collection>.find(<query filter>, <projection>).<cursor modifier>

Which part of the statement would reflect that of the FROM statement in SQL as illustrated in the lecture?

  • <query filter>
  • <collection>
  • <cursor modifier>
  • <projection>

Q11. Which part of the statement would reflect that of the SELECT statement in SQL as illustrated in the lecture?

  • <query filter>
  • <projection>
  • <cursor modifier>
  • <collection>

Q12. Which part of the statement would reflect that of the WHERE statement in SQL as illustrated in the lecture?

  • <projection>
  • <cursor modifier>
  • <query filter>
  • <collection>

Q13. A sample part of the data structure is as follows:

{ _id:1, userIndex: 10, email: “[email protected]", retainRate:2}

What would be the most likely statement that we would need to grab email info for user indexes greater than 24?

  • db.userIndex.find({email:{$gt:24}}, {_id:0})
  • db.email.find({userIndex:{$gt:24}}, {email:1, _id:0})
  • db.userIndex.find({email:{$lte:24}}, {_id:0})
  • db.email.find({userIndex:{$lte:24}}, {email:1, _id:0})

Q14. What does it mean to have a _id:0 within our query statement?

  • Grab the first object in the results.
  • Grab as many objects as possible.
  • Does not have an effect, simple convention left for compatibility issues.
  • Tell MongoDB not to return a document id.

Quiz 2: Postgres, MongoDB and Pandas

Q1. What is the highest level that the team has reached in-game clicks? (Hint: use the MAX operation in Postgres).

  • 6
  • 8
  • 9
  • 10
  • 7

Q2. How many users id’s (repeats allowed) have reached the highest level as found in the previous question? (Hint: For Postgres: you may either use two queries or use a sub-query).

  • 106436
  • 67271
  • 122757
  • 51294
  • 98823

Q3. How many users id’s (repeats allowed) reached the highest level in game clicks and also clicked the highest costing price in buy clicks? Hint: Refer to question 4 for ideas.

  • 66887
  • 32747
  • 23301
  • 73226

Q4. What does the following line of code do in postgres?

SELECT count(userid) FROM (SELECT buyclicks.userId, teamLevel, price FROM buyclicks JOIN gameclicks on buyclicks.userId = gameclicks.userId) temp WHERE price=3 and teamLevel=5;

  • Displays the users who have bought items worth $3 and have had a team with level 5.
  • This is an invalid line of code, the subquery is not formatted properly.
  • Counts the users who exists between both gameclicks and buyclicks files.
  • Finds the total number of user ids (repeats allowed) in buy-clicks that have bought items with prices worth $3 and was in a team with level 5 at some point in time.

Q5. In the MongoDB data set, what is the username of the twitter account who has a tweet_followers_count of exactly 8973882?

  • CreateImga
  • Autocenterit
  • FIFAcom
  • SasSpear

Quiz 3 – Information Integration

Q1. What is the main problem with big data information integration?

  • Pay-as-you-go model
  • Probabilistic Schema Mapping
  • Many sources
  • Mediated Schema

Q2. What would be the two possible solutions associated with “big data” information integration as mentioned in lecture? (Choose 2)

  • Probabilistic Schema Mapping
  • Customer Transactions
  • Pay-as-you-go Model
  • Mediated Schema
  • Attribute Grouping

Q3. What are mediated schemas?

  • Schemas created from customer info.
  • Schemas created entirely from attribute grouping.
  • A type of probabilistic schema mapping.
  • Schema created from integrating two or more schemas.

Q4. In attribute grouping, how would one evaluate if two attributes should go together? (Choose 2)

  • Probability of Two Attributes Co-occurring
  • Integrated Views
  • Similarity of Attributes
  • Customer Interaction
  • Candidate Designs

Q5. What is a data item?

  • Data found in a customer transaction.
  • Data that represents an aspect of a real-world entity.
  • The real worth of a data value.
  • Data found in a mediated schema.

Q6. What is data fusion?

  • Extracting a global value from a data source.
  • Extracting true sources from a data source.
  • Extracting the true value of a data item.
  • Another term for customer analytics.

Q7. What is a potential problem of having too many data sources as mentioned in lecture?

  • Too much data processing required for compression.
  • Too many data values.
  • Schema mapping becomes impossible.
  • None, the problem is not a problem when using big data methodologies.

Q8. What do we mean when we say “the true value of a data item”?

  • Extrapolated data from a data item that represents the worth of that item.
  • Data created from statistical estimations.
  • Another term for data fusion.

Q9. What is a potential method to deal with too many data sources as mentioned in lecture?

  • Compare and weigh each source by their trustworthiness.
  • Randomly select a sample of sources to represent the various data sources.
  • None, the more the better.
  • Take less samples per tick.

Quiz 4 – Hands-On with Splunk

Q1. Which of the queries below will return the average population of the counties in Georgia (be careful not to include the population of the state of Georgia itself)?

  • None of the above
  • source=”census.csv” CTYNAME != “Georgia” STNAME=”Georgia” | stats sum(CENSUS2010POP)
  • source=”census.csv” CTYNAME != “Georgia” STNAME=”Georgia” | stats mean(CENSUS2010POP)
  • source=”census.csv” STNAME=”Georgia” | stats mean(CENSUS2010POP)

Q2. What is the average population of the counties in the state of Georgia (be careful not to include the population of the state of Georgia itself)?

  • 394383.53786
  • 45373.454788
  • 243767.4564
  • 60928.635220

Q3. Of the options below, which query allows you to find the state with the most counties?

  • source=”census.csv” | stats count by CENSUS2010POP | sort count
  • stats count by STNAME | sort -count
  • source=”census.csv” | stats count by CTYNAME | sort num(count)
  • source=”census.csv” | stats count by STNAME | sort count desc

Q4. What state contains the most counties?

  • Texas
  • California
  • Georgia
  • Alaska

Q5. Of the options below, which query allows you to find the most populated counties in the state of Texas?

  • STNAME=”Texas” CENSUS2010POP > 100000 | sort -CENSUS2010POP | table CENSUS2010POP,CTYNAME
  • STNAME=”Texas” CENSUS2010POP > 100000 | sort CENSUS2010POP desc | table CENSUS2010POP,CTYNAME
  • Both
  • Neither

Q6. What is the most populated county in the state of Texas?

  • Harris
  • Dallas
  • Travis
  • Bexar

Quiz 5 – Pipeline and Tools

Q1. What is data-parallelism as defined in the lecture?

  • Having multiple multiple data pipelines at the same time.
  • Simultaneously processing input data from multiple cores.
  • Running the same function simultaneously for the partitions of a data set on multiple cores.
  • At each step of the data pipeline, process values simultaneously by using multiple cores.

Q2. Of the following, which procedure best generalizes big data procedures such as (but not limited to) the map-reduce process?

  • split->sort->merge
  • split->do->merge
  • split->map->shuffle and sort->reduce
  • split ->shuffle and sort->map->reduce

Q3. What are the three layers for the Hadoop Ecosystem? (Choose 3)

  • Data Manipulation and Integration
  • Data Management and Storage
  • Data Integration and Processing
  • Coordination and Workflow Management
  • Data Creation and Storage

Q4. What are the 5 key points in order to categorize big data systems?

  • Execution model, Latency, Scalability, Programming Language, Fault Tolerance
  • Coordination, Latency, Productivity, Speed, Fault Tolerance
  • Execution model, Speed, Scalability, Flexibility, Fault Tolerance
  • Coordination, Latency, Productivity, Flexibility, Fault Tolerance

Q5. What is the lambda architecture as shown in lecture?

  • A type of hybrid data processing architecture.
  • A type of architecture that only contains part of the data processing method.
  • A type of swappable data processing layer.
  • An architecture that natively supports lambda calculus.

Q6. Which of the following scenarios is NOT an aggregation operation?

  • Counting the total number of data per type.
  • Averaging the total number of data per type.
  • Removing undefined values.
  • Counting the total number of data.

Q7. What usually happens to data when aggregated as mentioned in lecture?

  • Data become organized.
  • Data becomes smaller.
  • Data becomes personalized.
  • Data becomes faster to process.

Q8. What is K-means clustering?

  • Divide samples using k lines.
  • Classify data by k decisions.
  • Group samples into k clusters.
  • Classify data by k actions.

Q9. Why is Hadoop not a good platform for machine learning as mentioned in lecture? (Choose 4)

  • Too massive.
  • Requires nodes and multiple machines.
  • Bottleneck using HDFS.
  • Map and Reduce Based Computation.
  • Unable to support machine learning.
  • No interactive shell and streaming.
  • Java support only.

10. What are the layers (parts) of Spark? (Choose 5)

  • SparkSQL
  • Graphx
  • MLlib
  • Spark Graph
  • Spark Core
  • Spark RDD
  • Spark Streaming
  • Worker Node

Q11. What is in-memory processing?

  • Having the pipeline completely in disk.
  • Writing data to disk between pipeline steps.
  • Writing data to memory between pipeline steps.
  • Having the pipeline completely in memory.
  • Having the input completely in disk.
  • Having the input completely in memory.

Quiz 6 – WordCount in Spark

Q1. What does the following line of code do?

words = lines.flatMap(lambda line: line.split(“ “))

  • Each line in the document is split up into words.
  • Each line in the document is split into various Spark partitions.
  • Each word in each line is counted.
  • Each word is merged into lines to be counted later.

Q2. What does the following line of code imply about the state of partitions before the action is performed?

words = lines.flatMap(lambda line: line.split(“ “))

  • Each Spark partition corresponds to a line in the document.
  • Each Spark partition corresponds to a word in the document.
  • There is only one single partition containing the full document.

Q3. When the following command is executed, where is the file written and how can it be accessed?

counts.coalesce(1).saveAsTextFile(‘hdfs:/user/cloudera/wordcount/outputDir’)

  • HDFS and through the system directory with the “cd” terminal command.
  • HDFS and through the “hadoop fs” command.
  • The local file system and through the “hadoop fs” command.
  • The local file system and through the directory with the “cd” terminal command.

Q4. What does the number one (1) allow us to do in the following line of code?

tuples = words.map(lambda word: (word,1))

  • The number represents the number of partitions in charge of counting each line.
  • The number represents the number of partitions in charge of keeping track of each word.
  • None, completely arbitrary in order to apply an algorithm that requires a tuple.
  • Treat each word with a weight of one during the counting process.

Quiz 7 – More on Spark

Q1. Which part of SPARK is in charge of creating RDDs?

  • Driver Program
  • Local CPU
  • Storage
  • Spark Executor
  • Worker Node

Q2. How does lazy evaluation work in Spark?

  • Transformations are queued and executed at a certain threshold.
  • Transformations are not executed until the action stage.
  • Actions are queued and executed at a certain threshold.
  • Actions are not executed until the transformation stage.

Q3. What are the consequences of lazy evaluation as mentioned in lecture?

  • Errors sometimes do not show up until the action stage.
  • Hiccups within the system during queue execution.
  • There are no consequences.

Q4. What is a wide transformation?

  • A transformation that requires data shuffling across node partitions.
  • Transformations that take a lot of nodes to complete.
  • A longer time-taking transformation compared to narrow transformations.
  • The name for the most used transformations.

Q5. Where does the data for each worker node get sent to after a collect function is called?

  • Other Worker Nodes
  • Spark Streaming
  • Spark Context
  • None; Stays in the Same Node
  • Spark SQL

Q6. What are DataFrames?

  • A special type of data node that contains framework to manipulate SQL.
  • A column like data format that can be read by Spark SQL.
  • A type of narrow transformation.

Q7. Can RDD’s be converted into DataFrames directly without manipulation?

  • Yes
  • No: lines have to be converted into row.
  • No: RDD’s needed to be made relational first.
  • No: RDD’s cannot be converted into DataFrames.

Q8. What is the function of Spark SQL as mentioned in lecture? (Choose 3)

  • Efficient data manipulation using SQL like structure.
  • Enables relational queries on Spark.
  • Deploy business intelligence tools over Spark.
  • Connect to variety of databases.
  • Better ability to manipulate big data.
  • Better worker node interpolation.

Q9. What is a triplet in GraphX?

  • A type of data to contain vertex info.
  • A type of data to contain the information on connections between vertices and edges.
  • A type of data to contain both edge and vertex info.
  • A type of data to contain edge info.

Quiz 8 – SparkSQL and Spark Streaming

Q1. What does the following filter line of code do?

df.filter(df[“teamlevel”] > 1)

  • Filter each row to show only team levels larger than 1.
  • Filter each column to show only team levels larger than 1.
  • Select the first two columns of the data and filter each column to show only team levels larger than 1.
  • Select the first two columns of the data and displays only team levels greater than 1.

Q2. What does the following do?

df.select(“userid”, “teamlevel”).show(5)

  • Select the rows named “userid” and “teamlevel” and display first 5 rows.
  • Display all rows except “userid” and “teamlevel”.
  • Select the columns named “userid” and “teamlevel” and display first 5 rows.
  • Display all columns except “userid” and “teamlevel”.

Q3. What does the 1 represent in the following line of code?

ssc = StreamingContext(sc,1)

  • To create only one partition to manage the stream.
  • To specific debug output.
  • To create one single context.
  • A batch interval of 1 second.

Q4. What does the following code do?

window = vals.window(10, 5)

  • Creates a window that combines 10 seconds worth of data and moves by 5 seconds.
  • Creates 10 windows with 5 seconds worth of data in them.
  • Creates 10 windows with 5 batch intervals inbetween.
  • Creates a batch interval between 10 seconds and 5 seconds.

Quiz 9 – Check Your Query Results

Q1. How many tweets have location not null?

  • 6937
  • 6945
  • No option applicable.
  • 5957
  • 6973

Q2. How many people have more followers than friends? (Hint : use this.user instead of user).

  • 6238
  • 5809
  • 5590
  • 6673
  • 5206

Q3. Perform a query that returns the text of tweets which have the string “http://”. Which of the following substrings do NOT occur in the results? (Choose all that apply)

  • @Infosmessi_
  • @DundalkFC
  • @Ass0Star
  • @espn
  • @TerraceImages

Q4. Query: Return all the tweets which contain text “England” but not “UEFA”. In these results the string “Euro 2016” appears in…

  • 2 tweets
  • 3 tweets
  • 0 tweets
  • More than 6 tweets.
  • 5 tweets

Q5. Query: Get all the tweets from the location “Ireland” which also contain the string “UEFA”. In this result the user with the highest friends count is…

  • Pauldonaghue
  • ProfitwatchInfo
  • irishexaminer
  • DerekRantsGames
  • Insight4News4

Quiz 10 – Check your Analysis Results

Q1. How many different countries are mentioned in at least one tweet?

  • 44
  • 112
  • 211
  • 64

Q2. How many times is any country mentioned in a tweet?

  • 52
  • 211
  • 397
  • 26634

Q3. What are the three countries with the highest mentioned count

  • Nigeria, Slovakia, Germany
  • Thailand, Iceland, Mexico
  • Norway, Nigeria, France
  • Thailand, Mexico, Denmark

Q4. How many times was France mentioned in a tweet?

  • 25
  • 8
  • 42
  • 30

Q5. Which country was mentioned most: Kenya, Wales, or the Netherlands?

  • Netherlands
  • Wales
  • Kenya

Q6. What is the average number of times a country is mentioned? (Round to the nearest integer)

  • 44
  • 15
  • 9
  • 3

Next Quiz Answers >>

Machine Learning With Big Data Quiz Answers

<< Previous Quiz Answers

Big Data Modeling and Management Systems Quiz Answers

All Courses Quiz Answers of Big Data Specialization

Course 01: Introduction to Big Data

Course 02: Big Data Modeling and Management Systems

Course 03: Big Data Integration and Processing

Course 04: Machine Learning With Big Data

Course 05: Graph Analytics for Big Data

Share your love

Newsletter Updates

Enter your email address below and subscribe to our newsletter

Leave a Reply

Your email address will not be published. Required fields are marked *