Big Data Modeling and Management Systems Quiz Answers

All Week Big Data Modeling and Management Systems Quiz Answers

Week 1: Quiz Answers

Q1. (Questions 1-3 pertain to the video lecture “Exploring the Relational Data Model of CSV”) What is the approximate population of La Paz County in the state of Arizona for the CENSUS2010POP (column H)? (Choose the best answer.)

15000
25000
10000
20000

Q2. What county in the state of Wyoming has the smallest estimated population?

Platte
Uinta
Niobrara
Sweetwater

Q3. At 2:45 of the video, the Instructor creates a filter for all of the counties in California with a population greater than 1,000,000. However, included in the results is the entire state of California. This anomalous value might skew our analysis if, for example, we wanted to compute the average population of these results. What additional filter might work to resolve this problem?

Add a filter to detect and remove results which do not include the word “County” in column G.
Add a filter which finds all counties with population greater than 100,000 AND less than 10,000,000 for column H (CENSUS2010POP).
Add a filter where the value in column E is greater than 1,000,000.
None of the above

Q4. (Questions 4 and 5 pertain to the video “Exploring Sensor Data”) How often (in seconds) do the R5 measurements occur?

Q5. What is the field for rain accumulation?

Q6. (Questions 6 and 7 pertain to the video lecture “Exploring the Array Data Model of an Image”) What is the (Red, Green, Blue) pixel value for location 500, 2000?

(163, 118, 79)
(134, 145, 46)
(50, 156, 182)
(100, 123, 149)

Q7. Is this value likely to be land or ocean?

Land
Ocean

Q8. (Questions 8 and 9 pertain to the video lecture “Exploring the Semistructured Data Model of JSON”) Given a tweet, what path would you most likely enter to obtain a count of the number of followers for a user?

user/followers_count
user/statuses_count
user/listed_count
None of the above

Q9. Which of the following fields are nested within the ‘entities’ field (select all that apply)?

tweets
user_mentions
events
views
symbols
urls

Week 2: Data Models Quiz

Q1. What is a possible pitfall of utilizing Excel as a way to manipulate small databases?

Excel does not enforce many principles of relational data models.
Excel is a user program and thus cannot run on a server.
Excel does not allow algorithms for data manipulation.

Q2. What does the term “atomic” mean in the context of relational databases?

Fixed schema of a particular database.
A tuple that cannot be reduced.
A column or row of data. Depends on the context.
One unit of information that cannot be decomposed.

Q3. What is the Pareto-Optimality problem?

Find the shortest path from source node to target node.
Find the best possible path given two or more optimization criteria where neither constraint can be fully optimized simultaneously.
Find the optimal path that requires going through specific nodes given by the user.

Q4. What constitutes a community within a graph?

High density of nodes at a certain location.
A neighborhood defined by an integer constant K around a specific node. All K+1 nodes belong in another community.
A dense amount of edge connections between nodes in a community and a few connections across communities.
Many anomalous neighborhoods within the same vicinity.

Q5. Why are trees useful for semi-structured data such as XML and JSON?

Computers can easily visualize the data with a tree structure.
It is not always the case that XML and JSON can be represented as trees.
Trees take advantage of the parent-child relationship of the data for easy navigation.
They are only useful for XML data as tree-like structure is apparent with tags. While JSON does not contain a tree-like structure as it contains arrays.

Q6. What is the general purpose of modeling data as vectors?

Enables weighting of the query.
The ability to normalize vectors allowing probability distributions.
Enables image searching.
Results can be ordered by similarity using vector projection.

Q7. For the following questions 7, 8, and 9, suppose a registration website creates data with the following fields for each person registered (note: if the user does not input a value, NULL is stored instead): Name, Date, Address, and Account Number.

Suppose we collect data month by month. Each month, we would have a batch of data containing the fields listed above. At the end of the year, we want to summarize our registrant activities for the entire year, so we would remove redundancies in our data by removing any records with duplicate account numbers from month to month. What type of operation do we use in this scenario?

Join
Not an Operation
Subsetting
Union

Q8. From the information given in question 7, what are the constraints, if any, which we have placed on the Account Number field for the end of year collection?

Account should have at most n digits.
If we had n duplicate Account Numbers then we will remove n-1 duplicate fields.
There are no constraints.
Account Number should be unique.

Q9. Suppose 100 people signup for our system and of the 100 people, 60 of them did not input an address. The system lists the values as NULL for these empty entries in the address field. Would this situation still have structure for our data?

No because the majority of data do not have a specific field filled, thus our originally defined structure is lost.
Yes the data has structure because we have placed a structural constraint on the data, thus the data will always have the originally defined structure.

Week 3: Data Formats and Streaming Data Quiz

Q1. What is true between data modeling and the formatting of the data?

There is a one to one correspondence between formatting data and data modeling. For every model of data, there is only one way to store the data.
There is always one specific schema for storing model data that is the best and preferred method for the specific data representation.
The data does not necessarily need to be formatted in a way that represents the data model. Just so long as it can be extrapolated.

Q2. What is streaming?

Calculating results using real time data otherwise known as streaming data.
Using static data stored from a real time source in order to process and guide the application.
Utilizing real time data to compute and change the state of an application continuously.
Using sensors to manipulate the system, such as a smart car being able to drive by itself using sensors to detect road hazards.

Q3. Of the following, what best describes the properties of working with streaming data?

Small time windows for working with data.
Data is always utilized for streaming the application.
Data manipulation is near real time.
Independent computations that do not rely on previous or future data.
Always unbounded in sequence, in other words, data is not guaranteed to be in order.
Does not ping the source interactively for a response upon receiving the data.

Q4. What is a characteristic of streaming data?

Data is unbounded in size but requires only finite time and space to process it.
The data is unbounded in size and the size determines the time and space of processing the data.
The data is finite and requires only finite time and space to process the data.
Data is finite in size and size determines the time and space of processing the data.

Q5. What type of algorithm is required for analyzing streaming data?

Accurate and Consistent
Accurate and Memory Efficient
Fast and Complex
Fast and Simple

Q6. What is lambda architecture?

A specific method for processing streaming data using special real time processes.
A specific hardware architecture for a server made specifically for processing real time data.
A method to process streaming data by utilizing batch processing and real time processing.

Q7. Of the following, which best represents the challenge regarding the size and frequency of data?

The size and frequency of the streaming data may be too small.
The size and frequency of the streaming data may be sporadic.
There may not be data to produce the notion of size and frequency.

Q8. What is the difference between data lakes and data warehouses?

Data lakes house raw data while data warehouses contain pre-formatted data.
Data lakes contain only files while data warehouses contain only databases.
Data lakes utilize hierarchical systems while data warehouses use object storage.

Q9. What is schema-on-read?

The process where formatted data is given structure when read.
Another name for data lakes.
Data is stored as raw data until it is read by an application where the application assigns structure.
The process where data is pre-formatted prior to being read but the schema is loaded on read.

Week 4 – BDMS Quiz

Q1. The desired characteristics of a BDMS include (select all that apply):

Narrow range of query sizes
Continuous data ingestion
Support for common “Big Data” data types
Support for ACID
A full query language
A flexible semi-structured data model

Q2. Fill in the blank with the best answer: CAP theorem states that _________ all at once within a distributed computer system?

it is impossible to have consistency, accuracy, and partial tolerance
it is necessary to have consistency, accuracy, and partial tolerance
it is necessary to have consistency, availability, and partition tolerance
it is impossible to have consistency, availability, and partition tolerance

Q3. What is the purpose of the acronym BASE?

The same as ACID.
To overcome CAP theorem.
To impose properties on a BDMS in order to guarantee certain results.
Enables stricter enforcement of ACID type design.

Q4. What are ziplists in Redis?

A special type of data type that can store up to 512 mb of image data.
A look up table that is stored as a value in the database. Look up table points to actual values in memory.
A compressed list that is stored within the value of the database.
A special type of data type that can store hashes that point to multiple attributes.

Q5. What is one of the main features of Aerospike?

Images as values within the database.
Enables real time data streaming from external sources.
Support for geospatial data storage and geospatial queries.
Better equipped for string based search applications.

Q6. What database would be best suited for the following scenario: An app development company is trying to implement a cloud based storage system for their new map-based app. The cloud will manage the longitude and latitude of the data in order to track user location.