Complete Introduction to Big Data Quiz Answers with Explanations

Get All Modules Introduction to Big Data Quiz Answers

Module 02: Why Big Data and Where Did it Come From?

Q1. Which of the following is an example of big data utilized in action today?

The Internet
Social Media
Wi-Fi Networks
Individual, Unconnected Hospital Database

Explanation: Social media platforms generate massive amounts of data from user interactions, making them a prime example of big data in action.

Q2. What reasoning was given for the following: why is the “data storage to price ratio” relevant to big data?

It isn’t, it was just an arbitrary example on big data usage.
Larger storage means easier accessibility to big data for every user because it allows users to download in bulk.
Companies can’t afford to own, maintain, and spend the energy to support large data storage unless the cost is sufficiently low.
Access of larger storage becomes easier for everyone, which means client-facing services require very large data storage.

Explanation: The decreasing cost of data storage makes it feasible for companies to handle and process large volumes of data effectively.

Q3. What is the best description of personalized marketing enabled by big data?

Being able to use the data from each customer for marketing needs.
Marketing to each customer on an individual level and suiting to their needs.
Being able to obtain and use customer information for specific groups and utilize them for marketing needs.

Explanation: Big data enables companies to collect detailed data about individual customers, which is then used to tailor marketing strategies for each person.

Q4. Of the following, which are some examples of personalized marketing related to big data?

Facebook revealing posts that cater towards similar interests.
A survey that asks your age and markets to you a specific brand.
News outlets gathering information from the internet in order to report them to the public.

Explanation: Google uses data from users’ past search history to show targeted ads, a direct example of personalized marketing powered by big data.

Q5. What is the workflow for working with big data?

Theory -> Models -> Precise Advice
Big Data -> Better Models -> Higher Precision
Extrapolation -> Understanding -> Reproducing

Explanation: This workflow illustrates how big data can be used to create more accurate models, leading to better precision in predictions or results.

Q6. Which is the most compelling reason why mobile advertising is related to big data?

Mobile advertising in and of itself is always associated with big data.
Mobile advertising benefits from data integration with location which requires big data.
Mobile advertising allows massive cellular/mobile texting to a wide audience, thus providing large amounts of data.
Since almost everyone owns a cell/mobile phone, the mobile advertising market is large and thus requires big data to contain all the information.

Explanation: Mobile advertising is often personalized based on location data, which involves processing large datasets to deliver relevant ads to users.

Q7. What are the three types of diverse data sources?

Machine Data, Map Data, and Social Media
Information Networks, Map Data, and People
Machine Data, Organizational Data, and People
Sensor Data, Organizational Data, and Social Media

Explanation: These are three main sources of data that contribute to big data. Sensor data includes input from devices, organizational data comes from internal systems, and social media provides user-generated content.

Q8. What is an example of machine data?

Social Media
Weather station sensor output.
Sorted data from Amazon regarding customer info.

Explanation: Machine data typically comes from sensors and devices, such as those used in weather stations to collect environmental data.

Q9. What is an example of organizational data?

Satellite Data
Social Media
Disease data from Center for Disease Control.

Explanation: Organizational data is data generated within an organization or government body, such as public health records from the CDC.

Q10. Of the three data sources, which is the hardest to implement and streamline into a model?

People
Machine Data
Organizational Data

Explanation: Data generated by people, often unstructured and varied, can be difficult to integrate into models compared to more structured data sources like machines or organizations.

Q11. Which of the following summarizes the process of using data streams?

Theory -> Models -> Precise Advice
Integration -> Personalization -> Precision
Big Data -> Better Models -> Higher Precision
Extrapolation -> Understanding -> Reproducing

Explanation: This process outlines how streaming data, when analyzed, can lead to the development of better predictive models, ultimately improving precision.

Q12. Where does the real value of big data often come from?

Size of the data.
Combining streams of data and analyzing them for new insights.
Using the three major data sources: Machines, People, and Organizations.
Having data-enabled decisions and actions from the insights of new data.

Explanation: The true value of big data lies in combining different data streams to uncover patterns or insights that weren’t apparent from any single data source.

Q13. What does it mean for a device to be “smart”?

Must have a way to interact with the user.
Connect with other devices and have knowledge of the environment.
Having a specific processing speed in order to keep up with the demands of data processing.

Explanation: A smart device is capable of collecting data, processing it, and providing services without requiring constant human input.

Q14. What does the term “in situ” mean in the context of big data?

Accelerometers.
In the situation
The sensors used in airplanes to measure altitude.
Bringing the computation to the location of the data.

Explanation: “In situ” in big data refers to processing data at its source, such as analyzing data from sensors without sending it to a central server first.

Q15. Which of the following are reasons mentioned for why data generated by people are hard to process?

Very unstructured data.
They cannot be modeled and stored.
The velocity of the data is very high.
Skilled people to analyze the data are hard to come by.

Explanation: Data from people is often unstructured, high-velocity, and requires skilled analysis, making it difficult to process effectively.

Q16. What is the purpose of retrieval and storage; pre-processing; and analysis in order to convert multiple data sources into valuable data?

To enable ETL methods.
Designed to work like the ETL process.
To allow scalable analytical solutions to big data.
Since the multi-layered process is built into the Neo4j database connection.

Explanation: These processes ensure that data can be efficiently stored, pre-processed, and analyzed, enabling scalable solutions for working with big data.

Q17. Which of the following are benefits for organization-generated data?

Higher Sales
High Velocity
Improved Safety
Better Profit Margins
Customer Satisfaction

Explanation: Organization-generated data can lead to improved decision-making, which enhances sales, profit margins, and customer satisfaction.

Q18. What are data silos and why are they bad?

Highly unstructured data. Bad because it does not provide meaningful results for organizations.
Data produced from an organization that is spread out. Bad because it creates unsynchronized and invisible data.
A giant centralized database to house all the data production within an organization. Bad because it hinders opportunity for data generation.
A giant centralized database to house all the data produces within an organization. Bad because it is hard to maintain as highly structured data.

Explanation: Data silos occur when different parts of an organization generate data that isn’t shared, leading to inefficiencies and missed insights.

Q19. Which of the following are the benefits of data integration?

Monitoring of data.
Adds value to big data.
Increase data availability.
Unify your data system.
Reduce data complexity.
Increase data collaboration.

Explanation: Data integration consolidates disparate data sources, making it more accessible, collaborative, and valuable for decision-making.

Quiz 2: V for the V’s of Big Data

Q1. Amazon has been collecting review data for a particular product. They have realized that almost 90% of the reviews were mostly a 5/5 rating. However, of the 90%, they realized that 50% of them were customers who did not have proof of purchase or customers who did not post serious reviews about the product. Of the following, which is true about the review data collected in this situation?

High Veracity
High Volume
Low Veracity
High Valence
Low Valence
Low Volume

Explanation: Veracity refers to the reliability or truthfulness of data. Since a significant portion of the reviews lack proof of purchase or serious feedback, this indicates that the data has low veracity.

Q2. As mentioned in the slides, what are the challenges to data with a high valence?

Reliability of Data
Difficult to Integrate
Complex Data Exploration Algorithms

Explanation: High valence data often presents challenges related to its reliability, as it might contain a high degree of uncertainty or inconsistent information.

Q3. Which of the following is the 6 V’s in big data?

Variety
Volume
Valence
Value
Veracity
Velocity
Vision

Explanation: The 6 V’s of big data are commonly described as: Velocity, Veracity, Volume, Variety, Valence, and Value. These characteristics help define the complexity and challenges in handling big data.

Q4. What is the veracity of big data?

The size of the data.
The connectedness of data.
The speed at which data is produced.
The abnormality or uncertainties of data.

Explanation: Veracity in big data refers to the quality, accuracy, and uncertainty of data. Data with high veracity is trustworthy, while data with low veracity may contain inconsistencies or errors.

Q5. What are the challenges of data with high variety?

Hard to integrate.
The quality of data is low.
Hard in utilizing group event detection.
Hard to perform emergent behavior analysis.

Explanation: High variety refers to the different types and formats of data. The challenge lies in integrating and analyzing these diverse forms of data from multiple sources.

Q6. Which of the following is the best way to describe why it is crucial to process data in real-time?

More accurate.
Prevents missed opportunities.
More expensive to batch process.
Batch processing is an older method that is not as accurate as real-time processing.

Explanation: Real-time data processing allows businesses to act quickly and capitalize on immediate opportunities, which might otherwise be missed if the data is processed in batches.

Q7. What are the challenges with big data that has high volume?

Effectiveness and Cost
Storage and Accessibility
Speed Increase in Processing
Cost, Scalability, and Performance

Explanation: Real-time data processing allows businesses to act quickly and capitalize on immediate opportunities, which might otherwise be missed if the data is processed in batches.

Module 04: Data Science 101

Q1. Which of the following are parts of the 5 P’s of data science and what is the additional P introduced in the slides?

People
Purpose
Product
Perception
Process
Platforms
Programmability

Explanation: The 5 P’s in data science include Process, People, Product, Purpose, and Platforms. These components encompass the holistic approach to data science, from infrastructure to the individuals involved.

Q2. Which of the following are part of the four main categories to acquire, access, and retrieve data?

Text Files
Web Services
Remote Data
NoSQL Storage
Traditional Databases

Explanation: These are the primary methods for accessing and retrieving data, covering various formats and storage technologies. Remote Data is typically managed within these categories.

Q3. What are the steps required for data analysis?

Investigate, Build Model, Evaluate
Classification, Regression, Analysis
Regression, Evaluate, Classification
Select Technique, Build Model, Evaluate

Explanation: These steps outline the process of analyzing data, starting with choosing the appropriate technique, building a model, and then evaluating its performance.

Q4. Of the following, which is a technique mentioned in the videos for building a model?

Validation
Evaluation
Analysis
Investigation

Explanation: Analysis is a core technique in the model-building phase, focusing on interpreting and structuring the data to create predictive models.

Q5. What is the first step in finding the right problem to tackle in data science?

Define the Problem
Define Goals
Assess the Situation
Ask the Right Questions

Explanation: Starting with the right questions ensures the problem is well-defined and aligns with business goals, setting the stage for effective data analysis.

Q6. What is the first step in determining a big data strategy?

Business Objectives
Collect Data
Build In-House Expertise
Organizational Buy-In

Explanation: Defining business objectives helps align data strategies with organizational goals, ensuring relevance and focus on key priorities.

Q7. According to Ilkay, why is exploring data crucial to better modeling?

Data exploration…
enables a description of data which allows visualization.
enables understanding of general trends, correlations, and outliers.
leads to data understanding which allows an informed analysis of the data.
enables histograms and others graphs as data visualization.

Explanation: Data exploration helps uncover patterns and anomalies that inform model building, improving accuracy and insights.

Q8. Why is data science mainly about teamwork?

Analytic solutions are required.
Engineering solutions are preferred.
Exhibition of curiosity is required.
Data science requires a variety of expertise in different fields.

Explanation: Teamwork is vital because data science integrates skills from statistics, programming, domain knowledge, and communication.

Q9. What are the ways to address data quality issues?

Remove outliers.
Data Wrangling
Merge duplicate records.
Remove data with missing values.
Generate best estimates for invalid values.

Explanation: These methods help clean and refine data to improve its quality, making it more suitable for analysis.

Q10. What is done to the data in the preparation stage?

Build Models
Retrieve Data
Select Analytical Techniques
Identify Data Sets and Query Data
Understanding Nature of Data and Preliminary Analysis

Explanation: In the preparation stage, data is assessed and organized to ensure it is ready for further analysis, with preliminary insights drawn to guide the process.

Module 04: Foundations for Big Data

Q1. Which of the following is the best description of why it is important to learn about the foundations for big data?

Foundations stand the test of time.
Foundations is all that is required to show a mastery of big data concepts.
Foundations allow for the understanding of practical concepts in Hadoop.
Foundations help you revisit calculus concepts required in the understanding of big data.

Explanation: Understanding foundational concepts in big data provides a stable basis for adapting to evolving technologies and tools, ensuring long-term relevance.

Q2. What is the benefit of a commodity cluster?

Enables fault tolerance
Prevents network connection failure
Prevents individual component failures
Much faster than a traditional super computer

Explanation: Commodity clusters allow distributed systems to remain operational even when individual components fail, ensuring high availability and reliability.

Q3. What is a way to enable fault tolerance?

Distributed Computing
System Wide Restart
Better LAN Connection
Data Parallel Job Restart

Explanation: By replicating data across multiple nodes, systems can recover from hardware failures, ensuring data integrity and availability.

Q4. What are the specific benefit(s) to a distributed file system?

Large Storage
High Concurrency
Data Scalability
High Fault Tolerance

Explanation: Distributed file systems handle vast amounts of data efficiently, allow multiple users to access files concurrently, tolerate failures, and scale seamlessly as data grows.

Q5. Which of the following are general requirements for a programming language in order to support big data models?

Handle Fault Tolerance
Utilize Map Reduction Methods
Support Big Data Operations
Enable Adding of More Racks
Optimization of Specific Data Types

Explanation: A programming language suited for big data must efficiently process large datasets, support distributed computing paradigms like MapReduce, and manage fault tolerance to ensure reliable operations.

Module 06 : Intro to Hadoop

Q1. What does IaaS provide?

Hardware Only
Software On-Demand
Computing Environment

Explanation: Infrastructure as a Service (IaaS) provides a virtualized computing environment, including resources such as servers, storage, and networking. Users manage the software while the hardware is provided.

Q2. What does PaaS provide?

Hardware Only
Computing Environment
Software On-Demand

Explanation: Platform as a Service (PaaS) offers a framework for developers to build, deploy, and manage applications without worrying about the underlying hardware or software layers.

Q3. What does SaaS provide?

Hardware Only
Computing Environment
Software On-Demand

Explanation: Software as a Service (SaaS) delivers software applications over the internet on a subscription basis, eliminating the need for local installation.

Q4. What are the two key components of HDFS and what are they used for?

NameNode for block storage and Data Node for metadata.
NameNode for metadata and DataNode for block storage.
FASTA for genome sequence and Rasters for geospatial data.

Explanation: The NameNode manages metadata (file structure, permissions), while the DataNodes store the actual data blocks.

Q5. What is the job of the NameNode?

For gene sequencing calculations.
Coordinate operations and assigns tasks to Data Nodes
Listens from DataNode for block creation, deletion, and replication.

Explanation: The NameNode oversees the file system metadata and ensures that data blocks are replicated and managed correctly across DataNodes.

Q6. What is the order of the three steps to Map Reduce?

Map -> Shuffle and Sort -> Reduce
Shuffle and Sort -> Map -> Reduce
Map -> Reduce -> Shuffle and Sort
Shuffle and Sort -> Reduce -> Map

Explanation: The Map phase processes input data into key-value pairs, Shuffle and Sort organize the data by keys, and the Reduce phase aggregates or processes the results.

Q7. What is a benefit of using pre-built Hadoop images?

Guaranteed hardware support.
Less software choices to choose from.
Quick prototyping, deploying, and validating of projects.
Quick prototyping, deploying, and guaranteed bug free.

Explanation: Pre-built images streamline the setup process, allowing teams to test and deploy projects faster.

Q8. What is an example of open-source tools built for Hadoop and what does it do?

Giraph, for SQL-like queries.
Zookeeper, analyze social graphs.
Pig, for real-time and in-memory processing of big data.
Zookeeper, management system for animal named related components

Explanation: Giraph is an open-source tool that enables large-scale graph processing on Hadoop.

Q9. What is the difference between low-level interfaces and high-level interfaces?

Low level deals with storage and scheduling while high level deals with interactivity.
Low level deals with interactivity while high level deals with storage and scheduling.

Explanation: Low-level interfaces manage fundamental tasks like resource allocation, while high-level interfaces provide user-friendly functionalities for data processing and interaction.

Q10. Which of the following are problems to look out for when integrating your project with Hadoop?

Random Data Access
Data Level Parallelism
Task Level Parallelism
Advanced Alogrithms
Infrastructure Replacement

Explanation: Hadoop is optimized for sequential data processing; random data access and infrastructure changes can be challenging. Complex algorithms may need adaptation for Hadoop’s parallel framework.

Q11. As covered in the slides, which of the following are the major goals of Hadoop?

Enable Scalability
Handle Fault Tolerance
Provide Value for Data
Latency Sensitive Tasks
Facilitate a Shared Environment
Optimized for a Variety of Data Types

Explanation: Hadoop is designed for distributed data storage and computation, ensuring fault tolerance, scalability, and efficient resource sharing across nodes.

Q12. What is the purpose of YARN?

Implementation of Map Reduce.
Enables large scale data across clusters.
Allows various applications to run on the same Hadoop cluster.

Explanation: YARN (Yet Another Resource Negotiator) decouples resource management and scheduling from MapReduce, enabling multiple data-processing engines to run concurrently on Hadoop.

Q13. What are the two main components for a data computation framework that were described in the slides?

Node Manager and Container
Resource Manager and Container
Applications Master and Container
Node Manager and Applications Master
Resource Manager and Node Manager

Explanation: The Resource Manager handles resource allocation across the cluster, while the Node Manager monitors and manages resources on individual nodes.

Next Quiz Answers >>

Big Data Modeling and Management Systems