All Weeks Hadoop Platform and Application Framework Quiz Answers
Hadoop Platform and Application Framework Week 01 Quiz Answers
Basic Hadoop Stack Quiz Answers
Q1. What does SQOOP stand for?
- System Quality Object Oriented Process
- SQL to Hadoop
- Does not stand for anything specific
- ‘Sqooping’ the data.
Q2. What is not part of the basic Hadoop Stack ‘Zoo’?
Q3. What is considered to be part of the Apache Basic Hadoop Modules?
Q4. What are the two major components of the MapReduce layer?
Q5. What does HDFS stand for?
- Hadoop Data File System
- Hadoop Distributed File System
- Hadoop Data File Scalability
- Hadoop Datanode File Security
Q6. What are the two majority types of nodes in HDFS?
Q7. What is Yarn used as an alternative to in Hadoop 2.0 and higher versions of Hadoop?
Q8. Could you run an existing MapReduce application using Yarn?
Q9. What are the two basic layers comprising the Hadoop Architecture?
- ZooKeeper and MapReduce
- HDFS and Hive
- MapReduce and HDFS
- Impala and HDFS
Q10. What are Hadoop advantages over a traditional platform?
Hadoop Platform and Application Framework Week 02 Quiz Answers
Overview of Hadoop Stack Quiz Answers
Q1. Choose features introduced in Hadoop2 HDFS
- Multiple DataNodes
- Heterogenous storage including SSD, RAM_DISK
- Multiple namespaces
- HDFS Federation
Q2. In Hadoop2 HDFS a namespace can generate block IDs for new blocks without coordinating with other namespaces.
Q3. This is a new feature in YARN:
- High Availability ResourceManager
- web services REST APIs
Q4. Apache Tez can run independent of YARN
Q5. In Hadoop2 with YARN
- ResourceManagers are running on every compute node
- Each application has its own ApplicationMaster
- Only MapReduce jobs can be run
- Each application has its own ResourceManager
Hadoop Execution Environment Quiz Answers
Q1. Apache Spark cannot operate without YARN?
Q2. Apache Tez can support dynamic DAG changes?
Q3. Give an example of an execution framework that supports cyclic data flow?
Q4. The Fairshare scheduler can support queues/sub-queues?
Q5. The Capacity Scheduler can use ACLs to control security?
Q6. Mark choices that apply for Apache Spark:
- Can run integrated with YARN
- Supports in memory computing
- Can be accessed/used from high level languages like Java, Scala, Python, and R.
Q7. Which of the following choices apply for Apache Tez?
- Supports complex directed acyclic graph (DAG) of tasks
- Supports in memory caching of data
- Improves resource usage efficiency
Hadoop Applications Quiz Answers
Q1. Check all databases/stores applications that can run within Hadoop
Q2. Name the high level language that is a main part of Apache Pig?
- Pig Latin
Q3. Apache Pig can only be run using scripts
Q4. Check options that are methods of using/accessing Hive.
Q5. Check features that apply for HBase.
- Non-relational distributed database
Q6. List methods of accessing HBase
- Apache HBase shell
- HBase External API
- HBase API
Hadoop Platform and Application Framework Week 03 Quiz Answers
HDFS Architecture Quiz Answers
Q1. HDFS is strictly POSIX compliant.
Q2. Following issues may be caused by lot of small files in HDFS
- NameNode memory usage increases significantly
- Network load decreases
- Number of map tasks need to process the same amount of data will be larger.
Q3. 10gb / 128megabyte ~ 80
Q5. What is the first step in a write process from a HDFS client?
- Immediately contact the NameNode
Q6. HDFS NameNode is not rack aware when it places the replica blocks.
HDFS performance,tuning, and robustness Quiz Answers
Q1. Name the configuration file which holds HDFS tuning parameters
Q2. Name the parameter that controls the replication factor in HDFS:
Q3. Check answers that apply when replication is lowered
- HDFS is less robust
- less likely make data local to more workers
- more space
Q4. Check answers that apply when NameNode fails to receive heartbeat from a DataNode
- DataNode is marked dead
- No new I/O is sent to particular DataNode that missed heartbeat check
- Blocks below replication factor are re-replicated on other DataNodes
Q5. How is data corruption mitigated in HDFS
- checksums are computed on file creation and stored in HDFS namespace for verification when data is retrieved.
Accessing HDFS Quiz Answers
Q1. Which of the following are valid access mechanisms for HDFS
- Can be accessed via hdfs binary/script
- Accessed via Java API
- Accessed via HTTP
- Mounted as a filesystem using NFS Gateway
Q2. Which of the following is not a valid command to handle data in HDFS?
- hdfs dfs -mkdir /user/test
- hdfs dfs -ls /
- cp -r /user/data /user/test/
- hdfs fsck /user/test/test.out
Q3. Which of the following commands will give information on the status of DataNodes
- hdfs dfs -status datanodes
- hdfs -status
- hdfs datanode -status
- hdfs dfsadmin -report
Q4. Which of the following is not a method in FSDataInputStream
Q5. You can only read data in HDFS via HTTP
Q6. What are some webhdfs REST API related parameters in HDFS
Hadoop Platform and Applicat ion Framework Week 04 Quiz Answers
Lesson 1 Review Quiz Answers
Q1. Which of these kinds of data motivated the Map/Reduce framework?
- Large number of internet documents that need to be indexed for searching by words
Q2. What is the organizing data structure for map/reduce programs?
- A list of identification keys and some value associated with that identifier
Q3. In map/reduce framework, which of these logistics does Map/Reduce do with the map function?
- Distribute map to cluster nodes, run map on the data partitions at the same time
Q4. Map/Reduce performs a ‘shuffle’ and grouping. That means it…
- Shuffles pairs into different partitions according to the key value, and sorts within the partitions by key.
Q5. In the word count example, what is the key?
- The word itself.
Q6.Streaming map/reduce allows mappers and reducers to be written in what languages:
- All of the above
Q7. The assignment asked you to run with 2 reducers. When you use 2 reducers instead of 1 reducer, what is the difference in global sort order?
- With 1 reducer, but not 2 reducers, the word counts are in global sort order by word.
Hadoop Platform and Applicat ion Framework Week 05 Quiz Answers
Spark Lesson 1 Quiz Answers
Q1. Apache Spark was developed in order to provide solutions to shortcomings of another project, and eventually replace it. What is the name of this project?
Q2. Why is Hadoop MapReduce slow for iterative algorithms?
- It needs to read off disk for every iteration
Q3. What is the most important feature of Apache Spark to speedup iterative algorithms?
- Caching datasets in memory
Q4. Which other Hadoop project can Spark rely to provision and manage the cluster of nodes?
Q5. When Spark reads data out of HDFS, what is the process that interfaces directly with HDFS?
Q6. Under which circumstances is preferable to run Spark in Standalone mode instead of relying on YARN?
- When you only plan on running Spark jobs
Spark Lesson 2 Quiz Answers
Q1. How can you create an RDD? Mark all that apply
- Reading from a local file available both on the driver and on the workers
- Reading from HDFS
- Apply a transformation to an existing RDD
Q2. How does Spark make RDDs resilient in case a partition is lost?
- Tracks the history of each partition and reruns what is needed to restore it
Q3. Which of the following sentences about flatMap and map are true?
- flatMap accepts a function that returns multiple elements, those elements are then flattened out into a continuous RDD.
- map transforms elements with a 1 to 1 relationship, 1 input – 1 output
Q4. Check all wide transformations
- Repartition, even if it triggers a shuffle, can improve performance of your pipeline by balancing the data distribution after a heavy filtering operation
Spark Lesson 3 Quiz Answers
Q1. Check all true statements about the Directed Acyclic Graph Scheduler
- The DAG is managed by the cluster manager
- A DAG is used to track dependencies of each partition of each RDD
Q2. Why is building a DAG necessary in Spark but not in MapReduce?
- Because MapReduce always has the same type of workflow, Spark needs to accommodate diverse workflows.
Q3. What are the differences between an action and a transformation? Mark all that apply
- A transformation is from worker nodes to worker nodes, an action between worker nodes and the Driver (or a data source like HDFS)
- A transformation is lazy, an action instead executes immediately.
Q4. Generally, which are good stages to mark a RDD for caching in memory?
- The first RDD, just after reading from disk, so we avoid reading from disk again.
- At the start of an iterative algorithm.
Q5. What are good cases for using a broadcast variable? Mark all that apply
- Copy a small/medium sized RDD for a join
- Copy a large lookup table to all worker nodes
- Copy a large configuration dictionary to all worker nodes
Q6. We would like to count the number of invalid entries in this example dataset:
invalid = sc.accumulator(0) d = sc.parallelize(["3", "23", "S", "99", "TT"]).foreach(count_invalid)
What would be a good implementation of the count_invalid function?
def count_invalid(element): try: int(element) except: invalid.add(1)