MCQ Collection
Big Data Analytics MCQs
Practice Big Data Analytics questions with answers and explanations.
Choose an option to check your answer.
A.
A worker process that stores every HDFS block
B.
The process that runs the application logic and coordinates distributed execution
C.
The YARN ResourceManager
D.
A database connector only
Show Answer
Correct Answer: B. The process that runs the application logic and coordinates distributed execution
Explanation:
The driver builds execution plans, schedules jobs, and collects metadata.
It communicates with executors throughout the application.
Choose an option to check your answer.
A.
Each child reads exactly one parent partition
B.
A child partition depends on data from many parent partitions
C.
The RDD has many columns
D.
The executor uses large memory
Show Answer
Correct Answer: B. A child partition depends on data from many parent partitions
Explanation:
groupByKey and repartition create wide dependencies.
They require data exchange across the cluster.
Choose an option to check your answer.
A.
It sends more data over the network
B.
It performs map-side aggregation before the shuffle
C.
It requires all values on the driver
D.
It works only with one partition
Show Answer
Correct Answer: B. It performs map-side aggregation before the shuffle
Explanation:
reduceByKey combines values locally within each mapper partition.
This substantially reduces shuffle volume for aggregations.
Choose an option to check your answer.
A.
Only reduces partitions without shuffle
B.
Redistributes data to a specified number of partitions using a full shuffle
C.
Changes executor count
D.
Creates an HDFS replica
Show Answer
Correct Answer: B. Redistributes data to a specified number of partitions using a full shuffle
Explanation:
Repartition can increase or decrease partition count and rebalance data.
Its shuffle makes it relatively expensive.
Choose an option to check your answer.
A.
Reducing executor count
B.
Removing RDD lineage
C.
Skipping data partitions whose partition values cannot satisfy a query
D.
Deleting table partitions
Show Answer
Correct Answer: C. Skipping data partitions whose partition values cannot satisfy a query
Explanation:
Filters on partition columns allow Spark to avoid opening unrelated directories.
Good partition design can greatly reduce scanned data.
Choose an option to check your answer.
A.
HDFS block placement
B.
Streaming watermark calculation
C.
Collaborative-filtering recommendations
D.
SQL schema inference
Show Answer
Correct Answer: C. Collaborative-filtering recommendations
Explanation:
ALS factorizes a sparse user-item interaction matrix into latent factors.
It scales through alternating distributed optimization steps.
Choose an option to check your answer.
A.
The process that submits every cluster application
B.
The HDFS namespace manager
C.
A worker process that runs tasks and stores cached data
D.
A Scala compiler plugin
Show Answer
Correct Answer: C. A worker process that runs tasks and stores cached data
Explanation:
Executors perform computation for one Spark application.
They also provide memory and disk storage for persisted partitions.
Choose an option to check your answer.
A.
Reading a local collection in random order
B.
Replicating all RDDs to the driver
C.
Redistributing data across partitions and executors
D.
Restarting the SparkSession
Show Answer
Correct Answer: C. Redistributing data across partitions and executors
Explanation:
Shuffles occur for operations requiring new key grouping or partitioning.
They involve network, disk, serialization, and sorting costs.
Choose an option to check your answer.
A.
Each key paired with one reduced value
B.
A globally sorted list of keys only
C.
Each key paired with all of its values
D.
A driver-side dictionary automatically
Show Answer
Correct Answer: C. Each key paired with all of its values
Explanation:
groupByKey gathers values across partitions.
It can use substantial network and memory when groups are large.
Choose an option to check your answer.
A.
Always increases partitions evenly
B.
Collects partitions to the driver
C.
Reduces the number of partitions with less data movement than a full repartition
D.
Sorts all data globally
Show Answer
Correct Answer: C. Reduces the number of partitions with less data movement than a full repartition
Explanation:
Coalesce often merges existing partitions without a full shuffle.
A shuffle option may be used when balance is important.
Choose an option to check your answer.
A.
The values of every row
B.
The contents of executor memory
C.
The HDFS replication history
D.
Its logical and physical execution plans
Show Answer
Correct Answer: D. Its logical and physical execution plans
Explanation:
Execution plans show scans, joins, exchanges, and other operators.
They are essential for diagnosing performance.
Choose an option to check your answer.
A.
To preserve executor processes permanently
B.
To avoid storing input schemas
C.
To change training labels later
D.
To apply the identical learned transformations and model in deployment
Show Answer
Correct Answer: D. To apply the identical learned transformations and model in deployment
Explanation:
Persisting the fitted pipeline records category mappings, scaling parameters, and model weights.
Loading it helps maintain consistent prediction behavior.