Spark

Apache Spark Q&A

1What is Spark?
Answer: Distributed processing engine for batch, SQL, ML, and streaming workloads.
2Why Spark fast?
Answer: In-memory execution and optimized query planning.
3RDD vs DataFrame?
Answer: RDD is low-level; DataFrame is optimized high-level structured API.
4What is Spark SQL?
Answer: Module for SQL queries over structured data.
5What is lazy evaluation?
Answer: Transformations build plan executed only when action is triggered.
6What is partitioning?
Answer: Dividing data for parallel processing across executors.
7What causes shuffle?
Answer: Operations like groupBy/join requiring data redistribution.
8How optimize Spark job?
Answer: Tune partitions, cache wisely, avoid wide shuffles, use broadcast joins.
9What is Spark Structured Streaming?
Answer: High-level streaming API built on DataFrame abstraction.
10One-line summary?
Answer: Spark is a scalable unified engine for large-scale data processing.