Apache Spark Basics

Learn how Spark processes large datasets in memory using RDDs and DataFrames, and see a few simple PySpark examples.

What is Apache Spark?

Spark is a fast, general engine for big data processing. It can handle batch processing, streaming, machine learning, and SQL workloads.

Runs on clusters (YARN, Mesos, Kubernetes, Standalone).
APIs in Scala, Python (PySpark), Java, R.
Uses in-memory computation for speed.

Simple PySpark Example

Create SparkSession and DataFrame

# Run this in a PySpark environment (or Jupyter with pyspark installed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("SimpleExample") \
    .getOrCreate()

data = [
    ("Alice", 25, 50000),
    ("Bob", 30, 60000),
    ("Charlie", 35, 70000)
]

columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)

df.show()

# Filter and transform
df_filtered = df.filter(col("salary") > 55000)
df_filtered.show()

Next: Model Deployment

Related Data Science Links

Apache Spark Basics

What is Apache Spark?

Simple PySpark Example