Apache Spark Big Data
PySpark

Apache Spark Basics

Learn how Spark processes large datasets in memory using RDDs and DataFrames, and see a few simple PySpark examples.

What is Apache Spark?

Spark is a fast, general engine for big data processing. It can handle batch processing, streaming, machine learning, and SQL workloads.

  • Runs on clusters (YARN, Mesos, Kubernetes, Standalone).
  • APIs in Scala, Python (PySpark), Java, R.
  • Uses in-memory computation for speed.

Simple PySpark Example

Create SparkSession and DataFrame
# Run this in a PySpark environment (or Jupyter with pyspark installed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder \
    .appName("SimpleExample") \
    .getOrCreate()

data = [
    ("Alice", 25, 50000),
    ("Bob", 30, 60000),
    ("Charlie", 35, 70000)
]

columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)

df.show()

# Filter and transform
df_filtered = df.filter(col("salary") > 55000)
df_filtered.show()