Related Data Science Links
Learn Spark Data Science Tutorial, validate concepts with Spark Data Science MCQ Questions, and prepare interviews through Spark Data Science Interview Questions and Answers.
Apache Spark
Big Data
PySpark
Apache Spark Basics
Learn how Spark processes large datasets in memory using RDDs and DataFrames, and see a few simple PySpark examples.
What is Apache Spark?
Spark is a fast, general engine for big data processing. It can handle batch processing, streaming, machine learning, and SQL workloads.
- Runs on clusters (YARN, Mesos, Kubernetes, Standalone).
- APIs in Scala, Python (PySpark), Java, R.
- Uses in-memory computation for speed.
Simple PySpark Example
Create SparkSession and DataFrame
# Run this in a PySpark environment (or Jupyter with pyspark installed)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder \
.appName("SimpleExample") \
.getOrCreate()
data = [
("Alice", 25, 50000),
("Bob", 30, 60000),
("Charlie", 35, 70000)
]
columns = ["name", "age", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
# Filter and transform
df_filtered = df.filter(col("salary") > 55000)
df_filtered.show()