All Courses (6)

Master's Degree (2)

Fellowship (2)

Certifications (2)

Woolf University

Top Rated

MS in Computer Science: Machine Learning and Artificial Intelligence

Woolf University

Popular

MS in Computer Science: Cloud Computing with AI System Design

Vishlesan I-Hub, IIT Patna

Professional Fellowship in Data Science and Agentic AI Engineering

Vishlesan I-Hub, IIT Patna

Professional Fellowship in Software Engineering with AI and DevOps

IBM & Microsoft

Advanced Certification in Data Analytics & Gen AI Engineering

IBM & Microsoft

Advanced Certification in Web Development & Gen AI Engineering

Your Success, Our Mission!

3000+ Careers Transformed – Be Next!

Your Success, Our Mission!

3000+ Careers Transformed.

Data Science

PySpark Cheat Sheet (Functions, Commands, Syntax, DataFrame)

Last Updated: 8th January, 2025

Explore a detailed PySpark cheat sheet covering functions, DataFrame operations, RDD basics and commands. Perfect for data engineers and big data enthusiasts

PySpark Cheat Sheet (Functions, Commands, Syntax, DataFrame)

PySpark is the Python API for Apache Spark, an open-source, distributed computing system. PySpark allows data engineers and data scientists to process large datasets efficiently and integrate with Hadoop and other big data technologies. This cheat sheet is designed to provide an overview of the most frequently used PySpark functionalities, organized for ease of reference.

PySpark Basics Cheat Sheet

Starting PySpark:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("App Name").getOrCreate()

Data Types

RDD (Resilient Distributed Dataset): Low-level abstraction for distributed data.
DataFrame: High-level abstraction for structured data.

PySpark DataFrame Cheat Sheet

Creating DataFrames

From RDD:

from pyspark.sql import Row
rdd = spark.sparkContext.parallelize([Row(name="Alice", age=25)])
df = rdd.toDF()

From a file:

df = spark.read.csv("path/to/file.csv", header=True, inferSchema=True)

Basic Operations

Display:

df.show()

Schema:

df.printSchema()

Select columns:

df.select("column_name").show()

Filter rows:

df.filter(df["column_name"] > value).show()

Add a column:

df.withColumn("new_column", df["existing_column"] + 10).show()

PySpark SQL Cheat Sheet

Enabling SQL Queries

df.createOrReplaceTempView("table_name")
spark.sql("SELECT * FROM table_name").show()

Aggregation and Grouping

from pyspark.sql.functions import col, avg, max, min, count

# Group by and aggregate
df.groupBy("column1").agg(avg("column2"), max("column2")).show()

# Count distinct values
df.select(countDistinct("column1")).show()

# Aggregate without grouping
df.agg(min("column1"), max("column2")).show()

String Operations

from pyspark.sql.functions import upper
df.select(upper(df["column_name"])).show()

Window Functions

from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, rank

# Define a window
window_spec = Window.partitionBy("column1").orderBy("column2")

# Add row numbers
df.withColumn("row_number", row_number().over(window_spec)).show()

# Add ranks
df.withColumn("rank", rank().over(window_spec)).show()

PySpark Functions Cheat Sheet

String

from pyspark.sql.functions import lit, concat
df.withColumn("concatenated", concat(df["col1"], lit("_"), df["col2"])).show()

Date

from pyspark.sql.functions import current_date, date_add
df.withColumn("today", current_date()).withColumn("tomorrow", date_add(current_date(), 1)).show()

Aggregate

from pyspark.sql.functions import sum, count
df.groupBy("group_col").agg(sum("value_col"), count("*")).show()

PySpark Commands Cheat Sheet

Reading Data

df = spark.read.json("path/to/json/file.json")
df = spark.read.parquet("path/to/parquet/file")

Writing Data

df.write.csv("path/to/save.csv", header=True)
df.write.json("path/to/save.json")

Cache & Persist

df.cache()
df.persist()
df.unpersist()

PySpark Syntax Cheat Sheet

Joins

# Inner join
df1.join(df2, df1["key"] == df2["key"], "inner").show()

# Left outer join
df1.join(df2, df1["key"] == df2["key"], "left").show()

# Full outer join
df1.join(df2, df1["key"] == df2["key"], "outer").show()

Sorting

# Sort by column
df.sort("column1").show()

# Sort descending
df.orderBy(df["column1"].desc()).show()

Null Handling

# Drop rows with null values
df.na.drop().show()

# Fill null values
df.na.fill({"column1": 0, "column2": "missing"}).show()

# Replace null values in a column
df.fillna(0, subset=["column1"]).show()

PySpark RDD Cheat Sheet

Creating RDDs

rdd = spark.sparkContext.parallelize([1, 2, 3, 4])
rddFromFile = spark.sparkContext.textFile("path/to/file.txt")

Transformations

map: Applies a function to each element.

rdd.map(lambda x: x * 2).collect()

filter: Filters elements based on a condition.

rdd.filter(lambda x: x % 2 == 0).collect()

flatMap: Maps each element to multiple elements.

rdd.flatMap(lambda x: [x, x * 2]).collect()

Actions

collect: Returns all elements.

rdd.collect()

count: Returns the number of elements.

rdd.count()

reduce: Aggregates elements using a function.

rdd.reduce(lambda x, y: x + y)

PySpark UDFs

User-Defined Functions (UDFs)

from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

# Define a UDF
def multiply_by_two(x):
    return x * 2

multiply_udf = udf(multiply_by_two, IntegerType())

# Apply UDF
df.withColumn("new_column", multiply_udf(df["column"])).show()

Saving and Loading Models

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

# Save model
model = LinearRegression().fit(training_data)
model.save("path/to/model")

# Load model
from pyspark.ml.regression import LinearRegressionModel
loaded_model = LinearRegressionModel.load("path/to/model")

Tips for Data Engineers (PySpark Cheat Sheet for Data Engineers)

Optimize Performance

Use .repartition() to control partitions for large datasets.

 df.repartition(10)

Use .explain() to understand query execution plans.

Error Handling

Wrap transformations in try-except blocks for debugging.
Use .isNotNull() to handle null values.

Integration

Connect with external databases:

 jdbc_url = "jdbc:mysql://host:port/db"
df = spark.read.format("jdbc").option("url", jdbc_url).option("dbtable", "table").load()

Conclusion

PySpark is a versatile tool for handling big data. This cheat sheet covers RDDs, DataFrames, SQL queries, and built-in functions essential for data engineering. Using these commands effectively can optimize data processing workflows, making PySpark indispensable for scalable, efficient data solutions.

More Cheat Sheets and Top Picks