· Alvaro Barber · Tutorials  · 3 min read

SPyspark Optimizations

Optimize your code and avoid Skew Data

Optimize your code and avoid Skew Data

It is 100% that at some point your code will be stucked executing your code within a notebook.
Here the main tricks of how to optimize your data in pyspark once you have an out of memory error in the Driver or in the Executor, two fundamental concepts that you need to know as a big data engineer.

DRIVER AND EXECUTOR

The driver is the central coordinator and controller of a Spark application. It is responsible for Job and Task Scheduling where the driver splits the application into smaller tasks and sends them to the executors for execution.

Typically, the driver runs the main application code, including operations like map(), reduce(), and filter().

On the other side, an executor is a worker node in the cluster that performs the actual data processing.
Executors are responsible for Task Execution where each executor is assigned tasks by the driver.\ These tasks perform computations on the data (e.g., transformations and actions).

OUT OF MEMORY ERROR

  1. The most direct way is to increase the driver or executor memory
  2. Avoid using collect() on large datasets and try to avoid wide transformations by using ReduceByKey() instead of GroupByKey().
  3. Broadcast smaller datasets that later will be joined with larger datasets
small_df = spark.read.csv("small_file.csv")
broadcasted_df = broadcast(small_df)
  1. Persist Cache so recomputation is avoided.
  2. If you have too many partitions, try to repartition to distribute the workload better. Here some very good video to clarify 100% your doubts.
    https://youtu.be/hvF7tY2-L3U?si=vsGd7-W4ZYkLn6gr
  3. Optimize Shuffling by going to the query plan and try to check if there is some join that you could avoid.Here another video clarifying in the topic of Shuffling
    https://youtu.be/ffHboqNoW_A?si=n8IIuX1zgZAa3zjS

HOW TO FIX SKEWED DATA

Skewed data means that when you are making partitions(subsets of data) from your dataset based on one column, this data can be distributed evenly across all the partitions.

However if you partition by date and it appears that for the month of December your sales are much higer because of Christmas and you have way more records, one of the partitions could be compromised, resulting in an executor out of memory error.

What to do in these cases?

  1. Try to change if possible the partition key/column so the data will be evenly distributed.
  2. Try to increase the partitions as this may help the overwhelmed partition that needs to take care of the month of December.
  3. Broadcast joins if one of the tables is small enough so you will avoid shuffling data and lower the probabilities to make the executors work.
  4. Salting technique where you add a random value (salt) to the skewed key to spread the data across multiple partitions, thus, reducing memory overload on any single partition.
Back to Blog