Data Analysis with Python and PySpark cover
welcome to this free extract from
an online version of the Manning book.
to read more
or

11 Faster PySpark: Understanding Spark’s query planning

 

This chapter covers

  • How Spark uses CPU, RAM, and hard drive resources
  • Using memory resources better to speed up (or avoid slowing down) computations
  • Using the Spark UI to review useful information about your Spark installation
  • How Spark splits a job into stages and how to profile and monitor those stages
  • Classifying transformations into narrow and wide operations and how to reason about them
  • Using caching judiciously and avoiding unfortunate performance drop with improper caching

Imagine the following scenario: you write a readable, well-thought-out PySpark program. When submitting your program to your Spark cluster, it runs. You wait.

How can we peek under the hood and see the progression of our program? Troubleshoot which step is taking a lot of time? This chapter is about understanding how we can access information about our Spark instance, such as its configuration and layout (CPU, memory, etc.). We also follow the execution of a program from raw Python code to optimized Spark instructions. This knowledge will remove a lot of magic from your program; you’ll be in a position to know what’s happening at every stage of your PySpark job. If your program takes too long, this chapter will show you where (and how) to look for the relevant information.

11.1 Open sesame: Navigating the Spark UI to understand the environment

11.1.1 Reviewing the configuration: The environment tab

11.1.2 Greater than the sum of its parts: The Executors tab and resource management

11.1.3 Look at what you’ve done: Diagnosing a completed job via the Spark UI

11.1.4 Mapping the operations via Spark query plans: The SQL tab

11.1.5 The core of Spark: The parsed, analyzed, optimized, and physical plans

11.2 Thinking about performance: Operations and memory

11.2.1 Narrow vs. wide operations

11.2.2 Caching a data frame: Powerful, but often deadly (for perf)

Summary

sitemap