One Spark task

Artikkeli
08/26/2024

If you see a long-running stage with just one task, that’s likely a sign of a problem. While this one task is running only one CPU is utilized and the rest of the cluster may be idle. This happens most frequently in the following situations:

Expensive UDF on small data
Window function without PARTITION BY statement
Reading from an unsplittable file type. This means the file cannot be read in multiple parts, so you end up with one big task. Gzip is an example of an unsplittable file type.
Setting the multiLine option when reading a JSON or CSV file
Schema inference of a large file
Use of repartition(1) or coalesce(1)

Jaa

One Spark task

Palaute

Lisäresursseja