Many small Spark jobs

Artikkeli
08/29/2024

If you see many small jobs, it’s likely you’re doing many operations on relatively small data (<10GB). Small operations only take a few seconds each, but they add up, and the time spent in overhead per operation also adds up.

The best approach to speeding up small jobs is to run multiple operations in parallel. Delta Live Tables do this for you automatically.

Other options include:

Separate your operations into multiple notebooks and run them in parallel on the same cluster by using multi-task jobs.
Use SQL warehouses if all your queries are written in SQL. SQL warehouses scale very well for many queries run in parallel as they were designed for this type of workload.
Parameterize your notebook and use the for each task to run your notebook multiple times in parallel. Use Concurrency to set the level of parallelization. This works well with serverless compute.

Jaa

Many small Spark jobs

Palaute

Lisäresursseja