Sorting Data: ExtSort vs. FastSort, which one is better for me?

I often get asked why CloverDX offers two sort components instead of just one and what’s the right key for determining which one is better for a particular purpose.

The reason for having two sort components in CloverDX is simply to keep things as easy as possible. Since the inner natures of ExtSort and FastSort are quite different it would be really difficult to implement a nice and clean universal one.

Luckily, the decision is simple and straightforward. In case you can dedicate enough system resources (CPU cores and/or memory) for the graph doing the sorting, FastSort is the clear option. On the other hand, if you’re short on resources and want a more conservative behavior, pick ExtSort which will give you steady performance at minimum system requirements.

FastSort is a very powerful tool, but to truly witness its power, users must set it up correctly to use their hardware's maximum potential. We will now dive into the settings behind this impressive component and learn how to max out it's ability while being careful to avoid crashes.

Tweaking FastSort

FastSort is greedy for both memory and CPU cores and in case the system does not have enough of either, FastSort can quite easily crash with out-of-memory, especially if the records you’re going to sort are big (long string fields, tens or hundreds of fields, etc.).

Parallelism

Unlike ExtSort, FastSort can utilize potentially unlimited number of CPU cores to do its job. You can control how many worker threads are used by overriding default value for “Concurrency (threads)”. My experience showshowever, that unless you’re able to use really fast disk drives, going for more than 2 threads does not necessarily help and can even slow the process back down a bit. So basically you don’t need to worry about parallelism at all unless you have the hardware to take advantage of it. Remember, that parallelism adds extra memory load for each additional thread!

Memory

FastSort can be a bit tricky with memory, since there are multiple settings which affect it. The most important is the “Run size (records)” which denotes the size of the data chunk being sorted at a time. Note, that actual record size and level of parallelism increase the overall memory consumption – so be careful with this setting. The default is 20k records, if you set the “Estimated record count” – which is your rough guess on total count of records to be sorted, the Run size is computed for you based on a experimentally derived formula. This formula tries to get the right “Run size” based on number of records and amount of available memory (which you can limit with “Maximum memory” – defaults to unlimited). This “computed guess” works in most cases, but can fail under certain conditions. You need to test and tweak on your data a bit to get the best result. Run size is definitely a parameter worth playing with!

Be sure to have enough memory dedicated to your JVM – with large, numerous records. You want to give FastSort plenty of free memory – going for 512 MB up to 2 gigs is worth it! (e.g. –Xmx1536m) With a lot of memory, FastSort will do an amazing job. However with default 64 MB heap space setting, FastSort can crash.

'In memory only sorting' is an option you can use in case you’re sure that all data will fit into your memory – you can either force it (and then possibly crash due to out-of-memory) or leave it to default auto. Auto means that at first, FastSort tries to sort the data in memory and if that fails, on disk sorting is used instead.

Other limits and valuable parameters

Apart from memory settings, you can impose more limits on FastSort to reflect your needs. For example, if your system works with disk quotas which limit the number of open files, you can cap temp files of FastSort with “Max open files”. Note that FastSort uses LOTS of files – hundreds, thousands. If you cap it too much (500 or less) FastSort will continue to work, but its performance will decrease significantly. So should you need to limit the number of open files, consider switching to ExtSort.

Settings you can forget

There are other advanced options for FastSort, but you can leave them to default values unless you are really trying to optimize your sort. Number of read buffers defines how many chunks of data will be held in memory at a time – which must be at least the number of Concurrency – otherwise some of the workers wouldn’t have data to work on. Using too large a number, you’ll end up with out-of-memory – the default is based on current concurrency setting and is just fine.

Average record size is nothing else than a helperguess on average byte size of records in the data – if not set, FastSort computes this automatically from the real data so it’s usually more precise than setting an explicit value.

Tape buffer is a buffer used for each worker for filling the output and slightly affects performance, but the default is fine in almost allcases.

The last two options control how temp files are created, they can be either compressed (defaults to false) and you can even control the charset of string fields (default UTF16). Both are there for space saving purposes (space occupied by temp files during graph execution) and decrease performance.

The Decision

FatSort is very powerful sorting component and can significantly speed up your transformation process. But it has to be set upcorrectly. So, if you are not sure and you want the always safe and simple sort, go with ExtSort. On the other hand, if you know your hardware and want to utilize it to optimize your sort for speed, dive into FastSort and explore it a bit. The results can be extraordinary.

Tweaking ExtSort

While FastSort is optimized to deliver stunning performance in resource-rich environment, ExtSort is a modest hard worker used to working with limited resources. This proves to be useful especially in tranformations with lots of parallel branches with sorting.

Again I won’t discuss the case where there are only a few records and all sorting can be done in memory. For external sorting, ExtSort organizes its temp data in sorted chunks and places them onto a fixed number of tapes – each tape being a single temporary file. Since ExtSort is single-threaded, multiple ExtSort-s running at once do not consume that much resources as FastSort does.

With ExtSort, there isn’t that much magic as with tweaking FastSort. First what you can do is specify a number of temporary directories where tapes will be stored. This is useful when you’re able to harness multiple physical drives under multiple mount points, otherwise just make sure you have enough free space on the drive.

The only two attributes you can fiddle with is Buffer capacity and Number of tapes. There’s also the Sorter initial capacity but this will be deprecated soon and nowadays acts exactly the same as Buffer capacity. Buffer capacity determines the size of each chunk and thus needs to fit into memory. If you’re familiar with FastSort, Run size is the equivalent parameter there. Increasing Buffer capacity is generally a good idea for boosting ExtSort’s performance, but expect increased memory requirements, although still far from those of FastSort.

Number of tapes defaults to 6 and generally will yield best results, however again, increasing the number slightly does not take too much resources and can help with performance.

Compared to FastSort, ExtSort is pretty easy to configure and will work almost under any conditions. However, there’s always a catch – with ExtSort it is the rather inferior performance compared to its stronger brother.

By CloverDX

June 15, 2010

Data Transformation

Quick start

CloverDX Academy

Sorting Data: ExtSort vs. FastSort, which one is better for me?

Tweaking FastSort

Parallelism

Memory

Other limits and valuable parameters

Settings you can forget

The Decision

Tweaking ExtSort

More from Tech Blog

CloverDX Server diagnostics with ChatGPT or Claude via MCP

Data Quality Examples on GitHub

Introducing CloverDX Examples on GitHub

Sending emails via Azure Communication Services SMTP

Connecting to REST APIs (OpenAPI)

Performance tuning: How to troubleshoot database-related performance issues in CloverDX

Visit CloverDX Blog

Using large language models in ETL pipelines: Best practices for integrating OpenAI and other LLMs

The vital importance of data governance in the age of AI

What great data teams stop doing

Using AI in CloverDX to speed up data onboarding without compromising data security

Troubleshoot your CloverDX server faster with the Support Assistant

Read On

Efficient data modelling with DBT and ETL data pipeline

Understanding the CloverDX Project Lifecycle