IntroductionBenchmarking rzip64 turns out to be more difficult than expected at first.
On one hand each run of rzip64 (with realistic testfile sizes) takes a considerable amount of time. While this mitigates statistical effects in the measurements the overall measurement procedure took us several weeks to complete.
On the other hand rzip64 can create an exceptional high system load. This stress applies not only to the cpu cores and caches but also to the disk I/O system. When multiple cores are involved the I/O capacity of the system may quickly become a bottleneck.
Test platformThe benchmarks were carried out on the following test system:
Tyan Tempest i5400PW (S5397) Mainboard (dual-socket system)An 103 GByte tar file containing a full linux backup is used as a test file for compression and verification after successful decompression (via diff).
Measurement resultsA single run was executed to obtain the following results for each number of cores. Times were measured using the /usr/bin/time command of linux.
DiscussionThe results become more clear after normalizing to their respective maximum and plotting them next to each other.
The black line answers the question "How does it scale with an increasing number of cores?".
As many data compression programs rzip64 operates blockwise. Each block has a size of nearly 1 GByte and is read into memory by a single mmap() operation. Immidiate access to the data in the block causes major page faults with subsequent disk I/O. An optional flag MAP_POPULATE can prevent some page faults by immediately reading ahead the file but that flag is not available on many platforms and is therefore not used currently.
The yellow line (paging activity) also gives some insight how the linux kernel distributes the compression processes over the available cores. This is critcal since that heavily influences the cache performance of the cores.
As stated above the system is equipped with two quad core CPUs. The first four cores share a single die and therefore a single cache. When more cores are added the kernel starts to distribute the processes to the other CPU socket, too. That results in an increase of page acitivity as shown by the yellow line that starts at 5 cores.
As it turns out there is no additional benefit in adding more parallel tasks than physically available cores. The 9th thread on the contrary slows down the compression significantly.
Running rzip64 in parallel on a higher number of cores imposes a high load on the disk I/O subsystem. Each core loads a chunk of nearly 1 GByte in parallel to the other cores. That quickly saturates the available I/O bandwidth of the test system.
Some tests with a large RAM disk suggest a very large impact of the underlying mass storage system. A properly configured RAID may also be used to speed up the overall compression time.