Home arrow Download arrow rzip64 (DE)
Suche
Benchmarking rzip64

Introduction

Benchmarking rzip64 turns out to be more difficult than expected at first.

On one hand each run of rzip64 (with realistic testfile sizes) takes a considerable amount of time. While this mitigates statistical effects in the measurements the overall measurement procedure took us several weeks to complete.

On the other hand rzip64 can create an exceptional high system load. This stress applies not only to the cpu cores and caches but also to the disk I/O system. When multiple cores are involved the I/O capacity of the system may quickly become a bottleneck.

Test platform

The benchmarks were carried out on the following test system:
Tyan Tempest i5400PW (S5397) Mainboard (dual-socket system)
2 x Intel Xeon Quadcore CPUs E5430 @ 2.66 GHz
4 x 4 GByte FBDIMMs
Single 1.2 TByte Seagate SATA Disk, XFS-Filesystem
openSuSE 11.3 Linux (x86_64, Kernel 2.6.34.7-0.5)
An 103 GByte tar file containing a full linux backup is used as a test file for compression and verification after successful decompression (via diff).

Measurement results

A single run was executed to obtain the following results for each number of cores. Times were measured using the /usr/bin/time command of linux.
Cores 1 2 3 4 5 6 7 8 9 10 11
User time / s 48495 55721 61335 69775 78702 83806 85587 83424 76198 72251 71431
Sys time / s 539 750 823 988 1180 1361 1488 1514 1373 1283 1311
CPU / % 0.96 1.85 2.76 3.62 4.44 4.91 5.16 5.08 4.02 3.56 3.33
Wall clock 14:10:33 08:26:38 06:15:05 05:25:47 04:59:18 04:48:36 04:40:49 04:38:11 05:21:34 05:43:40 06:03:55
Max. res. size / bytes 3991056 3990160 3990176 3990128 3990128 3990112 3990112 3990128 3990128 3990112 3990128
Major page faults 2658 12359 12879 15112 45550 113853 127866 204997 553168 694700 845394
Minor page faults 166283569 169967026 170033646 170084382 171211348 173937994 177253731 178905591 179185466 178702755 179589184
ContextSW, voluntary 1243093 1268051 1246331 1255282 1244928 1304375 1334221 1439029 1795810 1935975 2101555
ContextSW, involuntary 5354850 7034956 7493687 9358767 10755566 12464211 12369724 12014293 10752282 10145605 10038880
I/O in 308357112 294916048 295312536 309596824 319252904 342623640 369515360 391274832 449503616 481193800 501712536
I/O out 214908728 214883216 214877880 214879448 214888672 214889192 214895520 214888728 214880008 214879528 214877968
  • User time: time spent in user space
  • Sys time: time spent in kernel space
  • % CPU: average cpu utilization (8 = all cores permanently)
  • Wall clock: total wait time for the user
  • Max. resident size: number of occupied memory bytes
  • Major page faults: page faults that resulted in disk I/O
  • Minor page faults: page faults without disk I/O
  • Context switches, voluntary (waiting for I/O operations)
  • Context switches, involuntary (time slice elapsed)
  • I/O in: number of bytes read from disk
  • I/O out: number of bytes written to disk

Discussion

The results become more clear after normalizing to their respective maximum and plotting them next to each other. Normalized benchmark results

The black line answers the question "How does it scale with an increasing number of cores?".
Adding a second core cuts the total wall clock time by apporoximately 40 %. Add a third core and your compression jobs will complete in less than half the single core time. The gain of adding more cores, however, decreases with an increasing number of cores.

As many data compression programs rzip64 operates blockwise. Each block has a size of nearly 1 GByte and is read into memory by a single mmap() operation. Immidiate access to the data in the block causes major page faults with subsequent disk I/O. An optional flag MAP_POPULATE can prevent some page faults by immediately reading ahead the file but that flag is not available on many platforms and is therefore not used currently.

The yellow line (paging activity) also gives some insight how the linux kernel distributes the compression processes over the available cores. This is critcal since that heavily influences the cache performance of the cores.

As stated above the system is equipped with two quad core CPUs. The first four cores share a single die and therefore a single cache. When more cores are added the kernel starts to distribute the processes to the other CPU socket, too. That results in an increase of page acitivity as shown by the yellow line that starts at 5 cores.

As it turns out there is no additional benefit in adding more parallel tasks than physically available cores. The 9th thread on the contrary slows down the compression significantly.

Running rzip64 in parallel on a higher number of cores imposes a high load on the disk I/O subsystem. Each core loads a chunk of nearly 1 GByte in parallel to the other cores. That quickly saturates the available I/O bandwidth of the test system.

Some tests with a large RAM disk suggest a very large impact of the underlying mass storage system. A properly configured RAID may also be used to speed up the overall compression time.