Speedup of Quicksort with multi-core programming
Most modern computers have multiple cores, and even new multi-core processors are available for portable devices now. Yet, software are not taking advantage of them. Processors are sitting idle, doing nothing, most of the time.
A lot of frameworks, libraries, language extensions, and even new programming languages, have been proposed to take advantage of the new hardware platform. One of them is the Threading Building Blocks (TBB) from Intel. TBB is a library that can be used to outfit C++ programs, and speedup execution on multi-core processors.
In this post, I’m going to investigate the effectiveness of TBB to accelerate program execution on multi-core processors. For my experiments, I chose quicksort, which lends itself naturally to being parallelized.
This is by no means, a rigorously scientific experiment. As the designers of TBB have claimed:
Intel® Threading Building Blocks (TBB) offers a rich and complete approach to expressing parallelism in a C++ program. It is a library that helps you take advantage of multi-core processor performance without having to be a threading expert. Threading Building Blocks is not just a threads-replacement library. It represents a higher-level, task-based parallelism that abstracts platform details and threading mechanisms for scalability and performance.
Therefore, I’m going to do something really simple here, by pretending that I am a newbie programmer who is not familiar with multi-thread programming, yet, want to take advantage of the new tools to make my program run faster.
For the experiments, I wrote two simple quicksort programs. The first one is a single-threaded program that will invoke
std::sort() to sort a list of randomly generated integers. The second one will invoke
tbb::parallel_sort() to also sort a list of integers. Both templates implemented the quicksort algorithm.
The program is very simple: Randomly generate a list of integers, then sort it. In this program, only the sorting portion is parallelized, the code that generates random integers is not. Although this part could be parallelized as well, but we chose not to at this point, to see if the TBB has would add any overhead to the normal, serialized code.
The experiments were performed on a desktop computer with AMD Phenom(tm) II X4 940 Processor, and 4GB of RAM. The operating system is Ubuntu 10.04 x86_64. The program is compiled with TBB v2.2.
For small data set, there was no benefit at all to use the full power of multiple cores. Therefore, our data set started at one million integers, up to 100 million integers. The following graph shows the execution time of both programs:
Since we pretended to be a newbie, we actually didn’t know how many threads TBB would create on a quad-core processor to sort the integers, so we assumed that there were four threads created.
For both programs, we recorded the time for generating the list of random integers, and the time to sort the list. All execution times were recorded in milliseconds. As we have expected, for large data set, the effect of parallel processing was more obvious.
The following graph shows the speedup, and should I say, drag down, of the programs:
We were expecting that TBB would add some overhead to the serialized portion of the code, but the overhead seems to be much more than expected. The code to generate random integers ran at roughly 60% of the speed after adding TBB.
We only parallelize the code that sorts the integer list. As we can see, we did gain speed in that area. Since we were invoking
tbb::parallel_sort() provided by the TBB library, we were expecting, probably naively, that this had been highly optimized with some heuristic thread coordination algorithm. However, in this experiment, on a quad-core processor, the average speedup is around 2.5, which is less than the increased number of CPU cores. Well, to be frank, I wouldn’t expect the speedup to be 4 either, but down deep, I was still hoping that it could be higher.
As data set gets larger, the average speedup is higher too. However, I wanted to know whether the parallelized code still has any advantage if the system runs out of memory and starts swapping. I had 4GB of RAM on the machine, so I loaded one billion integers. As the size of an
int is 4 bytes, that’s a total of 4 billions bytes, and should eat up all my RAM. As expected, parallelized or not, it didn’t have any effect. The whole system was swapping like crazy, and the hard disk was spinning. Actually, the TBB-version program did have effect, but negative effect. As it had more threads, context switching certainly did have effect on the program. As the system was busy swapping, context switching just added to the workload. So, sorting one billion integers with parallelized program, in this case, took longer than the single-threaded program.
I tested with different cases, with data size in the range of hundreds of millions integers. As long as the data size is within the threshold before the system starts trashing the hard drive, the parallelized version could take advantage of multi-core processing. Therefore, the amount of physical memory is very important.
- Intel introduces Threading Building Blocks version 3 (insidehpc.com)