For this project, you should use MPI as well as OpenMP. You are free to choose how to combine the two for performanceC++ Programming

CS 484 - Parallel Programming


Learning Goals

You will learn the following:

  • Histogram sorting in a distributed memory environment

  • Hybrid OpenMP + MPI parallelization and performance analysis


Assignment Tasks

The basic workflow of this assignment is as follows: Clone your turnin repo to VMFarm 1

Your repo should be visible in a web browser and clonable with git from .

  • You may need to iterate:

    • Implement the algorithms / code for the various programming

    • Build and test on

    • Check in and push your complete, correct

    • Benchmark on Campus Cluster (scripts/batch run)

  • Check in any benchmarking results that you wish to and final versions of code

  • Write and check in your


1     Distributed Memory Histogramming Sort

For this project, you should use MPI as well as OpenMP. You are free to choose how to combine the two for performance, but you should at least have as many MPI ranks as the number of physical nodes used in benchmarking.

You will start with data consisting of unsigned 64-bit integers, whose type is defined as dist sort t, that is distributed non-uniformly across MPI ranks. You can find the function prototypes in solution.h, and corresponding function bodies in solution.cpp. You may also want to refer to main sort.cpp, where the overall execution flow of the program is defined.

You will first balance data by redistributing it across processes without worrying about sorted-ness, and then sort it in parallel using all MPI ranks with a repeated histogramming algorithm. At the end, each MPI rank should hold data that is locally sorted, and any data on rank i should be smaller than the smallest data item on rank i+1 (i.e. data should be globally sorted). The entire set of data items, including any duplications, must remain the same.


1.1  Rebalancing Data

Your first task is to balance data across all the ranks without worrying about sorted-ness. This should be implemented in the rebalance() function. The first two parameters are the original data array (randomly generated) and the number of elements in the array. The last two parameters are pointers to the array that will contain the rebalanced data and the number of elements that will be in the rebalanced array. This function should be implemented such that all ranks have roughly the same number of data elements, where each data element originates from one of the imbalanced arrays (the total dataset should remain the same).

For a precise description of pre and post conditions, see the comments in src/solution.h .

1You are free to use the Docker container uiuccs484parallelprog/cs484 student , but we will provide no support for Docker itself.


1.2   Histogramming

You will implement the findSplitters() function which attempts to find a histogram with (roughly) equally counts in its bins by varying bin boundaries. You will use the repeated histogramming algorithm. There will be N splitters for N bins, with the last splitter being the maximum value in the entire dataset. Do not assume that N will be equal to the number of ranks, it may not be. After calling findSplitters(), the values in splitters shall be monotonically increasing, counts[i] shall contain the global number of elements in all the data arrays such that splitters[i − 1] < x ≤ splitters[i]. That is:

counts[i] = |{x|    splitters[i − 1] <              x ≤ splitters[i], x ∈ DATASET} |

counts[0] = |{x|                                     x ≤ splitters[0], x ∈ DATASET} |

over the entire, distributed dataset.

You should read the paper [1] for more details on the iterative process of histogramming and finding the splitters. For those of you that would like to optimize the algorithm even further, we recommend reading [2]. Both papers have been uploaded to the Wiki, in Week 13 of the Lecture Schedule page.

This problem is much simpler if,  internally,  you consider it as a problem of finding equally spaced quantiles, rather  than equally weighted bins. Nevertheless, output must conform to this specification.  (It is easy to transform between the  two problems.)

For a precise description of pre and post conditions, see the comments in src/solution.h .


1.3    Moving the Data Elements

Finally, you will implement the moveData function that utilizes the splitters and counts arrays to send the data elements to the right ranks. At the end, each machine should have numbers that fall in the half-open interval

(splitters[i-1], splitters[i]]

, with the zeroth rank holding all data in the closed interval

[0, splitters[0]]


. The final elements should be placed in *recvData (you should allocate memory for it and store the new pointer) and the number of elements in *rDataCount.

For a precise description of pre and post conditions, see the comments in src/solution.h .


1.4 Local Sorting

You have been provided a correct, working, single-threaded sort algorithm in sort() . You may replace this with your own sort implementation if you feel you can do better.

Note that while no points will be awarded for the correctness of your sort(), points will be deducted for an incorrect

sort() function.


2   Benchmarking

The project will be graded on correctness as well as the performance of each phase, and good use of OpenMP and MPI primitives. You may create and test alternate implementations of some or all phases using competing MPI primitives. If you choose to do so, your writeup should compare the different approaches.

More specifically, once histogramming has identified the correct splitter keys, data needs to be moved from every process to every other process to where it belongs. For this purpose, you may use the following three alternatives (or more if you can think of any), and compare them:

  • All-to-all collectives

  • Regular point-to-point messages

  • One-sided messaging (using MPI Put, etc.)

Whether or not you choose to explore all the options, your final program should choose one, and you should highlight that in your writeup.

We have provided you with a batch script perform unit tests of your implemented functions and time your program. This time, however, make sure to modify or add lines with s mpirun in the benchmarking part of the script if necessary, according to how you mix MPI and OpenMP. The current script assumes that you are not using OpenMP and are allocating 1 MPI rank per CPU core. As always, you should run this on the campus cluster. The timing result will be stored in writeup/benchmark.txt.

You should vary the number of CPU cores and physical nodes used in the program and plot these results, evaluating the performance and scalability in your writeup.


Instructions Files

C++ Programming Experts

Dinh Hong P.
C++ Programming

20 Answers

Eric Bell
C++ Programming

88 Answers

View More Experts

The ready solutions purchased from Library are already used solutions. Please do not submit them directly as it may lead to plagiarism. Once paid, the solution file download link will be sent to your provided email. Please either use them for learning purpose or re-write them in your own language. In case if you haven't get the email, do let us know via chat support.

Get Free Quote!

251 Experts Online