MPI Broadcast and Collective CommunicationAuthor: Wes Kendall
So far in the MPI tutorials, we have examined point-to-point communication, which is communication between two processes. This lesson is the start of the collective communication section. Collective communication is a method of communication which involves participation of all processes in a communicator. In this lesson, we will discuss the implications of collective communication and go over a standard collective routine - broadcasting.
Note - All of the code for this site is on GitHub. This tutorial’s code is under tutorials/mpi-broadcast-and-collective-communication/code.
Collective communication and synchronization points
One of the things to remember about collective communication is that it implies a synchronization point among processes. This means that all processes must reach a point in their code before they can all begin executing again.
Before going into detail about collective communication routines, let’s examine synchronization in more detail. As it turns out, MPI has a special function that is dedicated to synchronizing processes:
The name of the function is quite descriptive - the function forms a barrier, and no processes in the communicator can pass the barrier until all of them call the function. Here’s an illustration. Imagine the horizontal axis represents execution of the program and the circles represent different processes:
Process zero first calls at the first time snapshot (T 1). While process zero is hung up at the barrier, process one and three eventually make it (T 2). When process two finally makes it to the barrier (T 3), all of the processes then begin execution again (T 4).
can be useful for many things. One of the primary uses of is to synchronize a program so that portions of the parallel code can be timed accurately.
Want to know how is implemented? Sure you do :-) Do you remember the ring program from the sending and receiving tutorial? To refresh your memory, we wrote a program that passed a token around all processes in a ring-like fashion. This type of program is one of the simplest methods to implement a barrier since a token can’t be passed around completely until all processes work together.
One final note about synchronization - Always remember that every collective call you make is synchronized. In other words, if you can’t successfully complete an , then you also can’t successfully complete any collective call. If you try to call or other collective routines without ensuring all processes in the communicator will also call it, your program will idle. This can be very confusing for beginners, so be careful!
Broadcasting with MPI_Bcast
A broadcast is one of the standard collective communication techniques. During a broadcast, one process sends the same data to all processes in a communicator. One of the main uses of broadcasting is to send out user input to a parallel program, or send out configuration parameters to all processes.
The communication pattern of a broadcast looks like this:
In this example, process zero is the root process, and it has the initial copy of data. All of the other processes receive the copy of data.
In MPI, broadcasting can be accomplished by using . The function prototype looks like this:
Although the root process and receiver processes do different jobs, they all call the same function. When the root process (in our example, it was process zero) calls , the variable will be sent to all other processes. When all of the receiver processes call , the variable will be filled in with the data from the root process.
Broadcasting with MPI_Send and MPI_Recv
At first, it might seem that is just a simple wrapper around and . In fact, we can make this wrapper function right now. Our function, called is located in bcast.c. It takes the same arguments as and looks like this:
The root process sends the data to everyone else while the others receive from the root process. Easy, right? If you run the my_bcast program from the tutorials directory of the repo, the output should look similar to this.
Believe it or not, our function is actually very inefficient! Imagine that each process has only one outgoing/incoming network link. Our function is only using one network link from process zero to send all the data. A smarter implementation is a tree-based communication algorithm that can use more of the available network links at once. For example:
In this illustration, process zero starts off with the data and sends it to process one. Similar to our previous example, process zero also sends the data to process two in the second stage. The difference with this example is that process one is now helping out the root process by forwarding the data to process three. During the second stage, two network connections are being utilized at a time. The network utilization doubles at every subsequent stage of the tree communication until all processes have received the data.
Do you think you can code this? Writing this code is a bit outside of the purpose of the lesson. If you are feeling brave, Parallel Programming with MPI is an excellent book with a complete example of the problem with code.
Comparison of MPI_Bcast with MPI_Send and MPI_Recv
The implementation utilizes a similar tree broadcast algorithm for good network utilization. How does our broadcast function compare to ? We can run , an example program included in the lesson code (compare_bcast.c). Before looking at the code, let’s first go over one of MPI’s timing functions - . takes no arguments, and it simply returns a floating-point number of seconds since a set time in the past. Similar to C’s function, you can call multiple functions throughout your program and subtract their differences to obtain timing of code segments.
Let’s take a look of our code that compares my_bcast to MPI_Bcast.
In this code, is a variable stating how many timing experiments should be executed. We keep track of the accumulated time of both functions in two different variables. The average times are printed at the end of the program. To see the entire code, just look at compare_bcast.c in the lesson code.
If you run the compare_bcast program from the tutorials directory of the repo, the output should look similar to this.
The run script executes the code using 16 processors, 100,000 integers per broadcast, and 10 trial runs for timing results. As you can see, my experiment using 16 processors connected via ethernet shows significant timing differences between our naive implementation and MPI’s implementation. Here are the timing results at different scales.
As you can see, there is no difference between the two implementations at two processors. This is because ’s tree implementation does not provide any additional network utilization when using two processors. However, the differences can clearly be observed when going up to even as little as 16 processors.
Try running the code yourself and experiment at larger scales!
Conclusions / up next
Feel a little better about collective routines? In the next MPI tutorial, I go over other essential collective communication routines - gathering and scattering.
For all lessons, go the the MPI tutorials page.
This site is hosted entirely on GitHub. This site is no longer being actively contributed to by the original author (Wes Kendall), but it was placed on GitHub in the hopes that others would write high-quality MPI tutorials. Click here for more information about how you can contribute.
- How does Fantasy Movie League work
- What happens when diborane reacts with water
- How can you not like Cardi B
- What is Storage googleapis com
- How can I contact David Gilmour personally
- Whats an ideal personality
- What is the goal of Islamic state
- Do you finish your bachelor degree
- How can I get his email password
- What causes Demyllination disorder
- What are Indian decimal coins
- Will Borderlands 3 be a direct sequel
- What is the alternative of SMTP
- Which is the largest parallel of latitude
- What is a Wiccan sigil