MPI: What and Why? (Part 01)

High-Performance Computing

10 min readNov 21, 2019

High-Performance Computing[¹]

In the late 1990s, there was still a question of whether the large vector supercomputers with their specialized memory systems could resist the assault from the increasing clock rates of the microprocessors. Also, in the later 1990s, there was a question whether the fast, expensive, and power-hungry RISC architectures would win over the commodity Intel microprocessors and commodity memory technologies.

By 2006, the Intel architecture had eliminated all the RISC architecture processors by significantly increasing clock rate and truly winning the increasingly important “Floating Point Operations per Watt” competition. Once users figure out how to effectively use loosely coupled processors, the low overall cost and the reduced energy consumption of commodity microprocessors became overriding factors in the marketplace.

Over the last decade, the definition of what is called high-performance computing has changed dramatically. In 1988 a personal computer costing $3000 could perform 0.25 million floating-point operations per second, a workstation costing $20,000 could deliver 3 million floating-point operations, and a supercomputer costing $3 million could perform 100 million floating-point operations per second. Therefore, why couldn’t we simply connect 400 personal computers together to achieve the same performance of a supercomputer for $1.2 million?

High performance RISC processors are designed to be easily inserted into a multiple-processor system with 2 to 64 CPUs accessing a single memory using symmetric multiprocessing (SMP). Programming multiple processors to solve a single problem adds its own set of additional challenges for the programmer. The programmer must be aware of how multiple processors operate together, and how work can be efficiently divided among those processors.

Even though each processor is potent and small numbers of processors can be put into a single enclosure, often there will be applications that are so large they need to span multiple enclosures. In order to cooperate to solve the more extensive application, these enclosures are linked with a high-speed network to function as a network of workstations(NOW). A NOW can be used individually through a batch of queuing system or can be used as a large multicomputer using a message-passing tool such as a parallel virtual machine(PVM) or message-passing interface(MPI).

Some examples of parallelism in order of the increasing grain size are:

When performing a 32-bit integer addition, by using a carry-lookahead adder, you can partially add bits 0 and 1 at the same time as bits 2 and 3.
On a pipelined processor, while decoding one instruction, you can fetch the next instruction.
On a two-way superscalar processor, you can execute any combination of an integer and a floating-point instruction in a single cycle.
On a multiprocessor, you can divide the iterations of a loop among the four processors of the system.
You can split a vast array across four workstations attached to a network. Each workstation can operate on its local information and then exchange boundary values at the end of each time step

High performance computing runs a broad range of systems, from our desktop computers through sizeable parallel processing systems.

[1] — Some of this content is available at: https://cnx.org/contents/10pAFv4r@3/

Shared Memory Versus Distributed Memory

Shared Memory

Now imagine trying to put all the pieces of a very large jigsaw puzzle together all by yourself. This would no doubtedly take you hours, especially if it is your first time with the puzzle.

Now imagine you trying to solve the same puzzle with a group of your friends all sitting around a table and working together.

You can even split everyone into small sub-teams to help coordinate the work depending on the size of your team. You will be able to reduce the overall time to solve this puzzle. However, there are things to consider. First of all, given the size of your table, you will reach a limit at which you can no longer fit more people around the workspace. (For example, you don’t have enough chairs)

Furthermore, two teams cannot use the same piece of the puzzle at the same time.

Finally, you may face issues where sub-teams are not communicating well with one another.

The main advantage of solving algorithms with shared memory lies in its simplicity. It is easy to use. On the other hand, it has limited scalability. Moreover, you have a high coherence overhead. As workers grow in number it becomes trickier to synchronize everyone.

In computer architecture, shared memory can look something like this,

The workers would be the central processing units (CPUs) and the table would be the memory. In this context, all CPUs share memory access.

Distributed Memory

In a distributed memory scenario each member has its own workspace to work on its portion of the puzzle. In this scenario the number of tables and chairs are scalable. In other words, we have scalable resources. There is less contention from private memory space.

Going back to the CPU and memory in a distributed memory scenario each couple of CPU memory is connected to a network allowing each member to communicate with each other.

In reality, modern computers have many CPU cores, therefore, we can imagine that each table has not only one member but that there is instead a team of workers per table.

In a computer network environment, it will look something like this,

When a team wants to communicate with one another, it sends its message over the network.

What is MPI?

As computing tasks get larger and larger if we want to do bigger and bigger simulations or analyze larger and larger piles of data our needs can quickly exceed the capabilities of one computer. We might want to solve more significant problems requiring more memory or storage than that is available in one machine. We might need the computation to go much faster than it would in one computer. So we need access to more processors or we might just want to do more computations than what is feasible in one computer. For an instance, running enormous parameter sweeps that would take months or years could have done sequentially in one desktop computer.

Bigger: more memory storage
Faster: each processor is faster
More: do many computations simultaneously

Message-passing is one programming model that is proven very useful for arranging computations on multiple computers in message-passing. Each process has its own data which can’t be seen by the other processes. When data needs to be communicated within the processes, they send and receive messages to and from one another. From these basic communication primitives, scientific programmers can build up multi-computer computations. When multiple computers are used together, it is common to refer to the separate independent computers as nodes. A given node likely has more than one processor just as your desktop or laptop.

MPI is a standard library for message passing. It is ubiquitous and readily available for already installed or whatever computers you have access to. It runs on the world’s largest supercomputers. But it can also run efficiently and usefully on your laptop or desktop. Nearly every big academic or commercial package for simulation or data analysis that uses multiple nodes simultaneously, uses MPI directly or indirectly.

If message passing is just sending and receiving messages between computers then why do we bother with MPI?

Why MPI?

This just sounds like network communications and there are already dozens of libraries out there to handle such things, why would we need another. First, the MPI interface is designed and implemented with performance in mind. Secondly, the MPI interface will take advantage of the fastest network transport available to it when sending messages. So, for an instance, to communicate with two different processes within a node, MPI will use a shared memory(within a computer) to send and receive messages rather than network communications. On fast interconnects, within a high performance computer cluster it already knows how to take advantage of transports like Infiniband or Myrinet for communications to processes on other nodes, and if all else fails only it will use the Standard Internet TCP/IP. This represents a massive body of networking code for many interfaces and protocols that you don’t have to implement yourself :)

MPI enforces other guarantees like guaranteeing messages arrive rather than being lost requiring retries and that messages arrive in order. This enormously simplifies programming. In the end, MPI is designed for multi-node technical computing and that means we can spend our time figuring out how to decompose our scientific problem rather than having to worry about network protocols. It is based on the standards process and is the most widely used interface of its type which is also easily available. Additionally, to those of us who have particular interests in doing technical computing, it comes with specialized routines for collective communications of the sort frequent we needed in science or engineering computations. The usual message-passing stuff is point-to-point communications where communications are one-to-one. One process has some data which sends to another process, possibly on another node. This is useful since we don’t have to worry about network capabilities but there are other modes of communication that are frequently needed for parallel computations.

Broadcast (one-to-many)

It is often useful to broadcast data which is a one-to-many operation. One process has a piece of data and broadcasts it to many or all of the other processes.

Scatter (one-to-many)

A close relative of the broadcast is Scatter, where one process divides values between many others.

Gather (many-to-one)

The inverse of Scatter is Gather. In which many processes have different parts of the overall picture which are then brought together to one process.

Reduction (one-to-many)

Reduction which combines communication and computation of many useful sorts of operations finding a global minimum, maximum sum or product are fundamentally reduction operations. Consider doing a global sum of data. Each process calculates its partial sum and then these are combined into a global sum on one process.

Is MPI high-level or low-level?

It is often said that MPI is a very low-level way to program. But abstracting away networking details and providing global operations sound relatively high-level. Thus, compared to other networking libraries it is quite a high level. You do not have to think about transport details and it implements those nice collectives. But from a technical computing point of view, it is still a very low level. We are concerned with doing things like fluid simulations that are used in bioinformatics.

Low level for scientists:

using MPI means we have to figure out how to break up the problem amongst multiple computers (decomposition) and
write code for each communication that has to occur between them (manually write code for every communication)

So should we use MPI to do everything?

No

It is probably the best choice for doing a single large computation across similar machines across a reliable network, such as a number of desktops in a lab or in a specialized computer cluster, but it would be a nightmare to try to use to write a distributed chat client or BitTorrent Network if it were possible to do so. Message-passing type approaches can work very well for such systems. Erlang is an example of a programming language using message passing that is very good at these sorts of tasks, but MPI is not. MPI is not designed to handle the resulting complexity of clients continually connecting and disconnecting to the network. At the other extreme, if you have lots of independent computations and you just want to parcel them out to run in parallel as a task farm, MPI can work for that, but it is quite a significant tool to use for a fairly simple task. There may be other approaches that are easier.

I hope you have have been able to get some basic idea about MPI through this :)

Let’s meet soon with the next part of this blog, ‘How to work with MPI?’ Until then have a nice time😃

Special thanks: Ms.M.A.Kalyani (Lecturer — Department of Computer Science, University of Ruhuna, Sri Lanka)