Strumenti Utente

Strumenti Sito


grid:mpiprogramming

MPI programming

Main implementations

mpich

MPICH is a MPI library developed at Argonne National Laboratory (ANL) www.mcs.anl.gov/mpi/

The original implementation of MPICH is called MPICH1 and it implements the MPI-1.1 standard. The latest implementation is called MPICH2 and it implements the MPI-2.0 standard.

There is a different distribution for each network interconnection type: MPICH for MPI communication over TCP/IP, MPICH-MX for communication over Myrinet and MVAPICH for InfiniBand.

OpenMPI

OpenMPI http://www.open-mpi.org/ is an MPI library project combining technologies and resources from several other projects (FT-MPI, LA-MPI, LAM/MPI, and PACX-MPI).

The Open MPI library detects the local environment during initialization of each process, and will use the fastest communication type to deliver messages (e.g. Infiniband, tcp, shared-memory).

mpi-selector

In case of different MPI flavours installed on a cluster, the mpi-selector tool sets up your mpi environment for you and sets paths correctly. Example:

mpi-selector --list
mpi-selector --set openmpi_gcc-1.4.1 

The following openmpi command will show the available kinds of networks (btl - Byte Transfer Layer)

> ompi_info | grep btl
       MCA btl: openib (MCA v2.0, API v2.0, Component v1.4.1) #Infiniband
       MCA btl: self (MCA v2.0, API v2.0, Component v1.4.1)   #loopback
       MCA btl: sm (MCA v2.0, API v2.0, Component v1.4.1)     #Shared memory
       MCA btl: tcp (MCA v2.0, API v2.0, Component v1.4.1)    #tcp
mpirun

The paradigm for writing mpi applications is SPMD (Single Program Multiple Data): N instances (processes) of the same program are executed on different nodes.

mpirun is the command executed on a “launch host” to start the remote processes via ssh (or rsh).

mpirun -np 2 -host aserv1,aserv2 hostname

The processes can cooperate thanks to the communication and synchronization primitives provided by the MPI library. User can force mpi to transfer messages using a btl subset. Example:

mpirun --mca btl tcp,self  -host aserv1,aserv2   my-mpiprog

If two (or more) interfaces are active MPI try to use both, in order to balance the network load. User can force MPI to use a sigle interface:

mpirun --mca btl tcp,self  -host aserv1,aserv2  --mca btl_tcp_if_include eth1  my-mpiprog

The hosts list can be specified in a hostfile:

cat <<EOF  >> myhostfile
aserv1
aserv2
EOF

mpirun --mca btl tcp,self  -hostfile myhostfile   my-mpiprog
rankfile

Open MPI (version 1.3.x) provides a mechanism for process affinity ( based on sched_getaffinity() and sched_setaffinity() ) . The slot mapping method is based on specification provided in a “Rankfile”.

OpenMPI supports memory affinity, meaning that it generally tries to allocate all memory local to the processor that asked for it.

The syntax of the rankfile is similar to that of a hostfile, with the addition of slot specifications for each rank in the following format:

rank N=hostA slot=cpu_num
rank M=hostB slot=socket_num:core_num 

Example:

mpirun -np 4 -hostfile hostfile -rf rankfile ./app
#cat rankfile
rank 0=anode240 slot=0      #run on anode240 bound to CPU0
rank 1=anode240 slot=4-7    #run on anode240 bound to CPUs from CPU4 to CPU7
rank 2=anode241 slot=0:*    #run on anode241 bound to socket0 any core
rank 3=anode241 slot=1:1    #run on anode241 bound to socket1 core 1

MPI programming in C

Environment management routines

The first MPI routine called in any MPI program must be

MPI_Init() 

MPI_Init must be called by every MPI program.

An MPI program should call

MPI_Finalize() 

when all communications have completed. Once called no other MPI calls can be made.

Theo routine

MPI_Wtime()

returns the current value of time as a double precision floating point number of seconds. This value represents elapsed time since some point in the past. The precision is 0.000001 sec

The MPI library provides 2 routines

MPI_Comm_rank() 
MPI_Comm_size()

used by processes to request information from a communicator. Example:

MPI communicator

MPI_init() defines a communicator called MPI_COMM_WORLD for every process that calls it.

All MPI communication calls require a communicator argument.

MPI processes can only communicate if they share a communicator.

Each process has it’s rank within the communicator that is a integer identifier assigned by the system, starting from 0.

MPI helloworld.c
#include <mpi.h>
#include <stdio.h>
 
main(int argc, char **argv)
{
        int numtasks, rank;
        MPI_Init(&argc, &argv);
        MPI_Comm_size(MPI_COMM_WORLD, & numtasks);
        MPI_Comm_rank(MPI_COMM_WORLD, &rank);
 
        printf("Hello World from process %d of %d\n", rank, numtasks);
 
        MPI_Finalize();
}
Master/worker model

A process acts as master, while other processes are workers:

main(int argc, char **argv)
	{
		if(process is assigned Master role)
		{
   	             /* Assign work and coordinate workers and collect results */
			 MasterRoutine(/*arguments*/);
		} 
	     else /* it is worker process */
         {
  		/* interact with master and other workers. Do the work and send results to the master*/
			WorkerRoutine(/*arguments*/);
		}
	}

Example mpi_name.c

mpicc mpi_name.c -o mpi_name
mpirun -np 2 -host aserv1,aserv2  mpi_name

MPI data types

MPI datatype C datatype Byte
MPI_CHAR signed char 1
MPI_SHORT signed short int 2
MPI_INT signed int 4
MPI_LONG signed long int 4
MPI_UNSIGNED_CHAR unsigned char 1
MPI_UNSIGNED_SHORT unsigned short 1
MPI_UNSIGNED unsigned int 4
MPI_UNSIGNED_LONG unsigned long int 4
MPI_FLOAT float 4
MPI_DOUBLE double 8
MPI_LONG_DOUBLE long double 12
MPI_BYTE 8 binary digit 1
MPI_PACKED packed with MPI_Pack()
unpacked with MPI_Unpack()

MPI communication routines

Point-to-point communication

Source process sends message to destination process

Destination process is identified by its rank in the communicator

Blocking send/receive
MPI_Send (&buf,count,datatype,dest,tag,comm)    

# SEND: Routine returns only after the application fuffer in the sending task in free for reuse.

MPI_Recv (&buf,count,datatype,source,tag,comm,&status) #source can be MPI_ANY_SOURCE

# RECEIVE: Receive a message and block until the requested data is available in the application buffer in the receiving task.

MPI_Ssend (&buf,count,datatype,dest,tag,comm)

# SYNCRONOUS SEND: Send a message and block until the application buffer in the sending task is free for reuse and the destination prosess has starte to receive the message.

MPI_Sendrecv(sbuf, scount, s_dtype, dest, stag, dbuf, dcount,d_type,src,dtag, comm, &status)

# SENDRECEIVE: Send a message and post a receive before blocking. Will block until the sending application buffer is free for reuse and until the receiving application buffer contains the received message.

Non-Blocking send/receive
MPI_Irecv(&buf,count,datatype,source,tag,comm,&request) 

#I-RECEIVE: Return (almost) immediately, without waiting for the message to be received and copied into the application buffer. A communication request handle is returned for handling the pending message status. Subsequent calls to MPI_Wait or MPI_Test indicates that the non-blocking receive has completed.

MPI_Isend (&buf,count,datatype,dest,tag,comm,&request)  

# I-SEND: Return (almost) immediately, without waiting for the message to be copied out from the application buffer. A communication request handle is returned for handling the pending message status. Subsequent calls to MPI_Wait or MPI_Test indicates that the non-blocking send has completed.

MPI_Test (&request,&flag,&status)

#TEST: verify the status of a non-blocking send/receive request. flag=1 → operation completed, flag=0 → operation not completed.

MPI_Wait (&request,&status)

#WAIT: like MPI_Test, but blocking until the send7receive operation is completed.

Exercises

PingPong
  • Write a program in which two process prepeatedly pass a message back and forth.
  • Insert timing cals to measure the time taken for one messsage.
  • Investigate how the time taken to exchange messages varies with the size of the message

mpi_pingpong.c

Ring

Each process receives an integer from the left, adds its own rank and sends the result to the right. Use MPI_sendrecv routine.

Collettive communication routines

Collective communication must involve all processes in the scope of a communicator. All processes are by default, members in the communicator MPI_COMM_WORLD. Collective operations are blocking.

Types of collective operations:

Barriers
MPI_Barrier (comm) 

Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call.

Data movement
MPI_Bcast (&buffer,count,datatype,root,comm) 

Broadcasts (sends) a message from the process with rank “root” to all other processes in the group.

MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf, recvcnt,recvtype,root,comm) 

Distributes distinct messages from a single source task to each task in the group.

MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf, recvcount,recvtype,root,comm) 

Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter.

MPI_Allgather (&sendbuf,sendcount,sendtype,&recvbuf, recvcount,recvtype,comm) 

Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group.

Collective Computation (reductions)
MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm) 

One member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.

Main operations for reduce:

OP operation C-type
MPI_MAX maximum integer, float
MPI_MIN minimum integer, float
MPI_SUM sum integer, float
MPI_PROD product integer, float
MPI_LAND locgical AND integer
MPI_BAND bit-wise AND integer, MPI_BYTE
MPI_LOR logical OR integer
MPI_BOR bitwise OR integer, MPI_BYTE
Reduction example
/var/www/html/dokuwiki/data/pages/grid/mpiprogramming.txt · Ultima modifica: Y/m/d H:i da