======= Speed Testing =======

Here I report the note on the attivity (and in this directory tree) all the  result I will optain on characterizing the performace of the Einstein Toolkit on varoiu machine I do have access. This log book will cover all my activity  starting from December the 7th 2012. This section contains various parts that will help to understand how well Catus behave on different platform. Main purpose of this testing is to find out how to run simulations on "Fermi".  

The main directory where I store the result of the Cactus Speed Test is "/work/staff/roberto.depietri/OrstedSpeedTest"


===== General Consideration =====

I decided to consider the November 2012 verdion announced as follow: 
We are pleased to announce the sixth release (code name "Ørsted") of the Einstein Toolkit, an open, 
community developed software infrastructure for relativistic astrophysics.

The main problems on previous test I did where strang scaling properties of Carpet
going to 256 or more processor and a lack of a proper log of the activity I did.
Thanks to Frank Loeffler I realized that the main scaling problem I observed 
were due to CARPET IOASCII for 1d output. I pointed to me that all the processor 
write in an order sequence to the 1d files and indeed the writing time scales
linearly with the number of MPI processes involved. Lesson lerned: do no output 
in testing speed and scalig. Do separate IO testing and do not mix up the to
type of speed testing.

The good lesson I learned in previous test is the need to have standarzide 
configuration to compare and use as reference. Alway do strong a week scaling 
check.

===== UNIGRID tests =====

First check UNIGRID:  border at 60,60,60 doing 32 integration steps

    PUGH:   PUGHit32.rpar   generate par files like PUGHdx1.000it32.par      
    CARPET: CARPETit32.rpar generate par files like CARPETdx1.000it32.par
    #################################################################################
    ### dx=[1.5 ....... 0.15]; nx=(60./dx *2 +1 +4);vol=nx.^3;[dx ;nx;vol/vol(1.5PUGH)]
    ##################################################################################
    ##  2.00  1.50  1.00  0.75  0.625  0.60  0.50  0.40  0.30  0.25  0.20  0.15  0.125
    ##    65    85   125   165    197   205   245   305   405   485   605   805    965
    ##  0.44  1.00  3.18  7.31  12.45  14.1  23.9  46.2   108   185   361   849   1463
    ##
    ##  dx=1.0 Carpet requires 4312.518 MB
    ##  dx=1.5 Carpet requires 1356.006 MB
    ##  dx=2.0 Carpet requires  606.390 MB
    #################################################################################

===== CARPET tests =====

Then check 3 refinement levels. Borders at 120 and subgrid at 60 and 30.
Also in this case we will do 32 integration steps on the finest grid. Resolution 
dx will refer to the finer grid

    CARPET: CARPET_RL3_it32.rpar generate par files like CARPET_RL3_dx1.000it32.par
    #################################################################################
    ### dx=[1.5 ....... 0.15]; nx=(120./(4*dx) *2 +1 +4);vol=3*nx.^3;[dx ;nx;vol/vol(1.5PUGH)]
    ##################################################################################
    ##  2.00  1.50  1.00  0.75  0.625  0.60  0.50  0.40  0.30  0.25  0.20  0.15  0.125
    ##   35     45    65    85    101   105   125   205   245   305   405   485    605
    ##  0.21  0.45  1.34     3   5.03  5.66  9.54  42.1  71.8   139   324   557   1081
    ##
    ##  dx=0.75 Carpet requires 6468.078 MB
    ##  dx=1.0 Carpet requires 3318.366 MB
    ##  dx=1.5 Carpet requires 1471.679 MB
    ##  dx=2.0 Carpet requires 1068.414 MB
    ##     Total time for simulation  (np1..t1)= 701 sec (11 minuti)
    ## Su Blue Gene Q se perfect scaling will require (1024 cores)
    ##  2.00  1.50  1.00  0.75  0.625  0.60  0.50  0.40  0.30  0.25  0.20  0.15  0.125
    ##  0.01  0.03  0.09  0.23  0.37         0.73  1.42  3.38  5.84  11.4  27.1  46.7
    #################################################################################


===== General problem with the testing =====


First test had shown that the use of 

  ActiveThorns = "TimerReport"
  TimerReport::output_all_timers_readable ="yes"
  TimerReport::out_every=32
  TimerReport::out_filename = "TimerReport"
  TimerReport::output_schedule_timers = "no"
  TimerReport::output_all_timers = "no"

deeply effect tests results. For example "CARPET_RL3_dx0.400it32.par" have the following result for timer "Total time for simulation" .

       Blue Gene Size    Np  Nt  Total time for simulation
  With TimerReport  64   64  16  249 s
                    64  128   8  251 s
                    64  256   4  272 s
                    64  512   2  316 s
                    64 1024   1  414 s
                   128 2048   1  564 s
  Without           64 1024   1  273 s
                   128 2048   1  278 s
                   256 4096   1  362 s

All the speed tests will be performed without the activation of "TimerReport".


===== Second stage of Testing =====

The second stage of testing involved just the output of various reduction of "rho". No other outputs.

^ Parfile: CARPET_RL3_dx(...)it32.par ^^^^^^^
^ dx ^ BG Size  ^ # of cores ^ OMP size ^ simulation ^ CCT_EVOLV ^ WALL Time ^ 
| 0.150  |  256|  4096|   1|  867|  555|  2995| 
^ ^^^^^^^
| 0.200  |  256|  4096|   1|  581|  280|  2707| 
| 0.200(*) |  128|  2048|   1|  696|  510|  1358| 
^ ^^^^^^^
| 0.250  |  256|  4096|   1|  467|  186|  2593| 
| 0.250  |  128|  2048|   1|  472|  306|  1130| 
| 0.250  |   64|  1024|   1|  632|  518|   827|
^ ^^^^^^^
| 0.300  |  256|  4096|   1|  409|  131|  2536| 
| 0.300  |  128|  2048|   1|  270|  220|  1313| 
| 0.300  |   64|  1024|   1|  437|  340|   732|
^ ^^^^^^^
| 0.400  |  256|  4096|   1|  362|   90|  2499| 
| 0.400  |  128|  2048|   1|  279|  129|  1222| 
| 0.400  |   64|  1024|   1|  273|  194|   609|
^ ^^^^^^^
| 0.500  |  256|  4096|   1|  371|   72|  2497| 
| 0.500  |  128|  2048|   1|  228|   98|   885| 
| 0.500  |   64|  1024|   1|  204|  132|   398|
^ OpenMP vs pure MPI ^^^^^^^ 
| 0.250  |   64|  1024|   1|  632|  518|   827|
| 0.250  |   64|  1024|   2|  594|  508|   661| 
| 0.250  |   64|  1024|   4|  583|  507|   614|
| 0.250  |   64|  1024|   8|  610|  537|   630|
| 0.250  |   64|  1024|  16|  677|  597|   694| 
^ ^^^^^^^
| 0.500  |   64|  1024|   1|  204|  132|   398|
| 0.500  |   64|  1024|   2|  185|  134|   254| 
| 0.500  |   64|  1024|   4|  172|  127|   202|
| 0.500  |   64|  1024|   8|  173|  131|   194|
| 0.500  |   64|  1024|  16|  184|  143|   203| 

(*) This run was also performed doing as much as four time the number of time integration of it=128
and the corresponding CCTK_EVOL changed from 510 to 2100 and simulation from 696 to 2524. 

===== Evaluation of the time to checkpoints =====