======= Speed Testing ======= Here I report the note on the attivity (and in this directory tree) all the result I will optain on characterizing the performace of the Einstein Toolkit on varoiu machine I do have access. This log book will cover all my activity starting from December the 7th 2012. This section contains various parts that will help to understand how well Catus behave on different platform. Main purpose of this testing is to find out how to run simulations on "Fermi". The main directory where I store the result of the Cactus Speed Test is "/work/staff/roberto.depietri/OrstedSpeedTest" ===== General Consideration ===== I decided to consider the November 2012 verdion announced as follow: We are pleased to announce the sixth release (code name "Ørsted") of the Einstein Toolkit, an open, community developed software infrastructure for relativistic astrophysics. The main problems on previous test I did where strang scaling properties of Carpet going to 256 or more processor and a lack of a proper log of the activity I did. Thanks to Frank Loeffler I realized that the main scaling problem I observed were due to CARPET IOASCII for 1d output. I pointed to me that all the processor write in an order sequence to the 1d files and indeed the writing time scales linearly with the number of MPI processes involved. Lesson lerned: do no output in testing speed and scalig. Do separate IO testing and do not mix up the to type of speed testing. The good lesson I learned in previous test is the need to have standarzide configuration to compare and use as reference. Alway do strong a week scaling check. ===== UNIGRID tests ===== First check UNIGRID: border at 60,60,60 doing 32 integration steps PUGH: PUGHit32.rpar generate par files like PUGHdx1.000it32.par CARPET: CARPETit32.rpar generate par files like CARPETdx1.000it32.par ################################################################################# ### dx=[1.5 ....... 0.15]; nx=(60./dx *2 +1 +4);vol=nx.^3;[dx ;nx;vol/vol(1.5PUGH)] ################################################################################## ## 2.00 1.50 1.00 0.75 0.625 0.60 0.50 0.40 0.30 0.25 0.20 0.15 0.125 ## 65 85 125 165 197 205 245 305 405 485 605 805 965 ## 0.44 1.00 3.18 7.31 12.45 14.1 23.9 46.2 108 185 361 849 1463 ## ## dx=1.0 Carpet requires 4312.518 MB ## dx=1.5 Carpet requires 1356.006 MB ## dx=2.0 Carpet requires 606.390 MB ################################################################################# ===== CARPET tests ===== Then check 3 refinement levels. Borders at 120 and subgrid at 60 and 30. Also in this case we will do 32 integration steps on the finest grid. Resolution dx will refer to the finer grid CARPET: CARPET_RL3_it32.rpar generate par files like CARPET_RL3_dx1.000it32.par ################################################################################# ### dx=[1.5 ....... 0.15]; nx=(120./(4*dx) *2 +1 +4);vol=3*nx.^3;[dx ;nx;vol/vol(1.5PUGH)] ################################################################################## ## 2.00 1.50 1.00 0.75 0.625 0.60 0.50 0.40 0.30 0.25 0.20 0.15 0.125 ## 35 45 65 85 101 105 125 205 245 305 405 485 605 ## 0.21 0.45 1.34 3 5.03 5.66 9.54 42.1 71.8 139 324 557 1081 ## ## dx=0.75 Carpet requires 6468.078 MB ## dx=1.0 Carpet requires 3318.366 MB ## dx=1.5 Carpet requires 1471.679 MB ## dx=2.0 Carpet requires 1068.414 MB ## Total time for simulation (np1..t1)= 701 sec (11 minuti) ## Su Blue Gene Q se perfect scaling will require (1024 cores) ## 2.00 1.50 1.00 0.75 0.625 0.60 0.50 0.40 0.30 0.25 0.20 0.15 0.125 ## 0.01 0.03 0.09 0.23 0.37 0.73 1.42 3.38 5.84 11.4 27.1 46.7 ################################################################################# ===== General problem with the testing ===== First test had shown that the use of ActiveThorns = "TimerReport" TimerReport::output_all_timers_readable ="yes" TimerReport::out_every=32 TimerReport::out_filename = "TimerReport" TimerReport::output_schedule_timers = "no" TimerReport::output_all_timers = "no" deeply effect tests results. For example "CARPET_RL3_dx0.400it32.par" have the following result for timer "Total time for simulation" . Blue Gene Size Np Nt Total time for simulation With TimerReport 64 64 16 249 s 64 128 8 251 s 64 256 4 272 s 64 512 2 316 s 64 1024 1 414 s 128 2048 1 564 s Without 64 1024 1 273 s 128 2048 1 278 s 256 4096 1 362 s All the speed tests will be performed without the activation of "TimerReport". ===== Second stage of Testing ===== The second stage of testing involved just the output of various reduction of "rho". No other outputs. ^ Parfile: CARPET_RL3_dx(...)it32.par ^^^^^^^ ^ dx ^ BG Size ^ # of cores ^ OMP size ^ simulation ^ CCT_EVOLV ^ WALL Time ^ | 0.150 | 256| 4096| 1| 867| 555| 2995| ^ ^^^^^^^ | 0.200 | 256| 4096| 1| 581| 280| 2707| | 0.200(*) | 128| 2048| 1| 696| 510| 1358| ^ ^^^^^^^ | 0.250 | 256| 4096| 1| 467| 186| 2593| | 0.250 | 128| 2048| 1| 472| 306| 1130| | 0.250 | 64| 1024| 1| 632| 518| 827| ^ ^^^^^^^ | 0.300 | 256| 4096| 1| 409| 131| 2536| | 0.300 | 128| 2048| 1| 270| 220| 1313| | 0.300 | 64| 1024| 1| 437| 340| 732| ^ ^^^^^^^ | 0.400 | 256| 4096| 1| 362| 90| 2499| | 0.400 | 128| 2048| 1| 279| 129| 1222| | 0.400 | 64| 1024| 1| 273| 194| 609| ^ ^^^^^^^ | 0.500 | 256| 4096| 1| 371| 72| 2497| | 0.500 | 128| 2048| 1| 228| 98| 885| | 0.500 | 64| 1024| 1| 204| 132| 398| ^ OpenMP vs pure MPI ^^^^^^^ | 0.250 | 64| 1024| 1| 632| 518| 827| | 0.250 | 64| 1024| 2| 594| 508| 661| | 0.250 | 64| 1024| 4| 583| 507| 614| | 0.250 | 64| 1024| 8| 610| 537| 630| | 0.250 | 64| 1024| 16| 677| 597| 694| ^ ^^^^^^^ | 0.500 | 64| 1024| 1| 204| 132| 398| | 0.500 | 64| 1024| 2| 185| 134| 254| | 0.500 | 64| 1024| 4| 172| 127| 202| | 0.500 | 64| 1024| 8| 173| 131| 194| | 0.500 | 64| 1024| 16| 184| 143| 203| (*) This run was also performed doing as much as four time the number of time integration of it=128 and the corresponding CCTK_EVOL changed from 510 to 2100 and simulation from 696 to 2524. ===== Evaluation of the time to checkpoints =====