ROSE Compiler Framework/Arithmetic intensity measuring tool
Overview
[edit | edit source]A tool to help measure arithmetic intensity (FLOPS/Memory) of loops. It does so by
- statically estimating floating point operations and load/store bytes per iteration for user-specified loops
- instrumenting the loops with statements to capture loop iteration counts and calculate FLOPS and memory footprints (load/store bytes)
- users then run the instrumented code to generate the final reports.
Quick information
- tool location: https://github.com/rose-compiler/rose-develop/tree/master/projects/ArithmeticMeasureTool
- testing: type "make check" within the corresponding build tree
Download and Installation
[edit | edit source]It is recommended to obtain the tool from rose-develop repo to have the latest update.
The first step is to download and install rose as usual
- Latest instructions: http://rosecompiler.org/ROSE_HTML_Reference/installation.html
Then
- cd rose-build-tree/projects/ArithmeticMeasureTool
- make && make install
An executable file named measureTool will then be installed within ROSE_INSTALLATION_PATH/bin
Now prepare your environment so the tool can be invoked
# set.rose file, source it to set up environment variables ROSE_INS=/home/liao6/workspace/masterDevClean/install export ROSE_INS PATH=$ROSE_INS/bin:$PATH export PATH LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH
Command line options
[edit | edit source]List
- -help: print out help information
- -debug: enable debugging mode, generating screen output showing progress and internal results
- -annot your_annotation_file: accept user specified function side effect annotations, complement compiler analysis
- -static-counting-only : a special execution mode in which the tool scans all loop bodies and write counting results into a report file
- -report-file your_report_file.txt : specify your own report file name, otherwise the default file ai_tool_report.txt is used.
- -use-algorithm-v2: using 2nd version algorithm in the static counting mode, bottomup synthesized traversal to count FLOPS, still under development
Function side effect annotation
[edit | edit source]Compiler analysis cannot figure out side effect of all functions. This can be caused by no access to the library source code or complexity of pointer uses in the source code. To solve this problem, the tool accepts function side effect annotation file via an option --annot
Annotation file format
operator abs(int val) { modify none; read{val}; alias none; } operator max(double val1, double val2) { modify none; read{val1, val2}; alias none; }
example command line
- measureTool -c -annot /path/to/functionSideEffect.annot your_input.c
Execution mode 1: static analysis only
[edit | edit source]This is a special mode of the tool to only find all loops and count FLOPs for loop bodies. The reported numbers are for single iteration only.
The load/store bytes are represented in two ways
- expression format: such as 3*sizeof(float) + 5*sizeof(double)
- evaluated final integer values: 52
The result is written to a text report file.
Example use
[edit | edit source]./measureTool -c -static-counting-only -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot -I. ../../../sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c
Excerpt of the generated report. Note that a loop at line 129 has two Plus FP operations and 2 multiplication operations. It loads 0 bytes and store one double element (8 bytes usually). So the final arithmetic intensity (AI) is 4/8= 0.5 ops/byte
Content of generated report file: ai_tool_report.txt
----------Floating Point Operation Counts--------------------- SgForStatement@ /home/liao6/workspace/ExReDi/ai_tool/sourcetree/projects/ArithmeticMeasureTool/test/jacobi.c:129:10 fp_plus:2 fp_minus:0 fp_multiply:2 fp_divide:0 fp_total:4 ----------Memory Operation Counts--------------------- Loads: NULL Loads int: 0 Stores:1 * sizeof(double ) Store int: 8 ----------Arithmetic Intensity--------------------- AI=0.5
Right now
- AI is set to -1.0 if it is unintialized
- AI is set to be 9999.9 if divided by zero bytes
User pragma to verify results
[edit | edit source]In this mode, the translator can verify the tool-generated results by comparing the results to what is indicated by pragmas in the input code.
The user provided pragma has the form of
#pragma aitool fp_plus(10) fp_minus(10) fp_multiply(10) fp_divide (10) fp_total(40) for () ... void error_check ( ) { int i,j; double xx,yy,temp,error; dx = 2.0 / (n-1); dy = 2.0 / (m-1); error = 0.0 ; #pragma aitool fp_plus(3) fp_minus(3) fp_multiply(6) for (i=0;i<n;i++) for (j=0;j<m;j++) { xx = -1.0 + dx * (i-1); yy = -1.0 + dy * (j-1); temp = u[i][j] - (1.0-xx*xx)*(1.0-yy*yy); error = error + temp*temp; } error = sqrt(error)/(n*m); printf("Solution Error :%E \n",error); }
fp_total is required while the clauses of other kinds of FP operations are optional.
Execution mode 2: analyze and instrument the code
[edit | edit source]This is the default mode .
Manual instrument the input code
[edit | edit source]The tool currently works with collaboration with user-added code instrumentation, using the following steps:
- declare four global counters with specific variable names, which will later be recognized by the tool
- add chiterations = .. before the loops you want to count FPs and Load/store bytes
- print out the results: printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores);
1 #include <stdio.h> 2 #define SIZE 10 3 4 // Instrumentation 1: add a few global variables 5 unsigned long int chiterations = 0; 6 unsigned long int chloads = 0; 7 unsigned long int chstores = 0; 8 unsigned long int chflops = 0; 9 10 double ref[2] = {9.2, 5.4}; 11 double coarse[SIZE][SIZE][SIZE]; 12 int main() 13 { 14 double refScale = 1.0 / (ref[0] * ref[1]); 15 int iboxlo1 = 0, iboxlo0 = 0, iboxhi1 = SIZE-1, iboxhi0 = SIZE-1; 16 int var; 17 int ic1=0, ic0=0; 18 int ip0 = ic0 * ref[0]; 19 int ip1 = ic1 * ref[1]; 20 double coarseSum = 0.0; 21 int ii1, ii0; 22 23 for (var =0; var < SIZE ; var++) 24 { 25 //Instrumentation 2: pass in loop iteration for the loop to be counted 26 chiterations = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0); 27 for (ic1 = iboxlo1; ic1< iboxhi1 +1; ic1++) 28 for (ic0 = iboxlo0; ic0< iboxhi0 +1; ic0++) 29 { 30 int ibreflo1 = 0, ibreflo0 = 0, ibrefhi1 = SIZE-1, ibrefhi0 = SIZE-1; 31 //Instrumentation 3: pass in loop iteration for the loop to be counted 32 chiterations = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0); 33 for (ii1 = ibreflo1; ii1< ibrefhi1 +1; ii1++) 34 for (ii0 = ibreflo0; ii0< ibrefhi0 +1; ii0++) 35 { 36 coarseSum = coarseSum + coarse[ii1][ii0][ii1] +(ip0 + ii0) + (ip1 + ii1) + var; 37 } 38 coarse[ic0][ic1][var] = coarseSum * refScale; 39 } 40 } 41 //Instrumentation 4: print out results 42 printf ("chflops =%lu chloads =%lu chstores=%lu\n", chflops, chloads, chstores); 43 return 0; 44 }
Use the tool to transform the code
[edit | edit source]./measureTool -c -annot ../../../sourcetree/projects/ArithmeticMeasureTool/src/functionSideEffect.annot nestedloops.c
The tool will
- count the FLOPs and load store bytes for the specified loop
- add counter accumulation statements, using different counters for different loops
1 #include <stdio.h> 2 #define SIZE 10 3 // Instrumentation 1: add a few global variables 4 unsigned long chiterations = 0; 5 unsigned long chloads = 0; 6 unsigned long chstores = 0; 7 unsigned long chflops = 0; 8 double ref[2] = {(9.2), (5.4)}; 9 double coarse[10][10][10]; 10 11 int main() 12 { 13 double refScale = 1.0 / (ref[0] * ref[1]); 14 int iboxlo1 = 0; 15 int iboxlo0 = 0; 16 int iboxhi1 = 10 - 1; 17 int iboxhi0 = 10 - 1; 18 int var; 19 int ic1 = 0; 20 int ic0 = 0; 21 int ip0 = (ic0 * ref[0]); 22 int ip1 = (ic1 * ref[1]); 23 double coarseSum = 0.0; 24 int ii1; 25 int ii0; 26 unsigned long chiterations_1; 27 unsigned long chiterations_2; 28 for (var = 0; var < 10; var++) { 29 //Instrumentation 2: pass in loop iteration for the loop to be counted 30 chiterations_2 = (1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0); 31 for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) { 32 for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) { 33 int ibreflo1 = 0; 34 int ibreflo0 = 0; 35 int ibrefhi1 = 10 - 1; 36 int ibrefhi0 = 10 - 1; 37 //Instrumentation 3: pass in loop iteration for the loop to be counted 38 chiterations_1 = (1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0); 39 for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) { 40 for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) { 41 coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var; 42 } 43 } 44 /* aitool generated Loads counting statement ... */ 45 chloads = chloads + chiterations_1 * (1 * sizeof(double )); 46 /* aitool generated FLOPS counting statement ... */ 47 chflops = chflops + chiterations_1 * 4; 48 coarse[ic0][ic1][var] = coarseSum * refScale; 49 } 50 } 51 /* aitool generated Stores counting statement ... */ 52 chstores = chstores + chiterations_2 * (1 * sizeof(double )); 53 /* aitool generated FLOPS counting statement ... */ 54 chflops = chflops + chiterations_2 * 1; 55 } 56 //Instrumentation 4: pass in loop iteration for the loop to be counted 57 printf("chflops =%lu chloads =%lu chstores=%lu\n",chflops,chloads,chstores); 58 return 0; 59 }
Compile& run the transformed code
[edit | edit source]gcc -O3 rose_nestedloops.c -o nestedloops.out -l
./nestedloops.out
The result looks like
chflops =401000 chloads =800000 chstores=8000
Limitations
[edit | edit source]The tool does not support Fortran loops with function calls for now
- ROSE's Fortran procedure/routine representation is not accurate enough (missing parameter type info.) to hook up with function side effect annotations designed to match C/C++ functions.
Internals
[edit | edit source]Execution model variable running_mode
- e_analysis_and_instrument
- e_static_counting
FP operations
[edit | edit source]class FPCounters: public AstAttribute {}; to store analysis results
void CountFPOperations() from src/ai_measurement.cpp
Rose_STL_Container<SgNode*> nodeList = NodeQuery::querySubTree(input, V_SgBinaryOp); for (Rose_STL_Container<SgNode *>::iterator i = nodeList.begin(); i != nodeList.end(); i++) { fp_operation_kind_enum op_kind = e_unknown; // bool isFPType = false; // check operation type SgBinaryOp* bop= isSgBinaryOp(*i); switch (bop->variantT()) { case V_SgAddOp: case V_SgPlusAssignOp: op_kind = e_plus; break; case V_SgSubtractOp: case V_SgMinusAssignOp: op_kind = e_minus; break; case V_SgMultiplyOp: case V_SgMultAssignOp: op_kind = e_multiply; break; case V_SgDivideOp: case V_SgDivAssignOp: op_kind = e_divide; break; default: break; } //end switch ... }
Load/Store bytes
[edit | edit source]The main functions are defined in ai_measurement.cpp:
- std::pair <SgExpression*, SgExpression*> CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars /* = true */, bool includeIntType /* = true */)
- SgExpression* calculateBytes (std::set<SgInitializedName*>& name_set, SgStatement* lbody, bool isRead)
return expressions to calculate the value, not the actual values, since sizeof(type) is machine dependent.
Configuration
- By default: only array references are counted. Scalars are ignored.
Algorithm
- call side effect analysis to find read/write variables, some reference may trigger both read and write accesses. If analysis is successful, proceed. Otherwise warning is sent.
- Accesses to the same array/scalar variable are grouped into one read (or write) access: e.g. array[i][j], array[i][j+1], array[i][j-1], etc are counted a single access
- Group accesses based on the types: same type access-> increment the same counter to shorten expression length
- Iterate on the results to generate expression like 2*sizeof(float) + 5* sizeof(double)
- As an approximate, we use simple analysis here assuming no function calls.
// Obtain per-iteration load/store bytes calculation expressions // excluding scalar types to match the manual version //CountLoadStoreBytes (SgLocatedNode* input, bool includeScalars = true, bool includeIntType = true); std::pair <SgExpression*, SgExpression*> load_store_count_pair = CountLoadStoreBytes (loop_body, false, true); // chstores=chstores+chiterations*8 if (load_store_count_pair.second!= NULL) { SgExprStatement* store_byte_stmt = buildCounterAccumulationStmt("chstores", new_iter_var_name, load_store_count_pair.second, scope); insertStatementAfter (loop, store_byte_stmt); attachComment(store_byte_stmt," aitool generated Stores counting statement ..."); } // handle loads stmt 2nd so it can be inserted as the first after the loop // build chloads=chloads+chiterations*2*8 if (load_store_count_pair.first != NULL) { SgExprStatement* load_byte_stmt = buildCounterAccumulationStmt("chloads", new_iter_var_name, load_store_count_pair.first, scope); insertStatementAfter (loop, load_byte_stmt); attachComment(load_byte_stmt," aitool generated Loads counting statement ..."); }
Nested loops
[edit | edit source]Scientific applications usually have nested loops. Naive instrumentation will cause two problems
- double counting for nested loop body:
- the chiterations= .. statement is used for all levels of loop. The inner loop's chiterations will overwrite the chiterations used to indicate outer loop.
Solutions
- The translator uses a bottom-up traversal order: processing inner loops first, then outer loops.
- To avoid double counting FP operations within nested loops: all visited FP operations expressions are stored into a lookup table. Later counting will check if an operation is already counted. If so, skip it.
- To avoid double counting of variables used in nested loops when counting a outer loop body: This is slightly different from the handling of FP op expressions. Here we find all variables counted in inner loops, exclude them when do the counting for an outer loop. Note: excluding a entirely, not just flagging a reference to a and exclude such reference later.
- Note: static counting mode does not do this excluding since the assumption of redundant execution is no longer a concern. We still count loop body's FLOPS for inner and outer loops if they are nested.
- rewrite chiterations= to chiterations_loopId= .. , so each loop has its own iteration number variable.
// global chiterations is changed to two local variables: each for one loop unsigned long chiterations_1; unsigned long chiterations_2; for (var = 0; var < 10; var++) { //Instrumentation 2: pass in loop iteration for the loop to be counted chiterations_2 = ((1 + iboxhi1 - iboxlo1) * (1 + iboxhi0 - iboxlo0) * 1); for (ic1 = iboxlo1; ic1 < iboxhi1 + 1; ic1++) { for (ic0 = iboxlo0; ic0 < iboxhi0 + 1; ic0++) { int ibreflo1 = 0; int ibreflo0 = 0; int ibrefhi1 = 10 - 1; int ibrefhi0 = 10 - 1; //Instrumentation 3: pass in loop iteration for the loop to be counted chiterations_1 = ((1 + ibrefhi1 - ibreflo1) * (1 + ibrefhi0 - ibreflo0) * 1); for (ii1 = ibreflo1; ii1 < ibrefhi1 + 1; ii1++) { for (ii0 = ibreflo0; ii0 < ibrefhi0 + 1; ii0++) { coarseSum = coarseSum + coarse[ii1][ii0][ii1] + (ip0 + ii0) + (ip1 + ii1) + var; } } /* aitool generated Loads counting statement ... */ chloads = chloads + chiterations_1 * (1 * sizeof(double )); /* aitool generated FLOPS counting statement ... */ chflops = chflops + chiterations_1 * 4; coarse[ic0][ic1][var] = coarseSum * refScale; } } /* aitool generated Stores counting statement ... */ chstores = chstores + chiterations_2 * (1 * sizeof(double )); /* aitool generated FLOPS counting statement ... */ chflops = chflops + chiterations_2 * 1; }
Testing
[edit | edit source]run all builtin tests
- make check
run tests for static analysis only
- make check-static
Manual testing
- [liao6@tux322:~/workspace/ExReDi/ai_tool.git/translator]m && ./measureTool -c -annot ./src/functionSideEffect.annot -I. ./test/jacobi-v3.c