ROSE Compiler Framework/autoPar
This is a tool which can automatically insert OpenMP pragmas into input serial C/C++ codes.
Overview
[edit | edit source]This tool is an implementation of automatic parallelization using OpenMP. It can automatically insert OpenMP directives into input serial C/C++ codes. For input programs with existing OpenMP directives, the tool will double check the correctness when the right option is turned on.
- The source files are currently located in rose/projects/autoParallelization.
- A standalone executable program (named autoPar ) is generated and installed to the installation tree of ROSE (under ROSE INS/bin).
- Test input files are located at rose/projects/autoParallelization/tests
- You can test the tool in rose_build_tree/projects/autoParallelization by typing "make check"
- There is a section in ROSE manual: 12.7 Automatic Parallelization pdf
- It is used to explore semantics-aware automatic parallelization, as described in our papers:
- A workshop paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore, In Proceedings of the 5th international Workshop on OpenMP: Evolving OpenMP in An Age of Extreme Parallelism (Dresden, Germany, June 3–05, 2009). pdf
- A journal version of the paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions, Journal of Parallel Programming, Accepted in Jan. 2010 pdf
Similar to ROSE, autoPar is released under the BSD license.
Installation and configuration
[edit | edit source]autoPar is released as part of the ROSE compiler. Please install ROSE by following the instructions at
- Installation
- type make install-core -j8 to install core ROSE libraries and prebuilt tools.
An executable file autoPar is created during the process of building ROSE. Users can use it to parallelize their codes. autoPar will be installed to the path specified by –prefix=ROSE INSTALLATION PATH. Here are some instructions.
During configuration, you might want to specify a prefix path, such as –prefix=/home/youraccount/opt/roserev-8345. Then, follow the ROSE installation guide to build and install ROSE, often by typing make and make install.
The next step is to set environment variables for search paths for ROSE executables (including autoPar ) and shared libraries. Assume you are using bash, put the following lines into a file (e.g. set.rose) and source it (typing source set.rose)
ROSE_INS=/home/youraccount/opt/rose-rev-8345 export ROSE_INS PATH=$ROSE_INS/bin:$PATH export PATH LD_LIBRARY_PATH=$ROSE_INS/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH
Now you can use autoPar to parallelize your code. Assume you have a sequential file: file.c, just type autoPar -c file.c. An output file rose file.c will be generated. If autoPar can find anything parallelizable, the output file should have OpenMP pragmas generated.
Alternative: using the virtual machine image
[edit | edit source]If you don't want to install everything from scratch, you can download the vmware image of ROSE and directly use the installed copy of ROSE inside of it.
More information about this is at ROSE_Compiler_Framework/Virtual_Machine_Image.
Within the virtual machine, you can cd into
/home/demo/buildrose/projects/autoParallelization
and type
make check
You will see autoPar is used to process input files.
Command line options
[edit | edit source][~build64/projects/autoParallelization/tests]../autoPar—help
AUTOPAR(1) ROSE Command-line Tools AUTOPAR(1) Name autoPar - This tool automatically inserts OpenMP directives into sequential codes. Synopsis autoPar switches files... Description This tool is an implementation of automatic parallelization using OpenMP. It can automatically insert OpenMP directives into input serial C/C++ codes. For input programs with existing OpenMP directives, the tool will double check the correctness when requested. Switches General switches --assert how Determines how a failed assertion behaves. The choices are "abort", "exit" with a non-zero value, or "throw" a Rose::Diagnostics::FailedAssertion exception. The default behavior depends on how ROSE was configured. --help; -h Show this documentation. --log config Configures diagnostics. Use "--log=help" and "--log=list" to get started. --threads n Number of threads to use for algorithms that support multi-threading. The default is 0. A value of zero means use the same number of threads as there is hardware concurrency (or one thread if the hardware concurrency can't be determined). --version; -V Shows the dotted quad ROSE version and then exits. See also --version-long, which prints much more information. --version-long Shows version information for ROSE and various dependencies and then exits. The shorter --version switch shows only the dotted quad of the ROSE library itself. autoPar's switches These switches control the autoPar tool. --[rose:autopar:]annot string Specify annotation file for semantics of abstractions --[rose:autopar:]dumpannot Dump annotation file content for debugging purposes. --[rose:autopar:]enable_debug Enable the debugging mode. --[rose:autopar:]enable_diff Compare user defined OpenMP pragmas to auto parallelization generated ones. --[rose:autopar:]enable_distance Report the absolute dependence distance of a dependence relation preventing parallelization. --[rose:autopar:]enable_patch Enable generating patch files for auto parallelization --[rose:autopar:]failure_report string Specify the report file for logging files autoPar cannot process --[rose:autopar:]keep_going Allow auto parallelization to keep going if errors happen --[rose:autopar:]no_aliasing Assuming no pointer aliasing exists. --[rose:autopar:]success_report string Specify the report file for logging files autoPar can process --[rose:autopar:]unique_indirect_index Assuming all arrays used as indirect indices have unique elements (no overlapping) ROSE builtin switches --rose:help Show the old-style ROSE help. 0.9.9.123 Friday October 13 10:16:56 2017
Additional useful rose flags
- -rose:skipfinalCompileStep // skip invoking the backend compiler to compile the transformed code, this is useful to workaround some bugs
- --edg:no_warnings // suppress warnings from the EDG C++ frontend
Example usage
[edit | edit source]Testing input files can be found at https://github.com/rose-compiler/rose-develop/tree/master/projects/autoParallelization/tests
The corresponding generated testing output files can be found at: https://github.com/chunhualiao/autoPar-demo
We provide two samples below:
Without using annotations
[edit | edit source]Input:
/* Only the inner loop can be parallelized */ void foo() { int n=100, m=100; double b[n][m]; int i,j; for (i=0;i<n;i++) for (j=0;j<m;j++) b[i][j]=b[i-1][j-1]; }
Command line and screen output
../autoPar -rose:C99—edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -rose:autopar:enable_patch -I../../../../sourcetree/src/frontend/SageIII -c ../../../../sourcetree/projects/autoParallelization/tests/inner_only.c
Enabling generating patch files for auto parallelization ... Assuming all arrays used as indirect indices have unique elements (no overlapping) ... Unparallelizable loop at line:8 due to the following dependencies: 2*2 TRUE_DEP; commonlevel = 2 CarryLevel = (0,0) Is precise SgPntrArrRefExp:b[i][j]@10:14->SgPntrArrRefExp:b[i - 1][j - 1]@10:21 == -1;* 0;||* 0;== -1;||:: Automatically parallelized a loop at line:9
Output:
/* Only the inner loop can be parallelized */ #include "omp.h" void foo() { int n = 100; int m = 100; double b[n][m]; int i; int j; for (i = 0; i <= n - 1; i += 1) { #pragma omp parallel for private (j) firstprivate (n,m,i) for (j = 0; j <= m - 1; j += 1) b[i][j] = b[i - 1][j - 1]; } }
Another input:
int i,j; int a[100][100]; void foo() { for (i=0;i<100;i++) for (j=0;j<100;j++) a[i][j]=a[i][j]+1; }
Output:
#include "omp.h" int i; int j; int a[100UL][100UL]; void foo() { #pragma omp parallel for private (i,j) for (i = 0; i <= 100 - 1; i += 1) { #pragma omp parallel for private (j) firstprivate (i) for (j = 0; j <= 100 - 1; j += 1) a[i][j] = (a[i][j] + 1); } }
Using annotations
[edit | edit source]Annotation files provide additional program information which is traditionally hard for compilers to extract, such as aliasing, side effect information.
Multiple annotations files can be loaded in one command line
../autoPar --edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -I/home/liao6/rose/freshmaster/sourcetree/src/frontend/SageIII -c -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/floatArray.annot -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/funcs.annot -annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/Index.annot /home/liao6/rose/freshmaster/sourcetree/projects/autoParallelization/tests/interp1_elem2.C
Assuming all arrays used as indirect indices have unique elements (no overlapping) ...
Automatically parallelized a loop at line:13
Input
//#include <A++.h> #include "simpleA++.h" void interpolate1D(class floatArray &fineGrid,class floatArray &coarseGrid) { int _var_0,i; int _var_1; int fineGridSize = fineGrid.length(0); int coarseGridSize = coarseGrid.length(0); // Interior fine points class Range If(2,_var_1 = (fineGridSize - 2),2); class Range Ic(1,(coarseGridSize - 1),1); for (i = 1; i < _var_0; i += 1) { fineGrid.elem(i) =fineGrid.elem(i)+1; } }
Annotation file
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/floatArray.annot
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/funcs.annot
Output file
//#include <A++.h> #include "simpleA++.h" #include "omp.h" void interpolate1D(class floatArray &fineGrid,class floatArray &coarseGrid) { int _var_0; int i; int _var_1; int fineGridSize = fineGrid . length (0); int coarseGridSize = coarseGrid . length (0); // Interior fine points class Range If(2,_var_1 = fineGridSize - 2,2); class Range Ic(1,coarseGridSize - 1,1); #pragma omp parallel for private (i) firstprivate (_var_0) for (i = 1; i <= _var_0 - 1; i += 1) { fineGrid . elem (i) = fineGrid . elem (i) + 1; } }
Generating patches
[edit | edit source]autoPar accepts an option -rose:autopar:enable_patch to generate a patch file. So users want to examine the directives to be inserted first. Then users can apply the inspected patch to their application.
Input file: https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/jacobi_seq.c
Command line:
- autoPar -rose:C99 --edg:no_warnings -w -rose:verbose 0 --edg:restrict -rose:autopar:unique_indirect_index -rose:autopar:enable_patch -c jacobi_seq.c
Screen output
Enabling generating patch files for auto parallelization ... Assuming all arrays used as indirect indices have unique elements (no overlapping) ... Automatically parallelized a loop at line:83 Unparallelizable loop at line:84 due to the following dependencies: 1*1 OUTPUT_DEP; commonlevel = 1 CarryLevel = (0,0) Is precise SgPntrArrRefExp:f[i][0]@89:6->SgPntrArrRefExp:f[i][0]@89:6 <= -1;||:: 1*1 OUTPUT_DEP; commonlevel = 1 CarryLevel = (0,0) Is precise SgPntrArrRefExp:f[i][0]@89:6->SgPntrArrRefExp:f[i][0]@89:6 <= -1;||:: Automatically parallelized a loop at line:143 Automatically parallelized a loop at line:144 Automatically parallelized a loop at line:147 Automatically parallelized a loop at line:148 Automatically parallelized a loop at line:185 Automatically parallelized a loop at line:186
Generated patch file: cat jacobi_seq.c.patch
diff -ar /home/liao6/workspace/autoPar/sourcetree/projects/autoParallelization/tests/jacobi_seq.c rose_jacobi_seq.c 0a1 > #include <omp.h> 82a83 > #pragma omp parallel for private (xx,yy,i,j) firstprivate (n,m) 142a143 > #pragma omp parallel for private (i,j) 143a144 > #pragma omp parallel for private (j) 146a147 > #pragma omp parallel for private (resid,i,j) reduction (+:error) 147a148 > #pragma omp parallel for private (resid,j) reduction (+:error) firstprivate (omega,ax,ay,b) 184a185 > #pragma omp parallel for private (xx,yy,temp,i,j) reduction (+:error) 185a186 > #pragma omp parallel for private (xx,yy,temp,j) reduction (+:error) firstprivate (dx,dy)
After inspecting the patch, you can apply the patch to the original source file
- patch jacobi_seq.c -i jacobi_seq.c.patch -o jacobi_patched.c
Checking Correctness
[edit | edit source]AutoPar examines pre-existing OpenMP directives in an application and verifies that they have correctly accounted for private, reductions and other OMP data-sharing attributes.
Sample input
#include <stdio.h> #include <omp.h> int main(int argc, char *argv[]) { int N = 20; int total ; int i,j; #pragma omp parallel for for (j=0;j<N;j++) { for (i=0;i<N;i++) { total += i+j ; } } printf("%d\n", total) ; return 0; }
The code above contains a real OpenMP bug someone struggled with while trying to add OMP annotations to their code, submitted here: http://openmp.org/forum/viewtopic.php?f=3&t=821
Running the autoPar OpenMP directive checker produces this result:
% autoPar -rose:unparse_tokens -rose:autopar:unique_indirect_index -rose:autopar:enable_diff -fopenmp -c OMPcheck.cc <<<<<<<<< user defined : #pragma omp parallel for -------- OMPcheck.cc:11 compiler generated:#pragma omp parallel for private (i) reduction (+:total) >>>>>>>>
The output above from autoPar indicates that the OMP pragma is missing an OpenMP 'private' declaration for the variable 'i', and a reduction annotation for the variable 'total'.
Algorithm
[edit | edit source]autoPar is designed to handle both conventional loops operating on primitive arrays and modern applications using high-level abstractions. The parallelizer uses the following algorithm: The loops may contain variables of either primitive data types or STL container types, or both.
overview
[edit | edit source]1. Preparation and Preprocessing
- (Optional) Read a specification file for known abstractions and semantics.
- (Optional) Apply optional custom transformations based on input code semantics, such as converting tree traversals to loop iterations on memory pools.
- (Optional) if unique index indices semantics are assumed, turn on normalization of indirect array access (void uniformIndirectIndexedArrayRefs() )
- indirect_loop_index = array_Y [current_loop_index] ;array_X[indirect_loop_index] ..; becomes array_X [array_Y [current_loop_index]] ..
- Normalize loops, including those using iterators.
- Apply constant folding to simplify the normalized code
- Find candidate array computation loops with canonical forms (for omp for) or loops and functions operating on individual elements (for omp task).
2. For each candidate:
- Skip the target if there are function calls without known semantics or side effects.
- Call dependence analysis and liveness analysis.
- Classify OpenMP variables (autoscoping) to be private, firstprivate, lastprivate, reduction variables. recognize references to the current element, and find order independent write accesses.
- Eliminate dependencies associated with autoscoped variables, those involving only the current elements, and output dependencies caused by order-independent write accesses.
- Insert the corresponding OpenMP constructs if no dependencies remain.
The key idea of the algorithm is to capture dependencies within a target and eliminate them later on as much as possible based on various rules. Parallelization is safe if there are no remaining dependencies.
Canonical loops
[edit | edit source]In page 51 or (59/320): section 2.6, OpenMP 4.0 (http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf), it is defined as
for (init-expr; test-expr; incr-expr) structured-block
init-expr One of the following:
var = lb integer-type var = lb random-access-iterator-type var = lb pointer-type var = lb
test-expr One of the following:
var relational-op b b relational-op var
relational-op One of the following:
< <= > >=
incr-expr One of the following:
++var var++ --var var-- var += incr var -= incr var = var + incr var = incr + var var = var - incr
var One of the following:
A variable of a signed or unsigned integer type. For C++, a variable of a random access iterator type. For C, a variable of a pointer type.
If this variable would otherwise be shared, it is implicitly made private in the loop construct. This variable must not be modified during the execution of the for-loop other than in incr-expr. Unless the variable is specified lastprivate on the loop construct, its value after the loop is unspecified.
lb and b: Loop invariant expressions of a type compatible with the type of var.
incr: A loop invariant integer expression.
Dependence analysis
[edit | edit source]Dependence analysis is the basis for the parallelizer to decide whether a loop is parallelizable. autoPar invokes the dependence analysis from the loop optimizer, which implements algorithms proposed in [35, 36] to effectively transform both perfectly nested loops and non-perfectly nested loops. An extended direction matrix (EDM) dependence representation is used to cover non-common loop nests that surround only one of the two statements in order to handle non-perfectly nested loops. For array accesses within loops, a Gaussian elimination algorithm is used to solve a set of linear integer equations of loop induction variables.
This step relies on the dependence analysis implemented in
- src/midend/programTransformation/loopProcessing/depGraph
A simplified interface function is provided in
//Compute dependence graph for a loop, using ArrayInterface and ArrayAnnoation LoopTreeDepGraph* ComputeDependenceGraph(SgNode* loop, ArrayInterface*, ArrayAnnotation* annot);
Side effect analysis
[edit | edit source]Dependence analysis relies on the variable references information collected from side effect analysis
The analysis's source header file is midend/programTransformation/loopProcessing/depInfo/DepInfoAnal.h
- bool AnalyzeStmtRefs( AstInterface& fa, const AstNodePtr& n, CollectObject<AstNodePtr> &wRefs, CollectObject<AstNodePtr> &rRefs);
Several SageInterface functions are provided to make this analysis more accessible to developers
// Collect all read and write references within stmt, which can be a function, a scope statement, or a single statement. Note that a reference can be both read and written, like i++. bool SageInterface::collectReadWriteRefs (SgStatement *stmt, std::vector< SgNode * > &readRefs, std::vector< SgNode * > &writeRefs) //Collect unique variables which are read or written within a statement. Note that a variable can be both read and written. The statement can be either of a function, a scope, or a single line statement. bool SageInterface::collectReadWriteVariables (SgStatement *stmt, std::set< SgInitializedName * > &readVars, std::set< SgInitializedName * > &writeVars) // Collect read only variables within a statement. The statement can be either of a function, a scope, or a single line statement. void SageInterface::collectReadOnlyVariables (SgStatement *stmt, std::set< SgInitializedName * > &readOnlyVars) //Collect read only variable symbols within a statement. The statement can be either of a function, a scope, or a single line statement. void SageInterface::collectReadOnlySymbols (SgStatement *stmt, std::set< SgVariableSymbol * > &readOnlySymbols)
Autoscoping/variable classification
[edit | edit source]This step categories variables for their data-sharing attributes, based on their live-in (before the execution of a loop) and live-out (after the execution of a loop) analysis results. For instance, a private variable inside a loop is neither live-in nor live-out of the loop, which means the variable is immediately killed (redefined) inside the loop and then used inside the loop somehow, but is never going to be used anywhere after the loop. All loop index variables are also classified as OpenMP private variables to an avoid possible race condition. On the other hand, shared variables are live at both the beginning and the end of the loop. Firstprivate and lastprivate variables are live at either only the beginning or only the end of the loop, respectively.
Reduction variables are handled specially to maximize the opportunities for parallelization. A typical reduction operation inside a loop, such as sum = sum + a[i], causes a loop-carried output dependence, a loop-carried anti-dependence, and a loop independent anti-dependence. We use an idiom recognition analysis to capture such typical operations and exclude the associated loop-carried dependences when deciding if a loop is parallelizable.
This step relies on Liveness analysis, implemented in
- src/midend/programAnalysis/defUseAnaysis/LivenessAnalysis.cpp
- There is a SageInterface function to call this analysis:
//!Get liveIn and liveOut variables for a for loop void SageInterface::getLiveVariables(LivenessAnalysis * liv, SgForStatement* loop, std::set<SgInitializedName*>& liveIns, std::set<SgIn itializedName*> & liveOuts)
Algorithm: private and reduction variables cause dependences (being written). firstprivate and lastprivate variables are never being written in the loop (no dependences)
live-in live-out shared Y Y no written, no dependences: no special handling, shared by default private N N written (depVars), need privatization: depVars- liveIns - liveOuts firstprivate Y N liveIns - LiveOus - writtenVariables lastprivate N Y liveOuts - LiveIns reduction Y Y depVars Intersection (liveIns Intersection liveOuts)
To recognize reduction variables: RecognizeReduction()
- Implemented as void SageInterface::ReductionRecognition(SgForStatement* loop, std::set< std::pair <SgInitializedName*, VariantT> > & results) from https://github.com/rose-compiler/rose-develop/blob/master/src/frontend/SageIII/sageInterface/sageInterface.C
/* * Algorithms: * for each scalar candidate which are both live-in and live-out for the loop body * and which is not the loop invariant variable. * Consider those with only 1 or 2 references * 1 reference * the operation is one of x++, ++x, x--, --x, x binop= expr * 2 references belonging to the same operation * operations: one of x= x op expr, x = expr op x (except for subtraction) * Also according to the specification. * x is not referenced in exp * expr has scalar type (no array, objects etc) * x: scalar only, aggregate types (including arrays), pointer types and reference types may not appear in a reduction clause. * op is not an overloaded operator, but +, *, -, &, ^ ,|, &&, || * binop is not an overloaded operator but: +, *, -, &, ^ ,| * */
Semantic specification via annotations
[edit | edit source]We use a separated file to store annotations for input code. The file essentially specifies the semantics of the input code so the tool can leverage to parallelize the code.
Some annotation files (*.annot) can be found in the project directory at https://github.com/rose-compiler/rose-develop/tree/master/projects/autoParallelization/tests
- https://github.com/rose-compiler/rose/blob/master/projects/autoParallelization/annot/Cmath.annot 156 C standard library functions
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/floatArray.annot array semantics
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/std_vector.annot std::vector semantics
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/funcs.annot user functions
- https://github.com/rose-compiler/rose-develop/blob/master/projects/autoParallelization/tests/Index.annot
Sample annotations are:
operator atoll(const char * val) { modify none; read{val}; alias none; } class InternalIndex { has_value { stride = this.stride; base = this.base; length = this.length; } } class Index { has_value { stride = this.stride; base = this.base; length = this.length; } } class Range { has_value { stride = this.stride; base = this.base; length = this.length; } } operator Index::Index( int lb, int l, int step) { modify none; read {lb, l, step}; alias none; restrict_value { result = { base = lb; length = l; stride = step; } }; } operator Range::Range( int lb, int ub, int step) { modify none; read {lb, ub, step}; alias none; restrict_value { result = { base = lb; length = (ub-lb+1)/step; stride = step; } }; } operator InternalIndex::InternalIndex( int lb, int l, int step) { modify none; read {lb, l, step}; alias none; restrict_value { result = { base = lb; length = l; stride = step; } }; } operator operator+ ( const InternalIndex & lhs, int x ) { modify none; read {lhs, x}; alias none; restrict_value { result = {stride = lhs.stride; base = lhs.base + x; length = lhs.length}; }; } operator operator- ( const InternalIndex & lhs, int x ) { modify none; read {lhs, x}; alias none; restrict_value { result = {stride = lhs.stride; base = lhs.base - x; length = lhs.length}; }; }
Array abstractions:
class floatArray { array { dimension = 6; length(i) = this.Array_Descriptor.Array_Domain.Size[i]; elem(i:dim:1:dimension) = this(i$dim); reshape(i:dim:1:dimension) = this.resize(i$dim); }; array_optimize { define { float* _pointer = this.getDataPointer(); int _size:dim:1:dimension = this.Array_Descriptor.Array_Domain.Size[dim-1]; int _stride:dim:1:dimension = this.Array_Descriptor.Array_Domain.Stride[dim-1]; int _length:dim:1:dimension = this.Array_Descriptor.Array_Domain.getLength(dim-1) } length(i) = _length$(i+1); elem(i:dim:1:dimension) = _pointer[i$1 + repeat(x,2,dimension, i$x * _stride$(x-1) * _size$(x-1))]; }; has_value { dimension = 6; length_0 = this.length(0); length_1 = this.length(1); length_2 = this.length(2); }; } operator floatArray::operator() (int index) { modify none; read {index}; alias none; restrict_value { this = { dimension = 1; } }; inline { this.elem(index) }; } operator floatArray::operator() ( const InternalIndex& index) { modify none; read {this, index}; alias { (result, this) }; restrict_value { this = { dimension = 1; }; result = {dimension = 1; length_0 = index.length;}; }; construct_array (this) { dimension = 1; length(0) = index.length; elem(i) = this.elem(i * index.stride + index.base); }; } operator floatArray::operator()(const InternalIndex& index1, const InternalIndex& index2) { modify none; read {this, index1, index2}; alias { (result, this) }; restrict_value { this = { dimension = 2; }; result = {dimension = 2; length_0 = index1.length; length_1 = index2.length}; }; construct_array (this) { dimension = 2; length(0) = index1.length; length(1) = index2.length; elem(i,j) = this.elem(i * index1.stride + index1.base, j * index2.stride + index2.base); }; } operator floatArray::operator()(const InternalIndex& index1, const InternalIndex& index2, const InternalIndex& index3) { modify none; read {this, index1, index2, index3}; alias { (result, this) }; restrict_value { this = { dimension = 3; }; result = {dimension = 3; length_0 = index1.length; length_1 = index2.length; length_2 = index3.length; }; }; construct_array (this) { dimension = 3; length(0) = index1.length; length(1) = index2.length; length(2) = index3.length; elem(i,j,k) = this.elem(i * index1.stride + index1.base, j * index2.stride + index2.base, k * index3.stride + index3.base); }; } operator floatArray::operator= (const floatArray& that) { modify {this}; read {that}; alias none; restrict_value { result = { dimension = that.dimension; length_0 = that.length_0;}; }; modify_array (this) { dimension = that.dimension; length(i) = that.length(i); elem(i:dim:1:dimension) = that.elem(i$dim); }; } operator floatArray::operator= (float val) { modify {this}; read {val}; alias none; modify_array (this) { elem(i:dim:1:dimension) = val; }; } operator floatArray::getLength( int dim) { modify none; read {dim}; alias none; inline { this.length(dim) }; }
Dependence elimination
[edit | edit source]Not all dependences prevent us from parallelizing loops. We exclude some unharmful dependences in our consideration.
void DependenceElimination(SgNode* sg_node, LoopTreeDepGraph* depgraph, std::vector<DepInfo>& remainings, OmpSupport::OmpAttribute* att, std::map<SgNode*, bool> & indirect_table, ArrayInterface* array_interface/*=0*/, ArrayAnnotation* annot/*=0*/)
- from file https://github.com/rose-compiler/rose/blob/master/projects/autoParallelization/autoParSupport.C
Algorithm, eliminate the following dependencies
- caused by locally declared variables: already private to each iteration
- either source or sink variable is thread local variable
- ignore a dependence relationship with empty source/sink nodes
- commonlevel ==0, no common enclosing loops
- carry level !=0, loop independent,
- at least one array reference in a dependence relationship, but dependence type is SCALAR_DEP or SCALAR_BACK_DEP
- dependencies caused by autoscoped variables (private, firstprivate, lastprivate, reduction, etc.)
- dependencies caused by a pair of indirect indexed array reference, if we know indirect index arrays have unique elements.
Note that why carry level must be 0 to prevent parallelization in our analysis
- This is valid since we run dep analysis on each level of loop.
- For a current loop to be considered, only the dependences carried by the current loop matters.
- The current loop is always the outermost level loop , carry level id is 0.
An example with debugging info
[edit | edit source]autoPar -rose:autopar:enable_debug sourcetree/projects/autoParallelization/tests/jacobi_seq.c
Search for "this dep relation cannot be eliminated. saved into remaining depedence set." to find loops with loop-carried dependency which cannot be eliminated.
// Input code from Jacobi, expected to generate // #pragma omp parallel for private(i,j,xx,yy,temp) reduction(+:error) 185 for (i = 0; i < n; i++) 186 for (j = 0; j < m; j++) 187 { 188 xx = -1.0 + dx * (i - 1); 189 yy = -1.0 + dy * (j - 1); 190 temp = u[i][j] - (1.0 - xx * xx) * (1.0 - yy * yy); 191 error = error + temp * temp; 192 } Debug: ComputeDependenceGraph() dumps the dependence graph for the loop at line :185 dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:xx = - 1.0 + dx *(i - 1); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise SgVarRefExp:xx@188:2-> SgVarRefExp:xx@188:2 == 0;* 0;||* 0;== 0;||:: dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise SgVarRefExp:xx@188:2-> SgPntrArrRefExp:u[i][j]@190:13 * 0;* 0;||* 0;* 0;||:: dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise SgVarRefExp:xx@188:2-> SgVarRefExp:xx@190:26 == 0;* 0;||* 0;== 0;||:: dep SgExprStatement:xx = - 1.0 + dx *(i - 1); SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise SgVarRefExp:xx@188:2-> SgVarRefExp:xx@190:31 == 0;* 0;||* 0;== 0;||:: .. dep SgExprStatement:error = error + temp * temp; SgExprStatement:error = error + temp * temp; 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise SgVarRefExp:error@191:2-> SgVarRefExp:error@191:10 == 0;* 0;||* 0;<= -1;||:: ... dep SgExprStatement:error = error + temp * temp; SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise SgVarRefExp:temp@191:25-> SgVarRefExp:temp@190:2 == 0;* 0;||* 0;<= -1;||:: dep SgExprStatement:error = error + temp * temp; SgExprStatement:temp = u[i][j] -(1.0 - xx * xx) *(1.0 - yy * yy); 0x7fffb5d03db0 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise SgVarRefExp:error@191:2-> SgPntrArrRefExp:u[i][j]@190:13 * 0;* 0;||* 0;* 0;||:: -------------------------------------------------------- Live-in variables for loop:186 error ::m ::n Live-out variables for loop before line:193 error ::m ::n Final Live-in variables for loop: error ::m ::n Final Live-out variables for loop: error ::m ::n Debug after CollectVisibleVaribles (): 0x7f1640ef0678 ::n 0x7f1640ef07c0 ::m 0x7f1640ef1200 ::dx 0x7f1640ef1348 ::dy 0x7f1640ef2680 j 0x7f1640ef27c8 xx 0x7f1640ef2910 yy 0x7f1640ef2a58 temp 0x7f1640ef2ba0 error Debug after CollectVariablesWithDependence(): 0x7f1640ef27c8 xx 0x7f1640ef2910 yy 0x7f1640ef2a58 temp 0x7f1640ef2ba0 error Debug dump private: 0x7f1640ef27c8 xx 0x7f1640ef2910 yy 0x7f1640ef2a58 temp 0x7f1640ef2538 i 0x7f1640ef2680 j Debug dump lastprivate: Debug: A candidate used twice:error Debug dump firstprivate: Entering DependenceElimination () ---------------------------------- Eliminating a dep relation due to at least one autoscoped variables 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise SgVarRefExp:xx@188:2-> SgVarRefExp:xx@188:2 == 0;* 0;||* 0;== 0;||:: Eliminating a dep relation due to scalar dep type for at least one array variable 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise SgVarRefExp:xx@188:2-> SgPntrArrRefExp:u[i][j]@190:13 * 0;* 0;||* 0;* 0;||:: Eliminating a dep relation due to at least one autoscoped variables 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (2,2) Not precise SgVarRefExp:xx@188:2-> SgVarRefExp:xx@190:26 == 0;* 0;||* 0;== 0;||:: Eliminating a dep relation due to scalar dep type for at least one array variable 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise SgVarRefExp:yy@189:2-> SgPntrArrRefExp:u[i][j]@190:13 * 0;* 0;||* 0;* 0;||:: .. Eliminating a dep relation due to at least one autoscoped variables 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise SgVarRefExp:yy@190:44-> SgVarRefExp:yy@189:2 == 0;* 0;||* 0;<= -1;||:: ... Eliminating a dep relation due to scalar dep type for at least one array variable 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_DEP; commonlevel = 2 CarryLevel = (0,0) Not precise SgPntrArrRefExp:u[i][j]@190:13-> SgVarRefExp:error@191:2 * 0;* 0;||* 0;* 0;||:: ... Eliminating a dep relation due to at least one autoscoped variables 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise SgVarRefExp:error@191:2-> SgVarRefExp:error@191:10 == 0;* 0;||* 0;<= -1;||:: Eliminating a dep relation due to at least one autoscoped variables 0x7fffb5d03f80 Distance Matrix size:2*2 SCALAR_BACK_DEP; commonlevel = 2 CarryLevel = (1,1) Not precise SgVarRefExp:temp@191:18-> SgVarRefExp:temp@190:2 == 0;* 0;||* 0;<= -1;||:: Exiting DependenceElimination () ---------------------------------- Automatically parallelized a loop at line:185 debug patch generation: attaching OMP att to sg_node SgForStatement at line 185
Annotations encoding semantics
[edit | edit source]Used to encode semantic information for types (class) and their operations (operator)
Source files: https://github.com/rose-compiler/rose-develop/tree/master/src/midend/astUtil/annotation
// annotations are separated by ; <annot> ::= <annot1> | <annot1>;<annot> // two main categories: class and operator <annot1> ::= class <cls annot> | operator <op annot> //---------class------------ // If this is an array-like class? // e.g: class MyArray: <cls annot> ::= <clsname>:<cls annot1>; <cls annot1>::= <cls annot2> | <cls annot2> <cls annot1> //from bool TypeDescriptor:: read(istream& in) in AnnotDescriptors.C // type name can have const modifier, with qualifiers, and of reference or pointer type <clasname>::= [const] name1[::name2[...]][&|*] // array information, inheritable or not <cls annot2>::= <arr annot> | inheritable <arr annot> | has-value { <val def> } //Array information: dimension, length, element, reshape <arr annot>::= is-array{ <arr def>} | is-array{define{<stmts>}<arr def>} // define loop-invariant operations that should be executed before enumerating array elements <arr def> ::= <arr attr def> | <arr attr def> <arr def> <arr attr def> ::= <arr attr>=<expression>; <arr attr> ::= dim // at most 'dim' dimensions | len (<param>) // length of each dimension: parameter is used to specify which dimension | elem(<paramlist>) //i$x:0:dim-1 ==> $x is from 0 to dim-1 ==> list of parameters i_0, i_1, i_dim-1 | reshape(<paramlist>) //Data members for types/classes //Could have expressions <val def> ::= <name>; | <name>;<val def> | <name> = <expression> ; | <name> = <expression> ; <val def> //---------operator------------ // side effects: mod, read, alias <op annot> ::= <opdecl> : <op annot1> ; <op annot1> ::=<op annot2> | <op annot2> <op annot1> <op annot2> ::= modify <namelist> // side effects on scalar:e.g. modify none | new-array (<aliaslist>){<arr def>} // side effects on arrays: creational | modify-array (<name>) {<arr def>} // side effects on arrays: modifying only | restrict-value {<val def list>} // How to get member values from this operation's semantics( may be from parameters values) | read <namelist> // e.g.: read {a,b,c} | alias <nameGrouplist> // e.g.: alias none | allow-alias <nameGrouplist> | inline <expression> // this operator is semantically equal to another operator/expression, used to substitute an operation with another // from OperatorDescriptors.h NameGropu, OperatorAliasDescriptor <nameGrouplist> e.g: {(x,y,z), (a,b), (m,n)}
examples
[edit | edit source]Arrays and array optimizations (rewrite to raw arrays)
class floatArray { array { // high level array abstraction with range from 1 to dimension dimension = 6; // max dimension length(i) = this.Array_Descriptor.Array_Domain.Size[i]; // length of each dimension //i:dim:1:dimension/ i$dim means a list of parameters i_1, i_2, .., i_dimension? // i:dim:1:dimension also defines a range variable dim, which is reused later in i$dim elem(i:dim:1:dimension) = this(i$dim); reshape(i:dim:1:dimension) = this.resize(i$dim); }; array_optimize { // low level array with range from 0 to dimension-1 define { float* _pointer = this.getDataPointer(); int _size:dim:1:dimension = this.Array_Descriptor.Array_Domain.Size[dim-1]; int _stride:dim:1:dimension = this.Array_Descriptor.Array_Domain.Stride[dim-1]; int _length:dim:1:dimension = this.Array_Descriptor.Array_Domain.getLength(dim-1) } length(i) = _length$(i+1); elem(i:dim:1:dimension) = _pointer[i$1 + repeat(x,2,dimension, i$x * _stride$(x-1) * _size$(x-1))]; }; has_value { dimension = 6; length_0 = this.length(0); length_1 = this.length(1); length_2 = this.length(2); }; } dump ()-------------- arrays: floatArray : {dimension=6; length=((i:::):[](.(.(.(this,Array_Descriptor),Array_Domain),Size),i)); elem=((i:dim:1:dimension):this($(i,dim))) } reshape = ((i:dim:1:dimension):FunctionPtrCall(.(this,resize),$(i,dim))) array optimizations: floatArray : {float* _pointer:::=FunctionPtrCall(.(this,getDataPointer)); int _size:dim:1:dimension=[](.(.(.(this,Array_Descriptor),Array_Domain),Size),+(-1dim)); int _stride:dim:1:dimension=[](.(.(.(this,Array_Descriptor),Array_Domain),Stride),+(-1dim)); int _length:dim:1:dimension=FunctionPtrCall(.(.(.(this,Array_Descriptor),Array_Domain),getLength),+(-1dim)) } {dimension=; length=((i:::):$(_length,+(1i))); elem=((i:dim:1:dimension):[](_pointer,+($(i,1)repeat(x,2,dimension,*($(i,x)$(_stride,+(-1x))$(_size,+(-1x))))))) }
Function side effect annotation: variable sets for read, modify, and alias
operator VectorXY::VectorXY(double xx, double yy) { modify none; read {xx,yy}; } // side effects annotation operator floatArray::operator() ( const InternalIndex& index) { modify none; read {this, index}; } // Dump(): mangled_function_name: (all variables): {modified variable } floatArray::operator()_constInternalIndex& : (this,index):{} // another annotation, side effects, alias, value operator Index::Index( int lb, int l, int step) { modify none; read {lb, l, step}; alias none; restrict_value { result = { base = lb; length = l; stride = step; } }; } // Dump() read: Index::Index_int_int_int : (lb,l,step):{l,lb,step} // allow_alias: used only for a function operating on two or more parameters? tell if parameters can be alias to each other? operator interpolate2D(floatArray & fineGrid,floatArray & coarseGrid) { allow_alias none; }
has_value and restrict _value annotations: what is "this" and "result" ?
// has_value: Only used for within class annotation, class floatArray { .. has_value { dimension = 6; length_0 = this.length(0); length_1 = this.length(1); length_2 = this.length(2); }; class Range { has_value { stride = this.stride; base = this.base; length = this.length; } } //Dump: floatArray : {dimension:6;length_0:FunctionPtrCall(.(this,length),0);length_1:FunctionPtrCall(.(this,length),1);length_2:FunctionPtrCall(.(this,length),2)} } Range : {base:.(this,base);length:.(this,length);stride:.(this,stride)} // restrict_value: infer the value of member variable from the semantics of current operations: used for operation annotations. could have inferred information for one "this" and multiple "result" depending on function parameters operator floatArray::operator() (int index) { .. restrict_value { this = { dimension = 1; } }; // this element access operator using one integer parameter can tell us that the current object (this) has a dimension value equaling to 1 } operator floatArray::operator()(const InternalIndex& index1, const InternalIndex& index2, const InternalIndex& index3) { restrict_value { this = { dimension = 3; }; result = {dimension = 3; length_0 = index1.length; length_1 = index2.length; length_2 = index3.length; }; }; }
c function annotations
[edit | edit source]operator sqrt( double val) { modify none; read{val}; alias none; } operator sqrt( float val) { modify none; read{val}; alias none; } operator abs(int val) { modify none; read{val}; alias none; }
Publications
[edit | edit source]- old technical report: Chunhua Liao, Daniel J Quinlan, Jeremiah J Willcock, and Thomas Panas. Automatic parallelization using OpenMP based on STL semantics. Technical report, Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), 2008.
- A workshop paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore, In Proceedings of the 5th international Workshop on OpenMP: Evolving OpenMP in An Age of Extreme Parallelism (Dresden, Germany, June 3–05, 2009). pdf
- A journal version of the paper: Chunhua Liao, Daniel J. Quinlan, Jeremiah J. Willcock and Thomas Panas, Semantic-Aware Automatic Parallelization of Modern Applications Using High-Level Abstractions, Journal of Parallel Programming, Accepted in Jan. 2010 pdf