magicfoot wrote:The values of ie and je lie in the range 1000 to 100000.
There is no timing data for the single loop but I can derive that. There are three of these loops in the program, all with different variables, and these loops use 98% of the total execution time.
That seems large enough such than the overhead
of the parallel region (typically in the 10s to 100s of microseconds range) is likely the be negligible.
MarkB wrote:Are you considering memory affinity or cache coherence issues ? Is there some way to stabilise that with openMP ?
On a multi-socket machine it can be important to get the distribution of data in main memory right. This means that the first access to large arrays (typically initialisation) should be made inside a parallel region. Your code might be getting some cache reuse (at least in L3), so making sure the same thread accesses the same data items ion different parallel loops might help.
The loop you posted looks very bandwidth-intensive, so you may simply be running into the limits of the hardware bandwidth scalability.
Hope that helps,