Optimizing compiler. Auto parallelization
Software prefetching
Prefetching is loading data from relatively slow memory into the cache before the memory is required by processor. Software prefetching is insertion of the special prefetch instructions to the code.
There are several methods of prefetch usage:
- Explicit instruction insertion.
- Implicit insertion with compiler option –prefetch, known as auto prefetch compiler feature.
Prefetch intrinsic functions are defined in xmmintrin.h and has the form
# include <xmmintrin.h> enum _mm_hint {_MM_HINT_T0 = 3, (L1) _MM_HINT_T1 = 2, (L2) _MM_HINT_T2 = 1, (L3) _MM_HINT_NTA = 0}; void _mm_prefetch (void * p, enum _mm_hint h);
It loads a cache line from the address specified (size of the cache line is 64 bytes)
Use CALL mm_prefetch (P, HINT) inside the fortran programs
Why software prefetching can be useful
There is hardware prefetch mechanism which tries to identify the memory access pattern to choose the appropriate preloading scheme. It works fine when the memory is accessed with constant stride and this stride is relatively small.
Software prefetch instructions have its price. Computing system can ignore software prefetching instructions when the system bus is busy.
Don’t use software prefetching instructions
- in case when hardware prefetching mechanism is able to help
- if there are many memory requests and the system bus is busy
- all needed memory is already cached
There are many cases when programmer could help to preload the memory required:
- large constant stride
- work with chains
- variable stride access to memory
- many different memory objects (?)
The VTUNE usage can help to identify slowdowns, caused by the inefficient memory access.
SUBROUTINE CALC(A,B,C,N,K,SEC) INTEGER N,K,SEC,I,J REAL A(K,N),B(K,N),C(K,N) DO I=1,K DO J=1,N A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J)) #ifdef PERF CALL mm_prefetch(A(I,J+SEC),3) CALL mm_prefetch(B(I,J+SEC),3) CALL mm_prefetch(C(I,J+SEC),3) #endif END DO END DO END SUBROUTINE CALC
Memory is accessed with large constant stride.
Can we obtain performance gain?
Would it be different for different SEC values?
INTEGER N,K REAL, ALLOCATABLE :: A(:,:),B(:,:),C(:,:),D(:,:) REAL T1,T2 INTEGER REP,SEC READ *,N,SEC READ *,K ALLOCATE(A(K,N),B(K,N),C(K,N)) ALLOCATE(D(10000,10000)) A=1 B=1 C=1 D=0 CALL CPU_TIME(T1) CALL CALC(A,B,C,N,K,SEC) CALL CPU_TIME(T2) PRINT *,T2-T1 END
Execution results
ifort /fpp -DPERF -Od pref.f90 -Feperf_pref.exe ifort /fpp -Od pref.f90 -Feperf.exe
4000 SEC 4000
SEC == 1 with SP: 0.56s. SEC == 4 with SP: 0.48s
?? Price of prefetch instructions exceed gain from prefetch.
Let’s enlarge calculations inside loop
A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J)) => A(I,J) = (EXPONENT(A(I,J))+EXPONENT(B(I,J))+EXPONENT(C(I,J)))/(A(I,J)*B(I,J)*C(I,J))
SEC == 1 with SP: 1.07s. SEC == 4 with SP: 0.98s
Conclusion
It is hard to determine if the prefetch instruction can be helpful for all computing systems. The performance of the memory subsystems depend on many different factors such as, amount of cash memory, memory latency, bandwidth, etc. Prefetch instruction call has its price and increases amount of data which should be passed through the system bus. Therefore result of software prefetch can be different for the different computing systems.
Auto software prefetching options
- 1-4 Enables different levels of software prefetching. If you do not specify a value for n, the default is 2 on IA-32 and Intel® 64 architecture; the default is 3 on IA-64 architecture. Use lower values to reduce the amount of prefetching.
- 0 Disables software prefetching. This is the same as specifying -no-opt-prefetch (Linux and Mac OS X) or /Qopt-prefetch- (Windows).