НОУ ИНТУИТ | Introduction to performance optimization using Intel SW tools. Лекция 6: Optimizing compiler. Auto parallelization

Учитесь и получайте официальные документы БЕСПЛАТНО. Вы можете поддержать наш проект.

Твой путь к знаниям!

Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 354 / 25 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00

Специальности: Программист

Теги: basic, basic block, call graph, linux, loop optimization, microprocessor, objective-c, openmp, optimizing compiler, permute, pipelining, prefetcher, register allocation, remark

|

Вам нравится? Нравится 9 студентам

| Поделиться |

Поддержать курс

| Скачать электронную книгу

Software prefetching

Prefetching is loading data from relatively slow memory into the cache before the memory is required by processor. Software prefetching is insertion of the special prefetch instructions to the code.

There are several methods of prefetch usage:

Explicit instruction insertion.
Implicit insertion with compiler option –prefetch, known as auto prefetch compiler feature.

Prefetch intrinsic functions are defined in xmmintrin.h and has the form

# include <xmmintrin.h>
enum _mm_hint {_MM_HINT_T0 = 3, (L1)
   _MM_HINT_T1 = 2, (L2)
   _MM_HINT_T2 = 1, (L3)
   _MM_HINT_NTA = 0};
void _mm_prefetch (void * p, enum _mm_hint h);

It loads a cache line from the address specified (size of the cache line is 64 bytes)

Use CALL mm_prefetch (P, HINT) inside the fortran programs

Why software prefetching can be useful

There is hardware prefetch mechanism which tries to identify the memory access pattern to choose the appropriate preloading scheme. It works fine when the memory is accessed with constant stride and this stride is relatively small.

Software prefetch instructions have its price. Computing system can ignore software prefetching instructions when the system bus is busy.

Don’t use software prefetching instructions

in case when hardware prefetching mechanism is able to help
if there are many memory requests and the system bus is busy
all needed memory is already cached

There are many cases when programmer could help to preload the memory required:

large constant stride
work with chains
variable stride access to memory
many different memory objects (?)

The VTUNE usage can help to identify slowdowns, caused by the inefficient memory access.

SUBROUTINE CALC(A,B,C,N,K,SEC)
INTEGER N,K,SEC,I,J
REAL A(K,N),B(K,N),C(K,N)
DO I=1,K
  DO J=1,N
    A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J))
#ifdef PERF
      CALL mm_prefetch(A(I,J+SEC),3)
      CALL mm_prefetch(B(I,J+SEC),3)
      CALL mm_prefetch(C(I,J+SEC),3)
#endif
  END DO
END DO
END SUBROUTINE CALC

Idea of this example:

Memory is accessed with large constant stride.

Can we obtain performance gain?

Would it be different for different SEC values?

INTEGER N,K
REAL, ALLOCATABLE :: A(:,:),B(:,:),C(:,:),D(:,:)
REAL T1,T2
INTEGER REP,SEC
READ *,N,SEC
READ *,K
ALLOCATE(A(K,N),B(K,N),C(K,N))
ALLOCATE(D(10000,10000))
A=1
B=1
C=1
D=0
CALL CPU_TIME(T1)
CALL CALC(A,B,C,N,K,SEC)
CALL CPU_TIME(T2)
PRINT *,T2-T1 
END

Execution results

 ifort /fpp  -DPERF -Od pref.f90 -Feperf_pref.exe 
 ifort /fpp         -Od pref.f90 -Feperf.exe

Data for input:

 4000 SEC
 4000

without SP: 0.48s.

  SEC == 1 with SP:   0.56s.
  SEC == 4 with SP:   0.48s

?? Price of prefetch instructions exceed gain from prefetch.

Let’s enlarge calculations inside loop

A(I,J)=A(I,J)/(A(I,J)+ B(I,J)*C(I,J)) => 
A(I,J) = (EXPONENT(A(I,J))+EXPONENT(B(I,J))+EXPONENT(C(I,J)))/(A(I,J)*B(I,J)*C(I,J))

without SP: 1.45s.

  SEC == 1 with SP:   1.07s.
  SEC == 4 with SP:   0.98s

Conclusion

It is hard to determine if the prefetch instruction can be helpful for all computing systems. The performance of the memory subsystems depend on many different factors such as, amount of cash memory, memory latency, bandwidth, etc. Prefetch instruction call has its price and increases amount of data which should be passed through the system bus. Therefore result of software prefetch can be different for the different computing systems.

Auto software prefetching options

/Qopt-prefetch[:n]

1-4 Enables different levels of software prefetching. If you do not specify a value for n, the default is 2 on IA-32 and Intel® 64 architecture; the default is 3 on IA-64 architecture. Use lower values to reduce the amount of prefetching.
0 Disables software prefetching. This is the same as specifying -no-opt-prefetch (Linux and Mac OS X) or /Qopt-prefetch- (Windows).

Дальше >>

Авторизоваться

Introduction to performance optimization using Intel SW tools

Optimizing compiler. Auto parallelization

Software prefetching

Why software prefetching can be useful

Execution results

Conclusion

Auto software prefetching options

Вопросы и ответы