C ++ 11을 활성화 할 때 std :: vector 성능 회귀

Programing

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

lottogame 2020. 4. 13. 08:08

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

C ++ 11을 사용할 때 작은 C ++ 스 니펫에서 흥미로운 성능 회귀를 발견했습니다.

#include <vector>

struct Item
{
  int a;
  int b;
};

int main()
{
  const std::size_t num_items = 10000000;
  std::vector<Item> container;
  container.reserve(num_items);
  for (std::size_t i = 0; i < num_items; ++i) {
    container.push_back(Item());
  }
  return 0;
}

g ++ (GCC) 4.8.2 20131219 (시험판) 및 C ++ 03을 사용하면 다음을 얻을 수 있습니다.

milian:/tmp$ g++ -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        35.206824 task-clock                #    0.988 CPUs utilized            ( +-  1.23% )
                4 context-switches          #    0.116 K/sec                    ( +-  4.38% )
                0 cpu-migrations            #    0.006 K/sec                    ( +- 66.67% )
              849 page-faults               #    0.024 M/sec                    ( +-  6.02% )
       95,693,808 cycles                    #    2.718 GHz                      ( +-  1.14% ) [49.72%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       95,282,359 instructions              #    1.00  insns per cycle          ( +-  0.65% ) [75.27%]
       30,104,021 branches                  #  855.062 M/sec                    ( +-  0.87% ) [77.46%]
            6,038 branch-misses             #    0.02% of all branches          ( +- 25.73% ) [75.53%]

      0.035648729 seconds time elapsed                                          ( +-  1.22% )

반면 C ++ 11을 사용하면 성능이 크게 저하됩니다.

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        86.485313 task-clock                #    0.994 CPUs utilized            ( +-  0.50% )
                9 context-switches          #    0.104 K/sec                    ( +-  1.66% )
                2 cpu-migrations            #    0.017 K/sec                    ( +- 26.76% )
              798 page-faults               #    0.009 M/sec                    ( +-  8.54% )
      237,982,690 cycles                    #    2.752 GHz                      ( +-  0.41% ) [51.32%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
      135,730,319 instructions              #    0.57  insns per cycle          ( +-  0.32% ) [75.77%]
       30,880,156 branches                  #  357.057 M/sec                    ( +-  0.25% ) [75.76%]
            4,188 branch-misses             #    0.01% of all branches          ( +-  7.59% ) [74.08%]

    0.087016724 seconds time elapsed                                          ( +-  0.50% )

누군가 이것을 설명 할 수 있습니까? 지금까지 내 경험은 C ++ 11, 특히 esp를 활성화하여 STL이 빨라지는 것입니다. 이동 의미론 덕분에.

편집 : 제안 된대로 container.emplace_back();대신 C ++ 03 버전과 성능이 같습니다. C ++ 03 버전은 어떻게 동일한 결과를 얻을 수 push_back있습니까?

milian:/tmp$ g++ -std=c++11 -O3 main.cpp && perf stat -r 10 ./a.out

Performance counter stats for './a.out' (10 runs):

        36.229348 task-clock                #    0.988 CPUs utilized            ( +-  0.81% )
                4 context-switches          #    0.116 K/sec                    ( +-  3.17% )
                1 cpu-migrations            #    0.017 K/sec                    ( +- 36.85% )
              798 page-faults               #    0.022 M/sec                    ( +-  8.54% )
       94,488,818 cycles                    #    2.608 GHz                      ( +-  1.11% ) [50.44%]
  <not supported> stalled-cycles-frontend 
  <not supported> stalled-cycles-backend  
       94,851,411 instructions              #    1.00  insns per cycle          ( +-  0.98% ) [75.22%]
       30,468,562 branches                  #  840.991 M/sec                    ( +-  1.07% ) [76.71%]
            2,723 branch-misses             #    0.01% of all branches          ( +-  9.84% ) [74.81%]

   0.036678068 seconds time elapsed                                          ( +-  0.80% )

게시물에 작성한 옵션으로 내 컴퓨터에서 결과를 재현 할 수 있습니다.

그러나 링크 시간 최적화 도 활성화하면 (또한 -flto플래그를 gcc 4.7.2로 전달 ) 결과는 동일합니다.

(와 함께 원본 코드를 컴파일하고 있습니다 container.push_back(Item());)

$ g++ -std=c++11 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.426793 task-clock                #    0.986 CPUs utilized            ( +-  1.75% )
                 4 context-switches          #    0.116 K/sec                    ( +-  5.69% )
                 0 CPU-migrations            #    0.006 K/sec                    ( +- 66.67% )
            19,801 page-faults               #    0.559 M/sec                  
        99,028,466 cycles                    #    2.795 GHz                      ( +-  1.89% ) [77.53%]
        50,721,061 stalled-cycles-frontend   #   51.22% frontend cycles idle     ( +-  3.74% ) [79.47%]
        25,585,331 stalled-cycles-backend    #   25.84% backend  cycles idle     ( +-  4.90% ) [73.07%]
       141,947,224 instructions              #    1.43  insns per cycle        
                                             #    0.36  stalled cycles per insn  ( +-  0.52% ) [88.72%]
        37,697,368 branches                  # 1064.092 M/sec                    ( +-  0.52% ) [88.75%]
            26,700 branch-misses             #    0.07% of all branches          ( +-  3.91% ) [83.64%]

       0.035943226 seconds time elapsed                                          ( +-  1.79% )



$ g++ -std=c++98 -O3 -flto regr.cpp && perf stat -r 10 ./a.out 

 Performance counter stats for './a.out' (10 runs):

         35.510495 task-clock                #    0.988 CPUs utilized            ( +-  2.54% )
                 4 context-switches          #    0.101 K/sec                    ( +-  7.41% )
                 0 CPU-migrations            #    0.003 K/sec                    ( +-100.00% )
            19,801 page-faults               #    0.558 M/sec                    ( +-  0.00% )
        98,463,570 cycles                    #    2.773 GHz                      ( +-  1.09% ) [77.71%]
        50,079,978 stalled-cycles-frontend   #   50.86% frontend cycles idle     ( +-  2.20% ) [79.41%]
        26,270,699 stalled-cycles-backend    #   26.68% backend  cycles idle     ( +-  8.91% ) [74.43%]
       141,427,211 instructions              #    1.44  insns per cycle        
                                             #    0.35  stalled cycles per insn  ( +-  0.23% ) [87.66%]
        37,366,375 branches                  # 1052.263 M/sec                    ( +-  0.48% ) [88.61%]
            26,621 branch-misses             #    0.07% of all branches          ( +-  5.28% ) [83.26%]

       0.035953916 seconds time elapsed

그 이유는 생성 된 어셈블리 코드 ( g++ -std=c++11 -O3 -S regr.cpp) 를 살펴 봐야합니다 . C ++ 11 모드에서 생성 된 코드는 C ++ 98 모드보다 훨씬 더 어수선 하고 C ++ 11 모드에서는 기본적으로 함수를 인라인하는 데
void std::vector<Item,std::allocator<Item>>::_M_emplace_back_aux<Item>(Item&&)
실패합니다inline-limit .

이 인라인 실패는 도미노 효과가 있습니다. 이 함수가 호출되고 (이 경우에도 호출되지 않습니다!)하지만 우리가 준비해야하기 때문에 때문이 아니라 : 경우 가 호출, 함수 argments는 ( Item.a와 Item.b) 이미 올바른 위치에 있어야합니다. 이것은 매우 지저분한 코드로 이어집니다.

인라인이 성공한 경우 생성 된 코드의 관련 부분은 다음과 같습니다 .

.L42:
    testq   %rbx, %rbx  # container$D13376$_M_impl$_M_finish
    je  .L3 #,
    movl    $0, (%rbx)  #, container$D13376$_M_impl$_M_finish_136->a
    movl    $0, 4(%rbx) #, container$D13376$_M_impl$_M_finish_136->b
.L3:
    addq    $8, %rbx    #, container$D13376$_M_impl$_M_finish
    subq    $1, %rbp    #, ivtmp.106
    je  .L41    #,
.L14:
    cmpq    %rbx, %rdx  # container$D13376$_M_impl$_M_finish, container$D13376$_M_impl$_M_end_of_storage
    jne .L42    #,

이것은 멋지고 컴팩트 한 for 루프입니다. 이제 이것을 실패한 인라인 케이스 와 비교해 봅시다 :

.L49:
    testq   %rax, %rax  # D.15772
    je  .L26    #,
    movq    16(%rsp), %rdx  # D.13379, D.13379
    movq    %rdx, (%rax)    # D.13379, *D.15772_60
.L26:
    addq    $8, %rax    #, tmp75
    subq    $1, %rbx    #, ivtmp.117
    movq    %rax, 40(%rsp)  # tmp75, container.D.13376._M_impl._M_finish
    je  .L48    #,
.L28:
    movq    40(%rsp), %rax  # container.D.13376._M_impl._M_finish, D.15772
    cmpq    48(%rsp), %rax  # container.D.13376._M_impl._M_end_of_storage, D.15772
    movl    $0, 16(%rsp)    #, D.13379.a
    movl    $0, 20(%rsp)    #, D.13379.b
    jne .L49    #,
    leaq    16(%rsp), %rsi  #,
    leaq    32(%rsp), %rdi  #,
    call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

이 코드는 혼란스럽고 이전의 경우보다 루프에서 더 많은 작업이 진행됩니다. 함수 call(마지막 줄 표시) 전에 인수를 적절하게 배치해야합니다.

leaq    16(%rsp), %rsi  #,
leaq    32(%rsp), %rdi  #,
call    _ZNSt6vectorI4ItemSaIS0_EE19_M_emplace_back_auxIIS0_EEEvDpOT_   #

이것이 실제로 실행되지는 않지만 루프는 이전에 항목을 정렬합니다.

movl    $0, 16(%rsp)    #, D.13379.a
movl    $0, 20(%rsp)    #, D.13379.b

이것은 지저분한 코드로 이어집니다. call인라이닝이 성공하여 함수가 없으면 루프에 2 개의 이동 명령 만 있고 %rsp(스택 포인터) 와 관련된 혼란이 없습니다 . 그러나 인라이닝이 실패하면 6 개의 동작이 발생하고와 엉망이 %rsp됩니다.

-finline-limitC ++ 11 모드에서 내 이론을 입증하기 위해 (을 참고하십시오. )

 $ g++ -std=c++11 -O3 -finline-limit=105 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         84.739057 task-clock                #    0.993 CPUs utilized            ( +-  1.34% )
                 8 context-switches          #    0.096 K/sec                    ( +-  2.22% )
                 1 CPU-migrations            #    0.009 K/sec                    ( +- 64.01% )
            19,801 page-faults               #    0.234 M/sec                  
       266,809,312 cycles                    #    3.149 GHz                      ( +-  0.58% ) [81.20%]
       206,804,948 stalled-cycles-frontend   #   77.51% frontend cycles idle     ( +-  0.91% ) [81.25%]
       129,078,683 stalled-cycles-backend    #   48.38% backend  cycles idle     ( +-  1.37% ) [69.49%]
       183,130,306 instructions              #    0.69  insns per cycle        
                                             #    1.13  stalled cycles per insn  ( +-  0.85% ) [85.35%]
        38,759,720 branches                  #  457.401 M/sec                    ( +-  0.29% ) [85.43%]
            24,527 branch-misses             #    0.06% of all branches          ( +-  2.66% ) [83.52%]

       0.085359326 seconds time elapsed                                          ( +-  1.31% )

 $ g++ -std=c++11 -O3 -finline-limit=106 regr.cpp && perf stat -r 10 ./a.out

 Performance counter stats for './a.out' (10 runs):

         37.790325 task-clock                #    0.990 CPUs utilized            ( +-  2.06% )
                 4 context-switches          #    0.098 K/sec                    ( +-  5.77% )
                 0 CPU-migrations            #    0.011 K/sec                    ( +- 55.28% )
            19,801 page-faults               #    0.524 M/sec                  
       104,699,973 cycles                    #    2.771 GHz                      ( +-  2.04% ) [78.91%]
        58,023,151 stalled-cycles-frontend   #   55.42% frontend cycles idle     ( +-  4.03% ) [78.88%]
        30,572,036 stalled-cycles-backend    #   29.20% backend  cycles idle     ( +-  5.31% ) [71.40%]
       140,669,773 instructions              #    1.34  insns per cycle        
                                             #    0.41  stalled cycles per insn  ( +-  1.40% ) [88.14%]
        38,117,067 branches                  # 1008.646 M/sec                    ( +-  0.65% ) [89.38%]
            27,519 branch-misses             #    0.07% of all branches          ( +-  4.01% ) [86.16%]

       0.038187580 seconds time elapsed                                          ( +-  2.05% )

실제로 컴파일러가 해당 함수를 인라인하기 위해 조금 더 열심히 노력하도록 요청하면 성능 차이가 사라집니다.

이 이야기에서 빼앗아 간 것은 무엇입니까? 인라인이 실패하면 비용이 많이들 수 있으며 컴파일러 기능을 최대한 활용해야 합니다. 링크 시간 최적화 만 권장합니다. 내 프로그램의 성능을 크게 향상 시켰으며 (최대 2.5x) -flto플래그 를 전달하기 만하면됩니다. 꽤 좋은 거래입니다! ;)

그러나 인라인 키워드로 코드를 휴지통에 버리지 않는 것이 좋습니다. 컴파일러가 무엇을할지 결정하게하십시오. (최적화 프로그램은 인라인 키워드를 여백으로 처리 할 수 있습니다.)

좋은 질문입니다, +1!

참고 URL : https://stackoverflow.com/questions/20977741/stdvector-performance-regression-when-enabling-c11

'Programing' 카테고리의 다른 글

C ++에 typedef 키워드에 대한 Java 동등 또는 방법이 있습니까? (0)	2020.04.13
npm 피어 종속성을 자동으로 설치하는 방법은 무엇입니까? (0)	2020.04.13
Node.js와 브라우저간에 코드를 공유하려면 어떻게해야합니까? (0)	2020.04.13
실제로 스택 오버플로 오류의 원인은 무엇입니까? (0)	2020.04.13
새로운 언어 기능을 사용하는 프로그램에서 Python 버전을 확인하려면 어떻게해야합니까? (0)	2020.04.13

현재글C ++ 11을 활성화 할 때 std :: vector 성능 회귀

복권의 역사, 로또 정보와 IT 기술 등을 다루는 블로그입니다.

행사, 자바, 여행, java, c++, 축제, 가족나들이, spring, 뮤지컬, 극장순위, Spring3, Javascript, 무비순위, c#, JQuery, 볼거리, 공연, 관광, 놀거리, 연극,

Today :
Yesterday :

lottogame

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

'Programing' 카테고리의 다른 글

'Programing'의 다른글

티스토리툴바

« 2026/04 »
일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

C ++ 11을 활성화 할 때 std :: vector 성능 회귀

'Programing' 카테고리의 다른 글

'Programing'의 다른글

관련글

티스토리툴바