Don't split 256-bit AVX2 load/store intrinsics (#20609)

Summary: Recent versions of GCC split unaligned load and store intrinsics into two 128-bit instructions. On old processors (Sandy Bridge) this was a bit faster for unaligned data, but bit slower for aligned data. On new processors (Intel Haswell+, recent AMD) splitting loads is slower on both aligned and unaligned data. Clang, MSVC, and ICC do not split unaligned load and store intrinsics. There's a good explanation here: https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top Splitting load and store intrinsics makes no sense in our AVX2 configuration because the CPUs that support AVX2 instructions are the same CPUs where splitting is disadvantageous on all data alignemnt. Note that this doesn't change the AVX configuration (used by CPUs that support AVX but not AVX2). It's possible this would be benficial for that configuration too (our data is usually 32-byte aligned), but I'd prefer the conservative change for now. torch.add generated assembly (hot loop) (GCC 7.3.0) before: https://gist.github.com/colesbury/066376537bccd514daf8fe4ab54d8295 after: https://gist.github.com/colesbury/8b4b948145001d44b225c51d2428bb91 Timing of `torch.add(x, y, out=z)` for size 10240 (1 thread, Broadwell, no turbo): before: 7.35 us after: 6.39 us (Take the torch.add timings with a grain of salt. The difference in timings is much larger than I would expect.) Pull Request resolved: https://github.com/pytorch/pytorch/pull/20609 Differential Revision: D15385800 Pulled By: colesbury fbshipit-source-id: 66415b148a3b19360b9de9881af594ab46547b6f
2025-10-20 12:54:11 +08:00 · 2019-05-17 09:03:05 -07:00
parent 000d73ccde
commit b90790ab1b
1 changed files with 11 additions and 1 deletions
--- a/cmake/Codegen.cmake
+++ b/cmake/Codegen.cmake
@ -101,11 +101,21 @@ if (INTERN_BUILD_ATEN_OPS)

  IF(CXX_AVX2_FOUND)
    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DHAVE_AVX2_CPU_DEFINITION")
+
+    # Some versions of GCC pessimistically split unaligned load and store
+    # instructions when using the default tuning. This is a bad choice on
+    # new Intel and AMD processors so we disable it when compiling with AVX2.
+    # See https://stackoverflow.com/questions/52626726/why-doesnt-gcc-resolve-mm256-loadu-pd-as-single-vmovupd#tab-top
+    check_cxx_compiler_flag("-mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store" COMPILER_SUPPORTS_NO_AVX256_SPLIT)
+    IF(COMPILER_SUPPORTS_NO_AVX256_SPLIT)
+      SET(CPU_NO_AVX256_SPLIT_FLAGS "-mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store")
+    ENDIF(COMPILER_SUPPORTS_NO_AVX256_SPLIT)
+
    LIST(APPEND CPU_CAPABILITY_NAMES "AVX2")
    IF(MSVC)
      LIST(APPEND CPU_CAPABILITY_FLAGS "${OPT_FLAG}/arch:AVX2")
    ELSE(MSVC)
-      LIST(APPEND CPU_CAPABILITY_FLAGS "${OPT_FLAG} -mavx2 -mfma")
+      LIST(APPEND CPU_CAPABILITY_FLAGS "${OPT_FLAG} -mavx2 -mfma ${CPU_NO_AVX256_SPLIT_FLAGS}")
    ENDIF(MSVC)
  ENDIF(CXX_AVX2_FOUND)