Files
pytorch/c10/util/Unroll.h
mingfeima a427d90411 add int4 packed gemm support on CPU device (#117475)
This patch adds int4 packed gemm support on CPU, both `avx512` and `avx2` are supported. It is used to speedup https://github.com/pytorch-labs/gpt-fast

The default perf measured on Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores) is `16.13 sec total, 12.40 tokens/sec`

* WOQ int4 on avx512: `5.92 sec total, 33.79 tokens/sec`
* WOQ int4 on avx2: `6.90 sec total, 29.00 tokens/sec`

WOQ int4 is measured with method: https://github.com/pytorch-labs/gpt-fast?tab=readme-ov-file#int4-weight-only-quantization

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117475
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-03-06 16:25:53 +00:00

31 lines
843 B
C++

#pragma once
#include <c10/macros/Macros.h>
#include <type_traits>
// Utility to guarantee complete unrolling of a loop where the bounds are known
// at compile time. Various pragmas achieve similar effects, but are not as
// portable across compilers.
// Example: c10::ForcedUnroll<4>{}(f); is equivalent to f(0); f(1); f(2); f(3);
namespace c10 {
template <int n>
struct ForcedUnroll {
template <typename Func, typename... Args>
C10_ALWAYS_INLINE void operator()(const Func& f, Args... args) const {
ForcedUnroll<n - 1>{}(f, args...);
f(std::integral_constant<int, n - 1>{}, args...);
}
};
template <>
struct ForcedUnroll<1> {
template <typename Func, typename... Args>
C10_ALWAYS_INLINE void operator()(const Func& f, Args... args) const {
f(std::integral_constant<int, 0>{}, args...);
}
};
} // namespace c10