mirror of
https://github.com/pytorch/pytorch.git
synced 2025-10-20 21:14:14 +08:00
Fix the logic of set_cpu_affinity (#154503)
While investigating https://github.com/pytorch/pytorch/issues/152566, I found two issues with how the cpu affinity is set in benchmark job: * The current logic doesn't work with cgroups slice, the mechanism behind multi-tenant runner: * Using `lscpu` returns all CPUs and not the available ones from cgroups. On the other hand, `nproc` works correctly. For example, on H100, `lscpu` returns 192 CPUs while `nproc` returns 24 (192 / 8) * Setting `taskset -c 0-N` blindly is wrong because CPU 0 is only available to the the first tenant, aka alice. For example, running `taskset -c 0 ls` on any other tenants will fail. To fix this, the ID of available CPUs can be fetched by calling `os.sched_getaffinity(0)`. * The last bug is `taskset` works with logical CPUs https://www.man7.org/linux/man-pages/man1/taskset.1.html, so using the result from `test_inductor_get_core_number` is also wrong because that function returns the number of physical CPUs. ### Testing CPU benchmark jobs look ok * [aarch64 torch.compile benchmark](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2021%20May%202025%2016%3A40%3A28%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A40%3A28%20GMT&granularity=hour&mode=inference&dtype=bfloat16&deviceName=cpu%20(aarch64)&lBranch=fix-cpu-affinity-cgroups&lCommit=9a6288e083d650c470623f5fe136b1060824021c&rBranch=main&rCommit=dec5ab8d984b8a608140911351d877b9ddb141c2) * [x86 micro benchmark](https://hud.pytorch.org/benchmark/llms?startTime=Wed%2C%2021%20May%202025%2016%3A41%3A26%20GMT&stopTime=Wed%2C%2028%20May%202025%2016%3A41%3A26%20GMT&granularity=day&lBranch=main&lCommit=c1b7dbc52aaa49f4cd147bbe5935110a4a10e3e3&rBranch=refs/tags/ciflow/inductor-micro-benchmark-cpu-x86/154503&rCommit=9a6288e083d650c470623f5fe136b1060824021c&repoName=pytorch%2Fpytorch&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=cpu%20(x86_64)&archName=All%20Platforms) Pull Request resolved: https://github.com/pytorch/pytorch/pull/154503 Approved by: https://github.com/Skylion007, https://github.com/malfet
This commit is contained in:
@ -820,16 +820,7 @@ test_inductor_torchbench_smoketest_perf() {
|
||||
done
|
||||
}
|
||||
|
||||
test_inductor_get_core_number() {
|
||||
if [[ "${TEST_CONFIG}" == *aarch64* ]]; then
|
||||
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
|
||||
else
|
||||
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
|
||||
fi
|
||||
}
|
||||
|
||||
test_inductor_set_cpu_affinity(){
|
||||
#set jemalloc
|
||||
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
|
||||
export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"
|
||||
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
|
||||
@ -841,14 +832,23 @@ test_inductor_set_cpu_affinity(){
|
||||
export KMP_AFFINITY=granularity=fine,compact,1,0
|
||||
export KMP_BLOCKTIME=1
|
||||
fi
|
||||
cores=$(test_inductor_get_core_number)
|
||||
# Set number of cores to 16 on Aarch64 for performance runs.
|
||||
|
||||
# Use nproc here instead of lscpu because it takes into account cgroups slice
|
||||
cpus=$(nproc)
|
||||
thread_per_core=$(lscpu | grep 'Thread(s) per core:' | awk '{print $4}')
|
||||
cores=$((cpus / thread_per_core))
|
||||
|
||||
# Set number of cores to 16 on aarch64 for performance runs
|
||||
if [[ "${TEST_CONFIG}" == *aarch64* && $cores -gt 16 ]]; then
|
||||
cores=16
|
||||
fi
|
||||
export OMP_NUM_THREADS=$cores
|
||||
end_core=$((cores-1))
|
||||
export TASKSET="taskset -c 0-$end_core"
|
||||
|
||||
# Handle cgroups slice start and end CPU
|
||||
start_cpu=$(python -c 'import os; print(min(os.sched_getaffinity(0)))')
|
||||
# Leaving one physical CPU for other tasks
|
||||
end_cpu=$(($(python -c 'import os; print(max(os.sched_getaffinity(0)))') - thread_per_core))
|
||||
export TASKSET="taskset -c $start_cpu-$end_cpu"
|
||||
}
|
||||
|
||||
test_inductor_torchbench_cpu_smoketest_perf(){
|
||||
|
Reference in New Issue
Block a user