profile
viewpoint

Ask questionsWorse performance on aarch64 when building latest from source, vs older versions from yum repos

Describe the Bug

We're seeing very different CPU usage profiles for compression based on where we get the zstd binary.

Our own 1.5.0 binary built from source seems to use 4-5% CPU, while a 1.4.2 binary from epel uses 2-3%, and a 1.3.3 binary from amzn2-core uses 1-1.5%.

This difference is confusing us and is somewhat important in terms of maximizing CPU for our main application. We'd like to know where we've gone wrong in our 1.5.0 build. I'd suspect it could be a matter of gcc version, but I'm hoping for some better insight.

To Reproduce

We don't have a good reproducible case, but we're hoping that there's some obvious thing to try here. Context is provided below.

Expected Behavior

We'd like to see the 1-1.5% CPU from the 1.3.3 binary, or better, but with 1.5.0 for all its features (notably, better --long mode).

Full Context

Each zstd invocation uses compression level -5 and -T5 compression threads. They are running in Docker containers restricted to 6 vCPUs via CFS, with the underlying host being Amazon's c6g.8xlarge instance type (32 vCPU, 64GB memory, aarch64 arch). The CPU throttling by the OS, Amazon Linux 2 with kernel 5.10.47-39.130.amzn2.aarch64, is negligible.

The outputs below are taken with perf top -F 100 -d 20 and letting it run for a few minutes. Feel free to ignore the kernel and JIT parts, as these are from the main Java application we're running, while zstd is part of a supporting process.

We see similar high CPU results if we build 1.3.3 and 1.4.2 ourselves just like we do with 1.5.0. If needed, I can recompile 1.5.0 with debug symbols and get some data with names instead of memory addresses.

1.3.3 (zstd-1.3.3-1.amzn2.0.1.aarch64.rpm from the amzn2-core repo)

2.52%  [kernel]          [k] __softirqentry_text_start
1.33%  [JIT] tid 6119    [.] 0x0000ffff88add1b4
1.03%  [JIT] tid 6119    [.] 0x0000ffff88aa6da4
0.85%  [JIT] tid 6119    [.] 0x0000ffff88aa709c
0.68%  [JIT] tid 6119    [.] 0x0000ffff88bd94d4
0.66%  zstd              [.] 0x0000000000017458
0.60%  [kernel]          [k] arch_local_irq_enable
0.54%  [JIT] tid 6119    [.] 0x0000ffff88bd94c8
0.53%  [JIT] tid 6119    [.] 0x0000ffff88d51fb4
0.52%  [JIT] tid 6119    [.] 0x0000ffff88add1c0
0.50%  [JIT] tid 6119    [.] 0x0000ffff88add3ac
0.49%  [JIT] tid 6119    [.] 0x0000ffff88d80b9c
0.43%  [kernel]          [k] __wake_up_common_lock
0.38%  [kernel]          [k] ____nf_conntrack_find
0.37%  [JIT] tid 6119    [.] 0x0000ffff8095e7e0
0.35%  [JIT] tid 6119    [.] 0x0000ffff88add190
0.34%  zstd              [.] 0x000000000001b458
0.32%  [JIT] tid 6119    [.] 0x0000ffff8095e640
0.32%  [JIT] tid 6119    [.] 0x0000ffff88c046f4
0.31%  [JIT] tid 6119    [.] 0x0000ffff8095e790
0.29%  zstd              [.] 0x00000000000174bc

1.4.2 (zstd-1.4.2-1.el7.aarch64.rpm from the epel repo)

2.66%  [kernel]          [k] __softirqentry_text_start
1.34%  zstd              [.] 0x0000000000025b48
1.29%  [JIT] tid 6204    [.] 0x0000ffff94adb834
1.16%  [JIT] tid 6204    [.] 0x0000ffff94aa61a4
0.94%  [JIT] tid 6204    [.] 0x0000ffff94aa649c
0.64%  zstd              [.] 0x000000000002c010
0.59%  [JIT] tid 6204    [.] 0x0000ffff94bae654
0.58%  [JIT] tid 6204    [.] 0x0000ffff94cdeeb8
0.58%  [JIT] tid 6204    [.] 0x0000ffff94bae648
0.55%  [JIT] tid 6204    [.] 0x0000ffff94d5f078
0.53%  zstd              [.] 0x0000000000025bf0
0.53%  [kernel]          [k] arch_local_irq_enable
0.49%  [JIT] tid 6204    [.] 0x0000ffff94adb840
0.47%  [JIT] tid 6204    [.] 0x0000ffff94adba2c
0.39%  [kernel]          [k] ____nf_conntrack_find
0.38%  [JIT] tid 6204    [.] 0x0000ffff8c95e8cc
0.38%  [kernel]          [k] __wake_up_common_lock
0.37%  [JIT] tid 6204    [.] 0x0000ffff8c95e7e0
0.33%  [JIT] tid 6204    [.] 0x0000ffff94adbbb4
0.32%  [JIT] tid 6204    [.] 0x0000ffff8c95e640
0.31%  zstd              [.] 0x000000000002bff4

1.5.0 (built from source in a Docker build)

3.26%  zstd              [.] 0x000000000004cc0c
2.66%  [kernel]          [k] __softirqentry_text_start
1.25%  [JIT] tid 6321    [.] 0x0000ffff98b19b74
1.04%  [JIT] tid 6321    [.] 0x0000ffff98ae86a4
0.90%  zstd              [.] 0x000000000004cc8c
0.80%  [JIT] tid 6321    [.] 0x0000ffff98ae899c
0.67%  [JIT] tid 6321    [.] 0x0000ffff9095e808
0.61%  [JIT] tid 6321    [.] 0x0000ffff986acd00
0.60%  [JIT] tid 6321    [.] 0x0000ffff9849986c
0.59%  [JIT] tid 6321    [.] 0x0000ffff98d9b9b8
0.58%  [JIT] tid 6321    [.] 0x0000ffff98bdbfd4
0.57%  [JIT] tid 6321    [.] 0x0000ffff98bdbfc8
0.52%  [kernel]          [k] arch_local_irq_enable
0.51%  zstd              [.] 0x000000000004cdbc
0.48%  [JIT] tid 6321    [.] 0x0000ffff9095e7b8
0.46%  [JIT] tid 6321    [.] 0x0000ffff98b19b7c
0.44%  [JIT] tid 6321    [.] 0x0000ffff9afb3320
0.37%  [kernel]          [k] ____nf_conntrack_find
0.37%  [kernel]          [k] __wake_up_common_lock
0.35%  zstd              [.] 0x000000000004cd40

This is basically how we're building 1.5.0. The gcc-c++ version ends up being 4.8.5-44.el7 from the base repo, and the make version is 1:3.82-24.el7 also from the base repo. We're using Docker to help with cross-compilation from an x86_64 platform to an aarch64 target, but we're only using the out-of-the-box usage for buildx/qemu-user-static, so we don't anticipate this is the issue.

FROM centos:7
RUN yum install -y git unzip boost-devel gcc-c++ make
RUN git clone --depth 1 --branch v1.5.0 https://github.com/facebook/zstd.git /zstd
RUN cd /zstd/programs && V=1 make

Verbose output shows lots of cc -O3 and -DZSTD_MULTITHREAD as expected, e.g.

cc -O3 -DBACKTRACE_ENABLE=0 -DXXH_NAMESPACE=ZSTD_ -DDEBUGLEVEL=0  -DZSTD_MULTITHREAD    -DZSTD_LEGACY_SUPPORT=5 -pthread  obj/conf_b6344f96651bb372db7f078dd65da0a1/debug.o obj/conf_b6344f96651bb372db7f078dd65da0a1/entropy_common.o obj/conf_b6344f96651bb372db7f078dd65da0a1/error_private.o obj/conf_b6344f96651bb372db7f078dd65da0a1/fse_decompress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/pool.o obj/conf_b6344f96651bb372db7f078dd65da0a1/threading.o obj/conf_b6344f96651bb372db7f078dd65da0a1/xxhash.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_common.o obj/conf_b6344f96651bb372db7f078dd65da0a1/fse_compress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/hist.o obj/conf_b6344f96651bb372db7f078dd65da0a1/huf_compress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_compress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_compress_literals.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_compress_sequences.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_compress_superblock.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_double_fast.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_fast.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_lazy.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_ldm.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_opt.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstdmt_compress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/huf_decompress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_ddict.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_decompress.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_decompress_block.o obj/conf_b6344f96651bb372db7f078dd65da0a1/cover.o obj/conf_b6344f96651bb372db7f078dd65da0a1/divsufsort.o obj/conf_b6344f96651bb372db7f078dd65da0a1/fastcover.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zdict.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_v05.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_v06.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd_v07.o obj/conf_b6344f96651bb372db7f078dd65da0a1/dibio.o obj/conf_b6344f96651bb372db7f078dd65da0a1/datagen.o obj/conf_b6344f96651bb372db7f078dd65da0a1/fileio.o obj/conf_b6344f96651bb372db7f078dd65da0a1/benchzstd.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstdcli_trace.o obj/conf_b6344f96651bb372db7f078dd65da0a1/timefn.o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstdcli.o obj/conf_b6344f96651bb372db7f078dd65da0a1/util.o obj/conf_b6344f96651bb372db7f078dd65da0a1/benchfn.o  -o obj/conf_b6344f96651bb372db7f078dd65da0a1/zstd

Thanks in advance for any response!

facebook/zstd

Answer questions terrelln

Hi @mkjois, can you print the output of zstd benchmark on the data you're compressing? All three of these zstd binaries will support the benchmark tool:

# Benchmark level 5
zstd -b5 -e5 /path/to/data

If you can't easily grab your real data, you can try the silesia corpus.

useful!
source:https://uonfu.com/
Github User Rank List