profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/terrelln/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Nick Terrell terrelln @facebook zstd developer

terrelln/linux 5

Linux kernel source tree

terrelln/automata 4

Simulating and visualizing finite automata

pviolette3/jamstacks 0

full stack system for jamming stacks (www.jamstacks.club)

terrelln/abseil-cpp 0

Abseil Common Libraries (C++)

terrelln/air-types 0

Air Typing

terrelln/bril 0

an educational compiler intermediate representation

terrelln/btrfs-progs 0

Mirror of official repository of userspace BTRFS tools, plus development branches.

terrelln/buildroot 0

Buildroot, making embedded Linux easy. Note that this is not the official repository, but only a mirror. The official Git repository is at http://git.buildroot.net/buildroot/. Do not open issues or file pull requests here.

terrelln/cfg-expander 0

Exands CFG grammars

terrelln/clang 0

Mirror of official clang git repository located at http://llvm.org/git/clang. Updated every five minutes.

pull request commentfacebook/zstd

Reduce size of dctx by reutilizing dst buffer

Awesome! The performance looks good to me!

binhdvo

comment created time in 5 hours

create barnchterrelln/zstd

branch : lazy-64-fix

created branch time in 7 hours

pull request commentfacebook/zstd

[lazy] Speed up compilation times

The only downside I can think of is that it's generally preferred to not use macro when an alternative is possible, and this code is heavy on template-by-macros, but well, in this case, measurements seem to prove your point, making compilation considerably faster as well as improving binary size by a sizable amount. So that's a small price to pay.

Yeah, I think it is a reasonable tradeoff. Inlining + functions for all the logic, and macros generate the functions that have the compile time constant "template parameters". That keeps all the logic in functions free of macros, and the macro magic is limited to selecting which function to call.

terrelln

comment created time in 9 hours

push eventterrelln/zstd

Nick Terrell

commit sha 13cad3abb1bea4419fef9c4e5d4f79a327e89071

[lazy] Speed up compilation times Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

view details

push time in 9 hours

push eventterrelln/zstd

Nick Terrell

commit sha a194e31cf285b9c6db34b7d37c291bea21805f44

[lazy] Speed up compilation times Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

view details

push time in 9 hours

PR opened facebook/zstd

[lazy] Speed up compilation times

Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with switch statements and inlining to select the implementation.

Compiler Flags Dev Time (s) PR Time (s) Delta
gcc -O3 16.5 5.6 -66%
gcc -O3 -g -fsanitize=address,undefined 158.9 38.2 -75%
clang -O3 36.5 5.5 -85%
clang -O3 -g -fsanitize=address,undefined 27.8 17.5 -37%

This also reduces the binary size because the search functions are no longer inlined into the main body.

Compiler Dev libzstd.a Size (B) PR libzstd.a Size (B) Delta
gcc 1563868 1308844 -16%
clang 1924372 1376020 -28%

Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost.

Compiler Level Dev Speed (MB/s) PR Speed (MB/s) Delta
gcc 5 110.6 110.0 -0.5%
gcc 7 70.4 72.2 +2.5%
gcc 9 53.2 53.5 +0.5%
gcc 13 12.7 12.9 +1.5%
clang 5 113.9 110.4 -3.0%
clang 7 67.7 70.6 +4.2%
clang 9 51.9 52.2 +0.5%
clang 13 12.4 13.3 +7.2%

The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

+146 -216

0 comment

1 changed file

pr created time in 9 hours

push eventterrelln/zstd

Nick Terrell

commit sha b105c85bd1b8411bad1fdbcd2204d1b4562004fe

[lazy] Speed up compilation times Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

view details

push time in 9 hours

push eventterrelln/zstd

Nick Terrell

commit sha eef58f2cbf85e00c644e98d0412c25a5a3d3b235

[lazy] Speed up compilation times Speed up compilation times by moving each specialized search function into its own function. This is faster because compilers can handle many smaller functions much faster than one gigantic function. The previous approach generated one giant function with `switch` statements and inlining to select the implementation. | Compiler | Flags | Dev Time (s) | PR Time (s) | Delta | |----------|-------------------------------------|--------------|-------------|-------| | gcc | -O3 | 16.5 | 5.6 | -66% | | gcc | -O3 -g -fsanitize=address,undefined | 158.9 | 38.2 | -75% | | clang | -O3 | 36.5 | 5.5 | -85% | | clang | -O3 -g -fsanitize=address,undefined | 27.8 | 17.5 | -37% | This also reduces the binary size because the search functions are no longer inlined into the main body. | Compiler | Dev libzstd.a Size (B) | PR libzstd.a Size (B) | Delta | |----------|------------------------|-----------------------|-------| | gcc | 1563868 | 1308844 | -16% | | clang | 1924372 | 1376020 | -28% | Finally, the performance is not impacted significantly by this change, in fact we generally see a small speed boost. | Compiler | Level | Dev Speed (MB/s) | PR Speed (MB/s) | Delta | |----------|-------|------------------|-----------------|-------| | gcc | 5 | 110.6 | 110.0 | -0.5% | | gcc | 7 | 70.4 | 72.2 | +2.5% | | gcc | 9 | 53.2 | 53.5 | +0.5% | | gcc | 13 | 12.7 | 12.9 | +1.5% | | clang | 5 | 113.9 | 110.4 | -3.0% | | clang | 7 | 67.7 | 70.6 | +4.2% | | clang | 9 | 51.9 | 52.2 | +0.5% | | clang | 13 | 12.4 | 13.3 | +7.2% | The compression strategy is unmodified in this PR, so the compressed size should be exactly the same. I may have a follow up PR to slightly improve the compression ratio, if it doesn't cost too much speed.

view details

push time in 9 hours

create barnchterrelln/zstd

branch : lazy-compile

created branch time in 10 hours

create barnchterrelln/zstd

branch : lazy-compile-2021-10-21-1245

created branch time in 11 hours

create barnchterrelln/zstd

branch : lazy-compile-2021-10-22-1111

created branch time in 11 hours

issue commentfacebook/zstd

Compression ratio for --fast=2 and higher became significantly worse. Expected?

Thanks for the feedback, its especially helpful before a release, so we have time to change things if we want!

So far we haven't really stabilized the ratio/speed tradeoff negative compression levels. They're getting monotonically faster, but thats what we provide so far. So we'll have to discuss and decide if we want to provide further stability.

Does switching from --fast=3 to --fast=2 approximate the old --fast=3 setting?

danlark1

comment created time in 12 hours

push eventfacebook/zstd

Nick Terrell

commit sha abd717a5fa90af6e35a6023fc88c31f45dc66901

[asm] Switch to C style comments Switch to C style comments for increased portability, and consistency.

view details

Nick Terrell

commit sha dad8a3cf34cbf7d2a51c43142b92d400c91336e7

Merge pull request #2825 from terrelln/huf-asm-comments [asm] Switch to C style comments

view details

push time in 2 days

PR merged facebook/zstd

[asm] Switch to C style comments CLA Signed

Switch to C style comments for increased portability, and consistency.

This PR is only formatting, and doesn't include any functional changes.

+525 -515

0 comment

1 changed file

terrelln

pr closed time in 2 days

PR opened facebook/zstd

[asm] Switch to C style comments

Switch to C style comments for increased portability, and consistency.

+525 -515

0 comment

1 changed file

pr created time in 2 days

create barnchterrelln/zstd

branch : huf-asm-comments

created branch time in 2 days

PullRequestReviewEvent

create barnchterrelln/zstd

branch : dctx-split-speed

created branch time in 4 days

issue commentfacebook/zstd

How can I compress a directory?

You mean you used PZSTD_NUM_THREADS ?

By default PZSTD_NUM_THREADS is undefined and unused, so the default number of threads is std::thread::hardware_concurrency(). See:

https://github.com/facebook/zstd/blob/12c045f74d922dc934c168f6e1581d72df983388/contrib/pzstd/Options.cpp#L25-L31

g666gle

comment created time in 7 days

create barnchterrelln/zstd

branch : v1.5.1-fb

created branch time in 7 days

issue commentfacebook/zstd

gcc `bmi2` option doesn't contain `bmi`

This could be doable. I wouldn't enable it by implicitly enabling AVX2 / BMI1, but by changing our bmi2 check to include all 3 using ZSTD_cpuid_bmi1(cpuid) && ZSTD_cpuid_bmi2(cpuid) && ZSTD_cpuid_avx2(cpuid).

On my machine, I don't see a significant difference when enabling BMI1 & lzcnt & avx2. I actually see worse performance, likely because of alignment. So I'd want to be sure the gains were actually because the compiler was able to use a new instruction, and not because of random noise.

animalize

comment created time in 8 days

issue commentfacebook/zstd

What is the correct way to pass the "--fast=X" flag for compression in BTRFS mount options?

The lowest compression level available for BtrFS is 1. We added BtrFS support before negative compression levels were added, so it doesn't support them.

There isn't currently a way to say compression=zstd:-1 or compression=zstd_fast:1. We have considered changing the compression level mapping, and mapping zstd:1 to -1.

I'm currently working on updating the zstd version in the kernel. Once that works has landed, it will be possible to use negative compression levels, and we could consider how to support them in BtrFS.

bkdwt

comment created time in 10 days

pull request commentfacebook/zstd

Reduce size of dctx by reutilizing dst buffer

I measured zstd -b1e3 -B128K silesia.tar. I chose a 128K chunk size because the decoder spends most of its time in the split literals loop. Level 1 spends 80% of time in the split loop, and level 2 & 3 spend 70%.

Compiler Level Skylake OnDemand DevVM Broadwell I9-9900K Coffeelake
gcc 1 0.0% -2.1% -10.2%
gcc 2 -0.3% -2.0% -12.1%
gcc 3 -0.3% -3.0% -11.8%
clang 1 -2.7% -4.0% -6.1%
clang 2 -2.8% -3.8% -6.3%
clang 3 -2.9% -3.5% -6.2%

Skylake OnDemand

  • gcc.par
  • clang.par
  • Skylake
  • Stable results - used the first measurement
  • Pinned to CPU 0

DevVM Broadwell

  • gcc.par
  • clang.par
  • Broadwell
  • Noisy results - took the average of 10 runs of 10 seconds each
  • Pinned to CPU 0 + nice -n -20
  • Even with the noise, all the averages were lower. If it were just noise I would've expected it to sometimes be positive. Additionally, I took the max speed of all the runs, and all the max speeds were lower too.

I9-9900K Coffeelake

  • gcc-11.1.0
  • clang-12.0.1
  • Personal server
  • Stable results - used the first measurement
  • No turbo, disabled ASLR, pinned to CPU0/1 which are isolated and have a disabled SMT pair

I'll edit in results from silesia.tar without -B128K to measure the performance of the regular non-split loop. It spends 100% of time in the non-split loop. It is looking neutral on the OnDemand, negative on my server, and the DevVM results aren't clear yet.

binhdvo

comment created time in 10 days

push eventterrelln/linux

Nick Terrell

commit sha ecea7adad80d9d230df766345e5f8061792da00d

lib: zstd: Upgrade to latest upstream zstd version 1.4.10 Upgrade to the latest upstream zstd version 1.4.10. This patch is 100% generated from upstream zstd commit 20821a46f412 [0]. This patch is very large because it is transitioning from the custom kernel zstd to using upstream directly. The new zstd follows upstreams file structure which is different. Future update patches will be much smaller because they will only contain the changes from one upstream zstd release. As an aid for review I've created a commit [1] that shows the diff between upstream zstd as-is (which doesn't compile), and the zstd code imported in this patch. The verion of zstd in this patch is generated from upstream with changes applied by automation to replace upstreams libc dependencies, remove unnecessary portability macros, replace `/**` comments with `/*` comments, and use the kernel's xxhash instead of bundling it. The benefits of this patch are as follows: 1. Using upstream directly with automated script to generate kernel code. This allows us to update the kernel every upstream release, so the kernel gets the latest bug fixes and performance improvements, and doesn't get 3 years out of date again. The automation and the translated code are tested every upstream commit to ensure it continues to work. 2. Upgrades from a custom zstd based on 1.3.1 to 1.4.10, getting 3 years of performance improvements and bug fixes. On x86_64 I've measured 15% faster BtrFS and SquashFS decompression+read speeds, 35% faster kernel decompression, and 30% faster ZRAM decompression+read speeds. 3. Zstd-1.4.10 supports negative compression levels, which allow zstd to match or subsume lzo's performance. 4. Maintains the same kernel-specific wrapper API, so no callers have to be modified with zstd version updates. One concern that was brought up was stack usage. Upstream zstd had already removed most of its heavy stack usage functions, but I just removed the last functions that allocate arrays on the stack. I've measured the high water mark for both compression and decompression before and after this patch. Decompression is approximately neutral, using about 1.2KB of stack space. Compression levels up to 3 regressed from 1.4KB -> 1.6KB, and higher compression levels regressed from 1.5KB -> 2KB. We've added unit tests upstream to prevent further regression. I believe that this is a reasonable increase, and if it does end up causing problems, this commit can be cleanly reverted, because it only touches zstd. I chose the bulk update instead of replaying upstream commits because there have been ~3500 upstream commits since the 1.3.1 release, zstd wasn't ready to be used in the kernel as-is before a month ago, and not all upstream zstd commits build. The bulk update preserves bisectablity because bugs can be bisected to the zstd version update. At that point the update can be reverted, and we can work with upstream to find and fix the bug. Note that upstream zstd release 1.4.10 doesn't exist yet. I have cut a staging branch at 20821a46f412 [0] and will apply any changes requested to the staging branch. Once we're ready to merge this update I will cut a zstd release at the commit we merge, so we have a known zstd release in the kernel. The implementation of the kernel API is contained in zstd_compress_module.c and zstd_decompress_module.c. [0] https://github.com/facebook/zstd/commit/20821a46f4122f9abd7c7b245d28162dde8129c9 [1] https://github.com/terrelln/linux/commit/e0fa481d0e3df26918da0a13749740a1f6777574 Signed-off-by: Nick Terrell <terrelln@fb.com>

view details

Nick Terrell

commit sha 464413496acb619a380364a26912f02c8b709d29

MAINTAINERS: Add maintainer entry for zstd Adds a maintainer entry for zstd listing myself as the maintainer for all zstd code, pointing to the upstream issues tracker for bugs, and listing my linux repo as the tree. Signed-off-by: Nick Terrell <terrelln@fb.com>

view details

push time in 11 days

push eventfacebook/zstd

Nick Terrell

commit sha 695181c2e006fc8285f84f90510edfc293e5a251

[multiple-ddicts] Fix NULL checks The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha 2c94f9fc614f5adc6a0a9aaa319e975cc4d81f1f

[nit] Fix buggy indentation The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha 5a3e16f0c210791a70e1665810d081efd3da0821

[ldm] Fix ZSTD_c_ldmHashRateLog bounds check There is no minimum value check, so the parameter could be negative. Switch to the standard pattern of using `BOUNDCHECK()`. The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha c0c38ba1db6a245b944b9c226647c00e5d657c4b

[binary-tree] Fix underflow of nbCompares Fix underflow of `nbCompares` by switching to an `int` and comparing `nbCompares > 0`. This is a minimal fix, because I don't want to change the logic. These loops seem to be doing `nbCompares + 1` comparisons. The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

push time in 11 days

push eventfacebook/zstd

Nick Terrell

commit sha c6c482fe07fb4f0e5007641ed9f6b758900c93c1

[binary-tree] Fix underflow of nbCompares Fix underflow of `nbCompares` by switching to an `int` and comparing `nbCompares > 0`. This is a minimal fix, because I don't want to change the logic. These loops seem to be doing `nbCompares + 1` comparisons. The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha b77d95b053552f4a319690cc56500d66a89ac1e8

Merge pull request #2820 from terrelln/nb-compares [binary-tree] Fix underflow of nbCompares

view details

push time in 12 days

PR merged facebook/zstd

[binary-tree] Fix underflow of nbCompares

Fix underflow of nbCompares by switching to a for loop, which is clearer, and consistent with the hash-chain and row-based match finders.

Also, add asserts that underflow isn't happening, which were easily triggered by the OSS-Fuzz fuzzers before the fix was added.

Clang performance is neutral to 2-3% worse, and gcc performance is neutral to 3-5% positive. So overall it is about a wash.

The bug was reported by Dan Carpenter and found by Smatch static checker.

https://lore.kernel.org/all/20211008063704.GA5370@kili/

+31 -27

3 comments

3 changed files

terrelln

pr closed time in 12 days

push eventfacebook/zstd

Nick Terrell

commit sha 1bbb372e3e2453a8f16813fc9d65c90f32eb9080

[ldm] Fix ZSTD_c_ldmHashRateLog bounds check There is no minimum value check, so the parameter could be negative. Switch to the standard pattern of using `BOUNDCHECK()`. The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha 26486db9ab20fdd8b8c77cdbafd46a338a3e1199

Merge pull request #2819 from terrelln/ldm-hash-rate-log [ldm] Fix ZSTD_c_ldmHashRateLog bounds check

view details

push time in 14 days

PR merged facebook/zstd

[ldm] Fix ZSTD_c_ldmHashRateLog bounds check

There is no minimum value check, so the parameter could be negative. Switch to the standard pattern of using BOUNDCHECK().

The bug was reported by Dan Carpenter and found by Smatch static checker.

https://lore.kernel.org/all/20211008063704.GA5370@kili/

+2 -2

0 comment

1 changed file

terrelln

pr closed time in 14 days

push eventfacebook/zstd

Nick Terrell

commit sha 399644b1f12cbdb61511fe3ad3139956fa658096

[nit] Fix buggy indentation The bug was reported by Dan Carpenter and found by Smatch static checker. https://lore.kernel.org/all/20211008063704.GA5370@kili/

view details

Nick Terrell

commit sha 802745e88af53d7224bb81de3ebdaa72ac165975

Merge pull request #2818 from terrelln/indentation-fix [nit] Fix buggy indentation

view details

push time in 14 days