profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/cwoffenden/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Carl Woffenden cwoffenden Basel, Switzerland Wrote software you may have seen. Writes software your kids will have seen. Does stuff, fixes things. Contains metal parts.

cwoffenden/hello-webgpu 79

Cross-platform C++ example for WebGPU and Dawn

cwoffenden/panacdc 1

Panasonic CX-DP60 Emulator

cwoffenden/basis_universal 0

Basis Universal GPU Texture Codec

cwoffenden/dp2usbc 0

DisplayPort to USB-C Carrier

cwoffenden/emscripten 0

Emscripten: An LLVM-to-WebAssembly Compiler

cwoffenden/mango 0

mango fun framework

cwoffenden/psd_sdk 0

A C++ library that directly reads Photoshop PSD files.

cwoffenden/zstd 0

Zstandard - Fast real-time compression algorithm

startedViladoman/StructLayout

started time in 9 hours

startedmwenge/defender

started time in 12 days

startedjemalloc/jemalloc

started time in a month

startedMrL314/BooView

started time in a month

push eventcwoffenden/psd_sdk

Oluseyi Sonaiya

commit sha 913c1d091a8dea38119fad52e55e0ed2bd3c6033

Crude initial port to macOS. Compiles, but misparses PSD file header on executing PsdSamples. Path to sample PSD currently hard-coded, because Xcode build locations are complex ;-)

view details

Oluseyi Sonaiya

commit sha 9eee6ae4213c5e90728ac9f0af3a033e1cb2d754

Match coding style (braces on new lines)

view details

Oluseyi Sonaiya

commit sha 3660732fc4745317ee31dfa935aa664f1651dbf5

Platform specific files to use capitalized naming suffix (e.g. "_Mac")

view details

MolecularMatters

commit sha 83a12e024b5a6547a5210916ff1bc477594831ed

added support for non-POD types in memory utility

view details

Oluseyi Sonaiya

commit sha 25817319cfb06ae1e4739acd8a2262b110fe0bf7

Merge branch 'master' into fork-master

view details

Oluseyi Sonaiya

commit sha 561e21400090ec188d958064a9e3b59079e3e2fe

posix_memalign requires that alignment be >= sizeof(void *), and a power of 2. Some of our invocation paths were passing in sizeof(char), so we find the smallest compliant alignment value.

view details

Oluseyi Sonaiya

commit sha c5a1c3e24e6c07c73dd926dcbec3db3fe4b68a9e

Prefer [Grand Central] Dispatch over NSFileHandle

view details

Oluseyi Sonaiya

commit sha ed159ce59e5fc7580517e2f6bd1430c4bf330860

"wchar_t was a mistake" :-/

view details

Oluseyi Sonaiya

commit sha a8d92f7cab8961f62f77ce8fd892e6a150fef3bc

Stubbing this out: probably need to call fstat() on the file, but we only have the file descriptor. Still thinking.

view details

MolecularMatters

commit sha 4bbd7436408a7926a3c28a41642fdaf116415e49

added helper functions for sample code The helper functions make it easier to store sample output data in different directories on different platforms.

view details

Oluseyi Sonaiya

commit sha 8a25ba012c95a429e08873627736f5603e91daa1

Merge branch 'master' into fork-master

view details

Oluseyi Sonaiya

commit sha 702b7a9968aff15fc13b83bdbf8cec4bf5f4dfdf

Apply create and truncate open flags, and set mode to `u+rw,g+r`

view details

Oluseyi Sonaiya

commit sha c8c8754492a12adc6c0601b22131c0ea5d26e2f0

dispatch_data_t parameter to handle is NULL on successful write

view details

Oluseyi Sonaiya

commit sha 9cf822bada7db01e860f5b8dcba40c65c90dbbdf

We can use the same path logic on macOS (and, presumably, Linux) since forward separators have been adopted. Note that macOS/Xcode will still require that a custom working directory be set by editing the build scheme.

view details

MolecularMatters

commit sha ef2006af951fca70cea0a2b6457f152f0773d8ad

Merge pull request #3 from oluseyi/master Crude initial port to macOS.

view details

MolecularMatters

commit sha c6146ef5b70912e352f5e1d5248cf582e98bd255

Update README.md

view details

MolecularMatters

commit sha aa3bd0cbc98a1d6802579cc7a560d148d79eeabf

fixed identification of different VS versions

view details

push time in a month

issue commentfacebook/zstd

pure javascript implementation

@felixhandte Most of these flags are already in the single file build script that, AFAIK, the JS builds are using:

https://github.com/facebook/zstd/blob/8a3bdfaa7bb2944dbed1ce9dede39f512d6da3f9/build/single_file_libs/zstddeclib-in.c#L31

(At the time I did the tradeoff between size and performance and settled on these ones)

cordovapolymer

comment created time in 2 months

delete branch cwoffenden/zstd

delete branch : sse-x86

delete time in 2 months

delete branch cwoffenden/zstd

delete branch : msvc-w4-fixes

delete time in 2 months

push eventcwoffenden/zstd

senhuang42

commit sha 939276cd0c8f6f5b6eede93bb4db3779742ca778

Add ldm and block splitter auto-enable to old api

view details

sen

commit sha 0a96d0006427e9275d0711b978495da919416ff4

Merge pull request #2684 from senhuang42/old_api_ldm_blocksplit Add ldm and block splitter auto-enable to old api

view details

Binh Vo

commit sha 1e17184ad043fde95548dd7580fee62d7d6e5bdb

Add documentation for --patch-from

view details

Yann Collet

commit sha cefafc0b6efc1cf31b57c8f7f99a7aa88344644d

Merge pull request #2693 from binhdvo/bootcamp Add documentation for --patch-from

view details

senhuang42

commit sha 88acf0ac6556631bd150b2339487eca0b3112a79

Make regression test run on every PR

view details

sen

commit sha 5fb3884f330b953d450f5a4a69544a548c025ed7

Merge pull request #2691 from senhuang42/per_pr_regressiontest Make regression test run on every PR

view details

Carl Woffenden

commit sha cc9f43f9461f64865425d54c6b79b24e45c9784c

Merge branch 'facebook:dev' into msvc-w4-fixes

view details

push time in 2 months

PR closed facebook/zstd

SSE/Neon path for MSVC x86 and ARM CLA Signed

This is taking what #2653 started and extending it to x86 and MS ARM64 targets. To do this I fake the __SSE2__ or __ARM_NEON defines for MSVC (this was preferable to having the longer tests everywhere else) and change the signature for ZSTD_Vec256_cmpMask8 (more of later).

First some benchmarks! This is x86 without the SSE2 path, on a 3990X (with 127 idle cores!):

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  43.6 MB/s , 413.7 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  42.3 MB/s , 425.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  31.1 MB/s , 454.4 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  25.8 MB/s , 469.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  22.7 MB/s , 475.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  19.7 MB/s , 475.6 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  15.0 MB/s , 475.9 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  11.0 MB/s , 485.2 MB/s

And this is with the SSE2 path enabled:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  52.6 MB/s , 424.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  50.7 MB/s , 436.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  40.3 MB/s , 466.6 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  33.5 MB/s , 482.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  28.9 MB/s , 487.9 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  27.0 MB/s , 486.4 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  19.7 MB/s , 487.0 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  17.0 MB/s , 495.9 MB/s

I took the best of five runs, and we see a 20-50% improvement. For this to work I needed to change ZSTD_Vec256_cmpMask8 to a pointer of the 256-bit type (since on 32-bit systems, depending on the version of MSVC, tested with 2010-2019, it errors with formal parameter with requested alignment of 16 won't be aligned). I worried this would affect performance by not making best use of the wider SSE registers, but after many runs comparing the x64 version with or without the change, the result was the pointer variant was always slightly faster (there was variance in the numbers but on a generally good run the pointer always bested the pass-by-value). I suspect this wouldn't be the case with a real 256-bit type.

The same run on 3990X as x64, for comparison:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\x64_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 107.5 MB/s , 600.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 101.7 MB/s , 616.4 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  72.2 MB/s , 655.0 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  56.8 MB/s , 674.8 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  47.1 MB/s , 683.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  45.2 MB/s , 681.1 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  41.9 MB/s , 681.7 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  32.8 MB/s , 691.6 MB/s

Since I had one on my desk I also threw this at a Surface Pro X with ARM64. Here's the before running the fallback path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-fallback.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  52.5 MB/s , 593.4 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  50.7 MB/s , 602.5 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  35.0 MB/s , 646.3 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  27.2 MB/s , 664.9 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  18.6 MB/s , 660.1 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  14.4 MB/s , 621.5 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  12.7 MB/s , 627.2 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614), 10.18 MB/s , 621.9 MB/s

And here's after with the Neon path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-neon.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  60.6 MB/s , 595.2 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  58.2 MB/s , 602.8 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  41.9 MB/s , 650.7 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  33.9 MB/s , 669.2 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  22.4 MB/s , 656.5 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  15.8 MB/s , 632.7 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  13.4 MB/s , 619.5 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614),  11.2 MB/s , 626.5 MB/s

Around a 10% improvement.

I also ran the same benchmark on other x86 and x64 hardware with the same result. I haven't as of yet run this on Apple ARM hardware with Clang for comparison, but I will, and then update this PR.

The fake defines I'm not 100% happy with, but it's no different (IMO) to faking __has_builtin() and others. But suggestions welcome.

+43 -18

5 comments

4 changed files

cwoffenden

pr closed time in 2 months

pull request commentfacebook/zstd

SSE/Neon path for MSVC x86 and ARM

Closing since #2681 makes this redundant.

cwoffenden

comment created time in 2 months

PR opened BinomialLLC/basis_universal

Single file transcoder fixes

These are minor and also fix #230 (which Clang and VS pick up when testing the script).

+11 -16

0 comment

5 changed files

pr created time in 2 months

push eventcwoffenden/basis_universal

Carl Woffenden

commit sha f5722a80259ff760c3b15f3abd94d9b341d7e005

Removed/moved unused vars Relates to issue #230

view details

Carl Woffenden

commit sha 150b7fb24f459397d9524bb1eb8b702ed7b0cb8e

Tidy docs

view details

push time in 2 months

push eventcwoffenden/basis_universal

Carl Woffenden

commit sha b0c447b47945b31e736b3c97262c29b12fa6e452

Minor: fixed quotes in example

view details

push time in 2 months

create barnchcwoffenden/basis_universal

branch : single-file-fixes

created branch time in 2 months

push eventcwoffenden/basis_universal

Carl Woffenden

commit sha d4b4b522ce50d28bb8837bcd6ff0d37d340a2594

Fixed broken merge with upstream

view details

push time in 2 months

push eventcwoffenden/basis_universal

Donovan Hutchence

commit sha ef7d326654eff5b1b13a63391ca8a46714e4e3d0

apply swizzle patch

view details

Rich Geldreich

commit sha 638c348294614aadfbfa39d786f5ab77df284dc5

Adding container independent transcoding support. See methods basisu_lowlevel_etc1s_transcoder::transcode_image() and basisu_lowlevel_uastc_transcoder::transcode_image().

view details

Rich Geldreich

commit sha 2952f6dc67cf50f75004619a9a6637a2b87ff81f

Merge pull request #187 from slimbuck/rgba-swizzle General RGBA swizzle param

view details

Jonathan Behrens

commit sha ad5daec2fc8071e3835862c93a8767b9e3d5b1db

Fix comment typo in basisu_transcoder.h

view details

Rich Geldreich

commit sha 62539278e10d106f52b9a05134ef3e8267b22a02

Moving encoder-related files to the "encoder" directory

view details

Rich Geldreich

commit sha e88073088d451902ea8fb1eba8a376bb7dfe00e4

Missing file

view details

Rich Geldreich

commit sha 5eebb7e4f989420d862b8540a457f283063e16ad

new files

view details

Rich Geldreich

commit sha 9ebf01854c9a907b92be0466e85739bc4cd26b57

new file

view details

Rich Geldreich

commit sha 85fa18ec940b369791f2f39410da4c79f8c5108f

Adding encoder WebAssembly support, encode_test sample Encoder is now a library in the "encoder" directory JavaScript wrappers now expose the entire codec: encoder, transcoder, container independent transcoding, and .basis file information

view details

Rich Geldreich

commit sha 79245d48d5b0348d337c3b73836c77d37653f43b

Update README.md

view details

Rich Geldreich

commit sha a21ea7f25bc55f39a0a04a51ae5b6e62034fba7d

Adding missing files

view details

Rich Geldreich

commit sha f51ea57463fe3231cb87ae6a228e570d8b026d82

Update README.md

view details

Rich Geldreich

commit sha 5b299684cd76d83a381e3cc932930d15a2dc846d

Update README.md

view details

Rich Geldreich

commit sha ea668ac2c3815032f6f0f684ed3d54c90a1f8785

Updating preview image

view details

Rich Geldreich

commit sha 5d2120b085cdab692c2d17c5f52771a8af92f923

Update README.md

view details

Rich Geldreich

commit sha 2d63617d7171837b58b2862289c585e90b13bb3e

Update README.md

view details

Rich Geldreich

commit sha 8fceed472ab595de689396dad07cfbe19578d349

Update README.md

view details

Rich Geldreich

commit sha 107ae4b9361eab1032abbc3b57763f84041c0e33

Update README.md

view details

Rich Geldreich

commit sha 45fa7d6d219b0f54485a8ca5f3973ab2eeaaf3cf

Updating REUSE file

view details

Rich Geldreich

commit sha 33fe39f9cc4bcb76ac7f46dba42409100fdf6563

Update README.md

view details

push time in 2 months

push eventcwoffenden/basis_universal

Carl Woffenden

commit sha f014a14fb595beaf979396bec9ac514c9c1798f5

Added UASTC to test

view details

push time in 2 months

push eventcwoffenden/zstd

senhuang42

commit sha bb0cd722b6417017d5c54b6403369cbbae5a9172

Migrate travis CI tests

view details

sen

commit sha 414bcf239b840a2f35840d8c088cb9007c23b7f5

Merge pull request #2675 from senhuang42/ci_overhaul [CI][1/2] Re-do the github actions workflows, migrate various travis and appveyor tests.

view details

senhuang42

commit sha d278bede332ace7406168c3c6159afca834560db

Update apt-get prior to tests that install packages

view details

sen

commit sha 2ee2cf9cdf7c70d6b3041e0fd0c154989d8d3e19

Merge pull request #2682 from senhuang42/armbuild_fixtest Make GH Actions CI tests run apt-get update before apt-get install

view details

senhuang42

commit sha 56b7dd121c8d97428df96ae5d320c904545183b5

Add arm64 fuzz test to travis

view details

sen

commit sha 18d02cbf2e0654de08093094f1a77cfd231f11d7

Merge pull request #2686 from senhuang42/arm64fuzztest Add arm64 fuzz test to travis

view details

push time in 2 months

push eventcwoffenden/zstd

Carl Woffenden

commit sha cfc8a61d57e259e184547c76be2649a214b1e7e8

ZSTD prefixed SSE and Neon macros

view details

push time in 2 months

push eventcwoffenden/zstd

W. Felix Handte

commit sha 51708b2c621bc74223feb21f5cdc6c3c59221cfe

Fix CircleCI Config to Fully Remove `publish-github-release` Job

view details

TrianglesPCT

commit sha 25bda9053add6218a58d88c5b8119afa63165231

Add files via upload msvc suport avx2 path

view details

TrianglesPCT

commit sha 52f44bb365337054d30a8e0edf83dd7c612b4d32

Add files via upload msvc

view details

TrianglesPCT

commit sha 77d54eb3b3116bf9606426730de818f33907aec3

Add files via upload

view details

TrianglesPCT

commit sha 0b9f4bb0ff1f313bea9e9166f693ec64a3a6a43e

Update zstd_lazy.c use 8bit

view details

TrianglesPCT

commit sha 69ac124b1209ccf12c8c5969f0d7a2124ccbe554

Update zstd_lazy.c

view details

TrianglesPCT

commit sha 0e071214b5721f3415611d07d33469f8026c3bb0

Update zstd_lazy.c switch to unaligned load as I don't know if buffer will always be aligned to 32 bytes, and compilers aside from MSVC might actually use aligned loads

view details

TrianglesPCT

commit sha 8f7ea1afeba1f3762e99413424df95b9faf4d2d8

Update zstd_lazy.c Switch to other comment style

view details

TrianglesPCT

commit sha a62856bf65f381eb2f99d056005b4b39cb7c8725

Update zstd_lazy.c Remove the AVX2 part

view details

TrianglesPCT

commit sha bb1cdd8c63046e66de63cf76448868bbc1dc6b72

Update zstd_lazy.c add space

view details

TrianglesPCT

commit sha d688ab1e0cfcbe5a894f07bab4033978d99bebd3

Add files via upload AVX2

view details

TrianglesPCT

commit sha bee0ef56475eb567b7c801812a1a5f8215b4ffd9

Update zstd_lazy.c It put the changes back when I tried to make a separate pull request, i don't understand githubs interface at all.

view details

Yann Collet

commit sha 61afa154cdd987de525c6f717a872e31af79b1a8

improve tar compatibility This patch is supposed to improve compatibility with less featured tar variants "when the tar program used does not support historical options (without hyphen) nor the '-z' option." Patch proposed by Antonio Diaz Diaz

view details

senhuang42

commit sha 38ffe9658ea831cdf71f95a999916fa1d5f9b844

[ci] Use *-latest for platforms to test on

view details

senhuang42

commit sha 5a75417d2bb01d682fcc912c91d5ddf43906708b

[ci] Add ARM tests back into CI

view details

Yann Collet

commit sha 156145de1c4da01d941d5b8d5c4da49404cef087

Merge pull request #2660 from facebook/diaz improve tar compatibility

view details

Yann Collet

commit sha 02ece5d59fdb49cf794791700b95af4a5fa99a07

Merge pull request #2653 from TrianglesPCT/dev Enable SSE2 compression path to work on MSVC

view details

Yann Collet

commit sha d2c1c00712877c43f3edee8550fba9b5ec710f5d

Merge pull request #2649 from felixhandte/circleci-release-job-fix Fix CircleCI Config to Fully Remove `publish-github-release` Job

view details

sen

commit sha 9fe10722294b240b2a8646492550bc1307d942b7

Merge pull request #2668 from senhuang42/update_ci_platforms [CI] Fix zlib-wrapper test

view details

sen

commit sha d92fef0f0a43852805ae8cdf75041f40181c2aeb

Merge pull request #2667 from senhuang42/arm_tests_ci [CI] Add ARM tests back into CI

view details

push time in 2 months

push eventcwoffenden/zstd

senhuang42

commit sha bb0cd722b6417017d5c54b6403369cbbae5a9172

Migrate travis CI tests

view details

sen

commit sha 414bcf239b840a2f35840d8c088cb9007c23b7f5

Merge pull request #2675 from senhuang42/ci_overhaul [CI][1/2] Re-do the github actions workflows, migrate various travis and appveyor tests.

view details

senhuang42

commit sha d278bede332ace7406168c3c6159afca834560db

Update apt-get prior to tests that install packages

view details

sen

commit sha 2ee2cf9cdf7c70d6b3041e0fd0c154989d8d3e19

Merge pull request #2682 from senhuang42/armbuild_fixtest Make GH Actions CI tests run apt-get update before apt-get install

view details

senhuang42

commit sha 56b7dd121c8d97428df96ae5d320c904545183b5

Add arm64 fuzz test to travis

view details

sen

commit sha 18d02cbf2e0654de08093094f1a77cfd231f11d7

Merge pull request #2686 from senhuang42/arm64fuzztest Add arm64 fuzz test to travis

view details

Carl Woffenden

commit sha f78e32fde10b06deaafdcc06887ef3971c5f4e37

Merge remote-tracking branch 'upstream/dev' into sse-x86

view details

push time in 2 months

pull request commentfacebook/zstd

SSE/Neon path for MSVC x86 and ARM

does MSVC need arm64_neon.h ?

@aqrit On MSVC 64-bit ARM arm_neon.h includes arm64_neon.h.

cwoffenden

comment created time in 2 months

pull request commentfacebook/zstd

SSE/Neon path for MSVC x86 and ARM

And here's the same test on an ARM Mac. The runs on this M1 Mini had little deviation:

carl@m1 Native % lipo -i zstd-dev 
Non-fat file: zstd-dev is architecture: arm64
carl@m1 Native % ./zstd-dev -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1287.7 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.3 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1446.9 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.3 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1493.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.0 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.3 MB/s ,1528.5 MB/s
carl@m1 Native % lipo -i zstd-sse-pr 
Non-fat file: zstd-sse-pr is architecture: arm64
carl@m1 Native % ./zstd-sse-pr -b5e12 silesia.tar 
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 184.6 MB/s ,1288.5 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 177.1 MB/s ,1324.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447), 124.6 MB/s ,1409.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  98.3 MB/s ,1447.5 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  83.0 MB/s ,1481.8 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  81.3 MB/s ,1494.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  75.3 MB/s ,1497.7 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  58.2 MB/s ,1529.0 MB/s 

The only really interesting takeaway is how much the M1 trounces the 3990X in this test (though the Threadripper here is under-clocked at the moment).

cwoffenden

comment created time in 2 months

pull request commentfacebook/zstd

SSE/Neon path for MSVC x86 and ARM

Testing this PR (zstd-sse-pr) against dev (zstd-dev) building with Clang 12.0.0 on an Intel MacBook Pro I see these results, alternating runs between the two builds:

carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.0 MB/s ,1051.6 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 136.0 MB/s ,1042.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.1 MB/s ,1110.6 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.3 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.4 MB/s ,1181.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.8 MB/s ,1121.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1152.9 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1195.2 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.6 MB/s ,1028.0 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.6 MB/s ,1055.6 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  94.2 MB/s ,1141.4 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  76.5 MB/s ,1151.0 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.3 MB/s ,1200.6 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.7 MB/s ,1105.2 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.1 MB/s ,1176.3 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1193.4 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-dev -b5e12 silesia.tar   
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 144.2 MB/s ,1019.4 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 137.7 MB/s ,1092.1 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  93.3 MB/s ,1129.9 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.7 MB/s ,1159.1 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  59.7 MB/s ,1167.2 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.6 MB/s ,1168.8 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  47.3 MB/s ,1101.2 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.7 MB/s ,1116.0 MB/s 
carl@dunkel Shared % /Volumes/Data/Work/Native/zstd-sse-pr -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 146.0 MB/s ,1028.8 MB/s 
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 138.0 MB/s ,1083.5 MB/s 
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  91.8 MB/s ,1131.5 MB/s 
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  77.9 MB/s ,1192.2 MB/s 
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  61.2 MB/s ,1166.0 MB/s 
10#silesia.tar       : 211971584 ->  59301912 (3.574),  51.4 MB/s ,1119.5 MB/s 
11#silesia.tar       : 211971584 ->  59159449 (3.583),  46.8 MB/s ,1131.4 MB/s 
12#silesia.tar       : 211971584 ->  58648764 (3.614),  36.6 MB/s ,1163.5 MB/s 

TL;DR: the change to ZSTD_Vec256_cmpMask8() doesn't appear to have any adverse effects.

cwoffenden

comment created time in 2 months

PR opened facebook/zstd

SSE/Neon path for MSVC x86 and ARM

This is taking what #2653 started and extending it to x86 and MS ARM64 targets. To do this I fake the __SSE2__ or __ARM_NEON defines for MSVC (this was preferable to having the longer tests everywhere else) and change the signature for ZSTD_Vec256_cmpMask8 (more of later).

First some benchmarks! This is x86 without the SSE2 path, on a 3990X (with 127 idle cores!):

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  43.6 MB/s , 413.7 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  42.3 MB/s , 425.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  31.1 MB/s , 454.4 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  25.8 MB/s , 469.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  22.7 MB/s , 475.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  19.7 MB/s , 475.6 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  15.0 MB/s , 475.9 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  11.0 MB/s , 485.2 MB/s

And this is with the SSE2 path enabled:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\Win32_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322),  52.6 MB/s , 424.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365),  50.7 MB/s , 436.3 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  40.3 MB/s , 466.6 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  33.5 MB/s , 482.3 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  28.9 MB/s , 487.9 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  27.0 MB/s , 486.4 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  19.7 MB/s , 487.0 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  17.0 MB/s , 495.9 MB/s

I took the best of five runs, and we see a 20-50% improvement. For this to work I needed to change ZSTD_Vec256_cmpMask8 to a pointer of the 256-bit type (since on 32-bit systems, depending on the version of MSVC, tested with 2010-2019, it errors with formal parameter with requested alignment of 16 won't be aligned). I worried this would affect performance by not making best use of the wider SSE registers, but after many runs comparing the x64 version with or without the change, the result was the pointer variant was always slightly faster (there was variance in the numbers but the pointer always bested the pass-by-value). I suspect this wouldn't be the case with a real 256-bit type.

The same run on 3990X as x64, for comparison:

C:\Volumes\Data\Work\Native\Zstd\build\VS2010>bin\x64_Release\zstd.exe -b5e12 silesia.tar
 5#silesia.tar       : 211971584 ->  63810797 (3.322), 107.5 MB/s , 600.5 MB/s
 6#silesia.tar       : 211971584 ->  62984414 (3.365), 101.7 MB/s , 616.4 MB/s
 7#silesia.tar       : 211971584 ->  61489071 (3.447),  72.2 MB/s , 655.0 MB/s
 8#silesia.tar       : 211971584 ->  60918862 (3.480),  56.8 MB/s , 674.8 MB/s
 9#silesia.tar       : 211971584 ->  59934752 (3.537),  47.1 MB/s , 683.6 MB/s
10#silesia.tar       : 211971584 ->  59301912 (3.574),  45.2 MB/s , 681.1 MB/s
11#silesia.tar       : 211971584 ->  59159449 (3.583),  41.9 MB/s , 681.7 MB/s
12#silesia.tar       : 211971584 ->  58648764 (3.614),  32.8 MB/s , 691.6 MB/s

Since I had one on my desk I also threw this at a Surface Pro X with ARM64. Here's the before running the fallback path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-fallback.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  52.5 MB/s , 593.4 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  50.7 MB/s , 602.5 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  35.0 MB/s , 646.3 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  27.2 MB/s , 664.9 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  18.6 MB/s , 660.1 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  14.4 MB/s , 621.5 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  12.7 MB/s , 627.2 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614), 10.18 MB/s , 621.9 MB/s

And here's after with the Neon path:

C:\Users\carl\OneDrive\Documents\Zstd>zstd-neon.exe -b5e12 silesia.tar
 5#silesia.tar       : 211972608 ->  63811033 (3.322),  60.6 MB/s , 595.2 MB/s
 6#silesia.tar       : 211972608 ->  62984688 (3.365),  58.2 MB/s , 602.8 MB/s
 7#silesia.tar       : 211972608 ->  61489289 (3.447),  41.9 MB/s , 650.7 MB/s
 8#silesia.tar       : 211972608 ->  60918998 (3.480),  33.9 MB/s , 669.2 MB/s
 9#silesia.tar       : 211972608 ->  59934838 (3.537),  22.4 MB/s , 656.5 MB/s
10#silesia.tar       : 211972608 ->  59302036 (3.574),  15.8 MB/s , 632.7 MB/s
11#silesia.tar       : 211972608 ->  59159575 (3.583),  13.4 MB/s , 619.5 MB/s
12#silesia.tar       : 211972608 ->  58648894 (3.614),  11.2 MB/s , 626.5 MB/s

Around a 10% improvement.

I also ran the same benchmark on other x86 and x64 hardware with the same result. I haven't as of yet run this or Apple ARM hardware with Clang for comparison, but I will and update this PR.

The fake defines I'm not 100% happy with, but it's no different (IMO) to faking __has_builtin() and others. But suggestions welcome.

+36 -13

0 comment

4 changed files

pr created time in 2 months

push eventcwoffenden/zstd

Carl Woffenden

commit sha 42003ff4dcfa7d2d69f93213e68f51805a99cc24

Minor docs

view details

push time in 2 months

push eventcwoffenden/zstd

Carl Woffenden

commit sha c5a4998bcbe16f4b920a511d73866379f4ee4f41

Expanded to cover MSVC ARM64

view details

push time in 2 months

create barnchcwoffenden/zstd

branch : sse-x86

created branch time in 2 months

push eventcwoffenden/zstd

W. Felix Handte

commit sha 51708b2c621bc74223feb21f5cdc6c3c59221cfe

Fix CircleCI Config to Fully Remove `publish-github-release` Job

view details

TrianglesPCT

commit sha 25bda9053add6218a58d88c5b8119afa63165231

Add files via upload msvc suport avx2 path

view details

TrianglesPCT

commit sha 52f44bb365337054d30a8e0edf83dd7c612b4d32

Add files via upload msvc

view details

TrianglesPCT

commit sha 77d54eb3b3116bf9606426730de818f33907aec3

Add files via upload

view details

TrianglesPCT

commit sha 0b9f4bb0ff1f313bea9e9166f693ec64a3a6a43e

Update zstd_lazy.c use 8bit

view details

TrianglesPCT

commit sha 69ac124b1209ccf12c8c5969f0d7a2124ccbe554

Update zstd_lazy.c

view details

TrianglesPCT

commit sha 0e071214b5721f3415611d07d33469f8026c3bb0

Update zstd_lazy.c switch to unaligned load as I don't know if buffer will always be aligned to 32 bytes, and compilers aside from MSVC might actually use aligned loads

view details

TrianglesPCT

commit sha 8f7ea1afeba1f3762e99413424df95b9faf4d2d8

Update zstd_lazy.c Switch to other comment style

view details

TrianglesPCT

commit sha a62856bf65f381eb2f99d056005b4b39cb7c8725

Update zstd_lazy.c Remove the AVX2 part

view details

TrianglesPCT

commit sha bb1cdd8c63046e66de63cf76448868bbc1dc6b72

Update zstd_lazy.c add space

view details

TrianglesPCT

commit sha d688ab1e0cfcbe5a894f07bab4033978d99bebd3

Add files via upload AVX2

view details

TrianglesPCT

commit sha bee0ef56475eb567b7c801812a1a5f8215b4ffd9

Update zstd_lazy.c It put the changes back when I tried to make a separate pull request, i don't understand githubs interface at all.

view details

Yann Collet

commit sha 61afa154cdd987de525c6f717a872e31af79b1a8

improve tar compatibility This patch is supposed to improve compatibility with less featured tar variants "when the tar program used does not support historical options (without hyphen) nor the '-z' option." Patch proposed by Antonio Diaz Diaz

view details

senhuang42

commit sha 38ffe9658ea831cdf71f95a999916fa1d5f9b844

[ci] Use *-latest for platforms to test on

view details

senhuang42

commit sha 5a75417d2bb01d682fcc912c91d5ddf43906708b

[ci] Add ARM tests back into CI

view details

Yann Collet

commit sha 156145de1c4da01d941d5b8d5c4da49404cef087

Merge pull request #2660 from facebook/diaz improve tar compatibility

view details

Yann Collet

commit sha 02ece5d59fdb49cf794791700b95af4a5fa99a07

Merge pull request #2653 from TrianglesPCT/dev Enable SSE2 compression path to work on MSVC

view details

Yann Collet

commit sha d2c1c00712877c43f3edee8550fba9b5ec710f5d

Merge pull request #2649 from felixhandte/circleci-release-job-fix Fix CircleCI Config to Fully Remove `publish-github-release` Job

view details

sen

commit sha 9fe10722294b240b2a8646492550bc1307d942b7

Merge pull request #2668 from senhuang42/update_ci_platforms [CI] Fix zlib-wrapper test

view details

sen

commit sha d92fef0f0a43852805ae8cdf75041f40181c2aeb

Merge pull request #2667 from senhuang42/arm_tests_ci [CI] Add ARM tests back into CI

view details

push time in 2 months