profile
viewpoint

EliRibble/dreampylib 0

A python library for interacting with Dreamhost's API

mkrupcale/.emacs.d 0

My Emacs configuration

mkrupcale/.gitconfig 0

My global git config

mkrupcale/.lyx 0

My LyX user directory

mkrupcale/.unison 0

My Unison profiles

mkrupcale/ansible 0

Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy. Avoid writing scripts or custom code to deploy and update your applications— automate in a language that approaches plain English, using SSH, with no agents to install on remote systems.

mkrupcale/ansible-modules-extras 0

Ansible extra modules - these modules ship with ansible

mkrupcale/BLAKE3 0

the official Rust and C implementations of the BLAKE3 cryptographic hash function

mkrupcale/dreampylib 0

A python library for interacting with Dreamhost's API

mkrupcale/eccodes 0

ECMWF's GRIB and BUFR encoding/decoding library

startedshibatch/sleef

started time in 3 days

startedtypesense/typesense

started time in 5 days

startedwilsonfreitas/awesome-quant

started time in 9 days

startedjpbruyere/vkvg

started time in 10 days

startedjk-jeon/fp

started time in 16 days

startedwoboq/utf8sse4

started time in 17 days

startedstgatilov/utf8lut

started time in 17 days

startedparallaxsecond/parsec

started time in 24 days

startedshuveb/loti

started time in 25 days

startedaxboe/liburing

started time in 25 days

startedctz/rustls

started time in a month

startedDavid-Haim/concurrencpp

started time in a month

startedstratum/stratum

started time in a month

startedopencomputeproject/OpenNetworkLinux

started time in a month

startedemersion/libliftoff

started time in 2 months

startedMotion-Project/motion

started time in 2 months

startedZoneMinder/zoneminder

started time in 2 months

startedjk-jeon/dragonbox

started time in 2 months

Pull request review commentBLAKE3-team/BLAKE3

Add SSE2 implementations

 pub fn sse41_detected() -> bool {     false } +#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]+#[inline(always)]+pub fn sse2_detected() -> bool {+    // A testing-only short-circuit.+    if cfg!(feature = "no_sse2") {+        return false;+    }+    // Static check, e.g. for building with target-cpu=native.+    #[cfg(target_feature = "sse2")]+    {+        return true;+    }

Rustc almost always assumes that SSE2 is supported, but you can convince it not to and trigger a build break like this:

Okay, that's what I suspected.

I've gone ahead and fixed this in #116.

Thanks. I had something like that in my original version, but I wasn't sure how to get the tests to pass with the compiler warning that this part was unreachable.

mkrupcale

comment created time in 2 months

PullRequestReviewEvent

pull request commentBLAKE3-team/BLAKE3

Add SSE2 implementations

I've made all the simplifications and changes suggested in the latest revision.

mkrupcale

comment created time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha be2da69b6b293764867c42fcbc278627271d9710

C: asm: simplify pblendw emulation Use statically calculated ~mask. This reduces the number of moves and registers necessary at the expense of an extra memory load. This is probably a good trade-off since we are not bound by memory uops in this loop.

view details

push time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha c592e9a3f604fa6c62ef547639723b5962529885

C: asm: remove blendvps usage altogether This simplifies the operation by removing the need to use blendvps at all.

view details

Matthew Krupcale

commit sha 47e415c7f19d97b3a39720f9c892288e82d4bd99

C: asm: simplify pinsrd emulation Use punpckl{,q}dq instead of pinsrw.

view details

Matthew Krupcale

commit sha 52d63fc131c3d68149b33b2d655f8f05375dd99b

C: asm: simplify pblendw emulation Use statically calculated ~mask. This reduces the number of moves and registers necessary at the expense of an extra memory load. This is probably a good trade-off since we are not bound by memory uops in this loop.

view details

push time in 2 months

startedmatt-42/lithium

started time in 2 months

startedalandefreitas/matplotplusplus

started time in 2 months

pull request commentBLAKE3-team/BLAKE3

Add SSE2 implementations

Supporting SSE2 was something I was not too sure about, given the vanishingly small percentage of machines still running without SSE4.1 (on the Steam survey 98.11% of machines have it).

Yeah, I've seen the Steam survey as well, but an SSE2 baseline implementation would ensure 100% coverage of x86_64 with optimized assembly. And incidentally my Phenom II falls into the category of CPUs without SSE4.1 :(.

In any case, this is a solid patch.

Thanks, I'm glad I was able to produce something reasonable based on your work, having never before written assembly myself.

I haven't gone over the whole thing, but one nit: the _mm_blend_epi16 emulation routine is a bit complicated, especially if the compiler doesn't see past the expensive multiplication (the Microsoft compiler can't, from the looks of it). I suggest rewriting it as ...

Done in the latest commit.

The other nit is less technical: when emulating rotations with shift-shift-or, I find it preferable to use por instead of pxor to make it easily distinguished from the "real" xors in the algorithm.

Makes sense.

Though in fairness, we may have been not entirely consistent everywhere with that.

Perhaps this can be addressed in a separate cleanup PR covering all implementations.

mkrupcale

comment created time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha c33a8462d1e1770f91a1aa4c4854ae000ed865ae

Write _mm_blend_epi16 emulation without multiplication Use _mm_and_si128 and _mm_cmpeq_epi16 rather than expensive multiplication _mm_mullo_epi16 with _mm_srai_epi16 that compiler may not be able to optimize.

view details

push time in 2 months

push eventmkrupcale/.lyx

Matthew Krupcale

commit sha b0627e00dde2b9ffb697bb405616e63454f9cb69

Add REVTeX 4.2 layout * layouts/revtex4-2.layout: Add REVTeX 4.2 layout based on official REVTeX 4.1 layout

view details

push time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha 90e2a924a4eed8c5cac3b37210777e9faa71b37d

Fix Windows MSVC undefined symbol errors MSVC returns "error A2006:undefined symbol : FFFFFFFFH", so use 0FFFFFFFFH instead. Also use 0 prefix for 0H to align things.

view details

push time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha e581035bd38185c16de97f3c8cff8e3a47b23638

Put PBLENDW masks in the RDATA section Previously, these masks were undefined because they were outside of the RDATA section.

view details

push time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha 00849f8625f3200ef7cac4b33d459005ccd7fe78

Fix Windows MSVC undefined symbol errors MSVC returns "error A2006:undefined symbol : B1H", so use 0B1H instead.

view details

push time in 2 months

push eventmkrupcale/BLAKE3

Matthew Krupcale

commit sha c32660099a86a574155a81767dc49ba08d05667d

Fix unreachable expression compiler warning SSE2 target_feature appears to always be present for x86_64.

view details

push time in 2 months

PR opened BLAKE3-team/BLAKE3

Add SSE2 implementations

Introduction

This adds SSE2 intrinsic and assembly implementations based heavily on the respective SSE4.1 implementations. It simply emulates any SSE4.1 or SSSE3 intrinsics/instructions using SSE2 intrinsics/instructions. I'm certainly a novice in this area, so I can't claim that the implementations are optimal, but I hope that they are close, and the benchmarks certainly show an improvement over the portable versions.

Details

There were two intrinsics which required emulation:

  1. _mm_blend_epi16 (SSE4.1): Used in both the rust and C intrinsic implementations
  2. _mm_shuffle_epi8 (SSSE3): Used in the C intrinsic implementation for rot8 and rot16.

There were four instructions which required emulation:

  1. pblendw (SSE4.1)
  2. blendvps (SSE4.1)
  3. pinsrd (SSE4.1)
  4. pshufb (SSSE3)

Fortunately, these all had relatively simple SSE2 implementations.

Testing

I've tested the SSE2 implementation on my own Linux (Fedora 32) machine (Phenom II X4) using the following methods:

  1. cargo test
  2. cargo test --features 'prefer_intrinsics'
  3. cargo bench
  4. make -f Makefile.testing all
  5. make -f Makefile.testing test
  6. make -f Makefile.testing test_asm

For the most part, things seem to work. Tests 1-3 above return some warnings like the following during compilation:

warning: unreachable statement
   --> src/platform.rs:408:5
    |
404 |           return true;
    |           ----------- any code following this expression is unreachable
...
408 | /     {
409 | |         if is_x86_feature_detected!("sse2") {
410 | |             return true;
411 | |         }
412 | |     }
    | |_____^ unreachable statement
    |
    = note: `#[warn(unreachable_code)]` on by default

warning: 1 warning emitted

I'm not sure if this is because SSE2 is always assumed to be present on x86_64 or if I'm missing something, but I essentially did the exact same thing as the SSE4.1 platform detection.

Test 4 fails to link with missing references to g{,et}_cpu_features:

$ make -f Makefile.testing all
gcc -O3 -Wall -Wextra -std=c11 -pedantic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -fPIE -fvisibility=hidden -Wa,--noexecstack -c blake3_sse2.c -o blake3_sse2.o -msse2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -fPIE -fvisibility=hidden -Wa,--noexecstack -c blake3_sse41.c -o blake3_sse41.o -msse4.1
gcc -O3 -Wall -Wextra -std=c11 -pedantic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -fPIE -fvisibility=hidden -Wa,--noexecstack -c blake3_avx2.c -o blake3_avx2.o -mavx2
gcc -O3 -Wall -Wextra -std=c11 -pedantic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -fPIE -fvisibility=hidden -Wa,--noexecstack -c blake3_avx512.c -o blake3_avx512.o -mavx512f -mavx512vl
gcc -O3 -Wall -Wextra -std=c11 -pedantic -fstack-protector-strong -D_FORTIFY_SOURCE=2 -fPIE -fvisibility=hidden -Wa,--noexecstack blake3.c blake3_dispatch.c blake3_portable.c main.c blake3_sse2.o blake3_sse41.o blake3_avx2.o blake3_avx512.o -o blake3 -pie -Wl,-z,relro,-z,now
/usr/bin/ld: /tmp/cchzaxEt.o: in function `main':
main.c:(.text.startup+0x2ab): undefined reference to `get_cpu_features'
/usr/bin/ld: main.c:(.text.startup+0x2f5): undefined reference to `g_cpu_features'
collect2: error: ld returned 1 exit status
make: *** [Makefile.testing:46: all] Error 1

But this was already the case on the master branch on my machine.

Tests 5-6 seemed to work correctly.

I don't have a Windows machine to test the Windows GNU or MSVC assembly implementations, but they are nearly identical to the Unix implementation, so in principle they should work as well.

Benchmark results

I used the BLAKE3-specs criterion benchmarks to compare the master branch to my sse2 branch[1]. For smaller inputs (<= 1 kiB), there were either statistically insignificant changes, no changes, or minor regressions of ~1-2%. For medium inputs (1 kiB < n < 10 kiB), there were performance improvements ranging from ~2% to ~200%, eventually reaching ~230% for the largest inputs.

[1] https://mkrupcale.gitlab.io/BLAKE3-specs-benchmark-results

+8809 -14

0 comment

18 changed files

pr created time in 2 months

create barnchmkrupcale/BLAKE3

branch : sse2

created branch time in 2 months

fork mkrupcale/BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function

fork in 2 months

startedjlblancoc/nanoflann

started time in 2 months

startedmozilla/rr

started time in 2 months

startedjk-jeon/Grisu-Exact

started time in 2 months

push eventmkrupcale/.lyx

Matthew Krupcale

commit sha 149f96d56f558e481f3a4ac9435ed9400e77f4bc

Add REVTeX 4.2 layout * layouts/revtex4-2.module: Add REVTeX 4.2 layout based on official REVTeX 4.1 layout

view details

push time in 3 months

startedp-ranav/psched

started time in 3 months

PR opened jtv/libpqxx

test: Properly order tests executed by test runner

The order of tests run is determined by the test map order, and since some tests depend on the output from previous tests, the map must be ordered based on lexicographical comparison. Using char const* for the map keys results in the map order being determined by the pointer order, not the lexicographical string order. Additionally, the pointer order is indeterminate since the C-string storage and static initialization order is indeterminate[1]. Thus, use a std::string key to create a test map in lexicographical order and run the tests in the correct order.

  • test/runner.cxx:
    • Use std::string for test map keys
    • Fix some include's
  • test/test_helpers.hxx: Fix some include's

[1] https://en.cppreference.com/w/cpp/language/initialization#Non-local_variables

+5 -3

0 comment

2 changed files

pr created time in 3 months

push eventmkrupcale/libpqxx

Matthew Krupcale

commit sha a2c240d0fb03f82e0a7b520e6c8d153402f67f54

test: Use std::string for test map keys The order of tests run is determined by the test map order, and since some tests depend on the output from previous tests, the map must be ordered based on lexicographical comparison. Using `char const*` for the map keys results in the map order being determined by the pointer order, not the lexicographical string order. Additionally, the pointer order is indeterminate since the C-string storage and static initialization order is indeterminate[1]. Thus, use a std::string key to create a test map in lexicographical order and run the tests in the correct order. * test/runner.cxx: Use std::string for test map keys [1] https://en.cppreference.com/w/cpp/language/initialization#Non-local_variables

view details

push time in 3 months

create barnchmkrupcale/libpqxx

branch : ordered-test-runner

created branch time in 3 months

startedGenymobile/scrcpy

started time in 3 months

startedcontinental/ecal

started time in 3 months

startedLuxCoreRender/LuxCore

started time in 3 months

more