profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/nadimkobeissi/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
Nadim Kobeissi nadimkobeissi @capsulesocial Paris, France https://nadim.computer

Inria-Prosecco/proscript-messaging 33

Supporting materials for our EuroS&P paper: Automated Verification for Secure Messaging Protocols and their Implementations: A Symbolic and Computational Approach.

ad-l/djcl 11

DJS Crypto Library

Inria-Prosecco/acme-model 9

A Formal Model for ACME: Analyzing Domain Validation over Insecure Channels

nadimkobeissi/cpr 2

cp with progress bar and other stats

nadimkobeissi/formatforest 1

Simple parser-generator blogging engine written in Go.

nadimkobeissi/diskgem 0

Command-line SFTP client written in Go with a terminal-based user interface.

nadimkobeissi/js-ipfs 0

IPFS implementation in JavaScript

release CMiksche/gitea-auto-update

2.0.9

released time in 4 days

release nodejs/node

v14.17.1

released time in 5 days

release facebook/flow

v0.153.0

released time in 9 days

release microsoft/vscode

1.57.0

released time in 9 days

push eventpq-crystals/kyber

Basil Hess

commit sha 49a2e02f30061b24ee3d234dd4240ce84d2f6de0

Corrects claimed security level for Kyber768-90s to 3 (in yml)

view details

Gregor Seiler

commit sha fd83229e9dcc7c235a5ea8bb320d1fbade812452

Merge pull request #40 from bhess/master Corrects claimed security level for Kyber768-90s to 3 (in yml)

view details

push time in 13 days

issue commentpq-crystals/kyber

ARM assembly support?

I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction

Note that MVE and SVE also have separate high/low multiply instructions. It is also noteworthy that in contrast to AVX2, those instructions aren't limited to 16-bit lanes, which is of interest e.g. for Dilithium or an NTT-based implementation of Saber.

When looking through your code, I noticed that you often duplicate twiddle factors across vectors. Have you considered loading multiple twiddle factors into a single vector and use the lane-indexed instruction variants? For layers 0-5, where multiplication by twiddles happens at the granularity of vectors, this will save you a few instructions and hopefully cycles.

tomleavy

comment created time in 17 days

issue commentpq-crystals/kyber

ARM assembly support?

I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul.

Great! Here's a complete patch, by the way:

#define fqmul(out, in, zeta, t)                                                                  \
    t.val[0] = (int16x8_t)vqdmulhq_s16(in, zeta);                  /* (2*a)_H */                 \ 
    t.val[1] = (int16x8_t)vmulq_s16(in, zeta);                     /* a_L */                     \
    t.val[2] = vmulq_s16(t.val[1], neon_qinv);                     /* a_L = a_L * QINV */        \
    t.val[3] = (int16x8_t)vqdmulhq_s16(t.val[2], neon_kyberq);     /* (2*a_L*Q)_H */             \
    out = vhsubq_s16(t.val[0], t.val[3]);                          /* ((2*a)_H - (2*a_L*Q)_H)/2 */

This, of course, doesn't yet implement that constant merging, though.

It would be interesting see what difference it actually makes on various CPUs. Since the bottleneck may be the multiplication sequence, rather than the boilerplate around it, it might not be as large as one would expect from the mere instruction count.

Please let me know your paper after you publish it, I am happy to cite it.

Yes, will do!

tomleavy

comment created time in 17 days

issue commentpq-crystals/kyber

ARM assembly support?

Hi @hanno-arm,

Yes, you're right. It is the multiply instruction I'm looking for. I'm fully aware that the specialty of AVX2 that they can multiply high or low in single instruction. I tried to look similar instruction in the past, due to the time limit, so I don't know much about rounding and saturating. I declare exact the issue you mentioned in a submitted paper (mine is going to be published soon), and you solve my bottleneck in fqmul. Please let me know your paper after you publish it, I am happy to cite it. I had a precise Kyber range analysis, improve upon the paper of Bas Westerbaan: When to Barrett reduce in the inverse NTT, I hope I can apply it to the new approach.

Thank you very much. I will definitely mention you, and cite your work.

tomleavy

comment created time in 17 days

issue commentpq-crystals/kyber

ARM assembly support?

Thanks a lot @cothan for sharing your work on developing optimized PQC implementations for Arm, it's great to see more interest and progress in this area.

Note that you can further improve performance by using the fixed-point doubling multiply-high VQDMULH to implement @gregorseiler's approach to Montgomery multiplication from Faster AVX2 optimized NTT multiplication for Ring-LWE lattice cryptography (use the doubling VQDMULH for the high multiply, VMUL for the low multiply, and the halving subtract VHSUB to compensate for the doubling). This reduces the Montgomery multiplication sequence from 9 to 5 instructions and should be a drop-in replacement for your fqmul macro. If you further precompute twisted constants following NTTRU: Truly Fast NTRU Using NTT, you'll get to 4 instructions.

Finally, you can use the fixed-point multiply-high-accumulate VQRDMLAH (v8.1-A onwards) to get to 3 instructions, but this is quite a bit trickier because of the carry-in mentioned in @gregorseiler's paper, as well as the rounding, doubling, and especially impacted range analysis in case of the Kyber prime.

If you publish or present your code with those modifications, a mention would be appreciated. A cite-able paper on the use of fixed point instructions for modular arithmetic -- which also applies to MVE, SVE and SVE2 -- will hopefully appear shortly.

tomleavy

comment created time in 17 days

PR opened pq-crystals/kyber

Elim unused param and use getrandom API.

The function randombytes_init included an unused parameter which has been removed. In 2017, glibc added an API call for getrandom. This update checks the glibc version and uses the getrandom API call if available instead of making a syscall.

+7 -9

0 comment

4 changed files

pr created time in a month

issue commentpq-crystals/kyber

License clarification

fizanaaz1995 ***@***.***> wrote:

Kindly confirm which one among the below two licenses are applicable.

Public Domain (License text - Public domain code is not subject to any license. )
Creative Commons Zero v1.0 Universal (https://creativecommons.org/publicdomain/zero/1.0/)

Neither "public domain" nor CC0 are licenses. We use the CC0 waiver to waive our author rights to the maximum extent possible, and thus place the work into the public domain as far as this is possible. For example, by European law it is not possible to waive all author rights -- the only way for any work to truly enter the public domain is for the author(s) to die and then a certain period of time to pass.

fizanaaz1995

comment created time in a month

issue openedpq-crystals/kyber

License clarification

Kindly confirm which one among the below two licenses are applicable.

Public Domain (License text - Public domain code is not subject to any license. )
Creative Commons Zero v1.0 Universal (https://creativecommons.org/publicdomain/zero/1.0/)

created time in a month

issue commentpq-crystals/kyber

ARM assembly support?

Hi @marco-palumbi ,

Here is the configuration in my RP4:

cothan@manjaro-rp4 ~> uname -a
Linux manjaro-rp4 5.10.20-1-MANJARO-ARM #1 SMP PREEMPT Thu Mar 4 14:36:16 CST 2021 aarch64 GNU/Linux

cothan@manjaro-rp4 ~> gcc -v 
Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0/lto-wrapper
Target: aarch64-unknown-linux-gnu
Configured with: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://github.com/archlinuxarm/PKGBUILDs/issues --enable-languages=c,c++,fortran,go,lto,objc,obj-c++,d --with-isl --with-linker-hash-style=gnu --with-system-zlib --enable-__cxa_atexit --enable-checking=release --enable-clocale=gnu --enable-default-pie --enable-default-ssp --enable-gnu-indirect-function --enable-gnu-unique-object --enable-install-libiberty --enable-linker-build-id --enable-lto --enable-plugin --enable-shared --enable-threads=posix --disable-libssp --disable-libstdcxx-pch --disable-libunwind-exceptions --disable-multilib --disable-werror --host=aarch64-unknown-linux-gnu --build=aarch64-unknown-linux-gnu --with-arch=armv8-a --enable-fix-cortex-a53-835769 --enable-fix-cortex-a53-843419 gdc_include_dir=/usr/include/dlang/gdc
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 10.2.0 (GCC) 

cothan@manjaro-rp4 ~> clang -v 
clang version 11.1.0
Target: aarch64-unknown-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
Found candidate GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Found candidate GCC installation: /usr/lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Selected GCC installation: /usr/bin/../lib/gcc/aarch64-unknown-linux-gnu/10.2.0
Candidate multilib: .;@m64
Selected multilib: .;@m64

I use Clang across all my results.

tomleavy

comment created time in a month

issue commentpq-crystals/kyber

ARM assembly support?

@cothan which OS and compiler did you use on Raspberry PI?

tomleavy

comment created time in a month

startedcloudflare/quiche

started time in 2 months

startedDefi-Cartel/salmonella

started time in 2 months

startedcryptochill/tss-ecdsa-cli

started time in 2 months

startedZenGo-X/white-city

started time in 2 months

startedsupranational/blst

started time in 2 months

fork veorq/CMTA20

Smart contract to tokenize a Swiss corporation's shares on Ethereum

https://cmta.ch

fork in 2 months

startednadimkobeissi/formatforest

started time in 2 months

issue commentpq-crystals/kyber

ARM assembly support?

Kyber A72 NEON neon Level 1 neon Level 3 neon Level 5
gen_matrix 27,152 59,800 108,518
neon_poly_getnoise_eta1_2x 9,247 4,422 4,416
neon_poly_getnoise_eta2 2,679 2,679 2,683
poly_tomsg 976 979 980
poly_frommsg 810 816 815
neon_ntt 1,496 1,473 1,476
neon_invntt 1,659 1,661 1,657
crypto_kem_keypair 72,015 116,390 184,965
crypto_kem_enc 95,287 150,959 223,786
crypto_kem_dec 94,069 149,845 220,671
Kyber A72 REF ref Level 1 ref Level 3 ref Level 5
gen_matrix 28,921 70,117 127,776
ref_poly_getnoise_eta1 4,397 3,317 3,315
ref_poly_getnoise_eta2 3,311 3,314 3,317
poly_tomsg 976 979 979
poly_frommsg 810 815 817
ref_ntt 8,496 8,500 8,551
ref_invntt 12,530 12,533 12,409
crypto_kem_keypair 136,934 237,601 371,906
crypto_kem_enc 184,533 298,928 440,645
crypto_kem_dec 223,359 349,122 503,820

Yes, sure, I conduct the result by using a patch of PAPI for Cortex-A72. https://github.com/cothan/PAPI_ARMv8_Cortex_A72

Note: neon_poly_getnoise_eta1_2x includes NEON implementation of NEON_SHA2x on Cortex-A72, when compare with ref_poly_getnoise_eta1 in Level 3 and Level 5, you have to divide by 2.

tomleavy

comment created time in 2 months

issue commentpq-crystals/kyber

ARM assembly support?

@cothan Could you also report the CCs for your latest code running on your Raspberry Pi's?

tomleavy

comment created time in 2 months

startedcronokirby/safenum

started time in 3 months

issue commentpq-crystals/kyber

ARM assembly support?

After iteration of code, finally my NEON ARMv8 code is complete. I also have some benchmarks on Apple M1. The speed up is a bit better.

Kyber M1 NEON neon Level 1 neon Level 3 neon Level 5
gen_matrix 7,680 17,944 30,743
neon_ntt 413 413 413
neon_invntt 428 428 428
crypto_kem_keypair 22,958 36,342 55,951
crypto_kem_enc 32,522 49,134 71,579
crypto_kem_dec 29,412 45,100 67,026
Kyber M1 REF ref Level 1 ref Level 3 ref Level 5
gen_matrix 11,989 26,894 47,558
ref_ntt 3,171 3,223 3,217
ref_invntt 5,171 5,118 5,148
crypto_kem_keypair 59,622 105,058 163,075
crypto_kem_enc 76,513 120,766 175,568
crypto_kem_dec 90,254 138,813 198,509

I think this neon version is ready, what else should I do to make a pull request?

tomleavy

comment created time in 3 months

startednadimkobeissi/cpr

started time in 3 months

issue commentpq-crystals/kyber

ARM assembly support?

Hello @tomleavy @cryptojedi and @gregorseiler,

We have made a fork of the Kyber repo and made an ARM32 and ARM64 implementation of Kyber: https://github.com/BeechatNetworkSystemsLtd/Kyber-ARM

We have also made a Java wrapper for Kyber: https://github.com/BeechatNetworkSystemsLtd/JNIPQC

And we are currently working on a Python wrapper for Kyber as we speak as well.

We hope this helps and to contribute to the Kyber project as well as its implementation on multiple platforms.

Cheers,

The Beechat Network team

tomleavy

comment created time in 3 months