profile
viewpoint
If you are wondering where the data of this site comes from, please visit https://api.github.com/users/senhuang42/events. GitMemory does not store any data, but only uses NGINX to cache data for a period of time. The idea behind GitMemory is simply to give users a better reading experience.
sen senhuang42 @Facebook New York, NY yale '20

senhuang42/sqz 2

New lossless audio compression codec

senhuang42/zstd 1

Zstandard - Fast real-time compression algorithm

senhuang42/benchmarkzstd 0

some super messy code to do benchmarking on zstd changes :))

senhuang42/Bsh 0

Bsh is a simple shell, a baby brother of the Bourne-again shell bash, and offers a limited subset of bash's functionality (plus some extras)

senhuang42/dontforget 0

A cute lil' Chrome extension

senhuang42/dooxsite 0

doox site

senhuang42/garrysmodscriptedeffects 0

Some scripted weapons for Garry's Mod that create beautiful FX programmatically generated from Half Life 2 effects. Looks great

senhuang42/languagemodeling 0

Python Natural Language Toolkit: PoS tagging, Hidden Markov Models, Language Modeling

senhuang42/learningc_ohgod 0

Solving the bin packing problem using C

senhuang42/NEURON_test6 0

Neuron flow numerical simulations and methods

issue closedfacebook/zstd

Advice for training on a small corpus

I’m looking for guidance on how best to create a trained dictionary, or if I really should be using one at all.

I’m working on a new protocol fingerprinting scheme. Examples of protocol fingerprints include p0f, JA3, and HASSH. Most protocol fingerprinting schemes work by picking protocol features of interest, and then coming up with a fingerprinting format that represents those features. I want to take a somewhat different approach.

Instead of creating a fingerprint format per se, the fingerprint will itself be a valid protocol message. This enables re-use of protocol parsers with little to no modification to also in effect be fingerprint parsers. How this will work will be by redacting and in some cases normalizing protocol messages.

Consider an HTTP request sent by curl:

GET /resource HTTP/1.1
Host: www.example.com
User-Agent: curl/7.68.0
Accept: */*

In turning this into a fingerprint, there are decisions to make about which features to preserve. For example, as a protocol fingerprint focused more on identifying the client implementation, we are not interested in the URL path or the contents of the Host header, so we’ll redact those. There are plenty of other considerations like which other header values to preserve or redact, but to keep this example simple let’s just go with redacting those two values by replacing them with a “-”:

GET - HTTP/1.1
Host: -
User-Agent: curl/7.68.0
Accept: */*

That’s the strawman HTTP fingerprint. (I picked HTTP here since as an ASCII protocol it’s easier to demonstrate with.) The fingerprint is longer than it could be and not easily shareable, as in it’s not easy to paste the fingerprint into a tweet (e.g. it’d be easy to accidentally copy/paste too much or too little of the fingerprint, or to accidentally mess up the whitespace). As a result, what I want to do is reduce the fingerprint size somehow, like by building a corpus of fingerprints along these lines, then training a zstd dictionary on that corpus. With the dictionary, protocol messages can be redacted and compressed, then the result hex encoded for easy sharing.

I’ve coded most of this approach up, though for a different protocol: TLS. I have what I thought was a decent corpus of TLS pcaps from various TLS clients. However, after going through redaction/normalization I only end up with about 3,600 ClientHello fingerprints and 340 ServerHello fingerprints. When I train with this small corpus size, zstd gives me a warning:

!  Warning : data size of samples too small for target dictionary size 
!  Samples should be about 100x larger than target dictionary size 

This makes me wonder if I’m taking the right approach or what I could do to produce a better dictionary. Part of the longer-term plan is to have the fingerprint be versioned, so that better dictionaries can be produced over time, but I need to come up with a “v0” dictionary to get started.

The corpus I have is pretty real-world, so that’s good, but I’m wondering if I should mix in more obscure possibilities. There’s a project called utls (https://github.com/refraction-networking/utls) that gives a great deal of control over the ClientHello, so I'm thinking about using it to produce a wider range of ClientHellos to include in the corpus.

That’s what I’m going to try next. Is it a good/bad idea to add in still valid but more obscure examples into the corpus? Are there other tips or tricks for training with a smaller corpus size?

Thank you!

closed time in 9 days

bhiggins

issue commentfacebook/zstd

Advice for training on a small corpus

Closing since no followup questions.

bhiggins

comment created time in 9 days

push eventsenhuang42/zstd

senhuang42

commit sha d85befe84ac763eb344a7cc92011f3e0434ba058

Separated out, qsort

view details

push time in 9 days

create barnchsenhuang42/zstd

branch : bucket_huf

created branch time in 9 days

startedresume/resume.github.com

started time in 12 days

create barnchsenhuang42/zstd

branch : sb_compress

created branch time in 19 days

push eventsenhuang42/zstd

senhuang42

commit sha f18cb448dfb3d56f69a9ba4d12f4227ba4dffeb9

Move MacOS test to github actions

view details

push time in 24 days

push eventsenhuang42/zstd

senhuang42

commit sha 709218e82945758cc33182296502edd698a7aec1

Move MacOS test to github actions

view details

push time in 24 days

push eventsenhuang42/zstd

senhuang42

commit sha ea6c0738bbce21ffe96138acc33213fbe7236679

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 96ee6a7ac508626e84cc873389b2bea82abc1ab4

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 51c8c1cf1280e64451bad40330b4c5f3418d72f4

Move MacOS test to github actions

view details

push time in 25 days

push eventfacebook/zstd

senhuang42

commit sha f5f6cc2e483a3af49a98a063d014b2e2e7e9606c

Remove folder when done with test

view details

sen

commit sha d90bc0e0b6ad7a9b6152a83c472b551cf0d3b53e

Merge pull request #2720 from senhuang42/remove_folder Remove folder when done with test

view details

push time in 25 days

PR merged facebook/zstd

Remove folder when done with test CLA Signed

The CLI tests were leaving behind this precompressedFilterTestDir folder.

+1 -0

0 comment

1 changed file

senhuang42

pr closed time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 7ea2c329e34bcd5d4f1ff1be58dbf36fc93a2516

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 186e94cf37fd52b4336cc366a8f3be08de0f3755

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 2be382516d4b7d39e60e9c755c27a146e69d1d30

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 4a5e8921bfcbb1e3c02bbea2e4c86f869b736a96

Move MacOS test to github actions

view details

push time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 755b41c49922bc04d556c776c3ef05cc9fa3f292

Move MacOS test to github actions

view details

push time in 25 days

PR opened facebook/zstd

Remove folder when done with test

The CLI tests were leaving behind this precompressedFilterTestDir folder.

+1 -0

0 comment

1 changed file

pr created time in 25 days

push eventsenhuang42/zstd

senhuang42

commit sha 5debf7c7f913eb37407227380b177a3b0bfcc50a

Move MacOS test to github actions

view details

push time in 25 days

PR opened senhuang42/zstd

Remove folder when done with test
+1 -0

0 comment

1 changed file

pr created time in 25 days

PR opened senhuang42/zstd

Move MacOS test to github actions
+1 -7

0 comment

2 changed files

pr created time in 25 days

create barnchsenhuang42/zstd

branch : remove_folder

created branch time in 25 days

create barnchsenhuang42/zstd

branch : amcso

created branch time in 25 days

push eventfacebook/zstd

senhuang42

commit sha 76466dfadf871babdaebcecb448cc00c08fbfd78

Add simple API for converting ZSTD_Sequence into seqStore

view details

sen

commit sha 45d707e908c302fee84bb0f6d34b12acc3d14e0f

Merge pull request #2715 from senhuang42/sequence_api_3 [RFC] Add internal API for converting ZSTD_Sequence into seqStore

view details

push time in a month

PR merged facebook/zstd

[RFC] Add internal API for converting ZSTD_Sequence into seqStore CLA Signed

Actually, it turns out this functionality already exists, but we just add a small wrapper around it to provide a clean interface. Currently, on silesia.tar with sequences generated from compression level 3 and ZSTD_generateSequences(), ZSTD_convertBlockSequencesToSeqStore runs at around 2100MB/s averaged across all blocks. Most of this time seems to be spent in ZSTD_finalizeOffCode().

More importantly though, I'll specify a general set of guidelines for dealing with hardware accelerated matchfinders and how they should be integrated into the library at a per-block level, and suggestions are welcome here.

Generally, a hardware-accelerated matchfinder must adhere to the below function signature, storing its result in an array of ZSTD_Sequence.

// Generic function signature for hardware matchfinders.
// Accepts a void* pointer for a "bag" of parameters that the matchfinder may use,
// possibly derived from the ZSTD_CCtx parameters.
//
// As an example, one could define a function
// size_t ZSTD_accelerated_findMatches(ZSTD_Sequence* sequences, size_t sequencesCapacity,
//                               void* params, const void* src, size_t srcSize);
//
// Returns number of sequences generated, storing the result in `sequences`, or a zstd error.
typedef size_t (*ZSTD_hardwareMatchFinder) 
     (ZSTD_Sequence* sequences, size_t sequencesCapacity, void* params,
      const void* src, size_t srcSize);

The reasoning being that then down the line, we can then define the following function that could potentially select between multiple accelerated matchfinders:

// This function selects the final hardware match finder used, depending on the
// parameters in the ZSTD_CCtx. 
//
// ZSTD_selectHardwareMatchFinder() then will return ZSTD_accelerated_findMatches.
ZSTD_hardwareMatchFinder ZSTD_selectHardwareMatchFinder(const ZSTD_CCtx* zc);

And finally, the code could be integrated like something along these lines, in ZSTD_compressBlock_internal() (and of course, a first implementation can hard-code a lot of these dynamic decisions for the purposes of testing).

static size_t ZSTD_compressBlock_internal(ZSTD_CCtx* zc,
                                        void* dst, size_t dstCapacity,
                                        const void* src, size_t srcSize, U32 frame)
{
    /* This the upper bound for the length of an rle block.
     * This isn't the actual upper bound. Finding the real threshold
     * needs further investigation.
     */
    const U32 rleMaxLength = 25;
    size_t cSize;
    const BYTE* ip = (const BYTE*)src;
    BYTE* op = (BYTE*)dst;
    DEBUGLOG(5, "ZSTD_compressBlock_internal (dstCapacity=%u, dictLimit=%u, nextToUpdate=%u)",
                (unsigned)dstCapacity, (unsigned)zc->blockState.matchState.window.dictLimit,
                (unsigned)zc->blockState.matchState.nextToUpdate);
                
    // HARDWARE ACCELERATED MATCHFINDING PATH HERE
    // ZSTD_useHardwareAccelerator() is a hypothetical function that determines
    // whether we use a hardware-accelerated approach for matchfinder, depending
    // on factors such as compression parameters and whatnot. The decision to use a hardware accelerator
    // could be predetermined/finalized during parameter initialization, and stored as a variable in the cctx.
    if (ZSTD_useHardwareAccelerator(zc)) {
        // Now, select a hardware matchfinder, based on parameters in ZSTD_CCtx
        ZSTD_hardwareMatchFinder matchFinder = ZSTD_selectHardwareMatchFinder(zc);
        
        // Reset the existing seqStore
        ZSTD_resetSeqStore(&cctx->seqStore);

        // Function pointer that delegates to the accelerated matchfinder to generate sequences.
        // `params` can be a custom struct of all required parameters for the particular matchfinder
        // `zc->hardwareSequences` is presumed already allocated and `zc->hardwareSequencesCapacity` is 
        //  already determined, likely during the decision to use hardware accelerated match-finding
        //  hardware acceleration during parameter finalization.
        size_t const nbSeqs = matchFinder(zc->hardwareSequences, zc->hardwareSequencesCapacity, &params, src, srcSize);
        
        // Generated sequences passed to new API, which gives us our final `zc->seqStore`
        FORWARD_IF_ERROR(ZSTD_convertBlockSequencesToSeqStore(...), "");
    } else {
        const size_t bss = ZSTD_buildSeqStore(zc, src, srcSize);
        FORWARD_IF_ERROR(bss, "ZSTD_buildSeqStore failed");
        if (bss == ZSTDbss_noCompress) { cSize = 0; goto out; }
    }
    ...
+17 -6

0 comment

1 changed file

senhuang42

pr closed time in a month

PullRequestReviewEvent

push eventsenhuang42/zstd

senhuang42

commit sha 76466dfadf871babdaebcecb448cc00c08fbfd78

Add simple API for converting ZSTD_Sequence into seqStore

view details

push time in a month

PR opened facebook/zstd

[RFC] Add internal API for converting ZSTD_Sequence into seqStore

Actually, it turns out this functionality already exists, but we just add a small wrapper around it to provide a clean interface. Currently, on silesia.tar with sequences generated from compression level 3 and ZSTD_generateSequences(), ZSTD_convertBlockSequencesToSeqStore runs at around 2100MB/s averaged across all blocks. Most of this time seems to be spent in ZSTD_finalizeOffCode().

More importantly though, I'll specify a general set of guidelines for dealing with hardware accelerated matchfinders and how they should be integrated into the library at a per-block level, and suggestions are welcome here.

Generally, a hardware-accelerated matchfinder must adhere to the below function signature, storing its result in an array of ZSTD_Sequence.

// Generic function signature for hardware matchfinders.
// Accepts a void* pointer for a "bag" of parameters that the matchfinder may use,
// possibly derived from the ZSTD_CCtx parameters.
//
// As an example, one could define a function
// size_t ZSTD_accelerated_findMatches(ZSTD_Sequence* sequences, size_t sequencesCapacity,
//                               void* params, const void* src, size_t srcSize);
//
// Returns number of sequences generated, storing the result in `sequences`, or a zstd error.
typedef size_t (*ZSTD_hardwareMatchFinder) 
     (ZSTD_Sequence* sequences, size_t sequencesCapacity, void* params,
      const void* src, size_t srcSize);

The reasoning being that then down the line, we can then define the following function that could potentially select between multiple accelerated matchfinders:

// This function selects the final hardware match finder used, depending on the
// parameters in the ZSTD_CCtx. 
//
// ZSTD_selectHardwareMatchFinder() then will return ZSTD_accelerated_findMatches.
ZSTD_hardwareMatchFinder ZSTD_selectHardwareMatchFinder(const ZSTD_CCtx* zc);

And finally, the code could be integrated like something along these lines, in ZSTD_compressBlock_internal() (and of course, a first implementation can hard-code a lot of these dynamic decisions for the purposes of testing).

static size_t ZSTD_compressBlock_internal(ZSTD_CCtx* zc,
                                        void* dst, size_t dstCapacity,
                                        const void* src, size_t srcSize, U32 frame)
{
    /* This the upper bound for the length of an rle block.
     * This isn't the actual upper bound. Finding the real threshold
     * needs further investigation.
     */
    const U32 rleMaxLength = 25;
    size_t cSize;
    const BYTE* ip = (const BYTE*)src;
    BYTE* op = (BYTE*)dst;
    DEBUGLOG(5, "ZSTD_compressBlock_internal (dstCapacity=%u, dictLimit=%u, nextToUpdate=%u)",
                (unsigned)dstCapacity, (unsigned)zc->blockState.matchState.window.dictLimit,
                (unsigned)zc->blockState.matchState.nextToUpdate);
                
    // HARDWARE ACCELERATED MATCHFINDING PATH HERE
    // ZSTD_useHardwareAccelerator() is a hypothetical function that determines
    // whether we use a hardware-accelerated approach for matchfinder, depending
    // on factors such as compression parameters and whatnot. The decision to use a hardware accelerator
    // could be predetermined/finalized during parameter initialization, and stored as a variable in the cctx.
    if (ZSTD_useHardwareAccelerator(zc)) {
        // Now, select a hardware matchfinder, based on parameters in ZSTD_CCtx
        ZSTD_hardwareMatchFinder matchFinder = ZSTD_selectHardwareMatchFinder(zc);
        
        // Reset the existing seqStore
        ZSTD_resetSeqStore(&cctx->seqStore);

        // Function pointer that delegates to the accelerated matchfinder to generate sequences.
        // `params` can be a custom struct of all required parameters for the particular matchfinder
        // `zc->hardwareSequences` is presumed already allocated and `zc->hardwareSequencesCapacity` is 
        //  already determined, likely during the decision to use hardware accelerated match-finding
        //  hardware acceleration during parameter finalization.
        size_t const nbSeqs = matchFinder(zc->hardwareSequences, zc->hardwareSequencesCapacity, &params, src, srcSize);
        
        // Generated sequences passed to new API, which gives us our final `zc->seqStore`
        FORWARD_IF_ERROR(ZSTD_convertBlockSequencesToSeqStore(...), "");
    } else {
        const size_t bss = ZSTD_buildSeqStore(zc, src, srcSize);
        FORWARD_IF_ERROR(bss, "ZSTD_buildSeqStore failed");
        if (bss == ZSTDbss_noCompress) { cSize = 0; goto out; }
    }
    ...
+11 -6

0 comment

1 changed file

pr created time in a month

push eventsenhuang42/zstd

senhuang42

commit sha bebaf6031221723d6891a40f9ed47cfd52158574

Add simple API for converting ZSTD_Sequence into seqStore

view details

push time in a month