Ask questionsAdvice for training on a small corpus
I’m looking for guidance on how best to create a trained dictionary, or if I really should be using one at all.
I’m working on a new protocol fingerprinting scheme. Examples of protocol fingerprints include p0f, JA3, and HASSH. Most protocol fingerprinting schemes work by picking protocol features of interest, and then coming up with a fingerprinting format that represents those features. I want to take a somewhat different approach.
Instead of creating a fingerprint format per se, the fingerprint will itself be a valid protocol message. This enables re-use of protocol parsers with little to no modification to also in effect be fingerprint parsers. How this will work will be by redacting and in some cases normalizing protocol messages.
Consider an HTTP request sent by curl:
GET /resource HTTP/1.1 Host: www.example.com User-Agent: curl/7.68.0 Accept: */*
In turning this into a fingerprint, there are decisions to make about which features to preserve. For example, as a protocol fingerprint focused more on identifying the client implementation, we are not interested in the URL path or the contents of the Host header, so we’ll redact those. There are plenty of other considerations like which other header values to preserve or redact, but to keep this example simple let’s just go with redacting those two values by replacing them with a “-”:
GET - HTTP/1.1 Host: - User-Agent: curl/7.68.0 Accept: */*
That’s the strawman HTTP fingerprint. (I picked HTTP here since as an ASCII protocol it’s easier to demonstrate with.) The fingerprint is longer than it could be and not easily shareable, as in it’s not easy to paste the fingerprint into a tweet (e.g. it’d be easy to accidentally copy/paste too much or too little of the fingerprint, or to accidentally mess up the whitespace). As a result, what I want to do is reduce the fingerprint size somehow, like by building a corpus of fingerprints along these lines, then training a zstd dictionary on that corpus. With the dictionary, protocol messages can be redacted and compressed, then the result hex encoded for easy sharing.
I’ve coded most of this approach up, though for a different protocol: TLS. I have what I thought was a decent corpus of TLS pcaps from various TLS clients. However, after going through redaction/normalization I only end up with about 3,600 ClientHello fingerprints and 340 ServerHello fingerprints. When I train with this small corpus size, zstd gives me a warning:
! Warning : data size of samples too small for target dictionary size ! Samples should be about 100x larger than target dictionary size
This makes me wonder if I’m taking the right approach or what I could do to produce a better dictionary. Part of the longer-term plan is to have the fingerprint be versioned, so that better dictionaries can be produced over time, but I need to come up with a “v0” dictionary to get started.
The corpus I have is pretty real-world, so that’s good, but I’m wondering if I should mix in more obscure possibilities. There’s a project called utls (https://github.com/refraction-networking/utls) that gives a great deal of control over the ClientHello, so I'm thinking about using it to produce a wider range of ClientHellos to include in the corpus.
That’s what I’m going to try next. Is it a good/bad idea to add in still valid but more obscure examples into the corpus? Are there other tips or tricks for training with a smaller corpus size?
Answer questions senhuang42
Closing since no followup questions.