profile
viewpoint
Henri Sivonen hsivonen @mozilla Helsinki, Finland https://hsivonen.fi/ Making Firefox focus out-of-process iframes

hsivonen/encoding_rs 177

A Gecko-oriented implementation of the Encoding Standard in Rust

hsivonen/chardetng 23

A character encoding detector for legacy Web content.

hsivonen/charset 10

Thunderbird-compatible character encoding decoding for email in Rust

hsivonen/detone 4

Decompose Vietnamese tone marks

hsivonen/encoding_c 4

C bindings for encoding_rs

hsivonen/encoding_bench 3

Performance testing framework for encoding_rs

hsivonen/browserphotos 2

Single-page HTML/JavaScript photo & video slideshow viewer

hsivonen/codepage 2

Placeholder for a mapping between Windows code page identifiers and encoding_rs Encodings

hsivonen/encoding_visualization 2

Script for visualizing the holes in CJK indices

hsivonen/encoding_rs_compat 1

rust-encoding API support for encoding_rs

issue commentunicode-org/icu4x

Fill in MIT copyright notices before first release

@sffc , can you figure out what notice should go there on Google's part? Not having a year would ease future administrative burden.

hsivonen

comment created time in 6 days

issue openedunicode-org/icu4x

Fill in MIT copyright notices before first release

In the LICENSE file, we still have "[Copyright notices to be filled in as crates are imported]" that hasn't been filled in. For Mozilla's part, our notice is "Copyright Mozilla Foundation". I'll let Googlers figure out how what should go in the copyright notice slot on Google's part.

created time in 6 days

issue commentunicode-org/icu4x

Add CI copyright headers check

I created a PR that updates CONTRIBUTING.md to document the license header language that was discussed.

For CI, this means checking that .rs files start with

// This file is part of ICU4X. For terms of use, please see the file
// called LICENSE at the top level of the ICU4X source tree
// (online at: https://github.com/unicode-org/icu4x/blob/master/LICENSE ).

and .toml files start with

# This file is part of ICU4X. For terms of use, please see the file
# called LICENSE at the top level of the ICU4X source tree
# (online at: https://github.com/unicode-org/icu4x/blob/master/LICENSE ).
echeran

comment created time in 6 days

PR opened unicode-org/icu4x

Reviewers
Document license header requirements in CONTRIBUTING.md

Per email and meeting, document the license header comments requirements.

+53 -2

0 comment

1 changed file

pr created time in 6 days

create barnchhsivonen/icu4x

branch : contributinglicenseheader

created branch time in 6 days

push eventhsivonen/icu4x

hagbard

commit sha c97811628bcc1961a5ad06b3196eba6e6491588b

First draft, probably needs markup fixing

view details

hagbard

commit sha b1a6448c32f938de9504cbb22f23895007dacd1d

Fixing markup and a few typos.

view details

hagbard

commit sha dd43c6dd11e30fb3fbf9dc8dc18b86bb214d22d3

Fixing markup and a few typos.

view details

Shane F. Carr

commit sha 3dd5d4341bbed32e53f4c043885bcb4e6de62a4a

Add icu-util project

view details

Shane F. Carr

commit sha 7cef64a20028acfb5dcdaffbb3bcc155a4a949f9

Add icu-datap-json project

view details

Shane F. Carr

commit sha bb1e1b515aac3accfe412cd42f8e2a892c39adf6

Add an initial data provider trait definition

view details

Shane F. Carr

commit sha 52845b0613f8b0f6f76c02695312f32084374861

Improving no_std support; adding serde; adding datap_json

view details

Shane F. Carr

commit sha 996b397b24348a72c3fce187f9bb88b3be3d14d0

Unit test for JSON-based data provider

view details

Shane F. Carr

commit sha 0147b2439651b2cba8dba7ad0d353b3b756f180c

Use dynamic type for response payload

view details

Shane F. Carr

commit sha b495fcbbc0c21166dbbf5b6dad3fbde9b6cb5790

Remove icu_util::std

view details

Shane F. Carr

commit sha 7cff876b0edcac354c866dbee86ad713fb598ef4

Remove obsolete code

view details

Shane F. Carr

commit sha 9d0298d12ff0f1d792ac31937dc01bb6607228c6

Rename payload2 to payload

view details

hagbard

commit sha 748d578a7b16b5a781d75f44e7e7aaab7c378a0c

Some feedback integrated

view details

Shane F. Carr

commit sha 5950fdb4b228b2d345313d6ec9727c86ec0bef66

Auto-implement Bovine for all data traits

view details

Shane F. Carr

commit sha 457d8ad110a303a9418916fbce0a0dfb4ddc51bc

Rename Bovine to ClonableAny

view details

Shane F. Carr

commit sha 1791debd14aec527cd078bd288ba31020f937c56

Additional polishing of Response struct

view details

Shane F. Carr

commit sha dbb767d7a3415298c3b6eb22abbae51a022f7d88

Fix spelling of Cloneable

view details

Shane F. Carr

commit sha 070cc8bba52f97a35408e5a1f301008396e31ae7

Add no_std tests for JSON Data Provider

view details

Shane F. Carr

commit sha 1ed302a2c597edfececcfc514b4de0d848c8f852

Adding take_payload, plus other cleanup

view details

Shane F. Carr

commit sha 4cd595a85fe947258f6c9f201c240d2d78d89347

Running cargo fmt

view details

push time in 6 days

issue commentunicode-org/icu4x

BiDi crate desiderata

is_ltr_only() functionality

It's in a rather odd crate thematically, but encoding_rs::mem has functions for this (the ones whose names had bidi as a substring).

raphlinus

comment created time in 6 days

issue commentprivacycg/proposals

Speculative Request Control

I was referring to a temporary performance decrease in the sense that this decrease can eventually be mitigated or even fixed by sites being refactored to adopt best practices regarding async loading.

I understood that but I don't believe it would consistently remain as a temporary state.

Also, if the speculative fetches are turned off, it's really hard or impossible to recover the performance by other means. The closest way possible would be resources as rel=preload right after the script responsible for the client-side CSP, but even that would make the fetches start later than in the present case. Of course, if the site has the capability of listing its resources that way from the server, it probably wouldn't need the client-side CSP in the first place.

eligrey

comment created time in 8 days

issue openedweb-platform-tests/wpt

Add tests for focusing different-site iframes by click

We should have some tests for the event sequence when focusing a different-site iframe by clicking inside it.

At present, my hypothesis is that the Web doesn't require the JS-observable focus event to fire from the same event loop task as the JS-observable user input event that caused the focusing, e.g. mousedown.

@mfreed7, do you know if Chrome guarantees anything about same or different task here, especially in the case where an out-of-process iframe doesn't have focus and gets focused by click? Any lore on whether there are Web compat constraints on this? (At present, Firefox fires the events from the same task in the in-process case and from different tasks in the out-of-process case.)

created time in 8 days

issue commentwhatwg/html

Charset scanner does not match at least Chromium

  • Chromium will ignore cases like
<script>
// It would be terrible if this document contained <meta charset=shift_jis> but fortunately it does not.
</script>

This is news to me. It's disappointing that Chromium has complexity like this compared to the spec. Does WebKit do this as well?

Apparently how Chromium works, at a high level, is that there's a feedback loop between the tokenizer and the byte-to-string decoder. The byte-to-stream decoder dynamically switches its encoding as it goes, and to do so, it uses the tokenizer code on the decoded-bytes it's seen so far. This lets it make more sophisticated decisions than the spec's purely-byte-based prescan.

Does it re-decode and re-tokenize if the encoding changes? If not, how does it handle non-ASCII before the meta?

However folks (such as @hsivonen) might be interested in better interop/spec structure for this.

Indeed. Thanks for disclosing this.

That said, Chromium is a little hacky in this regard, in that it will scan 1024 bytes into the response even if it's past the end of head, with comments indicating that this concession is for legacy compat, and only bail when both 1k have been read and the parser is no longer in the head section.

This is not news to me.

domenic

comment created time in 8 days

issue commentprivacycg/proposals

Speculative Request Control

site owners may be okay with a temporary perf decrease to help with privacy compliance needs

This assumes two things, neither of which is necessarily true:

  1. That the decrease is "temporary".
  2. That being OK with a perf decrease is up to site owners as opposed to users and browsers vendors.

Considering that users want performance, in terms of adoption, it's bad to suggest a feature that makes the first-mover browser appear slower. Also, it's suggestive of a bad placement of mechanism that a mechanism that is supposed to make the browser load fewer things could make the browser slower. (Generally, when the browser itself decides not to load stuff, things get faster.)

eligrey

comment created time in 8 days

pull request commentwhatwg/html

Use GBK as fallback, not gb18030

FWIW, the only case I'm aware of where Firefox still guesses the encoding from the UI language is the display of non-ASCII file names in FTP directory listings. However, when Firefox guesses something GBK/gb18030-ish from TLD or content, though, it guesses GBK, so I'm in favor of this change.

annevk

comment created time in 9 days

pull request commentrust-lang/rfcs

[RFC]: Portable SIMD Libs Project Group

SLEEF is only for x86_64 and for specialized functionality. It's also disabled by default. It would make sense to start without the operations that call into SLEEF.

KodrAus

comment created time in 9 days

issue commentprivacycg/proposals

Speculative Request Control

The doc now says:

The motivating use case for this feature is to increase the ease at which sites could adopt a CSP based on locally-stored consent provided by a third party JS library. In this use case, we can assume that the library vendor and site owner have taken the time explicitly preload resources asynchronously where appropriate, as they must knowingly disable eager speculative requests.

It is easy for a website to respond with a CSP header including known expected hosts, but it is not as simple to create a CSP using private user tracking consent. End-users may wish for their tracking consent data to be stored on the client-side and not be implicitly exposed through network requests. It is possible to create a client-side JavaScript library (e.g. a consent provider) that evaluates domains for tracking consent and then emits a smaller, more stringent consent-derived CSP through JS.

Considering how pervasive third-party consent scripts are, I think it's completely unrealistic to assume that every site that includes them has "taken the time explicitly preload resources asynchronously where appropriate". Also, considering how pervasive such scripts are, this feature could have a negative performance impact on significantly many sites.

eligrey

comment created time in 10 days

issue commentwhatwg/sg

Proposed for browser testing spec

There are no specific plans for additional APIs at this time

While I'm not committing to implementing the Gecko side of such an API myself, I'd like to have the ability to write WPTs that generate mouse clicks at CSS-pixel coordinates relative to the top-left corner of the content area and WPTs that generate key events (particularly tab key and shift-tab). Specifically, I care much more about the ability to synthesize clicks at a location than about the ability to synthesize specific mouse movements.

jgraham

comment created time in 12 days

issue commentprivacycg/proposals

Parser Speculation Control

@hsivonen what you describe is then not relevant for speculative parsing, but for speculative fetches that happen from the normal HTML parser, between the time the tree builder processes a token and the time the parser decides to insert the element to the DOM. Yes?

Yes.

So, it's possible for a script to insert a CSP meta in between?

Yes, e.g. from setTimeout if the DOM insertion batch is so long relative to CPU speed that the event loop gets to spin.

In general, trying to use script to undo fetches that, absent the script, would be caused by what's in the HTML source from the network results in a bad time, and I think we should not change the Web Platform to facilitate such attempts.

eligrey

comment created time in 13 days

issue commentprivacycg/proposals

Parser Speculation Control

Notably, the DOM insertions look at the wall clock to decide the event loop to spin before a script is seen. That in particular would allow non-speculative prefetches to end up a longer time earlier from the corresponding DOM insertions.

eligrey

comment created time in 13 days

issue commentprivacycg/proposals

Parser Speculation Control

Also, I was under the impression that speculative parsing only starts if there's a non-defer non-async non-module script src element. Is this not the case?

That applies for document.write-inserted content in Gecko. However, for content arriving from the network, the prefetches start before the corresponding DOM insertions even when the parsing is not speculative. The time difference between those fetches starting and the corresponding DOM insertions happening is just very short.

eligrey

comment created time in 13 days

issue commentprivacycg/proposals

Parser Speculation Control

It seems to me that this hinges a lot on whether one considers CSP via runtime meta as a misfeature or not. To me, it seems questionable not to provide CSP via HTTP headers.

async/defer etc. are rather beside the point. Markup that's visible to the parser can be deal with on a per-resource basis without having to turn the whole preloader off.

It's possible that the exist super-competent Web devs who can do better than the browser. However, it's all too likely that turning the preloader off becomes a cargo cult applied by less competent developers who end up sabotaging the overall perf.

eligrey

comment created time in 13 days

create barnchhsivonen/packed_simd

branch : rust_1_48

created branch time in 14 days

Pull request review commentrust-lang/rfcs

[RFC]: Portable SIMD Libs Project Group

+- Feature Name: stdsimd_project_group+- Start Date: 2020-08-28+- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)+- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)++# Summary+[summary]: #summary++This is a project group RFC version of [`lang-team#29`].++This RFC establishes a new project group, under the libs team, to produce a portable SIMD API in a new `rust-lang/stdsimd` repository, exposed through a new `std::simd` (and `core::simd`) module in the standard library in the same manner as [`stdarch`]. The output of this project group will be the finalization of [RFC 2948] and stabilization of `std::simd`.++# Motivation+[motivation]: #motivation++The current stable `core::arch` module is described by [RFC 2325], which considers a portable API desirable but out-of-scope. The current [RFC 2948] provides a good motivation for this API. Various ecosystem implementations of portable SIMD have appeared over the years, including [`packed_simd`], and [`wide`], each taking a different set of trade-offs in implementation while retaining some similarities in their public API. The group will pull together a "blessed" implementation in the standard library with the explicit goal of stabilization for the [2021 edition].++# Charter+[charter]: #charter++## Goals++- Determine the shape of the portable SIMD API.+- Get an unstable `std::simd` and `core::simd` API in the standard library. This may mean renaming `packed_simd` to `stdsimd` and working directly on it, or creating a new repository and pulling in chunks of code as needed.

I think run-time selection is scope creep. It also applies to other things already in the standard library: count_ones in particular. I think it should be developed separately and in a way that covers count_ones, too.

KodrAus

comment created time in 14 days

PullRequestReviewEvent

issue commentwhatwg/html

Specify speculative HTML parsing (preload scanner)

  * SVG `script` with `xlink:href` is speculatively fetched, but SVG `script` with `href` is not, even though `href` is [normally supported](http://software.hixie.ch/utilities/js/live-dom-viewer/saved/8413).

Filed. Thanks! Also: The name of the image test suggests that it's testing src rather than href. The spec says href. Apparently the tests haven't landed yet, so I failed to locate the test source to check.

zcorpan

comment created time in 14 days

Pull request review commentrust-lang/rfcs

[RFC]: Portable SIMD Libs Project Group

+- Feature Name: stdsimd_project_group+- Start Date: 2020-08-28+- RFC PR: [rust-lang/rfcs#0000](https://github.com/rust-lang/rfcs/pull/0000)+- Rust Issue: [rust-lang/rust#0000](https://github.com/rust-lang/rust/issues/0000)++# Summary+[summary]: #summary++This is a project group RFC version of [`lang-team#29`].++This RFC establishes a new project group, under the libs team, to produce a portable SIMD API in a new `rust-lang/stdsimd` repository, exposed through a new `std::simd` (and `core::simd`) module in the standard library in the same manner as [`stdarch`]. The output of this project group will be the finalization of [RFC 2948] and stabilization of `std::simd`.++# Motivation+[motivation]: #motivation++The current stable `core::arch` module is described by [RFC 2325], which considers a portable API desirable but out-of-scope. The current [RFC 2948] provides a good motivation for this API. Various ecosystem implementations of portable SIMD have appeared over the years, including [`packed_simd`], and [`wide`], each taking a different set of trade-offs in implementation while retaining some similarities in their public API. The group will pull together a "blessed" implementation in the standard library with the explicit goal of stabilization for the [2021 edition].++# Charter+[charter]: #charter++## Goals++- Determine the shape of the portable SIMD API.+- Get an unstable `std::simd` and `core::simd` API in the standard library. This may mean renaming `packed_simd` to `stdsimd` and working directly on it, or creating a new repository and pulling in chunks of code as needed.

Shouldn't we evaluate options other than packed_simd and stdsimd too or try experimenting more? Or are these already good enough?

I think more experimentation carries a significant risk of not reaching the goal of getting to stable.

Notably, simd already existed before packed_simd, the initial design of packed_simd omitted something that simd already had (boolean/mask vectors) and ended up adding them back, so packed_simd ended up being like a more comprehensive version of simd with the internals rewritten.

That is, portable SIMD in Rust has already gone through the cycle of a change in the person working on it resulting in a rewrite that, while making things more comprehensive, on the high level ended up close to the old thing. While each such cycle might improved the design marginally, starting over means the distance to getting the stable gets reset to zero.

KodrAus

comment created time in 15 days

PullRequestReviewEvent

issue commentmozilla-mobile/fenix

[Bug] [New search] autocomplete does not take history into account

Notably, the suggestions in the location field itself are sites that I've never visited. They look like a prepopulated list of top sites.

miDeb

comment created time in 15 days

pull request commentrust-lang/rfcs

[RFC]: Portable SIMD Libs Project Group

Establishes a new project group under libs for the production and stabilization of a core::simd and std::simd module.

Thank you for moving this forward! LGTM.

Sorry about the slow response this month.

Based on the work of @hsivonen.

@gnzlbg should get the credit for packed_simd.

KodrAus

comment created time in 16 days

PullRequestReviewEvent

pull request commentwhatwg/encoding

Meta: cleanup visualize.py

Does it still work with Apple-supplied Python afterwards?

annevk

comment created time in 16 days

issue commentwhatwg/encoding

EUC-JP encoding is currently ambiguous

Step 2 of ISO-2022-JP encoder seems to be contrary to the behavior of Chrome and Firefox and soon-to-be WebKit, but that's probably a different issue.

I don't see how it is contrary to the Firefox behavior. That step appears to say that if the encoder is already in the ASCII state, there's nothing more to do when the stream ends.

achristensen07

comment created time in 22 days

issue commentwhatwg/encoding

EUC-JP encoding is currently ambiguous

This is ambiguous because there are several code points where there are several pointers to the same code point, such as 0xFA16 has two. I've observed that Chrome and Firefox always choose the larger of the two.

Can you show clearer steps to reproduce for Firefox choosing the larger index for EUC-JP? The larger index values can't even be encoded in the EUC-JP code space.

If I create an EUC-JP document with a form, enter 猪¬ into a form field and submit the form, I see %FB%A3%A2%CC in the query string, as the spec requires. With Shift_JIS, I get %FB%5E%81%CA, which is also per spec.

(The definition of Shift_JIS pointer excludes the lower copy of IBM kanji from the search. That is, the logic isn't the highest index but the exclusion of a certain range as the behavior of ¬ shows.)

AFAICT, this definitions of "index pointer" and "index Shift_JIS pointer" are not ambiguous.

The test document I used was data:text/html;charset=EUC-JP,<form action=https://example.com><input name=v><input type=submit></form> (It appears that GitHub won't linkify it.)

achristensen07

comment created time in 22 days

push eventhsivonen/encoding_rs

Ralf Jung

commit sha b5a04264bfbbb8ef6f5f6eb14b175bf4d209c2c7

CI: run tests in Miri

view details

push time in 24 days

PR merged hsivonen/encoding_rs

CI: run tests in Miri
+21 -0

2 comments

2 changed files

RalfJung

pr closed time in 24 days

pull request commenthsivonen/encoding_rs

CI: run tests in Miri

Thank you!

(Travis status reporting to GH seems to be broken for this repo, I also see no status for the commits on master.)

I see https://travis-ci.org/github/hsivonen/encoding_rs/builds/721288307

RalfJung

comment created time in 24 days

pull request commenthsivonen/encoding_rs

adjust tests to Miri

Do you also want CI set up to run Miri (probably without Stacked Borrows, I doubt we have 7GB of RAM on Travis)?

That would be nice, yes.

RalfJung

comment created time in a month

pull request commenthsivonen/encoding_rs

adjust tests to Miri

Thank you!

RalfJung

comment created time in a month

push eventhsivonen/encoding_rs

Ralf Jung

commit sha 76c10b756da957bc4585f43cb9bd62673ea6c7b5

adjust tests to Miri

view details

Ralf Jung

commit sha 9eb726fd6dd4185b6228f1dae98225c18b6f1775

adjust code generation to make tests bail out early in Miri

view details

push time in a month

PR merged hsivonen/encoding_rs

adjust tests to Miri

This patch makes cargo miri test complete in reasonable time for me (around 10min). However this takes around 7GB of RAM. By adding -Zmiri-disable-stacked-borrows, the RAM usage goes down to <1GB and the time to around 5min.

If you want, I can also hook this up to CI.

+131 -60

0 comment

10 changed files

RalfJung

pr closed time in a month

Pull request review commenthsivonen/encoding_rs

adjust tests to Miri

 use super::*; pub fn decode(encoding: &'static Encoding, bytes: &[u8], expect: &str) {     let mut vec = Vec::with_capacity(bytes.len() + 32);     let mut string = String::with_capacity(expect.len() + 32);-    for i in 0usize..32usize {+    let range = if cfg!(miri) { 0usize..4usize } else { 0usize..32usize };

Let's keep it as 4, then. By ALU alignment I meant the alignment of an usize and trying all byte alignments smaller than that.

RalfJung

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commentvalidator/htmlparser

Conform encoding-label matching to Encoding spec

             Charset cs = entry.getValue();             String name = toNameKey(cs.name());             String canonName = toAsciiLowerCase(cs.name());-            if (!isBanned(name)) {+            if (!isBanned(stripDashAndUnderscore(name))) {

If the set of supported encodings changed to be the set of encodings from the Encoding Standard, there wouldn't be a need to maintain an isBanned blocklist at all, since there would only be an allowlist. In theory, that would be a breaking change in case there's someone parsing non-Encoding Standard HTML somewhere. However, it would at least be OK to make that change for 2.0. It's a bit unclear to me where this patch is presently going in terms of compliance if isBanned stays.

sideshowbarker

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

Pull request review commenthsivonen/encoding_rs

adjust tests to Miri

 mod tests {     fn test_single_byte_decode() {         decode_single_byte(IBM866, &data::SINGLE_BYTE_DATA.ibm866);         decode_single_byte(ISO_8859_10, &data::SINGLE_BYTE_DATA.iso_8859_10);+        if cfg!(miri) { return; } // Miri is too slow

The loop that generates this is at https://github.com/hsivonen/encoding_rs/blob/adcb9b300428e9a06a5074c1cb1f23550c93ee01/generate-encoding-data.py#L1449

RalfJung

comment created time in a month

PullRequestReviewEvent

Pull request review commenthsivonen/encoding_rs

adjust tests to Miri

 use super::*; pub fn decode(encoding: &'static Encoding, bytes: &[u8], expect: &str) {     let mut vec = Vec::with_capacity(bytes.len() + 32);     let mut string = String::with_capacity(expect.len() + 32);-    for i in 0usize..32usize {+    let range = if cfg!(miri) { 0usize..4usize } else { 0usize..32usize };

Would this be too slow if it went to 8usize? That would end up testing all the ALU alignment possibilities.

RalfJung

comment created time in a month

PullRequestReviewEvent

created taghsivonen/encoding_rs

tagv0.8.24

A Gecko-oriented implementation of the Encoding Standard in Rust

created time in a month

pull request commenthsivonen/encoding_rs

fix OOB arithmetic in ASCII encoding

Thank you! I'll audit the other case before pushing another release.

RalfJung

comment created time in a month

push eventhsivonen/encoding_rs

Ralf Jung

commit sha 322ae2e0e5372f030fc585da453d74e5ed8901d3

fix OOB arithmetic in ASCII encoding

view details

Ralf Jung

commit sha 1676109b081db5a9be8df2e990bd2cfeb69e7205

more wrapping arithmetic in ascii.rs

view details

push time in a month

PR merged hsivonen/encoding_rs

fix OOB arithmetic in ASCII encoding

Just a few lines down from what https://github.com/hsivonen/encoding_rs/pull/53 fixed, the same issue exists again. Here's the Miri error:

error: Undefined Behavior: inbounds test failed: pointer must be in-bounds at offset 12, but is outside bounds of alloc2066040 which has size 2
    --> /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/core/src/ptr/const_ptr.rs:225:18
     |
225  |         unsafe { intrinsics::offset(self, count) }
     |                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ inbounds test failed: pointer must be in-bounds at offset 12, but is outside bounds of alloc2066040 which has size 2
     |
     = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
     = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
             
     = note: inside `std::ptr::const_ptr::<impl *const u16>::offset` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/core/src/ptr/const_ptr.rs:225:18
     = note: inside `std::ptr::const_ptr::<impl *const u16>::add` at /home/r/.rustup/toolchains/miri/lib/rustlib/src/rust/library/core/src/ptr/const_ptr.rs:499:18
note: inside `ascii::basic_latin_to_ascii` at src/ascii.rs:223:29
    --> src/ascii.rs:223:29
     |
223  |                         if (src.add(dst_until_alignment) as usize) & ALU_ALIGNMENT_MASK != 0 {
     |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
1476 |         basic_latin_alu!(basic_latin_to_ascii, u16, u8, basic_latin_to_ascii_stride_alu);
     |         --------------------------------------------------------------------------------- in this macro invocation
note: inside `handles::Utf16Source::copy_ascii_to_check_space_two` at src/handles.rs:1244:17
    --> src/handles.rs:1244:17
     |
1244 |                 basic_latin_to_ascii(src_remaining.as_ptr(), dst_remaining.as_mut_ptr(), length)
     |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `big5::Big5Encoder::encode_from_utf16_raw` at src/macros.rs:1079:19
    --> src/macros.rs:1079:19
     |
1079 |               match $source.$copy_ascii(&mut dest) {
     |                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     | 
    ::: src/big5.rs:195:5
     |
195  | /     ascii_compatible_encoder_functions!(
196  | |         {
197  | |             // For simplicity, unified ideographs
198  | |             // in the pointer range 11206...11212 are handled
...    |
259  | |         false
260  | |     );
     | |______- in this macro invocation
note: inside `variant::VariantEncoder::encode_from_utf16_raw` at src/variant.rs:311:48
    --> src/variant.rs:311:48
     |
311  |             VariantEncoder::Big5(ref mut v) => v.encode_from_utf16_raw(src, dst, last),
     |                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `Encoder::encode_from_utf16_without_replacement` at src/lib.rs:4732:9
    --> src/lib.rs:4732:9
     |
4732 |         self.variant.encode_from_utf16_raw(src, dst, last)
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `Encoder::encode_from_utf16` at src/lib.rs:4661:43
    --> src/lib.rs:4661:43
     |
4661 |               let (result, read, written) = self.encode_from_utf16_without_replacement(
     |  ___________________________________________^
4662 | |                 &src[total_read..],
4663 | |                 &mut dst[total_written..effective_dst_len],
4664 | |                 last,
4665 | |             );
     | |_____________^
note: inside `testing::encode_from_utf16` at src/testing.rs:219:40
    --> src/testing.rs:219:40
     |
219  |     let (complete, read, written, _) = encoder.encode_from_utf16(string, &mut dest, true);
     |                                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `testing::encode_without_padding` at src/testing.rs:65:5
    --> src/testing.rs:65:5
     |
65   |     encode_from_utf16(encoding, &utf16_from_utf8(string)[..], expect);
     |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `testing::encode` at src/testing.rs:59:9
    --> src/testing.rs:59:9
     |
59   |         encode_without_padding(encoding, &string[..], &vec[..]);
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `big5::tests::encode_big5` at src/big5.rs:276:9
    --> src/big5.rs:276:9
     |
276  |         encode(BIG5, string, expect);
     |         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
note: inside `big5::tests::test_big5_encode` at src/big5.rs:363:9
    --> src/big5.rs:363:9
     |
363  |         encode_big5("", b"");
     |         ^^^^^^^^^^^^^^^^^^^^
note: inside closure at src/big5.rs:361:5
    --> src/big5.rs:361:5
     |
361  | /     fn test_big5_encode() {
362  | |         // Empty
363  | |         encode_big5("", b"");
364  | |
...    |
386  | |         encode_big5("\u{2550}", b"\xF9\xF9");
387  | |     }
     | |_____^
+3 -3

1 comment

1 changed file

RalfJung

pr closed time in a month

push eventhsivonen/encoding_rs

Henri Sivonen

commit sha 1ef7d5896fa79d8c0b3ef23810d06f5e29a7c653

Link to oem_cp from CONTRIBUTING.md.

view details

push time in a month

issue commenthsivonen/encoding_rs

Feature Request: Support IBM OEM code pages (e.g. CP437)

Nice! I added links to the documentation of this crate.

tats-u

comment created time in a month

push eventhsivonen/encoding_rs

Henri Sivonen

commit sha ab1a2d213616155dc2323a280c4e2b7896723526

Link to the oem_cp crate.

view details

push time in a month

issue commenthsivonen/encoding_rs

Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.

@RalfJung @YoshikiTakashima Thank you. The OOB case indeed materialized an OOB pointer though didn't dereference it. Fixed in #53 . I'm leaving this issue open to remind me to check the SIMD code for the same pattern.

YoshikiTakashima

comment created time in a month

push eventhsivonen/encoding_rs

Henri Sivonen

commit sha 5146c23f2c2434883a77d6c67483c62b2dd79ff7

Increment version number to 0.8.24.

view details

push time in a month

push eventhsivonen/encoding_rs

Henri Sivonen

commit sha fecae29389d579dbaee1e7b6528781622a5cb8bb

Document version 0.8.24 in the README.

view details

push time in a month

push eventhsivonen/encoding_rs

Yoshiki Takashima

commit sha 08a1a7ea9f8f7ba7d0639ca838f82b6d60a294a8

Fixed UB flagged by Miri.

view details

push time in a month

PR merged hsivonen/encoding_rs

Fixed UB in big5::tests::test_big5_decode flagged by Miri.

Fixes the out-of-bound UB in #52.

Looks like replacing dst.add with dst.wrapping_add is sufficient to stop UB from occurring, while passing all existing test cases.

Run cargo miri test -- -Zmiri-disable-alignment-check -- big5::tests::test_big5_decode to confirm that the UB is fixed.

+1 -1

1 comment

1 changed file

YoshikiTakashima

pr closed time in a month

pull request commenthsivonen/encoding_rs

Fixed UB in big5::tests::test_big5_decode flagged by Miri.

Thank you for looking into this.

Indeed, add here may compute a pointer that's outside the allocation bounds even if that pointer is never dereferenced, and wrapping_add is the right fix. The name of the wrapping_add methods is kinda weird for a pointer, but the documentation checks out.

YoshikiTakashima

comment created time in a month

Pull request review commentvalidator/htmlparser

Conform encoding-label matching to Encoding spec

 public static String toNameKey(String str) {             if (c >= 'A' && c <= 'Z') {                 c += 0x20;             }-            if (!((c >= '\t' && c <= '\r') || (c >= '\u0020' && c <= '\u002F')-                    || (c >= '\u003A' && c <= '\u0040')-                    || (c >= '\u005B' && c <= '\u0060') || (c >= '\u007B' && c <= '\u007E'))) {+            if (!Arrays.asList('\t','\n','\f','\r','\u0020').contains(c)) {

The performance characteristics of Arrays.asList are a bit unclear to me, but it seems bad to call it again on each loop iteration. I think (c == ' ' || c == '\t' || c == '\n' || c == '\f' || c == '\r') would probably be the best thing to do here.

sideshowbarker

comment created time in a month

PullRequestReviewEvent
PullRequestReviewEvent

pull request commentrust-lang/rfcs

Portable packed SIMD vector types

I'm trying to understand what the real blockers to moving forward with this RFC are.

AFAICT, the main blocker is that this lacks a champion within the libs team. (I don't have the bandwidth to become that person at this time.)

How relevant is the question of whether to build on LLVM intrinsics vs std::arch for offering a portable SIMD API in std?

As noted earlier, I believe in terms of the amount of work and who is willing to do the work, the only realistic path forward is to use the LLVM (or in the future other back end) facilities as packed_simd currently does. Previous decisions indicate that exposing those LLVM facilities directly is a no-go, as it would tie Rust LLVM. packed_simd is just enough abstraction to allow for other back ends in the future, which is why I think packed_simd itself in the form of core::simd (yes, this would belong in core rather than std) is the right abstraction.

While building on std::arch is theoretically possible, I believe no one is going to volunteer to reimplement things that LLVM already provides for the whole breadth of packed_simd. (A demo of one or two features isn't enough. It's already known that it's theoretically possible. The issue is doing all the work.)

hsivonen

comment created time in a month

issue commentsparklemotion/nokogiri

RFC: Explore alternatives to libxml2 for HTML parsing

Back in 2010 my plan was to generate a non-Gecko C++ version of the Validator.nu HTML Parser. That ended up not happening due to tasks that were of higher priority to Mozilla.

Instead @rubys added a to use the Validator.nu HTML Parser from Ruby via gcj. gcj itself is now gone. Also, I believe that @rubys migrated his app to use Gumbo.

One alternative to adding an non-Gecko target for the C++ translation of the Validator.nu HTML Parser and to using Gumbo after updating it to spec would be to create an FFI wrapper for html5ever.

flavorjones

comment created time in a month

issue commenthsivonen/encoding_rs

Potential Unsound: 1 out-of-bound read and 5 unaligned memory access.

Thank you!

My results so far:

  • I can't reproduce the out_of_bound case locally in Miri.
  • I have stepped through cases 118, 140, and 596 in x86 debug executions, and the computations were OK.

Based on reading the error messages and based on my previous discussion with @RalfJung, I believe these to be instances of the Miri issue mentioned in the error messages.

Due to predating align_to in the standard library, encoding_rs does its own alignment math. As I understand it, the Miri errors are based on pointers carrying alignment metadata in the Miri execution as opposed to Miri actually checking whether pointer math has been done correctly when casting from a pointer with smaller alignment to a pointer with a larger alignment.

Also, as I understand it, migrating to align_to would result in the interesting code paths getting tested but would result in Miri taking the least interesting code path.

Does this analysis look correct to you?

YoshikiTakashima

comment created time in a month

issue commentwhatwg/html

HTML: Cap nesting depth

would you be willing to submit a pull request?

Unfortunately, I expect not to have time for this in the near future. However, if someone want to volunteer, I still recommend looking at the Java patch.

DemiMarie

comment created time in a month

issue commentvalidator/htmlparser

Decide on the way forward

Fair enough, so I think we can agree on the following:

* 3 Java modules: `nu.validator.htmlparser`, `nu.validator.htmlparser.xom`, `nu.validator.saxtree`

* no classes are moved (i.e. forget about my proposal to move `Interner` et al.) & all packages are exported to maintain 100% backward compatibility

* organised in a single Git repo with a Maven multi-module project

OK.

And have one open question: what with jchardet? I'm ok with any proposal (mine being: move it into this repo as an additional nu.validator.jchardet module), as long as we don't have to rely on the automatic module name.

There's also another question whose answer might inform the answer to that question: What to do with the ICU4J normalization checking dependency. https://github.com/unicode-org/icu/blob/master/icu4j/manifest.stub says Automatic-Module-Name: com.ibm.icu, so I assume nu.validator.htmlparser would have requires static com.ibm.icu. Correct?

In any case, I think jchardet doesn't belong in this repo. I'd still like to avoid having to repackage it if possible.

@carlosame, can you corroborate the incompatibility of the existing jchardet with jlink?

anthonyvdotbe

comment created time in a month

issue commentvalidator/htmlparser

Decide on the way forward

classes which are not of interest to users don't show up in the Javadoc, IDE autocompletion suggestions, etc.

The clutter in Javadoc doesn't look too bad. Not sure about how much the internals show up in IDE autocomplete in practice for people who don't work on the internals.

maintainers have maximum flexibility w.r.t. backward compatibility

AFAICT, in the 13-year lifespan of this project, the one API-breaking change (as opposed to parser behavior correctness change) is the removal of support for the HTML4 mode, which would not have become non-breaking had the change you propose been made ahead of time.

nu.validator.htmlparser.xom.HtmlBuilder has these nu.validator imports:

import nu.validator.htmlparser.common.CharacterHandler;
import nu.validator.htmlparser.common.DocumentModeHandler;
import nu.validator.htmlparser.common.Heuristics;
import nu.validator.htmlparser.common.TokenHandler;
import nu.validator.htmlparser.common.TransitionHandler;
import nu.validator.htmlparser.common.XmlViolationPolicy;
import nu.validator.htmlparser.impl.ErrorReportingTokenizer;
import nu.validator.htmlparser.impl.Tokenizer;
import nu.validator.htmlparser.io.Driver;

nu.validator.htmlparser.xom.XOMTreeBuilder has these nu.validator imports:

import nu.validator.htmlparser.common.DocumentMode;
import nu.validator.htmlparser.impl.CoalescingTreeBuilder;
import nu.validator.htmlparser.impl.HtmlAttributes;

It is the design intent of the parser that a third party is allowed to write the kind of wrapper that nu.validator.htmlparser.xom is. Hiding nu.validator.htmlparser.impl seems contrary to that goal. (Whether impl is a good name is too late to debate: If we're not changing fully-qualified names, we're stuck with that name.)

Maybe I'm misunderstanding your question, but I doubt it's possible to decouple any piece of the parser from java.xml, since e.g. SAXException is used in the public API of pretty much every package.

Good point. I had forgotten about that.

Additonal observation: Enabling normalization checking depends on ICU4J.

@carlosame, @sideshowbarker, do you see value in the common/base split that @anthonyvdotbe suggests?

anthonyvdotbe

comment created time in a month

Pull request review commentwhatwg/sg

Update IPR policy to introduce BSD 3-Clause license

 Documents other than [Living Standards][Living Standard] and [Review Drafts][rev [contributor]: ./IPR%20Policy.md#contributor [cw-agreement]: https://participate.whatwg.org/agreement [review-draft]: ./Workstream%20Policy.md#review-draft+[BSD 3-Clause License]: ./BSD%203-Clause%20License.md

@othermaciej Indeed, the outbound language at https://github.com/whatwg/sg/pull/115/files#diff-d485eedd5b4f85f1d8d32af46e9e4dfbR279 is clear. I withdraw my previous comment in this thread. Sorry about the noise.

foolip

comment created time in 2 months

issue commentvalidator/htmlparser

Decide on the way forward

Actually, I would like to move some internal classes in nu.validator.htmlparser.common (e.g. the Interner interface) to a non-exported package as part of the modularization in 2.0.

What problem would this solve?

So in theory this would require source changes, but in practice I doubt anyone would (and believe no one should) be relying on any of those internal classes.

TransitionHandler there was added for NetBeans. Not sure what its usage is today.

nu.validator.htmlparser.impl is deliberately public so that you can write TreeBuilder subclasses that aren't provided by this project. (Also, spinning of XOM into a different module would require TreeBuilder to be subclassable from that module.) Things under common are visible in the signatures of impl.

What makes you say it's disappointing?

Mainly having to replace one jar with many when upgrading and the resulting proliferation of jar. Maybe that's not a real problem.

Yes. Either jchardet support should be removed, or a new version with a module descriptor should be published. An alternative is to just "fork" it (e.g. create a new repo under https://github.com/validator, or just add it as a separate module in this repo).

It's disappointing if the existing jar can't be used as-is. Considering that I'd like to get rid of jchardet eventually, I'm really not keen on de facto becoming responsible for jchardet's distribution beyond its usage by htmlparser. Realistically, though, the replacement I want is vaporware without a concrete path into becoming real software, so in that sense "eventually" can be expected to be far off anyway.

(That is, porting encoding_rs and chardetng to Java are not going to happen as part of my job. Writing a JNI wrapper for them seems like an educational hobby project of realistic scope, but a JNI dependency wouldn't be viewed favorably by the Java community for real-world deployment. Doing a translation that would result either .java files or Java bytecode seems like too large for what I have hobby project time for.)

Is there really no backward compatibility mechanism that would allow the existing jchardet to be used as-is? @carlosame's most recent comment suggests that there is.

Finally, I'd like to reiterate that, in my opinion, Q1 in my OP is the essential question here.

I'm leaning towards three Java Modules:

nu.validator.htmlparser, nu.validator.saxtree, and nu.validator.htmlparser.xom.

However, this is based on a pre-Modules view of the world where DOM and SAX are assumed to be always present, so perhaps it would make sense to treat sax and dom the same way as xom, i.e.:

  • nu.validator.htmlparser.sax depends on nu.validator.htmlparser, nu.validator.saxtree, java.xml, and java.base.
  • nu.validator.htmlparser.dom depends on nu.validator.htmlparser, java.xml, and java.base.
  • nu.validator.htmlparser.xom depends on nu.validator.htmlparser, nu.xom, and java.base.

Considering that nu.xom depends on SAX, it's unclear to me what problem this would solve other than allowing custom tree builders that use none of SAX, DOM, or XOM not to depend on java.xml.

Since you've used Modules and I haven't, do you see any benefit from decoupling the core of the parser from java.xml Module-wise?

Regardless of the module division, I'd prefer all these to stay in this git repo. (As noted, we used to have more repos for the validator as a whole and moved towards having fewer.) What that means for Maven, I don't know.

anthonyvdotbe

comment created time in 2 months

issue commentwhatwg/encoding

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input

@annevk, @ricea Thanks. I've requested submission of the document to the Unicode Document Register.

@EarlgreyPicard Bug 1652388

hsivonen

comment created time in 2 months

issue commentvalidator/htmlparser

Decide on the way forward

First of all, I acknowledge the annoyance of having a main branch owner who isn't up-to-speed with Java Modules (it's taken me way more time in terms of calendar distance than I hoped to get some idea of Java Modules) and also is as slow to respond as I have been. Sorry.

Currently, technically the compile-time-hard and run-time-optional dependency on both encoding detectors and on XOM are handled the same way. I see an issue open about XOM, but XOM seems to have more degrees of freedom for the solution, since if you use XOM, you already choose a XOM-specific API entry point anyway. However, the encoding detectors are something you can optionally enable regardless of entry point, so in that sense solving those should solve XOM, too, unless it's a problem for XOM types to appear in the outward API.

Considering the way Java worked up to and including 8, it would have been a backward compatibility bug to change the fully-qualified name of a class that remains otherwise compatible. That is, if you wrote an app with htmlparser.jar in the Java 5 days (and didn't explicitly call the HTML4-enablement methods that will be removed for 2.0), I think it's prima facie a bug if you have to make source changes to your app when you drop in the Java Modules-enabled future nu-validator-htmlparser.jar.

So far, I've understood that there are now restrictions on what packages nu-validator-htmlparser.jar is allowed to provide, and that it's not allowed to provide nu.validator.saxtree. Correct? OK, minting another .jarfor that, while disappointing, isn't that bad, since callers don't need to change any fully-qualified names.

https://github.com/validator/htmlparser/issues/25 says that while requires static would work, it would defeat the purpose of modularization for the public API of the nu.validator.htmlparser module to expose types from nu.xom if nu.xom is optional.

While I can guess that "formally wrong" is bad, is there some articulation of how the badness would manifest in practice?

OTOH, https://github.com/carlosame/htmlparser/tree/xom-removed/src/nu/validator/htmlparser and https://github.com/carlosame/htmlparser-xom/tree/master/src/nu/validator/htmlparser/xom suggest that the module system does allow a module called nu.validator.htmlparser to ship packages called nu.validator.htmlparser.foo while a module called nu.validator.htmlparser.xom ships a sibling package named nu.validator.htmlparser.xom. Since that still doesn't change any fully-qualified names, I guess that isn't any worse than having to ship nu-validator-saxtree.jar as a separate jar.

Considering that the validator project has over the span of its existence gone from me spreading stuff over multiple repos to @sideshowbarker merging some of the repos, I'm a bit uneasy about additional repos and am leaning towards keeping the XOM stuff in this repo if permitted by the rules of Modules and Maven.

Regarding a Maven dependency, I'd hope that after all changes, it would still be possible to point Eclipse to the source directories and the dependency jars and have stuff build without Maven.

As for JDK target, I believe the source code is still Java 5-compatible. Not using new language constructs matters for the Java-to-C++ translator, but actually running the code on a Java 5 JVM is unlikely to be a use case worth supporting. Running on OpenJDK 8 and recent-ish Android does seem relevant, though. Do I understand correctly that modularized jars can be released with the byte code compatibility level set to 8 and then pre-Modules user just dumps all the jars in the classpath, the older JVM ignores the Module manifests, and that's it?

Going back to the encoding detectors: Ideally, the parser would depend on a Java port of chardetng, but one does not exist. The ICU detector isn't very good. In the absence of a Java port of chardetng, jchardet can make some sense. Do I understand correctly that the blocker for depending on it even via requires static is that the project itself doesn't publish with a module manifest?

I've never used the RPM and OSGi stuff myself. Someone contributed them at some point. I have no idea if someone still cares. Probably prudent to leave that stuff in for 1.5 if it's easy to do so, drop that stuff for 2.0 and see if anyone complains.

anthonyvdotbe

comment created time in 2 months

issue commentwhatwg/encoding

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input

Oops. That edit was internally inconsistent. Tried again.

hsivonen

comment created time in 2 months

issue commentwhatwg/encoding

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input

Based on off-issue comments, I revised the draft to conclude to go for "End state 1" right away instead of leaving it as non-committal.

hsivonen

comment created time in 2 months

issue commentwhatwg/encoding

Concatenating two ISO-2022-JP outputs from a conforming encoder doesn't result in conforming input

No replies in one and a half years. @annevk, does the document linked to from the previous sentence look OK to you for submission to the Unicode Technical Committee?

(This issue was reported as a Thunderbird bug again.)

hsivonen

comment created time in 2 months

Pull request review commentwhatwg/sg

Update IPR policy to introduce BSD 3-Clause license

 Documents other than [Living Standards][Living Standard] and [Review Drafts][rev [contributor]: ./IPR%20Policy.md#contributor [cw-agreement]: https://participate.whatwg.org/agreement [review-draft]: ./Workstream%20Policy.md#review-draft+[BSD 3-Clause License]: ./BSD%203-Clause%20License.md

Saying "and" and then having a following sentence that turns the AND into an OR seems more stressful for readers than working the first sentence to use the word "or" between the licenses. That is, "Foo or Bar, at your option" or "Foo and Bar, at the recipient's option" seem much clearer.

That is, I suggest formulating this such that the English use of "and" and "or" matches the SPDX meaning of "AND" and "OR".

foolip

comment created time in 2 months

issue closedhsivonen/encoding_rs

Feature Request: Support IBM OEM code pages (e.g. CP437)

IBM OEM code pages live on today, persistently in zip archive file names, for example. OEM code pages of (South)east asian languages are same as ANSI code pages, included in this library now, but other languages including European languages are not.

Code pages list (OEM codepages are included): https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers?source=docs Characters list (CP437; replace 437 for other code pages): https://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/PC/CP437.TXT

closed time in 2 months

tats-u

issue commenthsivonen/encoding_rs

Feature Request: Support IBM OEM code pages (e.g. CP437)

This crate is explicitly scoped to the Encoding Standard.

However, another crate could support more encoding such that operations on the encodings that are in the Encoding Standard are delegated to this crate and extra encodings are implemented by the wrapper crate itself.

tats-u

comment created time in 2 months

pull request commentvalidator/htmlparser

Conform ambiguous-ampersand reporting to HTML spec

This happens when the tokenizer is fed one UTF-16 code unit at a time. That's why the Java harness doesn't catch this.

When there is a buffer boundary between &gt0 and YY, the ambiguous ampersand state drop the current input character (digit zero) on the floor such that it doesn't get restored in the variable c when tokenization of the next buffer continues.

sideshowbarker

comment created time in 2 months

issue commentrust-lang/lang-team

Portable SIMD project group

Also, I don't see this as a particularly "embedded" thing: E.g. my use cases for this involve desktop and mobile. (Apple Silicon is going to make this even more relevant on desktop than it already is.)

hsivonen

comment created time in 2 months

issue commentrust-lang/lang-team

Portable SIMD project group

“C Parity, interop, and embedded”

I think "C parity" mischaracterizes this: core::arch is C parity. This is about being better.

Controversial, ties us to LLVM

I think exposing the intrinsics that would allow packed_simd to be written in the crate ecosystem would tie us to LLVM. The crucial point of putting this in std::simd is not getting tied to LLVM. My understanding is that conceptually GCC already has similar capabilities and to the extent Cranelift will eventually support WASM SIMD, Cranelift must develop similar capability at least for 128-bit vectors.

hsivonen

comment created time in 2 months

pull request commentrust-lang/rfcs

Portable packed SIMD vector types

Can you please explain why portable SIMD has to be tied to LLVM intrinsics and not implemented in terms of arch intrinsics? I understand that it will be easier to start with them, but IIUC there are no fundamental blockers for the latter.

This is covered in the FAQ. While there is no theoretical obstacle, there's a huge practical obstacle: Rewriting a large chunk of code that LLVM has (and, I believe, GCC has) and Cranelift either has or is going to have makes no sense as a matter of realistic allocation of development effort and its hard to motivate anyone to do the work given that the work could be avoided by delegating to an existing compiler back end.

More bluntly: Those who suggest re-implementing the functionality on top of core::arch haven't yet actually volunteered to do it and shown existence proof of the result. It's far easier to say that the implementation should be done that way than to actually do it. OTOH, the implementation that delegates to an existing compiler back end actually exists.

hsivonen

comment created time in 2 months

pull request commentvalidator/htmlparser

Conform ambiguous-ampersand reporting to HTML spec

Hmm. entities02 works in the Java-only case. I'll try to figure out what's going on.

sideshowbarker

comment created time in 2 months

PR opened html5lib/html5lib-tests

Reviewers
Test zero byte after hyphen in comment
+9 -0

0 comment

1 changed file

pr created time in 2 months

create barnchhtml5lib/html5lib-tests

branch : hyphenzero

created branch time in 2 months

pull request commentvalidator/htmlparser

Conform ambiguous-ampersand reporting to HTML spec

Still fails the entities02 test from html5lib.

sideshowbarker

comment created time in 2 months

pull request commentvalidator/htmlparser

Conform ambiguous-ampersand reporting to HTML spec

Thanks! I triggered a new try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=216de62190b7d034640d584bfcf40c5a738b7da3

sideshowbarker

comment created time in 2 months

push eventvalidator/htmlparser

Matt Brundage

commit sha 3f48926a96a4a658f319c1680cff417255677145

Fix typo in error message about over-deep tree

view details

push time in 2 months

issue commentmozilla/standards-positions

Constant bitrate audio encoding with MediaRecorder

I note that neither the explainer nor the other linked documents explain a use case. The person who requested said they "need" this, and later comments took it as a given that CBR for audio is common.

However, it was not explained what use cases are enabled by CBR than now fail due to MediaRecorder producing VBR.

simoncent

comment created time in 2 months

MemberEvent

delete branch validator/htmlparser

delete branch : gecko-sync

delete time in 2 months

push eventvalidator/htmlparser

Michael[tm] Smith

commit sha 24659e4c2c615fb86636fe551b2a3da81d24625e

Fix grammar problem in HTML parser error message

view details

Michael[tm] Smith

commit sha 4f9f58317c347dc84435a1fef9cc4ae49c108126

Fix "non-space characters insided a table" typo

view details

Michael[tm] Smith

commit sha d18cec8aa2e15da35b5ae30405662675c7d967af

Report 1024 as byte limit for meta charset sniff Fixes https://github.com/validator/validator/issues/232

view details

Michael[tm] Smith

commit sha 2db740cd4e7ef5165b12d4e33a522095606aa004

Report error always for Transitional doctype Fixes https://github.com/validator/validator/issues/408 Thanks @still-dreaming-1

view details

Simon Pieters

commit sha 4acb71a514f8249cafeb75b73f6a6b054fc84f4f

Remove warning about comments before doctype This only applied to IE6..9, and does not apply to conditional comments.

view details

Michael[tm] Smith

commit sha ac9db58c2b156914462276e7642e5cd95e876569

Stop reporting HTML4-specific parse errors

view details

Henri Sivonen

commit sha 719ceb2808e0b15e2f8386c0253bc1b050215aa6

Fixup for NOCPP directives for HTML4 support removal.

view details

Michael[tm] Smith

commit sha 7f2a1448f792cc31889f155ccc01ba290ecbd80c

Emit error (not warning) for HTML4/XHTML1 doctype Since https://github.com/validator/htmlparser/commit/2594493 HTML4/XHTML1 doctypes have caused the warning `Obsolete doctype. Expected <!DOCTYPE html>`. This change makes that message an error rather than just a warning — bringing the parser into conformance with https://github.com/whatwg/html/commit/31c20af (https://github.com/whatwg/html/pull/2049).

view details

Michael[tm] Smith

commit sha 89296cd12696541f6ff53ce7f9aeeb1528be4b61

Drop parse error for missing </caption> end tag See https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-incaption:stack-of-open-elements-2 The parse error for this case was dropped from the spec in https://github.com/whatwg/html/commit/759ee62 Relates to https://github.com/validator/validator/issues/593 and https://bugzilla.mozilla.org/show_bug.cgi?id=1429415

view details

Michael[tm] Smith

commit sha d2627e3a422109c0d374482a0e510cb6d1025047

Require UTF-8 See https://html.spec.whatwg.org/multipage/semantics.html#charset:utf-8 and https://github.com/whatwg/html/commit/fae77e3

view details

Michael[tm] Smith

commit sha 8ff29a4358dfac9628a5c5e703bb588676f98347

Add suppress "unchecked" for a TreeBuilder method In `nu.validator.htmlparser.impl.TreeBuilder`, this change adds `@SuppressWarnings("unchecked")` to the `getUnusedStackNode()` method.

view details

Simon Pieters

commit sha 2f6f559a5fe0fffb40a12ae21dda4ac6c56c773f

Improve message: bad start tag in noscript in head

view details

Mike Bremford

commit sha 2f61c94cebcee834f7fbc30f6efde281b41ee9ba

Ensure every Locator is also a Locator2 (#10) * Replace org.xml.sax.helpers.LocatorImpl with nu.validator.htmlparser.impl.LocatorImpl * instanceof before cast to Locator2

view details

push time in 2 months

PR merged validator/htmlparser

Import patches from the validator-nu branch that have landed in m-c as of 2020-08-03

These are the changesets picked from the validator-nu branch that have the corresponding Gecko patches on Gecko autoland at this time. I suggest merging this when autoland merges to mozilla-central. (It's possible that there's a fancier way to do this with git, and I'm just bad at git.)

+282 -765

4 comments

19 changed files

hsivonen

pr closed time in 2 months

pull request commentvalidator/htmlparser

Conform ambiguous-ampersand reporting to HTML spec

New try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=439fd78375fe4d83a5999e2e0c9d30075da4efb3

sideshowbarker

comment created time in 2 months

pull request commentvalidator/htmlparser

Import patches from the validator-nu branch that have landed in m-c as of 2020-08-03

And it turns out there's something in here that the translator does not like. I'll try to figure out what.

Now resolved. I had forgotten the patch I wrote myself a month ago.

hsivonen

comment created time in 2 months

push eventvalidator/htmlparser

Henri Sivonen

commit sha ca91185e106313c4855de0886f9cd9f747a7fb82

Fixup for NOCPP directives for HTML4 support removal.

view details

Michael[tm] Smith

commit sha a2acf2b8a28a0c76fa1c93dfcd965f14ead18b81

Emit error (not warning) for HTML4/XHTML1 doctype Since https://github.com/validator/htmlparser/commit/2594493 HTML4/XHTML1 doctypes have caused the warning `Obsolete doctype. Expected <!DOCTYPE html>`. This change makes that message an error rather than just a warning — bringing the parser into conformance with https://github.com/whatwg/html/commit/31c20af (https://github.com/whatwg/html/pull/2049).

view details

Michael[tm] Smith

commit sha 498e43e4b8fdbb203561c2fd81f15476fba1aef8

Drop parse error for missing </caption> end tag See https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-incaption:stack-of-open-elements-2 The parse error for this case was dropped from the spec in https://github.com/whatwg/html/commit/759ee62 Relates to https://github.com/validator/validator/issues/593 and https://bugzilla.mozilla.org/show_bug.cgi?id=1429415

view details

Michael[tm] Smith

commit sha 271831c38c746e16b5ab1aff3c9cffc5a8b9c574

Require UTF-8 See https://html.spec.whatwg.org/multipage/semantics.html#charset:utf-8 and https://github.com/whatwg/html/commit/fae77e3

view details

Michael[tm] Smith

commit sha 6ad48999a2db2fd0b5f51aeefb3c1cde1a495734

Add suppress "unchecked" for a TreeBuilder method In `nu.validator.htmlparser.impl.TreeBuilder`, this change adds `@SuppressWarnings("unchecked")` to the `getUnusedStackNode()` method.

view details

Simon Pieters

commit sha 8a3d088fdbbeb35859ee5088bd1b6be04fcdbe2f

Improve message: bad start tag in noscript in head

view details

Mike Bremford

commit sha d1e941d59bdc17674c60d414ce58df8de8c7bde2

Ensure every Locator is also a Locator2 (#10) * Replace org.xml.sax.helpers.LocatorImpl with nu.validator.htmlparser.impl.LocatorImpl * instanceof before cast to Locator2

view details

push time in 2 months

more