profile
viewpoint

marcusklaas/Achtung--die-Kurve- 11

Remake of the classic Achtung die Kurve using HTML5 websockets

marcusklaas/lisp-interpreter 7

A bare bones lisp parser & interpreter

JordyMoos/elm-yapm-client 5

Yet another password manager client in Elm

marcusklaas/backbonzo 3

secure backup automater in rust

marcusklaas/grunt-inline-alt 1

Brings externally referenced resources into a single file.

marcusklaas/logical-verification-project 1

proving some basic theorems in modal logic

marcusklaas/modal-logic-cheat-sheet 1

things one should know going into the exam

startednot-fl3/miniquad

started time in 2 hours

issue commentraphlinus/pulldown-cmark

Text events in code block start with newlines

There's no Rust compiler shenanigans going with the newlines. The language has no concept of line endings in strings. It's all just codepoints.

I beg to differ: eol_conversion

This is the reason why I initially couldn't reproduce the problem in a minimal example, because rustc gobbles up the \r part of the newlines. The repro I posted in the comment above works though, because I give the input in raw bytes.

It's actually pulldown doing this. We are normalizing line endings ourselves.

Yes, pulldown-cmark does the thing where it emits more text events when \r is in the markdown string.

BenjaminRi

comment created time in 2 hours

created repositorykriskowal/bbbb

Basic Binary Board Book

created time in 4 hours

startedcaptainbrosset/inactive-css

started time in 9 hours

issue commentraphlinus/pulldown-cmark

Text events in code block start with newlines

Okay, so I played around on master (e97974b8d76195c953f0d427e8725ef9ad1a0c17) and this is highly interesting.

The issue is hard to reproduce because Rust does some string magic. Check this out:

fn main() {
    let markdown_input: &str = "```
test

test
```";

	println!("{:?}",  markdown_input.as_bytes());
}

Output:

[96, 96, 96, 10, 116, 101, 115, 116, 10, 10, 116, 101, 115, 116, 10, 96, 96, 96]

This little program will always print a byte sequence with \n, rather than \r\n, even if the original source file has \r\n line endings. In other words, the Rust compiler automatically strips \r from your string even on Windows when you use such line endings.

However, you can trick Rust into using Windows line endings by manually giving it the bytes and converting those into a string, here is a repro of the problem:

use pulldown_cmark::{html, Event, Options, Parser};

fn main() {
	let binary_input = &[0x60, 0x60, 0x60, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x0d, 0x0a, 0x74, 0x65, 0x73, 0x74, 0x0d, 0x0a, 0x60, 0x60, 0x60];
	let markdown_input = std::str::from_utf8(binary_input).unwrap();
	let parser = Parser::new_ext(markdown_input, Options::empty())
        .map(|event| match event {
            Event::Text(text) => {println!("Text: {:?}", &text); Event::Text(text)},
            _ => event,
        });

    let mut html_output = String::new();
	html::push_html(&mut html_output, parser);
}

Output:

Text: Borrowed("test")
Text: Borrowed("\n")
Text: Borrowed("\ntest")
Text: Borrowed("\n")

The reason why I stumbled into this problem is because I'm using an SQLite database which contains such Windows line endings.

BenjaminRi

comment created time in a day

issue commentraphlinus/pulldown-cmark

Text events in code block start with newlines

You are spot on. I am on Windows 10 and it has to do with \r\n newlines.

Binary input with \n (hex-bytes):

60 60 60 0a 74 65 73 74 0a 0a 74 65 73 74 0a 60 60 60

Text event from Parser:

Borrowed("test\n\ntest\n")

However,

Binary input with \r\n (hex-bytes):

60 60 60 0d 0a 74 65 73 74 0d 0a 0d 0a 74 65 73 74 0d 0a 60 60 60

Text events from Parser:

Borrowed("test")
Borrowed("\n")
Borrowed("\ntest")
Borrowed("\n")

Also, I've been thinking... Are there any guarantees provided by pulldown-cmark regarding how text is emitted? Could I technically get an event for every single character, in other words, does my code have to handle arbitrarily emitted text when I interface with the Parser? I think that's an important part of the interface. I looked over the documentation but couldn't find anything, it just says it's "a text node".

BenjaminRi

comment created time in a day

startedcountvajhula/rigpa

started time in a day

fork hidde/aframe-boilerplate

[DISCONTINUED] Hello, WebVR starter kit for A-Frame.

http://glitch.com/~aframe

fork in 2 days

MemberEvent

issue commentraphlinus/pulldown-cmark

Line breaks in code blocks

@BenjaminRi given that the CommonMark spec is defined in terms of HTML, compliance with that is still possible by pushing the new-line in the HTML serializer (for Event::End(Tag::CodeBlock(_)), only when the prior event was Event::Text and not empty), weakening your argument.

Further to this point, line-breaks in HTML source don't necessarily correspond to line-breaks in the output: it appears to make no difference in rendered HTML whether the last line of content is followed by a line-break inside <code>...</code>.

Adding the above rule to the HTML serializer would be annoying complexity I know, but handling line-breaks for Unicode output is also annoyingly complex. (In fact, even the semantics of line-breaks in Unicode are complex, as @raphlinus well knows.)

Summary: pulldown-cmark's output is not HTML, so the CommonMark spec does not apply (directly). An additional layer of specification is needed, and there is still some room for manoeuvre in what that spec says.

dhardy

comment created time in 2 days

startedeth-p/bat-extras

started time in 3 days

startedwbthomason/packer.nvim

started time in 3 days

issue openedraphlinus/pulldown-cmark

Text events in code block start with newlines

As can be seen in issue #457 , if you parse the following code block

test

test

you get

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test"))
Text(Borrowed("\n"))
Text(Borrowed("\ntest"))
Text(Borrowed("\n"))
End(CodeBlock(Fenced(Borrowed(""))))

The strange behaviour here is that the lines start with \n, but don't end with \n. I think this is highly unusual and makes the strings harder to work with than necessary. Desired behaviour would be:

Start(CodeBlock(Fenced(Borrowed(""))))
Text(Borrowed("test\n"))
Text(Borrowed("\n"))
Text(Borrowed("test\n"))
End(CodeBlock(Fenced(Borrowed(""))))

This behaviour would make much more sense (it is also more natural because it reflects the actual lines seen in the code block) and provides enhanced compatibility to other libraries like syntect which usually parse code on a line-by-line basis, where a line is defined as a string terminated by \n.

I am currently running into issues with this and the only remedy seems to be string slicing and copying, which costs performance.

created time in 3 days

issue commentraphlinus/pulldown-cmark

Line breaks in code blocks

I think the spurious newline will cause more harm than good for library users. People who want the extra newline could easily emit an extra one in Event::End(Tag::CodeBlock(_)).

Unfortunately, the CommonMark spec is clear about emitting this line break.

All the examples, like for example example 89, display a spurious newline at the end of the CodeBlock:

```
<
 >
```

transforms to:

<pre><code>&lt;
 &gt;
</code></pre>

The only exception is the empty code block (example 100):

```
```

which transforms to:

<pre><code></code></pre>

Interestingly, this means that it's impossible to make a non-empty code block containing a one-liner without newline.

Removing the newline would violate the spec. So, while the behaviour is essentially broken, it is broken by design, and unfortunately the spec must be adhered to if we want any compatibility in the ecosystem.

dhardy

comment created time in 3 days

fork Darksonn/tokio-core

I/O primitives and event loop for async I/O in Rust

fork in 3 days

startedmatijs/probable-succotash

started time in 3 days

startedmarcusklaas/advent20

started time in 3 days

pull request commentraphlinus/pulldown-cmark

Use `std::fmt::Write` instead of custom `StrWrite` trait

From that topic:

simulacrum: It is perhaps worth noting that io::Write::write_fmt does preserve the io Error in the raw form (https://doc.rust-lang.org/nightly/src/std/io/mod.rs.html#1498-1513) simulacrum: so if you're using write! or similar with io sources, and not the fmt trait directly, you'll get the error

From my understanding, our use case falls under the former category, so is this still considered a blocker?

camelid

comment created time in 4 days

fork MerlijnWajer/tesseract

Tesseract Open Source OCR Engine (main repository)

https://tesseract-ocr.github.io/

fork in 4 days

Pull request review commentraphlinus/pulldown-cmark

Miscellaneous code quality improvements

+use criterion::{criterion_group, criterion_main, Criterion};+use pulldown_cmark::{html, Parser};+use std::fs::{read_dir, read_to_string};++pub fn spec_samples(c: &mut Criterion) {+    let folder = read_dir("./benches/spec_samples").unwrap();+    for entry in folder {+        let entry = entry.unwrap();++        if entry.metadata().unwrap().is_file() {+            let filename = &entry.file_name().into_string().unwrap();+            let corpus = read_to_string(entry.path()).unwrap();+            let mut result = String::with_capacity(corpus.len() * 3 / 2);++            c.bench_function(filename, |b| {

From my understanding, the throughput benchmarks are usually done with multiple inputs, e.g. how it scales with more input, so I don't think it works here since we'd only have one data point (the size of the input).

I've uploaded an example report of the current code though (extract and view report.html) -- perhaps it's sufficient?

edward-shen

comment created time in 4 days

Pull request review commentraphlinus/pulldown-cmark

Miscellaneous code quality improvements

+CommonMark

The newly added samples are now correctly attributed to markdown-it, and a license has been added to reflect this.

edward-shen

comment created time in 4 days

created repositorymtn/advent20

Advent of Code 2020

created time in 4 days

fork Heliosmaster/clj-bugsnag

Fully fledged Bugsnag client for Clojure. Supports ex-data and ring middleware.

https://clojars.org/clj-bugsnag

fork in 4 days

startedOisinMoran/quinetweet

started time in 5 days

startedlunatic-lang/lunatic

started time in 5 days

PR opened raphlinus/pulldown-cmark

Miscellaneous code quality improvements

This set of patches incorporates multiple smaller improvements whose overall theme is "improved code quality". Each commit should be atomic, so I encourage looking at each commit individually, rather than the patch as a whole.

Changelog:

  • Implement FusedIterator marker trait for Parser. This is a non-breaking optimization that allows .fuse() methods on the iterator on the parser to be a no-op.
  • extern crate syntax is entirely optional for Rust 2018 and implicitly discouraged, so I've removed all references of the syntax, including examples.
  • Looks like #269 suggested migrating to criterion and we already use it, so I've just converted the lib.rs benchmark to use criterion now.
  • On that note, I've also added some corpus to bench against. Specifically, I've added the JS reference benchmarks. I didn't find the other linked items to be as meaningful as the JS benchmarks, but I'm open to adding more. I've written it to be very easy to add more.
  • Also on that note, I've refactored the lib.rs to use more of criterion's tools. We should now be able to see how pathological codeblocks behave over a longer inputs. That being said, I didn't do a perfect migration as benching 1000 times seems excessive, and I don't have full context on the benchmark, so I'd request extra scrutiny here.
+800 -62

0 comment

38 changed files

pr created time in 5 days

startedlaike9m/Cyberbrain

started time in 7 days

startedrtosholdings/riptable

started time in 8 days

startedloadimpact/k6

started time in 8 days

startednextjournal/codemirror.next-clojure

started time in 8 days

more