profile
viewpoint
Andrew Gallant BurntSushi @salesforce Marlborough, MA https://burntsushi.net I love to code.

BurntSushi/byteorder 571

Rust library for reading/writing numbers in big-endian and little-endian.

BurntSushi/aho-corasick 376

A fast implementation of Aho-Corasick in Rust.

BurntSushi/chan 360

Multi-producer, multi-consumer concurrent channel for Rust.

BurntSushi/bstr 337

A string type for Rust that is not required to be valid UTF-8.

BurntSushi/advent-of-code 312

Rust solutions to AoC 2018

BurntSushi/cargo-benchcmp 226

A small utility to compare Rust micro-benchmarks.

BurntSushi/chan-signal 124

Respond to OS signals with channels.

BurntSushi/clibs 88

A smattering of miscellaneous C libraries. Includes sane argument parsing, a thread-safe multi-producer/multi-consumer queue, and implementation of common data structures (hashmaps, vectors and linked lists).

BurntSushi/critcmp 79

A command line tool for comparing benchmarks run by Criterion.

BurntSushi/blog 28

My blog.

issue commentBurntSushi/aho-corasick

Duplicate match when using find_overlapping_iter() with ascii_case_insensitive

Looks like a bug to me, yes. There have been a lot of problems with the ASCII case insensitive functionality that result in strange outcomes, so it wouldn't surprise me. Not sure when I'll have a chance to look at this though. Thanks for the small reproduction!

jwass

comment created time in 4 hours

issue closedBurntSushi/xsv

Feature: Add option to specify number of header rows

I have some csv files where the headers are 2 rows. It would be really nice if I could specify somewhere that the first 2 rows are headers and not just the first.

closed time in 5 hours

belst

issue commentBurntSushi/xsv

Feature: Add option to specify number of header rows

Sorry, but I don't think xsv should support this. The amount of work required to do this isn't worth it IMO.

belst

comment created time in 5 hours

issue commentBurntSushi/ripgrep

request to benchmark Kazahana against ripgrep

I don't think it's specific to nix tools. I've never seen substring iteration do anything other than non-overlapping search outside of specialized circumstances. Overlapping search is weird unless that's specifically what you want.

Sanmayce

comment created time in 2 days

issue commentBurntSushi/ripgrep

request to benchmark Kazahana against ripgrep

grep is correct. From a Python interpreter:

>>> string = '99999999999999999999999999999999999999999999999999999999999999999
9999999999999999999999999999999999999999999999999999999999999999999999999999999
9999999999999999999999999999999999999999999999999999999999999999999999999999999
999999999999999999999999999999999999999999999999999999999999999'
>>> string.count('999999')
47

You're probably counting overlapping matches, which makes sense, since the length of the string is 286, the length of the needle is 6 and (286 - 6) + 1 = 281.

You're welcome to define your semantics this way, but most substring iterators look for non-overlapping matches.

Sanmayce

comment created time in 2 days

issue commentBurntSushi/ripgrep

request to benchmark Kazahana against ripgrep

Try running GNU grep as well. A count of 885 looks correct. ripgrep, GNU grep and ack all agree:

$ rg 999999 pi-billion.txt -o | wc -l
885
$ grep 999999 pi-billion.txt -o | wc -l
885
$ ack 999999 pi-billion.txt -o | wc -l
885

The -o/--only-matching option prints each match on its own line. Since the entire input is just one huge 1GB line (another very poor case for line oriented search tools like grep), you would otherwise just get a count of 1. wc -l counts the number of lines, and thus, the total number of matches.

And no, I wouldn't expect -a to make a difference here because pi-billion.txt contains no NUL bytes:

$ rg -a --no-unicode '\x00' pi-billion.txt
$
Sanmayce

comment created time in 2 days

push eventBurntSushi/ripgrep

Brandon Adams

commit sha ba3f9673ad2dcf42a837f3eb00c4927a6e66b071

ignore/types: generalize bazel type a bit Bazel supports `BUILD.bazel` as well as `WORKSPACE.bazel`. In addition, it is common to ship BUILD/WORKSPACE templates for external repositories suffixed with .bazel for easier tool recognition. Co-authored-by: Brandon Adams <brandon.adams@imc.com> PR #1716

view details

push time in 3 days

PR merged BurntSushi/ripgrep

Generalize bazel type to include *.bazel

Bazel supports BUILD.bazel as well as WORKSPACE.bazel. In addition, it is common to ship BUILD/WORKSPACE templates for external repositories suffixed with .bazel for easier tool recognition.

+1 -1

0 comment

1 changed file

emidln

pr closed time in 3 days

PullRequestReviewEvent

Pull request review commentrust-lang/rust

Rename/Deprecate LayoutErr in favor of LayoutError

 mod layout; #[stable(feature = "global_alloc", since = "1.28.0")] pub use self::global::GlobalAlloc; #[stable(feature = "alloc_layout", since = "1.28.0")]-pub use self::layout::{Layout, LayoutErr};+pub use self::layout::Layout;+#[stable(feature = "alloc_layout", since = "1.28.0")]+#[rustc_deprecated(+    since = "1.51.0",+    reason = "use LayoutError instead",+    suggestion = "LayoutError"+)]

Same comment as in https://github.com/rust-lang/rust/pull/77691/files#r510956574

exrook

comment created time in 3 days

PullRequestReviewEvent

Pull request review commentrust-lang/rust

Rename/Deprecate LayoutErr in favor of LayoutError

 impl Layout {     /// padding is inserted, the alignment of `next` is irrelevant,     /// and is not incorporated *at all* into the resulting layout.     ///-    /// On arithmetic overflow, returns `LayoutErr`.+    /// On arithmetic overflow, returns `LayoutError`.     #[unstable(feature = "alloc_layout_extra", issue = "55724")]     #[inline]-    pub fn extend_packed(&self, next: Self) -> Result<Self, LayoutErr> {-        let new_size = self.size().checked_add(next.size()).ok_or(LayoutErr { private: () })?;+    pub fn extend_packed(&self, next: Self) -> Result<Self, LayoutError> {+        let new_size = self.size().checked_add(next.size()).ok_or(LayoutError { private: () })?;         Layout::from_size_align(new_size, self.align())     }      /// Creates a layout describing the record for a `[T; n]`.     ///-    /// On arithmetic overflow, returns `LayoutErr`.+    /// On arithmetic overflow, returns `LayoutError`.     #[stable(feature = "alloc_layout_manipulation", since = "1.44.0")]     #[inline]-    pub fn array<T>(n: usize) -> Result<Self, LayoutErr> {+    pub fn array<T>(n: usize) -> Result<Self, LayoutError> {         let (layout, offset) = Layout::new::<T>().repeat(n)?;         debug_assert_eq!(offset, mem::size_of::<T>());         Ok(layout.pad_to_align())     } } +#[stable(feature = "alloc_layout", since = "1.28.0")]+#[rustc_deprecated(+    since = "1.51.0",+    reason = "use LayoutError instead",

I think this should probably say something like, "name doesn't follow std convention." Currently, it's redundant with suggestion below.

exrook

comment created time in 3 days

PullRequestReviewEvent

issue commentBurntSushi/ripgrep

ignore crate doesn't handle common recursive directory ignore pattern

Oh cool. I forgot about that API!

jazzdan

comment created time in 3 days

issue closedBurntSushi/ripgrep

libresolv dependency on mac precompiled binary

Is a mac release binary already built with rust with the following patch applied? https://github.com/rust-lang/rust/issues/46797 I still see libresolv dependency in the latest mac release binary.

closed time in 4 days

smekkley

issue commentBurntSushi/ripgrep

libresolv dependency on mac precompiled binary

The macOS binary was built with the latest nightly release at the time of the most recent release, which was on May 29, 2020.

The PR you issue was closed in 2018. So yes, ripgrep was compiled with a Rust release that fixed that bug. Whether that bug is still fixed or there is some other issue is something I don't know and it's not a ripgrep issue.

smekkley

comment created time in 4 days

issue commentBurntSushi/ripgrep

ignore crate doesn't handle common recursive directory ignore pattern

Errm, to clarify, the ignore crate provides recursive directory traversal.

jazzdan

comment created time in 4 days

issue commentBurntSushi/ripgrep

ignore crate doesn't handle common recursive directory ignore pattern

The only way to do that with the ignore crate is to iterate over parent paths.

I think people probably don't realize just how limited the ignore crate is. If you aren't doing filtered recursive directory traversal, then maybe parts of the crate will be useful to you, but you'll likely need to work hard to make it come together.

By the way, I'm guessing that target/ works with git because of some intricacy for how git handles files?

Yeah git handles ignore rules pretty differently than ripgrep. Obviously I did my best to make them match, but I did that in the context of recursive directory traversal because that was the problem I needed to solve. But for example, if a file matches a gitignore pattern but is committed, then git won't ignore but ripgrep will.

jazzdan

comment created time in 4 days

issue commentBurntSushi/ripgrep

ignore crate doesn't handle common recursive directory ignore pattern

Looks correct to me. The pattern target/ will ignore the path target. So when using ignore for recursive traversal, it should ignore the target directory before descending into it. If you aren't doing recursive search, then you may need to match each parent path up to the level at which the gitignore file resides.

jazzdan

comment created time in 4 days

pull request commentBurntSushi/ripgrep

Remove regex unicode dependency from globset

Whoops, I forgot to put out a new release of globset with this change. This PR is now in globset 0.4.6 on crates.io.

ajeetdsouza

comment created time in 5 days

created tagBurntSushi/ripgrep

tagglobset-0.4.6

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

created time in 5 days

push eventBurntSushi/ripgrep

Andrew Gallant

commit sha c777e2cd5766128e11f7fd5dffd79e1ba8a753fb

globset-0.4.6

view details

push time in 5 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha cb06e010b4c1ddcbedcb47ee4600f4f3fe4aa205

progress

view details

push time in 6 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 925b8c2549231f8e4f1f25c126005b72dc4e28fb

progress

view details

push time in 6 days

issue commentBurntSushi/ripgrep

Cannot perform empty search in Powershel Core

I'm not sure then. My guess is that this is a PowerShell issue and not a ripgrep issue, but it will be a while before I'm able to spin up a Windows machine and test this for myself. Others are encouraged to weigh in here!

emiliano-ruiz

comment created time in 6 days

issue commentBurntSushi/ripgrep

Cannot perform empty search in Powershel Core

Try rg "".

emiliano-ruiz

comment created time in 6 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 8420d140e46142d6cb4b89a5c4af5285ca7b5250

progress

view details

push time in 6 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 907aef6f56da9f5fdd62de725406e99c56df5662

progress

view details

push time in 7 days

PR closed BurntSushi/regex-automata

[WIP] RegexSet-like functionality

This is a work-in-progress, but I felt it was to a point where it was worth discussing. This specific PR is focused entirely on DFAs, implementing multiple-pattern functionality up to the point of overlapping_find_at.

The biggest issues I have currently are with the semantics of find for multiple-pattern regexes. I'm not sure what it should do, especially if it's unanchored.

Please take a look, let me know what you think of the overall approach, point out any obvious mistakes, etc.

Known work to go:

  • [ ] More test cases
  • [ ] Fix serialization/deserialization
  • [ ] Support non-allocating versions of the match map (AsRef a bunch of things)

Intended, eventually, to close #4

+756 -35

3 comments

6 changed files

awygle

pr closed time in 7 days

issue commentBurntSushi/regex-automata

Support for RegexSet-like cases

As mentioned here, I do have this working in my branch containing an in-progress rewrite of most of the crate. It will be a while yet before 0.2 release, but the functionality is there. And at least for DFAs, it is much more expressive than the regex crate's RegexSet. It supports reporting the offsets of matches, even when doing overlapping searches. And you can even provide a pattern ID to search to execute an anchored search of a specific pattern in the DFA.

RReverser

comment created time in 7 days

pull request commentBurntSushi/regex-automata

[WIP] RegexSet-like functionality

OK, so I have managed to get this done in my branch that rewrites most of the crate. Getting this to work correctly ended up being incredibly difficult just because of all of the new additions required to the automaton.

It will unfortunately be a while yet before my rewrite is done, but the functionality is at least implemented.

awygle

comment created time in 7 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha aa7bc5539a0ef732852b4b2ae82cd6d3071e05e3

progress

view details

push time in 7 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 4555a4e3e383f888b116377729867eb81622619b

progress

view details

push time in 7 days

issue commentBurntSushi/ripgrep

Blank lines to delineate different files should not be emitted when using -NI flags

I see, yes. People are sometimes caught off guard by the change in formatting depending on whether ripgrep is printing to a tty or not. But even core commands do this occasionally. Compare the output of ls and ls | cat for example. It's subtle!

I'll switch this over to a documentation bug, although it's a little tricky. Maybe the tty style behavior can be mentioned more prominently in the docs. But I think we're getting close to a point where there are a few things being mentioned prominently, which of course weakens the notion of "prominent." :-/

sheisnicola

comment created time in 7 days

issue commentBurntSushi/ripgrep

Blank lines to delineate different files should not be emitted when using -NI flags

So I actually wrote out a bug fix for this, but now I'm second guessing whether this is really a bug or not. It just seems weird to me to disable headings in this particular case, but leave them present when --no-filename is given but line numbers are displayed. I guess in my view, having that gap between different files matching might actually be desirable? Or at least, if this bug were fixed, having that gap would be impossible. But today, you can remove the heading with the --no-heading flag (or by redirecting ripgrep's output away from a tty).

I'm inclined to leave ripgrep's behavior as is.

sheisnicola

comment created time in 7 days

push eventBurntSushi/ripgrep

Ajeet D'Souza

commit sha e5639cf22d138d86b04f1097f42e0f9e13438aa5

globset: remove regex unicode dependency Since the translation from a glob to a regex always disables Unicode in the regex, it follows that we shouldn't need regex's Unicode features enabled. Now, ripgrep enables Unicode features in its regex dependency and of course uses them, which will cause globset to have it enabled in the ripgrep build as well. So this doesn't actually change anything for ripgrep. But this does slim thing downs for folks using globset independently of ripgrep. PR #1712

view details

push time in 7 days

PR merged BurntSushi/ripgrep

Remove regex unicode dependency from globset

I don't think globset requires Unicode support in regex for path matching (correct me if I'm wrong). I removed the dependency to make the crate lighter.

This won't affect the size of ripgrep, since it depends on Unicode in regex elsewhere. However, I've been meaning to use globset in zoxide, and I'd like to keep the binary size as small as possible.

I tested a trivial program before and after the change, and the executable size went from 2544 KB to 2084 KB.

Thanks!

+1 -1

2 comments

1 changed file

ajeetdsouza

pr closed time in 7 days

pull request commentBurntSushi/ripgrep

Remove regex unicode dependency from globset

Aye, yeah, passing the test suite is good enough for me. I expect we'll find out if this was an unwise choice after the next release! Hah. But I think it should be fine.

ajeetdsouza

comment created time in 7 days

PullRequestReviewEvent

push eventBurntSushi/ripgrep

Dương Đỗ Minh Châu

commit sha 86c843a44bc70b377c724a2bf9a6251da1f5f5b9

ignore/types: add a type for minified files Fixes #1710, PR #1711

view details

push time in 7 days

issue closedBurntSushi/ripgrep

Add a type for minified files

When searching, usually the matches from minified files are unwanted because each match is extremely long and the match can be found in original file as well. Because matches from minified files are rarely desired, and currently rg --type-list has no type for minified files, I suggest adding a minified type so people can ignore minified files while searching by using --type-not minified. This new type should be printed as minified: *.min.js, *.min.css, *.min.html when we run rg --type-list.

closed time in 7 days

duongdominhchau

PR merged BurntSushi/ripgrep

Add a type for minified files

Well, I don't know Rust and the directory structure looks weird. Thought it would be hard to make the change, but rg told me I was wrong, rg '\*.toml' shows me exactly what I want. Thank you for this awesome tool!

This PR is for issue #1710

+1 -0

0 comment

1 changed file

duongdominhchau

pr closed time in 7 days

PullRequestReviewEvent

issue openedrust-lang/regex

fuzz: compiling '\P{any}' panics by tripping an assertion in the compiler

Specifically, this one: https://github.com/rust-lang/regex/blob/a7ef5f452ec1dcb856fae99d56c6db08bf23d1ff/src/compile.rs#L419

Normally, regexes like [^\w\W] with empty classes are banned at translation time. But it looks like \P{any} (which is empty) slipped through. So we should just improve the ban to cover that case.

However, empty character classes are occasionally useful constructs for injecting a "fail" sub-pattern into a regex, typically in the context of cases where regexes are generated. Indeed, the NFA compiler in regex-automata handles this case fine:

$ regex-cli debug nfa thompson '\P{any}' -B
      parse time:  48.809µs
  translate time:  17.48µs
compile nfa time:  18.638µs
   pattern count:  1

thompson::NFA(
>000000: alt(2, 1)
 000001: \x00-\xff => 0
^000002: sparse()
 000003: MATCH(0)
)

Where it's impossible to ever move past state 2. Arguably, it might be nicer if it were an explicit "fail" instruction, but an empty sparse instruction (a state with no outgoing transitions) serves the purpose as well.

So once #656 is done, we should be able to relax this restriction.

created time in 7 days

issue commentBurntSushi/ripgrep

Add a type for minified files

I think this seems reasonable to me. I don't usually track open issues for file types. Just submitting a PR is fine.

Also, you might be interested in the -M/--max-columns flag. It is particularly helpful with things like minified files since it will suppress extremely long lines.

duongdominhchau

comment created time in 7 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha fa7c2c7e7b5d1aef9c850d91f2bf3a53093c1ab5

progress

view details

push time in 8 days

created tagBurntSushi/bstr

tag0.2.14

A string type for Rust that is not required to be valid UTF-8.

created time in 8 days

push eventBurntSushi/bstr

Andrew Gallant

commit sha 7f0ad15d9628c0abec8cf6b7585539cae63e6d5b

0.2.14

view details

push time in 8 days

delete branch BurntSushi/bstr

delete branch : ag/rollup

delete time in 8 days

push eventBurntSushi/bstr

Mara Bos

commit sha 8e659921312830b91d2a48aafa36fb1a49cba5bc

impl: improve ByteVec::into_string_lossy It now re-uses to_str_lossy() (which is more efficient than looping over v.chars() like this did), and no longer allocates in the case where it was already valid utf-8. Closes #76

view details

Todd Mortimer

commit sha 87a39778af5036b02ecb8905022acb6a34668cc8

doc: improve example for trim_end_with This makes the example illustrate the function. Closes #74

view details

GrayJack

commit sha acc45f1bbd96ba7145249ab38d62f7588a4ede76

impl: support formatting alignment This adds support for the various'{:5}' syntaxes defined by std::fmt. We ended up having to re-roll it instead of using std's facilities because we're working with byte slices and not &str. It would be simple to just lossily convert a BStr into a String, but this would invalidate the implementation in no_std contexts. Fixes #69, Closes #70

view details

Mara Bos

commit sha bad9e087fe88cbf04bf3d5f0111165df1460897f

api: add Utf8Chunk::incomplete method This makes it possible to easily tell whether a particular chunk either contains invalid UTF-8 or possibly contains a prefix of valid UTF-8. This is useful in the context of processing streams. Closes #68

view details

CAD97

commit sha 06295eea6d66db8539e15daccd658157fe24e0fc

api: add From impls between Box<[u8]> and Box<BStr> Fixes #66, Closes #67

view details

Alex Touchet

commit sha 4ff48c25425c9fc4049b9b2a834e6083743009be

doc: update various urls in README This mostly switches to https and tweaks docs.rs links. Closes #63

view details

Erik Zscheile

commit sha 16e782a7291a7f357ce4bfbf6d2155d65483e705

api: add as_slice, size_hint and FusedIterator to Bytes Fixes #61, Closes #62

view details

Paul Colomiets

commit sha 8e2041ed5481078f25635dd7989a96abd87721ce

impl: improve Debug impl for BStr Previously, any non-printable Unicode scalar value would get printed using the '\u{...}' syntax. While not horrible, it's a bit weird to do this for ASCII characters and especially for the NUL and various whitespace characters. So this commit tweaks the Debug impl for BStr to handle those cases specially in a way that's a bit more consistent with what folks expect. Fixes #56, Closes #58

view details

Andrew Gallant

commit sha eafb4951c651c4b4eab94621c259f80b217803ee

impl: fix replacement codepoint handling in Debug impl Previously, if an actual replacement codepoint was found in UTF-8, it would print the UTF-8 encoding of the replacement codepoint instead of the replacement codepoint itself. This is because the replacement codepoint is used as a sentinel to determine whether invalid UTF-8 was found. Of course, if the replacement codepoint is encoded itself, then it's valid UTF-8. So we fix that by checking whether the raw bytes correspond to a valid UTF-8 encoding of the replacement codepoint. If so, we print it just like any other scalar value. Fixes #72

view details

Andrew Gallant

commit sha 809c8b8237afd7c8c435268f72b3d2c03c6fcec5

api: impl FusedIterator for CharIndices Fixes #71

view details

push time in 8 days

issue closedBurntSushi/bstr

Implement FusedIterator for CharIndices?

Implementing FusedIterator means consumers can rely on CharIndices to return None forever when it is exhausted.

Is this not already the case today?

closed time in 8 days

lopopolo

issue closedBurntSushi/bstr

What is the intended Debug escape behavior for the BStrs that contain the Unicode replacement character?

I've based some ident parsing code on the fmt::Debug impl for &BStr, which currently looks like this:

https://github.com/BurntSushi/bstr/blob/f56685bc0b52fcd713286355baf488709120b0ce/src/impls.rs#L331-L339

For my usecase, this wasn't quite right because all Unicode characters outside of the ASCII range are valid ident characters. This means the replacement character itself is a valid ident char if it appears in the source byteslice. I ended up with something like this:

fn is_ident_char(ch: char) -> bool {
    ch.is_alphanumeric() || ch == '_' || !ch.is_ascii()
}

fn is_ident_until(name: &[u8]) -> Option<usize> {
    // Empty strings are not idents.
    if name.is_empty() {
        return Some(0);
    }
    for (start, end, ch) in name.char_indices() {
        match ch {
            // `char_indices` uses the Unicode replacement character to indicate
            // the current char is invalid UTF-8. However, the replacement
            // character itself _is_ valid UTF-8 and a valid Ruby identifier.
            //
            // If `char_indices` yields a replacement char and the byte span
            // matches the UTF-8 encoding of the replacement char, continue.
            REPLACEMENT_CHARACTER if name[start..end] == REPLACEMENT_CHARACTER_BYTES[..] => {}
            // Otherwise, we've gotten invalid UTF-8, which means this is not an
            // ident.
            REPLACEMENT_CHARACTER => return Some(start),
            ch if !is_ident_char(ch) => return Some(start),
            _ => {}
        }
    }
    None
}

The current implementation of Debug will always output the replacement character as three byte escapes. Is this intended?

closed time in 8 days

lopopolo

PR closed BurntSushi/bstr

Better (shorter, clearer) debug implementation

Fixes #56

Note: apart from what is discussed here I've also included \0 escape. In Rust this is not an octal escape but a special case similar to \n and \t. Given that it's quite often used in binary data, and is shorter this way, I think it justifies the special case.

+26 -5

0 comment

1 changed file

tailhook

pr closed time in 8 days

issue closedBurntSushi/bstr

More compatible escapes for control characters

Current Debug implementation yields something like this:

"\u{0}\u{0}\u{0} ftypisom\u{0}\u{0}\u{2}\u{0}isomiso2avc1mp"

And I'd like it to be like this:

"\0\0\0 ftypisom\0\0\x02\0isomiso2avc1mp"

Edit: these are first bytes of mp4 file, so quite real-world example This is not only compatible with Rust (even for str constants), but also with C, Python, Bash, and probably more languages. And also a bit shorter.

  1. What do you think of changing the representation?
  2. If changing is okay, what to do with non-printable chars > 0xff? There is two choices: keep them as current \u{..} or always escape with \x. I would prefer latter, because when working with binary data looking at individual bytes makes more sense, and when working with textual data, these character codes are rarely informing enough.

closed time in 8 days

tailhook

PR closed BurntSushi/bstr

improve the `bstr::ext_slice::Bytes` iterator

Added methods/traits:

  • as_slice (fix #61)
  • Iterator::size_hint (uphold contract with ExactSizeIterator)
  • impl FusedIterator (may allow some additional optimizations)
+18 -6

1 comment

1 changed file

zserik

pr closed time in 8 days

issue closedBurntSushi/bstr

the `Bytes` iterator should have an `as_slice` method

The underlying iterator std::slice::Iter has an as_slice method and it would be very convenient if that method would be re-exported in the impl Bytes, as this would allow an easy conversion between &[u8] and Bytes, which would make writing parsers easier, which often need to parse a few bytes, and then return the unparsed remaining byte slice (example use case (currently without usage of the Bytes iterator)).

closed time in 8 days

zserik

PR closed BurntSushi/bstr

Update URLs

Update some URLs and link to latest versions of doc pages.

+7 -7

0 comment

1 changed file

atouchet

pr closed time in 8 days

PR closed BurntSushi/bstr

Add conversions between boxed bytes and BStr

Closes #66

This might require the 1.41 adjustments of local impl checking to be coherent. (Let's ask CI!) If that turns out to be the case, these impls will need to be gated behind checking for the rust version (or bump MSRV) and potentially exposed not just as the trait methods. EDIT: CI says this works on MSRV 🎉

+28 -0

0 comment

2 changed files

CAD97

pr closed time in 8 days

issue closedBurntSushi/bstr

Feat: Box<[u8]> -> Box<BStr>

It's currently possible to do &[u8] -> &BStr and Vec<[u8]> -> BString, but a way to construct owned heap-allocated but fixed-capacity bstrings Box<[u8]> -> Box<BStr> is not yet provided.

closed time in 8 days

CAD97

PR closed BurntSushi/bstr

Add Utf8Chunk::incomplete().

This makes .utf8_chunks() usable for incrementally processed streams.

+76 -3

0 comment

1 changed file

m-ou-se

pr closed time in 8 days

PR closed BurntSushi/bstr

Factor the align and width for BStr Display implementation

This Fix #69

+172 -5

0 comment

1 changed file

GrayJack

pr closed time in 8 days

issue closedBurntSushi/bstr

BStr Display implementation doesn't consider width and fill/align

I tried to use the width, fill/align to print BStr and I was surprised when it didn't work.

Was this intended?

closed time in 8 days

GrayJack

PR closed BurntSushi/bstr

Improve example for trim_end_with().

Trivial diff to make the example illustrate the function.

+1 -1

0 comment

1 changed file

mordak

pr closed time in 8 days

PR closed BurntSushi/bstr

Make ByteVec::into_string_lossy() more efficient.

It now re-uses to_str_lossy() (which is more efficient than looping over v.chars() like this did), and does no longer allocate in the case it was already valid utf-8.

+7 -8

0 comment

1 changed file

m-ou-se

pr closed time in 8 days

PR merged BurntSushi/bstr

rollup most PRs and squash a few small issues
+387 -37

0 comment

6 changed files

BurntSushi

pr closed time in 8 days

push eventBurntSushi/bstr

Andrew Gallant

commit sha 842fae214355f49fd701c9cd2e8d6fdf788df5dd

api: impl FusedIterator for CharIndices Fixes #71

view details

push time in 8 days

Pull request review commentBurntSushi/bstr

rollup most PRs and squash a few small issues

 impl<'a> DoubleEndedIterator for CharIndices<'a> {     } } +#[cfg(feature = "std")]+impl<'a> ::std::iter::FusedIterator for CharIndices<'a> {}

Derp. Thank you.

BurntSushi

comment created time in 8 days

PullRequestReviewEvent

issue closedBurntSushi/bstr

A helper macro for bytestring concatenation/formatting

Hi there,

I've just released an article about bytestrings and encoding in general here: https://www.reddit.com/r/rust/comments/gz33u6/not_everything_is_utf8/, at the end of which I mention my intention of creating a format_bytes! macro to... format bytes.

I was wondering whether you had built that functionality somewhere I could re-use, or - if not - if you think that it would be a good fit for the bstr crate baring maybe a cargo feature for the added compilation dependencies for proc macros.

(Copy-pasted from an earlier exchange to preserve its history)

closed time in 8 days

Alphare

issue commentBurntSushi/bstr

A helper macro for bytestring concatenation/formatting

I'm going to close this out since I don't have any current plans to pursue or adopt this. I think the thing scaring me away from this is the implementation complexity. I think it would be biting off more than I can chew.

However, I'm happy to be shown that this can be simpler than I think, in which case, I would consider merging it into bstr.

Alphare

comment created time in 8 days

push eventBurntSushi/bstr

Andrew Gallant

commit sha df70fbef84b5dc0156effcab56f3396d867bffb4

api: impl FusedIterator for CharIndices Fixes #71

view details

push time in 8 days

issue closedBurntSushi/bstr

Question: For what functionality does `bstr` need `regex-automata` and `lazy-static`?

https://github.com/BurntSushi/bstr/blob/91edb3fb3e1ef347b30e5bd792bb4d29ee19d163/Cargo.toml#L25

I'm considering using this crate, but it seems to have quite some heavy dependencies by default (I'm aware that I can turn it off). What are these dependencies used for?

closed time in 8 days

tbu-

issue commentBurntSushi/bstr

Question: For what functionality does `bstr` need `regex-automata` and `lazy-static`?

I'm going to close this out. I still fee largely the same as I did when I wrote my comments above, and I don't see it changing necessarily.

tbu-

comment created time in 8 days

PR opened BurntSushi/bstr

rollup most PRs and squash a few small issues
+387 -37

0 comment

6 changed files

pr created time in 8 days

create barnchBurntSushi/bstr

branch : ag/rollup

created branch time in 8 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 2c4dd3ff1235906678072d08add8da9b03f1a75b

progress

view details

push time in 8 days

issue closedBurntSushi/bstr

Add more From impls for constructing Cow

I'm constructing a set of error messages that use Cow<'static, [u8]> because the vast majority of messages are static byte strings, but some require dynamic allocation (for example by deferring to the Display impl of a wrapped error). The actual message field is a Cow<'static, BStr> to make use of BStr's debug implementation.

The constructors for the error structs are all From impls on owned and borrowed String and Vec<u8>.

Trying to construct a Cow<'_, Bstr> from a Vec<u8> or &'a [u8] is a bit verbose because the compiler gets confused when you buf.into().into(). I end up having to write:

impl From<Vec<u8>> for Message {
    fn from(message: Vec<u8>) -> Self {
        Self(Cow::Owned(message.into()))
    }
}

impl From<&'static [u8]> for Message {
    fn from(message: &'static [u8]) -> Self {
        Self(Cow::Borrowed(message.into()))
    }
}

Can we add a direct set of impls for constructing a Cow?

impl<'a> From<Vec<u8>> for Cow<'a, BStr> {
    fn from(bytes: Vec<u8>) -> Self {
        Cow::Owned(bytes.into())
    }
}
impl<'a> From<&'a [u8]> for Cow<'a, BStr> {
    fn from(bytes: &'a [u8]) -> Self {
        Cow::Borrowed(bytes.into())
    }
}

And maybe impl<'a> From<Cow<'a, [u8]>> for Cow<'a, BStr>.

These impls would also make it easier to turn a String/&str Cow into a BStr Cow.

closed time in 8 days

lopopolo

issue commentBurntSushi/bstr

Add more From impls for constructing Cow

Unfortunately, no, none of those impls are allowed because of coherence (I just checked). For example, in the case of From<Vec<u8>> for Cow<'a, BStr>, neither Vec<u8> nor Cow are defined in BStr, which means the impl isn't allowed.

There are a lot of impls in src/impls.rs. If you find a missing impl that is legal, then please feel free to just submit a PR. :-) Thanks.

lopopolo

comment created time in 8 days

issue commentBurntSushi/bstr

Implement FusedIterator for CharIndices?

Indeed. I have this fixed in an upcoming PR. For simple cases like this, just submitting a PR is fine. :-) Thanks!

lopopolo

comment created time in 8 days

pull request commentBurntSushi/bstr

Add a `bstr::literal!` macro for constructing a `const` BStr

Looking at this again, I wonder if we can achieve this with a const fn instead. There's been a lot of work there recently. It would require a big MSRV bump to do it any time soon though, but I think I'd be okay with that.

You do mention that traits don't work in const fns, but it seems like we should be able to make a &BStr without using traits? In fact, I wonder if we can make B a const fn, although I guess that particular API would require using traits... Hmm...

thomcc

comment created time in 8 days

PullRequestReviewEvent

Pull request review commentBurntSushi/bstr

Make ByteVec::into_string_lossy() more efficient.

 pub trait ByteVec: Sealed {     where         Self: Sized,     {-        let v = self.as_vec();-        if let Ok(allutf8) = v.to_str() {-            return allutf8.to_string();+        match self.as_vec().to_str_lossy() {+            Cow::Borrowed(_) => unsafe { self.into_string_unchecked() },

This LGTM, but could you add a safety note here? Like this:

// SAFETY: blah blah

I believe every other use of unsafe in this crate has a safety annotation. Also, I try not to use the word unsafe in the description so that grep reports fewer false positives.

Thanks!

m-ou-se

comment created time in 8 days

PullRequestReviewEvent
PullRequestReviewEvent

issue commentBurntSushi/rust-csv

document usage of #[serde(flatten)] more thoroughly

@bchallenor Could you please provide a complete Rust program that compiles, along with any relevant input, your expected output and the actual output? If possible, please briefly describe the problem you're trying to solve. (To be honest, I think it might be better to open a new issue, unless you're 100% confident that this one is the same.)

Your suggested work-around by adding a crate feature to disable inference is probably a non-starter because it likely breaks other functionality. This sort of subtle interaction between features via Cargo features is not the kind of complexity I'd like to add to the crate. Instead, I'd like to look at your problem from first principles, however, I do not see a complete Rust program from you here or in the linked issue.

LPGhatguy

comment created time in 8 days

issue closedrust-lang/regex

Remove "perf" and "unicode" from default-features

README.md explains that one can disable default features to reduce binary size, but it is practically impossible with large dependency tree. As long as one of the crates includes regex crate without specifying default-features = false, perf feature is enabled. To disable it, you have to go to each dependency maintainer and convince them to disable default features.

For such a widely used library it makes sense to disable these features by default and let application maintainers enable it if they want. Some libraries may also enable "unicode", but only if they really depend on it.

closed time in 8 days

link2xt

issue commentrust-lang/regex

Remove "perf" and "unicode" from default-features

The problem you point out is certainly real and I knew it would be a problem when I added the features about a year ago. While you've provided a solution to the problem, you have no provided any argument for why the downsides of your solution are better than the downsides of the status quo. In particular, if I did what you're suggesting, then:

  1. It would be a breaking change, and thus require a regex 2 release.
  2. In this hypothetical regex 2 release, regexes like \w would fail to compile. Specifically, from the docs: "Stated differently, enabling or disabling any of the features below can only add or subtract from the total set of valid regular expressions. Enabling or disabling a feature will never modify the match semantics of a regular expression." One can only imagine how confused users would be, and I can empathized with their frustration as they seek to figure out how to fix it. (By re-enabling the features.) Suffice to say, having Regex::new(r"\w").unwrap() fail in the default configuration of this crate seems bonkers to me.
  3. Disabling these features is primarily geared towards reducing binary size and compilation time, at the expense of performance and correctness. IMO, and as the maintainer of this crate, I generally prioritize both performance and correctness over binary size and compilation time. With that said, where possible, being able to tweak that trade off is a nice convenience to offer. Hence why these crate features exist. That they can be difficult to disable in a large dependency tree is an unfortunate circumstance, but I don't really see another way.

To disable it, you have to go to each dependency maintainer and convince them to disable default features.

Yup, exactly. If a library crate you're using depends on regex and doesn't need or want these extra features, then it should be treated as a bug (or possibly feature enhancement), just like anything else. Sorry, but that's just the way the cookie crumbles in the current Cargo feature system.

link2xt

comment created time in 8 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 170f175dca05d5929bbf9e43db93125a2eec2116

progress

view details

push time in 9 days

push eventBurntSushi/regex-automata

Andrew Gallant

commit sha 412ac5233aaa9262a295032244fd006b0d97def5

progress

view details

push time in 9 days

issue commentBurntSushi/ripgrep

--smart-case breaks Unicode range search

When I post an issue, I take the time to post it right.

Then fill out the issue template that the maintainer of the project has requested, please. It's not some bureaucratic nonsense. It's to help streamline the identification of bugs. The exact formatting isn't necessary (although it is helpful), but the information in it is, and your initial issue didn't have it. I didn't know what you were searching or why you thought it was wrong or what you were seeing.

Your other open source work looks wonderful, but that doesn't really matter here. When someone files a bug, unless I have a previous report with them, they're a mystery to me. I don't know their background, their context, what they do or don't know. That's the whole point of the issue template. To eliminate assumptions and provide the detail necessary to understand and fix a bug. It's to eliminate me having to guess what people mean. Instead, people should show what they mean. That's the point of the questions in the template.

If you ran my command on a file with even a single lowercase k, then boom, issue reproduced.

No, it wouldn't. All you said was that it didn't work. I don't know what that means.

but I dont think any ASCII should be case matching anything in the multibyte Unicode range

You'll have to take that up with Unicode. Probably the "most correct" answer is locale support with specialized tailoring, but ripgrep doesn't and likely will never have it.

nu8

comment created time in 9 days

push eventBurntSushi/dotfiles

Andrew Gallant

commit sha d56be37693917f31316119b04985dee4ded2417f

bin/arch-install: add unzip package

view details

push time in 9 days

issue commentBurntSushi/ripgrep

--smart-case breaks Unicode range search

I'm glad you edited your comment, because I don't appreciate the hostility. I was on mobile when I was responding to you earlier, so no, I didn't have a chance to try anything. I always try to triage issues when I can, so that when I do sit down to look more closely, I have all the information I need. And I note that even after repeated requests to fill out the template, you still didn't quite do it completely. You left out the --debug output, and you didn't really explain why you thought the output was incorrect, which turned out to be important in this case.

Sometimes, bugs elude simple reproductions. But not in this case. It is extremely frustrating when I've gone through the trouble to setup bug reporters for success by asking for information that leads to a reproduction, only to have the reporter completely ignore it. I'm doing this in my spare time, and if every reporter was as difficult as you, I'd be spending a lot more time dealing with issues than I would like.

As for the issue you've reported, the output you're seeing is correct. I don't know whether your environment has colors enabled, but in mine, it's pretty clear what ripgrep is matching here:

ripgrep-1708-smart-case-match

As you can see, ripgrep is matching a k in each line of the file, which is correct, because k is a lowercase character of U+212A, or , which is the Kelvin sign and not U+004B, which is the latin uppercase letter K.

Perhaps though, you're wondering why smart case is even applying here. That's because smart case applies when two conditions are met: there is at least one literal character in the pattern and none of the literals in the pattern are considered uppercase with respect to Unicode. In your pattern, you have literals (two of them, \u07FF and \U0010FFFF) and neither of those literals are uppercase according to Unicode:

$ printf '\xDF\xBF' | rg '\p{Uppercase}'
$ printf '\xF4\x8F\xBF\xBF' | rg '\p{Uppercase}'
$

Thus, smart case kicks in and enables case insensitive search. Since U+212A is in the range U+07FF-U+0010FFFF, the case folded range includes U+006B, which is the lowercase latin letter k.

You're proposed work-around is the simplest way to avoid the issue. Another possible way would be to disable Unicode case folding and just use ASCII case folding, but ripgrep exposes no way to do that while simultaneously using other Unicode features (in this case, the \u syntax).

I've updated the docs of --smart-case to lay out these rules more explicitly so that this is clearer in the future. Incidentally, this is one of the many reasons why ripgrep does not enable smart case by default.

nu8

comment created time in 9 days

push eventBurntSushi/ripgrep

Andrew Gallant

commit sha 2b1637d1db5f8d20f417db11e3ef157b02556429

doc: clarify how -S/--smart-case works Whether or not smart case kicks in can be a little subtle in some cases. So we document the specific conditions in which it applies. These conditions were taken directly from the public API docs of the `grep-regex` crate: https://docs.rs/grep-regex/0.1.8/grep_regex/struct.RegexMatcherBuilder.html#method.case_smart Fixes #1708

view details

push time in 9 days

issue closedBurntSushi/ripgrep

--smart-case breaks Unicode range search

If I do a search like this:

rg '[\u07FF-\U0010FFFF]'

it works fine. But if I do this:

rg --smart-case '[\u07FF-\U0010FFFF]'

it fails. I dont actually do the above, but I do have a file:

C:\Users\Steven\ripgrep.txt

and this:

$env:RIPGREP_CONFIG_PATH = 'C:\Users\Steven\ripgrep.txt'

Is this expected result? If so what is the best way to handle this situation? I dont want to delete my config file just to do certain searches.

closed time in 9 days

nu8

issue commentBurntSushi/ripgrep

--smart-case breaks Unicode range search

Please fill out the complete issue template. I'm not going to spend my time trying to guess what you're saying.

nu8

comment created time in 9 days

issue commentBurntSushi/ripgrep

--smart-case breaks Unicode range search

Why did you delete the issue template? Please fill it out. Your issue is missing at least a few essential pieces of information that the issue template for reporting a bug requests.

Any please share your config file.

In the future, when filing bugs on any project, please consider providing more detail than "it breaks" or "doesn't work." That doesn't mean anything to me. That's why the issue template exists. I don't understand why people ignore it.

nu8

comment created time in 9 days

issue commentBurntSushi/toml

Is this project still maintained?

@theoretick IMO, anyone choosing a dependency should do their due diligence on whether it's appropriate to use or not. That should at least include inspecting commit activity and the issue tracker. It should be clear from that what the current status of this project is. And in particular, the README does state the version of TOML that this package supports near the top.

With that said, you're right, it would be more helpful if I posted an explicit acknowledgment. And that is now done.

Thanks for the prompting.

theoretick

comment created time in 9 days

push eventBurntSushi/toml

Andrew Gallant

commit sha ea60c4def909bde529d41a7e0674e31eba751da3

readme: add 'unmaintained' notice Closes #268

view details

push time in 9 days

issue closedBurntSushi/toml

Is this project still maintained?

Hello and thank you for this library.

I just noticed that the recent update was from over 2 years ago. With over 40 open issues, no support for the most recent TOML 5 spec, and a number of stalled PRs I was hoping to clarify the projects status.

Should this project be given a deprecation notice? Are there any plans to support new features? Are you are looking for new maintainers?

closed time in 9 days

theoretick
IssuesEvent
more