profile
viewpoint
Carol (Nichols || Goulding) carols10cents Pittsburgh, PA, USA http://carol-nichols.com

carols10cents/aoc-rs-2019 20

My Advent of Code 2019 solutions (in Rust, of course). Not guaranteed to be done each day or totally completed.

carols10cents/cargo-open 16

A third-party cargo extension to allow you to open a dependent crate in your $EDITOR

carols10cents/carolshubot 10

My hubot setup

carols10cents/adventofcode-rs 3

My advent of code 2016 solutions in Rust. Tags for each day's solutions

bonitoo-io/influxdb-client-rust 1

InfluxDB (v2+) Client Library for Rust

carols10cents/book 1

The Rust Programming Language

carols10cents/capybara 1

webrat alternative which aims to support all browser simulators

carols10cents/carol-test 1

A crate I use to test publishing

carols10cents/24pullrequests 0

Giving back little gifts of code for Christmas

create barnchinteger32llc/arrow

branch : unignore-some-tests

created branch time in 11 hours

PR closed apache/arrow

One definition/repetition level test lang-rust

Hey @nevi-me, before I go write a bunch of these, is this what would be useful for testing levels? Is there an easier way to create the arrays?

I'm basing these on tests in the C++ implementation that have a nice JSON constructor, and I tried using the JSON Reader but I couldn't get what I created with the JSON Reader to what I created that I currently have here :-/

Thank you for any feedback you have!

+4682 -885

5 comments

83 changed files

carols10cents

pr closed time in 11 hours

pull request commentapache/arrow

One definition/repetition level test

I think I'm going to close this one for now; I don't think it's useful until the def/rep levels are a bit further along.

carols10cents

comment created time in 11 hours

delete branch integer32llc/arrow

delete branch : dict

delete time in 12 hours

delete branch integer32llc/arrow

delete branch : update-cpp-comment

delete time in 12 hours

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

@nevi-me Rebased and fixed the last few things! I pulled the comment fix into its own PR, thanks for the tip on CI. Hoping for all greens now!

carols10cents

comment created time in 2 days

PR closed integer32llc/arrow

Reviewers
Complete dictionary support

This simplifies dictionary support by moving the primitive casts to one place. The idea is to not cast Parquet primitive types, as these can be mapped to 4 Arrow types (i32, i64, f32, f64). Once these 4 primitive Arrow arrays are created, we can leverage the machinery in arrow::compute::cast to cast to many Arrow types.

I've left some TODOs for technical debt which I'd love for us to address in this PR, lest we never get to it.

+5071 -965

1 comment

85 changed files

nevi-me

pr closed time in 2 days

pull request commentinteger32llc/arrow

Complete dictionary support

Cherry-picked this to https://github.com/apache/arrow/pull/8402 :)

nevi-me

comment created time in 2 days

PR opened apache/arrow

[ARROW-10397] Update comment to match change made in b1a7a73ff2

Dictionaries can be indexed by either signed or unsigned integers.

+1 -2

0 comment

1 changed file

pr created time in 2 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 8f621d0649dec5a7f9f5451150776c5863c23817

Add a failing test for string dictionary indexed by an unsinged int

view details

Carol (Nichols || Goulding)

commit sha be62e4a7061f208b619a541abbd32865654dc4e2

Extract a method for converting dictionaries

view details

Carol (Nichols || Goulding)

commit sha 5f330b2713fba9800eca1998200b6da8d929f408

Extract a macro for string dictionary conversion

view details

Carol (Nichols || Goulding)

commit sha 45600e650155e52c7f266a20bd4ce9241be47c01

Convert string dictionaries indexed by unsigned integers too

view details

Carol (Nichols || Goulding)

commit sha 4cde14eb1d88c680790cb9d9b83a1961107492ac

Convert one kind of primitive dictionary

view details

Carol (Nichols || Goulding)

commit sha e45265c8192f800e3fd5453641edc6cb351fbeb4

Update based on rebase

view details

Carol (Nichols || Goulding)

commit sha 3d27a0e1716f1edccc4bc07ebbeda4217329f7eb

cargo fmt

view details

Neville Dipale

commit sha f2f94fd8088a254eabf3d059578d4a7afba6cff2

Complete dictionary support

view details

Carol (Nichols || Goulding)

commit sha 9d692484aafee180023b16608012eaf917ed2b5d

Switch from general_err to unreachable

view details

Carol (Nichols || Goulding)

commit sha f3b287dfbb7fd41722c9659a61484e5cf948a3f1

Change match with one arm to an if let

view details

Carol (Nichols || Goulding)

commit sha bb5d5d7ba9187e9ab71be5eab2f1aad1b7ef912e

Remove some type aliases and calls to cast

view details

Carol (Nichols || Goulding)

commit sha a1c153f2ea097a8a732e1d0a35afca417a9d64d4

Remove RecordReader cast and the CastRecordReader trait

view details

Carol (Nichols || Goulding)

commit sha bfe76698ea7fce9e9d4b673639d755e9cf00701e

Remove some more type aliases

view details

Carol (Nichols || Goulding)

commit sha e15ecf79d1f56d3b72f6fb0396c4766e260adc4b

Move the CastConverter code into PrimitiveArrayReader

view details

Carol (Nichols || Goulding)

commit sha 7e3d54a2a573af35693bb3b183fdc8f2c29864ba

Remove now unneeded CastConverter and BoolConverter

view details

Carol (Nichols || Goulding)

commit sha c90485d7d7cce5a96109c58aed8e885eb58b8324

Remove a resolved TODO

view details

Carol (Nichols || Goulding)

commit sha 1cc53e447b8b532b6e32a22dda792b59054f3f81

Change a panic to unreachable

view details

push time in 2 days

create barnchinteger32llc/arrow

branch : update-cpp-comment

created branch time in 2 days

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 fn write_leaves(             }             Ok(())         }+        ArrowDataType::Dictionary(key_type, value_type) => {+            use arrow_array::{+                Int16DictionaryArray, Int32DictionaryArray, Int64DictionaryArray,+                Int8DictionaryArray, PrimitiveArray, StringArray, UInt16DictionaryArray,+                UInt32DictionaryArray, UInt64DictionaryArray, UInt8DictionaryArray,+            };+            use ArrowDataType::*;+            use ColumnWriter::*;++            let array = &**array;+            let mut col_writer = get_col_writer(&mut row_group_writer)?;+            let levels = levels.pop().expect("Levels exhausted");++            macro_rules! dispatch_dictionary {+                ($($kt: pat, $vt: pat, $w: ident => $kat: ty, $vat: ty,)*) => (+                    match (&**key_type, &**value_type, &mut col_writer) {+                        $(($kt, $vt, $w(writer)) => write_dict::<$kat, $vat, _>(array, writer, levels),)*+                        (kt, vt, _) => panic!("Don't know how to write dictionary of <{:?}, {:?}>", kt, vt),

I think this should probably be unreachable!; it's similar to this spot where the code shouldn't get to that spot.

carols10cents

comment created time in 2 days

PullRequestReviewEvent

PR closed rust-lang/book

A typo?
+1 -1

2 comments

1 changed file

imbolc

pr closed time in 2 days

pull request commentrust-lang/book

A typo?

Yes, I read this as "you'll see the server respond quickly" which sounds right to me. Going to leave this as-is. Thank you though!

imbolc

comment created time in 2 days

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

@nevi-me thank you so so so much! The code is SO much nicer now. I rebased this branch on the rust-parquet-arrow-writer branch, cherry-picked your latest commit, and made some further changes. I was able to get rid of the CastConverters and CastRecordReader!!

I left comments on the spots that I'm not sure how to resolve...

And yes, as you noted, I cherry-picked the "We need a custom comparison of ArrayData" commit from your ARROW-7842-cherry branch so that more tests would work on this branch. Do you think that commit is ready to go, even if the other commits on that branch aren't?

carols10cents

comment created time in 2 days

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 where             data_buffer.into_iter().map(Some).collect()         }; -        self.converter.convert(data)+        // TODO: I did this quickly without thinking through it, there might be edge cases to consider+        let mut array = self.converter.convert(data)?;++        if let ArrowType::Dictionary(_, _) = self.data_type {+            array = arrow::compute::cast(&array, &self.data_type)?;

This is really the part I just couldn't see... the code is so much better now!!!!!

carols10cents

comment created time in 2 days

PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 where             data_buffer.into_iter().map(Some).collect()         }; -        self.converter.convert(data)+        // TODO: I did this quickly without thinking through it, there might be edge cases to consider

This looks fine to me because this function originally returned this Result; I'm not sure what edge cases there might be...

carols10cents

comment created time in 2 days

PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 impl<T: DataType> ArrayReader for PrimitiveArrayReader<T> {             }         } -        // convert to arrays-        let array =-            match (&self.data_type, T::get_physical_type()) {-                (ArrowType::Boolean, PhysicalType::BOOLEAN) => {-                    BoolConverter::new(BooleanArrayConverter {})-                        .convert(self.record_reader.cast::<BoolType>())-                }-                (ArrowType::Int8, PhysicalType::INT32) => {-                    Int8Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::Int16, PhysicalType::INT32) => {-                    Int16Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::Int32, PhysicalType::INT32) => {-                    Int32Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::UInt8, PhysicalType::INT32) => {-                    UInt8Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::UInt16, PhysicalType::INT32) => {-                    UInt16Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::UInt32, PhysicalType::INT32) => {-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::Int64, PhysicalType::INT64) => {-                    Int64Converter::new().convert(self.record_reader.cast::<Int64Type>())-                }-                (ArrowType::UInt64, PhysicalType::INT64) => {-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())-                }-                (ArrowType::Float32, PhysicalType::FLOAT) => Float32Converter::new()-                    .convert(self.record_reader.cast::<FloatType>()),-                (ArrowType::Float64, PhysicalType::DOUBLE) => Float64Converter::new()-                    .convert(self.record_reader.cast::<DoubleType>()),-                (ArrowType::Timestamp(unit, _), PhysicalType::INT64) => match unit {-                    TimeUnit::Millisecond => TimestampMillisecondConverter::new()-                        .convert(self.record_reader.cast::<Int64Type>()),-                    TimeUnit::Microsecond => TimestampMicrosecondConverter::new()-                        .convert(self.record_reader.cast::<Int64Type>()),-                    _ => Err(general_err!("No conversion from parquet type to arrow type for timestamp with unit {:?}", unit)),-                },-                (ArrowType::Date32(unit), PhysicalType::INT32) => match unit {-                    DateUnit::Day => Date32Converter::new()-                        .convert(self.record_reader.cast::<Int32Type>()),-                    _ => Err(general_err!("No conversion from parquet type to arrow type for date with unit {:?}", unit)),-                }-                (ArrowType::Time32(unit), PhysicalType::INT32) => {-                    match unit {-                        TimeUnit::Second => {-                            Time32SecondConverter::new().convert(self.record_reader.cast::<Int32Type>())-                        }-                        TimeUnit::Millisecond => {-                            Time32MillisecondConverter::new().convert(self.record_reader.cast::<Int32Type>())-                        }-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))-                    }-                }-                (ArrowType::Time64(unit), PhysicalType::INT64) => {-                    match unit {-                        TimeUnit::Microsecond => {-                            Time64MicrosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())-                        }-                        TimeUnit::Nanosecond => {-                            Time64NanosecondConverter::new().convert(self.record_reader.cast::<Int64Type>())-                        }-                        _ => Err(general_err!("Invalid or unsupported arrow array with datatype {:?}", self.get_data_type()))-                    }-                }-                (ArrowType::Interval(IntervalUnit::YearMonth), PhysicalType::INT32) => {-                    UInt32Converter::new().convert(self.record_reader.cast::<Int32Type>())-                }-                (ArrowType::Interval(IntervalUnit::DayTime), PhysicalType::INT64) => {-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())-                }-                (ArrowType::Duration(_), PhysicalType::INT64) => {-                    UInt64Converter::new().convert(self.record_reader.cast::<Int64Type>())-                }-                (arrow_type, physical_type) => Err(general_err!(-                    "Reading {:?} type from parquet {:?} is not supported yet.",-                    arrow_type,-                    physical_type-                )),-            }?;+        let arrow_data_type = match T::get_physical_type() {+            PhysicalType::BOOLEAN => ArrowBooleanType::DATA_TYPE,+            PhysicalType::INT32 => ArrowInt32Type::DATA_TYPE,+            PhysicalType::INT64 => ArrowInt64Type::DATA_TYPE,+            PhysicalType::FLOAT => ArrowFloat32Type::DATA_TYPE,+            PhysicalType::DOUBLE => ArrowFloat64Type::DATA_TYPE,+            PhysicalType::INT96+            | PhysicalType::BYTE_ARRAY+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {+                unreachable!(+                    "PrimitiveArrayReaders don't support complex physical types"+                );+            }+        };++        // Convert to arrays by using the Parquet phyisical type.+        // The physical types are then cast to Arrow types if necessary++        let mut record_data = self.record_reader.consume_record_data()?;++        if T::get_physical_type() == PhysicalType::BOOLEAN {+            let mut boolean_buffer = BooleanBufferBuilder::new(record_data.len());++            for e in record_data.data() {+                boolean_buffer.append(*e > 0)?;+            }+            record_data = boolean_buffer.finish();+        }++        let mut array_data = ArrayDataBuilder::new(arrow_data_type)+            .len(self.record_reader.num_values())+            .add_buffer(record_data);++        if let Some(b) = self.record_reader.consume_bitmap_buffer()? {+            array_data = array_data.null_bit_buffer(b);+        }++        let array = match T::get_physical_type() {+            PhysicalType::BOOLEAN => {+                Arc::new(PrimitiveArray::<ArrowBooleanType>::from(array_data.build()))+                    as ArrayRef+            }+            PhysicalType::INT32 => {+                Arc::new(PrimitiveArray::<ArrowInt32Type>::from(array_data.build()))+                    as ArrayRef+            }+            PhysicalType::INT64 => {+                Arc::new(PrimitiveArray::<ArrowInt64Type>::from(array_data.build()))+                    as ArrayRef+            }+            PhysicalType::FLOAT => {+                Arc::new(PrimitiveArray::<ArrowFloat32Type>::from(array_data.build()))+                    as ArrayRef+            }+            PhysicalType::DOUBLE => {+                Arc::new(PrimitiveArray::<ArrowFloat64Type>::from(array_data.build()))+                    as ArrayRef+            }+            PhysicalType::INT96+            | PhysicalType::BYTE_ARRAY+            | PhysicalType::FIXED_LEN_BYTE_ARRAY => {+                unreachable!(+                    "PrimitiveArrayReaders don't support complex physical types"+                );+            }+        };++        // cast to Arrow type+        // TODO: we need to check if it's fine for this to be fallible.+        // My assumption is that we can't get to an illegal cast as we can only+        // generate types that are supported, because we'd have gotten them from+        // the metadata which was written to the Parquet sink

I'm not sure what needs to be checked to resolve this TODO :-/

carols10cents

comment created time in 2 days

PullRequestReviewEvent

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 static Status GetDictionaryEncoding(FBB& fbb, const std::shared_ptr<Field>& fiel                                     const DictionaryType& type, int64_t dictionary_id,                                     DictionaryOffset* out) {   // We assume that the dictionary index type (as an integer) has already been-  // validated elsewhere, and can safely assume we are dealing with signed-  // integers+  // validated elsewhere, and can safely assume we are dealing with integers

This is just changing a comment to align with what the code is actually doing. I initially read the comment and thought the Rust code should assume it's dealing only with signed integers, and then I read the CPP code and realized the comment was out of date. This should have been updated with b1a7a73.

I'm happy to pull this commit out into a separate PR if you'd like?

carols10cents

comment created time in 2 days

PullRequestReviewEvent

push eventinteger32llc/arrow

Krisztián Szűcs

commit sha 0aa20697bcd3dbdf0daadb3409b6347f533be563

[Release] Update CHANGELOG.md for 2.0.0

view details

Krisztián Szűcs

commit sha e46a3c6f27eaf6ebe019336a8cbe92b747f0689a

[Release] Update .deb/.rpm changelogs for 2.0.0

view details

Krisztián Szűcs

commit sha 59434212298fdd3d3251c0bd46d6ba5207b41d3e

[Release] Update versions for 2.0.0

view details

Krisztián Szűcs

commit sha 478286658055bb91737394c2065b92a7e92fb0c1

[maven-release-plugin] prepare release apache-arrow-2.0.0

view details

Krisztián Szűcs

commit sha b1f36acca85d0845c1e64c0a3270651d4a1467b7

[Release] Update versions for 3.0.0-SNAPSHOT

view details

Krisztián Szűcs

commit sha f72575c4bf858a866984692fc1f939b56ec4069a

[Release] Update .deb package names for 3.0.0

view details

Yibo Cai

commit sha a3a35b232a1bb73673128645af09641d8d936f81

ARROW-10263: [C++][Compute] Improve variance kernel numerical stability Improve variance merging method to address stability issue when merging short chunks with approximate mean value. Improve reference variance accuracy by leveraging Kahan summation. Closes #8437 from cyb70289/variance-stability Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Jorge C. Leitao

commit sha 91b5f07cbd50254aceab181c288bb9c5b3db8400

ARROW-10293: [Rust] [DataFusion] Fixed benchmarks The benchmarks were only benchmarking planning, not execution, of the plans. This PR fixes this. Closes #8452 from jorgecarleitao/bench Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Jorge C. Leitao

commit sha a030fc5d1bc5b1d6dac18b4d4e68492d00b94edf

ARROW-10295 [Rust] [DataFusion] Replace Rc<RefCell<>> by Box<> in accumulators. This PR replaces `Rc<RefCell<>>` by `Box<>`. We do not need interior mutability on the accumulations. Closes #8456 from jorgecarleitao/box Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Neville Dipale

commit sha 34533b6629fae74c1d5d1ea4809ac5a958685b96

ARROW-10289: [Rust] Read dictionaries in IPC streams We were reading dictionaries in the file reader, but not in the stream reader. This was a trivial change, as we needed to add the dictionary to the stream when we encounter it, and then read the next message until we reach a record batch. I tested with the 0.14.1 golden file, I'm going to test with later versions (1.0.0-littleendian) when I get to `arrow::ipc::MetadataVersion::V5` support, hopefully soon. Closes #8450 from nevi-me/ARROW-10289 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Jorge C. Leitao

commit sha 7209ffcb58e619ecdb26dc065b616b4f45248e92

ARROW-10292: [Rust] [DataFusion] Simplify merge Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool. Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available. This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance: <details> <summary>Benchmark results</summary> ``` Switched to branch 'simplify_merge' Your branch is up to date with 'origin/simplify_merge'. Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion) Finished bench [optimized] target(s) in 38.02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae Gnuplot not found, using plotters backend aggregate_query_no_group_by 15 12 time: [715.17 us 722.60 us 730.19 us] change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe aggregate_query_group_by 15 12 time: [5.6538 ms 5.6695 ms 5.6892 ms] change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [2.6598 ms 2.6665 ms 2.6751 ms] change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe ``` </details> Closes #8453 from jorgecarleitao/simplify_merge Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Neal Richardson

commit sha 9e671ac105aebe9c3235e75aa367f59af40ed828

ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel Also adds a GHA job that tests on R-devel so we catch issues like this sooner. Closes #8447 from nealrichardson/r-timestamp-test Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

H-Plus-Time

commit sha 8f302d33bbb8408502e0a6fd2323f84ef49f8806

ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) Steps for reproduction: ```js const foo = new arrow.List(new arrow.Field('bar', new arrow.Float64())) const table = arrow.Table.empty(foo) // ⚡ ``` The Data constructor assumes childData is either falsey, a zero-length array (still falsey, but worth distinguishing) or a non-zero length array of valid instances of Data or objects with a data property. Coercing undefineds to empty arrays a little earlier for compound types (List, FixedSizeList, Map) avoids this. Closes #7771 from H-Plus-Time/ARROW-9479 Authored-by: H-Plus-Time <Nicholas.Roberts.au@gmail.com> Signed-off-by: Brian Hulette <bhulette@google.com>

view details

Benjamin Kietzman

commit sha 03c7c023e639b8cae4c2b209c7ea0d2670970bdc

ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string Closes #8462 from bkietz/10145-Integer-like-partition-fi Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

view details

Benjamin Wilhelm

commit sha a7ef5d2e2084375303cc18dd39c391022f8e28ac

ARROW-10174: [Java] Fix reading/writing dict structs When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children Closes #8363 from HedgehogCode/bug/ARROW-10174-dict-structs Authored-by: Benjamin Wilhelm <benjamin.wilhelm@knime.com> Signed-off-by: liyafan82 <fan_li_ya@foxmail.com>

view details

alamb

commit sha 3f69ad2004fa5afbaddcda61a9bb6fb3cca266e5

ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from https://github.com/apache/arrow/pull/8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in https://github.com/apache/arrow/pull/8400#discussion_r501850814) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in https://github.com/apache/arrow/pull/8399 and this PR. See also https://github.com/apache/arrow/pull/8340#discussion_r501257096 for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see https://github.com/apache/arrow/pull/8400#discussion_r504278902), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

liyafan82

commit sha 22027c7b141fa2fbe168c2d31b5aa236caf62bd7

ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs Unlike other fixed width vectors, DecimalVectors have some APIs that directly manipulate an ArrowBuf (e.g. `void set(int index, int isSet, int start, ArrowBuf buffer)`. After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they work properly. Closes #8455 from liyafan82/fly_1012_dec Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Hongze Zhang

commit sha cb5814605d170c0c5ccc812216b0149041cf7015

ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… …stead Issue link: https://issues.apache.org/jira/browse/ARROW-9475. Closes #7768 from zhztheplayer/ARROW-9475 Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Antoine Pitrou

commit sha 3f96cc0fc1e7692a77589d4aba044c9916f32f48

ARROW-10313: [C++] Faster UTF8 validation for small strings This improves CSV string conversion performance by about 30%. Closes #8470 from pitrou/ARROW-10313-faster-utf8-validate Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Projjal Chanda

commit sha 36bf7a43eefd3bc5563fcfa8d04bc35a49c97978

ARROW-9898: [C++][Gandiva] Fix linking issue with castINT/FLOAT functions Moving the castint/float functions to gdv_function_stubs outside of precompiled module Closes #8096 from projjal/castint and squashes the following commits: 85179a593 <Projjal Chanda> moved castInt to gdv_fn_stubs c09077e92 <Projjal Chanda> fixed castfloat function ddc429d74 <Projjal Chanda> added java test case f666f5488 <Projjal Chanda> fix error handling in castint Authored-by: Projjal Chanda <iam@pchanda.com> Signed-off-by: Praveen <praveen@dremio.com>

view details

push time in 2 days

delete branch integer32llc/arrow

delete branch : path-internal

delete time in 2 days

PR closed apache/arrow

[Rust] [Parquet] Start porting path_internal from C++ to handle def_levels and rep_levels lang-rust

Hey @nevi-me, I was looking into helping out with the def_levels and rep_levels handling in get_levels in arrow_writer.rs, and the logic is... quite complex! I honestly have no idea how you're planning to get it to be the same as the C++ code without a direct port of the C++ algorithm; the code looks completely different right now so I feel like I have no chance of helping to fix the Rust code by looking at the C++ code.

I feel like you've mentioned that you're working on fixing the def/rep level stuff somewhere and that it was taking a while; what do you think of a more direct correspondence with the C++ code that I've started here? Are there reasons not to go this way?

I wanted to see if this kind of direction would be helpful in getting the rust-parquet-arrow-writer branch merged into master, or if this is too much or too little or the wrong direction. Please advise, thank you!

+4655 -885

2 comments

84 changed files

carols10cents

pr closed time in 2 days

pull request commentapache/arrow

[Rust] [Parquet] Start porting path_internal from C++ to handle def_levels and rep_levels

Talked with nevi out-of-band, going to close this :) Sorry for the noise!

carols10cents

comment created time in 2 days

PR opened apache/arrow

[Rust] [Parquet] Start porting path_internal from C++ to handle def_levels and rep_levels

Hey @nevi-me, I was looking into helping out with the def_levels and rep_levels handling in get_levels in arrow_writer.rs, and the logic is... quite complex! I honestly have no idea how you're planning to get it to be the same as the C++ code without a direct port of the C++ algorithm; the code looks completely different right now so I feel like I have no chance of helping to fix the Rust code by looking at the C++ code.

I feel like you've mentioned that you're working on fixing the def/rep level stuff somewhere and that it was taking a while; what do you think of a more direct correspondence with the C++ code that I've started here? Are there reasons not to go this way?

I wanted to see if this kind of direction would be helpful in getting the rust-parquet-arrow-writer branch merged into master, or if this is too much or too little or the wrong direction. Please advise, thank you!

+86 -0

0 comment

2 changed files

pr created time in 5 days

create barnchinteger32llc/arrow

branch : path-internal

created branch time in 5 days

pull request commentapache/arrow

One definition/repetition level test

@nevi-me So @shepmaster and I wrote some more tests, but they're failing and we're not sure if our setup is wrong or if they're expected to fail? What do you think?

carols10cents

comment created time in 5 days

push eventinteger32llc/arrow

Benjamin Kietzman

commit sha ae396b9d4c26621cba2cce955f1d55f43e8faab9

ARROW-9782: [C++][Dataset] More configurable Dataset writing Python: - ParquetFileFormat.write_options has been removed - Added classes {,Parquet,Ipc}FileWriteOptions - FileWriteOptions are constructed using FileFormat.make_write_options(...) - FileWriteOptions are passed as a parameter to _filesystemdataset_write() R: - FileWriteOptions$create(...) to make write options; no subclasses exposed in R - A filter() on the dataset is applied to restrict written rows. C++: - FileSystemDataset::Write's parameters have been consolidated into - A Scanner, from which the batches to be written are pulled - A FileSystemDatasetWriteOptions, which is an options struct specifying - destination filesystem - base directory - partitioning - basenames (via a string template, ex "dat_{i}.feather") - format specific write options - Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified. - ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions. Internal C++: - Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter - FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory - FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk. Closes #8305 from bkietz/9782-more-configurable-writing Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

Benjamin Kietzman

commit sha 1150c385278b0ea49596326f39421bf5e317d338

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups Closes #8317 from bkietz/10134-Add-ParquetFileFragmentnu Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

kanga333

commit sha 878c53455bf9de610e184fd92c93517e137b5228

ARROW-10227: [Ruby] Use a table size as the default for parquet chunk_size A chunk_size that is too small will cause metadata bloat in the parquet file, leading to poor read performance. Set the chunk_size to be the same value as the table size so that one file becomes one row_group. Closes #8391 from kanga333/ruby-use-table-size-for-chunk-size Authored-by: kanga333 <e411z7t40w@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Micah Kornfield

commit sha d4cbc4b7aab5d37262b83e972af4bd7cb44c7a5c

ARROW-10229: [C++] Remove errant log line noticed this on rereviewing the merged code. Closes #8392 from emkornfield/remove_log Authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

naman1996

commit sha 54199ecad388d322b4ac0a8f96ab2b3ebc730b9d

ARROW-10023: [C++][Gandiva] Implement split_part function in gandiva Closes #8231 from naman1996/ARROW-10023 and squashes the following commits: 13ef67943 <naman1996> force inline and out_len refactor 7dadc4d8a <naman1996> Changes to set out_len = 0 3ba0a3bcc <naman1996> Fixing ASAN warning d6da07a7d <naman1996> fixing small lint issue a17eedd1d <naman1996> Fixing typo b6c85503a <naman1996> Removing usage of std::string and adding some unit test for UTF8 style strings cb843670b <naman1996> fixing lint errores 899ccf075 <naman1996> changes for split_part function 3d716f61b <naman1996> adding split_string function with unit tests Authored-by: naman1996 <namanudasi160196@gmail.com–> Signed-off-by: Praveen <praveen@dremio.com>

view details

arw2019

commit sha ba7ee65422033e11d453ba63bb0eb0108d5183be

ARROW-9967: [Python] Add compute module docs + expose more option classes #8163 exposes `pyarrow.compute` kernels and generates their docstrings. This PR adds documentation for the module in the User Guide and the Python API reference. Closes #8145 from arw2019/ARROW-7871 Lead-authored-by: arw2019 <andrew.r.wieteska@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Jörn Horstmann

commit sha 20f2bd49fc95e3ebd73ba6aa7fdf8f1451b7dd40

ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets @nevi-me this is the chunked iterator based approach i mentioned in #8223 I'm not fully satisfied with the solution yet: - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`. - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code. Closes #8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers Authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

alamb

commit sha 8447bb12b550610842b6a88b0d92e4178a63bf20

ARROW-10235: [Rust][DataFusion] Improve documentation for type coercion The code / comments for type coercion are a little confusing and don't make the distinction between coercion and casting clear – this PR attempts to clarify the intent, channeling the information from @jorgecarleitao here: https://github.com/apache/arrow/pull/8340#discussion_r501257096 Closes #8399 from alamb/alamb/coercion-docs Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Romain Francois

commit sha 74903919625220f5d67a96cf21411acb645c789f

ARROW-6537 [R]: Pass column_types to CSV reader Either passing down NULL or a Schema. But perhaps a schema is confusing because the only thing that is being controlled by it here is the types, not their order etc .. which I believe feels implied if you supply a schema. Closes #7807 from romainfrancois/ARROW-6537/column_types Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

alamb

commit sha 4bbb74713c6883e8523eeeb5ac80a1e1f8521674

ARROW-10233: [Rust] Make array_value_to_string available in all Arrow builds This PR makes `array_value_to_string` available to all arrow builds. Currently it is only available if the `feature = "prettyprint"` is enabled which is not the default. The full `print_batches` and `pretty_format_batches` (and the libraries they depend on) are still only available of the feature flag is set. The rationale for making this change is that I want to be able to use `array_value_to_string` to write tests (such as on https://github.com/apache/arrow/pull/8346) but currently it is only available when `feature = "prettyprint"` is enabled. It appears that @nevi-me made prettyprint compilation optional so that arrow could be compiled for wasm in https://github.com/apache/arrow/pull/7400. https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to some dependency of pretty-table; `array_value_to_string` has no needed dependencies. Note I tried to compile ARROW again using the `wasm32-unknown-unknown` target on master and it fails (perhaps due to a new dependency that was added?): <details> <summary>Click to expand!</summary> ``` alamb@ip-192-168-0-182 rust % git log | head -n 1 git log | head -n 1 commit d4cbc4b7aab5d37262b83e972af4bd7cb44c7a5c alamb@ip-192-168-0-182 rust % git status git status On branch master Your branch is up to date with 'upstream/master'. nothing to commit, working tree clean alamb@ip-192-168-0-182 rust % alamb@ip-192-168-0-182 rust % cargo build --target=wasm32-unknown-unknown cargo build --target=wasm32-unknown-unknown Compiling cfg-if v0.1.10 Compiling lazy_static v1.4.0 Compiling futures-core v0.3.5 Compiling slab v0.4.2 Compiling futures-sink v0.3.5 Compiling once_cell v1.4.0 Compiling pin-utils v0.1.0 Compiling futures-io v0.3.5 Compiling itoa v0.4.5 Compiling bytes v0.5.4 Compiling fnv v1.0.7 Compiling iovec v0.1.4 Compiling unicode-width v0.1.7 Compiling pin-project-lite v0.1.7 Compiling ppv-lite86 v0.2.8 Compiling atty v0.2.14 Compiling dirs v1.0.5 Compiling smallvec v1.4.0 Compiling regex-syntax v0.6.18 Compiling encode_unicode v0.3.6 Compiling hex v0.4.2 Compiling tower-service v0.3.0 error[E0433]: failed to resolve: could not find `unix` in `os` --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 | 41 | use std::os::unix::ffi::OsStringExt; | ^^^^ could not find `unix` in `os` error[E0432]: unresolved import `unix` --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 | 6 | use unix; | ^^^^ no `unix` in the root Compiling alloc-no-stdlib v2.0.1 Compiling adler32 v1.0.4 error[E0599]: no function or associated item named `from_vec` found for struct `std::ffi::OsString` in the current scope --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:48:34 | 48 | Some(PathBuf::from(OsString::from_vec(out))) | ^^^^^^^^ function or associated item not found in `std::ffi::OsString` | = help: items from traits can only be used if the trait is in scope = note: the following trait is implemented but not in scope; perhaps add a `use` for it: `use std::sys_common::os_str_bytes::OsStringExt;` error: aborting due to 3 previous errors Some errors have detailed explanations: E0432, E0433, E0599. For more information about an error, try `rustc --explain E0432`. error: could not compile `dirs`. To learn more, run the command again with --verbose. warning: build failed, waiting for other jobs to finish... error: build failed alamb@ip-192-168-0-182 rust % ``` </details> Closes #8397 from alamb/alamb/consolidate-array-value-to-string Lead-authored-by: alamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Sutou Kouhei

commit sha 945f649c2a67c684ebab6a8e91901638d4c11b7a

ARROW-9414: [Packaging][deb][RPM] Enable S3 Closes #8394 from kou/packaing-linux-s3 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Jörn Horstmann

commit sha 0100121f92299d68b348206288f12c43c44110e4

ARROW-10015: [Rust] Simd aggregate kernels Built on top of [ARROW-10040][1] (#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes #8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Daniel Russo

commit sha 1c7581cf7779e316da484c20c4aecaa6c89ea23f

ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col) This is a proposal for an initial and partial implementation of the `DISTINCT` keyword. Only `COUNT(DISTINCT)` is supported, with the following conditions: (a) only one argument, i.e. `COUNT(DISTINCT col)`, but not `COUNT(DISTINCT col, other)`, (b) the argument is an integer type, and (c) the query must have a `GROUP BY` clause. **Implementation Overview:** The `Expr::AggregateFunction` variant has a new field, `distinct`, which mirrors the `distinct` flag from `SQLExpr::Function` (up until now this flag was unused). Any `Expr::AggregateFunction` may have its `distinct` flag switched to `true` if the keyword is present in the SQL query. However, the physical planner respects it only for `COUNT` expressions. The count distinct aggregation slots into the existing physical plans as a new set of `AggregateExpr`. To demonstrate, below are examples of the physical plans for the following query, where `c1` may be any data type, and `c2` is a `UInt8` column: ``` SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1 ``` (a) Multiple Partitions: HashAggregateExec: mode: Final group_expr: Column(c1) aggr_expr: DistinctCountReduce(Column(c2)) schema: c1: any c2: UInt64 input: MergeExec: input: HashAggregateExec: mode: Partial group_expr: Column(c1) aggr_expr: DistinctCount(Column(c2)) schema: c1: any c2: LargeList(UInt8) input: CsvExec: schema: c1: any c2: UInt8 The `DistinctCount` accumulates each `UInt8` into a list of distinct `UInt8`. No counts are collected yet, this is a partial result: lists of distinct values. In the `RecordBatch`, this is a `LargeListArray<UInt8>` column. After the `MergeExec`, each list in `LargeListArray<UInt8>` is accumulated by `DistinctCountReduce` (via `accumulate_batch()`), producing the _final_ sets of distinct values. Finally, given the finalized sets of distinct values, the counts are computed (always as `UInt64`). (b) Single Partition: HashAggregateExec: mode: NoPartial group_expr: Column(c1) aggr_expr: DistinctCountReduce(Column(c2)) schema: c1: any c2: UInt64 input: CsvExec: schema: c1: any c2: UInt8 This scenario is unlike the multiple partition scenario: `DistinctCount` is _not_ used, and there are no partial sets of distinct values. Rather, in a single `HashAggregateExec` stage, each `UInt8` is accumulated into a distinct value set, then the counts are computed at the end of the stage. `DistinctCountReduce` is used, but note that unlike the multiple partition case, it accumulates scalars via `accumulate_scalar()`. There is a new aggregation mode: `NoPartial`. In summary, the modes are: - `NoPartial`: used in single-stage aggregations - `Partial`: used as the first stage of two-stage aggregations - `Final`: used as the second stage of two-stage aggregaions Prior to the new `NoPartial` mode, `Partial` was handling both of what are now the responsibilities of `Partial` and `NoPartial`. No distinction was required, because _non-distinct_ aggregations (such as count, sum, min, max, and avg) do not need the distinction: the first aggregation stage is always the same, regardless of whether the aggregation is one-stage or two-stage. This is not the case for a _distinct_ count aggregation, and we can see that in the physical plans above. Closes #8222 from drusso/ARROW-10043 Authored-by: Daniel Russo <danrusso@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

alamb

commit sha 4c101efed5d698bbd6c005dbd68b40df06596ae7

ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel This PR adds support to the rust compute kernel casting `DictionaryArray` to/from `PrimitiveArray`/`StringArray` It does not include full support for other types such as `LargeString` or `Binary` (though the code could be extended fairly easily following the same pattern). However, my usecase doesn't need `LargeString` or `Binary` so I am trying to get the support I need in rather than fully flesh out the library Closes #8346 from alamb/alamb/ARROW-10164-dictionary-casts-2 Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Eric Erhardt

commit sha beb031f3a49d6db0c88788ce4d05d2daa1afda02

ARROW-10238: [C#] List<Struct> is broken When reading the flatbuffer type information, we are reading the ListType incorrectly. We should be using the childFields parameter, like we do for StructTypes. I also took this opportunity to redo how we compare schema information in our tests. For nested types, we need to recursively compare types all the way down. @pgovind @HashidaTKS @chutchinson Closes #8404 from eerhardt/FixListOfStruct Authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>

view details

Benjamin Kietzman

commit sha 109f701d805b761710f8b08f76be17ff16b3c3fd

ARROW-10237: [C++] Duplicate dict values cause corrupt parquet Fix suggested by @pitrou Closes #8403 from bkietz/10237-Duplicate-values-in-a-dic Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

Sutou Kouhei

commit sha f0f7593f7035e990da81d6878b9025ae96e43c4e

ARROW-10239: [C++] Add missing zlib dependency to aws-sdk-cpp Closes #8406 from kou/cpp-aws-sdk-cpp-zlib Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Uwe L. Korn

commit sha d908bc83ad1c5310c66c314a476275aa660cb46f

ARROW-9879: [Python] Add support for numpy scalars to ChunkedArray.__getitem__ FYI @marc9595 Closes #8072 from xhochy/ARROW-9879 Lead-authored-by: Uwe L. Korn <uwelk@xhochy.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

naman1996

commit sha f2ad6a95acf8ff27906b0a64d330c74387767789

ARROW-9956: [C++] [Gandiva] Implementation of binary_string function in gandiva Function takes in a normal string or a hexadecimal encoded string( Eg: \xca\xfe\xba\xbe) and converts it to VARBINARY (byte array). Closes #8201 from naman1996/ARROW-9956 and squashes the following commits: f3abfd627 <naman1996> Force inlining helper function cf335be68 <naman1996> Removing convert_fromUTF8_binary 746f993a5 <naman1996> removing include of arrow/util/string.h af6e12e36 <naman1996> Changes to remove parseHexValue 5d1c90a10 <naman1996> Correcting typo f970214a3 <naman1996> Removing error thrown by ParseHexValue for parity with java implementation 11572ceb4 <naman1996> setting out_len to 0 90a7798f1 <naman1996> Making char array null terminated for failing unit test in ubuntu af3785a1d <naman1996> Fixing small linting error 4d09d154e <naman1996> Changes to remove std::string d3afcb91a <naman1996> refactor to use arrow::ParseHexValue 8b4f563cd <naman1996> fixing test dec947544 <naman1996> fixing test issue 52a1708de <naman1996> fixing lint errors 562b285c2 <naman1996> correcting linting errors 1f7fb91eb <naman1996> handling null string case 58efbb93c <naman1996> adding binary string function Authored-by: naman1996 <namanudasi160196@gmail.com–> Signed-off-by: Praveen <praveen@dremio.com>

view details

Joris Van den Bossche

commit sha 599b458c68dfcba38fe5448913d4bb69723e1439

ARROW-9518: [Python] Deprecate pyarrow serialization Closes #8255 from jorisvandenbossche/ARROW-9518-deprecate-serialize Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

push time in 5 days

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

@carols10cents -- one idea I had which might be less efficient at runtime but possibly be less complicated to implement, would be to use the arrow cast kernels here: https://github.com/apache/arrow/blob/master/rust/arrow/src/compute/kernels/cast.rs

So rather than going directly from ParquetType to DesiredArrowType we could go from ParquetType --> CanonicalArrowType and then from CanonicalArrowType --> DesiredArrowType

So for example, to generate a Dictionary<UInt8, Utf8> from a parquet column of Utf8 you could always create Dictionary<Uint64, Utf8> and then use cast to go to the desired arrow type

Does that make sense?

Not really, because I am using the cast kernels in the Converter: 4b59fc9 (#8402) in the style of the other converters in that file, so I'm not sure how to rearrange that to reduce complexity :-/ Could you possibly put together a code sketch of what you mean?

carols10cents

comment created time in 5 days

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

@vertexclique @nevi-me I'm feeling stuck on converting primitive dictionaries...

I have a solution that works for one key/value type, but I tried to expand that to all the types and it involves listing out all the possible combinations (😱) and overflows the stack (😱😱😱).

I have tried to find a different abstraction, though, and the type checker doesn't like anything I've come up with. Do you have any suggestions?

carols10cents

comment created time in 6 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 4b59fc952336bee757cf27ae596b56edbccf099a

Convert one kind of primitive dictionary

view details

Carol (Nichols || Goulding)

commit sha 79b78d97f5457895f2e96a39dcc341e29a588058

Try to support all key/value combinations; this overflows the stack

view details

push time in 6 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 2bf54f9a45654e1b816e69f287b41cdc7801dbb7

failing test

view details

push time in 6 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 443efedfbd1b530bb60c0a938002e9aa27c2fe2c

Serialize unsigned int dictionary index types As the C++ implementation was updated to do in b1a7a73ff2, and as supported by the unsigned integer types that implement ArrowDictionaryKeyType.

view details

Carol (Nichols || Goulding)

commit sha 4a0c8362d8598f8e12d1a958075a76885697634b

Update comment to match change made in b1a7a73ff2 Dictionaries can be indexed by either signed or unsigned integers.

view details

Carol (Nichols || Goulding)

commit sha eec817606d34e6a350a249db8c3c8526f9761d8f

Add a failing test for string dictionary indexed by an unsinged int

view details

Carol (Nichols || Goulding)

commit sha 3e94ca680af11f22980f9e2a7ad6bfc78df814c9

Extract a method for converting dictionaries

view details

Carol (Nichols || Goulding)

commit sha acf6a72cda61de510ff853b28a15e056cc153a8e

Extract a macro for string dictionary conversion

view details

Carol (Nichols || Goulding)

commit sha 59f3bb93aee9d18e0de7aac4691c62dd7a367ef6

Convert string dictionaries indexed by unsigned integers too

view details

Carol (Nichols || Goulding)

commit sha 27427fff91717896ecf08ab722eec3602391d052

failing test

view details

Carol (Nichols || Goulding)

commit sha 32f724ea39ca58251125ccef67d6f2e938aa02b5

progress

view details

Carol (Nichols || Goulding)

commit sha 06733614a9653e9a753ad6d43243e0a8c253312c

Rearrange to handle all dictionary types first

view details

Carol (Nichols || Goulding)

commit sha 3b80244089c6436f1b257807f25951c83c0b3f90

omg monomorphized and hacky converted but the test passes

view details

Carol (Nichols || Goulding)

commit sha 9e3b0d486b0eaae10df14e536e246ba9339ff1d0

got rid of the hacky cast

view details

Carol (Nichols || Goulding)

commit sha 7f22a20f6f273b1633e43765bb7cc5a74050d32b

a little better

view details

Carol (Nichols || Goulding)

commit sha 6b5c5a8abcead0e4f1b73645bd6704004d87bc44

more

view details

Carol (Nichols || Goulding)

commit sha 727d7a610e1bab3fb118fc1b638fa83b2dbd699d

separate strings more for a second

view details

Carol (Nichols || Goulding)

commit sha 7608e7a26ac264570534a0e35f7f5aff08f4d94c

getting there

view details

Carol (Nichols || Goulding)

commit sha 67757e17ac6a564c839404f39c8053229a91714c

Add some type aliases

view details

push time in 7 days

PR closed rust-lang/book

add missing word
+1 -1

1 comment

1 changed file

Uniminin

pr closed time in 9 days

pull request commentrust-lang/book

add missing word

Sorry, I prefer this the way it is, so I won't be merging this pull request.

Uniminin

comment created time in 9 days

push eventinteger32llc/arrow

alamb

commit sha 1d10f2290da1bd2af6cc8305e4ae55fd6790e13a

ARROW-10236: [Rust] Add can_cast_types to arrow cast kernel, use in DataFusion This is a PR incorporating the feedback from @nevi-me and @jorgecarleitao from https://github.com/apache/arrow/pull/8400 It adds 1. a `can_cast_types` function to the Arrow cast kernel (as suggested by @jorgecarleitao / @nevi-me in https://github.com/apache/arrow/pull/8400#discussion_r501850814) that encodes the valid type casting 2. A test that ensures `can_cast_types` and `cast` remain in sync 3. Bug fixes that the test above uncovered (I'll comment inline) 4. Change DataFuson to use `can_cast_types` so that it plans casting consistently with what arrow allows Previously the notions of coercion and casting were somewhat conflated in DataFusion. I have tried to clarify them in https://github.com/apache/arrow/pull/8399 and this PR. See also https://github.com/apache/arrow/pull/8340#discussion_r501257096 for more discussion. I am adding this functionality so DataFusion gains rudimentary support `DictionaryArray`. Codewise, I am concerned about the duplication in logic between the match statements in `cast` and `can_cast_types. I have some thoughts on how to unify them (see https://github.com/apache/arrow/pull/8400#discussion_r504278902), but I don't have time to implement that as it is a bigger change. I think this approach with some duplication is ok, and the test will ensure they remain in sync. Closes #8460 from alamb/alamb/ARROW-10236-casting-rules-2 Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

liyafan82

commit sha 18495e073d9d8b4d9a6f7b08f37860cc09d2637a

ARROW-10294: [Java] Resolve problems of DecimalVector APIs on ArrowBufs Unlike other fixed width vectors, DecimalVectors have some APIs that directly manipulate an ArrowBuf (e.g. `void set(int index, int isSet, int start, ArrowBuf buffer)`. After supporting 64-bit ArrowBufs, we need to adjust such APIs so that they work properly. Closes #8455 from liyafan82/fly_1012_dec Authored-by: liyafan82 <fan_li_ya@foxmail.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Hongze Zhang

commit sha 7189b91ebe246dac6cbbafc03e1bba48985e430c

ARROW-9475: [Java] Clean up usages of BaseAllocator, use BufferAllocator in… …stead Issue link: https://issues.apache.org/jira/browse/ARROW-9475. Closes #7768 from zhztheplayer/ARROW-9475 Authored-by: Hongze Zhang <hongze.zhang@intel.com> Signed-off-by: Micah Kornfield <emkornfield@gmail.com>

view details

Antoine Pitrou

commit sha 2510f4fd32cefe333ac7340f5dca9a5907b114e5

ARROW-10313: [C++] Faster UTF8 validation for small strings This improves CSV string conversion performance by about 30%. Closes #8470 from pitrou/ARROW-10313-faster-utf8-validate Authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Projjal Chanda

commit sha f58db451f27610577f47ca6787bd7ff17e556355

ARROW-9898: [C++][Gandiva] Fix linking issue with castINT/FLOAT functions Moving the castint/float functions to gdv_function_stubs outside of precompiled module Closes #8096 from projjal/castint and squashes the following commits: 85179a593 <Projjal Chanda> moved castInt to gdv_fn_stubs c09077e92 <Projjal Chanda> fixed castfloat function ddc429d74 <Projjal Chanda> added java test case f666f5488 <Projjal Chanda> fix error handling in castint Authored-by: Projjal Chanda <iam@pchanda.com> Signed-off-by: Praveen <praveen@dremio.com>

view details

Krisztián Szűcs

commit sha 487895fe10540488f99d7d26f0a3b5e77c097122

ARROW-10311: [Release] Update crossbow verification process - Fix the verification build setups - Expose `--param` options to crossbow.py submit to override jinja parameters - Expose the same option to the comment bot, so `crossbow submit -p release=2.0.0 -p rc=2 -g verify-rc` will work next time Closes #8464 from kszucs/release-verification Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Frank Du

commit sha ab62c28dd60bd956be034c353c4117063bb4ad06

ARROW-10321: [C++] Use check_cxx_source_compiles for AVX512 detect in compiler Also build the SIMD files as ARROW_RUNTIME_SIMD_LEVEL. Signed-off-by: Frank Du <frank.du@intel.com> Closes #8478 from jianxind/avx512_runtime_level_build Authored-by: Frank Du <frank.du@intel.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

Neville Dipale

commit sha 2b8dc084b5bc600a1e96e31227cd3c5ed8cf3650

ARROW-8289: [Rust] Parquet Arrow writer with nested support **Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha 923d23b617ce386b8b5680598a5a1116f026e596

ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes #7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha 2f8178567221d920018e8c43104766357f5c7617

ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes #8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha 6e237bcc20836f336800b51f50adfa3879560586

ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha b7b45d1525401627541f26681928c2e16ce51edd

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Carol (Nichols || Goulding)

commit sha 3a22d3dd7e0423f71cc325d48e64a1515b45cd8b

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available @nevi-me This is one commit on top of https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha ead5e14ca026954e53f6d98d5c9215f24130bfc1

ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip Closes #8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha 453f9789aeb5903ff32d2040cd2f974ff76b0ab9

ARROW-10334: [Rust] [Parquet] NullArray roundtrip This allows writing an Arrow NullArray to Parquet. Support was added a few years ago in Parquet, and the C++ implementation supports writing null arrays. The array is stored as an int32 which has all values set as null. In order to implement this, we introduce a `null -> int32` cast, which creates a null int32 of same length. Semantically, the write is the same as writing an int32 that's all null, but we create a null writer to preserve the data type. Closes #8484 from nevi-me/ARROW-10334 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha 8ccd9c3a8c52a219d556a1b1618010f2f913e5d0

ARROW-7842: [Rust] [Parquet] Arrow list reader This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader,and ensure that we can roundtrip on list types. Closes #8449 from nevi-me/ARROW-7842-cherry Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha 561e2bb526d14801743d5874d2ce86803858e16c

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (not supported: converting an Arrow Dictionary directly to Parquet DictEncoding, also only supports Int32 index types in this commit, also removes NULLs) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries

view details

Carol (Nichols || Goulding)

commit sha 3767051044c134948b6f15dd62c751d661a457a1

Change variable name from index_type to key_type

view details

Carol (Nichols || Goulding)

commit sha bd4d4a8b8a817c2be721dcbde5905a8e96be4178

cargo fmt

view details

Carol (Nichols || Goulding)

commit sha bae6e7447c6265543a94d568258abf8c59bd490f

Change an unwrap to an expect

view details

push time in 9 days

delete branch integer32llc/arrow

delete branch : ARROW-7842-cherry

delete time in 10 days

pull request commentnevi-me/arrow

[Rust] [Parquet] LargeListArray support and why I think the tests are still failing

Just saw that you rebased your branch so I rebased on your branch :)

carols10cents

comment created time in 12 days

push eventinteger32llc/arrow

Neville Dipale

commit sha 97d21cf15208fcc3b86be05284653239b99e4877

ARROW-7842: [Rust] [Parquet] Arrow list reader This is a port of #6770 to the parquet-writer branch. We'll have more of a chance to test this reader, and ensure that we can roundtrip on list types.

view details

Carol (Nichols || Goulding)

commit sha c9f9abff56d728657315fabcac6ff16fcb2dcd1d

Support reading LargeListArrays by making ListArrayReader generic over OffsetSize

view details

Carol (Nichols || Goulding)

commit sha b7336fff81a2970a099f45806a28dacf1e1d0439

Update comment to match the actual values in this test; probably copy-paste

view details

Carol (Nichols || Goulding)

commit sha 37028e32454d7731d3a70765a1555c2b6b0c92c8

Document why I think the test setup isn't quite right

view details

push time in 12 days

PullRequestReviewEvent

PR opened nevi-me/arrow

LargeListArray support and why I think the tests are still failing datafusion lang-rust

Hi, I decided to open the PR over here rather than on the apache/arrow repo since I'm making changes to your branch... I did rebase this on the upstream rust-parquet-arrow-writer branch though, so there's some noise in this PR's diff :-/

I added support for LargeListArrays in the ListArrayReader, and I added a test in array_reader and it's passing!

However, the roundtrip tests are still failing, and my current suspicion is that something isn't quite right with the test setup, because the null count isn't what I expected it to be. I've documented with comments and assertions (some of which pass and some of which fail) my current reasoning-- I have to go right now but I'll keep investigating later. If you notice something that's obvious to you about the test setup, please let me know!

+120 -25

1 comment

2 changed files

pr created time in 12 days

Pull request review commentapache/arrow

ARROW-7842: [Rust] [Parquet] Arrow list reader

 mod tests {     }      #[test]-    #[should_panic(-        expected = "Reading parquet list array into arrow is not supported yet!"-    )]+    #[ignore = "Roundtrip failing, reason likely reader, not yet investigated"]

@nevi-me I opened a PR on your repo with some further work -- feel free to cherry pick or whatever, but the tests aren't quite passing yet over there, just failing for a different reason ;)

nevi-me

comment created time in 12 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 160b99ec8de4682ff0fccb42600ce80c645339a9

Support reading LargeListArrays by making ListArrayReader generic over OffsetSize

view details

Carol (Nichols || Goulding)

commit sha fd925f0a6cb94da987546f663ab88e45c2bcab8c

Update comment to match the actual values in this test; probably copy-paste

view details

Carol (Nichols || Goulding)

commit sha ca9df6735b40670ec361f4cc2ac0273d12081604

Document why I think the test setup isn't quite right

view details

push time in 12 days

create barnchinteger32llc/arrow

branch : ARROW-7842-cherry

created branch time in 12 days

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Status update: The other index types are done, but primitive dictionaries are not yet.

carols10cents

comment created time in 12 days

issue commentManishearth/namespacing-rfc

Drawback: typosquatting

Ok I haven't thought through the complete implications of this yet; will think some more later. But what if we introduce a new syntax in Cargo.toml that will map to "foo/bar" (and old Cargos can use "foo/bar" to depend on crates like this, I feel like supporting old cargos is probably where this idea is going to fall down but stick with me for a second)

either something like:

[dependencies]
serde = { sub_crate = "json", version = "1.0" }

or:

[dependencies.serde]
version = "1.0" # to depend on the parent crate, omit this if you aren't depending on the parent
json = "1.0" # to depend on the serde/json crate

to make it more obvious and explicit and different when your intention is to use a sub-crate?

Manishearth

comment created time in 12 days

issue commentManishearth/namespacing-rfc

Decision: Separator mapping

This would have to be renamed in Cargo.toml

Ahhh right.

Correct, however someone could maliciously do this to confuse people, e.g. by publishing serde-gelf containing a bitcoin miner, whereas serde/gelf is the Actual Good crate.

LOL it's really difficult to consider this and #1 and #3 separately... hmmm....

Manishearth

comment created time in 12 days

Pull request review commentapache/arrow

ARROW-7842: [Rust] [Parquet] Arrow list reader

 mod tests {     }      #[test]-    #[should_panic(-        expected = "Reading parquet list array into arrow is not supported yet!"-    )]+    #[ignore = "Roundtrip failing, reason likely reader, not yet investigated"]

Looking at this now!

nevi-me

comment created time in 12 days

PullRequestReviewEvent

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 0dcf149392bcff328ddc393356c9977e135a2799

Add a test and update comment to explain why it's ok to drop nulls

view details

push time in 12 days

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 fn write_leaves(             }             Ok(())         }+        ArrowDataType::Dictionary(k, v) => {+            // Materialize the packed dictionary and let the writer repack it+            let any_array = array.as_any();+            let (k2, v2) = match &**k {+                ArrowDataType::Int32 => {+                    let typed_array = any_array+                        .downcast_ref::<arrow_array::Int32DictionaryArray>()+                        .expect("Unable to get dictionary array");++                    (typed_array.keys(), typed_array.values())+                }+                o => unimplemented!("Unknown key type {:?}", o),+            };++            let k3 = k2;+            let v3 = v2+                .as_any()+                .downcast_ref::<arrow_array::StringArray>()+                .unwrap();++            // TODO: This removes NULL values; what _should_ be done?+            // FIXME: Don't use `as`+            let materialized: Vec<_> = k3+                .flatten()+                .map(|k| v3.value(k as usize))+                .map(ByteArray::from)+                .collect();+            //

@vertexclique I updated the roundtrip dictionary test to include some None values, and it passes, so I think this code is fine-- it seems that the None values are handled by the definition levels, so we don't need to handle them here. Do I have that right or am I still missing something?

carols10cents

comment created time in 12 days

PullRequestReviewEvent

push eventinteger32llc/arrow

Krisztián Szűcs

commit sha 70ae16115692a1234ba177ea912727bb97fb8227

ARROW-10290: [C++] List POP_BACK is not available in older CMake versions Closes #8451 from kszucs/cmake-compat Authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

Yibo Cai

commit sha f07a415d251a1a32629a299d7f7f1999887a25b0

ARROW-10263: [C++][Compute] Improve variance kernel numerical stability Improve variance merging method to address stability issue when merging short chunks with approximate mean value. Improve reference variance accuracy by leveraging Kahan summation. Closes #8437 from cyb70289/variance-stability Authored-by: Yibo Cai <yibo.cai@arm.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Jorge C. Leitao

commit sha c5280a550d49023d26c058127f4693bdb863f004

ARROW-10293: [Rust] [DataFusion] Fixed benchmarks The benchmarks were only benchmarking planning, not execution, of the plans. This PR fixes this. Closes #8452 from jorgecarleitao/bench Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Jorge C. Leitao

commit sha 818593f46f4900afca129f6f2286c55ef2d253aa

ARROW-10295 [Rust] [DataFusion] Replace Rc<RefCell<>> by Box<> in accumulators. This PR replaces `Rc<RefCell<>>` by `Box<>`. We do not need interior mutability on the accumulations. Closes #8456 from jorgecarleitao/box Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Neville Dipale

commit sha becf329fda73d0a21692e568a2cd31e107b29833

ARROW-10289: [Rust] Read dictionaries in IPC streams We were reading dictionaries in the file reader, but not in the stream reader. This was a trivial change, as we needed to add the dictionary to the stream when we encounter it, and then read the next message until we reach a record batch. I tested with the 0.14.1 golden file, I'm going to test with later versions (1.0.0-littleendian) when I get to `arrow::ipc::MetadataVersion::V5` support, hopefully soon. Closes #8450 from nevi-me/ARROW-10289 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Jorge C. Leitao

commit sha ea29f65e0d580b1b3badcd429c246e158ecd92d6

ARROW-10292: [Rust] [DataFusion] Simplify merge Currently, `mergeExec` uses `tokio::spawn` to parallelize the work, by calling `tokio::spawn` once per logical thread. However, `tokio::spawn` returns a task / future, which `tokio` runtime will then schedule on its thread pool. Therefore, there is no need to limit the number of tasks to the number of logical threads, as tokio's runtime itself is responsible for that work. In particular, since we are using [`rt-threaded`](https://docs.rs/tokio/0.2.22/tokio/runtime/index.html#threaded-scheduler), tokio already declares a thread pool from the number of logical threads available. This PR removes the coupling, in `mergeExec`, between the number of logical threads (`max_concurrency`) and the number of created tasks. I observe no change in performance: <details> <summary>Benchmark results</summary> ``` Switched to branch 'simplify_merge' Your branch is up to date with 'origin/simplify_merge'. Compiling datafusion v2.0.0-SNAPSHOT (/Users/jorgecarleitao/projects/arrow/rust/datafusion) Finished bench [optimized] target(s) in 38.02s Running /Users/jorgecarleitao/projects/arrow/rust/target/release/deps/aggregate_query_sql-5241a705a1ff29ae Gnuplot not found, using plotters backend aggregate_query_no_group_by 15 12 time: [715.17 us 722.60 us 730.19 us] change: [-8.3167% -5.2253% -2.2675%] (p = 0.00 < 0.05) Performance has improved. Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) high mild 2 (2.00%) high severe aggregate_query_group_by 15 12 time: [5.6538 ms 5.6695 ms 5.6892 ms] change: [+0.1012% +0.5308% +0.9913%] (p = 0.02 < 0.05) Change within noise threshold. Found 10 outliers among 100 measurements (10.00%) 4 (4.00%) high mild 6 (6.00%) high severe aggregate_query_group_by_with_filter 15 12 time: [2.6598 ms 2.6665 ms 2.6751 ms] change: [-0.5532% -0.1446% +0.2679%] (p = 0.51 > 0.05) No change in performance detected. Found 7 outliers among 100 measurements (7.00%) 3 (3.00%) high mild 4 (4.00%) high severe ``` </details> Closes #8453 from jorgecarleitao/simplify_merge Authored-by: Jorge C. Leitao <jorgecarleitao@gmail.com> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Neal Richardson

commit sha 249adb448ec3287542dc42186875006705941b8d

ARROW-10270: [R] Fix CSV timestamp_parsers test on R-devel Also adds a GHA job that tests on R-devel so we catch issues like this sooner. Closes #8447 from nealrichardson/r-timestamp-test Authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

H-Plus-Time

commit sha ac14e91551c32769f9ab0d2d81aa12c35f1aa1d3

ARROW-9479: [JS] Fix Table.from for zero-item serialized tables, Table.empty for schemas containing compound types (List, FixedSizeList, Map) Steps for reproduction: ```js const foo = new arrow.List(new arrow.Field('bar', new arrow.Float64())) const table = arrow.Table.empty(foo) // ⚡ ``` The Data constructor assumes childData is either falsey, a zero-length array (still falsey, but worth distinguishing) or a non-zero length array of valid instances of Data or objects with a data property. Coercing undefineds to empty arrays a little earlier for compound types (List, FixedSizeList, Map) avoids this. Closes #7771 from H-Plus-Time/ARROW-9479 Authored-by: H-Plus-Time <Nicholas.Roberts.au@gmail.com> Signed-off-by: Brian Hulette <bhulette@google.com>

view details

Benjamin Kietzman

commit sha ed8b1bce034eb9e389d2ea069a2f80460c6e31cc

ARROW-10145: [C++][Dataset] Assert integer overflow in partitioning falls back to string Closes #8462 from bkietz/10145-Integer-like-partition-fi Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

view details

Benjamin Wilhelm

commit sha 35ace395d4dede8a1b954dfdc453c2598cbc9af4

ARROW-10174: [Java] Fix reading/writing dict structs When translating between the memory FieldType and message FieldType for dictionary encoded vectors the children of the dictionary field were not handled correctly. * When going from memory format to message format the Field must have the children of the dictionary field. * When going from message format to memory format the Field must have no children but the dictionary must have the mapped children Closes #8363 from HedgehogCode/bug/ARROW-10174-dict-structs Authored-by: Benjamin Wilhelm <benjamin.wilhelm@knime.com> Signed-off-by: liyafan82 <fan_li_ya@foxmail.com>

view details

Neville Dipale

commit sha 4e6a836b42b064a50582bcc9d6cfca2b7e77a46a

ARROW-8289: [Rust] Parquet Arrow writer with nested support **Note**: I started making changes to #6785, and ended up deviating a lot, so I opted for making a new draft PR in case my approach is not suitable. ___ This is a draft to implement an arrow writer for parquet. It supports the following (no complete test coverage yet): * writing primitives except for booleans and binary * nested structs * null values (via definition levels) It does not yet support: - Boolean arrays (have to be handled differently from numeric values) - Binary arrays - Dictionary arrays - Union arrays (are they even possible?) I have only added a test by creating a nested schema, which I tested on pyarrow. ```jupyter # schema of test_complex.parquet a: int32 not null b: int32 c: struct<d: double, e: struct<f: float>> not null child 0, d: double child 1, e: struct<f: float> child 0, f: float ``` This PR potentially addresses: * https://issues.apache.org/jira/browse/ARROW-8289 * https://issues.apache.org/jira/browse/ARROW-8423 * https://issues.apache.org/jira/browse/ARROW-8424 * https://issues.apache.org/jira/browse/ARROW-8425 And I would like to propose either opening new JIRAs for the above incomplete items, or renaming the last 3 above. ___ **Help Needed** I'm implementing the definition and repetition levels on first principle from an old Parquet blog post from the Twitter engineering blog. It's likely that I'm not getting some concepts correct, so I would appreciate help with: * Checking if my logic is correct * Guidance or suggestions on how to more efficiently extract levels from arrays * Adding tests - I suspect we might need a lot of tests, so far we only test writing 1 batch, so I don't know how paging would work when writing a large enough file I also don't know if the various encoding levels (dictionary, RLE, etc.) and compression levels are applied automagically, or if that'd be something we need to explicitly enable. CC @sunchao @sadikovi @andygrove @paddyhoran Might be of interest to @mcassels @maxburke Closes #7319 from nevi-me/arrow-parquet-writer Lead-authored-by: Neville Dipale <nevilledips@gmail.com> Co-authored-by: Max Burke <max@urbanlogiq.com> Co-authored-by: Andy Grove <andygrove73@gmail.com> Co-authored-by: Max Burke <maxburke@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha 12b11b29293b6ace9cb99cffa93fd1b74b4849be

ARROW-8423: [Rust] [Parquet] Serialize Arrow schema metadata This will allow preserving Arrow-specific metadata when writing or reading Parquet files created from C++ or Rust. If the schema can't be deserialised, the normal Parquet > Arrow schema conversion is performed. Closes #7917 from nevi-me/ARROW-8243 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha ebe81729da4e983ec0f580ea0582059c0237d41c

ARROW-10095: [Rust] Update rust-parquet-arrow-writer branch's encode_arrow_schema with ipc changes Note that this PR is deliberately filed against the rust-parquet-arrow-writer branch, not master!! Hi! 👋 I'm looking to help out with the rust-parquet-arrow-writer branch, and I just pulled it down and it wasn't compiling because in 75f804efbfe367175fef5a2238d9cd2d30ed3afe, `schema_to_bytes` was changed to take `IpcWriteOptions` and to return `EncodedData`. This updates `encode_arrow_schema` to use those changes, which should get this branch compiling and passing tests again. I'm kind of guessing which JIRA ticket this should be associated with; honestly I think this commit can just be squashed with https://github.com/apache/arrow/commit/8f0ed91469f2e569472edaa3b69ffde051088555 next time this branch gets rebased. Please let me know if I should change anything, I'm happy to! Closes #8274 from carols10cents/update-with-ipc-changes Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha 58855a888438fe2063db8499659d83fe1bb91b66

ARROW-8426: [Rust] [Parquet] Add support for writing dictionary types In this commit, I: - Extracted a `build_field` function for some code shared between `schema_to_fb` and `schema_to_fb_offset` that needed to change - Uncommented the dictionary field from the Arrow schema roundtrip test and add a dictionary field to the IPC roundtrip test - If a field is a dictionary field, call `add_dictionary` with the dictionary field information on the flatbuffer field, building the dictionary as [the C++ code does][cpp-dictionary] and describe with the same comment - When getting the field type for a dictionary field, use the `value_type` as [the C++ code does][cpp-value-type] and describe with the same comment The tests pass because the Parquet -> Arrow conversion for dictionaries is [already supported][parquet-to-arrow]. [cpp-dictionary]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L426-L440 [cpp-value-type]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/cpp/src/arrow/ipc/metadata_internal.cc#L662-L667 [parquet-to-arrow]: https://github.com/apache/arrow/blob/477c1021ac013f22389baf9154fb9ad0cf814bec/rust/arrow/src/ipc/convert.rs#L120-L127 Closes #8291 from carols10cents/rust-parquet-arrow-writer Authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha de95e847bc82cbd28b3963edc843baaa10bb99ab

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Carol (Nichols || Goulding)

commit sha e1b613b1ec9239bf58d5882081aeeb75fa06c3d3

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available @nevi-me This is one commit on top of https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I *think* this will bring the Rust implementation more in line with C++, but I'm not certain. I tried removing the `#[ignore]` attributes from the `LargeArray` and `LargeUtf8` tests, but they're still failing because the schemas don't match yet-- it looks like [this code](https://github.com/apache/arrow/blob/b2842ab2eb0d7a7a633049a5591e1eaa254d4446/rust/parquet/src/arrow/array_reader.rs#L595-L638) will need to be changed as well. That `build_array_reader` function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate? Closes #8354 from carols10cents/schema-roundtrip Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Neville Dipale

commit sha f70e6db575cce746d2a4cd1c9e5a99629c27926c

ARROW-10225: [Rust] [Parquet] Fix null comparison in roundtrip Closes #8388 from nevi-me/ARROW-10225 Authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

Carol (Nichols || Goulding)

commit sha a12171cc0b1651694485f92831ddb04c7b00b164

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts This adds more support for: - When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (not supported: converting an Arrow Dictionary directly to Parquet DictEncoding, also only supports Int32 index types in this commit, also removes NULLs) - When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries

view details

Carol (Nichols || Goulding)

commit sha 27c9c7203a4ad37954e25356d6b1d0c5d7054413

Change variable name from index_type to key_type

view details

Carol (Nichols || Goulding)

commit sha f184cc7a1aa0ef5590b2750d1a304ae09c046e4d

cargo fmt

view details

push time in 12 days

issue commentManishearth/namespacing-rfc

Decision: Separator mapping

The typosquatting section referenced:

Dash typosquatting

This proposal does not prevent anyone from taking foo-bar after you publish foo/bar. Given that the Rust crate import syntax for foo/bar is foo_bar, same as foo-bar, it's totally possible for a user to accidentally type foo-bar in Cargo.toml instead of foo/bar, and pull in the wrong, squatted, crate.

We currently prevent foo-bar and foo_bar from existing at the same time. We could do this here as well, but it would only go in one direction: if foo/bar exists foo-bar/foo_bar cannot be published, but not vice versa. This limits the "damage" to cases where someone pre-squats foo-bar before you publish foo/bar, and the damage can be mitigated by checking to see if such a clashing crate exists when publishing, if you actually care about this attack vector. There are some tradeoffs there that we would have to explore.

One thing that could mitigate foo/bar mapping to the potentially ambiguous foo_bar is using something like foo::crate::bar or ~foo::bar or foo::/bar in the import syntax.

I was confused about the multiple negatives and slashes in this sentence, especially when I lost the markdown formatting the first time I copied it (I've fixed it now):

if foo/bar exists foo-bar/foo_bar cannot be published, but not vice versa.

To clarify, I would remove the / and spell out the "vice versa" and the reasoning so that this says:

- if `foo/bar` exists `foo-bar`/`foo_bar` cannot be published, but not vice versa.
+ if `foo/bar` exists, neither `foo-bar` nor `foo_bar` may be published. However, if `foo-bar` or `foo_bar` exist, we would choose to allow `foo/bar` to be published, because we don't want to limit the use of names within a crate namespace due to crates outside the namespace existing.

Have I understood correctly?

To put this another way, this would mean crates.io could have the crates serde_gelf and serde/gelf (if serde_gelf exists first, which it would). If for some reason you wanted to depend on both, your Cargo.toml would contain:

[dependencies]
serde_gelf = "0.1.6"
"serde/gelf" = "1.0.0"

and to bring items from each of these into scope, your code would need:

use serde_gelf::stuff; // brings in from `serde_gelf`
use serde_gelf as serde_somethingelse_gelf; // need to choose a new name for one of them
use serde_somethingelse_gelf::other_stuff; // can then bring more names into scope with the alias

Seems... non-ambiguous to the compiler but potentially a little annoying for the programmer, but probably not common?

Manishearth

comment created time in 12 days

issue commentManishearth/namespacing-rfc

Decision: Separator choice

Ugh. No great choices here. I'm currently leaning toward something new and weird to avoid ambiguities with existing code; we'll get used to the way it looks.

Manishearth

comment created time in 12 days

issue commentrust-lang/rustc-perf

artifact ID column sequence overflowing

Looks like you're right! I tried making a table just like artifacts with:

create table artifact( 
    id integer primary key generated always as identity,
    name text not null unique, 
    date timestamptz,
    type text not null
);

Then I did two inserts with the same name, then one with a different name:

# insert into artifact (name, date, type) values ('hi', now(), 'try') on conflict do nothing returning id;
 id 
----
  1
(1 row)

INSERT 0 1
# insert into artifact (name, date, type) values ('hi', now(), 'try') on conflict do nothing returning id;
 id 
----
(0 rows)

INSERT 0 0
# insert into artifact (name, date, type) values ('bye', now(), 'try') on conflict do nothing returning id;
 id 
----
  3
(1 row)

INSERT 0 1

So id 2 is gone! I think your plan to try selecting first, then insert with on conflict nothing, sounds good.

Mark-Simulacrum

comment created time in 13 days

issue commentrust-lang/rustc-perf

artifact ID column sequence overflowing

Sorry for the delay-- the most likely reason, as I understand it, is rolled back transactions. I see the artifact_id function that does the inserting into artifacts, and I see that function being called within a transaction here, but I don't immediately see why that transaction would be rolled back (nor am I sure that's the only place this is called within a transaction).

I have to go, but I'll take another look later.... does that sound possible though?

Mark-Simulacrum

comment created time in 13 days

issue closedrust-lang/crates.io

Crate name request

Hello, I would like to use the crates.io name compute (https://crates.io/crates/compute). How can I do so?

The current owner has the following on the page:

I consent to the transfer of this crate to the first person who requests it.

Thanks! :)

closed time in 13 days

al-jshen

issue commentrust-lang/crates.io

Crate name request

The compute crate is all yours!

al-jshen

comment created time in 13 days

Pull request review commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

 impl Converter<Vec<Option<ByteArray>>, LargeBinaryArray> for LargeBinaryArrayCon     } } +pub struct DictionaryArrayConverter {}++impl<K: ArrowDictionaryKeyType> Converter<Vec<Option<ByteArray>>, DictionaryArray<K>>+    for DictionaryArrayConverter+{+    fn convert(&self, source: Vec<Option<ByteArray>>) -> Result<DictionaryArray<K>> {+        let data_size = source+            .iter()+            .map(|x| x.as_ref().map(|b| b.len()).unwrap_or(0))+            .sum();++        let keys_builder = PrimitiveBuilder::<K>::new(source.len());+        let values_builder = StringBuilder::with_capacity(source.len(), data_size);++        let mut builder = StringDictionaryBuilder::new(keys_builder, values_builder);+        for v in source {+            match v {+                Some(array) => {+                    builder.append(array.as_utf8()?)?;

Do you mean make this one line again, like:

Some(array) => let _ = builder.append(array.as_utf8()?)?,

carols10cents

comment created time in 14 days

PullRequestReviewEvent

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 24a13b0e67aa38a361c1b21cfab253f225f9deb9

Change an unwrap to an expect

view details

push time in 14 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 9b15043bc8ba6b06a49e7e814d206e9c46e2e60e

cargo fmt

view details

push time in 14 days

push eventinteger32llc/arrow

Benjamin Kietzman

commit sha ae396b9d4c26621cba2cce955f1d55f43e8faab9

ARROW-9782: [C++][Dataset] More configurable Dataset writing Python: - ParquetFileFormat.write_options has been removed - Added classes {,Parquet,Ipc}FileWriteOptions - FileWriteOptions are constructed using FileFormat.make_write_options(...) - FileWriteOptions are passed as a parameter to _filesystemdataset_write() R: - FileWriteOptions$create(...) to make write options; no subclasses exposed in R - A filter() on the dataset is applied to restrict written rows. C++: - FileSystemDataset::Write's parameters have been consolidated into - A Scanner, from which the batches to be written are pulled - A FileSystemDatasetWriteOptions, which is an options struct specifying - destination filesystem - base directory - partitioning - basenames (via a string template, ex "dat_{i}.feather") - format specific write options - Format specific write options are represented using the FileWriteOptions hierarchy. An instance of these can be constructed from a format using FileFormat::DefaultWriteOptions(), after which the instance can be modified. - ParquetFileFormat::{writer_properties, arrow_writer_properties} have been moved to ParquetFileWriteOptions, an implementation of FileWriteOptions. Internal C++: - Individual files can now be incrementally written using a FileWriter, constructible from a format using FileFormat::MakeWriter - FileSystemDataset::Write now parallelizes across scan tasks rather than fragments, so there will be no difference in performance for different arrangements of tables/batches/lists of tables and batches when writing from memory - FileSystemDataset::Write::WriteQueue provides a threadsafe channel for batches awaiting write, allowing threads to produce batches as another thread flushes the queue to disk. Closes #8305 from bkietz/9782-more-configurable-writing Lead-authored-by: Benjamin Kietzman <bengilgit@gmail.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

Benjamin Kietzman

commit sha 1150c385278b0ea49596326f39421bf5e317d338

ARROW-10134: [Python][Dataset] Add ParquetFileFragment.num_row_groups Closes #8317 from bkietz/10134-Add-ParquetFileFragmentnu Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

kanga333

commit sha 878c53455bf9de610e184fd92c93517e137b5228

ARROW-10227: [Ruby] Use a table size as the default for parquet chunk_size A chunk_size that is too small will cause metadata bloat in the parquet file, leading to poor read performance. Set the chunk_size to be the same value as the table size so that one file becomes one row_group. Closes #8391 from kanga333/ruby-use-table-size-for-chunk-size Authored-by: kanga333 <e411z7t40w@gmail.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Micah Kornfield

commit sha d4cbc4b7aab5d37262b83e972af4bd7cb44c7a5c

ARROW-10229: [C++] Remove errant log line noticed this on rereviewing the merged code. Closes #8392 from emkornfield/remove_log Authored-by: Micah Kornfield <emkornfield@gmail.com> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

naman1996

commit sha 54199ecad388d322b4ac0a8f96ab2b3ebc730b9d

ARROW-10023: [C++][Gandiva] Implement split_part function in gandiva Closes #8231 from naman1996/ARROW-10023 and squashes the following commits: 13ef67943 <naman1996> force inline and out_len refactor 7dadc4d8a <naman1996> Changes to set out_len = 0 3ba0a3bcc <naman1996> Fixing ASAN warning d6da07a7d <naman1996> fixing small lint issue a17eedd1d <naman1996> Fixing typo b6c85503a <naman1996> Removing usage of std::string and adding some unit test for UTF8 style strings cb843670b <naman1996> fixing lint errores 899ccf075 <naman1996> changes for split_part function 3d716f61b <naman1996> adding split_string function with unit tests Authored-by: naman1996 <namanudasi160196@gmail.com–> Signed-off-by: Praveen <praveen@dremio.com>

view details

arw2019

commit sha ba7ee65422033e11d453ba63bb0eb0108d5183be

ARROW-9967: [Python] Add compute module docs + expose more option classes #8163 exposes `pyarrow.compute` kernels and generates their docstrings. This PR adds documentation for the module in the User Guide and the Python API reference. Closes #8145 from arw2019/ARROW-7871 Lead-authored-by: arw2019 <andrew.r.wieteska@gmail.com> Co-authored-by: Antoine Pitrou <antoine@python.org> Signed-off-by: Antoine Pitrou <antoine@python.org>

view details

Jörn Horstmann

commit sha 20f2bd49fc95e3ebd73ba6aa7fdf8f1451b7dd40

ARROW-10040: [Rust] Iterate over and combine boolean buffers with arbitrary offsets @nevi-me this is the chunked iterator based approach i mentioned in #8223 I'm not fully satisfied with the solution yet: - I'd prefer to move all the bit-based functions into `Bitmap`, but creating a `Bitmap` from a `&Buffer` would involve cloning an `Arc`. - I need to do some benchmarking about how much the `packed_simd` implementation actually helps. If it's not a big difference I'd propose to remove it to simplify the code. Closes #8262 from jhorstmann/ARROW-10040-unaligned-bit-buffers Authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Neville Dipale <nevilledips@gmail.com>

view details

alamb

commit sha 8447bb12b550610842b6a88b0d92e4178a63bf20

ARROW-10235: [Rust][DataFusion] Improve documentation for type coercion The code / comments for type coercion are a little confusing and don't make the distinction between coercion and casting clear – this PR attempts to clarify the intent, channeling the information from @jorgecarleitao here: https://github.com/apache/arrow/pull/8340#discussion_r501257096 Closes #8399 from alamb/alamb/coercion-docs Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Romain Francois

commit sha 74903919625220f5d67a96cf21411acb645c789f

ARROW-6537 [R]: Pass column_types to CSV reader Either passing down NULL or a Schema. But perhaps a schema is confusing because the only thing that is being controlled by it here is the types, not their order etc .. which I believe feels implied if you supply a schema. Closes #7807 from romainfrancois/ARROW-6537/column_types Lead-authored-by: Romain Francois <romain@rstudio.com> Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>

view details

alamb

commit sha 4bbb74713c6883e8523eeeb5ac80a1e1f8521674

ARROW-10233: [Rust] Make array_value_to_string available in all Arrow builds This PR makes `array_value_to_string` available to all arrow builds. Currently it is only available if the `feature = "prettyprint"` is enabled which is not the default. The full `print_batches` and `pretty_format_batches` (and the libraries they depend on) are still only available of the feature flag is set. The rationale for making this change is that I want to be able to use `array_value_to_string` to write tests (such as on https://github.com/apache/arrow/pull/8346) but currently it is only available when `feature = "prettyprint"` is enabled. It appears that @nevi-me made prettyprint compilation optional so that arrow could be compiled for wasm in https://github.com/apache/arrow/pull/7400. https://issues.apache.org/jira/browse/ARROW-9088 explains that this is due to some dependency of pretty-table; `array_value_to_string` has no needed dependencies. Note I tried to compile ARROW again using the `wasm32-unknown-unknown` target on master and it fails (perhaps due to a new dependency that was added?): <details> <summary>Click to expand!</summary> ``` alamb@ip-192-168-0-182 rust % git log | head -n 1 git log | head -n 1 commit d4cbc4b7aab5d37262b83e972af4bd7cb44c7a5c alamb@ip-192-168-0-182 rust % git status git status On branch master Your branch is up to date with 'upstream/master'. nothing to commit, working tree clean alamb@ip-192-168-0-182 rust % alamb@ip-192-168-0-182 rust % cargo build --target=wasm32-unknown-unknown cargo build --target=wasm32-unknown-unknown Compiling cfg-if v0.1.10 Compiling lazy_static v1.4.0 Compiling futures-core v0.3.5 Compiling slab v0.4.2 Compiling futures-sink v0.3.5 Compiling once_cell v1.4.0 Compiling pin-utils v0.1.0 Compiling futures-io v0.3.5 Compiling itoa v0.4.5 Compiling bytes v0.5.4 Compiling fnv v1.0.7 Compiling iovec v0.1.4 Compiling unicode-width v0.1.7 Compiling pin-project-lite v0.1.7 Compiling ppv-lite86 v0.2.8 Compiling atty v0.2.14 Compiling dirs v1.0.5 Compiling smallvec v1.4.0 Compiling regex-syntax v0.6.18 Compiling encode_unicode v0.3.6 Compiling hex v0.4.2 Compiling tower-service v0.3.0 error[E0433]: failed to resolve: could not find `unix` in `os` --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:41:18 | 41 | use std::os::unix::ffi::OsStringExt; | ^^^^ could not find `unix` in `os` error[E0432]: unresolved import `unix` --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:6:5 | 6 | use unix; | ^^^^ no `unix` in the root Compiling alloc-no-stdlib v2.0.1 Compiling adler32 v1.0.4 error[E0599]: no function or associated item named `from_vec` found for struct `std::ffi::OsString` in the current scope --> /Users/alamb/.cargo/registry/src/github.com-1ecc6299db9ec823/dirs-1.0.5/src/lin.rs:48:34 | 48 | Some(PathBuf::from(OsString::from_vec(out))) | ^^^^^^^^ function or associated item not found in `std::ffi::OsString` | = help: items from traits can only be used if the trait is in scope = note: the following trait is implemented but not in scope; perhaps add a `use` for it: `use std::sys_common::os_str_bytes::OsStringExt;` error: aborting due to 3 previous errors Some errors have detailed explanations: E0432, E0433, E0599. For more information about an error, try `rustc --explain E0432`. error: could not compile `dirs`. To learn more, run the command again with --verbose. warning: build failed, waiting for other jobs to finish... error: build failed alamb@ip-192-168-0-182 rust % ``` </details> Closes #8397 from alamb/alamb/consolidate-array-value-to-string Lead-authored-by: alamb <andrew@nerdnetworks.org> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Signed-off-by: Jorge C. Leitao <jorgecarleitao@gmail.com>

view details

Sutou Kouhei

commit sha 945f649c2a67c684ebab6a8e91901638d4c11b7a

ARROW-9414: [Packaging][deb][RPM] Enable S3 Closes #8394 from kou/packaing-linux-s3 Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Jörn Horstmann

commit sha 0100121f92299d68b348206288f12c43c44110e4

ARROW-10015: [Rust] Simd aggregate kernels Built on top of [ARROW-10040][1] (#8262) Benchmarks (run on a Ryzen 3700U laptop with some thermal problems) Current master without simd: ``` $ cargo bench --bench aggregate_kernels sum 512 time: [3.9652 us 3.9722 us 3.9819 us] change: [-0.2270% -0.0896% +0.0672%] (p = 0.23 > 0.05) No change in performance detected. Found 14 outliers among 100 measurements (14.00%) 4 (4.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [9.4577 us 9.4796 us 9.5112 us] change: [+2.9175% +3.1309% +3.3937%] (p = 0.00 < 0.05) Performance has regressed. Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) high mild 3 (3.00%) high severe ``` This branch without simd (speedup probably due to accessing the data via a slice): ``` sum 512 time: [1.1066 us 1.1113 us 1.1168 us] change: [-72.648% -72.480% -72.310%] (p = 0.00 < 0.05) Performance has improved. Found 13 outliers among 100 measurements (13.00%) 7 (7.00%) high mild 6 (6.00%) high severe sum nulls 512 time: [1.3279 us 1.3364 us 1.3469 us] change: [-86.326% -86.209% -86.085%] (p = 0.00 < 0.05) Performance has improved. Found 20 outliers among 100 measurements (20.00%) 4 (4.00%) high mild 16 (16.00%) high severe ``` This branch with simd: ``` sum 512 time: [108.58 ns 109.47 ns 110.57 ns] change: [-90.164% -90.033% -89.850%] (p = 0.00 < 0.05) Performance has improved. Found 11 outliers among 100 measurements (11.00%) 1 (1.00%) high mild 10 (10.00%) high severe sum nulls 512 time: [249.95 ns 250.50 ns 251.06 ns] change: [-81.420% -81.281% -81.157%] (p = 0.00 < 0.05) Performance has improved. Found 4 outliers among 100 measurements (4.00%) 3 (3.00%) high mild 1 (1.00%) high severe ``` [1]: https://issues.apache.org/jira/browse/ARROW-10040 Closes #8370 from jhorstmann/ARROW-10015-simd-aggregate-kernels Lead-authored-by: Jörn Horstmann <git@jhorstmann.net> Co-authored-by: Jörn Horstmann <joern.horstmann@signavio.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Daniel Russo

commit sha 1c7581cf7779e316da484c20c4aecaa6c89ea23f

ARROW-10043: [Rust][DataFusion] Implement COUNT(DISTINCT col) This is a proposal for an initial and partial implementation of the `DISTINCT` keyword. Only `COUNT(DISTINCT)` is supported, with the following conditions: (a) only one argument, i.e. `COUNT(DISTINCT col)`, but not `COUNT(DISTINCT col, other)`, (b) the argument is an integer type, and (c) the query must have a `GROUP BY` clause. **Implementation Overview:** The `Expr::AggregateFunction` variant has a new field, `distinct`, which mirrors the `distinct` flag from `SQLExpr::Function` (up until now this flag was unused). Any `Expr::AggregateFunction` may have its `distinct` flag switched to `true` if the keyword is present in the SQL query. However, the physical planner respects it only for `COUNT` expressions. The count distinct aggregation slots into the existing physical plans as a new set of `AggregateExpr`. To demonstrate, below are examples of the physical plans for the following query, where `c1` may be any data type, and `c2` is a `UInt8` column: ``` SELECT c1, COUNT(DISTINCT c2) FROM t1 GROUP BY c1 ``` (a) Multiple Partitions: HashAggregateExec: mode: Final group_expr: Column(c1) aggr_expr: DistinctCountReduce(Column(c2)) schema: c1: any c2: UInt64 input: MergeExec: input: HashAggregateExec: mode: Partial group_expr: Column(c1) aggr_expr: DistinctCount(Column(c2)) schema: c1: any c2: LargeList(UInt8) input: CsvExec: schema: c1: any c2: UInt8 The `DistinctCount` accumulates each `UInt8` into a list of distinct `UInt8`. No counts are collected yet, this is a partial result: lists of distinct values. In the `RecordBatch`, this is a `LargeListArray<UInt8>` column. After the `MergeExec`, each list in `LargeListArray<UInt8>` is accumulated by `DistinctCountReduce` (via `accumulate_batch()`), producing the _final_ sets of distinct values. Finally, given the finalized sets of distinct values, the counts are computed (always as `UInt64`). (b) Single Partition: HashAggregateExec: mode: NoPartial group_expr: Column(c1) aggr_expr: DistinctCountReduce(Column(c2)) schema: c1: any c2: UInt64 input: CsvExec: schema: c1: any c2: UInt8 This scenario is unlike the multiple partition scenario: `DistinctCount` is _not_ used, and there are no partial sets of distinct values. Rather, in a single `HashAggregateExec` stage, each `UInt8` is accumulated into a distinct value set, then the counts are computed at the end of the stage. `DistinctCountReduce` is used, but note that unlike the multiple partition case, it accumulates scalars via `accumulate_scalar()`. There is a new aggregation mode: `NoPartial`. In summary, the modes are: - `NoPartial`: used in single-stage aggregations - `Partial`: used as the first stage of two-stage aggregations - `Final`: used as the second stage of two-stage aggregaions Prior to the new `NoPartial` mode, `Partial` was handling both of what are now the responsibilities of `Partial` and `NoPartial`. No distinction was required, because _non-distinct_ aggregations (such as count, sum, min, max, and avg) do not need the distinction: the first aggregation stage is always the same, regardless of whether the aggregation is one-stage or two-stage. This is not the case for a _distinct_ count aggregation, and we can see that in the physical plans above. Closes #8222 from drusso/ARROW-10043 Authored-by: Daniel Russo <danrusso@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

alamb

commit sha 4c101efed5d698bbd6c005dbd68b40df06596ae7

ARROW-10164: [Rust] Add support for DictionaryArray to cast kernel This PR adds support to the rust compute kernel casting `DictionaryArray` to/from `PrimitiveArray`/`StringArray` It does not include full support for other types such as `LargeString` or `Binary` (though the code could be extended fairly easily following the same pattern). However, my usecase doesn't need `LargeString` or `Binary` so I am trying to get the support I need in rather than fully flesh out the library Closes #8346 from alamb/alamb/ARROW-10164-dictionary-casts-2 Authored-by: alamb <andrew@nerdnetworks.org> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Eric Erhardt

commit sha beb031f3a49d6db0c88788ce4d05d2daa1afda02

ARROW-10238: [C#] List<Struct> is broken When reading the flatbuffer type information, we are reading the ListType incorrectly. We should be using the childFields parameter, like we do for StructTypes. I also took this opportunity to redo how we compare schema information in our tests. For nested types, we need to recursively compare types all the way down. @pgovind @HashidaTKS @chutchinson Closes #8404 from eerhardt/FixListOfStruct Authored-by: Eric Erhardt <eric.erhardt@microsoft.com> Signed-off-by: Eric Erhardt <eric.erhardt@microsoft.com>

view details

Benjamin Kietzman

commit sha 109f701d805b761710f8b08f76be17ff16b3c3fd

ARROW-10237: [C++] Duplicate dict values cause corrupt parquet Fix suggested by @pitrou Closes #8403 from bkietz/10237-Duplicate-values-in-a-dic Authored-by: Benjamin Kietzman <bengilgit@gmail.com> Signed-off-by: Benjamin Kietzman <bengilgit@gmail.com>

view details

Sutou Kouhei

commit sha f0f7593f7035e990da81d6878b9025ae96e43c4e

ARROW-10239: [C++] Add missing zlib dependency to aws-sdk-cpp Closes #8406 from kou/cpp-aws-sdk-cpp-zlib Authored-by: Sutou Kouhei <kou@clear-code.com> Signed-off-by: Sutou Kouhei <kou@clear-code.com>

view details

Uwe L. Korn

commit sha d908bc83ad1c5310c66c314a476275aa660cb46f

ARROW-9879: [Python] Add support for numpy scalars to ChunkedArray.__getitem__ FYI @marc9595 Closes #8072 from xhochy/ARROW-9879 Lead-authored-by: Uwe L. Korn <uwelk@xhochy.com> Co-authored-by: Krisztián Szűcs <szucs.krisztian@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

naman1996

commit sha f2ad6a95acf8ff27906b0a64d330c74387767789

ARROW-9956: [C++] [Gandiva] Implementation of binary_string function in gandiva Function takes in a normal string or a hexadecimal encoded string( Eg: \xca\xfe\xba\xbe) and converts it to VARBINARY (byte array). Closes #8201 from naman1996/ARROW-9956 and squashes the following commits: f3abfd627 <naman1996> Force inlining helper function cf335be68 <naman1996> Removing convert_fromUTF8_binary 746f993a5 <naman1996> removing include of arrow/util/string.h af6e12e36 <naman1996> Changes to remove parseHexValue 5d1c90a10 <naman1996> Correcting typo f970214a3 <naman1996> Removing error thrown by ParseHexValue for parity with java implementation 11572ceb4 <naman1996> setting out_len to 0 90a7798f1 <naman1996> Making char array null terminated for failing unit test in ubuntu af3785a1d <naman1996> Fixing small linting error 4d09d154e <naman1996> Changes to remove std::string d3afcb91a <naman1996> refactor to use arrow::ParseHexValue 8b4f563cd <naman1996> fixing test dec947544 <naman1996> fixing test issue 52a1708de <naman1996> fixing lint errors 562b285c2 <naman1996> correcting linting errors 1f7fb91eb <naman1996> handling null string case 58efbb93c <naman1996> adding binary string function Authored-by: naman1996 <namanudasi160196@gmail.com–> Signed-off-by: Praveen <praveen@dremio.com>

view details

Joris Van den Bossche

commit sha 599b458c68dfcba38fe5448913d4bb69723e1439

ARROW-9518: [Python] Deprecate pyarrow serialization Closes #8255 from jorisvandenbossche/ARROW-9518-deprecate-serialize Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Krisztián Szűcs <szucs.krisztian@gmail.com>

view details

push time in 14 days

pull request commentapache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

Do you want to work on other index types and supporting primitive Arrow dictionaries? We could keep this PR open for longer; as long as it's not blocking any additional unit of work.

Yup, I'm happy to do that! I'll be rebasing, addressing feedback, and adding to this on Wednesday.

carols10cents

comment created time in 16 days

pull request commentapache/arrow

One definition/repetition level test

Yup, we can keep this as a draft, no problem! I'll be rebasing and adding to this on Wednesday.

carols10cents

comment created time in 16 days

issue closedrust-lang/crates.io

Add possibility to rename crates

I would suggest that crates can be renamed. As long as no other crate has used the old name, this should be redirected to the new name.

Changes that only change the case sensitivity or the characters _ and - should also be possible. Since there are crates that have the crate names written in the wrong upper and lower case and would like to change this.

closed time in 16 days

deeprobin

issue commentrust-lang/crates.io

Add possibility to rename crates

We already change all crate names to lowercase before considering uniqueness-- in other words, a crate named MyCrate and a crate named mycrate are not allowed to both exist.

In order for Cargo to use case insensitive lookup when downloading crates specified in a Cargo.toml, changes need to be made to Cargo's resolution code. That issue is https://github.com/rust-lang/cargo/issues/5678.

Then there is allowing a change to a crate's canonical capitalization, which is https://github.com/rust-lang/crates.io/issues/1451.

So, if I understand correctly, that leaves the feature request to allow renaming of crates, and if someone has the old name in their Cargo.toml and runs cargo update, to "redirect" and resolve the versions using the new name.

I do think this should go through the RFC process; I'm concerned about the perceived security angle. I think it would be quite unexpected to ask for package foo and silently get package bar. Perhaps there could be a warning saying something like "you're depending on package foo but it has been renamed to bar; change foo to bar to get future updates", but that would need a way to silence that warning if you didn't want to update... these are the sorts of details I'd want someone to explore with the community through an RFC.

So I'm closing this issue for now, pending an RFC. Thanks!

deeprobin

comment created time in 16 days

PR opened rust-lang/rust-central-station

Backfill MAJOR.MINOR Rustup channel manifests with a script this time

This is a replacement for #949

Now that https://github.com/rust-lang/rust/pull/76107 has been merged, new releases will also write their manifests to channel-rust-1.x.toml as well as channel-rust-1.x.y.toml to enable rustup install 1.48 to get the latest patch release in a minor release series.

This commit adds an idempotent script to copy manifests and their signatures for the last patch release in every minor release series to the corresponding minor manifest files so that past minor versions will work with the rustup functionality too.

This script should only need to be run once, but should be safe to run more than once.

It starts at 1.8 because we don't have manifests for 1.0-1.7, and it ends with 1.47 because 1.48 will be the first stable release to write out the 1.x channel manifest.

r? @pietroalbini <3

+43 -0

0 comment

1 changed file

pr created time in 18 days

create barnchcarols10cents/rust-central-station

branch : backfill-manifests-script

created branch time in 18 days

PR opened apache/arrow

One definition/repetition level test

Hey @nevi-me, before I go write a bunch of these, is this what would be useful for testing levels? Is there an easier way to create the arrays?

I'm basing these on tests in the C++ implementation that have a nice JSON constructor, and I tried using the JSON Reader but I couldn't get what I created with the JSON Reader to what I created that I currently have here :-/

Thank you for any feedback you have!

+29 -0

0 comment

1 changed file

pr created time in 20 days

create barnchinteger32llc/arrow

branch : def-rep-level-tests

created branch time in 20 days

PR opened apache/arrow

ARROW-8426: [Rust] [Parquet] - Add more support for converting Dicts

This adds more support for:

  • When converting Arrow -> Parquet containing an Arrow Dictionary, materialize the Dictionary values and send to Parquet to be encoded with a dictionary or not according to the Parquet settings (not supported: converting an Arrow Dictionary directly to Parquet DictEncoding, also only supports Int32 index types in this commit, also removes NULLs)
  • When converting Parquet -> Arrow, noticing that the Arrow schema metadata in a Parquet file has a Dictionary type and converting the data to an Arrow dictionary (right now this only supports String dictionaries

I'm not sure if this is in a good enough state to merge or not yet, please let me know @nevi-me !

+214 -16

0 comment

3 changed files

pr created time in 20 days

create barnchinteger32llc/arrow

branch : dict

created branch time in 20 days

PullRequestReviewEvent

PR closed integer32llc/arrow

Reviewers
add option to project root columns from schema

The default Parquet projection works at a leaf leve, such that the schema below would have 4 fields:

a: Struct<<b: String, c: Int32>
d: List<Float32>
e: Int64
----
leaf 1: a.b
leaf 2: a.c
leaf 3: d
leaf 4: e

By default, when selecting fields 1 and 3, we don't get a and e, but we get a.b and d. This is often undesirable for users who might want to select a, without knowing that it's spread out into 2 leaf indices.

This adds the option to select fields by their root.

We also read all Arrow types in the roundtrip test. Lists and structs continue to fail, and have been commented out.

+730 -78

1 comment

9 changed files

nevi-me

pr closed time in 21 days

pull request commentinteger32llc/arrow

add option to project root columns from schema

I cherry-picked onto schema-roundtrip because I force pushed that 🤭

nevi-me

comment created time in 21 days

pull request commentapache/arrow

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

@nevi-me I added your commit onto this branch!

carols10cents

comment created time in 21 days

push eventinteger32llc/arrow

Neville Dipale

commit sha 69b474330aba3382de79bcfa616ef959f0d44bec

add option to project root columns from schema

view details

push time in 21 days

Pull request review commentinteger32llc/arrow

add option to project root columns from schema

 mod tests {                 ),                 Field::new("c32", DataType::LargeBinary, true),                 Field::new("c33", DataType::LargeUtf8, true),+                // Field::new(+                //     "c34",+                //     DataType::LargeList(Box::new(DataType::List(Box::new(+                //         DataType::Struct(vec![+                //             Field::new("a", DataType::Int16, true),+                //             Field::new("b", DataType::Float64, true),+                //         ]),+                //     )))),+                //     true,+                // ),+            ],+            metadata,+        );++        // write to an empty parquet file so that schema is serialized+        let file = get_temp_file("test_arrow_schema_roundtrip.parquet", &[]);+        let mut writer = ArrowWriter::try_new(+            file.try_clone().unwrap(),+            Arc::new(schema.clone()),+            None,+        )?;+        writer.close()?;++        // read file back+        let parquet_reader = SerializedFileReader::try_from(file)?;+        let mut arrow_reader = ParquetFileArrowReader::new(Rc::new(parquet_reader));+        let read_schema = arrow_reader.get_schema()?;+        assert_eq!(schema, read_schema);++        // read all fields by columns+        let partial_read_schema =+            arrow_reader.get_schema_by_columns(0..(schema.fields().len()), false)?;

Nice! I hadn't yet figured out why there were a different number of Parquet columns and Arrow fields; I think I'm starting to understand now :)

nevi-me

comment created time in 21 days

PullRequestReviewEvent

pull request commentapache/arrow

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

@nevi-me I saw it just after :) I'm looking at it now! I don't think there are conflicts, and I think my last commit is addressing a different issue than your last commit?

carols10cents

comment created time in 21 days

pull request commentapache/arrow

ARROW-10168: [Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

Ok @nevi-me, I rebased this PR on the branch and I think this is ready for review now. It pushes more type information from the arrow metadata schema down into the reading code... the LargeBinary and LargeUtf8 tests are still failing, but no longer because their schemas don't match ;)

carols10cents

comment created time in 21 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha e456dfc6f2d4519a1bf2ccca9531a75065216c2f

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes Note that this PR goes to the rust-parquet-arrow-writer branch, not master. Inspired by tests in cpp/src/parquet/arrow/arrow_reader_writer_test.cc These perform round-trip Arrow -> Parquet -> Arrow of a single RecordBatch with a single column of values of each the supported data types and some of the unsupported ones. Tests that currently fail are either marked with `#[should_panic]` (if the reason they fail is because of a panic) or `#[ignore]` (if the reason they fail is because the values don't match). I am comparing the RecordBatch's column's data before and after the round trip directly; I'm not sure that this is appropriate or not because for some data types, the `null_bitmap` isn't matching and I'm not sure if it's supposed to or not. So I would love advice on that front, and I would love to know if these tests are useful or not! Closes #8330 from carols10cents/roundtrip-tests Lead-authored-by: Carol (Nichols || Goulding) <carol.nichols@gmail.com> Co-authored-by: Neville Dipale <nevilledips@gmail.com> Signed-off-by: Andy Grove <andygrove@nvidia.com>

view details

Carol (Nichols || Goulding)

commit sha 51ce6130a519f5ea03d97472582477ef8ff10fef

ARROW-10168: [Rust] [Parquet] Use Arrow schema by column too Previously, if an Arrow schema was present in the Parquet metadata, that schema would always be returned when requesting all columns via `parquet_to_arrow_schema` and would never be returned when requesting a subset of columns via `parquet_to_arrow_schema_by_columns`. Now, if a valid Arrow schema is present in the Parquet metadata and a subset of columns is requested by Parquet column index, the `parquet_to_arrow_schema_by_columns` function will try to find a column of the same name in the Arrow schema first, and then fall back to the Parquet schema for that column if there isn't an Arrow Field for that column. This is part of what is needed to be able to restore Arrow types like LargeUtf8 from Parquet.

view details

Neville Dipale

commit sha 332f44036805a5d31f0e21aa778de7e540382b17

run cargo +stable fmt (and clippy)

view details

Carol (Nichols || Goulding)

commit sha 30e3e41edba3e2f38c3a96f81bd22d4c74f81acd

ARROW-10168: [Rust] [Parquet] Convert LargeString and LargeBinary types back from Parquet

view details

push time in 21 days

PR closed rust-lang/rust-central-station

Backfill MAJOR.MINOR Rustup channel manifests

Now that https://github.com/rust-lang/rust/pull/76107 has been merged, new releases will also write their manifests to channel-rust-1.x.toml as well as channel-rust-1.x.y.toml to enable rustup install 1.48 to get the latest patch release in a minor release series.

This commit adds an idempotent function to copy manifests for the last patch release in every minor release series to the corresponding minor manifest so that past minor versions will work with the rustup functionality too.

This function should only need to be run once, but should be safe to run more than once.

r? @pietroalbini or @Mark-Simulacrum

+49 -0

2 comments

1 changed file

carols10cents

pr closed time in 21 days

pull request commentrust-lang/rust-central-station

Backfill MAJOR.MINOR Rustup channel manifests

Yep, I can do that! Going to close this in the meantime :) Thank you for the feedback!

carols10cents

comment created time in 21 days

create barnchcarols10cents/rust-central-station

branch : backfill-manifests

created branch time in 23 days

PR opened rust-lang/rust-central-station

Backfill MAJOR.MINOR Rustup channel manifests

Now that https://github.com/rust-lang/rust/pull/76107 has been merged, new releases will also write their manifests to channel-rust-1.x.toml as well as channel-rust-1.x.y.toml to enable rustup install 1.48 to get the latest patch release in a minor release series.

This commit adds an idempotent function to copy manifests for the last patch release in every minor release series to the corresponding minor manifest so that past minor versions will work with the rustup functionality too.

This function should only need to be run once, but should be safe to run more than once.

r? @pietroalbini or @Mark-Simulacrum

+49 -0

0 comment

1 changed file

pr created time in 23 days

pull request commentapache/arrow

ARROW-10191: [Rust] [Parquet] Add roundtrip Arrow -> Parquet tests for all supported Arrow DataTypes

@nevi-me Done! https://issues.apache.org/jira/secure/ViewProfile.jspa?name=carols10cents

carols10cents

comment created time in 23 days

PR opened apache/arrow

[Rust] [Parquet] Schema roundtrip - use Arrow schema from Parquet metadata when available

@nevi-me This is one commit on top of https://github.com/apache/arrow/pull/8330 that I'm opening to get some feedback from you on about whether this will help with ARROW-10168. I think this will bring the Rust implementation more in line with C++, but I'm not certain.

I tried removing the #[ignore] attributes from the LargeArray and LargeUtf8 tests, but they're still failing because the schemas don't match yet-- it looks like this code will need to be changed as well.

That build_array_reader function's code looks very similar to the code I've changed here, is there a possibility for the code to be shared or is there a reason they're separate?

+563 -48

0 comment

5 changed files

pr created time in 23 days

create barnchinteger32llc/arrow

branch : schema-roundtrip

created branch time in 23 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha c3f3597efc96641577bc84aac66120a22c169675

Remove unused import

view details

push time in 23 days

push eventinteger32llc/arrow

Carol (Nichols || Goulding)

commit sha 8fe210b1f5d93aa14774917835588ab83d9f4e70

Update to use new Iterator support by calling next instead of next_batch

view details

push time in 23 days

PR closed integer32llc/arrow

fix a few failing roundtrip tests

This is on top of @carols10cents' PR (https://github.com/apache/arrow/pull/8330). I've fixed some of the failing tests that were ignored.

+1494 -83

1 comment

9 changed files

nevi-me

pr closed time in 23 days

pull request commentinteger32llc/arrow

fix a few failing roundtrip tests

I merged this in to integer32llc/roundtrip-tests.

nevi-me

comment created time in 23 days

more