profile
viewpoint
Hannes Mühleisen hannesmuehleisen CWI Amsterdam http://hannes.muehleisen.org Database Architectures, DuckDB

cwida/duckdb 2494

DuckDB is an in-process SQL OLAP Database Management System

cwida/public_bi_benchmark 32

BI benchmark with user generated data and queries

hannesmuehleisen/clickhouse-r 28

Rstats client for ClickHouse (https://clickhouse.yandex)

co0p/codekata 1

let's practise coding ... by coding :)

co0p/totalRecall 0

you call us, we do the calling for you

hannesmuehleisen/aisdecoder 0

AIS decoder module for node.js

issue openedcwida/duckdb

Can you provide a row iterator interface in Python for query results?

All of the row result methods here: https://duckdb.org/docs/api/python

  • fetchdf
  • fetchall
  • fetchnumpy will put store the query results into memory. If the DuckDB database is sufficiently large, we will run out of memory.

Can you provide an iterator-based method that won't create a list like fetchall? I would recommend just having fetchall return an iterator. That way, you can still call list(fetchall) if you want a list of results.

created time in 4 hours

issue commentcwida/duckdb

Can you add a Python Appender API?

I would recommend two different APIs for appending.

  1. Via dataframes: Append a dataframe to an existing table. This is already possible via the APIs you built, but it could. be slightly simpler for users. Eg:
table.append_df(df)

That would basically create a virtual reference to the dataframe, insert the values in, and then delete the virtual reference.

  1. Via the appender API you recommended

While you are correct that this will add latency b/c of Python object allocation overhead, this might still pale in comparison to the overhead of writing to disk?

This will be especially useful when writing entries to the DuckDB that are too big to fit into memory. This appender API would let us basically add things in batch (into memory) and then flush it to disk.

fintron

comment created time in 8 hours

pull request commentcwida/duckdb

Refactor and nested types support for Parquet Reader

It seems that there is a bug: after all data has been Fetched, the parquet reader returns nullptr instead of an empty DataChunk

Could you please post a full example of what goes wrong in a new issue? Not sure I understand.

Ah, actually, disregard that, it seems that the behaviour of duckdb has changed in query-result: Used to be:

	//! Fetches a DataChunk from the query result. Returns an empty chunk if the result is empty, or nullptr on failure.
	virtual unique_ptr<DataChunk> Fetch() = 0;

Now:

	//! Fetches a DataChunk of normalized (flat) vectors from the query result.
	//! Returns nullptr if there are no more results to fetch.
	DUCKDB_API virtual unique_ptr<DataChunk> Fetch();

So, it used to return an empty chunk, now it returns a nullptr.

hannesmuehleisen

comment created time in 11 hours

issue openedcwida/duckdb

Checkout to tag v0.2.3 still produce 0.2.4 version

https://github.com/cwida/duckdb/blob/436f6455f6e48b571bf5ba0812332f08d0bd65f4/CMakeLists.txt#L138

➜  ./build/release/duckdb -version
0.2.4-dev0 436f6455f
➜  duckdb git:(436f6455f) git describe --tags --abbrev=0git describe --tags --abbrev=0
error: option `abbrev' expects a numerical value
➜  duckdb git:(436f6455f) git describe --tags --abbrev=0
v0.2.3
➜  duckdb git:(436f6455f) git describe --tags --long
v0.2.3-0-g436f6455f

created time in 11 hours

pull request commentcwida/duckdb

Refactor and nested types support for Parquet Reader

It seems that there is a bug: after all data has been Fetched, the parquet reader returns nullptr instead of an empty DataChunk

hannesmuehleisen

comment created time in 12 hours

issue commentcwida/duckdb

Prepare slower than Appender

Oh I mean something not so specific, just some basic tips as Appender or transactions but for the read side.. ..but since you insist.. :) .. I wrote a simple read test using the old SQL queries. The improvement compared to SQLite is about 20X. Very impressive. The heavy part of my DB is a sort of timed series (key, blob, timestamp). I got millions of these tuple in a recording session. When replaying the data, I need to retrieve the last blob for a given (key, time) or retrieve all tuples in a given time range. The thing is not exactly this way but is just to explain a little more.

So, back to my test, I executed a read banchmark:

  • a loop calling Execute -> about 21sec
  • a loop calling Query -> about 6sec
  • SQLite -> 120sec

The query for the Prepare is "SELECT type, value, MAX(time) FROM attribute WHERE object_id = $1 AND time <= $2 GROUP BY type, value;".

The Query call instead is that auto result = con->Query("SELECT type, value, MAX(time) FROM attribute WHERE object_id = " + std::to_string(object_id) + " AND time <= " + std::to_string(time) + " GROUP BY type, value;");

I don't understand why the Prepare + Execute is slower than the simple Query call.

TheGiamig

comment created time in 13 hours

issue closedcwida/duckdb

Syntax error for WITH RECURSIVE query

As discovered in PR #1075, the following query runs into an infinite loop on the LDBC SNB SF0.1 data set.

WITH RECURSIVE search_graph(link, depth) AS (
		SELECT 17592186044856, 0 -- the big number is the start person Id
		UNION ALL
		(SELECT distinct k_person2id, x.depth+1
		FROM knows, search_graph x
		WHERE x.link = k_person1id)
)
select * from search_graph

The same code terminates under Postgres (v12.4):

      link      | depth 
----------------+-------
 17592186044856 |     0
(1 row)

closed time in 14 hours

szarnyasg

issue commentcwida/duckdb

Syntax error for WITH RECURSIVE query

PR #1239 fixed this, the query in https://github.com/cwida/duckdb/issues/1088#issuecomment-723650858 now works:

┌────────────┐
│ max(depth)  │
├────────────┤
│ 2           │
└────────────┘

Thanks @lnkuiper!

szarnyasg

comment created time in 14 hours

issue commentcwida/duckdb

Prepare slower than Appender

That is hard to answer without more details about what you are doing/trying to do/what API you are using. Perhaps you could expand a bit on your problem, the speed you are encountering and the speed you expect?

TheGiamig

comment created time in 15 hours

pull request commentcwida/duckdb

Use ccache action

The "R package Windows" workflow consistently fails, also on the main branch. I'll take a look.

krlmlr

comment created time in 15 hours

PR opened cwida/duckdb

Fix for null values in in clauses
+15 -1

0 comment

2 changed files

pr created time in 15 hours

issue commentcwida/duckdb

Prepare slower than Appender

Thank you Mark! Yeah I know transactions will remove some overhead (same for SQLite, huge speed improvement). Have you some tips for me to speed up the reading? Transaction is a nop in this case? Thanks

TheGiamig

comment created time in 15 hours

issue commentcwida/duckdb

Prepare slower than Appender

This is fully expected; the appender is optimized for exactly this use case. Prepare avoids e.g. re-running the optimization and planning phase but still needs to do more than just the Appender. For example, it sets up the query context and creates a result object for every query that is run. There is likely room for improving that somewhat, but using the appender is preferable since the code there is much simpler and faster.

Another thing you should do is wrap multiple statements in a BEGIN TRANSACTION and COMMIT block. This also applies to the appender. If you do not do this but run in auto commit mode, there will be the additional overhead of starting and committing transactions, which includes syncing data to disk if you are not running in in-memory mode. This is extremely slow. i.e. your code should look like this:

con.BeginTransaction();
for(...) {
   con.Execute(1, 2, 3, 4);
}
con.Commit();

Or like this with the appender:

con.BeginTransaction();
for(...) {
	appender.AppendRow(1, 2, 3, 4);
}
con.Commit();
TheGiamig

comment created time in 16 hours

issue commentcwida/duckdb

Prepare slower than Query

Sorry, I wrote connection.Query but I was talking about the Appender.

TheGiamig

comment created time in 16 hours

issue openedcwida/duckdb

Prepare slower than Query

Hi, I'm trying to speed up a data recorder/replayer based on SQLite.

I coded an application to transfer data from old DB to DuckDB. As first attempt I used the connection.Prepared + Execute methods to insert data. The tranfert speed wasn't that high.. Also using the C-api the transfert seems really slow. Then I re-wrote the thing using connection.Query obtaining much better performances.

There is something I completely missed or this beaviour is correct? Thanks

created time in 16 hours

PR opened cwida/duckdb

Use ccache action

This speeds up build time for the debug build from > 7 to just shy of 2 minutes.

Build with cold cache: https://github.com/krlmlr/duckdb/actions/runs/514705674

Build with warm cache: https://github.com/krlmlr/duckdb/actions/runs/514741712

We should consider one of the following options, to avoid corruption of the GitHub action used here:

  • Fork https://github.com/hendrikmuhs/ccache-action in the cwida organization and use our forked action
  • Embed the action in this repository, use with the syntax ./.github/actions/ccache-action
  • Use a full SHA1 to refer to the current state of the action

We also need to enable ccache for the other builds. The size of the cache for the debug build is ~200 MB, this means there's enough space to store caches for other build variants. (The limit is 5 GB per repository, after that caches are evicted.)

+8 -3

0 comment

1 changed file

pr created time in 17 hours

startedfacebook/rocksdb

started time in 17 hours

issue commentcwida/duckdb

Can you add a Python Appender API?

You can also insert into a table from any query, including from a pandas dataframe. For example:

INSERT INTO test_df_table SELECT * FROM test_df_view

Does that resolve the issue, or would you propose a different method of appending?

The problem with porting the Appender as-is is that constructing Python objects is really expensive. Perhaps it would help if you could tell us what you would envision this API to look like. A straightforward wrapper like e.g.:

appender = con.append("table");
appender.append(1, 2);
appender.append(1, 2);

Would still be very slow because of all the Python object allocation and de-allocation overhead.

fintron

comment created time in 18 hours

issue commentcwida/duckdb

Filter issue with `IN` and `NULL`

Likely caused by my IN optimization. @hannesmuehleisen please assign me.

hannesmuehleisen

comment created time in 18 hours

PR opened cwida/duckdb-web

Adding new functions and new data types
+7 -0

0 comment

3 changed files

pr created time in 19 hours

issue openedcwida/duckdb

Can you add a Python Appender API?

It would be really helpful to be able to use the Appender API: https://duckdb.org/docs/data/appender

in Python. Unfortunately, the best way to do this in Python right now is the Pandas dataframe option: https://duckdb.org/docs/api/python This isn't a great option when you want to bulk append to an existing table.

created time in 20 hours

PR opened cwida/duckdb

Add 'regexp_split_to_array' alias to 'string_split_regex'

Followup of #1006.

This commit adds the Postgres-compatible regexp_split_to_array alias to string_split_regex.

cc @lnkuiper

+1 -1

0 comment

1 changed file

pr created time in a day

issue commentcwida/duckdb

Auto Increment Primary Key And/or Serial

never mind, this worked well:

ALTER TABLE gh ADD COLUMN i BIGINT DEFAULT NEXTVAL('seq');
willium

comment created time in a day

issue commentcwida/duckdb

How to tell DuckDB that the csv is compressed

Thank you very much

djouallah

comment created time in a day

issue commentcwida/duckdb

Auto Increment Primary Key And/or Serial

CREATE SEQUENCE seq;
CREATE TABLE gh (i BIGINT DEFAULT NEXTVAL('seq'), time TIME, count INTEGER);
INSERT INTO gh SELECT * FROM read_csv_auto('github.csv');

Error: Binder Error: table gh has 3 columns but 2 values were supplied

-- Is there a way to handle this in cases where we don't know the column names for the given csv?

INSERT INTO gh(time,count) SELECT * FROM read_csv_auto('github.csv');
SELECT * FROM gh LIMIT 5;
┌───┬──────────┬───────┐
│ i │   time   │ count │
├───┼──────────┼───────┤
│ 1 │ 01:00:00 │ 2     │
│ 2 │ 04:00:00 │ 3     │
│ 3 │ 05:00:00 │ 1     │
│ 4 │ 08:00:00 │ 1     │
│ 5 │ 09:00:00 │ 3     │
└───┴──────────┴───────┘
willium

comment created time in a day

issue openedcwida/duckdb

Can you safely copy the file while it’s being written to?

Somewhat of a follow-up to this question: https://github.com/cwida/duckdb/issues/1330

if you have one process continually writing to a DB, can you copy that file and be sure that it won’t be in a corrupt or somewhat indeterminate state?

(obviously, it would be hard to predict which writes made it in or not. I’m copying the dB file while it’s being written might lead it to be in a corrupt state. Basically not sure if writes are atomic, etc)

created time in a day

issue openedcwida/duckdb

Does DuckDB work with single writer, multiple readers?

I have a use-case where I have one process continually writing to a DB and multiple readers who are reading from the DB.

Does DuckDB support for this use-case? Docs say multireader is ok, but not sure if that also allows for one writer.

created time in a day

PR opened cwida/duckdb

Allow CTEs under set operation nodes

This PR moves the SelectStatement::cte_map to QueryNode::cte_map, so that queries like the following work:

SELECT 1 UNION ALL (WITH cte AS (SELECT 42) SELECT * FROM cte);

Before, an error would be thrown because the binder was not able to find cte.

This was found due to #1088 , which should now also be fixed.

Happy to receive any feedback.

+120 -87

0 comment

20 changed files

pr created time in a day

push eventcwida/duckdb

Pedro Holanda

commit sha 14872eed0601c1cbcc145e5287186639d16112aa

Adding initial implementation for unsigned types

view details

Pedro Holanda

commit sha 695de731ba997528060c41e7a0408fa57a342dae

More on casting of unsigned types and arithmetic ops

view details

Pedro Holanda

commit sha 63c697e93cfb0aab17adbcdd02f9085aeb4a72a3

More of casting and unassigned in python/arrow

view details

Pedro Holanda

commit sha 9294c2e0a394f5bf019f7988a99a080c797d7315

Tests passing

view details

Pedro Holanda

commit sha 730ae9c8c2370e8e196ae3212acd04010e2336f6

Merge branch 'master' into unsignedtypes

view details

Pedro Holanda

commit sha 079d3bb0fd412f5a0538905a66443ba5fac77fd6

Parquet read and python tests for unsigned

view details

Pedro Holanda

commit sha b58964761950977fe9c62395ec4e123858541b89

Adding fixes for sqlancer

view details

Pedro Holanda

commit sha 77ce538b82732ff095af589b93745188d0227f65

Fixing build

view details

Pedro Holanda

commit sha 89ae2116da6d35f5502f456e184e50914979f719

Changes requested in the PR

view details

Mark

commit sha 8b03e470e13f3cb692708c91d23f0f30a1e777f6

Merge pull request #1325 from pdet/unsignedtypes Implementing Unsigned Types

view details

push time in a day

PR merged cwida/duckdb

Reviewers
Implementing Unsigned Types

Task #1023

Since the unsigned types are built-in I think the most relevant changes are in the cast operators. Let me know if I forgot any interface or if I should add any other relevant tests.

+3026 -42

6 comments

68 changed files

pdet

pr closed time in a day

more