cwida/duckdb 2494
DuckDB is an in-process SQL OLAP Database Management System
BI benchmark with user generated data and queries
hannesmuehleisen/clickhouse-r 28
Rstats client for ClickHouse (https://clickhouse.yandex)
let's practise coding ... by coding :)
you call us, we do the calling for you
AIS decoder module for node.js
issue openedcwida/duckdb
Can you provide a row iterator interface in Python for query results?
All of the row result methods here: https://duckdb.org/docs/api/python
- fetchdf
- fetchall
- fetchnumpy will put store the query results into memory. If the DuckDB database is sufficiently large, we will run out of memory.
Can you provide an iterator-based method that won't create a list like fetchall
? I would recommend just having fetchall return an iterator. That way, you can still call list(fetchall)
if you want a list of results.
created time in 4 hours
issue commentcwida/duckdb
Can you add a Python Appender API?
I would recommend two different APIs for appending.
- Via dataframes: Append a dataframe to an existing table. This is already possible via the APIs you built, but it could. be slightly simpler for users. Eg:
table.append_df(df)
That would basically create a virtual reference to the dataframe, insert the values in, and then delete the virtual reference.
- Via the appender API you recommended
While you are correct that this will add latency b/c of Python object allocation overhead, this might still pale in comparison to the overhead of writing to disk?
This will be especially useful when writing entries to the DuckDB that are too big to fit into memory. This appender API would let us basically add things in batch (into memory) and then flush it to disk.
comment created time in 8 hours
pull request commentcwida/duckdb
Refactor and nested types support for Parquet Reader
It seems that there is a bug: after all data has been
Fetch
ed, the parquet reader returnsnullptr
instead of an emptyDataChunk
Could you please post a full example of what goes wrong in a new issue? Not sure I understand.
Ah, actually, disregard that, it seems that the behaviour of duckdb has changed
in query-result
:
Used to be:
//! Fetches a DataChunk from the query result. Returns an empty chunk if the result is empty, or nullptr on failure.
virtual unique_ptr<DataChunk> Fetch() = 0;
Now:
//! Fetches a DataChunk of normalized (flat) vectors from the query result.
//! Returns nullptr if there are no more results to fetch.
DUCKDB_API virtual unique_ptr<DataChunk> Fetch();
So, it used to return an empty chunk, now it returns a nullptr.
comment created time in 11 hours
issue openedcwida/duckdb
Checkout to tag v0.2.3 still produce 0.2.4 version
https://github.com/cwida/duckdb/blob/436f6455f6e48b571bf5ba0812332f08d0bd65f4/CMakeLists.txt#L138
➜ ./build/release/duckdb -version
0.2.4-dev0 436f6455f
➜ duckdb git:(436f6455f) git describe --tags --abbrev=0git describe --tags --abbrev=0
error: option `abbrev' expects a numerical value
➜ duckdb git:(436f6455f) git describe --tags --abbrev=0
v0.2.3
➜ duckdb git:(436f6455f) git describe --tags --long
v0.2.3-0-g436f6455f
created time in 11 hours
pull request commentcwida/duckdb
Refactor and nested types support for Parquet Reader
It seems that there is a bug: after all data has been Fetch
ed, the parquet reader returns nullptr
instead of an empty DataChunk
comment created time in 12 hours
issue commentcwida/duckdb
Oh I mean something not so specific, just some basic tips as Appender or transactions but for the read side.. ..but since you insist.. :) .. I wrote a simple read test using the old SQL queries. The improvement compared to SQLite is about 20X. Very impressive. The heavy part of my DB is a sort of timed series (key, blob, timestamp). I got millions of these tuple in a recording session. When replaying the data, I need to retrieve the last blob for a given (key, time) or retrieve all tuples in a given time range. The thing is not exactly this way but is just to explain a little more.
So, back to my test, I executed a read banchmark:
- a loop calling Execute -> about 21sec
- a loop calling Query -> about 6sec
- SQLite -> 120sec
The query for the Prepare is "SELECT type, value, MAX(time) FROM attribute WHERE object_id = $1 AND time <= $2 GROUP BY type, value;"
.
The Query call instead is that
auto result = con->Query("SELECT type, value, MAX(time) FROM attribute WHERE object_id = " + std::to_string(object_id) + " AND time <= " + std::to_string(time) + " GROUP BY type, value;");
I don't understand why the Prepare + Execute is slower than the simple Query call.
comment created time in 13 hours
issue closedcwida/duckdb
Syntax error for WITH RECURSIVE query
As discovered in PR #1075, the following query runs into an infinite loop on the LDBC SNB SF0.1 data set.
WITH RECURSIVE search_graph(link, depth) AS (
SELECT 17592186044856, 0 -- the big number is the start person Id
UNION ALL
(SELECT distinct k_person2id, x.depth+1
FROM knows, search_graph x
WHERE x.link = k_person1id)
)
select * from search_graph
The same code terminates under Postgres (v12.4):
link | depth
----------------+-------
17592186044856 | 0
(1 row)
closed time in 14 hours
szarnyasgissue commentcwida/duckdb
Syntax error for WITH RECURSIVE query
PR #1239 fixed this, the query in https://github.com/cwida/duckdb/issues/1088#issuecomment-723650858 now works:
┌────────────┐
│ max(depth) │
├────────────┤
│ 2 │
└────────────┘
Thanks @lnkuiper!
comment created time in 14 hours
issue commentcwida/duckdb
That is hard to answer without more details about what you are doing/trying to do/what API you are using. Perhaps you could expand a bit on your problem, the speed you are encountering and the speed you expect?
comment created time in 15 hours
pull request commentcwida/duckdb
The "R package Windows" workflow consistently fails, also on the main branch. I'll take a look.
comment created time in 15 hours
PR opened cwida/duckdb
pr created time in 15 hours
issue commentcwida/duckdb
Thank you Mark! Yeah I know transactions will remove some overhead (same for SQLite, huge speed improvement). Have you some tips for me to speed up the reading? Transaction is a nop in this case? Thanks
comment created time in 15 hours
issue commentcwida/duckdb
This is fully expected; the appender is optimized for exactly this use case. Prepare avoids e.g. re-running the optimization and planning phase but still needs to do more than just the Appender. For example, it sets up the query context and creates a result object for every query that is run. There is likely room for improving that somewhat, but using the appender is preferable since the code there is much simpler and faster.
Another thing you should do is wrap multiple statements in a BEGIN TRANSACTION
and COMMIT
block. This also applies to the appender. If you do not do this but run in auto commit mode, there will be the additional overhead of starting and committing transactions, which includes syncing data to disk if you are not running in in-memory mode. This is extremely slow. i.e. your code should look like this:
con.BeginTransaction();
for(...) {
con.Execute(1, 2, 3, 4);
}
con.Commit();
Or like this with the appender:
con.BeginTransaction();
for(...) {
appender.AppendRow(1, 2, 3, 4);
}
con.Commit();
comment created time in 16 hours
issue commentcwida/duckdb
Sorry, I wrote connection.Query but I was talking about the Appender.
comment created time in 16 hours
issue openedcwida/duckdb
Hi, I'm trying to speed up a data recorder/replayer based on SQLite.
I coded an application to transfer data from old DB to DuckDB. As first attempt I used the connection.Prepared + Execute methods to insert data. The tranfert speed wasn't that high.. Also using the C-api the transfert seems really slow. Then I re-wrote the thing using connection.Query obtaining much better performances.
There is something I completely missed or this beaviour is correct? Thanks
created time in 16 hours
PR opened cwida/duckdb
This speeds up build time for the debug build from > 7 to just shy of 2 minutes.
Build with cold cache: https://github.com/krlmlr/duckdb/actions/runs/514705674
Build with warm cache: https://github.com/krlmlr/duckdb/actions/runs/514741712
We should consider one of the following options, to avoid corruption of the GitHub action used here:
- Fork https://github.com/hendrikmuhs/ccache-action in the cwida organization and use our forked action
- Embed the action in this repository, use with the syntax
./.github/actions/ccache-action
- Use a full SHA1 to refer to the current state of the action
We also need to enable ccache for the other builds. The size of the cache for the debug build is ~200 MB, this means there's enough space to store caches for other build variants. (The limit is 5 GB per repository, after that caches are evicted.)
pr created time in 17 hours
startedfacebook/rocksdb
started time in 17 hours
issue commentcwida/duckdb
Can you add a Python Appender API?
You can also insert into a table from any query, including from a pandas dataframe. For example:
INSERT INTO test_df_table SELECT * FROM test_df_view
Does that resolve the issue, or would you propose a different method of appending?
The problem with porting the Appender as-is is that constructing Python objects is really expensive. Perhaps it would help if you could tell us what you would envision this API to look like. A straightforward wrapper like e.g.:
appender = con.append("table");
appender.append(1, 2);
appender.append(1, 2);
Would still be very slow because of all the Python object allocation and de-allocation overhead.
comment created time in 18 hours
issue commentcwida/duckdb
Filter issue with `IN` and `NULL`
Likely caused by my IN optimization. @hannesmuehleisen please assign me.
comment created time in 18 hours
issue openedcwida/duckdb
Can you add a Python Appender API?
It would be really helpful to be able to use the Appender API: https://duckdb.org/docs/data/appender
in Python. Unfortunately, the best way to do this in Python right now is the Pandas dataframe option: https://duckdb.org/docs/api/python This isn't a great option when you want to bulk append to an existing table.
created time in 20 hours
PR opened cwida/duckdb
Followup of #1006.
This commit adds the Postgres-compatible regexp_split_to_array
alias to string_split_regex
.
cc @lnkuiper
pr created time in a day
issue commentcwida/duckdb
Auto Increment Primary Key And/or Serial
never mind, this worked well:
ALTER TABLE gh ADD COLUMN i BIGINT DEFAULT NEXTVAL('seq');
comment created time in a day
issue commentcwida/duckdb
How to tell DuckDB that the csv is compressed
Thank you very much
comment created time in a day
issue commentcwida/duckdb
Auto Increment Primary Key And/or Serial
CREATE SEQUENCE seq;
CREATE TABLE gh (i BIGINT DEFAULT NEXTVAL('seq'), time TIME, count INTEGER);
INSERT INTO gh SELECT * FROM read_csv_auto('github.csv');
Error: Binder Error: table gh has 3 columns but 2 values were supplied
-- Is there a way to handle this in cases where we don't know the column names for the given csv?
INSERT INTO gh(time,count) SELECT * FROM read_csv_auto('github.csv');
SELECT * FROM gh LIMIT 5;
┌───┬──────────┬───────┐
│ i │ time │ count │
├───┼──────────┼───────┤
│ 1 │ 01:00:00 │ 2 │
│ 2 │ 04:00:00 │ 3 │
│ 3 │ 05:00:00 │ 1 │
│ 4 │ 08:00:00 │ 1 │
│ 5 │ 09:00:00 │ 3 │
└───┴──────────┴───────┘
comment created time in a day
issue openedcwida/duckdb
Can you safely copy the file while it’s being written to?
Somewhat of a follow-up to this question: https://github.com/cwida/duckdb/issues/1330
if you have one process continually writing to a DB, can you copy that file and be sure that it won’t be in a corrupt or somewhat indeterminate state?
(obviously, it would be hard to predict which writes made it in or not. I’m copying the dB file while it’s being written might lead it to be in a corrupt state. Basically not sure if writes are atomic, etc)
created time in a day
issue openedcwida/duckdb
Does DuckDB work with single writer, multiple readers?
I have a use-case where I have one process continually writing to a DB and multiple readers who are reading from the DB.
Does DuckDB support for this use-case? Docs say multireader is ok, but not sure if that also allows for one writer.
created time in a day
PR opened cwida/duckdb
This PR moves the SelectStatement::cte_map
to QueryNode::cte_map
, so that queries like the following work:
SELECT 1 UNION ALL (WITH cte AS (SELECT 42) SELECT * FROM cte);
Before, an error would be thrown because the binder was not able to find cte
.
This was found due to #1088 , which should now also be fixed.
Happy to receive any feedback.
pr created time in a day
push eventcwida/duckdb
commit sha 14872eed0601c1cbcc145e5287186639d16112aa
Adding initial implementation for unsigned types
commit sha 695de731ba997528060c41e7a0408fa57a342dae
More on casting of unsigned types and arithmetic ops
commit sha 63c697e93cfb0aab17adbcdd02f9085aeb4a72a3
More of casting and unassigned in python/arrow
commit sha 9294c2e0a394f5bf019f7988a99a080c797d7315
Tests passing
commit sha 730ae9c8c2370e8e196ae3212acd04010e2336f6
Merge branch 'master' into unsignedtypes
commit sha 079d3bb0fd412f5a0538905a66443ba5fac77fd6
Parquet read and python tests for unsigned
commit sha b58964761950977fe9c62395ec4e123858541b89
Adding fixes for sqlancer
commit sha 77ce538b82732ff095af589b93745188d0227f65
Fixing build
commit sha 89ae2116da6d35f5502f456e184e50914979f719
Changes requested in the PR
commit sha 8b03e470e13f3cb692708c91d23f0f30a1e777f6
Merge pull request #1325 from pdet/unsignedtypes Implementing Unsigned Types
push time in a day
PR merged cwida/duckdb
Task #1023
Since the unsigned types are built-in I think the most relevant changes are in the cast operators. Let me know if I forgot any interface or if I should add any other relevant tests.
pr closed time in a day