profile
viewpoint
Mark Mytherin CWI Amsterdam, Netherlands www.markraasveldt.com I'm a Postdoc at the CWI. I like databases.

cwida/duckdb 2485

DuckDB is an in-process SQL OLAP Database Management System

MonetDB/MonetDBLite-Python 26

MonetDBLite as a Python Package

MonetDB/MonetDBLite-C 25

MonetDB as a shared library with a C API

diegomestre2/tpchQ01_GPU 4

TPC-H Query 01 Implementation Optimized for CPU-GPU co-processing

Mytherin/Panther 3

Panther is an open-source, highly efficient text editor written from scratch in C++.

cwida/duckdb-benchmark-data 2

Repository with extra data for DuckDB benchmarking

hannesmuehleisen/sqlancer 2

Detecting Logic Bugs in DBMS

Mytherin/MonetDBLiteBenchmarks 2

Benchmarks for the paper MonetDBLite: An Embedded Analytical Database

lnkuiper/duckdb 1

Fork of cwida/duckdb

pull request commentcwida/duckdb

Parser clean up: no longer transform multi-node CASE and NULLIF in the transformer

Ha I think I may have been responsible for those in the first place. All that hard work ๐Ÿ˜ญ. Great this is cleaned up!

Mytherin

comment created time in 11 hours

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Nice. Thank you, Mark.

Mytherin

comment created time in 12 hours

issue commentcwida/duckdb

Data conversion: Virtual table streaming and exporting with EXPORT DATABASE

Excellent. We missed that it is possible with COPY but have found that all the info we needed is here. We can probably close this issue.

dforsber

comment created time in 13 hours

fork pdet/sqlancer

Detecting Logic Bugs in DBMS

fork in 13 hours

pull request commentcwida/duckdb

Refactor and nested types support for Parquet Reader

Performance went down ~10% so need to investigate what's going on there.

hannesmuehleisen

comment created time in 17 hours

PR opened cwida/duckdb

Refactor and nested types support for Parquet Reader

This PR refactors and extends the Parquet reader. A major feature addition is the support for nested types in Parquet files, which are mapped to DuckDB's STRUCT and LIST types. Under the hood the Parquet reader now does zero-copy of strings, which should increase performance.

+2193 -980

0 comment

31 changed files

pr created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Mark, thanks for explaining. To clarify the terminology, lambda in filter(array_column, x -> x > int_column) has a single capture, e.g. int_column, and a single argument, e.g. x. In map_filter(m, k > 10 and v < 0) we have a lambda with no captures and 2 arguments: k and v. Hence, I think we should rename capture_name above into something like argument_names and make it a vector, not a single value.

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

+optional type specification for lambda

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Mark, this is great. To confirm, does this PR include support for multiple arguments for lambda, e.g. map_filter(m, (k, v) -> k > 10 AND v < 0) and does it include support for captures, e.g. filter(array_column, x -> x > int_column)?

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

CC @mbasmanova

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Avoid using arithmetic on strings in dbgen (minor compilation fix)

Thank you, Mark.

Mytherin

comment created time in 2 days

issue commentcwida/duckdb

Data conversion: Virtual table streaming and exporting with EXPORT DATABASE

That is extremely interesting, but leads to some questions.

  1. I am assuming the above would be single-threaded, is that correct? Are there portions of the workload run in parallel?
  2. Could the "SELECT * FROM read_csv_auto('test.csv'" include extended sql syntax (join, filter, aggregate, expressions, order)?
dforsber

comment created time in 2 days

issue openedcwida/duckdb

Data conversion: Virtual table streaming and exporting with EXPORT DATABASE

DuckDB is powerful as it can also write Parquet as well as read it among other formats. This naturally brings up a question that can DuckDB be used to stream data from source to destination while doing "local" transformations that do not require full table presence (like perhaps sorting without an index).

In our tests we have noticed (@jupiter) that DuckDB uses quite a lot memory (compared to e.g. nodejs streaming solution) when reading e.g. from CSV file (read_csv_auto) and exporting it to Parquet without any transformations or sorting. Does this mean that source data size contributes/scales with DuckDB memory consumption? What if the CSV file would be "infinite", never ending file?

created time in 2 days

issue commentcwida/duckdb

regexp_matches() does not recognise "(?!"

Yes, I think RE2 is POSIX and differs from PCRE in this manner. Some negative lookahead regexps can be converted to posivite and adding NOT on SQL level, so this helps us. But I think it would nicer to have PCRE supported.

dforsber

comment created time in 2 days

issue commentcwida/duckdb

Syntax error for WITH RECURSIVE query

@lnkuiper if you look at this, consider cleaning up https://github.com/cwida/duckdb/tree/master/test/ldbc which has an old version of the LDBC queries.

szarnyasg

comment created time in 2 days

pull request commentcwida/duckdb

Read-only mode and shutdown for R client

Hi, not sure if this is still relevant. I managed to properly shutdown a duckdb on Windows, so that I did not get error when reopening it (as in issue #323):

` library(duckdb)

FAILS

con <- dbConnect(duckdb(), dbdir="test.duckdb") dbWriteTable(con, "iris", iris, overwrite = TRUE) dbDisconnect(con) con <- dbConnect(duckdb(), dbdir="test.duckdb")

Fehler in initialize(value, ...) :

duckdb_startup_R: Failed to open database

REMEDIES FAILURE, THEN SUCCEDS

dbDisconnect(con, shutdown=TRUE)

Warnmeldung:

Connection already closed.

con <- dbConnect(duckdb(), dbdir="test.duckdb") dbWriteTable(con, "iris", iris, overwrite = TRUE) dbDisconnect(con, shutdown=TRUE)

SUCCEDS :)

con <- dbConnect(duckdb(), dbdir="test.duckdb") dbWriteTable(con, "iris", iris, overwrite = TRUE) dbDisconnect(con, shutdown=TRUE)

con <- dbConnect(duckdb(), dbdir="test.duckdb") ## no error dbWriteTable(con, "iris", iris, overwrite = TRUE) dbDisconnect(con, shutdown=TRUE) `

hannesmuehleisen

comment created time in 2 days

issue openedcwida/duckdb

regexp_matches() does not recognise "(?!"

We have this kind of regular expression, which works e.g. with NodeJS string.match(), but not with DuckDB.

regexp_matches(header_content_type, ' *(?![tT][eE][xX][tT]/[Hh][Tt][Mm][Ll]).*') 
invalid perl operator: (?!

Is this a bug or a feature? :)

created time in 3 days

startedjasonge27/fastQuantile

started time in 3 days

issue closedcwida/duckdb

Auto Increment Primary Key And/or Serial

While Auto-Incrementing ideas are more useful, common, and idiomatic is an OLTP store, they can be very useful for tracking changesets (especially for caching) in OLAP analytical tasks. Towards that end, it would be great to have the ability to specify an AUTO INCREMENT policy on a column (or something more advanced like PostgreSQLs Serial flag). While It's easy enough to do this manually with a prior COUNT(*) query, a write-lock, and bulk insert statements, the only way to add such a column when using a scanner/reader like read_csv is to add a new column and manually UPDATE into that column (thereby ~defeating the purpose of those fast import mechanisms). Thoughts?

closed time in 3 days

willium

issue commentcwida/duckdb

Provide Android Packages

In general, we should provide Android packages. It appears like different SDKs should be used to create those builds, and then there is some CPU-differences in resulting builds. If possible, integrate those binaries into the normal JDBC driver, but I have my doubts.

Grufy

comment created time in 3 days

issue closedcwida/duckdb

run C++ example link fail

I' m a C++ beginner. I try to run C++ example in CLion and debug it (Win10 64).

I get error like "-lduckdb failed" so I download the duckdb library from websites and put it in my MinGW directory.

Now CLion can find the library but not recognized.

"D:\JetBrains\CLion 2020.3.1\bin\cmake\win\bin\cmake.exe" --build D:\code\duckdb\examples\embedded-c++\cmake-build-debug-mingw --target all -- -j 6
[ 50%] Linking CXX executable example.exe
D:/mingw/mingw32/bin/../lib/gcc/i686-w64-mingw32/8.1.0/../../../../lib/duckdb.dll: file not recognized: File format not recognized
collect2.exe: error: ld returned 1 exit status
mingw32-make.exe[2]: *** [CMakeFiles\example.dir\build.make:106: example.exe] Error 1
mingw32-make.exe[1]: *** [CMakeFiles\Makefile2:95: CMakeFiles/example.dir/all] Error 2
mingw32-make.exe: *** [Makefile:103: all] Error 2

Are there any suggestions to solve this problem?

closed time in 3 days

BowenXiao1999

issue closedcwida/duckdb

Regression Analysis

Does DuckDB provide regression analysis functions?

closed time in 3 days

waynelapierre

PR opened cwida/duckdb

Implementing Filter Clause for aggregates

This PR gives a (potentially over-complicated binding process, please check that) implementation of the filter clause for aggregates #896

+728 -145

0 comment

41 changed files

pr created time in 3 days

Pull request review commentcwida/duckdb

Pre-filtering data in zonemaps and #1303

+#include "duckdb/execution/expression_executor.hpp"+#include "duckdb/optimizer/rule/in_clause_simplification.hpp"+#include "duckdb/planner/expression/list.hpp"+#include "duckdb/planner/expression/bound_operator_expression.hpp"++namespace duckdb {++InClauseSimplificationRule::InClauseSimplificationRule(ExpressionRewriter &rewriter) : Rule(rewriter) {+	// match on InClauseExpression that has a ConstantExpression as a check+	auto op = make_unique<InClauseExpressionMatcher>();+	op->policy = SetMatcher::Policy::SOME;+	root = move(op);+}++unique_ptr<Expression> InClauseSimplificationRule::Apply(LogicalOperator &op, vector<Expression *> &bindings,+                                                         bool &changes_made) {+	D_ASSERT(bindings[0]->expression_class == ExpressionClass::BOUND_OPERATOR);+	auto expr = (BoundOperatorExpression *)bindings[0];+	if (expr->children[0]->expression_class != ExpressionClass::BOUND_CAST) {+		return nullptr;+	}+	auto cast_expression = (BoundCastExpression *)expr->children[0].get();+	if (cast_expression->child->expression_class != ExpressionClass::BOUND_COLUMN_REF) {+		return nullptr;+	}+	//! Here we check if we can apply the expression on the constant side+	auto target_type = cast_expression->source_type();+	if (!BoundCastExpression::CastIsInvertible(target_type, cast_expression->return_type)) {+		return nullptr;+	}+	for (size_t i{1}; i < expr->children.size(); i++) {+		if (expr->children[i]->expression_class != ExpressionClass::BOUND_CONSTANT) {+			return nullptr;+		}+		D_ASSERT(expr->children[i]->IsFoldable());+		auto constant_value = ExpressionExecutor::EvaluateScalar(*expr->children[i]);+		auto new_constant = constant_value.TryCastAs(target_type);+		if (new_constant) {+			//! We can cast, so we move the new constant+			auto new_constant_expr = make_unique<BoundConstantExpression>(constant_value);+			expr->children[i] = move(new_constant_expr);

good catch

pdet

comment created time in 3 days

issue commentcwida/duckdb

Auto Increment Primary Key And/or Serial

Certainly:

echo -e '42\n43\n44' > /tmp/dummy
COPY a(b) FROM '/tmp/dummy';
SELECT * FROM a;
โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ i โ”‚ b  โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚ 1 โ”‚ 42 โ”‚
โ”‚ 2 โ”‚ 43 โ”‚
โ”‚ 3 โ”‚ 44 โ”‚
โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
willium

comment created time in 4 days

issue commentcwida/duckdb

Auto Increment Primary Key And/or Serial

oh neat! is there any way to use this alongside read_csv/COPY?

willium

comment created time in 4 days

issue commentcwida/duckdb

Return empty json array in case of no results returned

This is again the SQLite shell that does this, not DuckDB

burtgulash

comment created time in 4 days

issue commentcwida/duckdb

Regression Analysis

Two options, 1) pull those columns into R, and run lm there. 2) Implement a recursive CTE that computes the fit.

waynelapierre

comment created time in 4 days

issue commentcwida/duckdb

Auto Increment Primary Key And/or Serial

How about using a sequence? For example

CREATE SEQUENCE seq;
CREATE TABLE a (i INTEGER DEFAULT NEXTVAL('seq'), b INTEGER);
INSERT INTO a (b) VALUES (42), (43);
SELECT * FROM a;

Result:

โ”Œโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”
โ”‚ i โ”‚ b  โ”‚
โ”œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”ค
โ”‚ 1 โ”‚ 42 โ”‚
โ”‚ 2 โ”‚ 43 โ”‚
โ””โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”˜
willium

comment created time in 4 days

issue openedcwida/duckdb

Auto Increment Primary Key And/or Serial

While Auto-Incrementing ideas are more useful, common, and idiomatic is an OLTP store, they can be very useful for tracking changesets (especially for caching) in OLAP analytical tasks. Towards that end, it would be great to have the ability to specify an AUTO INCREMENT policy on a column (or something more advanced like PostgreSQLs Serial flag). While It's easy enough to do this manually with a prior COUNT(*) query, a write-lock, and bulk insert statements, the only way to add such a column when using a scanner/reader like read_csv is to add a new column and manually UPDATE into that column (thereby ~defeating the purpose of those fast import mechanisms). Thoughts?

created time in 4 days

more