profile
viewpoint

rezacsedu/DeepKneeOAExplainer_ 0

Explainable Knee Osteoarthritis Diagnosis from Radiographs & MRIs

tdoehmen/ASO-ACA-2014 0

Ant Clustering Algorithm using Pheromones (NetLogo)

tdoehmen/datasketches-cpp 0

Core C++ Sketch Library

tdoehmen/duckdb 0

DuckDB is an embeddable SQL OLAP Database Management System

tdoehmen/EC_2014_team11 0

Evolutionary Algorithm which solves a polynomial regression problem

push eventcwida/duckdb

Mark Raasveldt

commit sha 8f468d429ec11653202b2cb8dfe87bb01008f001

Add support for Lambda functions to parser

view details

Mark Raasveldt

commit sha f8a702564d54682401ef3805d577298348724bba

Remove redundant blank line

view details

Mark Raasveldt

commit sha d818a8b9abff27902bed93bfcedfb4d81d1e029e

Add support for lambda functions with multiple parameters

view details

Mark Raasveldt

commit sha 2c8b106da531b1dff76d5e90efc33eb1eeeb7c4d

Merge branch 'master' into lambdas

view details

Mark Raasveldt

commit sha 80d723ad404771c6dc9fa028af5a254b60165e43

Remove print and increase lambda operator precedence further so lambdas such as x -> x > 10 AND z < 20 are correctly parsed

view details

Mark Raasveldt

commit sha 132ac405e5d966c83da86bb7d12a4ec6beb8fa02

Fix for single file compilation

view details

Mark Raasveldt

commit sha e5a64a52f56083f989e2675a287d28f2f96adf69

Automatically replace generated calls to fprintf and exit in src_backend_parser_scan.cpp to avoid triggering R CRAN warnings

view details

Mark

commit sha f79660c66b8d97e598e30390f6b638dd5ffd6ad2

Merge pull request #1313 from Mytherin/lambdas Add support for Lambda functions to parser

view details

push time in 9 hours

PR merged cwida/duckdb

Add support for Lambda functions to parser

This PR adds basic support for lambda functions to the parser. They are not supported anywhere else yet and not bound yet, but the plan is to use them later on in functions that can apply to lists.

Lambda expressions look like this:

class LambdaExpression : public ParsedExpression {
public:
	string capture_name;
	unique_ptr<ParsedExpression> expression;
};

Example syntax:

SELECT map(i, x -> x + 1) FROM (VALUES (list_value(1, 2, 3))) tbl(i);
+13456 -13161

9 comments

30 changed files

Mytherin

pr closed time in 9 hours

pull request commentcwida/duckdb

Parser clean up: no longer transform multi-node CASE and NULLIF in the transformer

Ha I think I may have been responsible for those in the first place. All that hard work 😭. Great this is cleaned up!

Mytherin

comment created time in 10 hours

PR opened cwida/duckdb

Parser clean up: no longer transform multi-node CASE and NULLIF in the transformer

Previously a CASE statement with multiple WHEN ... THEN ... nodes would be transformed into a chain of case statements in the transformer. In this PR we modify the case so that this step is only performed during binding. The case statement now looks like this after the transformer phase:

struct CaseCheck {
	unique_ptr<ParsedExpression> when_expr;
	unique_ptr<ParsedExpression> then_expr;
};

//! The CaseExpression represents a CASE expression in the query
class CaseExpression : public ParsedExpression {
	vector<CaseCheck> case_checks;
	unique_ptr<ParsedExpression> else_expr;
};

NULLIF(a, b) used to be transformed into CASE WHEN a=b THEN NULL ELSE a in the transformer phase. In this PR I have changed this to instead be transformed to a regular function call nullif(a, b) and have created a macro that does this transformation during binding instead.

+206 -95

0 comment

13 changed files

pr created time in 10 hours

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Nice. Thank you, Mark.

Mytherin

comment created time in 12 hours

pull request commentcwida/duckdb

Add support for Lambda functions to parser

All the changes are implemented now, lambda functions now look like this:

class LambdaExpression : public ParsedExpression {
	vector<string> parameters;
	unique_ptr<ParsedExpression> expression;
};

I also fixed several operator precedence rules so that lambda arrows take priority over other operators, which causes e.g. x -> x + 1 AND y + 1 to be correctly parsed as x + 1 AND y + 1 without requiring brackets.

select map(i, (x, y) -> x + y) from tbl;
-- lambda: parameters { x, y }, function: x + y
select map(i, x -> x + 1) from (values (list_value(1, 2, 3))) tbl(i);
-- lambda: parameters { x }, function: x + 1
select map(i, x -> x + 1 AND y + 1) from (values (list_value(1, 2, 3))) tbl(i);
-- lambda: parameters { x }, function: x -> x + 1 AND y + 1
Mytherin

comment created time in 12 hours

issue commentcwida/duckdb

Data conversion: Virtual table streaming and exporting with EXPORT DATABASE

Excellent. We missed that it is possible with COPY but have found that all the info we needed is here. We can probably close this issue.

dforsber

comment created time in 12 hours

pull request commentcwida/duckdb

Refactor and nested types support for Parquet Reader

Performance went down ~10% so need to investigate what's going on there.

hannesmuehleisen

comment created time in 17 hours

PR opened cwida/duckdb

Refactor and nested types support for Parquet Reader

This PR refactors and extends the Parquet reader. A major feature addition is the support for nested types in Parquet files, which are mapped to DuckDB's STRUCT and LIST types. Under the hood the Parquet reader now does zero-copy of strings, which should increase performance.

+2193 -980

0 comment

31 changed files

pr created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

That makes a lot of sense; will do. Thanks for the feedback!

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Mark, thanks for explaining. To clarify the terminology, lambda in filter(array_column, x -> x > int_column) has a single capture, e.g. int_column, and a single argument, e.g. x. In map_filter(m, k > 10 and v < 0) we have a lambda with no captures and 2 arguments: k and v. Hence, I think we should rename capture_name above into something like argument_names and make it a vector, not a single value.

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

+optional type specification for lambda

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

As for captures, the parser will not do anything besides transforming the expression. It is up to the binder to actually resolve columns. i.e. filter(array_column, x -> x > int_column) will pass the parser just fine and generate a lambda expression containing the following:

capture_name: x
expression: `Comparison(Column(x), Column(int_column), GREATER_THAN)`

The binder is then in charge of resolving "x" back to the lambda, and "int_column" to another data source (e.g. a table).

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

It supports only single argument captures right now, e.g. filter(array_column, x -> x > column) works, but map_filter(m, (k, v) -> k > 10 and v < 0) does not. I can have a look at extending the lambdas to support multiple captures.

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

Mark, this is great. To confirm, does this PR include support for multiple arguments for lambda, e.g. map_filter(m, (k, v) -> k > 10 AND v < 0) and does it include support for captures, e.g. filter(array_column, x -> x > int_column)?

Mytherin

comment created time in a day

pull request commentcwida/duckdb

Add support for Lambda functions to parser

CC @mbasmanova

Mytherin

comment created time in a day

PR opened cwida/duckdb

Add support for Lambda functions to parser

This PR adds basic support for lambda functions to the parser. They are not supported anywhere else yet and not bound yet, but the plan is to use them later on in functions that can apply to lists.

Lambda expressions look like this:

class LambdaExpression : public ParsedExpression {
public:
	string capture_name;
	unique_ptr<ParsedExpression> expression;
};

Example syntax:

SELECT map(i, x -> x + 1) FROM (VALUES (list_value(1, 2, 3))) tbl(i);
+190 -8

0 comment

18 changed files

pr created time in a day

push eventcwida/duckdb

Mark Raasveldt

commit sha c7cd7bcee3b3c0213afbe1cb5c635507ef03d8e5

Avoid using arithmetic on strings in dbgen

view details

Mark

commit sha d47baa52f1618bbf7a6f7dd97c0b658a959ffd72

Merge pull request #1312 from Mytherin/dbgenfix Avoid using arithmetic on strings in dbgen (minor compilation fix)

view details

push time in a day

pull request commentcwida/duckdb

Avoid using arithmetic on strings in dbgen (minor compilation fix)

Thank you, Mark.

Mytherin

comment created time in a day

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 void RemoveUnusedColumns::VisitOperator(LogicalOperator &op) { 			auto &aggr = (LogicalAggregate &)op; 			ClearUnusedExpressions(aggr.expressions, aggr.aggregate_index); -			if (aggr.expressions.size() == 0 && aggr.groups.size() == 0) {-				// removed all expressions from the aggregate: push a COUNT(*)-				auto count_star_fun = CountStarFun::GetFunction();-				aggr.expressions.push_back(-				    AggregateFunction::BindAggregateFunction(context, count_star_fun, {}, false));-			}-		}+            if (aggr.expressions.size() != 0 || aggr.groups.size() != 0) {

The change here to an empty if followed by an else doesn't make much sense. Perhaps a left-over from earlier code?

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

+# name: test/sql/filter/test_filter_clause.test+# description: Test aggregation with filter clause

I would like some more test cases:

  • Query with many different filter clauses (e.g. 5 aggregates, 5 different filters)
  • Filter with some more complex aggregates: COVAR_POP (multiple input columns), STRING_AGG (strings) and ARRAY_AGG (lists)
  • DISTINCT aggregates

Also; looking at these tests I would not be surprised if all of them use the perfect hash aggregate. You can force the regular hash aggregate to be used by using very spaced out groups (e.g. [0, 10000000, 20000000, ....]).

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 class PhysicalHashAggregate : public PhysicalSink { 	//! Pointers to the aggregates 	vector<BoundAggregateExpression *> bindings; +    //! Map between payload index and input index for filters+	unordered_map<Expression*,std::pair<bool,unordered_map<size_t,size_t>>> filter_map;

This seems overly complicated; why not just add the filter one after the regular payload of an aggregate?

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 PhysicalPlanGenerator::ExtractAggregateExpressions(unique_ptr<PhysicalOperator> 	vector<unique_ptr<Expression>> expressions; 	vector<LogicalType> types; -	for (idx_t group_idx = 0; group_idx < groups.size(); group_idx++) {-		auto &group = groups[group_idx];+	for (auto &group : groups) { 		auto ref = make_unique<BoundReferenceExpression>(group->return_type, expressions.size()); 		types.push_back(group->return_type); 		expressions.push_back(move(group));-		groups[group_idx] = move(ref);+		group = move(ref); 	}  	for (auto &aggr : aggregates) { 		auto &bound_aggr = (BoundAggregateExpression &)*aggr;-		for (idx_t child_idx = 0; child_idx < bound_aggr.children.size(); child_idx++) {-			auto &child = bound_aggr.children[child_idx];-			auto ref = make_unique<BoundReferenceExpression>(child->return_type, expressions.size());-			types.push_back(child->return_type);-			expressions.push_back(move(child));-			bound_aggr.children[child_idx] = move(ref);+		for (auto &child_ : bound_aggr.children) {+			bool already_in = false;+			for (size_t i = 0; i < expressions.size(); i++) {+				auto *base_expr = (BaseExpression *)expressions[i].get();+				if (child_->Equals(base_expr)) {

Is this necessary for correctness purposes; or just an optimization? Not that I disagree with adding this, just for clarification. I would like to move it to a function, use an expression_map_t instead of a vector, and also use it for the bound_aggr.filter.

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 void PerfectAggregateHashTable::AddChunk(DataChunk &groups, DataChunk &payload)  	// after finding the group location we update the aggregates 	idx_t payload_idx = 0;-	for (idx_t aggr_idx = 0; aggr_idx < aggregates.size(); aggr_idx++) {-		auto &aggr = aggregates[aggr_idx];-		auto input_count = (idx_t)aggr.child_count;-		aggr.function.update(input_count == 0 ? nullptr : &payload.data[payload_idx], input_count, addresses,-		                     payload.size());+	for (auto &aggregate : aggregates) {+		auto input_count = (idx_t)aggregate.child_count;+		if (aggregate.filter) {+			ExpressionExecutor filter_execution(aggregate.filter);+			SelectionVector true_sel(STANDARD_VECTOR_SIZE);

This seems like the exact same code as the regular AggregateHashtable. I would unify it with that code by using a static function (AggregateHashtable::UpdateAggregate(...)).

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 string PhysicalPerfectHashAggregate::ParamsToString() const { 		result += groups[i]->GetName(); 	} 	for (idx_t i = 0; i < aggregates.size(); i++) {-		if (i > 0 || groups.size() > 0) {+		if (i > 0 || !groups.empty()) { 			result += "\n"; 		} 		result += aggregates[i]->GetName();+		auto &aggregate = (BoundAggregateExpression &)*aggregates[i];+		if (aggregate.filter){+			result += aggregate.filter->GetName();

"FILTER " + ...

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 void PhysicalSimpleAggregate::GetChunkInternal(ExecutionContext &context, DataCh string PhysicalSimpleAggregate::ParamsToString() const { 	string result; 	for (idx_t i = 0; i < aggregates.size(); i++) {+		auto &aggregate = (BoundAggregateExpression &)*aggregates[i]; 		if (i > 0) { 			result += "\n"; 		} 		result += aggregates[i]->GetName();+		if (aggregate.filter){+			result += aggregate.filter->GetName();

"FILTER " + ...

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 void PhysicalPerfectHashAggregate::Sink(ExecutionContext &context, GlobalOperato 		group_chunk.data[group_idx].Reference(input.data[bound_ref_expr.index]); 	} 	idx_t aggregate_input_idx = 0;-	for (idx_t i = 0; i < aggregates.size(); i++) {-		auto &aggr = (BoundAggregateExpression &)*aggregates[i];+	for (auto & aggregate : aggregates) {+		auto &aggr = (BoundAggregateExpression &)*aggregate; 		for (auto &child_expr : aggr.children) { 			D_ASSERT(child_expr->type == ExpressionType::BOUND_REF); 			auto &bound_ref_expr = (BoundReferenceExpression &)*child_expr; 			aggregate_input_chunk.data[aggregate_input_idx++].Reference(input.data[bound_ref_expr.index]); 		}+		if (aggr.filter) {+			vector<LogicalType> types;+			vector<vector<Expression *>> bound_refs;+			BoundAggregateExpression::GetColumnRef(aggr.filter.get(), bound_refs, types);

Same here; can't this just refer to the filter as computed in the projection above it?

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 string PhysicalHashAggregate::ParamsToString() const { 		result += groups[i]->GetName(); 	} 	for (idx_t i = 0; i < aggregates.size(); i++) {-		if (i > 0 || groups.size() > 0) {+		auto &aggregate = (BoundAggregateExpression &)*aggregates[i];+		if (i > 0 || !groups.empty()) { 			result += "\n"; 		} 		result += aggregates[i]->GetName();+		if (aggregate.filter) {+			result += aggregate.filter->GetName();

Maybe add "FILTER " before this to the output, to make it clear that this is a filter op (similar to how it appears in a SQL statement).

pdet

comment created time in 2 days

Pull request review commentcwida/duckdb

Implementing Filter Clause for aggregates

 idx_t GroupedAggregateHashTable::AddChunk(DataChunk &groups, Vector &group_hashe 				}  				distinct_addresses.Verify(new_group_count);--				aggr.function.update(input_count == 0 ? nullptr : &payload.data[payload_idx], input_count,-				                     distinct_addresses, new_group_count);+				if (aggr.filter) {

This seems duplicated from below. Perhaps better to move this into a function called "UpdateAggregate(...)" that is called from both places?

pdet

comment created time in 2 days

more