What happens?
I have several DuckLake tables that use column partitioning. One of them is main.populasi_clean.
-- column partitioning query
-- cluster contains unique values: 'CLUSTER 1' to 'CLUSTER 6'
-- respondent_type contains unique values: 'head', 'member'
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);
This table is a replica of the original table in PostgreSQL. Data from PostgreSQL is inserted into DuckLake using a MERGE query as follows:
# python code
query = f'''
MERGE INTO ducklake.main.populasi_clean as target
USING df as source
ON target.id_populasi_raw = source.id_populasi_raw
WHEN MATCHED THEN UPDATE SET
{update_clause}
WHEN NOT MATCHED BY TARGET THEN INSERT BY NAME
;
'''
I have verified that the source data is unique. However, the MERGE result produces duplicate records in DuckLake.
I’m not entirely sure what is causing this issue, since I have other tables using the same update script that do not produce duplicates.
This screenshot is summary of parquet data that is used to update DuckLake right before the MERGE process is executed
* file_source: indicate which parquet file where the data is being read from
* _last_sync: this column indicate when merge process is executed
* datetime_insert: indicate when the record is inserted into its postgres table
* datetime_update: indicate when last update from record in its postgres table
This is the result in DuckLake
I tried to periodically remove all duplicate rows and rerun the merge script, but I didn't find any duplicates. After the script has several times is executed by cron jobs, the table has duplicate values again.
To Reproduce
Generally, the step to reproduce is to create postgres and ducklake tables and follow these steps
- Create postgres table
-- create postgres table
CREATE TABLE public.populasi_clean (
datetime_insert timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
datetime_update timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
id_populasi_raw int4 NOT NULL,
respondent_type text,
cluster text,
CONSTRAINT populasi_clean_pkey PRIMARY KEY (id_populasi_raw),
);
-- create postgres function
CREATE OR REPLACE FUNCTION public.update_datetime_update_column()
RETURNS trigger
LANGUAGE plpgsql
AS $function$
BEGIN
NEW.datetime_update = date_trunc('seconds', NOW());
RETURN NEW;
END;
$function$
;
-- Table Triggers
create trigger update_populasi_clean_datetime_update before
update
on
public.populasi_clean for each row execute function update_datetime_update_column();
- Create ducklake table and its partition
-- create ducklake table
CREATE TABLE main.populasi_clean (
datetime_insert timestamp ,
datetime_update timestamp ,
_last_sync timestamp ,
id_populasi_raw int,
respondent_type varchar,
cluster varchar
);
-- alter table
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);
ALTER TABLE main.populasi_clean ALTER id_populasi_raw SET NOT NULL;
- Insert any random value into its postgres
- Run this python script to run merge query. Not exactly the same, but the core is still similar. Migrate2Ducklake.py
- This is the query to fetch data from its postgres source. data_migration.sql
- Update any random record or insert new record into postgres
- Redo step 4
As I mentioned previously, I am not really sure what produce the duplicates because other table doesn't seem have its issue.
OS:
Ubuntu 22.04
DuckDB Version:
1.5
DuckLake Version:
0.4
DuckDB Client:
Python 3.12
Hardware:
No response
Full Name:
M Fariz A
Affiliation:
Frontier Group
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
What happens?
I have several DuckLake tables that use column partitioning. One of them is
main.populasi_clean.This table is a replica of the original table in PostgreSQL. Data from PostgreSQL is inserted into DuckLake using a MERGE query as follows:
I have verified that the source data is unique. However, the MERGE result produces duplicate records in DuckLake.
I’m not entirely sure what is causing this issue, since I have other tables using the same update script that do not produce duplicates.
This screenshot is summary of parquet data that is used to update DuckLake right before the MERGE process is executed
This is the result in DuckLake
I tried to periodically remove all duplicate rows and rerun the merge script, but I didn't find any duplicates. After the script has several times is executed by cron jobs, the table has duplicate values again.
To Reproduce
Generally, the step to reproduce is to create postgres and ducklake tables and follow these steps
As I mentioned previously, I am not really sure what produce the duplicates because other table doesn't seem have its issue.
OS:
Ubuntu 22.04
DuckDB Version:
1.5
DuckLake Version:
0.4
DuckDB Client:
Python 3.12
Hardware:
No response
Full Name:
M Fariz A
Affiliation:
Frontier Group
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?