Skip to content

MERGE query produce duplicate even though the source is unique #1004

@m-fariz-a

Description

@m-fariz-a

What happens?

I have several DuckLake tables that use column partitioning. One of them is main.populasi_clean.

-- column partitioning query
-- cluster contains unique values: 'CLUSTER 1' to 'CLUSTER 6'
-- respondent_type contains unique values: 'head', 'member'
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);

This table is a replica of the original table in PostgreSQL. Data from PostgreSQL is inserted into DuckLake using a MERGE query as follows:

# python code
query = f'''
    MERGE INTO ducklake.main.populasi_clean as target
    USING df as source
    ON target.id_populasi_raw = source.id_populasi_raw
    WHEN MATCHED THEN UPDATE SET
        {update_clause}
    WHEN NOT MATCHED BY TARGET THEN INSERT BY NAME
    ;
'''

I have verified that the source data is unique. However, the MERGE result produces duplicate records in DuckLake.

I’m not entirely sure what is causing this issue, since I have other tables using the same update script that do not produce duplicates.

This screenshot is summary of parquet data that is used to update DuckLake right before the MERGE process is executed

Image
* file_source: indicate which parquet file where the data is being read from
* _last_sync: this column indicate when merge process is executed
* datetime_insert: indicate when the record is inserted into its postgres table
* datetime_update: indicate when last update from record in its postgres table

This is the result in DuckLake

Image

I tried to periodically remove all duplicate rows and rerun the merge script, but I didn't find any duplicates. After the script has several times is executed by cron jobs, the table has duplicate values again.

To Reproduce

Generally, the step to reproduce is to create postgres and ducklake tables and follow these steps

  1. Create postgres table
-- create postgres table
CREATE TABLE public.populasi_clean (
	datetime_insert timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
	datetime_update timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
	id_populasi_raw int4 NOT NULL,
	respondent_type text,
	cluster text,
	CONSTRAINT populasi_clean_pkey PRIMARY KEY (id_populasi_raw),
);

-- create postgres function
CREATE OR REPLACE FUNCTION public.update_datetime_update_column()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
NEW.datetime_update = date_trunc('seconds', NOW());
RETURN NEW; 
END; 
$function$
;

-- Table Triggers
create trigger update_populasi_clean_datetime_update before
update
    on
    public.populasi_clean for each row execute function update_datetime_update_column();
  1. Create ducklake table and its partition
-- create ducklake table
CREATE TABLE main.populasi_clean (
	datetime_insert timestamp ,
	datetime_update timestamp ,
	_last_sync timestamp ,
	id_populasi_raw int,
	respondent_type varchar,
	cluster varchar
);

-- alter table
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);
ALTER TABLE main.populasi_clean ALTER id_populasi_raw SET NOT NULL;
  1. Insert any random value into its postgres
  2. Run this python script to run merge query. Not exactly the same, but the core is still similar. Migrate2Ducklake.py
  3. This is the query to fetch data from its postgres source. data_migration.sql
  4. Update any random record or insert new record into postgres
  5. Redo step 4

As I mentioned previously, I am not really sure what produce the duplicates because other table doesn't seem have its issue.

OS:

Ubuntu 22.04

DuckDB Version:

1.5

DuckLake Version:

0.4

DuckDB Client:

Python 3.12

Hardware:

No response

Full Name:

M Fariz A

Affiliation:

Frontier Group

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions