MERGE query produce duplicate even though the source is unique

### What happens?

I have several DuckLake tables that use column partitioning. One of them is `main.populasi_clean`.

```sql
-- column partitioning query
-- cluster contains unique values: 'CLUSTER 1' to 'CLUSTER 6'
-- respondent_type contains unique values: 'head', 'member'
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);
```

This table is a replica of the original table in PostgreSQL. Data from PostgreSQL is inserted into DuckLake using a MERGE query as follows:

```python
# python code
query = f'''
    MERGE INTO ducklake.main.populasi_clean as target
    USING df as source
    ON target.id_populasi_raw = source.id_populasi_raw
    WHEN MATCHED THEN UPDATE SET
        {update_clause}
    WHEN NOT MATCHED BY TARGET THEN INSERT BY NAME
    ;
'''
```

I have verified that the source data is unique. However, the MERGE result produces duplicate records in DuckLake.

I’m not entirely sure what is causing this issue, since I have other tables using the same update script that do not produce duplicates.

This screenshot is summary of parquet data that is used to update DuckLake right before the MERGE process is executed

<img width="855" height="242" alt="Image" src="https://github.com/user-attachments/assets/51830061-b822-450c-ae85-9ef554d0cfaf" />

```
* file_source: indicate which parquet file where the data is being read from
* _last_sync: this column indicate when merge process is executed
* datetime_insert: indicate when the record is inserted into its postgres table
* datetime_update: indicate when last update from record in its postgres table
```

This is the result in DuckLake

<img width="933" height="610" alt="Image" src="https://github.com/user-attachments/assets/dcdef547-5b80-4c6c-aa05-0dba4bda8a03" />

I tried to periodically remove all duplicate rows and rerun the merge script, but I didn't find any duplicates. After the script has several times is executed by cron jobs, the table has duplicate values again.

### To Reproduce

Generally, the step to reproduce is to create postgres and ducklake tables and follow these steps

1. Create postgres table

```sql
-- create postgres table
CREATE TABLE public.populasi_clean (
	datetime_insert timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
	datetime_update timestamp DEFAULT date_trunc('second'::text, CURRENT_TIMESTAMP) NULL,
	id_populasi_raw int4 NOT NULL,
	respondent_type text,
	cluster text,
	CONSTRAINT populasi_clean_pkey PRIMARY KEY (id_populasi_raw),
);

-- create postgres function
CREATE OR REPLACE FUNCTION public.update_datetime_update_column()
 RETURNS trigger
 LANGUAGE plpgsql
AS $function$
BEGIN
NEW.datetime_update = date_trunc('seconds', NOW());
RETURN NEW; 
END; 
$function$
;

-- Table Triggers
create trigger update_populasi_clean_datetime_update before
update
    on
    public.populasi_clean for each row execute function update_datetime_update_column();
```

2. Create ducklake table and its partition

```sql
-- create ducklake table
CREATE TABLE main.populasi_clean (
	datetime_insert timestamp ,
	datetime_update timestamp ,
	_last_sync timestamp ,
	id_populasi_raw int,
	respondent_type varchar,
	cluster varchar
);

-- alter table
ALTER TABLE main.populasi_clean SET PARTITIONED BY (cluster, respondent_type);
ALTER TABLE main.populasi_clean ALTER id_populasi_raw SET NOT NULL;
```

3. Insert any random value into its postgres
4. Run this python script to run merge query. Not exactly the same, but the core is still similar. [Migrate2Ducklake.py](https://github.com/user-attachments/files/26664494/Migrate2Ducklake.py)
5. This is the query to fetch data from its postgres source. [data_migration.sql](https://github.com/user-attachments/files/26664499/data_migration.sql)
6. Update any random record or insert new record into postgres
7. Redo step 4

As I mentioned previously, I am not really sure what produce the duplicates because other table doesn't seem have its issue. 

### OS:

Ubuntu 22.04

### DuckDB Version:

1.5

### DuckLake Version:

0.4

### DuckDB Client:

Python 3.12

### Hardware:

_No response_

### Full Name:

M Fariz A

### Affiliation:

Frontier Group

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

### Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MERGE query produce duplicate even though the source is unique #1004

What happens?

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MERGE query produce duplicate even though the source is unique #1004

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions