Skip to content

#9106: Add PDiff similarity score to Version Tracking#9107

Open
clearbluejar wants to merge 2 commits intoNationalSecurityAgency:masterfrom
clearbluejar:pdiff-similarity-score
Open

#9106: Add PDiff similarity score to Version Tracking#9107
clearbluejar wants to merge 2 commits intoNationalSecurityAgency:masterfrom
clearbluejar:pdiff-similarity-score

Conversation

@clearbluejar
Copy link
Copy Markdown

@clearbluejar clearbluejar commented Apr 7, 2026

Summary

Add a PDiff (basic-block mnemonic hash) similarity score to the Version Tracking match
table. The score is computed at match creation time, persisted in the database, and
exposed via a new Similarity column and filter in the UI.

Related: #9106, #5859

Problem

Version Tracking correlators each produce their own similarity scores, but these reflect
each correlator's matching algorithm rather than actual structural similarity. For example,
the Symbol Name correlator assigns a perfect 1.0 when names match, even if the functions
differ significantly. There is no correlator-independent metric for how structurally
similar two matched functions are at the basic-block level, making it difficult to
prioritize matches when patch diffing.

Solution

image

New PDiff similarity score — computed once at match creation time and stored in the DB.

Score formula

  • 95% basic-block mnemonic hash similarity — for each basic block, mnemonic hashes are
    collected, sorted (to tolerate compiler instruction reordering), and combined into a
    per-block hash. A sorted-merge counts matching blocks between source and destination.
  • 5% stack frame size similaritymin(a,b)/max(a,b) ratio; a subtle tiebreaker that
    distinguishes otherwise-identical functions with different local variable allocation.

New UI components

  • Similarity column — displays the stored PDiff score in the match table
  • Similarity filter — filter matches by PDiff score range (bypasses DATA and null-score
    matches)
  • Best PDiff Match filter — deduplicates matches across correlators, keeping the best
    score per source/destination function pair

Backward compatibility

  • Auto-upgrade: v0 sessions are automatically migrated to v1 schema on open
  • Backfill: After programs are loaded, any function match with a null PDiff score is
    automatically computed and persisted — upgraded sessions get full scores on first open
  • Filter safety: Null scores (DATA matches, pre-migration) pass through the Similarity
    filter instead of being rejected

Changes (15 files, 846 insertions, 13 deletions)

Core DB schema

  • VTMatchTableDBAdapter.java — Add PDIFF_SIMILARITY_SCORE_COL, bump schema 0→1
  • VTMatchTableDBAdapterV0.java — Accept v0 and v1, auto-upgrade v0→v1, write new column
  • VTMatchDB.java — Read/write stored PDiff score
  • VTSessionDB.java — Wrap adapter init in transaction (for upgrade), backfill on program open

Domain model

  • VTMatch.java — Add getPdiffSimilarityScore() interface method
  • VTMatchInfo.java — Add pdiffSimilarityScore field with getter/setter

Score computation

  • VTMatchSetDB.java — Compute PDiff score in addMatch() for FUNCTION matches
  • BasicBlockMnemonicFunctionBulker.java — Per-basic-block mnemonic hashing + combined
    similarity
  • FunctionBulker.java — Interface for function hash strategies

UI components

  • AbstractVTMatchTableModel.java — New Similarity column reading stored value
  • SimilarityFilter.java — New filter by PDiff score range
  • BestPDiffMatchFilter.java — New deduplication filter across correlators
  • VTMatchTableModel.java — Register Similarity column
  • VTMatchTableProvider.java — Register Similarity and BestPDiff filters

Support

  • MatchMapper.java — Delegate getPdiffSimilarityScore() for implied matches

Test plan

  • All 6 existing VT database tests pass (./gradlew :VersionTracking:test)
  • Build compiles clean (./gradlew jar)
  • New VT sessions compute and display Similarity scores immediately
  • Old v0 sessions auto-upgrade and backfill scores on open
  • Similarity filter works responsively
  • DATA matches and null-score matches pass through filter correctly
  • BestPDiff filter correctly deduplicates across correlators

Compute the PDiff similarity score (95% basic-block mnemonic hash
similarity + 5% stack frame size similarity) once at match creation
time and persist it in the match table DB. This eliminates expensive
on-the-fly recomputation every time the Similarity column renders or
the Similarity filter runs.

Schema changes:
- Add PDIFF_SIMILARITY_SCORE_COL to match table (schema v0 -> v1)
- Auto-upgrade v0 tables on session open (recreate with new column)
- Backfill scores for migrated matches once programs are loaded

Key files:
- VTMatchSetDB.addMatch(): computes score for all FUNCTION matches
- VTMatchTableDBAdapterV0: v0->v1 schema migration
- VTSessionDB.backfillPdiffScores(): one-time backfill on open
- BulkBBSimilarityTableColumn: reads stored score instead of computing
- SimilarityFilter: reads stored score, passes null scores through
- BasicBlockMnemonicFunctionBulker: adjusted weights to 95/5
- BestPDiffMatchFilter: deduplicates matches across correlators
Correlators reuse a single VTMatchInfo across all addMatch() calls.
The null check on getPdiffSimilarityScore() meant the score was only
computed for the first match — all subsequent matches inherited that
stale value. Remove the null guard so the score is always recomputed
for FUNCTION matches.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature: Version Tracking Status: Triage Information is being gathered

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add correlator-independent similarity score to Version Tracking match table

3 participants