This repository hosts the implementation and benchmarking of Evolving Sketch, an adaptively parameterized time-decaying frequency sketch for online streams.
- Programming Language: C++
- Build Tool: CMake
- Logging: spdlog
- Testing: doctest
- Others: Clang-Tidy / Clang-Format
This project uses CMake to build, and uses CPM.cmake to automatically download dependencies during the build, so you can build the project without worrying about dependencies.
Make sure your C++ compiler supports C++ 23 standard before building. This project has been tested with GCC 15 and Clang 20 on Linux.
To build the program in debug mode, run the following command in the project root directory:
mkdir -p build
cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Debug ..
cmake --build . --config Debug
cd ..To build the program in release mode, run the following command in the project root directory:
mkdir -p build
cd build
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release ..
cmake --build . --config Release
cd ..Several real-world datasets and a synthetic dataset are used in the benchmarking of Evolving Sketch. Follow the instructions below to retrieve and prepare the datasets.
Preprocessing scripts are implemented in TypeScript and are located in the scripts/ directory, and you can run them with Node.js ≥ 24. The following code snippets assume you run commands in the project root directory.
-
Cache traces. The cache traces we used are from a recent paper in caching systems. Follow the link mentioned in its reproducing guide to download the cache traces. To fetch the MSR trace used in the paper:
mkdir -p data wget https://ftp.pdl.cmu.edu/pub/datasets/twemcacheWorkload/cacheDatasets/msr/msr_prxy_1.oracleGeneral.zst -O data/msr.oracleGeneral.zst zstd -d data/msr.oracleGeneral.zst -o data/msr.oracleGeneral rm data/msr.oracleGeneral.zst
The Meta KV trace linked in the paper is no longer hosted. You can recreate an equivalent trace from Meta’s CacheBench dataset on its AWS S3 bucket and convert it with the helper program below. Before running the commands, make sure AWS CLI is installed (e.g.,
sudo apt install awsclion Debian/Ubuntu).mkdir -p data aws s3 cp --no-sign-request s3://cachelib-workload-sharing/pub/kvcache/202206/kvcache_traces_1.csv data/meta.csv g++ -std=c++23 -O3 scripts/convert_meta.cpp -o convert_meta ./convert_meta data/meta.csv data/meta.oracleGeneral rm convert_meta rm data/meta.csv
-
H&M Personalized Fashion Recommendations from Kaggle. You should first download
articles.csv.zipandtransactions_train.csv.zipfrom the Kaggle competition page, and then unzip them intodata/articles.csvanddata/transactions_train.csvrespectively. Then you can use the following command to process the data:node scripts/csv.ts select data/articles.csv 0,1 --output data/articles_selected.csv rm data/articles.csv node scripts/csv.ts select data/transactions_train.csv 0,2 --output data/transactions_train_selected.csv rm data/transactions_train.csv node scripts/csv.ts join data/transactions_train_selected.csv data/articles_selected.csv 1 0 --output data/hm_joined.csv rm data/transactions_train_selected.csv rm data/articles_selected.csv node scripts/csv.ts select data/hm_joined.csv 0,2 --output data/hm.csv rm data/hm_joined.csv
-
Synthetic e-commerce stream. Use the provided script to generate a synthetic e-commerce stream with Node.js ≥ 24 (run from the project root):
node scripts/synth.ts --output data/synthetic.csv
After building and preparing the datasets, you can run the benchmark program to evaluate the performance of Evolving Sketch.
$ ./build/benchmark
Usage:
./build/benchmark caching [--help] [--version] [--parallel] [--output VAR] trace_path cache_size_ratio adapt_intervals alphas
./build/benchmark hm [--help] [--version] [--parallel] [--output VAR] trace_path cache_size_ratio top_k adapt_intervals alphas
$ ./build/benchmark hm
Usage: ./build/benchmark hm [--help] [--version] [--parallel] [--output VAR] trace_path cache_size_ratio top_k adapt_intervals alphas
Positional arguments:
trace_path The path to the cache trace file
cache_size_ratio The ratio of the cache size to the number of unique objects in the trace
top_k The number of top results to return
adapt_intervals Comma-separated list of adaptation intervals to use (only used by EvolvingSketch) (e.g., '1000,10000,100000')
alphas Comma-separated list of alpha values to use (e.g., '0.1,0.2,0.3')
Optional arguments:
-h, --help shows help message and exits
-v, --version prints version information and exits
-p, --parallel Run all experiments in parallel
-o, --output Output file path (as CSV) [nargs=0..1] [default: ""]
$ ./build/benchmark_hm
Usage:
./build/benchmark_hm {CMS|ADA|EVO_PRUNING_ONLY|EVO} ...For example, to run Evolving Sketch on the cache trace data/msr.oracleGeneral with a cache size ratio of 0.01, an adaptation interval of 10,000, and decay factors 0.5 and 1.0, run:
./build/benchmark caching data/msr.oracleGeneral 0.01 10000 0.5,1.0Logs print to stdout. To save benchmark results as CSV, pass --output <file.csv>.
We also provide a figures/visualize.ipynb Jupyter notebook to visualize the benchmark results saved as CSV files. The notebook is written in TypeScript and run in Jupyter Kernel for Deno, employing several libraries such as Polars and Observable Plot, so you need to install Deno first and follow the instructions to install Jupyter Kernel for Deno.
All plots in the paper are generated by the figures/visualize.ipynb notebook. To reproduce the results:
Make sure the datasets are prepared as described in “Data Retrieval” before running the commands below.
-
Build the project in Release as described in the “Build” section.
-
Generate the changing DCG curve on the synthetic e-commerce stream for different decay parameters (alpha) to produce the figure used in the introduction:
mkdir -p output ./build/benchmark_hm CMS data/synthetic.csv 4000 1000 10000 0 --trace output/synthetic_cms.trace.csv --progress ./build/benchmark_hm EVO_PRUNING_ONLY data/synthetic.csv 4000 1000 10000 0.05 --trace output/synthetic_alpha0.05.trace.csv --progress ./build/benchmark_hm EVO_PRUNING_ONLY data/synthetic.csv 4000 1000 10000 500 --trace output/synthetic_alpha500.trace.csv --progress
-
Run the benchmarks on the cache traces:
./build/benchmark caching data/meta.oracleGeneral 0.1 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/meta_large.csv ./build/benchmark caching data/meta.oracleGeneral 0.01 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/meta_small.csv ./build/benchmark caching data/msr.oracleGeneral 0.1 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/msr_large.csv ./build/benchmark caching data/msr.oracleGeneral 0.01 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/msr_small.csv
-
Run the benchmarks on the online ranking traces (H&M and synthetic e-commerce):
./build/benchmark hm data/hm.csv 0.1 1000 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/hm_large.csv ./build/benchmark hm data/hm.csv 0.1 100 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/hm_small.csv ./build/benchmark hm data/synthetic.csv 0.1 1000 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/synthetic_large.csv ./build/benchmark hm data/synthetic.csv 0.1 100 1000,10000,100000 0.01,0.02,0.05,0.1,0.2,0.5,1,2,5,10,20,50,100,200,500,1000 -o output/synthetic_small.csv
-
Record traces to generate the adaptation timeline:
./build/benchmark_caching W-TinyLFU_EVO data/meta.oracleGeneral 176705 10000 1 --trace output/meta.trace.csv --progress ./build/benchmark_caching W-TinyLFU_EVO data/msr.oracleGeneral 4902 10000 1 --trace output/msr.trace.csv --progress ./build/benchmark_hm EVO data/synthetic.csv 4000 100 10000 1 --trace output/synthetic.trace.csv --progress ./build/benchmark_hm EVO data/hm.csv 4683 100 10000 1 --trace output/hm.trace.csv --progress
-
Open
figures/visualize.ipynbin Jupyter and run all cells to generate the plots.