Publications

MicroscopeSketch: Accurate Sliding Estimation Using Adaptive Zooming.

Published in Knowledge Discovery and Data Mining (KDD) 2023, 2023

High-accuracy real-time data stream estimations are critical for various applications, and sliding-window-based techniques have attracted wide attention. However, existing solutions struggle to achieve high accuracy, generality, and low memory usage simultaneously. To overcome these limitations, we present MicroscopeSketch, a high-accuracy sketch framework. Our key technique, called adaptive zooming, dynamically adjusts the granularity of counters to maximize accuracy while minimizing memory usage. By applying MicroscopeSketch to three specific tasks—frequency estimation, top-𝑘 frequent items discovery, and top-𝑘 heavy changes identification—we demonstrate substantial improvements over existing methods, reducing errors by roughly 4 times for frequency estimation and 3 times for identifying top-𝑘 items. The relevant source code is available in a GitHub repository.

Download here

SketchPolymer: Estimate Per-item Tail Quantile Using One Sketch.

Published in Knowledge Discovery and Data Mining (KDD) 2023, 2023

Estimating the quantile of distribution, especially tail distribution, is an interesting topic in data stream models, and has obtained extensive interest from many researchers. In this paper, we propose a novel sketch, namely SketchPolymer to accurately estimate per-item tail quantile. SketchPolymer uses a technique called Early Filtration to filter infrequent items, and another technique called VSS to reduce error. Our experimental results show that the accuracy of SketchPolymer is on average 32.67 times better than state-of-the-art techniques. We also implement our SketchPolymer on P4 and FPGA platforms to verify its deployment flexibility. All our codes are available at GitHub.

Download here

Finding Persistent and Infrequent Flows

Published in Transactions on Networking (ToN) 2023, 2023

Finding persistent and low-active activity periods is very helpful in practice, for example to detect intrusion activities. Most of the literature focuses on finding persistent flows or frequent flows. No previous work is able to find persistent and infrequent flows. In this paper, we propose a novel sketch data structure, PISketch, to find persistent and infrequent flows in real time. The key idea of PISketch is to define a weight and its Reward and Penalty System for each flow to combine and balance the information of both persistency and infrequency, and to keep high-weighted flows in a limited space through a strategy. We implement PISketch on P4, FPGA, and CPU platforms, and compare the performance of PISketch with two strawman solutions (On-Off + CM sketch, and PIE + CM sketch), in terms of finding persistent and infrequent flows. Our experimental results demonstrate the advantage of PISketch, by comparing it to two strawman solutions: 1) The F1 Score of PISketch is around 22.1% and 57.6% higher than two strawman solutions, respectively; 2) The Average Relative Error (ARE) of PISketch is around 820.9 (up to 1188.8) and 126.2 (up to 265.6) times lower than two strawman solutions, respectively; 3) The insertion throughput of PISketch is around 1.23 and 16.5 times higher than two strawman solutions, respectively. Moreover, we implement two concrete cases of PISketch through end-to-end experiments. All of our codes are available at GitHub.

Download here

Published in , 1900

SHE: A Generic Framework for Data Stream Mining over Sliding Windows.

Published in International Conference on Parallel Processing (ICPP) 2022, 2022

Data stream mining over a sliding window is a fundamental problem in many applications, such as financial data trackers, intrusion detection and QoS. To meet the demand for high throughput of high speed data streams, hardware platforms (FPGA/ASIC) have been designed. These hardware platforms have three constraints for algorithms running on, which are 1) small memory usage 2) single stage memory access and 3) limited concurrent memory access. Algorithms perfectly fit in with these constraints will enable a highest utilization of these hardware platforms. However, no existing sliding window algorithm is specifically designed for hardware platforms. In this paper, we propose the Sliding Hardware Estimator (SHE), which is a generic framework that extends existing fixed window algorithms to sliding windows for hardware platforms. The key idea of SHE is that, during insertions we approximately delete out-dated information with little time and space overhead, while during queries we design sophisticated techniques to minimize error. We have fully implemented our SHE on FPGA, achieving a throughput of 544 Mips. We apply SHE to four typical data stream mining tasks. Experimental results show that, when compared with the state-of-the-art which cannot be implemented in hardware, SHE reduces the error by up to 100 times in membership queries. All related source codes are anonymously released at Github.

Download here

PISketch: Finding Persistent and Infrequent Flows

Published in ACM SIGCOMM Workshop on FFSPIN 2022., 2022

Data stream mining over a sliding window is a fundamental problem in many applications, such as financial data trackers, intrusion detection and QoS. To meet the demand for high throughput of high speed data streams, hardware platforms (FPGA/ASIC) have been designed. These hardware platforms have three constraints for algorithms running on, which are 1) small memory usage 2) single stage memory access and 3) limited concurrent memory access. Algorithms perfectly fit in with these constraints will enable a highest utilization of these hardware platforms. However, no existing sliding window algorithm is specifically designed for hardware platforms. In this paper, we propose the Sliding Hardware Estimator (SHE), which is a generic framework that extends existing fixed window algorithms to sliding windows for hardware platforms. The key idea of SHE is that, during insertions we approximately delete out-dated information with little time and space overhead, while during queries we design sophisticated techniques to minimize error. We have fully implemented our SHE on FPGA, achieving a throughput of 544 Mips. We apply SHE to four typical data stream mining tasks. Experimental results show that, when compared with the state-of-the-art which cannot be implemented in hardware, SHE reduces the error by up to 100 times in membership queries. All related source codes are anonymously released at Github.

Download here

Precise Error Estimation for Sketch-based Flow Measurement.

Published in November 02, 2021

As a class of approximate measurement approaches, sketching algorithms have significantly improved the estimation of network flow information using limited resources. While these algorithms enjoy sound error-bound analysis under worst-case scenarios, their actual errors can vary significantly with the incoming flow distribution, making their traditional error bounds too “loose” to be useful in practice. In this paper, we propose a simple yet rigorous error estimation method to more precisely analyze the errors for posterior sketch queries by leveraging the knowledge from the sketch counters. This approach will enable network operators to understand how accurate the current measurements are and make appropriate decisions accordingly (e.g., identify potential heavy users or answer “what-if” questions to better provision resources). Theoretical analysis and trace-driven experiments show that our estimated bounds on sketch errors are much tighter than previous ones and match the actual error bounds in most cases.

Download here

MapEmbed: Perfect Hashing with High Load Factor and Fast Update.

Published in August 14, 2021

Perfect hashing is a hash function that maps a set of distinct keys to a set of continuous integers without collision. However, most existing perfect hash schemes are static, which means that they cannot support incremental updates, while most datasets in practice are dynamic. To address this issue, we propose a novel hashing scheme, namely MapEmbed Hashing. Inspired by divide-and-conquer and map-and-reduce, our key idea is named map-and-embed and includes two phases: 1) Map all keys into many small virtual tables; 2) Embed all small tables into a large table by circular move. Our experimental results show that under the same experimental setting, the state-of-the-art perfect hashing (dynamic perfect hashing) can achieve around 15% load factor, around 0.3 Mops update speed, while our MapEmbed achieves around 90% ∼ 95% load factor, and around 8.0 Mops update speed per thread. All codes of ours and other algorithms are open-sourced at GitHub.

Download here

Elastic Bloom Filter: Deletable and ExpandableFilter Using Elastic Fingerprints.

Published in March 22 , 2021

The Bloom filter, answering whether an item is in a set, has achieved great success in various fields, including networking, databases, and bioinformatics. However, the Bloom filter has two main shortcomings: no support of item deletion and no support of expansion. Existing solutions either support deletion at the cost of using additional memory, or support expansion at the cost of increasing the false positive rate and decreasing the query speed. Unlike existing solutions, we propose the Elastic Bloom filter (EBF) to address the two shortcomings simultaneously. Importantly, when EBF expands, the false positives decrease. Our key technique is Elastic Fingerprints, which dynamically absorb and release bits during compression and expansion. To support deletion, EBF can first delete the corresponding fingerprint and then update the corresponding bit in the Bloom filter. To support expansion, Elastic Fingerprints release bits and insert them to the Bloom filter. Our experimental results show that the Elastic Bloom filter significantly outperforms existing works.

Download here