TY - JOUR
T1 - Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm
AU - Shah, Zubair
AU - Mahmood, Abdun Naser
AU - Barlow, Michael
N1 - Publisher Copyright:
© 2015 Elsevier B.V.
PY - 2015/9/1
Y1 - 2015/9/1
N2 - Abstract Finding the "Top k" list or heavy hitters is an important function in many computing applications, including database joins, data warehousing (e.g., OLAP), web caching and hits, network usage monitoring, and detecting DDoS attacks. While most applications work on traditional "flat" data, many domains contain data that has attributes which can take values from different hierarchies e.g., time and geographic location, source and destination IP addresses, etc. In addition, hierarchical heavy hitters generated from hierarchical attributes offer insightful and a generalized view of the data. When data arrives in a stream, infrequent data elements need to be deleted due to space constraints. However, unlike traditional heavy hitters which do not consider deleted elements, in a hierarchical heavy hitter some of these deleted elements could form a heavy hitter at a higher level of the hierarchy. Furthermore, unlike traditional heavy hitters, hierarchical heavy hitters have ancestor-descendant relationship which requires the count of descendant heavy hitters to be discounted at the higher levels of the hierarchy. This is particularly challenging in a streaming environment because all the incoming data items cannot be stored or revisited. This problem is generally addressed by accepting an error constraint ∈ on the precision of the count of hierarchical heavy hitters, since accurate count cannot be guaranteed under the constrained memory requirement posed by the streaming environment. In this work, we propose a new streaming ∈-approximation algorithm (HHH-MG) for computing Hierarchical Heavy Hitters based on a modified Misra Gries heavy hitter algorithm. The proposed algorithm guarantees ∈-approximation precision with improved worst-case time and space bounds compared to previous algorithms. It requires overall O(Formula presented.) space, and O(η) updates per element of the data, where η is a small constant. We provide theoretical proofs for the space and time requirements. We have also experimentally compared the proposed algorithm with benchmark techniques. Experimental results demonstrate that the proposed algorithm requires fewer updates per element of data, and on average requires less memory. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark datasets from an international Internet Service Provider.
AB - Abstract Finding the "Top k" list or heavy hitters is an important function in many computing applications, including database joins, data warehousing (e.g., OLAP), web caching and hits, network usage monitoring, and detecting DDoS attacks. While most applications work on traditional "flat" data, many domains contain data that has attributes which can take values from different hierarchies e.g., time and geographic location, source and destination IP addresses, etc. In addition, hierarchical heavy hitters generated from hierarchical attributes offer insightful and a generalized view of the data. When data arrives in a stream, infrequent data elements need to be deleted due to space constraints. However, unlike traditional heavy hitters which do not consider deleted elements, in a hierarchical heavy hitter some of these deleted elements could form a heavy hitter at a higher level of the hierarchy. Furthermore, unlike traditional heavy hitters, hierarchical heavy hitters have ancestor-descendant relationship which requires the count of descendant heavy hitters to be discounted at the higher levels of the hierarchy. This is particularly challenging in a streaming environment because all the incoming data items cannot be stored or revisited. This problem is generally addressed by accepting an error constraint ∈ on the precision of the count of hierarchical heavy hitters, since accurate count cannot be guaranteed under the constrained memory requirement posed by the streaming environment. In this work, we propose a new streaming ∈-approximation algorithm (HHH-MG) for computing Hierarchical Heavy Hitters based on a modified Misra Gries heavy hitter algorithm. The proposed algorithm guarantees ∈-approximation precision with improved worst-case time and space bounds compared to previous algorithms. It requires overall O(Formula presented.) space, and O(η) updates per element of the data, where η is a small constant. We provide theoretical proofs for the space and time requirements. We have also experimentally compared the proposed algorithm with benchmark techniques. Experimental results demonstrate that the proposed algorithm requires fewer updates per element of data, and on average requires less memory. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark datasets from an international Internet Service Provider.
KW - Data streams
KW - Heavy hitters
KW - Hierarchical heavy hitters
KW - Network traffic summarization
UR - http://www.scopus.com/inward/record.url?scp=84939258353&partnerID=8YFLogxK
U2 - 10.1016/j.peva.2015.06.011
DO - 10.1016/j.peva.2015.06.011
M3 - Article
AN - SCOPUS:84939258353
SN - 0166-5316
VL - 91
SP - 170
EP - 186
JO - Performance Evaluation
JF - Performance Evaluation
M1 - 1826
ER -