TY - GEN
T1 - Finding local anomalies in very high dimensional space
AU - De Vries, Timothy
AU - Chawla, Sanjay
AU - Houle, Michael E.
PY - 2010
Y1 - 2010
N2 - Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.
AB - Time, cost and energy efficiency are critical factors for many data analysis techniques when the size and dimensionality of data is very large. We investigate the use of Local Outlier Factor (LOF) for data of this type, providing a motivating example from real world data. We propose Projection-Indexed Nearest-Neighbours (PINN), a novel technique that exploits extended nearest neighbour sets in the a reduced dimensional space to create an accurate approximation for k-nearest-neighbour distances, which is used as the core density measurement within LOF. The reduced dimensionality allows for efficient sub-quadratic indexing in the number of items in the data set, where previously only quadratic performance was possible. A detailed theoretical analysis of Random Projection (RP) and PINN shows that we are able to preserve the density of the intrinsic manifold of the data set after projection. Experimental results show that PINN outperforms the standard projection methods RP and PCA when measuring LOF for many high-dimensional real-world data sets of up to 300000 elements and 102600 dimensions.
KW - Anomaly detection
KW - Dimensionality reduction
UR - http://www.scopus.com/inward/record.url?scp=79951739637&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2010.151
DO - 10.1109/ICDM.2010.151
M3 - Conference contribution
AN - SCOPUS:79951739637
SN - 9780769542560
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 128
EP - 137
BT - Proceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
T2 - 10th IEEE International Conference on Data Mining, ICDM 2010
Y2 - 14 December 2010 through 17 December 2010
ER -