An Investigation of Anomaly Detection Methods in Machine Learning for High Dimensional Datasets

Anomaly detection is defined as the detection of observations that differ significantly from others. These are observations that are incompatible with the rest of the dataset in a way that it is suspected another mechanism has generated them. Anomalies are very rare observations by nature. Generally, they are sensor or human-caused, such as measurement or recording errors, but sometimes, they may show a significant underlying problem or an unexpected condition. Studies on this subject are examined under different names, such as novelty detection, outlier detection, noise detection, deviation detection, exception mining, or outlier mining. In practice, anomaly detection is used for different purposes, such as fault diagnosis, healthcare informatics/medical diagnostics, fraud detection, intrusion detection, activity monitoring, and novel topic detection in text mining. While visualization and classical statistical methods are sufficient for low-dimensional datasets, for high-dimensional datasets numerous machine learning-based methods have been developed. In this study, three different anomaly detection methods are presented in detail to show the differences in their approach to the problem. For that purpose, theoretical aspects of the Local Outlier Factor which is a density-based method, Isolation Forests which is an ensemble method based on Random Forests, and One-Class Support Vector Machines methods are examined. Implementation details of these methods in the scikit-learn (a popular Python-based machine learning library) are given.

Keywords: Anomaly Detection, Local Outlier Factor, Isolation Forest, One-Class Support Vector Machines

DOI :10.26650/B/SS28ET06.2023.006.10 IUP :10.26650/B/SS28ET06.2023.006.10 Full Text (PDF)

Yüksek Boyutlu Veri Kümeleri İçin Makine Öğreniminde Anomali Saptama Yöntemlerinin İncelenmesi

Şenol Emir

Anomali tespiti, diğerlerinden belirgin biçimde farklı olan gözlemlerin tespiti olarak tanımlanabilir. Bu gözlemler, başka bir mekanizmanın bunları oluşturduğundan şüphelenilecek derecede veri kümesinin geri kalanıyla uyumsuzdurlar. Anomaliler doğası gereği çok nadir görülen gözlemlerdir. Genellikle, ölçüm, kayıt hataları gibi sensör veya insan kaynaklıdırlar, ancak bazen altta yatan önemli bir sorunu veya beklenmeyen bir durumu gösterebilirler. Bu konudaki çalışmalar yenilik tespiti, aykırı değer tespiti, gürültü tespiti, sapma tespiti, istisna madenciliği veya aykırı değer madenciliği gibi farklı isimler altında incelenmektedir. Uygulamada anomali tespiti, arıza teşhisi, sağlık bilişimi / tıbbi teşhis, dolandırıcılık tespiti, izinsiz giriş tespiti, etkinlik izleme ve metin madenciliğinde yeni konu tespiti gibi farklı amaçlar için kullanılmaktadır. Düşük boyutlu veri kümeleri için görselleştirme ve klasik istatistiksel yöntemler yeterli olsa da yüksek boyutlu veri kümeleri için makine öğrenimine dayalı yöntemler geliştirilmiştir. Bu çalışmada, soruna yaklaşımlarındaki farklılıkları göstermek için üç farklı anomali tespit yöntemi ayrıntılı olarak sunulmuştur. Bu amaçla, yoğunluğa dayalı bir yöntem olan Local Outlier Factor, Random Forest yöntemine dayalı bir topluluk yöntemi olan Isolation Forest ve One-Class Support Vector Machines yöntemlerinin teorik yönleri incelenmiştir. Ayrıca bu yöntemlerin uygulama detayları popüler bir Python tabanlı makine öğrenme kütüphanesi olan scikit-learn üzerinde gösterilmiştir.

Keywords: Anomali Tespiti, Local Outlier Factor, Isolation Forest, One-Class Support Vector Machines

References

Aggarwal, C. C. (2017). Outlier Analysis. Springer International Publishing. https://doi.org/10.1007/978-3-319-47578-3 google scholar
Aggarwal, C. C., & Sathe, S. (2017). Outlier Ensembles. Springer International Publishing. https://doi. org/10.1007/978-3-319-54765-7 google scholar
Agyemang, M., Barker, K., & Alhajj, R. (2006). A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis, 10(6), 521-538. https://doi.org/10.3233/IDA-2006-10604 google scholar
Ahmed, M., Naser Mahmood, A., & Hu, J. (2016). A survey of network anomaly detection techniques. Journal of Network and Computer Applications, 60, 19-31. https://doi.org/10.1016/j.jnca.2015.11.016 google scholar
Alpaydın, E. (2020). Introduction to Machine Learning, fourth edition. The MIT Press. google scholar
Beckman, R. J., & Cook, R. D. (1983). Outlier, S. Technometrics, 25(2), 119-149. https://doi.org/10.1080/004 01706.1983.10487840 google scholar
Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data. google scholar
Brzezinska, A. N., & Horyn, C. (2021). Outliers in Covid 19 data based on Rule representation—The analysis of LOF algorithm. Procedia Computer Science, 192, 3010-3019. https://doi.org/10.1016/j.procs.2021.09.073 google scholar
Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B., Micenkova, B., Schubert, E., Assent, I., & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891-927. https://doi.org/10.1007/s10618-015-0444-8 google scholar
Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. Acm Computing Surveys, 41, 15:1-15:58. google scholar
Chandola, V., Banerjee, A., & Kumar, V. (2012). Anomaly Detection for Discrete Sequences: A Survey. IEEE Transactions on Knowledge and Data Engineering, 24(5), 823-839. https://doi.org/10.1109/TKDE.2010.235 google scholar
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-samp-ling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953 google scholar
Chen, Y., Zhao, Z., Wu, H., Chen, X., Xiao, Q., & Yu, Y. (2022). Fault anomaly detection of synchronous machi-ne winding based on isolation forest and impulse frequency response analysis. Measurement, 188, 110531. https://doi.org/10.1016/j.measurement.2021.110531 google scholar
Deng, X., & Wang, L. (2018). Modified kernel principal component analysis using double-weighted local out-lier factor and its application to nonlinear process monitoring. ISA Transactions, 72, 218-228. https://doi. org/10.1016/j.isatra.2017.09.015. google scholar
Dunning, T., & Friedman, E. (2014). Practical Machine Learning: A New Look at Anomaly Detection. O’Reilly. google scholar
Emmott, A., Das, S., Dietterich, T., Fern, A., & Wong, W.-K. (2016). A Meta-Analysis of the Anomaly Detection Problem (arXiv:1503.01158). arXiv. https://doi.org/10.48550/arXiv.1503.01158 google scholar
Fernando, T., Gammulle, H., Denman, S., Sridharan, S., & Fookes, C. (2021). Deep Learning for Medical Ano-maly Detection—A Survey (arXiv:2012.02364). arXiv. http://arxiv.org/abs/2012.02364 google scholar
Gao, J., Ji, W., Zhang, L., Li, A., Wang, Y., & Zhang, Z. (2020). Cube-based incremental outlier detection for streaming computing. Information Sciences, 517, 361-376. https://doi.org/10.1016/j.ins.2019.12.060 google scholar
Gee, S. (2015). Fraud and Fraud Detection: A Data Analytics Approach. Wiley. google scholar
Goldstein, M., & Uchida, S. (2016). A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLOS ONE, 11(4), e0152173. https://doi.org/10.1371/journal.pone.0152173 google scholar
Hamel, L. (2009). Knowledge Discovery with Support Vector Machines. John Wiley & Sons, Inc. https://doi. org/10.1002/9780470503065 google scholar
Han, J., Pei, J., & Tong, H. (2022). Data Mining: Concepts and Techniques (4th edition). Morgan Kaufmann. google scholar
Jeon, D., Ahn, J. M., Kim, J., & Lee, C. (2022). A doc2vec and local outlier factor approach to measuring the novelty of patents. Technological Forecasting and Social Change, 174, 121294. https://doi.org/10.1016/j. techfore.2021.121294 google scholar
Jinka, P., & Schwartz, B. (2015). Anomaly Detection for Monitoring: A Statistical Approach to Time Series Anomaly Detection. O’Reilly. google scholar
Karczmarek, P., Kiersztyn, A., Pedrycz, W., & Czerwinski, D. (2021). Fuzzy C-Means-based Isolation Forest. Applied Soft Computing, 106, 107354. https://doi.org/10.1016/j.asoc.2021.107354 google scholar
Katser, I. D., & Kozitsin, V. O. (2020). Skoltech anomaly benchmark (SKAB). Kaggle. https://doi.org/10.34740/ KAGGLE/DSV/1693952 google scholar
Khan, S. S., & Madden, M. G. (2014). One-Class Classification: Taxonomy of Study and Review of Techniques. The Knowledge Engineering Review, 29(3), 345-374. https://doi.org/10.1017/S026988891300043X google scholar
Kubat, M., & Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, 179-186. google scholar
Laptev, N., Amizadeh, S., & Flint, I. (2015). Generic and scalable framework for automated time-series anomaly detection. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1939-1947. google scholar
Ling, C. X., & Li, C. (1998). Data Mining for Direct Marketing: Problems and Solutions. KDD, 7. google scholar
Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining, 413-422. https://doi.org/10.1109/ICDM.2008.17. google scholar
Ma, Y., Shi, H., Ma, H., & Wang, M. (2013). Dynamic process monitoring using adaptive local outlier fa-ctor. Chemometrics and Intelligent Laboratory Systems, 127, 89-101. https://doi.org/10.1016/j.chemo-lab.2013.06.004 google scholar
Madsen, J. H. (2018). DDoutlier: Distance & density-based outlier detection. https://CRAN.R-project.org/pa-ckage=DDoutlier google scholar
Markos, Markou, and Sameer, & Singh. (2003). Novelty detection: A review—part 2: neural network based approaches. Signal Processing. https://doi.org/10.1016/j.sigpro.2003.07.019 google scholar
Markou, M., & Singh, S. (2003). Novelty detection: A review—part 1: statistical approaches. Signal Processing, 83(12), 2481-2497. https://doi.org/10.1016/j.sigpro.2003.07.018 google scholar
Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly Detection Principles and Algorithms. Springer International Publishing. https://doi.org/10.1007/978-3-319-67526-8. google scholar
Mensi, A., & Bicego, M. (2021). Enhanced anomaly scores for isolation forests. Pattern Recognition, 120, 108115. https://doi.org/10.1016/j.patcog.2021.108115 google scholar
Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2021). E1071: Misc functions of the depart-ment of statistics, probability theory group (formerly: E1071), TU wien [Manual]. https://CRAN.R-project. org/package=e1071 google scholar
Mohammadi, M., Rashid, T. A., Karim, S. H. T., Aldalwie, A. H. M., Tho, Q. T., Bidaki, M., Rahmani, A. M., & Hosseinzadeh, M. (2021). A comprehensive survey and taxonomy of the SVM-based intrusion de-tection systems. Journal of Network and Computer Applications, 178, 102983. https://doi.org/10.1016/j. jnca.2021.102983 google scholar
Mourâo-Miranda, J., Hardoon, D. R., Hahn, T., Marquand, A. F., Williams, S. C. R., Shawe-Taylor, J., & Bram-mer, M. (2011). Patient classification as an outlier detection problem: An application of the One-Class Support Vector Machine. NeuroImage, 58(3), 793-804. https://doi.org/10.1016/j.neuroimage.2011.06.042 google scholar
Nassif, A. B., Talib, M. A., Nasir, Q., & Dakalbab, F. M. (2021). Machine Learning for Anomaly Detection: A Systematic Review. IEEE Access, 9, 78658-78700. https://doi.org/10.1109/ACCESS.2021.3083060 google scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. google scholar
Pinacho, P., Pau, I., Chacon, M., & Sânchez, S. (2012). An Ecological Approach to Anomaly Detection: The EIA Model. In C. A. Coello Coello, J. Greensmith, N. Krasnogor, P. Liö, G. Nicosia, & M. Pavone (Eds.), Artificial Immune Systems (pp. 232-245). Springer. https://doi.org/10.1007/978-3-642-33757-4_18 google scholar
R Core Team. (2020). R: A language and environment for statistical computing. google scholar
Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., & Platt, J. C. (2000). Support Vector Method for Novelty Detection. Neural Information Processing Systems. google scholar
Sen, J., & Mehtab, S. (2020). Machine Learning Applications in Misuse and Anomaly Detection. In C. Kalloni-atis & C. Travieso-Gonzalez (Eds.), Security and Privacy From a Legal, Ethical, and Technical Perspective. IntechOpen. https://doi.org/10.5772/intechopen.92653. google scholar
Song, B., Tan, S., & Shi, H. (2016). Key principal components with recursive local outlier factor for multimode chemical process monitoring. Journal of Process Control, 47, 136-149. google scholar
Srikanth, K. S. (2021). solitude: An implementation of isolation forest [Manual]. https://CRAN.R-project.org/ package=solitude google scholar
Sultani, W., Chen, C., & Shah, M. (2019). Real-world Anomaly Detection in Surveillance Videos (arXiv:1801.04264). arXiv. http://arxiv.org/abs/1801.04264 google scholar
Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining (1st edition). Pearson. google scholar
Tax, D. M. J., & Duin, R. P. W. (2004). Support Vector Data Description. Machine Learning, 54(1), 45-66. https://doi.org/10.1023/B:MACH.0000008084.60811.49 google scholar
Wang, X., Wang, X., & Wilkes, M. (2021). New Developments in Unsupervised Outlier Detection: Algorithms and Applications. Springer Singapore. https://doi.org/10.1007/978-981-15-9519-6 google scholar
Wang, Y., Li, K., & Gan, S. (2018). A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detecti-on in a Baking Process. IFAC-PapersOnLine, 51(18), 297-302. https://doi.org/10.1016/j.ifacol.2018.09.316 google scholar
Westphal, C. (2008). Data Mining for Intelligence, Fraud & Criminal Detection: Advanced Analytics & Infor-mation Sharing Technologies (0 ed.). CRC Press. https://doi.org/10.1201/9781420067248 google scholar
Witten, I. H., Frank, E., Hall, M. A., & Pal, C. (2016). Data Mining: Practical Machine Learning Tools and Techniques (4th edition). Morgan Kaufmann. google scholar
You, L., Peng, Q., Xiong, Z., He, D., Qiu, M., & Zhang, X. (2020). Integrating aspect analysis and local outlier factor for intelligent review spam detection. Future Generation Computer Systems, 102, 163-172. https:// doi.org/10.1016/j.future.2019.07.044 google scholar
Zhao, Y., Nasrullah, Z., & Li, Z. (2019). PyOD: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96), 1-7. google scholar
Zhou, L., Zhang, T., Zhang, Z., Lei, Z., & Zhu, S. (2021). A new online quality monitoring method of chain resistance upset butt welding based on Isolation Forest and Local Outlier Factor. Journal of Manufacturing Processes, 68, 843-851. https://doi.org/10.1016/j.jmapro.2021.06.005 google scholar

Global Studies on Management Information Systems

CHAPTER

An Investigation of Anomaly Detection Methods in Machine Learning for High Dimensional Datasets

Yüksek Boyutlu Veri Kümeleri İçin Makine Öğreniminde Anomali Saptama Yöntemlerinin İncelenmesi

References

SHARE