CHAPTER


DOI :10.26650/B/SS28ET06.2023.006.10   IUP :10.26650/B/SS28ET06.2023.006.10    Full Text (PDF)

An Investigation of Anomaly Detection Methods in Machine Learning for High Dimensional Datasets

Şenol Emir

Anomaly detection is defined as the detection of observations that differ significantly from others. These are observations that are incompatible with the rest of the dataset in a way that it is suspected another mechanism has generated them. Anomalies are very rare observations by nature. Generally, they are sensor or human-caused, such as measurement or recording errors, but sometimes, they may show a significant underlying problem or an unexpected condition. Studies on this subject are examined under different names, such as novelty detection, outlier detection, noise detection, deviation detection, exception mining, or outlier mining. In practice, anomaly detection is used for different purposes, such as fault diagnosis, healthcare informatics/medical diagnostics, fraud detection, intrusion detection, activity monitoring, and novel topic detection in text mining. While visualization and classical statistical methods are sufficient for low-dimensional datasets, for high-dimensional datasets numerous machine learning-based methods have been developed. In this study, three different anomaly detection methods are presented in detail to show the differences in their approach to the problem. For that purpose, theoretical aspects of the Local Outlier Factor which is a density-based method, Isolation Forests which is an ensemble method based on Random Forests, and One-Class Support Vector Machines methods are examined. Implementation details of these methods in the scikit-learn (a popular Python-based machine learning library) are given.


DOI :10.26650/B/SS28ET06.2023.006.10   IUP :10.26650/B/SS28ET06.2023.006.10    Full Text (PDF)

Yüksek Boyutlu Veri Kümeleri İçin Makine Öğreniminde Anomali Saptama Yöntemlerinin İncelenmesi

Şenol Emir

Anomali tespiti, diğerlerinden belirgin biçimde farklı olan gözlemlerin tespiti olarak tanımlanabilir. Bu gözlemler, başka bir mekanizmanın bunları oluşturduğundan şüphelenilecek derecede veri kümesinin geri kalanıyla uyumsuzdurlar. Anomaliler doğası gereği çok nadir görülen gözlemlerdir. Genellikle, ölçüm, kayıt hataları gibi sensör veya insan kaynaklıdırlar, ancak bazen altta yatan önemli bir sorunu veya beklenmeyen bir durumu gösterebilirler. Bu konudaki çalışmalar yenilik tespiti, aykırı değer tespiti, gürültü tespiti, sapma tespiti, istisna madenciliği veya aykırı değer madenciliği gibi farklı isimler altında incelenmektedir. Uygulamada anomali tespiti, arıza teşhisi, sağlık bilişimi / tıbbi teşhis, dolandırıcılık tespiti, izinsiz giriş tespiti, etkinlik izleme ve metin madenciliğinde yeni konu tespiti gibi farklı amaçlar için kullanılmaktadır. Düşük boyutlu veri kümeleri için görselleştirme ve klasik istatistiksel yöntemler yeterli olsa da yüksek boyutlu veri kümeleri için makine öğrenimine dayalı yöntemler geliştirilmiştir. Bu çalışmada, soruna yaklaşımlarındaki farklılıkları göstermek için üç farklı anomali tespit yöntemi ayrıntılı olarak sunulmuştur. Bu amaçla, yoğunluğa dayalı bir yöntem olan Local Outlier Factor, Random Forest yöntemine dayalı bir topluluk yöntemi olan Isolation Forest ve One-Class Support Vector Machines yöntemlerinin teorik yönleri incelenmiştir. Ayrıca bu yöntemlerin uygulama detayları popüler bir Python tabanlı makine öğrenme kütüphanesi olan scikit-learn üzerinde gösterilmiştir.



References

  • Aggarwal, C. C. (2017). Outlier Analysis. Springer International Publishing. https://doi.org/10.1007/978-3-319-47578-3 google scholar
  • Aggarwal, C. C., & Sathe, S. (2017). Outlier Ensembles. Springer International Publishing. https://doi. org/10.1007/978-3-319-54765-7 google scholar
  • Agyemang, M., Barker, K., & Alhajj, R. (2006). A comprehensive survey of numeric and symbolic outlier mining techniques. Intelligent Data Analysis, 10(6), 521-538. https://doi.org/10.3233/IDA-2006-10604 google scholar
  • Ahmed, M., Naser Mahmood, A., & Hu, J. (2016). A survey of network anomaly detection techniques. Journal of Network and Computer Applications, 60, 19-31. https://doi.org/10.1016/j.jnca.2015.11.016 google scholar
  • Alpaydın, E. (2020). Introduction to Machine Learning, fourth edition. The MIT Press. google scholar
  • Beckman, R. J., & Cook, R. D. (1983). Outlier, S. Technometrics, 25(2), 119-149. https://doi.org/10.1080/004 01706.1983.10487840 google scholar
  • Breunig, M. M., Kriegel, H.-P., Ng, R. T., & Sander, J. (2000). LOF: Identifying Density-Based Local Outliers. Proc. ACM SIGMOD 2000 Int. Conf. On Management of Data. google scholar
  • Brzezinska, A. N., & Horyn, C. (2021). Outliers in Covid 19 data based on Rule representation—The analysis of LOF algorithm. Procedia Computer Science, 192, 3010-3019. https://doi.org/10.1016/j.procs.2021.09.073 google scholar
  • Campos, G. O., Zimek, A., Sander, J., Campello, R. J. G. B., Micenkova, B., Schubert, E., Assent, I., & Houle, M. E. (2016). On the evaluation of unsupervised outlier detection: Measures, datasets, and an empirical study. Data Mining and Knowledge Discovery, 30(4), 891-927. https://doi.org/10.1007/s10618-015-0444-8 google scholar
  • Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. Acm Computing Surveys, 41, 15:1-15:58. google scholar
  • Chandola, V., Banerjee, A., & Kumar, V. (2012). Anomaly Detection for Discrete Sequences: A Survey. IEEE Transactions on Knowledge and Data Engineering, 24(5), 823-839. https://doi.org/10.1109/TKDE.2010.235 google scholar
  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-samp-ling Technique. Journal of Artificial Intelligence Research, 16, 321-357. https://doi.org/10.1613/jair.953 google scholar
  • Chen, Y., Zhao, Z., Wu, H., Chen, X., Xiao, Q., & Yu, Y. (2022). Fault anomaly detection of synchronous machi-ne winding based on isolation forest and impulse frequency response analysis. Measurement, 188, 110531. https://doi.org/10.1016/j.measurement.2021.110531 google scholar
  • Deng, X., & Wang, L. (2018). Modified kernel principal component analysis using double-weighted local out-lier factor and its application to nonlinear process monitoring. ISA Transactions, 72, 218-228. https://doi. org/10.1016/j.isatra.2017.09.015. google scholar
  • Dunning, T., & Friedman, E. (2014). Practical Machine Learning: A New Look at Anomaly Detection. O’Reilly. google scholar
  • Emmott, A., Das, S., Dietterich, T., Fern, A., & Wong, W.-K. (2016). A Meta-Analysis of the Anomaly Detection Problem (arXiv:1503.01158). arXiv. https://doi.org/10.48550/arXiv.1503.01158 google scholar
  • Fernando, T., Gammulle, H., Denman, S., Sridharan, S., & Fookes, C. (2021). Deep Learning for Medical Ano-maly Detection—A Survey (arXiv:2012.02364). arXiv. http://arxiv.org/abs/2012.02364 google scholar
  • Gao, J., Ji, W., Zhang, L., Li, A., Wang, Y., & Zhang, Z. (2020). Cube-based incremental outlier detection for streaming computing. Information Sciences, 517, 361-376. https://doi.org/10.1016/j.ins.2019.12.060 google scholar
  • Gee, S. (2015). Fraud and Fraud Detection: A Data Analytics Approach. Wiley. google scholar
  • Goldstein, M., & Uchida, S. (2016). A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data. PLOS ONE, 11(4), e0152173. https://doi.org/10.1371/journal.pone.0152173 google scholar
  • Hamel, L. (2009). Knowledge Discovery with Support Vector Machines. John Wiley & Sons, Inc. https://doi. org/10.1002/9780470503065 google scholar
  • Han, J., Pei, J., & Tong, H. (2022). Data Mining: Concepts and Techniques (4th edition). Morgan Kaufmann. google scholar
  • Jeon, D., Ahn, J. M., Kim, J., & Lee, C. (2022). A doc2vec and local outlier factor approach to measuring the novelty of patents. Technological Forecasting and Social Change, 174, 121294. https://doi.org/10.1016/j. techfore.2021.121294 google scholar
  • Jinka, P., & Schwartz, B. (2015). Anomaly Detection for Monitoring: A Statistical Approach to Time Series Anomaly Detection. O’Reilly. google scholar
  • Karczmarek, P., Kiersztyn, A., Pedrycz, W., & Czerwinski, D. (2021). Fuzzy C-Means-based Isolation Forest. Applied Soft Computing, 106, 107354. https://doi.org/10.1016/j.asoc.2021.107354 google scholar
  • Katser, I. D., & Kozitsin, V. O. (2020). Skoltech anomaly benchmark (SKAB). Kaggle. https://doi.org/10.34740/ KAGGLE/DSV/1693952 google scholar
  • Khan, S. S., & Madden, M. G. (2014). One-Class Classification: Taxonomy of Study and Review of Techniques. The Knowledge Engineering Review, 29(3), 345-374. https://doi.org/10.1017/S026988891300043X google scholar
  • Kubat, M., & Matwin, S. (1997). Addressing the Curse of Imbalanced Training Sets: One-Sided Selection. In Proceedings of the Fourteenth International Conference on Machine Learning, 179-186. google scholar
  • Laptev, N., Amizadeh, S., & Flint, I. (2015). Generic and scalable framework for automated time-series anomaly detection. Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1939-1947. google scholar
  • Ling, C. X., & Li, C. (1998). Data Mining for Direct Marketing: Problems and Solutions. KDD, 7. google scholar
  • Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). Isolation Forest. 2008 Eighth IEEE International Conference on Data Mining, 413-422. https://doi.org/10.1109/ICDM.2008.17. google scholar
  • Ma, Y., Shi, H., Ma, H., & Wang, M. (2013). Dynamic process monitoring using adaptive local outlier fa-ctor. Chemometrics and Intelligent Laboratory Systems, 127, 89-101. https://doi.org/10.1016/j.chemo-lab.2013.06.004 google scholar
  • Madsen, J. H. (2018). DDoutlier: Distance & density-based outlier detection. https://CRAN.R-project.org/pa-ckage=DDoutlier google scholar
  • Markos, Markou, and Sameer, & Singh. (2003). Novelty detection: A review—part 2: neural network based approaches. Signal Processing. https://doi.org/10.1016/j.sigpro.2003.07.019 google scholar
  • Markou, M., & Singh, S. (2003). Novelty detection: A review—part 1: statistical approaches. Signal Processing, 83(12), 2481-2497. https://doi.org/10.1016/j.sigpro.2003.07.018 google scholar
  • Mehrotra, K. G., Mohan, C. K., & Huang, H. (2017). Anomaly Detection Principles and Algorithms. Springer International Publishing. https://doi.org/10.1007/978-3-319-67526-8. google scholar
  • Mensi, A., & Bicego, M. (2021). Enhanced anomaly scores for isolation forests. Pattern Recognition, 120, 108115. https://doi.org/10.1016/j.patcog.2021.108115 google scholar
  • Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., & Leisch, F. (2021). E1071: Misc functions of the depart-ment of statistics, probability theory group (formerly: E1071), TU wien [Manual]. https://CRAN.R-project. org/package=e1071 google scholar
  • Mohammadi, M., Rashid, T. A., Karim, S. H. T., Aldalwie, A. H. M., Tho, Q. T., Bidaki, M., Rahmani, A. M., & Hosseinzadeh, M. (2021). A comprehensive survey and taxonomy of the SVM-based intrusion de-tection systems. Journal of Network and Computer Applications, 178, 102983. https://doi.org/10.1016/j. jnca.2021.102983 google scholar
  • Mourâo-Miranda, J., Hardoon, D. R., Hahn, T., Marquand, A. F., Williams, S. C. R., Shawe-Taylor, J., & Bram-mer, M. (2011). Patient classification as an outlier detection problem: An application of the One-Class Support Vector Machine. NeuroImage, 58(3), 793-804. https://doi.org/10.1016/j.neuroimage.2011.06.042 google scholar
  • Nassif, A. B., Talib, M. A., Nasir, Q., & Dakalbab, F. M. (2021). Machine Learning for Anomaly Detection: A Systematic Review. IEEE Access, 9, 78658-78700. https://doi.org/10.1109/ACCESS.2021.3083060 google scholar
  • Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825-2830. google scholar
  • Pinacho, P., Pau, I., Chacon, M., & Sânchez, S. (2012). An Ecological Approach to Anomaly Detection: The EIA Model. In C. A. Coello Coello, J. Greensmith, N. Krasnogor, P. Liö, G. Nicosia, & M. Pavone (Eds.), Artificial Immune Systems (pp. 232-245). Springer. https://doi.org/10.1007/978-3-642-33757-4_18 google scholar
  • R Core Team. (2020). R: A language and environment for statistical computing. google scholar
  • Schölkopf, B., Williamson, R. C., Smola, A. J., Shawe-Taylor, J., & Platt, J. C. (2000). Support Vector Method for Novelty Detection. Neural Information Processing Systems. google scholar
  • Sen, J., & Mehtab, S. (2020). Machine Learning Applications in Misuse and Anomaly Detection. In C. Kalloni-atis & C. Travieso-Gonzalez (Eds.), Security and Privacy From a Legal, Ethical, and Technical Perspective. IntechOpen. https://doi.org/10.5772/intechopen.92653. google scholar
  • Song, B., Tan, S., & Shi, H. (2016). Key principal components with recursive local outlier factor for multimode chemical process monitoring. Journal of Process Control, 47, 136-149. google scholar
  • Srikanth, K. S. (2021). solitude: An implementation of isolation forest [Manual]. https://CRAN.R-project.org/ package=solitude google scholar
  • Sultani, W., Chen, C., & Shah, M. (2019). Real-world Anomaly Detection in Surveillance Videos (arXiv:1801.04264). arXiv. http://arxiv.org/abs/1801.04264 google scholar
  • Tan, P.-N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining (1st edition). Pearson. google scholar
  • Tax, D. M. J., & Duin, R. P. W. (2004). Support Vector Data Description. Machine Learning, 54(1), 45-66. https://doi.org/10.1023/B:MACH.0000008084.60811.49 google scholar
  • Wang, X., Wang, X., & Wilkes, M. (2021). New Developments in Unsupervised Outlier Detection: Algorithms and Applications. Springer Singapore. https://doi.org/10.1007/978-981-15-9519-6 google scholar
  • Wang, Y., Li, K., & Gan, S. (2018). A Kernel Connectivity-based Outlier Factor Algorithm for Rare Data Detecti-on in a Baking Process. IFAC-PapersOnLine, 51(18), 297-302. https://doi.org/10.1016/j.ifacol.2018.09.316 google scholar
  • Westphal, C. (2008). Data Mining for Intelligence, Fraud & Criminal Detection: Advanced Analytics & Infor-mation Sharing Technologies (0 ed.). CRC Press. https://doi.org/10.1201/9781420067248 google scholar
  • Witten, I. H., Frank, E., Hall, M. A., & Pal, C. (2016). Data Mining: Practical Machine Learning Tools and Techniques (4th edition). Morgan Kaufmann. google scholar
  • You, L., Peng, Q., Xiong, Z., He, D., Qiu, M., & Zhang, X. (2020). Integrating aspect analysis and local outlier factor for intelligent review spam detection. Future Generation Computer Systems, 102, 163-172. https:// doi.org/10.1016/j.future.2019.07.044 google scholar
  • Zhao, Y., Nasrullah, Z., & Li, Z. (2019). PyOD: A python toolbox for scalable outlier detection. Journal of Machine Learning Research, 20(96), 1-7. google scholar
  • Zhou, L., Zhang, T., Zhang, Z., Lei, Z., & Zhu, S. (2021). A new online quality monitoring method of chain resistance upset butt welding based on Isolation Forest and Local Outlier Factor. Journal of Manufacturing Processes, 68, 843-851. https://doi.org/10.1016/j.jmapro.2021.06.005 google scholar


SHARE




Istanbul University Press aims to contribute to the dissemination of ever growing scientific knowledge through publication of high quality scientific journals and books in accordance with the international publishing standards and ethics. Istanbul University Press follows an open access, non-commercial, scholarly publishing.