CHAPTER


DOI :10.26650/B/ET07.2023.005.15   IUP :10.26650/B/ET07.2023.005.15    Full Text (PDF)

Dataset Balancing with Synthetic Data Generation in Health Studies

Ahmet Fatih DeveciM. Fevzi Esen

There are ethical, bureaucratic and operational difficulties in obtaining and using personal health data in the areas that require the use of sensitive health data such as health care planning, clinical trials and research and development studies. The cost and time consuming of obtaining data from clinical and field studies, especially the restrictions on the security of electronic personal health records and personal data privacy, necessitate the production of synthetic data as close to real data. In this study, it is aimed to compare the performances of SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek and ADASYN methods that have been used in synthetic data production by considering the importance of synthetic data generation in line with the increasing need for data use in the health field. In the study, a dataset consisting of 15 variables belonging to 390 patients with different observation and class numbers and a dataset consisting of 16 variables related to 19,212 COVID-19 patients were used. It has been concluded that SMOTE is more successful in balancing the data sets with large number of observations and multiclass classification. This technique can be used effectively in synthetic data generation compared to hybrid techniques.


DOI :10.26650/B/ET07.2023.005.15   IUP :10.26650/B/ET07.2023.005.15    Full Text (PDF)

Sağlık Çalışmalarında Sentetik Veri Üretimiyle Veri Seti Dengelemesi

Ahmet Fatih DeveciM. Fevzi Esen

Sağlık hizmetleri planlaması, klinik deneyler ve araştırma geliştirme çalışmaları gibi sağlık verisi kullanımını gerektiren alanlarda, kişisel sağlık verisinin elde edilmesi ve kullanımında etik, bürokratik ve operasyonel zorluklar yaşanmaktadır. Elektronik kişisel sağlık kayıtlarının güvenliği ve kişisel veri mahremiyeti konularındaki kısıtlamalar başta olmak üzere, klinik ve saha çalışmalarından veri elde edilmesinin maliyetli ve zaman alıcı olması, gerçek veriye en yakın şekilde yapay veri üretilmesini gerekli kılmaktadır. Bu çalışmada, son dönemde sağlık alanında artan veri kullanımı ihtiyacı doğrultusunda, sentetik veri kullanımının önemi ele alınarak, sentetik veri üretiminde kullanılan SMOTE, SMOTEENN, BorderlineSMOTE, SMOTETomek ve ADASYN yöntemlerinin performanslarının karşılaştırılması amaçlanmıştır. Çalışmada, gözlem ve sınıf sayısı birbirinden farklı ve ikisi de kamuya açık, 390 hastaya ait 15 değişkenden oluşan veri seti ile 19.212 COVID-19 hastasına ilişkin 16 değişkenden oluşan veri seti kullanılmıştır. Çalışma sonucunda SMOTE tekniğinin gözlem ve sınıf sayısının fazla olduğu veri setini dengelemede daha başarılı olduğu ve sentetik veri üretiminde hibrit tekniklere göre etkin olarak kullanılabileceği sonucuna ulaşılmıştır. 



References

  • Batista, G. E. A. P. A., Prati, R. C., & Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsletter, 6(1). https://doi. org/10.1145/1007730.1007735 google scholar
  • Bekkar, M., & Alitouche, T. A. (2013). Imbalanced Data Learning Approaches Review. International Journal of Data Mining & Knowledge Management Process, 3(4). https://doi.org/10.5121/ijdkp.2013.3402 google scholar
  • Belarouci, S., & Chikh, M. A. (2017). Medical imbalanced data classification. Advances in Science, Technology and Engineering Systems, 2(3), 116-124. https://doi.org/10.25046/aj020316 google scholar
  • Benaim, A. R., Almog, R., Gorelik, Y., Hochberg, I., Nassar, L., Mashiach, T., Khamaisi, M., Lurie, Y., Azzam, Z. S., Khoury, J., Kurnik, D., & Beyar, R. (2020). Analyzing medical research results based on synthetic data and their relation to real data results: Systematic comparison from five observational studies. JMIR Medical Informatics, 8(2), 1-14. https://doi.org/10.2196/16492 google scholar
  • Borowska, K., & Stepaniuk, J. (2016). Imbalanced data classification: A novel re-sampling approach combining versatile improved SMOTE and rough sets. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 9842 LNCS(1), 31-42. https://doi. org/10.1007/978-3-319-45378-1_4 google scholar
  • Buczak, A. L., Babin, S., & Moniz, L. (2010). Data-driven approach for creating synthetic electronic medical records. BMC medical informatics and decision making, 10, 59. https://doi.org/10.1186/1472-6947-10-59 google scholar
  • Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). snopes.com: Two-Striped Telamonia Spider. Journal of Artificial Intelligence Research, 16(Sept. 28), 321-357. https://arxiv.org/pdf/1106.1813. pdf%0Ahttp://www.snopes.com/horrors/insects/telamonia.asp, google scholar
  • Dai, F., Song, Y., Si, W., Yang, G., Hu, J., & Wang, X. (2021). Improved CBSO: A distributed fuzzy-based adaptive synthetic oversampling algorithm for imbalanced judicial data. Information Sciences, 569, 70-89. https://doi.org/10.1016/j.ins.2021.04.017 google scholar
  • Dube, K. , Gallagher, T. (2014). Approach and Method for Generating Realistic Synthetic Electronic Healthcare Records for Secondary Use. In: Gibbons J., MacCaull W. eds. Foundations of Health Information Engineering and Systems. FHIES 2013. Lecture Notes in Computer Science, vol 8315. Berlin, Heidelberg: Springer. google scholar
  • Emmert-Streib, F., Yang, Z., Feng, H., Tripathi, S., & Dehmer, M. (2020). An Introductory Review of Deep Learning for Prediction Models With Big Data. Frontiers in artificial intelligence, 3, 4. https://doi.org/10.3389/ frai.2020.00004 google scholar
  • Frid-Adar, M., Diamant, I., Klang, E., Amitai, M., Goldberger, J., & Greenspan, H. (2018). GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification. Neurocomputing, 321, 321-331. https://doi.org/10.1016Zj.neucom.2018.09.013 google scholar
  • Gartner (2020). Hype Cycle for Data Science and Machine Learning - 2020, 19 Temmuz 2021 tarihinde https:// www.gartner.com/en/documents/3988118/hype-cycle-for-data-science-and-machine-learning-2020 adresinden alindi. google scholar
  • Gartner (2021). Top Strategic Technology Trends for 2021, 13 Temmuz 2021 tarihinde https://www.gartner. com/en/publications/top-tech-trends-2021 adresinden alindi. google scholar
  • Gherardini, M., Mazomenos, E., Menciassi, A., & Stoyanov, D. (2020). Catheter segmentation in X-ray fluoroscopy using synthetic data and transfer learning with light U-nets. Computer Methods and Programs in Biomedicine, 192, 105420. https://doi.org/10.1016/j.cmpb.2020.105420 google scholar
  • Goncalves, A., Ray, P., Soper, B., Stevens, J., Coyle, L., & Sales, A. P. (2020). Generation and evaluation of synthetic patient data. BMC Medical Research Methodology, 20(1), 1-40. https://doi.org/10.1186/s12874-020-00977-1 google scholar
  • Han, H., Wang, W. Y., & Mao, B. H. (2005). Borderline-SMOTE: A new over-sampling method in imbalanced data sets learning. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3644 LNCS. https://doi.org/10.1007/11538059_91 google scholar
  • He, H., Bai, Y., Garcia, E. A., & Li, S. (2008). ADASYN: Adaptive synthetic sampling approach for imbalanced learning. Proceedings of the International Joint Conference on Neural Networks. https://doi.org/10.1109/ IJCNN.2008.4633969 google scholar
  • He, H. & Ma, Y. (2013). Imbalanced learning: foundations, algorithms, and applications, John Wiley & IEEE Press, USA. google scholar
  • Hernandez-Matamoros, A., Fujita, H., & Perez-Meana, H. (2020). A novel approach to create synthetic biomedical signals using BiRNN. Information Sciences, 541, 218-241. https://doi.org/10.1016/j.ins.2020.06.019 google scholar
  • Jacob, P.D. (2020). Management of patient healthcare information: Healthcare-related information flow, access, and availability, In Fundamentals of Telemedicine and Telehealth (ss. 35-57) (Eds. Shashi Gogia), Academic Press. google scholar
  • Karbhari, Y., Basu, A., Geem, Z. W., Han, G. T., & Sarkar, R. (2021). Generation of synthetic chest X-ray images and detection of COVID-19: A deep learning based approach. Diagnostics, 11(5), 1-19. https://doi. org/10.3390/diagnostics11050895 google scholar
  • Le, T., Son, L. H., Vo, M. T., Lee, M. Y., & Baik, S. W. (2018). A cluster-based boosting algorithm for bankruptcy prediction in a highly imbalanced dataset. Symmetry, 10(7), 1-13. https://doi.org/10.3390/sym10070250 google scholar
  • Liu, N., Li, X., Qi, E., Xu, M., Li, L., & Gao, B. (2020). A novel ensemble learning paradigm for medical diagnosis with imbalanced data. IEEE Access, 8, 171263-171280. https://doi.org/10.1109/ACCESS.2020.3014362 google scholar
  • Liu, Y., Li, X., Chen, X., Wang, X., & Li, H. (2020). High-Performance Machine Learning for Large-Scale Data Classification considering Class Imbalance. Scientific Programming, 2020. https://doi. org/10.1155/2020/1953461 google scholar
  • Manju, B. R., & Nair, A. R. (2019). Classification of Cardiac Arrhythmia of 12 Lead ECG Using Combination of SMOTEENN, XGBoost and Machine Learning Algorithms. Proceedings of the 2019 International Symposium on Embedded Computing and System Design, ISED 2019, 48-55. https://doi.org/10.1109/ ISED48680.2019.9096244 google scholar
  • Marathe, M. V. (2006). Synthetic Data for Data Mining to Support Epidemiological Modeling. Network Dynamics and Simulation Science Laboratory, Virginia Tech, 1 Ağustos 2021 tarihinde https://www.cs.dartmouth. edu/~cbk/sdm06/marathe-data.sdm.pdf adresinden alındı. google scholar
  • Palmer, E., Karlsson, A., Nordström, E, Petruson, K., Siversson, C., Ljungberg, M., & Sohlin, M. (2021). Synthetic computed tomography data allows for accurate absorbed dose calculations in a magnetic resonance imaging only workflow for head and neck radiotherapy. Physics and Imaging in Radiation Oncology, 17(December 2020), 36-42. https://doi.org/10.1016Zj.phro.2020.12.007 google scholar
  • Rahman, M. M., & Davis, D. N. (2013). Addressing the Class Imbalance Problem in Medical Datasets. International Journal of Machine Learning and Computing, May 2014, 224-228. https://doi.org/10.7763/ ijmlc.2013.v3.307 google scholar
  • ReportLinker (2021). Big Data Industry. 20 Temmuz 2021 tarihinde https://www.reportlinker.com/market-re-port/Advanced-IT/513221/Big-Data adresinden alındı. google scholar
  • Riegler, G., Urschler, M., Ruther, M., Bischof, H., & Stern, D. (2015). Anatomical Landmark Detection in Medical Applications Driven by Synthetic Data. Proceedings of the IEEE International Conference on Computer Vision, 2015-February, 85-89. https://doi.org/10.1109/ICCVW.2015.21 google scholar
  • Rocher, L., Hendrickx, J.M. & de Montjoye, YA. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nat Commun, 10: 3069. google scholar
  • Shamsuddin, R., Maweu, B. M., Li, M., & Prabhakaran, B. (2018). Virtual patient model: An approach for generating synthetic healthcare time series data. Proceedings - 2018 IEEE International Conference on Healthcare Informatics, ICHI 2018, February 2019, 208-218. https://doi.org/10.1109/ICHI.2018.00031 google scholar
  • Shi, G., Wang, J., Qiang, Y., Yang, X., Zhao, J., Hao, R., Yang, W., Du, Q., & Kazihise, N. G. F. (2020). Knowledge-guided synthetic medical image adversarial augmentation for ultrasonography thyroid nodule classification. Computer Methods and Programs in Biomedicine, 196, 105611. https://doi.org/10.1016/j. cmpb.2020.105611 google scholar
  • Stolfi, P., Valentini, I., Palumbo, M. C., Tieri, P., Grignolio, A., & Castiglione, F. (2020). Potential predictors of type-2 diabetes risk: machine learning, synthetic data and wearable health devices. BMC Bioinformatics, 21(17), 1-20. https://doi.org/10.1186/s12859-020-03763-4 google scholar
  • Susan, S. & Kumar, A. (2021). The balancing trick: Optimized sampling of imbalanced datasets—A brief survey of the recent State of the Art. Engineering Reports, 3: e12298. https://doi.org/10.1002/eng2.12298 google scholar
  • Tanha, J., Abdi, Y., Samadi, N., Razzaghi, N. & Asadpour, M. (2020). Boosting methods for multi-class imbalanced data classification: an experimental review. J Big Data, 7: 70. google scholar
  • Tucker, A., Wang, Z., Rotalinti, Y., & Myles, P. (2020). Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. Npj Digital Medicine, 3(1). https://doi.org/10.1038/s41746-020-00353-9 google scholar
  • Vaden, K. I., Gebregziabher, M., Dyslexia Data Consortium, & Eckert, M. A. (2020). Fully synthetic neuroimaging data for replication and exploration. NeuroImage, 223. https://doi.org/10.1016/j.neuroima-ge.2020.117284 google scholar
  • Vepa, A., Saleem, A., Rakhshan, K., Daneshkhah, A., Sedighi, T., Shohaimi, S., Omar, A., Salari, N., Chatrab-goun, O., Dharmaraj, D., Sami, J., Parekh, S., Ibrahim, M., Raza, M., Kapila, P., & Chakrabarti, P. (2021). Using machine learning algorithms to develop a clinical decision-making tool for covid-19 inpatients. International Journal of Environmental Research and Public Health, 18(12), 1-22. https://doi.org/10.3390/ ijerph18126228 google scholar
  • Vilardell, M., Buxo, M., Cleries, R., Martinez, J. M., Garcia, G., Ameijide, A., Font, R., & Civit, S. (2020). Missing data imputation and synthetic data simulation through modeling graphical probabilistic dependencies between variables (ModGraProDep): An application to breast cancer survival. Artificial Intelligence in Medicine, 107: 101875. https://doi.org/10.1016/j.artmed.2020.101875 google scholar
  • Waheed, A., Goyal, M., Gupta, D., Khanna, A., Al-Turjman, F., & Pinheiro, P. R. (2020). CovidGAN: Data Augmentation Using Auxiliary Classifier GAN for Improved Covid-19 Detection. IEEE Access, 8: 9191691923. https://doi.org/10.1109/ACCESS.2020.2994762 google scholar
  • Walonoski, J., Klaus, S., Granger, E., Hall, D., Gregorowicz, A., Neyarapally, G., Watson, A., & Eastman, J. (2020). SyntheaTM Novel coronavirus (COVID-19) model and synthetic data set. Intelligence-Based Medicine, 1-2: 100007. https://doi.org/10.1016/j.ibmed.2020.100007 google scholar
  • Yale, A., Dash, S., Dutta, R., Guyon, I., Pavao, A., & Bennett, K. P. (2020). Generation and evaluation of privacy preserving synthetic health data. Neurocomputing, 416: 244-255. https://doi.org/10.1016/j.neu-com.2019.12.136 google scholar
  • Zeng, M., Zou, B., Wei, F., Liu, X., & Wang, L. (2016). Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data. Proceedings of 2016 IEEE International Conference of Online Analysis and Computing Science, ICOACS 2016, 2016, 225-228. htt-ps://doi.org/10.1109/ICOACS.2016.7563084 google scholar
  • Zhao, Y., Wong, Z. S. Y., & Tsui, K. L. (2018). A Framework of Rebalancing Imbalanced Healthcare Data for Rare Events’ Classification: A Case of Look-Alike Sound-Alike Mix-Up Incident Detection. Journal of Healthcare Engineering, 2018 (2010): 6275435. https://doi.org/10.1155/2018/6275435 google scholar
  • Zhang, Z., Yan, C., Mesa, D. A., Sun, J., & Malin, B. A. (2020). Ensuring electronic medical record simulation through better training, modeling, and evaluation. Journal of the American Medical Informatics Association, 27(1). https://doi.org/10.1093/jamia/ocz161 google scholar


SHARE




Istanbul University Press aims to contribute to the dissemination of ever growing scientific knowledge through publication of high quality scientific journals and books in accordance with the international publishing standards and ethics. Istanbul University Press follows an open access, non-commercial, scholarly publishing.