DOI :10.26650/B/ET06.2020.011.07   IUP :10.26650/B/ET06.2020.011.07    Full Text (PDF)

Data Pre-processing in Text Mining

Tuğçe AksoySerra ÇelikSevinç Gülseçen

The fact that any kind of user has the ability to generate data with great ease at any time causes an increase in the importance of data mining. Considering the reality that the vast majority of the available data is composed of unstructured data and that the data in the text type is outnumbering, it proves the increasing interest in text mining and the abundance of studies in this field. However, in order to be able to examine an unstructured data type like text, which is quite different from machine language, it is necessary to make this data more structured and make the machine work. At this point, the data pre-processing step, which covers a large part of the entire text mining process, is of great importance. In this chapter, it is aimed to explain the text pre-processing phase on a basic level by supporting this using visuals. In doing so, it is primarily planned to focus on text mining and to explain in detail the characteristics of the data processed. In this context, it is aimed to explain the data pre-processing steps followed in order to overcome these difficulties by examining the difficulties created by the data in question. As a result, this chapter is a descriptive review of the data pre-processing phase in text mining, which covers some of the studies previously conducted on this subject.


  • Brants, T. (2003, Ocak). Natural Language Processing in Information Retrieval. Conference: Computational Linguistics in the Netherlands. google scholar
  • Chrisholm, E. & Kolda, T.F. (1998). New Term Weighting Formulas for The Vector Space Method in Information Retrieval, Technical Report, Oak Ridge National Laboratory. google scholar
  • Duda, R.O., Hart, P.E. & Stark, D.G. (2000). Pattern Classification. Access Address: viewdoc/download?doi= google scholar
  • Eberandu, A.C. (2016). Unstructured Data: an overview of the data of Big Data. International Journal of Emerging Trends & Technology in Computer Science, 38(1), 46-50. DOI: 10.14445/22312803/IJCTTV38P109 google scholar
  • Feldman, R., & Sanger, J. (2007). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge, NY: Cambridge University Press. google scholar
  • Gantz, J. & Reinsel, D. (2011). Extracting Value from Chaos, IDC Iview google scholar
  • Gaikwad, S.V., Chaugule, A., & Patil, P. (2014). Text Mining Methods and Techniques. International Journal of Computer Applications, 85(17), 42-45. google scholar
  • Giagole, P.C., Patil, L.H. & Chaudhari, P.M. (2013). Pre-processing Techniques in Text Categorization. International Journal of Computer Applications. Access Address: ff34/7657082e70347a916548a9fe567ab791162a.pdf google scholar
  • Gurusamy, V. & Kannan, S. (2014). Pre-processing Techniques for Text Mining. Date: 18 February 2018, google scholar
  • Han, J., Kamber, M. & Pei, J. (2011). Data Mining: Concepts and Techniques. USA: Elsevier Inc. google scholar
  • Jo, T. (2018). Text Mining: Concepts, Implementation and Big Data Challenge. Poland: Polish Academy of Science. google scholar
  • Jo, T. (2006). The Implementation of Dynamic Document Organization Using the Integration of Text Clustering and Text Categorization. University of Ottawa. google scholar
  • Jones, K.S., & Manu, I. (Ed.). (1999). Automatic Summarizing: Factors and Directions in Advanced Automatic Summarization (pp.1-12). Cambridge, MA: MIT Press. google scholar
  • Kadhim, A.I. (2018). An Evaluation of Pre-processing Techniques for Text Classification. International Journal of Computer Science and Information Security, 16(6). google scholar
  • Kalra, V. & Aggarwal, R. (2018). Importance of Text Data Pre-processing & Implementation in RapidMiner. Proceedings of The First International Conference on Information Technology and Knowledge Management, (pp. 71-75). DOI: 10.15439/2018KMK6 google scholar
  • Karbasi, S. & Boughanem, M. (2006). Document Length Normalization Using Effective Level of Term Frequency in Large Collections. Advances in Information Retrieval, Lecture Notes in Computer Science, 3936/2006, 72-83. google scholar
  • Kowalski, G.J. & Maybury, M.T. (2000). Information Storage and Retrieval Systems: Theory and Implementation. Boston: Kluwer Academic. google scholar
  • Lourdusamy, R. & Abraham, S. (2018). A Survey on Text Pre-processing Techniques and Tools. International Journal of Computer Sciences and Engineering, 6(3). google scholar
  • Manning, C.D., Raghavan, P., Schutze, H. (2009). Introduction to Information Retrieval. Cambridge, NY: Cambridge University Press. google scholar
  • Mitchell, T. (1997). Machine Learning. McGraw, NY: Hill Companies. google scholar
  • Moreno, J. (2012). Artex is AnotheR TEXt summarizer. CoRR, abs/1210.3312 google scholar
  • Salton, G. (1998). Automatic Text Pre-processing: Transformation, Analysis and Retrieval of Information by Computer. Tokyo: Addison Weseley Publishing Company. google scholar
  • Salton, G. & Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Information Processing and Management, 24(5), 513-523. google scholar
  • Sheeba, J. & Vivekanandan, K. (2012). Improved Unsupervised Framework for Solving Synonym, Homonym, Hyponym & Polysemy Problems from Extracted Keywords and Identify Topics in Meeting Transcripts. International Journal of Computer Science, Engineering and Applications (IJCSEA), 2(5), 85-92. google scholar
  • Singh, S. (2018). Natural Language Processing for Information Retrieval. arXiv:1807.02383 [cs.CL] google scholar
  • Srividhya, V. & Anitha, R. (2010). Evaluating Pre-processing Techniques in Text Categorization. International Journal of Computer Science and Applications, 2010. google scholar
  • Zanini, N. & Dhawan, V. (2015). Text Mining: An Introduction to Theory and Some Applications. Research Matters: A Cambridge Assessment Publication, 19, 38-44. google scholar
  • Zhai, C., Massung, Z. & Özsu, M.T. (Ed.). (2016). Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining. Morgan & Claypool Publishers. google scholar


Istanbul University Press aims to contribute to the dissemination of ever growing scientific knowledge through publication of high quality scientific journals and books in accordance with the international publishing standards and ethics. Istanbul University Press follows an open access, non-commercial, scholarly publishing.