Text Document Clustering for Topic Extraction using unsupervised learning

In such a vast digital-driven age and owing to gigantic technological advances, the Internet development and advanced online technologies including hugely powerful data servers and voluminous information all constitute an issue that we daily encounter. The International Data Corporation (IDC) has released a report, which anticipates 175 zettabytes of data worldwide by 2025[1] Such voluminous data are accumulating in mainframes, servers, and public cloud environments. A significant amount of such enormous data is represented in text format. Various text mining applications were introduced in the existing literature. These applications involve the enhancement of the query results that are returned by search engines, unsupervised text organization systems, knowledge discovery processes, as well as information retrieval services, in addition to text mining processes [1]. Also, many approaches were proposed with the aim of organizing unsupervised text documents for efficient use.

Topic extraction (TE)

can be useful for many real-world applications [2]. For example, by examining recent publications in computer science domains, areas that are becoming increasingly important can be identified and their trends and popularity can be further predicted in the foreseeable future. In addition, they as a fundamental problem of information retrieval can help the decision makers to efficiently detect meaningful topics. Therefore, it has attracted much attention such as public opinion monitoring, decision supporting and emergency management.

However, there is a lot of uncertainty regarding how to define these topics and an ongoing debate about how to automatically extract them. In addition, extracting these topics using manual methods is slow, expensive, and bristling with mistakes. One of the most commonly used techniques to identify topics is to cluster documents to determine certain groups of papers representing a related subject matter. After that, the most relevant terms from each cluster are extracted and ranked.

Text documents clustering (TDC)

is one of the powerful and efficient unsupervised learning techniques in text mining. it represents, in general, the first step of TE to identify the documents, which address a related subject matter. TDC aims to divide documents into groups (also called clusters), where similar documents are placed in the same cluster and dissimilar documents in different clusters. This technique helps construct meaningful partitions of massive amounts of heterogeneous digital documents. Partition text documents clustering (PTDC) is defined by [3] as follows, “the process of partitioning a collection of documents into several sub-collections based on their similarity of contents”. TDC has been widely studied in the last few years due to two main reasons. First, it is difficult to assign humongous documents manually to extract meaningful information. Second, to avoid personal biases in judging documents that belong to any filed or category. According to [4], even human experts in automatic clustering do not intervene, which entails that there is no need for any prior knowledge about the texts (i.e., without consulting class label of the documents).

Generally, each document in TDC is represented as a vector using the vector space model (VSM). A widely used approach for document representation is a bag of words [5] where each distinct term that is present in the documents' collection is considered as a feature for the documents' representation. Therefore, a document is represented by a multi-dimensional feature space, where the cell value of each dimension corresponds to a weighted value, e.g., term frequency inverse document frequency (Tf-Idf), of the concerned term within the document. Hundreds of thousands of informative and uninformative features (i.e., irrelevant, redundant, and noisy features) originate from the transformation process.

High-dimensional feature space of VSM is one of the most important challenges in text clustering because it increases the computational time while decreasing the efficiency of clustering performance [6]. Therefore, a dimension reduction (DR) technique is necessary to remove irrelevant, redundant, and noisy features without sacrificing the performance of the underlying algorithm. Feature selection (FS) techniques are robust DR methods that are used to determine the optimal subset of informative text features. The filter method uses statistical analysis to evaluate the selected subset of the features from the original large set. Typically, the filter methods are often computationally less expensive than other methods because they are independent of any learning algorithms. They can perform without any foreknowledge of the class label of the document [7].


Dr.Ammar Kamal Abasi

Ph.D. in Artificial Intelligence & Software Engineering

Software Engineer and Artificial Intelligence (AI) Researcher, BrainTech Malaysia.


Scopus: https://www.scopus.com/authid/detail.uri?authorId=57208488241

Scholar: https://scholar.google.com/citations?user=oyfzJ90AAAAJ&hl=en


[1] Emrouznejad, A., & Yang, G. L. (2018). A survey and analysis of the first 40 years of scholarly literature in DEA: 1978–2016. Socio-economic planning sciences, 61, 4-8.

[2] Zhang, Y., Zhang, G., Chen, H., Porter, A. L., Zhu, D., & Lu, J. (2016). Topic analysis and forecasting for science, technology and innovation: Methodology with a case study focusing on big data research. Technological Forecasting and Social Change, 105, 179-191.

[3] Bouras, C., & Tsogkas, V. (2012). A clustering technique for news articles using WordNet. Knowledge-Based Systems, 36, 115-128.

[4] Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2020). Link-based multi-verse optimizer for text documents clustering. Applied Soft Computing, 87, 106002.

[5] Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Alyasseri, Z. A. A., & Makhadmeh, S. N. (2020). A novel hybrid multi-verse optimizer with K-means for text documents clustering. Neural Comput. Appl., 32(23), 17703-17729.

[6] Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2019, April). A text feature selection technique based on binary multi-verse optimizer for text clustering. In 2019 IEEE Jordan international joint conference on electrical engineering and information technology (JEEIT) (pp. 1-6). IEEE.

[7] Abasi, A. K., Khader, A. T., Al-Betar, M. A., Naim, S., Makhadmeh, S. N., & Alyasseri, Z. A. A. (2021). An improved text feature selection for clustering using binary grey wolf optimizer. In Proceedings of the 11th national technical seminar on unmanned system technology 2019 (pp. 503-516). Springer, Singapore.

Comment  2

Dr.Ammar Abasi

Tue, Oct, 2021 10:24

You are welcome Mr Sam Obeidat. It's my pleasure.

Sam Obeidat

Tue, Oct, 2021 08:56

Thanks for the contribution @Dr. Ammar Abasi