AccScience Publishing / IJOSI / Volume 8 / Issue 4 / DOI: 10.6977/IJoSI.202412_8(4).0007
ARTICLE

Codeswitching: exploring perplexity and coherence metrics for op-timizing topic models of historical documents

Muhammad Abdullah Yusof1 Suhaila Saee1
Show Less
1 Faculty of Computer Science and Information Technology, University Malaysia Sarawak, Sarawak, Malaysia
Submitted: 4 March 2024 | Revised: 30 September 2024 | Accepted: 1 October 2024 | Published: 30 December 2024
© 2024 by the Author(s). Licensee AccScience Publishing, USA. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution -Noncommercial 4.0 International License (CC BY-NC 4.0) ( https://creativecommons.org/licenses/by-nc/4.0/ )
Abstract

The Latent Dirichlet Allocation (LDA) model has two important hyperparameters that control the document-topic distribution known as alpha (α), and topic-word distribution known as beta (β). It is important to find the suitable values for both hyperparameters to achieve an accurate topic cluster. Using a single evaluation method to determine the optimal hyperparameters values is insufficient due to the size and complexity of the dataset. Thus, an experiment was conducted to study the relationship between thehyperparameters with perplexity, coherence scores and to estab-lish a baseline for further topic modelling studies. It is the first study that focuses on multiple languages in Sarawak Gazette data for topic modelling. The study was conducted on LDA using Gensim package. The result shows that while perplexity scores were good indicator of the model’s ability to predict new or hidden data, the word cluster within topic does not always reflect the similarity or relationships between words which compromised topic interpre-tation. The lowest perplexity score was observed when αwas set to 5 and β to 0.4. The coherence evaluation indicated the optimal number of topics for each set of hyperparameter values although the relationship with hidden words re-mains unclear. The coherence score is highest when the number of topics was 5 and4. In conclusion, the perplexity scores are effective indicators of word prediction accuracy for each hyperparameter setting. While coherence captures the optimal number of topics needed to produce high-coherence word cluster within a topic. Combining both evalua-tion methods ensures optimal results, producing topics that are both accurate and interpretable.

Keywords
Hyperparameter
Latent Dirichlet Allocation
perplexity
topic coherence
topic modelling
References
  1. Agarwal, A., Salehundam, P., Padhee, S., Romine, W. L., & Banerjee, T. (2020). Leveraging Natural Language Processing to Mine Issues on Twitter During the COVID-19 Pandemic. *ArXiv (Cornell University)*, December. https://doi.org/10.1109/bigdata50022.2020.937802.
  2. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet Allocation. *Journal of Machine Learning Research*, 3(Jan), 993-1022.
  3. Bretsko, D., Belyi, A., & Sobolevsky, S. (2023, June). Comparative Analysis of Community Detection and Transformer-Based Approaches for Topic Clustering of Scientific Papers. In *International Conference on Computational Science and Its Applications* (pp. 648-660). Cham: Springer Nature Switzerland.
  4. Dieng, A. B., Ruiz, F. J., & Blei, D. M. (2020). Topic Modeling in Embedding Spaces. *Transactions of the Association for Computational Linguistics*, 8, 439-453.
  5. Ding, R., Nallapati, R., & Xiang, B. (2018). Coherence-Aware Neural Topic Modeling. *arXiv Preprint arXiv:1809.02687*.
  6. Gertis, E. M. (2021). Extracting Philosophical Topics from Reddit Posts via Topic Modeling.
  7. Griffiths, T. L., & Steyvers, M. (2004). Finding Scientific Topics. *Proceedings of the National Academy of Sciences*, 101(suppl_1), 5228-5235.
  8. Hasan, M., Rahman, A., Karim, M. R., Khan, M. S. I., & Islam, M. J. (2021). Normalized Approach to Find Optimal Number of Topics in Latent Dirichlet Allocation (LDA). In *Proceedings of International Conference on Trends in Computational and Cognitive Engineering: Proceedings of TCCE 2020* (pp. 341-354). Springer Singapore.
  9. Lee, J., Kang, J. H., Jun, S., Lim, H., Jang, D., & Park, S. (2018). Ensemble Modeling for Sustainable Technology Transfer. *Sustainability*, 10(7), 2278.
  10. Mohr, J. W., & Bogdanov, P. (2013). Introduction—Topic Models: What They Are and Why They Matter. *Poetics*, 41(6), 545-569.
  11. Muhajir, D., Akbar, M., Bagaskara, A., & Vinarti, R. (2022). Improving Classification Algorithm on Education Dataset Using Hyperparameter Tuning. *Procedia Computer Science*, 197, 538-544.
  12. Newman, D., Bonilla, E. V., & Buntine, W. (2011). Improving Topic Coherence with Regularized Topic Models. *Advances in Neural Information Processing Systems*, 24.
  13. O’Callaghan, D., Greene, D., Carthy, J., & Cunningham, P. (2015). An Analysis of the Coherence of Descriptors in Topic Modeling. *Expert Systems with Applications*, 42(13), 5645-5657.
  14. Odden, T. O. B., Marin, A., & Caballero, M. D. (2020). Thematic Analysis of 18 Years of Physics Education Research Conference Proceedings Using Natural Language Processing. *Physical Review Physics Education Research*, 16(1), 010142.
  15. Panichella, A. (2021). A Systematic Comparison of Search-Based Approaches for LDA Hyperparameter Tuning. *Information and Software Technology*, 130, 106411.
  16. Papadimitriou, C. H., Tamaki, H., Raghavan, P., & Vempala, S. (1998, May). Latent Semantic Indexing: A Probabilistic Analysis. In *Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems* (pp. 159-168).
  17. Péladeau, N., & Davoodi, E. (2018). Comparison of Latent Dirichlet Modeling and Factor Analysis for Topic Extraction: A Lesson of History.
  18. Pinto Gurdiel, L., Morales Mediano, J., & Cifuentes Quintero, J. A. (2021). A Comparison Study Between Coherence and Perplexity for Determining the Number of Topics in Practitioners Interviews Analysis.
  19. Seymore, K., & Rosenfeld, R. (1997). Large-Scale Topic Detection and Language Model Adaptation. Carnegie-Mellon University, Department of Computer Science.
  20. Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. *Handbook of Latent Semantic Analysis*, 427(7), 424-440.
  21. Teh, Y., Newman, D., & Welling, M. (2006). A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation. *Advances in Neural Information Processing Systems*, 19.
  22. Wallach, H., Mimno, D., & McCallum, A. (2009). Rethinking LDA: Why Priors Matter. *Advances in Neural Information Processing Systems*, 22.
  23. Watanabe, K., & Baturo, A. (2023). Seeded Sequential LDA: A Semi-supervised Algorithm for Topic-specific Analysis of Sentences. *Social Science Computer Review*, 08944393231178605.
  24. Xue, J., Chen, J., Chen, C., Zheng, C., Li, S., & Zhu, T. (2020). Public Discourse and Sentiment During the COVID-19 Pandemic: Using Latent Dirichlet Allocation for Topic Modeling on Twitter. *PLoS One*, 15(9), e0239441. https://doi.org/10.1371/journal.pone.0239441.
  25. Yang, L., & Shami, A. (2020). On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice. *Neurocomputing*, 415, 295-316.
  26. Yuan, M., Lin, P., Rashidi, L., & Zobel, J. (2023). Assessment of the Quality of Topic Models for Information Retrieval Applications.
  27. Zhou, Z., Liu, M., & Tao, Z. (2023). Quantitative Analysis of Citi's ESG Reporting: LDA and TF-IDF Approaches. *Financial Engineering and Risk Management*, 6(3), 53-63.
Share
Back to top
International Journal of Systematic Innovation, Electronic ISSN: 2077-8767 Print ISSN: 2077-7973, Published by AccScience Publishing