Home|Journals|Articles by Year|Audio Abstracts
 

Original Article

JJCIT. 2022; 8(1): 1-17


CLUSTERING VIETNAMESE CONVERSATIONS FROM FACEBOOK PAGE TO BUILD TRAINING DATASET FOR CHATBOT

Trieu Hai Nguyen, Thi-Kim-Ngoan Pham, Thi-Hong-Minh Bui, Thanh-Quynh-Chau Nguyen.




Abstract

The biggest challenge of building chatbots is training data. The required data must be realistic and large enough to train chatbots. We create a tool to get actual training data from Facebook messenger of a Facebook page. After text preprocessing steps, the newly obtained dataset generates FVnC and Sample dataset. We use the Retraining of BERT for Vietnamese (PhoBERT) to extract features of our text data. K-Means and DBSCAN clustering algorithms are used for clustering tasks based on output embeddings from PhoBERT$_{base}$. We apply V-measure score and Silhouette score to evaluate the performance of clustering algorithms. We also demonstrate the efficiency of PhoBERT compared to other models in feature extraction on the Sample dataset and wiki dataset. A GridSearch algorithm that combines both clustering evaluations is also proposed to find optimal parameters. Thanks to clustering such a number of conversations, we save a lot of time and effort to build data and storylines for training chatbot.

Key words: BERT, Clustering, Language models, Feature Extraction, Word Embeddings.






Full-text options


Share this Article


Online Article Submission
• ejmanager.com




ejPort - eJManager.com
Refer & Earn
JournalList
About BiblioMed
License Information
Terms & Conditions
Privacy Policy
Contact Us

The articles in Bibliomed are open access articles licensed under Creative Commons Attribution 4.0 International License (CC BY), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.