TY - GEN
T1 - Evaluating LLM-Generated Topics from Survey Responses
T2 - 2024 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2024
AU - Tamime, Reham Al
AU - Salminen, Joni
AU - Jung, Soon Gyo
AU - Jansen, Bernard
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - The evolution of generative artificial intelligence (AI) technologies, particularly large language models (LLMs), has lead to consequences for the field of Human-Computer Interaction (HCI) in areas such as personalization, predictive analytics, automation, and data analysis. This research aims to evaluate LLM-generated topics derived from survey responses in comparison with topics suggested by humans, particularly participants recruited through a crowdsourcing experiment. We present an evaluation results to compare LLM-generated topics with human-generated topics in terms of Quality, Usefulness, Accuracy, Interestingness, and Completeness. This involves three stages: (1) Design and Generate Topics with an LLM (OpenAI's GPT-4); (2) Crowdsourcing Human-Generated Topics; and (3) Evaluation of Human-Generated Topics and LLM-Generated Topics. However, a feasibility study with 33 crowdworkers indicated challenges in using participants for LLM evaluation, particularly in inviting humans participants to suggest topics based on open-ended survey answers. We highlight several challenges in recruiting crowdsourcing participants for generating topics from survey responses. We recommend using well-trained human experts rather than crowdsourcing to generate human baselines for LLM evaluation.
AB - The evolution of generative artificial intelligence (AI) technologies, particularly large language models (LLMs), has lead to consequences for the field of Human-Computer Interaction (HCI) in areas such as personalization, predictive analytics, automation, and data analysis. This research aims to evaluate LLM-generated topics derived from survey responses in comparison with topics suggested by humans, particularly participants recruited through a crowdsourcing experiment. We present an evaluation results to compare LLM-generated topics with human-generated topics in terms of Quality, Usefulness, Accuracy, Interestingness, and Completeness. This involves three stages: (1) Design and Generate Topics with an LLM (OpenAI's GPT-4); (2) Crowdsourcing Human-Generated Topics; and (3) Evaluation of Human-Generated Topics and LLM-Generated Topics. However, a feasibility study with 33 crowdworkers indicated challenges in using participants for LLM evaluation, particularly in inviting humans participants to suggest topics based on open-ended survey answers. We highlight several challenges in recruiting crowdsourcing participants for generating topics from survey responses. We recommend using well-trained human experts rather than crowdsourcing to generate human baselines for LLM evaluation.
KW - Challenges in Recruitment
KW - Crowdsourcing for Human-centric Computing
KW - Feasibility Study
KW - LLM Evaluation
UR - http://www.scopus.com/inward/record.url?scp=85207852861&partnerID=8YFLogxK
U2 - 10.1109/VL/HCC60511.2024.00064
DO - 10.1109/VL/HCC60511.2024.00064
M3 - Conference contribution
AN - SCOPUS:85207852861
T3 - Proceedings of IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC
SP - 412
EP - 416
BT - Proceedings - 2024 IEEE Symposium on Visual Languages and Human-Centric Computing, VL/HCC 2024
PB - IEEE Computer Society
Y2 - 2 September 2024 through 6 September 2024
ER -