A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of Large Language Model in Domain-Specific Dataset GenerationSource: Journal of Mechanical Design:;2025:;volume( 147 ):;issue: 004::page 41707-1DOI: 10.1115/1.4067478Publisher: The American Society of Mechanical Engineers (ASME)
Abstract: In engineering disciplines, leveraging generative language models requires using specialized datasets for training or fine-tuning the preexisting models. Compiling these domain-specific datasets is a complex endeavor, demanding significant human effort and resources. To address the problem of domain-specific dataset scarcity, this study investigates the potential of generative large language models (LLMs) in creating synthetic domain-specific textual datasets for engineering design domains. By harnessing the advanced capabilities of LLMs, such as GPT-4, a systematic methodology was developed to create high-fidelity datasets using designed prompts, evaluated against a manually labeled benchmark dataset through various computational measurements without human intervention. Findings suggest that well-designed prompts can significantly enhance the quality of domain-specific synthetic datasets with reduced manual effort. The research highlights the importance of prompt design in eliciting precise, domain-relevant information and discusses the balance between dataset robustness and richness. It is demonstrated that a language model trained on synthetic datasets can achieve a level of performance comparable to that of human-labeled, domain-specific datasets in terms of quality, offering a strategic solution to the limitations imposed by dataset shortages in engineering domains. The implications for design thinking processes are particularly noteworthy, with the potential to assist designers through GPT-4's structured reasoning capabilities. This work presents a complete guide for domain-specific dataset generation, automated evaluation metrics, and insights into the interplay between data robustness and comprehensiveness.
|
Collections
Show full item record
contributor author | Qiu, Yunjian | |
contributor author | Jin, Yan | |
date accessioned | 2025-04-21T10:03:35Z | |
date available | 2025-04-21T10:03:35Z | |
date copyright | 1/29/2025 12:00:00 AM | |
date issued | 2025 | |
identifier issn | 1050-0472 | |
identifier other | md_147_4_041707.pdf | |
identifier uri | http://yetl.yabesh.ir/yetl1/handle/yetl/4305404 | |
description abstract | In engineering disciplines, leveraging generative language models requires using specialized datasets for training or fine-tuning the preexisting models. Compiling these domain-specific datasets is a complex endeavor, demanding significant human effort and resources. To address the problem of domain-specific dataset scarcity, this study investigates the potential of generative large language models (LLMs) in creating synthetic domain-specific textual datasets for engineering design domains. By harnessing the advanced capabilities of LLMs, such as GPT-4, a systematic methodology was developed to create high-fidelity datasets using designed prompts, evaluated against a manually labeled benchmark dataset through various computational measurements without human intervention. Findings suggest that well-designed prompts can significantly enhance the quality of domain-specific synthetic datasets with reduced manual effort. The research highlights the importance of prompt design in eliciting precise, domain-relevant information and discusses the balance between dataset robustness and richness. It is demonstrated that a language model trained on synthetic datasets can achieve a level of performance comparable to that of human-labeled, domain-specific datasets in terms of quality, offering a strategic solution to the limitations imposed by dataset shortages in engineering domains. The implications for design thinking processes are particularly noteworthy, with the potential to assist designers through GPT-4's structured reasoning capabilities. This work presents a complete guide for domain-specific dataset generation, automated evaluation metrics, and insights into the interplay between data robustness and comprehensiveness. | |
publisher | The American Society of Mechanical Engineers (ASME) | |
title | A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of Large Language Model in Domain-Specific Dataset Generation | |
type | Journal Paper | |
journal volume | 147 | |
journal issue | 4 | |
journal title | Journal of Mechanical Design | |
identifier doi | 10.1115/1.4067478 | |
journal fristpage | 41707-1 | |
journal lastpage | 41707-14 | |
page | 14 | |
tree | Journal of Mechanical Design:;2025:;volume( 147 ):;issue: 004 | |
contenttype | Fulltext |