A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of Large Language Model in Domain-Specific Dataset Generation

Qiu, Yunjian; Jin, Yan

Source: Journal of Mechanical Design:;2025:;volume( 147 ):;issue: 004::page 41707-1

Author:

Qiu, Yunjian

Jin, Yan

DOI: 10.1115/1.4067478

Publisher: The American Society of Mechanical Engineers (ASME)

Abstract: In engineering disciplines, leveraging generative language models requires using specialized datasets for training or fine-tuning the preexisting models. Compiling these domain-specific datasets is a complex endeavor, demanding significant human effort and resources. To address the problem of domain-specific dataset scarcity, this study investigates the potential of generative large language models (LLMs) in creating synthetic domain-specific textual datasets for engineering design domains. By harnessing the advanced capabilities of LLMs, such as GPT-4, a systematic methodology was developed to create high-fidelity datasets using designed prompts, evaluated against a manually labeled benchmark dataset through various computational measurements without human intervention. Findings suggest that well-designed prompts can significantly enhance the quality of domain-specific synthetic datasets with reduced manual effort. The research highlights the importance of prompt design in eliciting precise, domain-relevant information and discusses the balance between dataset robustness and richness. It is demonstrated that a language model trained on synthetic datasets can achieve a level of performance comparable to that of human-labeled, domain-specific datasets in terms of quality, offering a strategic solution to the limitations imposed by dataset shortages in engineering domains. The implications for design thinking processes are particularly noteworthy, with the potential to assist designers through GPT-4's structured reasoning capabilities. This work presents a complete guide for domain-specific dataset generation, automated evaluation metrics, and insights into the interplay between data robustness and comprehensiveness.

Download: (1.563Mb)
Show Full MetaData Hide Full MetaData
Get RIS
Item Order
Go To Publisher
Price: 5000 Rial
Statistics

A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of Large Language Model in Domain-Specific Dataset Generation

URI

http://yetl.yabesh.ir/yetl1/handle/yetl/4305404

Collections

Journal of Mechanical Design

Show full item record

contributor author	Qiu, Yunjian
contributor author	Jin, Yan
date accessioned	2025-04-21T10:03:35Z
date available	2025-04-21T10:03:35Z
date copyright	1/29/2025 12:00:00 AM
date issued	2025
identifier issn	1050-0472
identifier other	md_147_4_041707.pdf
identifier uri	http://yetl.yabesh.ir/yetl1/handle/yetl/4305404
description abstract	In engineering disciplines, leveraging generative language models requires using specialized datasets for training or fine-tuning the preexisting models. Compiling these domain-specific datasets is a complex endeavor, demanding significant human effort and resources. To address the problem of domain-specific dataset scarcity, this study investigates the potential of generative large language models (LLMs) in creating synthetic domain-specific textual datasets for engineering design domains. By harnessing the advanced capabilities of LLMs, such as GPT-4, a systematic methodology was developed to create high-fidelity datasets using designed prompts, evaluated against a manually labeled benchmark dataset through various computational measurements without human intervention. Findings suggest that well-designed prompts can significantly enhance the quality of domain-specific synthetic datasets with reduced manual effort. The research highlights the importance of prompt design in eliciting precise, domain-relevant information and discusses the balance between dataset robustness and richness. It is demonstrated that a language model trained on synthetic datasets can achieve a level of performance comparable to that of human-labeled, domain-specific datasets in terms of quality, offering a strategic solution to the limitations imposed by dataset shortages in engineering domains. The implications for design thinking processes are particularly noteworthy, with the potential to assist designers through GPT-4's structured reasoning capabilities. This work presents a complete guide for domain-specific dataset generation, automated evaluation metrics, and insights into the interplay between data robustness and comprehensiveness.
publisher	The American Society of Mechanical Engineers (ASME)
title	A Method for Synthesizing Ontology-Based Textual Design Datasets: Evaluating the Potential of Large Language Model in Domain-Specific Dataset Generation
type	Journal Paper
journal volume	147
journal issue	4
journal title	Journal of Mechanical Design
identifier doi	10.1115/1.4067478
journal fristpage	41707-1
journal lastpage	41707-14
page	14
tree	Journal of Mechanical Design:;2025:;volume( 147 ):;issue: 004
contenttype	Fulltext

YaBeSH Engineering and Technology Library

Archive