Evaluating Large Language Models in Exercises of UML Class Diagram Modeling

Paper · Source
Domain SpecializationKnowledge Graphs

The goal of this work is to evaluate the capability of LLM agents to correctly generate UML class diagrams in activities of Requirements Modeling in the field of Software Engineering. Our aim is to evaluate LLMs in an educational setting, i.e., understanding how valuable are the results of LLMs when compared to results made by human actors, and how valuable can LLM be to generate sample solutions to provide to students.

For that purpose, we collected 20 exercises from a diverse set of web sources and compared the models generated by a human and an LLM solver in terms of syntactic, semantic, pragmatic correctness, and distance from a provided reference solution. Our results show that the solutions generated by an LLM solver typically present a significantly higher number of errors in terms of semantic quality and textual difference against the provided reference solution, while no significant difference is found in syntactic and pragmatic quality.

3.2 Research Questions and Metrics

We define the following set of research questions to frame the experiment design:

• RQ1: Does the use of LLM agents have an impact on the quality and correctness of UML Class Diagrams generated from natural language requirements?

• RQ2: What aspects of the natural language requirements have an impact on the quality of generated UML Class Diagrams? To measure the quality of generated class diagrams, we focus on the quality aspects defined by Bolloju et al. [2]:

• Syntactic Quality: assess if the class diagrams follow the syntactic structure of UML Class Diagrams. The rules verified for syntactic quality are:

– Missing cardinality details; – Inappropriate naming of classes and associations;

Incorrect use of UML symbols. • Semantic Quality: evaluate the accuracy and completeness of the diagrams in representing the intended domain. It is measured in terms of: (i) Validity: all elements and relationships in the diagrams should accurately represent the domain; (ii) Completeness: the diagrams should include all necessary elements and relationships. The rules verified for semantic quality are:

– Incorrect Cardinality; – Aggregation in place of association; – Wrong location of attributes or operations; – Operations cannot be realized using existing attributes and relationships.

• Pragmatic Quality: focus on the understandability of the diagrams from the perspective of stakeholders. The rules verified for pragmatic quality are:

– Redundant attributes and associations; – Specialization with no distinction among subclasses; – Inconsistency in styling and conventions.

The highest difference can be noticed for Semantic Quality (4.85 errors on average against 1.75 made by the human solver), whereas the smallest difference is measured for Syntactic Quality (0.9 errors against 0.5). Regarding Syntactic and Pragmatic quality, it is worth underlying that – while having a higher average number of errors – the LLM agent performed at most, respectively, 2 and 3 errors, against a maximum number of errors equal to 4 for both qualities for the human agent.

The number of attributes and operations (methods) of the reference solution instead has a negative correlation with the number of semantic errors and the distance from the solution for human agents. Our interpretation of this result is two-fold: first, the human agents may have a tendency to capture many attributes from textual documentation; second, the reference solutions can be sketched by reducing the number of attributes inside classes. The combination of these two attitudes leads inevitably to an increase in the distance from the reference solution.

The results of this paper highlight that LLM agents still achieve worse results overall in generating UML diagrams from requirements against human solvers. The difference is however limited for several quality and correctness aspects that were evaluated, therefore hinting at a possible utilization of LLM agents for the generation of sample solutions for students or preliminary sketches of diagrams to accompany requirements in software projects. All the results in this study were gathered after a single-pass interaction with LLM agents, therefore there is important room for improvement of the results provided by such agents.

As our future work, we envision the possibility of evolving the basic and static prompts used to generate the diagrams and consider the possibilities of using an architecture of LLM agents cooperating to improve a generated diagram iteratively.