AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation

Mohamed Mehdi Trigui; Wasfi G. Al-Khatib; Mohammad Amro; Fatma Mallouli

doi:10.3390/computers15020087

AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation

Mohamed Mehdi Trigui^*
, Wasfi G. Al-Khatib
, Mohammad Amro^*
, Fatma Mallouli

^*Corresponding author for this work

Computer Center

King Fahd University of Petroleum and Minerals

Research output: Contribution to journal › Article › peer-review

Abstract

Commit messages are essential for understanding software evolution and maintaining traceability of projects; however, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context-sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval-augmented generation, and fine-tuning, using the large-scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot approaches. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human-in-the-loop approach for quality assurance.

Original language	English
Article number	87
Journal	Computers
Volume	15
Issue number	2
DOIs	https://doi.org/10.3390/computers15020087
State	Published - Feb 2026

Keywords

automatic and human evaluation
commit message generation
CommitBench
large language models
retrieval-augmented generation
transformer-based models

Access to Document

10.3390/computers15020087

Cite this

@article{c24f595054c74fd49735d2acadd99965,

title = "AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation",

abstract = "Commit messages are essential for understanding software evolution and maintaining traceability of projects; however, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context-sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval-augmented generation, and fine-tuning, using the large-scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot approaches. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human-in-the-loop approach for quality assurance.",

keywords = "automatic and human evaluation, commit message generation, CommitBench, large language models, retrieval-augmented generation, transformer-based models",

author = "Trigui, \{Mohamed Mehdi\} and Al-Khatib, \{Wasfi G.\} and Mohammad Amro and Fatma Mallouli",

note = "Publisher Copyright: {\textcopyright} 2026 by the authors.",

year = "2026",

month = feb,

doi = "10.3390/computers15020087",

language = "English",

volume = "15",

journal = "Computers",

issn = "2073-431X",

number = "2",

}

TY - JOUR

T1 - AI-Driven Code Documentation

T2 - Comparative Evaluation of LLMs for Commit Message Generation

AU - Trigui, Mohamed Mehdi

AU - Al-Khatib, Wasfi G.

AU - Amro, Mohammad

AU - Mallouli, Fatma

PY - 2026/2

Y1 - 2026/2

N2 - Commit messages are essential for understanding software evolution and maintaining traceability of projects; however, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context-sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval-augmented generation, and fine-tuning, using the large-scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot approaches. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human-in-the-loop approach for quality assurance.

AB - Commit messages are essential for understanding software evolution and maintaining traceability of projects; however, their quality varies across repositories. Recent Large Language Models provide a promising path to automate this task by generating concise context-sensitive commit messages directly from code diffs. This paper provides a comparative study of three paradigms of large language models: zero-shot prompting, retrieval-augmented generation, and fine-tuning, using the large-scale CommitBench dataset that spans six programming languages. We assess the performance of the models with automatic metrics, namely BLEU, ROUGE-L, METEOR, and Adequacy, and a human assessment of 100 commits. In the latter, experienced developers rated each generated commit message for Adequacy and Fluency on a five-point Likert scale. The results show that fine-tuning and domain adaptation yield models that perform consistently better than general-purpose baselines across all evaluation metrics, thus generating commit messages with higher semantic adequacy and clearer phrasing than zero-shot approaches. The correlation analysis suggests that the Adequacy and BLEU scores are closer to human judgment, while ROUGE-L and METEOR tend to underestimate the quality in cases where the models generate stylistically diverse or paraphrased outputs. Finally, the study outlines a conceptual integration pathway for incorporating such models into software development workflows, emphasizing a human-in-the-loop approach for quality assurance.

KW - automatic and human evaluation

KW - commit message generation

KW - CommitBench

KW - large language models

KW - retrieval-augmented generation

KW - transformer-based models

UR - https://www.scopus.com/pages/publications/105031406258

U2 - 10.3390/computers15020087

DO - 10.3390/computers15020087

M3 - Article

AN - SCOPUS:105031406258

SN - 2073-431X

VL - 15

JO - Computers

JF - Computers

IS - 2

M1 - 87

ER -

AI-Driven Code Documentation: Comparative Evaluation of LLMs for Commit Message Generation

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this