Skip to main navigation Skip to search Skip to main content

A Multi-Objective Statistical Framework for Evaluating LLM-Based Code Modernization: Transformation Pattern Analysis and Effect Size Validation

Research output: Contribution to journalArticlepeer-review

Abstract

Automated legacy code modernization using Large Language Models lacks rigorous evaluation frameworks and multi-objective quality assessment methodologies. Existing research suffers from three critical deficiencies: single-metric evaluation paradigms creating pathological optimization incentives, statistical validation limited to p-values without effect size analysis, and absence of systematic transformation pattern taxonomies explaining what works and why. We present a novel multi-objective statistical framework that jointly assesses Cyclomatic Complexity (CC) and Maintainability Index (MI) while providing comprehensive effect size analysis addressing software engineering research gaps. Applied to 47 legacy Java samples from Apache Ant (version 1.10.x, commit rel/1.10.14), our framework achieves 97.9% metric-level improvement with very large practical effects (Cohen’s (Formula presented.), 95% CI [1.36, 2.35], (Formula presented.) ) for maintainability—substantially exceeding prior work and conventional significance thresholds. We note that this success rate reflects quality metric improvement; functional equivalence was verified through syntactic validation and manual inspection of a 20% random sample, while comprehensive automated test-based verification remains a limitation addressed in future work. We contribute: (1) first multi-objective quality assessment framework for code modernization with weighted composite scoring and sensitivity analysis, (2) rigorous statistical methodology with effect size analysis beyond p-values, (3) systematic transformation pattern taxonomy identifying four successful patterns and three failure modes with predictive value (inter-rater agreement (Formula presented.) ), and (4) negative result showing iterative refinement provides no benefit ( (Formula presented.), (Formula presented.) ), saving community resources. Our transformation taxonomy enables practitioners to predict success likelihood from code characteristics, while our statistical framework provides replicable methodology for evaluating LLM-based software engineering tools. The very large effect size indicates metric-level improvements are materially meaningful for real-world software maintenance, not merely statistically detectable.

Original languageEnglish
Article number148
JournalComputers
Volume15
Issue number3
DOIs
StatePublished - Mar 2026

Keywords

  • automated software engineering
  • code quality metrics
  • effect size analysis
  • large language models
  • legacy modernization
  • multi-objective optimization
  • statistical validation
  • transformation patterns

Fingerprint

Dive into the research topics of 'A Multi-Objective Statistical Framework for Evaluating LLM-Based Code Modernization: Transformation Pattern Analysis and Effect Size Validation'. Together they form a unique fingerprint.

Cite this