Enhanced Neighborhood Metric for Spreadsheet Fault Prediction

Abstract

Purpose: Spreadsheets are widely used in business and scientific domains, yet they are prone to input errors that can lead to significant risks. Faults often occur due to the use of formulas that are syntactically correct but semantically incorrect. This issue is particularly challenging for formula cells that are physically close and exhibit minor logical differences, which traditional fault prediction methods struggle to detect.

Methods: To address these challenges, this paper introduces an enhanced neighborhood metric approach, which extends traditional formula-based metrics by incorporating neighborhood-based metrics. This approach analyzes the dependencies between adjacent formula cells, considering factors such as formula diversity, content dissimilarity, and structural consistency. Building upon the metric framework of Koch et al. (2019), this study introduces eight additional neighborhood-based spreadsheet metrics to enhance fault prediction.

Results: Extensive experiments conducted on three widely used datasets—Enron, INFO1, and EUSES—demonstrated that integrating the enhanced neighborhood metrics with traditional ones significantly improves fault prediction performance. The approach shows notable improvements in precision, recall, and F1 scores, particularly for medium and large datasets.

Conclusion: This study highlights the importance of incorporating neighborhood metrics for spreadsheet fault detection. The enhanced neighborhood metric approach improves fault detection accuracy by capturing subtle logical variations between formula cells that are physically close. This method offers a robust and effective framework for improving the reliability of spreadsheets and can be applied in various real-world data analysis tasks.

Datasets

The experiments were conducted on three public datasets. You can download them from the following links.

Spreadsheet corpora as originally published.

Dataset Enron: [Download Dataset Enron (ZIP)] [Download Dataset Enron (overview file)]
Dataset INFO1: [Download Dataset INFO1 (ZIP)]
Dataset EUSES: [Download Dataset EUSES (ZIP)]

RQ

RQ1: Do spreadsheet formula cells typically have neighboring formulas that share the same logical structure?
RQ2: Do the proposed ENM metrics provide distinct structural information beyond the original 64 baseline metrics?
RQ3: Can incorporating ENM metrics into predictive models significantly enhance the detection of faulty formula cells?

Cite

Name：Enhanced neighborhood metric for spreadsheet fault prediction
Visit: https://link.springer.com/article/10.1007/s10515-025-00552-2
Description: An additional neighborhood metric indicator has been added on top of traditional metrics to enhance the prediction precision of formula faults in spreadsheets.
DOI: https://doi.org/10.1007/s10515-025-00552-2
Cite:
- Sun H, Wang Y, Yu H, et al. Enhanced neighborhood metric for spreadsheet fault prediction[J]. Automated Software Engineering, 2026, 33(1): 1-37.
- Sun, H., Wang, Y., Yu, H., & Zhu, Z. (2026). Enhanced neighborhood metric for spreadsheet fault prediction. Automated Software Engineering, 33(1), 1-37.
- @article{sun2026enhanced, title={Enhanced neighborhood metric for spreadsheet fault prediction}, author={Sun, Haitao and Wang, Ying and Yu, Hai and Zhu, Zhiliang}, journal={Automated Software Engineering}, volume={33}, number={1}, pages={1–37}, year={2026}, publisher={Springer} }

You can access the repository here: https://github.com/dangerwolf/ENM.