Haitao Sun Enhanced Neighborhood Metric for Spreadsheet Fault Prediction

This page provides the supplementary material for our paper titled 'Enhanced Neighborhood Metric for Spreadsheet Fault Prediction'.

Enhanced Neighborhood Metric for Spreadsheet Fault Prediction

Haitao Sun1, Ying Wang1,2, Hai Yu1,2, Zhiling Zhu1,2

1Software College, Northeastern University,
2National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Northeastern University


Abstract

Purpose: Spreadsheets are widely used in business and scientific domains, yet they are prone to input errors that can lead to significant risks. Faults often occur due to the use of formulas that are syntactically correct but semantically incorrect. This issue is particularly challenging for formula cells that are physically close and exhibit minor logical differences, which traditional fault prediction methods struggle to detect.

Methods: To address these challenges, this paper introduces an enhanced neighborhood metric approach, which extends traditional formula-based metrics by incorporating neighborhood-based metrics. This approach analyzes the dependencies between adjacent formula cells, considering factors such as formula diversity, content dissimilarity, and structural consistency. Building upon the metric framework of Koch et al. (2019), this study introduces eight additional neighborhood-based spreadsheet metrics to enhance fault prediction.

Results: Extensive experiments conducted on three widely used datasets—Enron, INFO1, and EUSES—demonstrated that integrating the enhanced neighborhood metrics with traditional ones significantly improves fault prediction performance. The approach shows notable improvements in precision, recall, and F1 scores, particularly for medium and large datasets.

Conclusion: This study highlights the importance of incorporating neighborhood metrics for spreadsheet fault detection. The enhanced neighborhood metric approach improves fault detection accuracy by capturing subtle logical variations between formula cells that are physically close. This method offers a robust and effective framework for improving the reliability of spreadsheets and can be applied in various real-world data analysis tasks.


Datasets

The experiments were conducted on three public datasets. You can download them from the following links.

Spreadsheet corpora as originally published.


RQ

Cite


You can access the repository here: https://github.com/dangerwolf/ENM.