SpreadsheetLLM: Encoding Spreadsheets for Large Language Models

Paper · arXiv 2407.09025 · Published July 12, 2024

Recently, the rapid development of Large Language Models (LLMs) has opened new frontiers in table processing (Li et al., 2023b) and reasoning (Cheng et al., 2022). However, spreadsheets pose unique challenges for LLMs due to their expansive grids that usually exceed the token limitations of popular LLMs, as well as their inherent two-dimensional layouts and structures, which are poorly suited to linear and sequential input. Furthermore, LLMs often struggle with spreadsheet-specific features such as cell addresses and formats, complicating their ability to effectively parse and utilize spreadsheet data, as detailed in Appendix A.

Structural Anchors for Efficient Layout Understanding: Observations indicate that large spreadsheets often contain numerous homogeneous rows or columns, which contribute minimally to understanding the layout and structure (see left panel in Figure 2 (a)). To address this, we identify structural anchors—heterogeneous rows and columns at possible table boundaries that offer substantial layout insights, as depicted in Figure 2 (b). Then we remove distant, homogeneous rows and columns, producing a condensed "skeleton" version of the spreadsheet, as illustrated in Figure 2 (c).

Inverted-Index Translation for Token Efficiency: The vanilla encoding method becomes token-consuming when handling spreadsheets with numerous empty cells and repetitive values, as shown in Figure 2 (c). To improve efficiency, we depart from traditional row-by-row and column-bycolumn serialization and employ a lossless invertedindex translation in JSON format. This method creates a dictionary that indexes non-empty cell texts and merges addresses with identical text, optimizing token usage while preserving data integrity.
Data Format Aggregation for Numerical Cells: Adjacent numerical cells often share similar number formats. Recognizing that exact numerical values are less crucial for grasping spreadsheet structure, we extract number format strings and data types from these cells. Then adjacent cells with the same formats or types are clustered together. This method is visualized in the right example of Figure 2, where rectangular regions are represented by uniform format strings and data types, streamlining the understanding of numerical data distribution without excessive token expenditure. We conducted a comprehensive evaluation of our method on a variety of LLMs. Our experiments show that SHEETCOMPRESSOR significantly reduces token usage for spreadsheet encoding by 96%. Moreover, SPREADSHEETLLM has shown exceptional performance in spreadsheet table detection,