Modeling Code: Is Text All You Need?

Paper · arXiv 2507.11467 · Published July 15, 2025

Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and graph neural networks. However, these approaches lack the generative capabilities and scale of modern LLMs. In this work, we introduce a novel approach to combine the strengths of modeling both code as text and more structured forms.

In this paper, we propose a solution that integrates graph-based reasoning into LLMs using a GNN soft prompting approach. Our method learns how to encode structured code representations into prompts that can be consumed by powerful pre-trained LLMs. To accomplish this, we propose a novel graph representation of LLVM intermediate representation (IR) that can be learned to be mapped into a language model’s embedding space. By bridging key information from graph-based analyses directly into the model’s latent space, we preserve the fidelity of structured reasoning, and achieve the flexibility and scale of modern LLMs.

4.1. Design of the IRGraph Format

The graph format is inspired by similar representations of code, such as ProGraML and PerfoGraph, however, it uses a finer granularity to split IR statements into the graph. It further adds more node and edge types to better distinguish information on the graph. The final graph format, IRGraph, has six node types and eight edge types, which are described below.

Node Types

Value: Represents individual LLVM IR values, such as variables

or constants.

Type: Encodes type information, such as integer or floating-point

types, associated with LLVM IR values.

Size: Models container sizes or dimensionality information

derived from type properties.

Module: Represents the LLVM IR module as a global context

for values and functions.

Attributes: Captures function and argument attributes, including

linkage, visibility, and calling conventions.

Instruction: Represents individual instructions in the

LLVM IR, including their operation codes and alignment

information.

Edge Types

Type: Connects a value to its associated type (value →

type).

Dataflow: Captures data dependencies, including definitions

(instruction→value) and uses (value→instruction).

Attribute: Links values to their attributes (value → at

tribute).

CFG: Represents control flow between instructions (instruction→

instruction).

Size: Maps types to their associated sizes (type→size).

Symbol: Connects the module to global values (module→

value).

Includes: Connects contained types (type→type).

Contains: Connects constants and global variables to

operands and initializers (value→value).

This structure provides more comprehensive detail about the LLVM IR code beyond control and data flow, which are typically the only focus of graph code representations such as ProGraML and PerfoGraph. By incorporating more granular details, the representation is better able to capture nuanced relationships between different elements of the LLVM IR code. For example, attributes, symbols, and size information are crucial to understanding performance properties of code, which is a common task for graph-based code representations, yet none of the related works incorporated these details or only encapsulated them in a limited manner.