SAILER: Structure-aware Pre-trained Language Model for Legal Case Retrieval

Paper · arXiv 2304.11370 · Published April 22, 2023

To address these issues, in this paper, we propose SAILER, a new Structure-Aware pre-traIned language model for LEgal case Retrieval. It is highlighted in the following three aspects: (1) SAILER fully utilizes the structural information contained in legal case documents and pays more attention to key legal elements, similar to how legal experts browse legal case documents. (2) SAILER employs an asymmetric encoder-decoder architecture to integrate several different pre-training objectives. In this way, rich semantic information across tasks is encoded into dense vectors. (3) SAILER has powerful discriminative ability, even without any legal annotation data. It can distinguish legal cases with different charges accurately.

paragraphs are very similar in text, but legally irrelevant due to key legal elements. Paragraphs A: Person X, 24 years old, men, height 180 cm. In 2018, he entered the shopping mall five times where he stole one cell phone and two tablet computers, worth a total of 10,000,000. At five o’clock, he returned home and gave the above items to Person Y. Paragraphs B: Person X, 24 years old, men, height 180 cm. In 2018, he entered the shopping mall five times where he purchased one cell phone and two tablet computers, worth a total of 10,000,000. At five o’clock, he returned home and gave the above items to Person Y.

Challenge 1. Legal case documents are typically long text sequences with intrinsic writing logic. As shown in Figure 1, case documents in the Case Law system and Statute Law system 1 usually consist of five parts: Procedure, Fact, Reasoning, Decision, and Tail (we will discuss the details of these parts in Section 3). Each part represents a specific topic and varies from hundreds to thousands of words. These parts are written in standard legal logic and are usually correlated with each other. Existing PLMs either possess limited text modeling capacity, hindering their ability to model long documents [9, 18], or neglect the structures of legal case documents, preventing them from capturing the long-distance dependencies in legal writing logic [34]. Consequently, the performance of vanilla PLM-based retrievers is constrained.

Challenge 2. The concept of relevance in the legal domain is quite different from that in general search. In legal case retrieval, the relevance between two legal cases is sensitive to their key legal elements (e.g. “forcibly took other people’s property", “arbitrarily damaged other people’s belongings", etc.). Here, key legal elements include key circumstances and the legal concept abstraction of key circumstances or factors [24]. Cases without key legal elements or with different key legal elements may lead to different judgments. For example, as shown in Table 1, the two paragraphs are usually considered relevant in ad-hoc retrieval because they share significant number of keywords and sentences. However, in legal case retrieval, these two paragraphs are irrelevant and could lead to completely different judgments due to the influence of key legal elements. Without guidance, open-domain PLM-based neural retrieval models have difficulty understanding key legal elements, resulting in suboptimal retrieval performance in the legal domain.

To tackle these challenges, we propose a Structure-Aware pretraIned language model for LEgal case Retrieval (SAILER). SAILER adopts an encoder-decoder architecture to explicitly model and captures the dependency between the Fact and other parts of a legal case document (Challenge 1). Meanwhile, SAILER utilizes the legal knowledge in the Reasoning and Decision paragraphs to enhance the understanding of key legal elements (Challenge 2). Specifically, we encode the Fact paragraph into dense vectors usinga deep encoder. Then, with help of the Fact vector, two shallow decoders are applied to reconstruct the aggressively-masked texts in the Reasoning and Desicions paragraphs respectively. In this way, SAILER takes full advantage of the logical relationships in legal case documents and the knowledge in different structures.

Pre-trained Language Models for Dense

Retrieval

Dense retrieval (DR) typically uses a dual encoder to encode queries and documents separately and computes the relevance score by a simple similarity function (cosine or dot product). Many researchers have further improved the performance of DR by negative sampling and distillation [6, 12, 26, 37, 42].

Researchers in the IR community have already begun to design pre-training methods tailored for DR [14, 15, 19, 21, 22, 32, 33]. These approaches primarily aim to better represent contextual semantics with [CLS] embedding. They are based on the intuition that [CLS] embedding should encode important information in the given text to achieve robust matches. For example, Condenser [14] and coCondenser [15] design skip connections in the last layers of the encoder to force information aggregated into the [CLS] token. Recently, autoencoder-based pre-training has drawn much attention.

The input sentences are encoded into embeddings to reconstruct the originals, forcing the encoder to generate better sentence representations.

SEED-Encoder [21] proposes to use a weak decoder for reconstruction. SIMIM [32] and RetroMAE [19] modify the decoding approach to strengthen the information bottleneck, which improves the quality of generated embeddings.

Despite the success, the autoencoder-based models cannot fully comprehend the logical relationship between the different structures in legal case documents, as they mainly rely on limited information in the corpus. Moreover, since the case text contains numerous facts that are not essential for judging the relevance of the cases, reconstructing the original text may decrease the discriminative power of dense vectors.