LLM Augmentations to support Analytical Reasoning over Multiple Documents
Our key contributions are:
- We conduct the first investigation of the feasibility of using
LLMs in intelligence analysis where both evidencebased
reasoning and analytical creativity is of utmost
importance.
- We develop a three-step augmentation to support the use
of LLMs for IA: i) Dynamic Evidence Trees (DETs),
a memory module to help organize evidences, ii) Data
condensation via LLMs, and iii) an LLM-driven search
and retrieval process.
- Applying our framework on multiple IA datasets, we
show that while our augmentations help orchestrate and
improves narratives on large datasets, LLMs still lack
the analytical creativity to craft convincing arguments.
- We outline detailed recommendations for applying
LLMs to IA tasks and specifically how they can serve
as modules in specific subtasks such as evidence marshalling
and narrative generation.
Memory modules are increasingly being used to augment LLMs in various applications [16]. As such, we devise a memory module in form of trees to help LLMs orchestrate evidence as the reasoning goes along. As an improvement on the second front, we also perform the tests with two different granularities of the reports. We propose condensing the reports into concise information chunks
With the help of DETs, LLMs try to build up hypotheses and investigation threads. Intelligence analysts go through a process of discovery and combining the dots to build up various hypotheses. Like human analysts might do, DETs helps to keep track of all the information dots and keeps connecting them with new relevant information in a tree like structure. Each hypothesis dot represents an investigation thread that may or may not get added to another related investigation thread.
We utilize the language modeling capabilities of LLMs to digest the set of reports into usable information dots before generating reports. These condensed information dots can be merged with other information dots to create hypotheses. We empirically test various system prompts to break down the report in a zero-shot fashion such that a report ri turns into an evidential dot edi. We use dynamic pre-processing to break it down further, if deemed necessary by the LLM. All the reports in our experiments yielded a single dot each. This also indicates the concise reporting practice of the intelligence community.
F. Third augmentation: LLMs for Retrieval To build up a DET, we require search and retrieval capabilities on the existing evidence list, given new evidence. We augment the LLMs with a two-step retrieval pipeline. The first step is powered by an LLM-based embedding vector database and similarity-based search. Following the recent use of LLMs-based embeddings for retrieval [17], [18], we adopt an instructor embedding model [19] for our database. The second step uses an LLM to filter the extracted set of DOTs.
The augmented version of the experiment is designed to process intelligence report sequentially and keep track of the evidence in DETs. This sequential approach helps in three ways; first, to conform to the dynamic nature of how intelligence report are available and assigned over period of time; second, to help build up DETs to keep track of the emerging evidence; third, to get around the context limitation of the models. This also follows the natural tendencies of how analyst must make and keep the hypotheses running until enough evidence can be accrued and a reasoning can be established for each hypothesis. New reports and the resulting information can either approve or disprove any existing hypotheses. The DETbased augmented pipeline considers this aspect and inputs the reports iteratively to simulate the temporal continuum of intelligence analysis.
Each report goes through a pre-processing step based on the dataset provided and a set of information dots are extracted from the report. We initialize a database and keep updating it with new information dots during runtime. During the tree building operation, the system first attempts to search relevant DOTs from the DETs. The extracted DOTs candidates go through a parent level hypothesis DOTs identification and consolidation process. We take the lowest common parent hypothesis nodes and assign the new DOT to that node. This ensures that we are assigning the new DOT to the most relevant branch of the evidence tree. Afterwards, each evidential dots of the resulting new branch are extracted. LLMs will synthesize all the evidential dots to create a short narrative for this particular investigation thread. From the resulting DETs, we isolate the largest chain of events and use that as the main DET for the input. The full generated DETs along with its disjoint nodes also demonstrate how LLMs are being augmented to keep track different documents and how each of these documents are being utilized through the chain of events for building up to the final hypothesis. Fig. 5 shows the final steps of the proposed architecture and algorithm 1 and 2 shows the overall algorithm used for the build and merge operation.
These works typically consider the story as an established connection between a set of entities by graph properties. The task of sense-making, finding out the bigger story, and narrative generation are still up to the analysts.
From the quantitative and qualitative experiments, it is apparent that the generation of LLMs studied here struggle with in-depth analytical reasoning. Across all datasets, we see trends of good summarization, and less on analytical creativity or speculative/imaginative reasoning (not withstanding LLMs ability to hallucinate information). We briefly describe these points here: Steering speculative reasoning is hard: LLMs are adept at summarizing documents into groups. However, our repeated attempts with multiple types of prompts and parameter sweeps failed to invoke the capability required to properly speculate connections. Complementary to this finding, we also experimented with a shorter use case with brief descriptions of four persons as shown in Fig. 9. When asked about the connection between only two persons, LLMs were able to speculate based on the similarity of their names. However, with the addition of two more persons in the mix, LLMs failed to invoke the previous speculation. We also noticed that the position of the target entities mattered. Our finding is consistent with those of recent research, e.g., needle in the haystack [41], lost in the middle [42] studies. Moreover, our findings suggest that, LLMs still struggle on a much smaller scale than previously thought of and this depends on both task complexity (i.e., number of entities in IA) and context length.
As a ray of hope, preliminary tests on OpenAI’s o1 model [43] (unavailable during submission) show substantial improvement which can be attributed to the additional chain-of-thought reasoning steps invoked during generation. This reinforces that additional reasoning augmentations can be helpful for IA tasks.
LLMs are good organizers: Even with the lack of in-depth analytical reasoning, LLMs were able to group related entities and events together, specially with smaller datasets. With large datasets, augmentation is needed to orchestrate some of the required grouping because of the context limitation and loss of attention [15].
Orchestrating the evidence is a major challenge: For data-centric analytical tasks like IA, it is hard for an LLM to take everything into context in a single prompt. Augmentation is necessary to organize the evidence. We propose a LLM-driven framework for these tasks. LLMs can also help in the search and retrieval process for such frameworks. Frameworks designed to show the immediate reasoning steps can also be helpful for the analysts to develop mental models.