Can Large Language Models do Analytical Reasoning?
Our divide-and-conquer approach breaks down play-by-play data into smaller, more manageable segments, solves each piece individually, and then aggregates them together. Besides the divide-and-conquer approach, we also explore the Chain of Thought (CoT) strategy, which markedly improves outcomes for certain models, notably GPT-4 and Claude-2.1, with their accuracy rates increasing significantly. However, the CoT strategy has negligible or even detrimental effects on the performance of other models like GPT-3.5 and Gemini-Pro. Secondly, to our surprise, we observe that most models, including GPT-4, struggle to accurately count the total scores for NBA quarters despite showing strong performance in counting NFL quarter scores. This leads us to further investigate the factors that impact the complexity of analytical reasoning tasks with extensive experiments, through which we conclude that task complexity depends on the length of context, the information density, and the presence of related information. Our research provides valuable insights into the complexity of analytical reasoning tasks and potential directions for developing future large language models
Based on the analysis of the plot, it is observed that all curves exhibit a downward trend as the context length increases. This deviation could potentially be attributed to the prevalence of whole game scores in news reports, suggesting that these outliers may result from background knowledge acquired during the model’s training phase. Consequently, it is reasonable to infer that within the confines of the same task, the complexity of the task escalates in correlation with an increase in context length.
The data points are classified into three tiers: low density, medium density, and high density, corresponding respectively to the bottom one-third, middle one-third, and top one-third of the information density values. Across all models and reasoning methods, a consistent trend emerges indicating a decline in model performance with increasing information density levels. This observation allows us to infer that the complexity of the reasoning task escalates in direct proportion to the level of information density.