Comparing Apples to Apples: Generating Aspect-Aware Comparative Sentences from User Reviews

Paper · arXiv 2307.03691 · Published July 5, 2023

“Deciding on a product to purchase can be a time-consuming process. Every user has specific quality preferences, budget restrictions, or enjoys different item features. To distill important information, users typically have to read the specifications and reviews of several different (but often very similar) products. For example, when a user is interested in digital pianos, they might want to know which item has the best price-to-value ratio, longevity, or sound quality. If the sound quality of an electronic piano is of importance for one user, the sentence “This piano sounds more natural than my Sony NWZ-A855." can give richer information than “This piano sounds natural.".

We define a comparative sentence as a sentence highlighting differences of an item to ground its relative perception. A comparative sentence typically includes a comparative adjective or adverb (such as “bigger", “faster", “more beautiful", “less expensive", etc.). Comparative sentences assist users in their relative perception of items and can provide a richer information context to point out specific features that are superior or inferior compared to other similar products. Recommendation explanations like justifications (Ni et al., 2019) or ‘tips’ (Li et al., 2019) assess qualities of a single product and are evaluative by nature, but do not put the generated explanation into a larger context. There are few prior works that tangentially study the problem of generating indirect comparisons (Yang et al., 2021; Ginty and Smyth, 2002; Chen et al., 2022), but we are the first to directly generate abstractive recommendations for this task, which means that rather than relying on templates or extractions from previously written reviews, our method has the power of abstracting from previous reviews to generate new, personalized comparative recommendation explanations.

Our work generates sentences that express relative comparisons to facilitate product purchase decisions. Our method can additionally underline outstanding product features for user-specific personalization.”

To compile a larger corpus of comparative sentences, we automatically label instances from a subset of Amazon Musical Instruments and Amazon Electronics to extract comparative sentences for each item. To reliably find comparative sentences automatically, we use our previously manually labeled dataset to train a BERT classification model. We tokenize our data instances with the BERT tokenization scheme (Devlin et al., 2018) and pad sentences to a fixed length. The model is then fine-tuned for 4 epochs with a linear classification layer to predict comparative sentences. The comparative sentence classification training data was split into training/validation and test sets and the best model was chosen on the validation set.

To infuse control into our language generation task and personalize generation toward item features that are relevant to a particular user, we obtain aspects from all reviews of an item following the method from Zhang et al. (2014b). We build a sentiment lexicon to extract fine-grained aspects from the reviews of the dataset and their associated sentiments. For example, a review like “I like the sound of these headphones" would have a positive sentiment towards the aspect “sound". We extract both positive and negative aspects in our dataset, but proceed to use only aspects extracted with a positive sentiment to feed into our language generation pipeline. To summarize, our dataset consists of item reviews, item aspects and their sentiments, comparative sentences for each item, as well as product and user information.

We provide a dataset of 258, 816 comparative sentences, associated reviews and user information. We validate our automatic labeling approach in a human evaluation study, where we randomly select 200 comparative sentence instances to be labeled by three different crowd workers. For 92% of instances, at least two out of three crowd workers agreed that the sentence is a comparative sentence.

4.3.2 Human Evaluation To further evaluate our method, we conduct human experiments on (1) Comparativeness, where we ask human raters if a sentence is comparing the item or a feature of the item; (2) Relevance, where we ask if the generated sentence is relevant to the item(s) in question, and (3) Fidelity to see if the generated sentence is truthful with respect to the review and aspect information present.