Starflation

Differential weight and meaning of review stars

4 min readJan 29, 2021

One of the common online retail experiences is to show a product or service rating with one to five stars. But should we weight each rating equally? What makes a given rating more or less helpful to discerning whether the product is any good or what the critical deficiencies are?

The least helpful of all reviews is a five star review with no clarifying content. Somebody thought an experience was perfect but didn’t articulate why. Now it does provide some degree of Signal if thousands of people gave a five star review, but there is just too much in the way of poor incentives in terms of both the merchant and the platform padding such reviews.

This review padding shows up in all kinds of interesting ways, such as vendors offering free gifts for customers that write five star reviews. This is usually against a platforms terms of service but almost never results and punitive action, because it’s in the platforms interest to be giving high reviews to items. You can also see this in a lot of service oriented businesses, where anything less than five stars is a huge and traumatic big deal. Consumers get pressured in all kinds of ways to only give a five star review. All of these factors mean that a five star review carries very little signal unless it is accompanied by some detailed explanation and analysis of what happened that was authentically so amazing.

This makes one star reviews by definition more believable, since they work against the interest of the merchant and the platform. Some thing that has 100 five star reviews and 50 one star reviews probably actually exists as a product, versus some thing that has 50 five star reviews and no one star reviews. One star reviews are somewhat suspect however, because it’s pretty rare that some thing really was abysmally bad. There is a strong selection bias in review systems that only the people who had a really terrible experience are going to bother writing anything meaningful or giving less than five stars, which means that one star reviews are disproportionately written by cranks and don’t accurately reflect the value of the exchange.

So now we get to two star reviews, which acknowledge some modicum of underlying value but are still deliberately incredibly punitive. This is where the reviews start to get interesting, because if you leave a two star review it’s usually with context and commentary about a mixture of disappointment and positivity but leaning towards disappointment. On the other end of the scale, four star reviews offer nuance about oh and experience was exciting but not quite perfect. These are usually highly believable reviews.

My absolute favorites are probably the three star reviews. To rank some thing as three stars requires believing that a substantial number of components of the offering are valuable and interesting and were worth your time to evaluate, but you have strong beliefs about how the offering could be improved.

For services, the above get even more extremely compressed, where five stars means that somebody did a basically OK but not necessarily great job, four stars means they did a very poor job and need immediate retraining, and three stars or below is basically a statement that the person needs to be immediately fired.

Absolute ranking of quality is difficult for the above reasons. One way around this that hasn’t been widely adopted yet is comparative ranking which people tend to give much more readily. At scale we’ve seen a compression of rating averages end up on Yelp between around 3.5 and 4.2; if something has less than 3.5 it’s probably awful and more than 4.2 it’s probably great, but there’s this narrow window that defines nearly all companies and services that have gotten a reasonable number of reviews — where the difference between a 3.7 and 3.9 star average is probably pretty difficult to discern and may depend as much as anything on the location & demographic rather than objective quality. So in effect Yelp produces only four tiers of rating: bad, ok, great, and not-enough-data.

I could imagine an interesting implementation of comparative rating in online retail, where some small percentage of shoppers get two different items competitive with each other and get a nice bonus when they return the one they liked least with an explanation of why it wasn’t as good as the other. Naturally, there is cost in running this experiment of floating the extra item, the logistics and reverse logistics of shipping it to and from the consumer and of course the breakage that will inevitably result from transport as well as lack of return. But judiciously done it is my hunch that a lot of value could be derived from getting this truly objective signal around consumer preference for key categories. It’s hard to do with everything, a consumer isn’t going to want to install two different fridges and return the fridge they don’t like as much. That’s where comparative review sites can help. But for a midsize purchase like a Bluetooth speaker, “buy one, get two, return the one you liked less” could work well.

What’s your take on how we can get the clearest signal on the quality of a product or service without misaligned incentives?

Starflation

Differential weight and meaning of review stars

Written by David E. Weekly