Tips, Tricks, and Hacks
Programming should not really be thought of as an amalgamation of different tips, tricks, and hacks. But also, I would be wrong to deny the power of tips, tricks, and hacks. They are an important piece of the pie. And every once in a while you are just missing one crucial piece of information to get you where you need to be.
The frontend world in particular seems to have an affinity for these nuggets of knowledge (a la CSS-tricks). I also see a lot of helpful tips and tricks with terminal, bash aliases, git, certain hotkey combinations, etc.
Today, however, I am coming to you today from the SQL world. Hopefully this trick will be as helpful to you as it has been to me.
The Situation
Once in a blue moon there arises some business logic that seems like it might be impossible for SQL to solve—or, perhaps more accurately, the solutions that are coming to mind just seem to be too complex, or messy, and you don’t want to deal with them.
I was in this situation not too long ago, and the end result of that situation ended up in me adding a new tool to my SQL toolbelt. And it was so simple that I felt like people needed to have easier access to it!
Introducing SELECT DISTINCT ON
Here’s how it goes.
I have a database of dogs
and each dog has a breed
, and one or many weights
which are stored in the dog_weight
’s table. Each weight has a value_in_pounds
and a best_in_show
boolean.
(I’m guessing that if you clicked on and are reading a SQL blog post, you probably also want to see the schema in SQL)
CREATE TABLE dog (
id INT,
breed TEXT
);
CREATE TABLE dog_weight (
id INT,
dog_id INT,
value_in_pounds INT,
best_in_show BOOL
);
The one last detail is that there can only be one best_in_show
weight per dog
breed.
With this schema, my goal is to select what can be thought of as the “best-in-show weight” for each dog breed. BUT, some breed
’s have never won best-in-show, so they don’t have a best-in-show weight—instead they have one or more rows which have best_in_show
set to FALSE
. And in this case, we actually want to return the highest weight
for that breed
. So really, I’m trying to select the “best-in-show weight, or the highest weight on record for that dog”. So how do we do this?
SELECT dog.id, dog.breed, dog_weight.value_in_pounds FROM dog INNER JOIN dog_weight ON dog.id = dog_weight.dog_id WHERE dog_weight.best_in_show = true;
This query is sort of almost correct, but will not return the breeds which have not won best-in-show (which is really where the complexity comes from, and returning the heaviest dog of that breed in that circumstance).
As always in programming, and in this case with SQL, there’s many different paths that could arrive at the correct query. But one of the simplest paths is with this query:
SELECT DISTINCT ON (dog.breed) dog.id, dog.breed, dog_weight.value_in_pounds FROM dog INNER JOIN dog_weight ON dog.id = dog_weight.dog_id ORDER BY dog.breed DESC, dog_weight.best_in_show DESC, dog_weight.value_in_pounds DESC;
What’s really happening here is that the results are being ordered in a very specific way, such that the first result in each grouping of dog breeds is the one we want to select.
They are ordered first by their breed
(which groups them together by breed), and next by best_in_show
(which puts the first result for each window of breeds as the dog with best_in_show
, if it’s present), and finally by their weight (putting the heaviest dog first as the result for that dog breed, in the case that best_in_show
is not present). Finally, it selects DISTINCT ON (dog.breed)
, which selects only the first result for each dog breed, which is exactly what we are aiming to return.
The PostgreSQL docs can explain it better than I can:
SELECT DISTINCT ON ( expression [, ...] )
keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY.. Note that the “first row” of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. For example:
SELECT DISTINCT ON (location) location, time, report
FROM weather_reports,
ORDER BY location, time DESC;
retrieves the most recent weather report for each location. But if we had not used ORDER BY to force descending order of time values for each location, we’d have gotten a report from an unpredictable time for each location.
Conclusion
Ordering your results and using SELECT DISTINCT ON
can be a very useful way to reduce a complex query. I think one of the most interesting takeaways from this is that although we normally think of ordering results as being useful in, perhaps, finding the most expensive “something” or most recently created “other thing”, ordering can also be used as a tool to group results.
Hopefully you find use for this variation of SELECT DISTINCT
in your SQL life.
Loved the article? Hated it? Didn’t even read it?
We’d love to hear from you.
The real issue and a headache is that you can have a few best_in_show dogs for one breed in this scheme!
That’s true. There should really be a unique constraint or a slightly different SQL schema but I didn’t feel like adding that to the schema definition. We’ll have to suspend our disbelief for this one!
thanks for sharing!