How do you write a SQL statement?
This article covers some best practices for writing SQL queries for data analysts and data scientists. Most of our discussion will concern SQL in general, but we’ll include some notes on features specific to Metabase that make writing SQL a breeze. Show
Correctness, readability, then optimization: in that orderThe standard warning against premature optimization applies here. Avoid tuning your SQL query until you know your query returns the data you’re looking for. And even then, only prioritize optimizing your query if it’s run frequently (like powering a popular dashboard), or if the query traverses a large number of rows. In general, prioritize accuracy (does the query produce the intended results), and readability (can others easily understand and modify the code) before worrying about performance. Make your haystacks as small as possible before searching for your needlesArguably, we’re already getting into optimization here, but the goal should be to tell the database to scan the minimum number of values necessary to retrieve your results. Part of SQL’s beauty is its declarative nature. Instead of telling the database how to retrieve records, you need only tell the database which records you need, and the database should figure out the most efficient way to get that information. Consequently, much of the advice about improving the efficiency of queries is simply about showing people how to use the tools in SQL to articulate their needs with more precision. We’ll review the general order of query execution, and include tips along the way to reduce your search space. Then we’ll talk about three essential tools to add to your utility belt: , , and . First, get to know your dataFamiliarize yourself with your data before your write a single line of code by studying the metadata to make sure that a column really does contain the data you expect. The SQL editor in Metabase features a handy data reference tab (accessible via the book icon), where you can browse through the tables in your database, and view their columns and connections (figure 1): Fig. 1. Use the Data Reference sidebar to view a table's fields.You can also view sample values for specific columns (figure 2). Fig. 2. Use the Data Reference sidebar to view sample data.Metabase gives you many different ways to explore your data: you can X-ray tables, compose questions using the query builder and Notebook Editor, convert a saved question to SQL code, or build from an existing native query. We cover this in other articles; for now, let’s go through the general workflow of a query. Developing your queryEveryone’s method will differ, but here’s an example workflow to follow when developing a query.
The general order of query executionBefore we get into individual tips on writing SQL code, it’s important to have a sense of how databases will carry out your query. This differs from the reading order (left to right, top to bottom) you use to compose your query. Query optimizers can change the order of the following list, but this general lifecycle of a SQL query is good to keep in mind when writing SQL. We’ll use the execution order to group the tips on writing good SQL that follow. The rule of thumb here is this: the earlier in this list you can eliminate data, the better.
And, of course, there will always be occasions where the query optimizer for your particular database will devise a different query plan, so don’t get hung up on this order. Some query guidelines (not rules)The following tips are guidelines, not rules, intended to keep you out of trouble. Each database handles SQL differently, has a slightly different set of functions, and takes different approaches to optimizing queries. And that’s before we even get into comparing traditional transactional databases with analytics databases that use columnar storage formats, which have vastly different performance characteristics. Comment your code, especially the whyHelp people out (including yourself three months from now) by adding comments that explain different parts of the code. The most important thing to capture here is the “why.” For example, it’s obvious that the code below filters out orders with 6 greater than 10, but the reason it’s doing that is because the first 10 orders are used for testing.
The catch here is that you introduce a little maintenance overhead: if you change the code, you need to make sure that the comment is still relevant and up to date. But that’s a small price to pay for readable code. SQL best practices for FROMJoin tables using the ON keywordAlthough it’s possible to “join” two tables using a 7 clause (that is, to perform an implicit join, like 8), you should instead prefer an explicit JOIN:
Mostly for readability, as the 9 + 0 syntax distinguishes joins from 7 clauses intended to filter the results.Alias multiple tablesWhen querying multiple tables, use aliases, and employ those aliases in your select statement, so the database (and your reader) doesn’t need to parse which column belongs to which table. Note that if you have columns with the same name across multiple tables, you will need to explicitly reference them with either the table name or alias. Avoid
Prefer
This is a trivial example, but when the number of tables and columns in your query increases, your readers won’t have to track down which column is in which table. That and your queries might break if you join a table with an ambiguous column name (e.g., both tables include a field called 2.Note that field filters are incompatible with table aliases, so you’ll need to remove aliases when connecting filter widgets to your Field Filters. SQL best practices for WHEREFilter with WHERE before HAVINGUse a 7 clause to filter superfluous rows, so you don’t have to compute those values in the first place. Only after removing irrelevant rows, and after aggregating those rows and grouping them, should you include a 4 clause to filter out aggregates.Avoid functions on columns in WHERE clausesUsing a function on a column in a 7 clause can really slow down your query, as the function makes the query non-sargable (i.e., it prevents the database from using an index to speed up the query). Instead of using the index to skip to the relevant rows, the function on the column forces the database to run the function on each row of the table.And remember, the concatenation operator 6 is also a function, so don’t get fancy trying to concat strings to filter multiple columns. Prefer multiple conditions instead:Avoid
Prefer
Prefer SELECT
o.id,
o.total,
p.vendor
FROM
orders AS o
JOIN products AS p ON o.product_id = p.id
SELECT
o.id,
o.total,
p.vendor
FROM
orders AS o
JOIN products AS p ON o.product_id = p.id
|