To answer the OP, no. Your compiler will never do the optimization you want it to do, no matter how high you try and move up the abstraction.
The fundamental issue isn't just that your compiler doesn't understand SQL. The problem is that your compiler doesn't understand how data is or will be stored. It's blind to the current state of the dataset.
For example, maybe it's the case that the data is stored in a hashtable and it's rarely written. In that case, N+1 might actually be the right way to query the data to avoid excessive read locks across the database.
Perhaps the data has 1, 2 or several indexes created against it. Those indexes can change at anytime and have to be consciously made as building them can take a lot of resources.
RDBMS build up a huge amount of statistics for optimizations purposes. That's all information you'd have to have the compiler JIT into the application to get similar performance or to have a good feeling for how to do the optimization.
> This problem is usually caused by a leaky abstraction; the ORM, or whatever database abstraction you are using, can’t anticipate that it would need to send N queries, so it can’t automatically optimize this down to a single query.
If you do an if-statement with an ORM before doing an update-statement, the result of the if-statement is already stale before the update occurs. If you skipped the ORM and put your if-statement as a where-clause, it's less of a problem.
> However, this only works since Haskell is declarative / pure; the low-level operational semantics (like evaluation order) are abstracted away from the programmer, and as such are amenable to optimization.
Thank you. People so often forget (or don’t realize) that their RDBMS is also doing a ton of optimization on their query, and a primary reason it’s able to do so in real-time is because SQL is declarative.
The author is right to note that Haskell can optimise across module (abstraction) boundaries. However I remember that in my childhood that Debray [1] did a lot of work on link-time optimisations (late 1980s). And of course there's the classic self work that fed into the JVM [2], and the whole-of-program compilers that are receiving renewed attention now; mlton [3] being a classic of the genre, "supercompiler" being the active area of research AIUI. So at least sometimes abstraction boundaries are transparent to optimisers.
On the other hand the classic data abstraction story (signatures/interfaces for structures/modules) naturally allows for selecting or optimising implementations depending on uses. There was some great work done in the early 2000s on that (see [4]) and I'm sure the state-of-the-art has moved on since [5].
> However, what if we raise the abstraction boundary and make the ORM part of the language? This means that we could formulate rewrite rules for the ORM, allowing it to eg merge the N queries into a single query.
It's correct that abstraction boundaries are optimization boundaries, but I don't think you need to make queries part of the language itself to raise the boundary.
To give a concrete example, take the Django ORM in Python. If you write a function which returns a single database record, then calling that function many times in a loop is naturally going to result in an n+1 query. However, if you instead return a QuerySet, then what you're returning is a lazily-evaluated sequence of records. Then, the caller can make the choice on whether to immediately evaluate the query (when they only need one record) or collect together a bunch of QuerySets and union them into a single query.
In other words we give the caller more control and thus more opportunity to optimize for their use case.
To answer the OP, no. Your compiler will never do the optimization you want it to do, no matter how high you try and move up the abstraction.
The fundamental issue isn't just that your compiler doesn't understand SQL. The problem is that your compiler doesn't understand how data is or will be stored. It's blind to the current state of the dataset.
For example, maybe it's the case that the data is stored in a hashtable and it's rarely written. In that case, N+1 might actually be the right way to query the data to avoid excessive read locks across the database.
Perhaps the data has 1, 2 or several indexes created against it. Those indexes can change at anytime and have to be consciously made as building them can take a lot of resources.
RDBMS build up a huge amount of statistics for optimizations purposes. That's all information you'd have to have the compiler JIT into the application to get similar performance or to have a good feeling for how to do the optimization.
They are consistency boundaries too.
> This problem is usually caused by a leaky abstraction; the ORM, or whatever database abstraction you are using, can’t anticipate that it would need to send N queries, so it can’t automatically optimize this down to a single query.
If you do an if-statement with an ORM before doing an update-statement, the result of the if-statement is already stale before the update occurs. If you skipped the ORM and put your if-statement as a where-clause, it's less of a problem.
> However, this only works since Haskell is declarative / pure; the low-level operational semantics (like evaluation order) are abstracted away from the programmer, and as such are amenable to optimization.
What else is declarative / pure? SQL. Not ORMS.
> What else is declarative / pure? SQL.
Thank you. People so often forget (or don’t realize) that their RDBMS is also doing a ton of optimization on their query, and a primary reason it’s able to do so in real-time is because SQL is declarative.
The author is right to note that Haskell can optimise across module (abstraction) boundaries. However I remember that in my childhood that Debray [1] did a lot of work on link-time optimisations (late 1980s). And of course there's the classic self work that fed into the JVM [2], and the whole-of-program compilers that are receiving renewed attention now; mlton [3] being a classic of the genre, "supercompiler" being the active area of research AIUI. So at least sometimes abstraction boundaries are transparent to optimisers.
On the other hand the classic data abstraction story (signatures/interfaces for structures/modules) naturally allows for selecting or optimising implementations depending on uses. There was some great work done in the early 2000s on that (see [4]) and I'm sure the state-of-the-art has moved on since [5].
[1] https://dblp.org/pid/d/SKDebray.html
[2] https://en.wikipedia.org/wiki/Self_(programming_language)
[3] http://mlton.org/
[4] https://dblp.org/pid/59/4501.html
[5] https://en.wikipedia.org/wiki/Interprocedural_optimization
> However, what if we raise the abstraction boundary and make the ORM part of the language? This means that we could formulate rewrite rules for the ORM, allowing it to eg merge the N queries into a single query.
It's correct that abstraction boundaries are optimization boundaries, but I don't think you need to make queries part of the language itself to raise the boundary.
To give a concrete example, take the Django ORM in Python. If you write a function which returns a single database record, then calling that function many times in a loop is naturally going to result in an n+1 query. However, if you instead return a QuerySet, then what you're returning is a lazily-evaluated sequence of records. Then, the caller can make the choice on whether to immediately evaluate the query (when they only need one record) or collect together a bunch of QuerySets and union them into a single query.
In other words we give the caller more control and thus more opportunity to optimize for their use case.