Consider using an array
with value_expression in (select unnest(?))
instead of issuing SQL statements in a batch or building dynamic statements
I refer to PostgreSQL in this article, but the unnest
function, the in
subquery expression
and the in
comparison are all standard SQL.
Contents
Problem statement
You have a set of n
identifiers (IDs) that cannot be represented by a range,
and you want to
delete
/update
all rows containing these IDs from/in a relational database. How would you do this? What if n
is huge?
Solutions
n
static prepared statements with one parameter in a batch
You could issue n
SQL prepared statements
with one parameter1, a.k.a. bind variable:
delete from my_table where id = ?;
Of course, you would not want to issue the commands one by one—it is better to organize them into a batch.
With JDBC this can be done by using
java.sql.PreparedStatement.addBatch()
/java.sql.Statement.executeBatch()
/java.sql.Statement.executeLargeBatch()
.
Despite the commands being issued more efficiently this way, you still request n
commands which a DBMS executes one by one.
It is reasonable to assume that executing n
commands takes more time than executing a single one that does the same thing as those n
commands,
and it seems to be true according to
"The Performance Difference Between SQL Row-by-row Updating, Batch Updating, and Bulk Updating" by
Lukas Eder working with JOOQ.
A dynamic prepared statement with n
parameters
You may dynamically build a single SQL statement with n
parameters specified for the in
comparison:
delete from my_table where id in (?, ?, ...);
However, for a large enough n
you may face a limit imposed by a (poor?) implementation of a JDBC driver,
like the one described here:
java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 100000
at org.postgresql.core.PGStream.SendInteger2(PGStream.java:201)
which happened when 100_000 values were specified for the in
comparison.
Another problem with this approach, is that you may end up generating many similar SQL statements which differ only by the number of bind variables.
If there is a cache of execution plans in the DBMS or a cache of prepared statements in the JDBC API implementation / JDBC driver,
then not only you can hardly benefit from it, but you also pollute the cache.
Hibernate ORM tries to mitigate this by using in
clause parameter padding
(see "How to improve statement caching efficiency with IN clause parameter padding" by Vlad Mihalcea
and hibernate.query.in_clause_parameter_padding
).
A static prepared statement with one parameter of the array
type of size n
As a result of all the aforementioned, it appears to me that a good option may be to use
the SQL array
type represented by
java.sql.Array
in JDBC:
delete from my_table where id in (select unnest(?));
We can create an array
with
java.sql.Connection.createArrayOf(String typeName, Object[] elements)
and specify it by using
PreparedStatement.setArray(int parameterIndex, Array x)
.
The reason for using the unnest
function is that the in
subquery expression expects (you guessed it)
a subquery which returns a set of rows, a.k.a. a table, not an array
.
Note that previously we were using the in
comparison, while now we are using the in
subquery expression.
The function unnest
converts an array
to a set of rows, this is also called "flattening".
What about the performance of the in
comparison with n
parameters and
the in
subquery expression with a single parameter of the array
type?
I am so glad the measurements have already been done and described in
"SQL IN Predicate: With IN List or With Array? Which is Faster?" by
Lukas Eder working with JOOQ. In short:
- for PostgreSQL, the approach with
n
parameters seem to result in a smaller latency than the approach with anarray
forn
< 128, and the situations changes in favour of using anarray
forn
>= 128; - for Oracle Database, the approach with an
array
is at least not worse than the approach withn
parameters if we ask it to determine thearray
cardinality with the/*+gather_plan_statistics*/
hint.
As we can see, there is no simple answer showing the performance of one approach being always better than the performance of the other approach,
but looks like at least for big enough n
the approach with an array
results in smaller latencies while also having the benefit of not polluting caches.
By the way, Hibernate ORM may also use this technique in the future.
JDBC example
This technique turns out especially handy when you have multiple sets of IDs and want to request different updates for each set. It allows you to have a single SQL statement for each set of IDs and issue all such commands in a batch. Here is an example code demonstrating the situation:
Map<String, Set<Long>> valueToIds = ...;
JdbcTemplate jdbcTemplate = ...;
jdbcTemplate.execute((Connection connection) -> {
try (PreparedStatement statement = connection.prepareStatement(
"update my_table set value = ? where id in (select unnest(?))")) {
valueToIds.forEach((value, ids) -> {
try {
statement.setString(1, value);
statement.setArray(2, connection.createArrayOf(
JDBCType.BIGINT.getName(), ids.toArray()));
statement.addBatch();
} catch (SQLException e) {
throw new RuntimeException(e);
}
});
return statement.executeBatch();
}
});
A temporary table with n
rows
A cardinally different and significantly less convenient approach is to create a temporary table which contains all n
IDs
and then use inner join
by utilizing PostgreSQL-specific using
clause:
delete from my_table as t using tmp_table as tmp where t.id = tmp.id;
or with standard SQL syntax:
delete from my_table as t where t.id in (select id from tmp_table);
This technique is also described here. It is not easy to imagine a situation in which this approach may be needed, but this comment seems to describe one such situation.
Notes
"100x faster Postgres performance by changing 1 line" by Alexis Lê-Quôc a co-founder of Datadog reports poor performance of
value_expression = any(array[v1, v2, ...])
in contrast with
value_expression = any(values (v1), (v2), ...)
for PostgreSQL 9.0, but says the problem was fixed in PostgreSQL 9.3.
The reason I am mentioning this difference between the any
comparison and the any
subquery expression here is that
the in
subquery expression is equivalent to the =
any
subquery expression according to the docs;
therefore, the mentioned performance bug probably also affected SQL commands with the in
subquery expression.
Note also that in PostgreSQL the any
comparison accepts an array
expression, while the in
comparison accepts a list of
value expressions, a.k.a. scalar expressions.
-
JDBC API Specification also supports prepared statements via
java.sql.PreparedStatement
. Section "13.2 The PreparedStatement Interface" of the JDBC API Specification 4.3 states"Parameter markers, represented by "
I am using this JDBC SQL syntax in the article. ?
" in the SQL string, are used to specify input values to the statement that may vary at runtime."