Consider using an `array` with `value_expression in (select unnest(?))` instead of issuing SQL statements in a batch or building dynamic statements

SQL²

Posted on 2019-02-14, last updated on 2021-12-19

I refer to PostgreSQL in this article, but the unnest function, the in subquery expression and the in comparison are all standard SQL.

Problem statement
Solutions
Notes

Problem statement

You have a set of n identifiers (IDs) that cannot be represented by a range, and you want to delete/update all rows containing these IDs from/in a relational database. How would you do this? What if n is huge?

Solutions

`n` static prepared statements with one parameter in a batch

You could issue n SQL prepared statements with one parameter¹, a.k.a. bind variable:

delete from my_table where id = ?;

Of course, you would not want to issue the commands one by one—it is better to organize them into a batch. With JDBC this can be done by using java.sql.PreparedStatement.addBatch()/ java.sql.Statement.executeBatch()/ java.sql.Statement.executeLargeBatch(). Despite the commands being issued more efficiently this way, you still request n commands which a DBMS executes one by one. It is reasonable to assume that executing n commands takes more time than executing a single one that does the same thing as those n commands, and it seems to be true according to "The Performance Difference Between SQL Row-by-row Updating, Batch Updating, and Bulk Updating" by Lukas Eder working with JOOQ.

A dynamic prepared statement with `n` parameters

You may dynamically build a single SQL statement with n parameters specified for the in comparison:

delete from my_table where id in (?, ?, ...);

However, for a large enough n you may face a limit imposed by a (poor?) implementation of a JDBC driver, like the one described here:

java.io.IOException: Tried to send an out-of-range integer as a 2-byte value: 100000
    at org.postgresql.core.PGStream.SendInteger2(PGStream.java:201)

which happened when 100_000 values were specified for the in comparison.

Another problem with this approach, is that you may end up generating many similar SQL statements which differ only by the number of bind variables. If there is a cache of execution plans in the DBMS or a cache of prepared statements in the JDBC API implementation / JDBC driver, then not only you can hardly benefit from it, but you also pollute the cache. Hibernate ORM tries to mitigate this by using in clause parameter padding (see "How to improve statement caching efficiency with IN clause parameter padding" by Vlad Mihalcea and hibernate.query.in_clause_parameter_padding).

A static prepared statement with one parameter of the `array` type of size `n`

As a result of all the aforementioned, it appears to me that a good option may be to use the SQL array type represented by java.sql.Array in JDBC:

delete from my_table where id in (select unnest(?));

We can create an array with java.sql.Connection.createArrayOf(String typeName, Object[] elements) and specify it by using PreparedStatement.setArray(int parameterIndex, Array x). The reason for using the unnest function is that the in subquery expression expects (you guessed it) a subquery which returns a set of rows, a.k.a. a table, not an array. Note that previously we were using the in comparison, while now we are using the in subquery expression. The function unnest converts an array to a set of rows, this is also called "flattening".

What about the performance of the in comparison with n parameters and the in subquery expression with a single parameter of the array type? I am so glad the measurements have already been done and described in "SQL IN Predicate: With IN List or With Array? Which is Faster?" by Lukas Eder working with JOOQ. In short:

for PostgreSQL, the approach with n parameters seem to result in a smaller latency than the approach with an array for n < 128, and the situations changes in favour of using an array for n >= 128;
for Oracle Database, the approach with an array is at least not worse than the approach with n parameters if we ask it to determine the array cardinality with the /*+gather_plan_statistics*/ hint.

As we can see, there is no simple answer showing the performance of one approach being always better than the performance of the other approach, but looks like at least for big enough n the approach with an array results in smaller latencies while also having the benefit of not polluting caches. By the way, Hibernate ORM may also use this technique in the future.

JDBC example

This technique turns out especially handy when you have multiple sets of IDs and want to request different updates for each set. It allows you to have a single SQL statement for each set of IDs and issue all such commands in a batch. Here is an example code demonstrating the situation:

Map<String, Set<Long>> valueToIds = ...;
JdbcTemplate jdbcTemplate = ...;
jdbcTemplate.execute((Connection connection) -> {
    try (PreparedStatement statement = connection.prepareStatement(
            "update my_table set value = ? where id in (select unnest(?))")) {
        valueToIds.forEach((value, ids) -> {
            try {
                statement.setString(1, value);
                statement.setArray(2, connection.createArrayOf(
                        JDBCType.BIGINT.getName(), ids.toArray()));
                statement.addBatch();
            } catch (SQLException e) {
                throw new RuntimeException(e);
            }
        });
        return statement.executeBatch();
    }
});

A temporary table with `n` rows

A cardinally different and significantly less convenient approach is to create a temporary table which contains all n IDs and then use inner join by utilizing PostgreSQL-specific using clause:

delete from my_table as t using tmp_table as tmp where t.id = tmp.id;

or with standard SQL syntax:

delete from my_table as t where t.id in (select id from tmp_table);

This technique is also described here. It is not easy to imagine a situation in which this approach may be needed, but this comment seems to describe one such situation.

Notes

"100x faster Postgres performance by changing 1 line" by Alexis Lê-Quôc a co-founder of Datadog reports poor performance of

value_expression = any(array[v1, v2, ...])

in contrast with

value_expression = any(values (v1), (v2), ...)

for PostgreSQL 9.0, but says the problem was fixed in PostgreSQL 9.3.

The reason I am mentioning this difference between the any comparison and the any subquery expression here is that the in subquery expression is equivalent to the = any subquery expression according to the docs; therefore, the mentioned performance bug probably also affected SQL commands with the in subquery expression.

Note also that in PostgreSQL the any comparison accepts an array expression, while the in comparison accepts a list of value expressions, a.k.a. scalar expressions.

JDBC API Specification also supports prepared statements via java.sql.PreparedStatement. Section "13.2 The PreparedStatement Interface" of the JDBC API Specification 4.3 states "Parameter markers, represented by "?" in the SQL string, are used to specify input values to the statement that may vary at runtime." I am using this JDBC SQL syntax in the article.

Consider using an array with value_expression in (select unnest(?)) instead of issuing SQL statements in a batch or building dynamic statements

Contents

Problem statement

Solutions

n static prepared statements with one parameter in a batch

A dynamic prepared statement with n parameters

A static prepared statement with one parameter of the array type of size n