We were executing the full statement (SELECT rebalance_table_shards('distributed_table_name');) against the distributed table when we received the error. We ending up opening an Azure Support ticket. They first said there was an issue with ca.pem creation while the new nodes were being deployed and they applied a hot-fix as well as a an actual fix to be rolled across all regions by the end of the week. The same error occurred in spite of this. This was the final fix and so far everything is working now, including rebalancing after scaling out additional nodes again.
At our end, we found that it may happen when it doesn’t requires data transfer.
We did following tests at our end.
SELECT master_move_shard_placement() call doesn't use logical replication by default and it does continue give the same error when we run;
citus=> select master_move_shard_placement(102538, '10.0.0.34', 5432, '10.0.0.33', 5432, 'force_logical');
ERROR: could not connect to the publisher: SSL error: tlsv1 alert unknown ca
CONTEXT: while executing command on 10.0.0.33:5432
Weirdly, this happens only when we need to move shards in between newly added workers. So, the following queries just work fine.
select master_move_shard_placement(102538, '10.0.0.34', 5432, '10.0.0.15', 5432, 'force_logical');
master_move_shard_placement
-----------------------------
(1 row)
select master_move_shard_placement(102538, '10.0.0.15', 5432, '10.0.0.33', 5432, 'force_logical');
master_move_shard_placement
-----------------------------
(1 row)
Therefore, for this moment, engineering suggested that you may try once more which you already did and it is working again. Based on engineering, if the rebalancing doesn't require data transfer from another newly added worker to w6 (10.0.0.33), the operation would fail.
We had the same incident before and there is an issue about this in citus-enterprise repository. We do not know the fix for this yet.
Last resort solution if it fails again is that you can use rebalance_table_shards function with the option shard_transfer_mode := 'block_writes' and see if that works for you.
Source: MSDN