你当前正在访问 Microsoft Azure Global Edition 技术文档网站。如果需要访问由世纪互联运营的 Microsoft Azure 中国技术文档网站，请访问 https://docs.azure.cn。

排查和识别 Azure Database for PostgreSQL 灵活服务器中运行缓慢的查询

项目
01/09/2025

适用于： Azure Database for PostgreSQL 灵活服务器

本文介绍如何识别和诊断查询运行缓慢的根本原因。

本文介绍以下内容：

如何识别运行缓慢的查询。
如何识别与之一起的运行缓慢的过程。在属于同一运行缓慢的存储过程的查询列表中识别运行缓慢的查询。

先决条件

按照使用故障排除指南中所述的步骤启用故障排除指南。
通过加载 auto_explain 扩展并将其加入允许列表来配置该扩展。
配置 auto_explain 扩展后，更改以下服务器参数，这些参数可控制扩展的行为：
- 将 auto_explain.log_analyze 指定为 ON。
- 将 auto_explain.log_buffers 指定为 ON。
- 根据场景合理需求为 auto_explain.log_min_duration。
- 将 auto_explain.log_timing 指定为 ON。
- 将 auto_explain.log_verbose 指定为 ON。

注意

如果将 auto_explain.log_min_duration 设置为 0，扩展会开始记录服务器上执行的所有查询。这可能会影响数据库的性能。必须进行适当的审慎调查才能得出在服务器上被视为缓慢的值。例如，如果所有查询在 30 秒内完成，并且应用程序可以接受，则建议将参数更新为 30000 毫秒。这样就可以记录完成时间超过 30 秒的所有查询。

识别速度缓慢的查询

有了故障排除指南和 auto_explain 扩展，我们可以通过一个示例来描述该场景。

我们遇到 CPU 使用率飙升到 90% 的情况，想要确定造成飙升的原因。若要对这种情况进行调试，请执行以下步骤：

一旦收到 CPU 场景警报，请在受影响的 Azure Database for PostgreSQL 灵活服务器实例的资源菜单中，选择“监视”部分下的“故障排除指南”。
选择“CPU”选项卡。“优化 CPU 使用率过高的情况”故障排除指南随即打开。
在“分析时段(当地时间)”下拉列表中，选择要要重点分析的时间范围。
选择“查询”选项卡。其中显示了在 CPU 使用率达到 90% 的时间间隔内运行的所有查询的详细信息。从快照来看，该时间间隔内平均执行时间最慢的查询约为 2.6 分钟，该查询在该时间间隔内运行了 22 次。该查询很可能是 CPU 飙升的原因。
若要检索实际查询文本，请连接到 azure_sys 数据库并执行以下查询。

    psql -h <server>.postgres.database.azure.com -U <user> -d azure_sys

     SELECT query_sql_text
     FROM query_store.query_texts_view
     WHERE query_text_id = <query_id>;

在所考虑的示例中，发现运行缓慢的查询是：

SELECT  c_id, SUM(c_balance) AS total_balance
FROM customer
GROUP BY c_w_id, c_id
ORDER BY c_w_id;

若要了解生成的确切说明计划，请使用 Azure Database for PostgreSQL 灵活服务器日志。每次查询在该时间窗口内完成执行时，auto_explain 扩展就会在日志中写入一个条目。在资源菜单中的“监视”部分下，选择“日志”。
选择出现 CPU 利用率达到 90% 的时间范围。
执行以下查询，以检索所识别查询的 EXPLAIN ANALYZE 输出。

AzureDiagnostics
| where Category contains  'PostgreSQLLogs'
| where Message contains "<snippet of SQL text identified or name of any of the tables referenced in the query>"
| project TimeGenerated, Message

消息列存储执行计划，如以下输出所示：

2024-11-10 19:56:46 UTC-6525a8e7.2e3d-LOG: duration: 150692.864 ms plan:

Query Text: SELECT c_id, SUM(c_balance) AS total_balance
FROM customer
GROUP BY c_w_id,c_id
order by c_w_id;
GroupAggregate (cost=1906820.83..2131820.83 rows=10000000 width=40) (actual time=70930.833..129231.950 rows=10000000 loops=1)
Output: c_id, sum(c_balance), c_w_id
Group Key: customer.c_w_id, customer.c_id
Buffers: shared hit=44639 read=355362, temp read=77521 written=77701
-> Sort (cost=1906820.83..1931820.83 rows=10000000 width=13) (actual time=70930.821..81898.430 rows=10000000 loops=1)
Output: c_id, c_w_id, c_balance
Sort Key: customer.c_w_id, customer.c_id
Sort Method: external merge Disk: 225104kB
Buffers: shared hit=44639 read=355362, temp read=77521 written=77701
-> Seq Scan on public.customer (cost=0.00..500001.00 rows=10000000 width=13) (actual time=0.021..22997.697 rows=10000000 loops=1)
Output: c_id, c_w_id, c_balance

查询运行了大约 2.5 分钟，如故障排除指南中所示。所提取的执行计划中的值 duration 150692.864 毫秒已确认该结果。使用 EXPLAIN ANALYZE 的输出进一步排除故障并优化查询。

注意

注意，查询在间隔期间运行了 22 次，并且显示的日志包含时间间隔内捕获的此类条目之一。

识别存储过程中运行缓慢的查询

有了故障排除指南和 auto_explain 扩展，我们将借助示例来解释该场景。

我们遇到 CPU 使用率飙升到 90% 的情况，想要确定造成飙升的原因。若要对这种情况进行调试，请执行以下步骤：

一旦收到 CPU 场景警报，请在受影响的 Azure Database for PostgreSQL 灵活服务器实例的资源菜单中，选择“监视”部分下的“故障排除指南”。
选择“CPU”选项卡。“优化 CPU 使用率过高的情况”故障排除指南随即打开。
在“分析时段(当地时间)”下拉列表中，选择要要重点分析的时间范围。
选择“查询”选项卡。其中显示了在 CPU 使用率达到 90% 的时间间隔内运行的所有查询的详细信息。从快照来看，该时间间隔内平均执行时间最慢的查询约为 2.6 分钟，该查询在该时间间隔内运行了 22 次。该查询很可能是 CPU 飙升的原因。
连接到 azure_sys 数据库并执行查询，以使用以下脚本检索实际查询文本。

    psql -h <server>.postgres.database.azure.com -U <user> -d azure_sys

     SELECT query_sql_text
     FROM query_store.query_texts_view
     WHERE query_text_id = <query_id>;

在所考虑的示例中，发现速度缓慢的查询是一个存储过程：

    call autoexplain_test ();

若要了解生成的确切说明计划，请使用 Azure Database for PostgreSQL 灵活服务器日志。每次查询在该时间窗口内完成执行时，auto_explain 扩展就会在日志中写入一个条目。在资源菜单中的“监视”部分下，选择“日志”。然后，在“时间范围:”中，选择要集中分析的时间窗口。

执行以下查询，以检索所标识查询的说明分析输出。

AzureDiagnostics
| where Category contains  'PostgreSQLLogs'
| where Message contains "<snippet of SQL text identified or name of any of the tables referenced in the query>"
| project TimeGenerated, Message

该过程有多个查询，下面突出显示了这些查询。将记录存储过程中使用的每个查询的解释分析输出，以便进一步分析和排除故障。记录的查询的执行时间可用于识别存储过程中最慢的查询。

2024-11-10 17:52:45 UTC-6526d7f0.7f67-LOG: duration: 38459.176 ms plan:

Query Text: insert into customer_balance SELECT c_id, SUM(c_balance) AS total_balance FROM customer GROUP BY c_w_id,c_id order by c_w_id
Insert on public.customer_balance (cost=1906820.83..2231820.83 rows=0 width=0) (actual time=38459.173..38459.174 rows=0 loops=1)Buffers: shared hit=10108203 read=454055 dirtied=54058, temp read=77521 written=77701 WAL: records=10000000 fpi=1 bytes=640002197
    -> Subquery Scan on "*SELECT*" (cost=1906820.83..2231820.83 rows=10000000 width=36) (actual time=20415.738..29514.742 rows=10000000 loops=1)
        Output: "*SELECT*".c_id, "*SELECT*".total_balance Buffers: shared hit=1 read=400000, temp read=77521 written=77701
            -> GroupAggregate (cost=1906820.83..2131820.83 rows=10000000 width=40) (actual time=20415.737..28574.266 rows=10000000 loops=1)
                Output: customer.c_id, sum(customer.c_balance), customer.c_w_id Group Key: customer.c_w_id, customer.c_id Buffers: shared hit=1 read=400000, temp read=77521 written=77701
                -> Sort (cost=1906820.83..1931820.83 rows=10000000 width=13) (actual time=20415.723..22023.515 rows=10000000 loops=1)
                    Output: customer.c_id, customer.c_w_id, customer.c_balance Sort Key: customer.c_w_id, customer.c_id Sort Method: external merge Disk: 225104kB Buffers: shared hit=1 read=400000, temp read=77521 written=77701
                     -> Seq Scan on public.customer (cost=0.00..500001.00 rows=10000000 width=13) (actual time=0.310..15061.471 rows=10000000 loops=1) Output: customer.c_id, customer.c_w_id, customer.c_balance Buffers: shared hit=1 read=400000

2024-11-10 17:52:07 UTC-6526d7f0.7f67-LOG: duration: 61939.529 ms plan:
Query Text: delete from customer_balance
Delete on public.customer_balance (cost=0.00..799173.51 rows=0 width=0) (actual time=61939.525..61939.526 rows=0 loops=1) Buffers: shared hit=50027707 read=620942 dirtied=295462 written=71785 WAL: records=50026565 fpi=34 bytes=2711252160
    -> Seq Scan on public.customer_balance (cost=0.00..799173.51 rows=15052451 width=6) (actual time=3185.519..35234.061 rows=50000000 loops=1)
        Output: ctid Buffers: shared hit=27707 read=620942 dirtied=26565 written=71785 WAL: records=26565 fpi=34 bytes=11252160


2024-11-10 17:51:05 UTC-6526d7f0.7f67-LOG: duration: 10387.322 ms plan:
Query Text: select max(c_id) from customer
Finalize Aggregate (cost=180185.84..180185.85 rows=1 width=4) (actual time=10387.220..10387.319 rows=1 loops=1) Output: max(c_id) Buffers: shared hit=37148 read=1204 written=69
    -> Gather (cost=180185.63..180185.84 rows=2 width=4) (actual time=10387.214..10387.314 rows=1 loops=1)
        Output: (PARTIAL max(c_id)) Workers Planned: 2 Workers Launched: 0 Buffers: shared hit=37148 read=1204 written=69
        -> Partial Aggregate (cost=179185.63..179185.64 rows=1 width=4) (actual time=10387.012..10387.013 rows=1 loops=1) Output: PARTIAL max(c_id) Buffers: shared hit=37148 read=1204 written=69
            -> Parallel Index Only Scan using customer_i1 on public.customer (cost=0.43..168768.96 rows=4166667 width=4) (actual time=0.446..7676.356 rows=10000000 loops=1)
               Output: c_w_id, c_d_id, c_id Heap Fetches: 24 Buffers: shared hit=37148 read=1204 written=69

注意

为了便于演示，仅显示该过程中使用的少数查询的 EXPLAIN ANALYZE 输出。思路是可以收集日志中所有查询的 EXPLAIN ANALYZE 输出，识别其中最慢的查询，并尝试优化它们。

通过

排查和识别 Azure Database for PostgreSQL 灵活服务器中运行缓慢的查询

先决条件

识别速度缓慢的查询

识别存储过程中运行缓慢的查询

反馈

其他资源

通过

排查和识别 Azure Database for PostgreSQL 灵活服务器中运行缓慢的查询

先决条件

识别速度缓慢的查询

识别存储过程中运行缓慢的查询

相关内容

反馈

其他资源