CREATETABLE `project-id.dataset.user_table` ( user_id INT64, name STRING, region STRING ) PARTITIONBY RANGE_BUCKET(user_id, GENERATE_ARRAY(0, 1000000, 10000)) OPTIONS( require_partition_filter =true );
分区类型对比:
特性
TimePartitioning
RangePartitioning
字段类型
DATE 或 TIMESTAMP
INT64
分区依据
时间范围
数值范围
典型场景
日志、时间序列数据
用户分组、订单编号
查询优化
按时间范围查询
按数值范围查询
步骤四:创建物化视图
物化视图用于优化查询性能和降低成本:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-- 创建物化视图 CREATE MATERIALIZED VIEW `project-id.dataset.mv_daily_summary` PARTITIONBY event_date AS SELECT DATE(event_time) AS event_date, user_id, COUNT(*) AS event_count, SUM(amount) AS total_amount FROM `project-id.dataset.events` GROUPBY event_date, user_id;
-- 查询物化视图(自动使用预计算结果) SELECT* FROM `project-id.dataset.mv_daily_summary` WHERE event_date ='2025-04-29';
物化视图优势:
⚡ 预计算结果,查询更快
💰 减少扫描量,降低成本
🔄 自动增量更新
限制与注意事项
CDC 表限制:
❌ 无法支持 Search Index
⏱️ 需结合 max_staleness 参数使用
📊 存在查询延迟(通常几秒到几分钟)
max_staleness 使用:
1 2 3 4 5 6 7
-- 允许最多 15 分钟的数据延迟 SELECT* FROM `project-id.dataset.cdc_table` WHERE updated_at >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL1HOUR) OPTIONS( max_staleness =INTERVAL15MINUTE );
from google.cloud import bigquery import functions_framework
@functions_framework.http defmerge_cdc_data(request): """定期执行 MERGE 操作的云函数""" client = bigquery.Client() merge_query = """ MERGE `project-id.dataset.item` AS target USING `project-id.dataset.staging_item` AS source ON target.id = source.id WHEN MATCHED AND source.operation_type = 'UPDATE' THEN UPDATE SET target.category_id = source.category_id, target.updated_at = source.create_date WHEN MATCHED AND source.operation_type = 'DELETE' THEN DELETE WHEN NOT MATCHED AND source.operation_type = 'INSERT' THEN INSERT (id, category_id, create_date, updated_at) VALUES (source.id, source.category_id, source.create_date, source.sync_time) """ job = client.query(merge_query) job.result() # 等待完成 returnf"Merged {job.total_bytes_processed} bytes"
调度频率建议:
实时性要求高:每 5-15 分钟
一般场景:每小时
低频场景:每天
方案三:直接更新
适用场景
📝 更新频率低(每天数百次以内)
📊 数据量小(百万级以下)
⏰ 对实时性要求不高
基本操作
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
-- 更新单条记录 UPDATE `project-id.dataset.item` SET category_id =234, category_log ='Updated category', updated_at =CURRENT_TIMESTAMP() WHERE id =1;
-- 批量更新 UPDATE `project-id.dataset.item` SET status ='inactive', updated_at =CURRENT_TIMESTAMP() WHERE last_login < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL90DAY);
-- 删除操作 DELETEFROM `project-id.dataset.item` WHERE status ='deleted' AND updated_at < TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL7DAY);
优化建议
使用分区过滤:
1 2 3 4 5 6 7 8 9 10
-- ✅ 好的查询(使用分区过滤) UPDATE `project-id.dataset.item` SET status ='processed' WHEREDATE(create_date) ='2025-04-29' AND id IN (1, 2, 3);
-- ❌ 差的查询(全表扫描) UPDATE `project-id.dataset.item` SET status ='processed' WHERE id IN (1, 2, 3);
-- 创建物化视图降低重复查询成本 CREATE MATERIALIZED VIEW `project-id.dataset.mv_user_stats` AS SELECT user_id, COUNT(*) AS event_count, SUM(amount) AS total_amount, MAX(event_time) AS last_event_time FROM `project-id.dataset.events` GROUPBY user_id;
监控和配额
设置项目配额:
进入 BigQuery Console
点击项目设置
设置每日查询配额限制
查询成本估算:
1 2 3 4 5 6
-- 使用 DRY RUN 估算查询成本 #standardSQL --dry_run SELECT* FROM `project-id.dataset.large_table` WHEREDATE(timestamp) >='2025-01-01';