希望通过以下介绍可以回答以下问题:
Clickstream Data wareHouse
ClickHouse是一个用于在线分析处理 (OLAP) 的面向列的 SQL 数据库管理系统 (DBMS)
CK的设计目标:回答两个问题 Filter&aggregate 。也即主要解决group by类查询
相比于通用性,CK更注重效率。每一个任务都有各自的特征,不能简单的丢进去随便执行。CK的设计是具体任务具体分析。仅Hash Table在ck中就有30多种变化。同样在算法选择上,也会根据不同的任务选择合适的算法。
为什么不是B(+)-tree
稀疏索引
CREATE TABLE hits_UserID_URL
(
`UserID` UInt32,
`URL` String,
`EventTime` DateTime
)
ENGINE = MergeTree
PRIMARY KEY (UserID, URL)
ORDER BY (UserID, URL, EventTime)
SETTINGS index_granularity = 8192, index_granularity_bytes = 0;
数据存储在磁盘上,按照primary Key column排序
数据处理的基本单位:granule
Primary index
Granule定位:mark file(多了一步,但是节约内存资源)
定位granual示例
SELECT URL, count(URL) AS Count
FROM hits_UserID_URL
WHERE UserID = 749927693
GROUP BY URL
ORDER BY Count DESC
LIMIT 10;
查询流程
数据写入预排序算法:LSM
LSM算法的几个核心步骤:
数据写入存储系统前首先记录日志,防止系统崩溃
记录完日志后在内存中以供使用,当内存达到极限后写入磁盘,记录合并次数Level为0(L=0)。已经写入磁盘的文件不可变。
每过一段时间将磁盘上L和L+1的文件合并
我们用一个示例来展示下整个过程
T=0时刻,数据库为空。
T=1时刻,clickhouse收到一条500条insert的插入请求,这500条数据时乱序的。此时,clickhouse开始插入操作。首先将500条插入请求一次性写入日志。接着在内存中进行排序,排序完成后将有序的结果写入磁盘,此时L=0;
T=2时刻,clickhouse收到一条800条insert的插入请求,这800条数据时乱序的。此时,clickhouse开始插入操作。首先将800条插入请求一次性写入日志。接着在内存中进行排序,排序完成后将有序的结果写入磁盘,此时L=0;
T=3时刻,clickhouse开始合并,此时此刻,磁盘上存在两个L=0的文件。这两个文件每个文件内部有序,但可能存在重合。(例如第一批500条的范围是300-400,第二批800条数据的范围是350-700)。因此需要合并。clickhouse在后台完成合并后,产生了一个新的L=1的文件。将两个L=0的文件标记为删除。
T=4时刻,clickhouse开始清理,将两个被标记为删除的文件真正地物理删除。
T=5时刻,clickhouse收到一条100条insert的插入请求,这100条数据时乱序的。此时,clickhouse开始插入操作。首先将100条插入请求一次性写入日志。接着在内存中进行排序,排序完成后将有序的结果写入磁盘,此时L=0;
T=6时刻,clickhouse开始合并,此时此刻,磁盘上存在1个L=0的文件和1个L=1的文件。这两个文件每个文件内部有序,但不存在重合。(例如L0文件的范围是100-200,L1文件的范围是300-700)。因此不需要合并。clickhouse在后台将L=0的文件升级成L=1,此时数据库内存在两个L=1的互不重合的文件。
……
以上就是LSM算法在clickhouse上的应用,我们总结一下,clickhouse使用LSM算法将乱序的数据在内存中排序为有序的数据,然后写入磁盘保存,并且定期合并有重合的磁盘文件。
CK的预排序设计带来的弊病:
查询时,有多个level,返回数据是怎样的?如何解决。
如果查询发生在合并前,就有可能数据分布在两个数据文件内。此时clickhouse默认会返回两个列表,这两个列表内部有序,但相互之间却会有重合
合并后查询
当查询在作为复合键的一部分并且是第一个键列的列上进行过滤时,ClickHouse 将在键列的索引标记上运行二进制搜索算法。
但是当不能命中primary index时。clickhouse会怎么办
SELECT UserID, count(UserID) AS Count
FROM hits_UserID_URL
WHERE URL = 'http://public_search'
GROUP BY UserID
ORDER BY Count DESC
LIMIT 10;
10 rows in set. Elapsed: 0.086 sec.
Processed 8.81 million rows,
799.69 MB (102.11 million rows/s., 9.27 GB/s.)
1.通用排除搜索算法(Generic exclusion search algorithm)
排除规则:
两种情况:
通常情况下:如果userId的基数较大时,不管是排除算法和skip index都不会有好的效果。这时有以下三种解决方案。不同点在于附加表的查询和插入对用户的透明度
2. Secondary Tables
使用不同的主键创建第二个表。查询时必须显示指定具体的表,并且查询时必须两张表显示写入以保持同步
CREATE TABLE hits_URL_UserID
(
`UserID` UInt32,
`URL` String,
`EventTime` DateTime
)
ENGINE = MergeTree
PRIMARY KEY (URL, UserID)
ORDER BY (URL, UserID, EventTime)
SETTINGS index_granularity = 8192, index_granularity_bytes = 0;
INSERT INTO hits_URL_UserID
SELECT * from hits_UserID_URL;
OPTIMIZE TABLE hits_URL_UserID FINAL;
3. Materialized Views
使用物化视图,附加表是隐式创建的,并且数据在两个表之间自动保持同步
CREATE MATERIALIZED VIEW mv_hits_URL_UserID
ENGINE = MergeTree()
PRIMARY KEY (URL, UserID)
ORDER BY (URL, UserID, EventTime)
POPULATE
AS SELECT * FROM hits_UserID_URL;
SELECT UserID, count(UserID) AS Count
FROM mv_hits_URL_UserID
WHERE URL = 'http://public_search'
GROUP BY UserID
ORDER BY Count DESC
LIMIT 10;
4.Projections
投影是最透明的选项,除了可以隐式的创建和同步附加表,还会自动选择最有效的表版本进行查询
ALTER TABLE hits_UserID_URL
ADD PROJECTION prj_url_userid
(
SELECT *
ORDER BY (URL, UserID)
);
ALTER TABLE hits_UserID_URL
MATERIALIZE PROJECTION prj_url_userid;
5.Data Skipping index
跳过一定没有匹配值的数据块。只适用于mergeTree引擎
CREATE TABLE skip_table
(
my_key UInt64,
my_value UInt64
)
ENGINE MergeTree primary key my_key
SETTINGS index_granularity=8192;
INSERT INTO skip_table SELECT number, intDiv(number,4096) FROM numbers(100000000);
ALTER TABLE skip_table ADD INDEX vix my_value TYPE set(100) GRANULARITY 2;
ALTER TABLE skip_table MATERIALIZE INDEX vix;
SELECT * FROM skip_table WHERE my_value IN (125, 700)
Skipping index 并不总是有效,和数据在granual的分布强相关
1.limit优化
ALTER TABLE uk_price_paid_with_projections
ADD PROJECTION prj_oby_town_price
(
SELECT
*
ORDER BY
town, price
);
ALTER TABLE uk_price_paid_with_projections
MATERIALIZE PROJECTION prj_oby_town_price SETTINGS mutations_sync = 1;
2.聚合优化
ALTER TABLE uk_price_paid_with_projections
ADD PROJECTION prj_gby_county
(
SELECT
county,
avg(price)
GROUP BY
county
);
ALTER TABLE uk_price_paid_with_projections
MATERIALIZE PROJECTION prj_gby_county SETTINGS mutations_sync = 1;
思考一下:如何有效识别单行
内存限制:
max_memory_usage The maximum amount of RAM to use for running a query on a single server.
max_memory_usage_for_user The maximum amount of RAM to use for running a user’s queries on a single server.
max_server_memory_usage Limits total RAM usage by the ClickHouse server.
max_server_memory_usage_to_ram_ratio
线程
max_thread_pool_size(10000) ClickHouse uses threads from the Global Thread pool to process queries
max_thread_pool_free_size(1000) 空闲线程数,超过这个值就会有线程被释放
thread_pool_queue_size(10000)
background_pool_size(16) background merges and mutations
concurrent_threads_soft_limit_num(2022,为了提高查询QPS)
concurrent_threads_soft_limit_ratio_to_cores(2022)
max_concurrent_queries(100)
max_concurrent_insert_queries
max_concurrent_select_queries
max_concurrent_queries_for_user
max_concurrent_queries_for_all_users
max_threads The maximum number of query processing threads, excluding threads for retrieving data from remote servers.Default value: the number of physical CPU cores.max_insert_threads
The maximum number of threads to execute the INSERT SELECT query.
始终使用performance scaling governor
处理器可能会过热。 使用 dmesg 查看 CPU 的时钟速率是否因过热而受到限制。 也可以在数据中心级别外部设置限制。 您可以使用 turbostat 在负载下对其进行监控。
少量数据,采用和数据量一样的内存
大量数据,内存越多越好。
不要禁用overcommit(cat /proc/sys/vm/overcommit_memory)
使用 perf top 查看内核在内存管理上花费的时间。
建议内存至少32g+
SSD > HDD 优先选择带有本地硬盘驱动器的大量服务器,而不是带有附加磁盘架的少量服务器。
使用 HDD 时,您可以组合它们的 RAID-10、RAID-5、RAID-6 或 RAID-50。 对于 Linux,软件 RAID 更好(使用 mdadm)。 我们不建议使用 LVM。 创建 RAID-10 时,选择远布局。 如果您的预算允许,请选择 RAID-10。
如果您有超过 4 个磁盘,请使用 RAID-6(首选)或 RAID-50,而不是 RAID-5。 使用 RAID-5、RAID-6 或 RAID-50 时,始终增加 stripe_cache_size,因为默认值通常不是最佳选择。
对于大多数RAID配置,64 KB的块大小就足够了。clickhouse服务器的平均写入大小约为1 MB(1024 KB),因此建议的条带大小也为1 MB。如果将块大小设置为1 MB除以RAID阵列中非奇偶校验磁盘的数量,则可以根据需要优化块大小,以便在所有可用的非奇偶校验盘上并行执行每个写入操作。切勿将块大小设置得太小或太大。
使用长队列启用NCQ。对于HDD,选择CFQ调度器,对于SSD,选择noop。不要减少“预读”设置。对于HDD,启用写缓存。
确保为操作系统中的NVME和SSD磁盘启用fstrim(通常使用cronjob或systemd服务实现)。
Ext4是最可靠的选项,将装载选项设置为:noatime XFS也支持 可以在NFS上工作,但不太好 FAT-32、exFAT不支持 不要使用带压缩或加密的文件系统,CK自带的有,效果更好
不要使用过时的linux 内核
如果您使用的是IPv6,请增加路由缓存的大小。3.2之前的Linux内核在IPv6实现方面存在许多问题。
如果可能,至少使用10 GB网络。1 Gb也可以工作,但对于修补具有数十TB数据的副本,或处理具有大量中间数据的分布式查询,情况会更糟。
旧的Linux内核上,禁用透明大页,会干扰内存分配,导致性能限制下降。 较新的Linux内核上开启透明大页是可以的
OpenStack: nova.conf cpu_mode=host-passthrough
libvert:
这对于ClickHouse能够使用cpuid指令获得正确的信息非常重要。否则,当管理程序在旧CPU型号上运行时,可能会出现非法指令崩溃。
在CK集群中建议ClickHouse代替Zookeeper
如果使用防病毒软件,请将其配置为跳过带有ClickHouse数据文件(/var/lib/ClickHouse)的文件夹,否则可能会降低性能,并且在数据摄取和后台合并过程中可能会遇到意外错误。
数据集
过去100年特定地点和特定时间的天气记录
表结构
CREATE TABLE noaa_codec_v1
(
`station_id` String COMMENT 'Id of the station at which the measurement as taken',
`date` Date32,
`tempAvg` Int64 COMMENT 'Average temperature (tenths of a degrees C)',
`tempMax` Int64 COMMENT 'Maximum temperature (tenths of degrees C)',
`tempMin` Int64 COMMENT 'Minimum temperature (tenths of degrees C)',
`precipitation` Int64 COMMENT 'Precipitation (tenths of mm)',
`snowfall` Int64 COMMENT 'Snowfall (mm)',
`snowDepth` Int64 COMMENT 'Snow depth (mm)',
`percentDailySun` Int64 COMMENT 'Daily percent of possible sunshine (percent)',
`averageWindSpeed` Int64 COMMENT 'Average daily wind speed (tenths of meters per second)',
`maxWindSpeed` Int64 COMMENT 'Peak gust wind speed (tenths of meters per second)',
`weatherType` String,
`location` Point,
`elevation` Float64,
`name` String
) ENGINE = MergeTree() ORDER BY (station_id, date)
导入数据大约10亿行。磁盘占用量如下:
SELECT
name,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_v1'
GROUP BY name
ORDER BY sum(data_compressed_bytes) DESC
name compressed_size uncompressed_size ratio
date 2.24 GiB 3.93 GiB 1.76
tempMax 522.77 MiB 7.87 GiB 15.41
tempMin 519.53 MiB 7.87 GiB 15.51
precipitation 494.41 MiB 7.87 GiB 16.29
tempAvg 130.69 MiB 7.87 GiB 61.64
snowDepth 62.33 MiB 7.87 GiB 129.26
weatherType 37.87 MiB 6.87 GiB 185.7
snowfall 32.94 MiB 7.87 GiB 244.56
location 14.89 MiB 15.73 GiB 1081.94
averageWindSpeed 14.64 MiB 7.87 GiB 550.29
maxWindSpeed 11.09 MiB 7.87 GiB 726.54
name 9.63 MiB 14.58 GiB 1549.63
elevation 7.95 MiB 7.87 GiB 1012.79
station_id 7.60 MiB 11.80 GiB 1589.03
percentDailySun 6.59 MiB 7.87 GiB 1222.67
15 rows in set. Elapsed: 0.005 sec.
SELECT
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_v1'
compressed_size uncompressed_size ratio
4.07 GiB 131.58 GiB 32.36
1 row in set. Elapsed: 0.004 sec.
SELECT
elevation_range,
uniq(station_id) AS num_stations,
max(tempMax) / 10 AS max_temp,
min(tempMin) / 10 AS min_temp,
sum(precipitation) AS total_precipitation,
avg(percentDailySun) AS avg_percent_sunshine,
max(maxWindSpeed) AS max_wind_speed,
sum(snowfall) AS total_snowfall
FROM noaa_codec_v1
WHERE (date > '1970-01-01') AND (station_id IN (
SELECT station_id
FROM stations
WHERE country_code = 'US'
))
GROUP BY floor(elevation, -2) AS elevation_range
ORDER BY elevation_range ASC
FORMAT `Null`
Ok.
0 rows in set. Elapsed: 1.615 sec. Processed 331.11 million rows, 23.19 GB (204.98 million rows/s., 14.36 GB/s.)
通过修改type,降低磁盘占用量
CREATE TABLE noaa_codec_v3
(
`station_id` LowCardinality(String) COMMENT 'Id of the station at which the measurement as taken',
`date` Date32,
`tempAvg` Int16 COMMENT 'Average temperature (tenths of a degrees C)',
`tempMax` Int16 COMMENT 'Maximum temperature (tenths of degrees C)',
`tempMin` Int16 COMMENT 'Minimum temperature (tenths of degrees C)',
`precipitation` UInt16 COMMENT 'Precipitation (tenths of mm)',
`snowfall` UInt16 COMMENT 'Snowfall (mm)',
`snowDepth` UInt16 COMMENT 'Snow depth (mm)',
`percentDailySun` UInt8 COMMENT 'Daily percent of possible sunshine (percent)',
`averageWindSpeed` UInt16 COMMENT 'Average daily wind speed (tenths of meters per second)',
`maxWindSpeed` UInt16 COMMENT 'Peak gust wind speed (tenths of meters per second)',
`weatherType` Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22),
`lat` Float32,
`lon` Float32,
`elevation` Int16,
`name` LowCardinality(String)
) ENGINE = MergeTree() ORDER BY (station_id, date)
SELECT
name,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_v3'
GROUP BY name
ORDER BY sum(data_compressed_bytes) DESC
name compressed_size uncompressed_size ratio
date 2.24 GiB 3.94 GiB 1.76
precipitation 469.11 MiB 1.97 GiB 4.3
tempMax 451.33 MiB 1.97 GiB 4.47
tempMin 437.15 MiB 1.97 GiB 4.61
tempAvg 120.28 MiB 1.97 GiB 16.76
snowDepth 42.80 MiB 1.97 GiB 47.1
snowfall 32.61 MiB 1.97 GiB 61.81
weatherType 16.48 MiB 1008.00 MiB 61.16
averageWindSpeed 10.27 MiB 1.97 GiB 196.24
maxWindSpeed 6.31 MiB 1.97 GiB 319.57
name 3.99 MiB 1.92 GiB 492.99
lat 3.57 MiB 3.94 GiB 1127.84
lon 3.57 MiB 3.94 GiB 1130.25
station_id 3.40 MiB 1.92 GiB 577.5
elevation 1.89 MiB 1.97 GiB 1065.35
percentDailySun 1.51 MiB 1008.00 MiB 667.67
SELECT
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_v3'
compressed_size uncompressed_size ratio
3.81 GiB 35.34 GiB 9.28
优化codec
CREATE TABLE noaa_codec_optimal
(
`station_id` LowCardinality(String),
`date` Date32 CODEC(DoubleDelta, ZSTD(1)),
`tempAvg` Int16 CODEC(T64, ZSTD(1)),
`tempMax` Int16 CODEC(T64, ZSTD(1)),
`tempMin` Int16 CODEC(T64, ZSTD(1)) ,
`precipitation` UInt16 CODEC(T64, ZSTD(1)) ,
`snowfall` UInt16 CODEC(T64, ZSTD(1)) ,
`snowDepth` UInt16 CODEC(ZSTD(1)),
`percentDailySun` UInt8,
`averageWindSpeed` UInt16 CODEC(T64, ZSTD(1)),
`maxWindSpeed` UInt16 CODEC(T64, ZSTD(1)),
`weatherType` Enum8('Normal' = 0, 'Fog' = 1, 'Heavy Fog' = 2, 'Thunder' = 3, 'Small Hail' = 4, 'Hail' = 5, 'Glaze' = 6, 'Dust/Ash' = 7, 'Smoke/Haze' = 8, 'Blowing/Drifting Snow' = 9, 'Tornado' = 10, 'High Winds' = 11, 'Blowing Spray' = 12, 'Mist' = 13, 'Drizzle' = 14, 'Freezing Drizzle' = 15, 'Rain' = 16, 'Freezing Rain' = 17, 'Snow' = 18, 'Unknown Precipitation' = 19, 'Ground Fog' = 21, 'Freezing Fog' = 22),
`lat` Float32,
`lon` Float32,
`elevation` Int16,
`name` LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (station_id, date)
SELECT
name,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_optimal'
GROUP BY name
ORDER BY sum(data_compressed_bytes) DESC
name compressed_size uncompressed_size ratio
precipitation 416.51 MiB 2.01 GiB 4.93
tempMax 400.70 MiB 2.01 GiB 5.13
tempMin 388.00 MiB 2.01 GiB 5.29
tempAvg 101.50 MiB 2.01 GiB 20.24
snowDepth 43.48 MiB 2.01 GiB 47.24
snowfall 28.72 MiB 2.01 GiB 71.51
date 24.28 MiB 4.01 GiB 169.17
weatherType 16.90 MiB 1.00 GiB 60.76
averageWindSpeed 9.37 MiB 2.01 GiB 219.32
maxWindSpeed 6.17 MiB 2.01 GiB 332.67
name 5.07 MiB 1.98 GiB 400.41
station_id 4.52 MiB 1.97 GiB 447.45
lat 3.64 MiB 4.01 GiB 1128.65
lon 3.63 MiB 4.01 GiB 1130.98
elevation 1.93 MiB 2.01 GiB 1066.81
percentDailySun 1.56 MiB 1.00 GiB 658.76
SELECT
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.columns
WHERE table = 'noaa_codec_optimal'
compressed_size uncompressed_size ratio
1.42 GiB 36.05 GiB 25.36
SELECT
elevation_range,
uniq(station_id) AS num_stations,
max(tempMax) / 10 AS max_temp,
min(tempMin) / 10 AS min_temp,
sum(precipitation) AS total_precipitation,
avg(percentDailySun) AS avg_percent_sunshine,
max(maxWindSpeed) AS max_wind_speed,
sum(snowfall) AS total_snowfall
FROM noaa_codec_optimal
WHERE (date > '1970-01-01') AND (station_id IN (
SELECT station_id
FROM stations
WHERE country_code = 'US'
))
GROUP BY floor(elevation, -2) AS elevation_range
ORDER BY elevation_range ASC
FORMAT `Null`
0 rows in set. Elapsed: 1.235 sec. Processed 330.20 million rows, 6.48 GB (267.28 million rows/s., 5.25 GB/s.)
优化结果总结
磁盘占用(压缩前后)4.07 GiB/131.58 GiB -> 1.42 GiB/36.05 GiB
相同的语句查询速度1.615 sec -> 1.235 sec
type和codec优化建议
选择使用哪种编解码器和压缩算法最终归结为了解数据的特征以及编解码器和压缩算法的属性。多测试、以实际测试结果为准。
以下为一般准则:
页面更新:2024-04-22
本站资料均由网友自行发布提供,仅用于学习交流。如有版权问题,请与我联系,QQ:4156828
© CopyRight 2008-2024 All Rights Reserved. Powered By bs178.com 闽ICP备11008920号-3
闽公网安备35020302034844号