Skip to content

Commit 1d57628

Browse files
yangdongshengMikulas Patocka
authored andcommitted
dm-pcache: add persistent cache target in device-mapper
This patch introduces dm-pcache, a new DM target that places a DAX- capable persistent-memory device in front of any slower block device and uses it as a high-throughput, low-latency cache. Design highlights ----------------- - DAX data path – data is copied directly between DRAM and the pmem mapping, bypassing the block layer’s overhead. - Segmented, crash-consistent layout - all layout metadata are dual-replicated CRC-protected. - atomic kset flushes; key replay on mount guarantees cache integrity even after power loss. - Striped multi-tree index - Multi‑tree indexing for high parallelism. - overlap-resolution logic ensures non-intersecting cached extents. - Background services - write-back worker flushes dirty keys in order, preserving backing-device crash consistency. This is important for checkpoint in cloud storage. - garbage collector reclaims clean segments when utilisation exceeds a tunable threshold. - Data integrity – optional CRC32 on cached payload; metadata always protected. Comparison with existing block-level caches --------------------------------------------------------------------------------------------------------------------------------- | Feature | pcache (this patch) | bcache | dm-writecache | |----------------------------------|---------------------------------|------------------------------|---------------------------| | pmem access method | DAX | bio (block I/O) | DAX | | Write latency (4 K rand-write) | ~5 µs | ~20 µs | ~5 µs | | Concurrency | multi subtree index | global index tree | single tree + wc_lock | | IOPS (4K randwrite, 32 numjobs) | 2.1 M | 352 K | 283 K | | Read-cache support | YES | YES | NO | | Deployment | no re-format of backend | backend devices must be | no re-format of backend | | | | reformatted | | | Write-back ordering | log-structured; | no ordering guarantee | no ordering guarantee | | | preserves app-IO-order | | | | Data integrity checks | metadata + data CRC(optional) | metadata CRC only | none | --------------------------------------------------------------------------------------------------------------------------------- Signed-off-by: Dongsheng Yang <dongsheng.yang@linux.dev> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
1 parent 499cbe0 commit 1d57628

File tree

23 files changed

+5450
-0
lines changed

23 files changed

+5450
-0
lines changed
Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
.. SPDX-License-Identifier: GPL-2.0
2+
3+
=================================
4+
dm-pcache — Persistent Cache
5+
=================================
6+
7+
*Author: Dongsheng Yang <dongsheng.yang@linux.dev>*
8+
9+
This document describes *dm-pcache*, a Device-Mapper target that lets a
10+
byte-addressable *DAX* (persistent-memory, “pmem”) region act as a
11+
high-performance, crash-persistent cache in front of a slower block
12+
device. The code lives in `drivers/md/dm-pcache/`.
13+
14+
Quick feature summary
15+
=====================
16+
17+
* *Write-back* caching (only mode currently supported).
18+
* *16 MiB segments* allocated on the pmem device.
19+
* *Data CRC32* verification (optional, per cache).
20+
* Crash-safe: every metadata structure is duplicated (`PCACHE_META_INDEX_MAX
21+
== 2`) and protected with CRC+sequence numbers.
22+
* *Multi-tree indexing* (indexing trees sharded by logical address) for high PMem parallelism
23+
* Pure *DAX path* I/O – no extra BIO round-trips
24+
* *Log-structured write-back* that preserves backend crash-consistency
25+
26+
27+
Constructor
28+
===========
29+
30+
::
31+
32+
pcache <cache_dev> <backing_dev> [<number_of_optional_arguments> <cache_mode writeback> <data_crc true|false>]
33+
34+
========================= ====================================================
35+
``cache_dev`` Any DAX-capable block device (``/dev/pmem0``…).
36+
All metadata *and* cached blocks are stored here.
37+
38+
``backing_dev`` The slow block device to be cached.
39+
40+
``cache_mode`` Optional, Only ``writeback`` is accepted at the
41+
moment.
42+
43+
``data_crc`` Optional, default to ``false``
44+
45+
* ``true`` – store CRC32 for every cached entry
46+
and verify on reads
47+
* ``false`` – skip CRC (faster)
48+
========================= ====================================================
49+
50+
Example
51+
-------
52+
53+
.. code-block:: shell
54+
55+
dmsetup create pcache_sdb --table \
56+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
57+
58+
The first time a pmem device is used, dm-pcache formats it automatically
59+
(super-block, cache_info, etc.).
60+
61+
62+
Status line
63+
===========
64+
65+
``dmsetup status <device>`` (``STATUSTYPE_INFO``) prints:
66+
67+
::
68+
69+
<sb_flags> <seg_total> <cache_segs> <segs_used> \
70+
<gc_percent> <cache_flags> \
71+
<key_head_seg>:<key_head_off> \
72+
<dirty_tail_seg>:<dirty_tail_off> \
73+
<key_tail_seg>:<key_tail_off>
74+
75+
Field meanings
76+
--------------
77+
78+
=============================== =============================================
79+
``sb_flags`` Super-block flags (e.g. endian marker).
80+
81+
``seg_total`` Number of physical *pmem* segments.
82+
83+
``cache_segs`` Number of segments used for cache.
84+
85+
``segs_used`` Segments currently allocated (bitmap weight).
86+
87+
``gc_percent`` Current GC high-water mark (0-90).
88+
89+
``cache_flags`` Bit 0 – DATA_CRC enabled
90+
Bit 1 – INIT_DONE (cache initialised)
91+
Bits 2-5 – cache mode (0 == WB).
92+
93+
``key_head`` Where new key-sets are being written.
94+
95+
``dirty_tail`` First dirty key-set that still needs
96+
write-back to the backing device.
97+
98+
``key_tail`` First key-set that may be reclaimed by GC.
99+
=============================== =============================================
100+
101+
102+
Messages
103+
========
104+
105+
*Change GC trigger*
106+
107+
::
108+
109+
dmsetup message <dev> 0 gc_percent <0-90>
110+
111+
112+
Theory of operation
113+
===================
114+
115+
Sub-devices
116+
-----------
117+
118+
==================== =========================================================
119+
backing_dev Any block device (SSD/HDD/loop/LVM, etc.).
120+
cache_dev DAX device; must expose direct-access memory.
121+
==================== =========================================================
122+
123+
Segments and key-sets
124+
---------------------
125+
126+
* The pmem space is divided into *16 MiB segments*.
127+
* Each write allocates space from a per-CPU *data_head* inside a segment.
128+
* A *cache-key* records a logical range on the origin and where it lives
129+
inside pmem (segment + offset + generation).
130+
* 128 keys form a *key-set* (kset); ksets are written sequentially in pmem
131+
and are themselves crash-safe (CRC).
132+
* The pair *(key_tail, dirty_tail)* delimit clean/dirty and live/dead ksets.
133+
134+
Write-back
135+
----------
136+
137+
Dirty keys are queued into a tree; a background worker copies data
138+
back to the backing_dev and advances *dirty_tail*. A FLUSH/FUA bio from the
139+
upper layers forces an immediate metadata commit.
140+
141+
Garbage collection
142+
------------------
143+
144+
GC starts when ``segs_used >= seg_total * gc_percent / 100``. It walks
145+
from *key_tail*, frees segments whose every key has been invalidated, and
146+
advances *key_tail*.
147+
148+
CRC verification
149+
----------------
150+
151+
If ``data_crc is enabled`` dm-pcache computes a CRC32 over every cached data
152+
range when it is inserted and stores it in the on-media key. Reads
153+
validate the CRC before copying to the caller.
154+
155+
156+
Failure handling
157+
================
158+
159+
* *pmem media errors* – all metadata copies are read with
160+
``copy_mc_to_kernel``; an uncorrectable error logs and aborts initialisation.
161+
* *Cache full* – if no free segment can be found, writes return ``-EBUSY``;
162+
dm-pcache retries internally (request deferral).
163+
* *System crash* – on attach, the driver replays ksets from *key_tail* to
164+
rebuild the in-core trees; every segment’s generation guards against
165+
use-after-free keys.
166+
167+
168+
Limitations & TODO
169+
==================
170+
171+
* Only *write-back* mode; other modes planned.
172+
* Only FIFO cache invalidate; other (LRU, ARC...) planned.
173+
* Table reload is not supported currently.
174+
* Discard planned.
175+
176+
177+
Example workflow
178+
================
179+
180+
.. code-block:: shell
181+
182+
# 1. Create devices
183+
dmsetup create pcache_sdb --table \
184+
"0 $(blockdev --getsz /dev/sdb) pcache /dev/pmem0 /dev/sdb 4 cache_mode writeback data_crc true"
185+
186+
# 2. Put a filesystem on top
187+
mkfs.ext4 /dev/mapper/pcache_sdb
188+
mount /dev/mapper/pcache_sdb /mnt
189+
190+
# 3. Tune GC threshold to 80 %
191+
dmsetup message pcache_sdb 0 gc_percent 80
192+
193+
# 4. Observe status
194+
watch -n1 'dmsetup status pcache_sdb'
195+
196+
# 5. Shutdown
197+
umount /mnt
198+
dmsetup remove pcache_sdb
199+
200+
201+
``dm-pcache`` is under active development; feedback, bug reports and patches
202+
are very welcome!

Documentation/admin-guide/device-mapper/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Device Mapper
1818
dm-integrity
1919
dm-io
2020
dm-log
21+
dm-pcache
2122
dm-queue-length
2223
dm-raid
2324
dm-service-time

MAINTAINERS

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7051,6 +7051,14 @@ S: Maintained
70517051
F: Documentation/admin-guide/device-mapper/vdo*.rst
70527052
F: drivers/md/dm-vdo/
70537053

7054+
DEVICE-MAPPER PCACHE TARGET
7055+
M: Dongsheng Yang <dongsheng.yang@linux.dev>
7056+
M: Zheng Gu <cengku@gmail.com>
7057+
L: dm-devel@lists.linux.dev
7058+
S: Maintained
7059+
F: Documentation/admin-guide/device-mapper/dm-pcache.rst
7060+
F: drivers/md/dm-pcache/
7061+
70547062
DEVLINK
70557063
M: Jiri Pirko <jiri@resnulli.us>
70567064
L: netdev@vger.kernel.org

drivers/md/Kconfig

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,4 +659,6 @@ config DM_AUDIT
659659

660660
source "drivers/md/dm-vdo/Kconfig"
661661

662+
source "drivers/md/dm-pcache/Kconfig"
663+
662664
endif # MD

drivers/md/Makefile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ obj-$(CONFIG_DM_RAID) += dm-raid.o
7171
obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
7272
obj-$(CONFIG_DM_VERITY) += dm-verity.o
7373
obj-$(CONFIG_DM_VDO) += dm-vdo/
74+
obj-$(CONFIG_DM_PCACHE) += dm-pcache/
7475
obj-$(CONFIG_DM_CACHE) += dm-cache.o
7576
obj-$(CONFIG_DM_CACHE_SMQ) += dm-cache-smq.o
7677
obj-$(CONFIG_DM_EBS) += dm-ebs.o

drivers/md/dm-pcache/Kconfig

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
config DM_PCACHE
2+
tristate "Persistent cache for Block Device (Experimental)"
3+
depends on BLK_DEV_DM
4+
depends on DEV_DAX
5+
help
6+
PCACHE provides a mechanism to use persistent memory (e.g., CXL persistent memory,
7+
DAX-enabled devices) as a high-performance cache layer in front of
8+
traditional block devices such as SSDs or HDDs.
9+
10+
PCACHE is implemented as a kernel module that integrates with the block
11+
layer and supports direct access (DAX) to persistent memory for low-latency,
12+
byte-addressable caching.
13+
14+
Note: This feature is experimental and should be tested thoroughly
15+
before use in production environments.
16+
17+
If unsure, say 'N'.

drivers/md/dm-pcache/Makefile

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
dm-pcache-y := dm_pcache.o cache_dev.o segment.o backing_dev.o cache.o cache_gc.o cache_writeback.o cache_segment.o cache_key.o cache_req.o
2+
3+
obj-m += dm-pcache.o

0 commit comments

Comments
 (0)