Skip to content

Commit 0793916

Browse files
veod32ainonekopatiencedaur
authored
Restructure Storage engines chapter and write the memtx engine overview (#2243)
* Restructure Storage engines chapter. Add initial structure and topics in brief for memtx Part of #1632 * Update description and doc structure Part of #1632 * Return excluded vynil.rst back Part of #1632 * Update memtx overview page and storage enging chapter index page Part of #1632 * Correct content after review Part of #1632 * Corrections after review Part of #1632 * Update doc/book/box/engines/memtx.rst * Corrections after review Part of #1632 * Update translations * Update translations Co-authored-by: ainoneko <ainoneko@users.noreply.github.com> Co-authored-by: patiencedaur <patiencedaur@gmail.com>
1 parent 631368f commit 0793916

File tree

15 files changed

+4236
-1571
lines changed

15 files changed

+4236
-1571
lines changed

conf.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@
6262
'book/connectors/__*',
6363
'book/replication/*_1.rst',
6464
'book/replication/*_2.rst',
65-
'book/box/engines/vinyl.rst',
6665
'getting_started/using_package_manager.rst',
6766
'getting_started/using_docker.rst',
6867
'dev_guide/box_protocol.rst',

doc/book/box/engines/index.rst

Lines changed: 18 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,34 @@
11
.. _engines-chapter:
22

3-
********************************************************************************
43
Storage engines
5-
********************************************************************************
4+
===============
65

7-
A storage engine is a set of very-low-level routines which actually store and
8-
retrieve tuple values. Tarantool offers a choice of two storage engines:
6+
A storage engine is a set of low-level routines which actually store and
7+
retrieve :term:`tuple <tuple>` values. Tarantool offers a choice of two storage engines:
98

10-
* memtx (the in-memory storage engine) is the default and was the first to
11-
arrive.
9+
* :doc:`memtx <memtx>` is the in-memory storage engine used by default.
10+
* :doc:`vinyl <vinyl>` is the on-disk storage engine.
1211

13-
* vinyl (the on-disk storage engine) is a working key-value engine and will
14-
especially appeal to users who like to see data go directly to disk, so that
15-
recovery time might be shorter and database size might be larger.
12+
Below you can find comparing of the two engines in brief.
13+
All the details on how each engine works you can find in the dedicated
14+
sections:
1615

17-
On the other hand, vinyl lacks some functions and options that are available
18-
with memtx. Where that is the case, the relevant description in this manual
19-
contains a note beginning with the words "Note re storage engine".
16+
.. toctree::
17+
:maxdepth: 1
2018

21-
Further in this section we discuss the details of storing data using
22-
the vinyl storage engine.
23-
24-
To specify that the engine should be vinyl, add the clause ``engine = 'vinyl'``
25-
when creating a space, for example:
26-
27-
.. code-block:: lua
28-
29-
space = box.schema.space.create('name', {engine='vinyl'})
19+
memtx
20+
vinyl
3021

3122
.. _vinyl_diff:
3223

33-
================================================================================
34-
Differences between memtx and vinyl storage engines
35-
================================================================================
24+
Difference between memtx and vinyl storage engines
25+
--------------------------------------------------
3626

37-
The primary difference between memtx and vinyl is that memtx is an "in-memory"
38-
engine while vinyl is an "on-disk" engine. An in-memory storage engine is
27+
The primary difference between memtx and vinyl is that memtx is an in-memory
28+
engine while vinyl is an on-disk engine. An in-memory storage engine is
3929
generally faster (each query is usually run under 1 ms), and the memtx engine
40-
is justifiably the default for Tarantool, but on-disk engine such as vinyl is
41-
preferable when the database is larger than the available memory and adding more
30+
is justifiably the default for Tarantool. But on-disk engine such as vinyl is
31+
preferable when the database is larger than the available memory, and adding more
4232
memory is not a realistic option.
4333

4434
.. container:: table
@@ -69,5 +59,3 @@ memory is not a realistic option.
6959
| yield | Does not yield on the select requests unless the | Yields on the select requests or on its equivalents: |
7060
| | transaction is committed to WAL | get() or pairs() |
7161
+---------------------------------------------+------------------------------------------------------+------------------------------------------------------+
72-
73-
.. include:: vinyl.rst

doc/book/box/engines/memtx.rst

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
.. _engines-memtx:
2+
3+
Storing data with memtx
4+
=======================
5+
6+
The ``memtx`` storage engine is used in Tarantool by default. It keeps all data in random-access memory (RAM), and therefore has very low read latency.
7+
8+
The obvious question here is:
9+
if all the data is stored in memory, how can you prevent the data loss in case of emergency such as outage or Tarantool instance failure?
10+
11+
First of all, Tarantool persists all data changes by writing requests to the write-ahead log (WAL) that is stored on disk.
12+
Read more about that in the :ref:`memtx-persist` section.
13+
In case of a distributed application, there is an option of synchronous replication that ensures keeping the data consistent on a quorum of replicas.
14+
Although replication is not directly a storage engine topic, it is a part of the answer regarding data safety. Read more in the :ref:`memtx-replication` section.
15+
16+
In this chapter, the following topics are discussed in brief with the references to other chapters that explain the subject matter in details.
17+
18+
.. contents::
19+
:local:
20+
:depth: 1
21+
22+
.. _memtx-memory:
23+
24+
Memory model
25+
------------
26+
27+
There is a fixed number of independent :ref:`execution threads <atomic-threads_fibers_yields>`.
28+
The threads don't share state. Instead they exchange data using low-overhead message queues.
29+
While this approach limits the number of cores that the instance uses,
30+
it removes competition for the memory bus and ensures peak scalability of memory access and network throughput.
31+
32+
Only one thread, namely, the **transaction processor thread** (further, **TX thread**)
33+
can access the database, and there is only one TX thread for each Tarantool instance.
34+
In this thread, transactions are executed in a strictly consecutive order.
35+
Multi-statement transactions exist to provide isolation:
36+
each transaction sees a consistent database state and commits all its changes atomically.
37+
At commit time, a yield happens and all transaction changes are written to :ref:`WAL <internals-wal>` in a single batch.
38+
In case of errors during transaction execution, a transaction is rolled-back completely.
39+
Read more in the following sections: :ref:`atomic-transactions`, :ref:`atomic-transactional-manager`.
40+
41+
Within the TX thread, there is a memory area allocated for Tarantool to store data. It's called **Arena**.
42+
43+
.. image:: memtx/arena2.svg
44+
45+
Data is stored in :term:`spaces <space>`. Spaces contain database records—:term:`tuples <tuple>`.
46+
To access and manipulate the data stored in spaces and tuples, Tarantool builds :doc:`indexes </book/box/indexes>`.
47+
48+
Special `allocators <https://github.com/tarantool/small>`__ manage memory allocations for spaces, tuples, and indexes within the Arena.
49+
The slab allocator is the main allocator used to store tuples.
50+
Tarantool has a built-in module called ``box.slab`` which provides the slab allocator statistics
51+
that can be used to monitor the total memory usage and memory fragmentation.
52+
For details, see the ``box.slab`` module :doc:`reference </reference/reference_lua/box_slab>`.
53+
54+
.. image:: memtx/spaces_indexes.svg
55+
56+
Also inside the TX thread, there is an event loop. Within the event loop, there are a number of :ref:`fibers <fiber-fibers>`.
57+
Fibers are cooperative primitives that allows interaction with spaces, that is, reading and writting the data.
58+
Fibers can interact with the event loop and between each other directly or by using special primitives called channels.
59+
Due to the usage of fibers and :ref:`cooperative multitasking <atomic-cooperative_multitasking>`, the ``memtx`` engine is lock-free in typical situations.
60+
61+
.. image:: memtx/fibers-channels.svg
62+
63+
To interact with external users, there is a separate :ref:`network thread <atomic-threads_fibers_yields>` also called the **iproto thread**.
64+
The iproto thread receives a request from the network, parses and checks the statement,
65+
and transforms it into a special structure—a message containing an executable statement and its options.
66+
Then the iproto thread ships this message to the TX thread and runs the user's request in a separate fiber.
67+
68+
.. image:: memtx/iproto.svg
69+
70+
.. _memtx-persist:
71+
72+
Data persistence
73+
----------------
74+
75+
To ensure :ref:`data persistence <index-box_persistence>`, Tarantool does two things.
76+
77+
* After executing data change requests in memory, Tarantool writes each such request to the :ref:`write-ahead log (WAL) <internals-wal>` files (``.xlog``)
78+
that are stored on disk. Tarantool does this via a separate thread called the **WAL thread**.
79+
80+
.. image:: memtx/wal.svg
81+
82+
* Tarantool periodically takes the entire :doc:`database snapshot </reference/reference_lua/box_snapshot>` and saves it on disk.
83+
It is necessary for accelerating instance's restart because when there are too many WAL files, it can be difficult for Tarantool to restart quickly.
84+
85+
To save a snapshot, there is a special fiber called the **snapshot daemon**.
86+
It reads the consistent content of the entire Arena and writes it on disk into a snapshot file (``.snap``).
87+
Due of the cooperative multitasking, Tarantool cannot write directly on disk because it is a locking operation.
88+
That is why Tarantool interacts with disk via a separate pool of threads from the :doc:`fio </reference/reference_lua/fio>` library.
89+
90+
.. image:: memtx/snapshot03.svg
91+
92+
So, even in emergency situations such as an outage or a Tarantool instance failure,
93+
when the in-memory database is lost, the data can be restored fully during Tarantool restart.
94+
95+
What happens during the restart:
96+
97+
1. Tarantool finds the latest snapshot file and reads it.
98+
2. Tarantool finds all the WAL files created after that snapshot and reads them as well.
99+
3. When the snapshot and WAL files have been read, there is a fully recovered in-memory data set
100+
corresponding to the state when the Tarantool instance stopped.
101+
4. While reading the snapshot and WAL files, Tarantool is building the primary indexes.
102+
5. When all the data is in memory again, Tarantool is building the secondary indexes.
103+
6. Tarantool runs the application.
104+
105+
.. _memtx-indexes:
106+
107+
Accessing data
108+
--------------
109+
110+
To access and manipulate the data stored in memory, Tarantool builds indexes.
111+
Indexes are also stored in memory within the Arena.
112+
113+
Tarantool supports a number of :ref:`index types <index-types>` intended for different usage scenarios.
114+
The possible types are TREE, HASH, BITSET, and RTREE.
115+
116+
Select query are possible against secondary index keys as well as primary keys.
117+
Indexes can have multi-part keys.
118+
119+
For detailed information about indexes, refer to the :doc:`/book/box/indexes` page.
120+
121+
.. _memtx-replication:
122+
123+
Replicating data
124+
----------------
125+
126+
Although this topic is not directly related to the ``memtx`` engine, it completes the overall picture of how Tarantool works in case of a distributed application.
127+
128+
Replication allows multiple Tarantool instances to work on copies of the same database.
129+
The copies are kept in sync because each instance can communicate its changes to all the other instances.
130+
It is implemented via WAL replication.
131+
132+
To send data to a replica, Tarantool runs another thread called **relay**.
133+
Its purpose is to read the WAL files and send them to replicas.
134+
On a replica, the fiber called **applier** is run. It receives the changes from a remote node and applies them to the replica's Arena.
135+
All the changes are being written to WAL files via the replica's WAL thread as if they are done locally.
136+
137+
.. image:: memtx/replica-xlogs.svg
138+
139+
By default, :ref:`replication <replication-architecture>` in Tarantool is asynchronous: if a transaction
140+
is committed locally on a master node, it does not mean it is replicated onto any
141+
replicas.
142+
143+
:ref:`Synchronous replication <repl_sync>` exists to solve this problem. Synchronous transactions
144+
are not considered committed and are not responded to a client until they are
145+
replicated onto some number of replicas.
146+
147+
For more information on replication, refer to the :doc:`corresponding chapter </book/replication/index>`.
148+
149+
.. _memtx-summary:
150+
151+
Summary
152+
--------
153+
154+
The main key points describing how the in-memory storage engine works can be summarized in the following way:
155+
156+
* All data is in RAM.
157+
* Access to data is from one thread.
158+
* Tarantool writes all data change requests in WAL.
159+
* Data snapshots are taken periodically.
160+
* Indexes are build to access the data.
161+
* WAL can be replicated.

doc/book/box/engines/memtx/arena2.svg

Lines changed: 3 additions & 0 deletions
Loading

0 commit comments

Comments
 (0)