|
| 1 | +.. _engines-memtx: |
| 2 | + |
| 3 | +Storing data with memtx |
| 4 | +======================= |
| 5 | + |
| 6 | +The ``memtx`` storage engine is used in Tarantool by default. It keeps all data in random-access memory (RAM), and therefore has very low read latency. |
| 7 | + |
| 8 | +The obvious question here is: |
| 9 | +if all the data is stored in memory, how can you prevent the data loss in case of emergency such as outage or Tarantool instance failure? |
| 10 | + |
| 11 | +First of all, Tarantool persists all data changes by writing requests to the write-ahead log (WAL) that is stored on disk. |
| 12 | +Read more about that in the :ref:`memtx-persist` section. |
| 13 | +In case of a distributed application, there is an option of synchronous replication that ensures keeping the data consistent on a quorum of replicas. |
| 14 | +Although replication is not directly a storage engine topic, it is a part of the answer regarding data safety. Read more in the :ref:`memtx-replication` section. |
| 15 | + |
| 16 | +In this chapter, the following topics are discussed in brief with the references to other chapters that explain the subject matter in details. |
| 17 | + |
| 18 | +.. contents:: |
| 19 | + :local: |
| 20 | + :depth: 1 |
| 21 | + |
| 22 | +.. _memtx-memory: |
| 23 | + |
| 24 | +Memory model |
| 25 | +------------ |
| 26 | + |
| 27 | +There is a fixed number of independent :ref:`execution threads <atomic-threads_fibers_yields>`. |
| 28 | +The threads don't share state. Instead they exchange data using low-overhead message queues. |
| 29 | +While this approach limits the number of cores that the instance uses, |
| 30 | +it removes competition for the memory bus and ensures peak scalability of memory access and network throughput. |
| 31 | + |
| 32 | +Only one thread, namely, the **transaction processor thread** (further, **TX thread**) |
| 33 | +can access the database, and there is only one TX thread for each Tarantool instance. |
| 34 | +In this thread, transactions are executed in a strictly consecutive order. |
| 35 | +Multi-statement transactions exist to provide isolation: |
| 36 | +each transaction sees a consistent database state and commits all its changes atomically. |
| 37 | +At commit time, a yield happens and all transaction changes are written to :ref:`WAL <internals-wal>` in a single batch. |
| 38 | +In case of errors during transaction execution, a transaction is rolled-back completely. |
| 39 | +Read more in the following sections: :ref:`atomic-transactions`, :ref:`atomic-transactional-manager`. |
| 40 | + |
| 41 | +Within the TX thread, there is a memory area allocated for Tarantool to store data. It's called **Arena**. |
| 42 | + |
| 43 | +.. image:: memtx/arena2.svg |
| 44 | + |
| 45 | +Data is stored in :term:`spaces <space>`. Spaces contain database records—:term:`tuples <tuple>`. |
| 46 | +To access and manipulate the data stored in spaces and tuples, Tarantool builds :doc:`indexes </book/box/indexes>`. |
| 47 | + |
| 48 | +Special `allocators <https://github.com/tarantool/small>`__ manage memory allocations for spaces, tuples, and indexes within the Arena. |
| 49 | +The slab allocator is the main allocator used to store tuples. |
| 50 | +Tarantool has a built-in module called ``box.slab`` which provides the slab allocator statistics |
| 51 | +that can be used to monitor the total memory usage and memory fragmentation. |
| 52 | +For details, see the ``box.slab`` module :doc:`reference </reference/reference_lua/box_slab>`. |
| 53 | + |
| 54 | +.. image:: memtx/spaces_indexes.svg |
| 55 | + |
| 56 | +Also inside the TX thread, there is an event loop. Within the event loop, there are a number of :ref:`fibers <fiber-fibers>`. |
| 57 | +Fibers are cooperative primitives that allows interaction with spaces, that is, reading and writting the data. |
| 58 | +Fibers can interact with the event loop and between each other directly or by using special primitives called channels. |
| 59 | +Due to the usage of fibers and :ref:`cooperative multitasking <atomic-cooperative_multitasking>`, the ``memtx`` engine is lock-free in typical situations. |
| 60 | + |
| 61 | +.. image:: memtx/fibers-channels.svg |
| 62 | + |
| 63 | +To interact with external users, there is a separate :ref:`network thread <atomic-threads_fibers_yields>` also called the **iproto thread**. |
| 64 | +The iproto thread receives a request from the network, parses and checks the statement, |
| 65 | +and transforms it into a special structure—a message containing an executable statement and its options. |
| 66 | +Then the iproto thread ships this message to the TX thread and runs the user's request in a separate fiber. |
| 67 | + |
| 68 | +.. image:: memtx/iproto.svg |
| 69 | + |
| 70 | +.. _memtx-persist: |
| 71 | + |
| 72 | +Data persistence |
| 73 | +---------------- |
| 74 | + |
| 75 | +To ensure :ref:`data persistence <index-box_persistence>`, Tarantool does two things. |
| 76 | + |
| 77 | +* After executing data change requests in memory, Tarantool writes each such request to the :ref:`write-ahead log (WAL) <internals-wal>` files (``.xlog``) |
| 78 | + that are stored on disk. Tarantool does this via a separate thread called the **WAL thread**. |
| 79 | + |
| 80 | +.. image:: memtx/wal.svg |
| 81 | + |
| 82 | +* Tarantool periodically takes the entire :doc:`database snapshot </reference/reference_lua/box_snapshot>` and saves it on disk. |
| 83 | + It is necessary for accelerating instance's restart because when there are too many WAL files, it can be difficult for Tarantool to restart quickly. |
| 84 | + |
| 85 | + To save a snapshot, there is a special fiber called the **snapshot daemon**. |
| 86 | + It reads the consistent content of the entire Arena and writes it on disk into a snapshot file (``.snap``). |
| 87 | + Due of the cooperative multitasking, Tarantool cannot write directly on disk because it is a locking operation. |
| 88 | + That is why Tarantool interacts with disk via a separate pool of threads from the :doc:`fio </reference/reference_lua/fio>` library. |
| 89 | + |
| 90 | +.. image:: memtx/snapshot03.svg |
| 91 | + |
| 92 | +So, even in emergency situations such as an outage or a Tarantool instance failure, |
| 93 | +when the in-memory database is lost, the data can be restored fully during Tarantool restart. |
| 94 | + |
| 95 | +What happens during the restart: |
| 96 | + |
| 97 | +1. Tarantool finds the latest snapshot file and reads it. |
| 98 | +2. Tarantool finds all the WAL files created after that snapshot and reads them as well. |
| 99 | +3. When the snapshot and WAL files have been read, there is a fully recovered in-memory data set |
| 100 | + corresponding to the state when the Tarantool instance stopped. |
| 101 | +4. While reading the snapshot and WAL files, Tarantool is building the primary indexes. |
| 102 | +5. When all the data is in memory again, Tarantool is building the secondary indexes. |
| 103 | +6. Tarantool runs the application. |
| 104 | + |
| 105 | +.. _memtx-indexes: |
| 106 | + |
| 107 | +Accessing data |
| 108 | +-------------- |
| 109 | + |
| 110 | +To access and manipulate the data stored in memory, Tarantool builds indexes. |
| 111 | +Indexes are also stored in memory within the Arena. |
| 112 | + |
| 113 | +Tarantool supports a number of :ref:`index types <index-types>` intended for different usage scenarios. |
| 114 | +The possible types are TREE, HASH, BITSET, and RTREE. |
| 115 | + |
| 116 | +Select query are possible against secondary index keys as well as primary keys. |
| 117 | +Indexes can have multi-part keys. |
| 118 | + |
| 119 | +For detailed information about indexes, refer to the :doc:`/book/box/indexes` page. |
| 120 | + |
| 121 | +.. _memtx-replication: |
| 122 | + |
| 123 | +Replicating data |
| 124 | +---------------- |
| 125 | + |
| 126 | +Although this topic is not directly related to the ``memtx`` engine, it completes the overall picture of how Tarantool works in case of a distributed application. |
| 127 | + |
| 128 | +Replication allows multiple Tarantool instances to work on copies of the same database. |
| 129 | +The copies are kept in sync because each instance can communicate its changes to all the other instances. |
| 130 | +It is implemented via WAL replication. |
| 131 | + |
| 132 | +To send data to a replica, Tarantool runs another thread called **relay**. |
| 133 | +Its purpose is to read the WAL files and send them to replicas. |
| 134 | +On a replica, the fiber called **applier** is run. It receives the changes from a remote node and applies them to the replica's Arena. |
| 135 | +All the changes are being written to WAL files via the replica's WAL thread as if they are done locally. |
| 136 | + |
| 137 | +.. image:: memtx/replica-xlogs.svg |
| 138 | + |
| 139 | +By default, :ref:`replication <replication-architecture>` in Tarantool is asynchronous: if a transaction |
| 140 | +is committed locally on a master node, it does not mean it is replicated onto any |
| 141 | +replicas. |
| 142 | + |
| 143 | +:ref:`Synchronous replication <repl_sync>` exists to solve this problem. Synchronous transactions |
| 144 | +are not considered committed and are not responded to a client until they are |
| 145 | +replicated onto some number of replicas. |
| 146 | + |
| 147 | +For more information on replication, refer to the :doc:`corresponding chapter </book/replication/index>`. |
| 148 | + |
| 149 | +.. _memtx-summary: |
| 150 | + |
| 151 | +Summary |
| 152 | +-------- |
| 153 | + |
| 154 | +The main key points describing how the in-memory storage engine works can be summarized in the following way: |
| 155 | + |
| 156 | +* All data is in RAM. |
| 157 | +* Access to data is from one thread. |
| 158 | +* Tarantool writes all data change requests in WAL. |
| 159 | +* Data snapshots are taken periodically. |
| 160 | +* Indexes are build to access the data. |
| 161 | +* WAL can be replicated. |
0 commit comments