You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This matches exactly our diagram of the primary index content for our example table:
478
-
<imgsrc={sparsePrimaryIndexes03b}class="image"/>
465
+
466
+
479
467
</p>
480
468
</details>
481
469
482
470
483
471
484
472
The primary key entries are called index marks because each index entry is marking the start of a specific data range. Specifically for the example table:
485
-
- UserID index marks:<br/>
473
+
- UserID index marks:
474
+
486
475
The stored `UserID` values in the primary index are sorted in ascending order.<br/>
487
476
‘mark 1’ in the diagram above thus indicates that the `UserID` values of all table rows in granule 1, and in all following granules, are guaranteed to be greater than or equal to 4.073.710.
488
477
489
478
[As we will see later](#the-primary-index-is-used-for-selecting-granules), this global order enables ClickHouse to <ahref="https://github.com/ClickHouse/ClickHouse/blob/22.3/src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp#L1452"target="_blank">use a binary search algorithm</a> over the index marks for the first key column when a query is filtering on the first column of the primary key.
490
479
491
-
- URL index marks:<br/>
480
+
- URL index marks:
481
+
492
482
The quite similar cardinality of the primary key columns `UserID` and `URL`
493
483
means that the index marks for all key columns after the first column in general only indicate a data range as long as the predecessor key column value stays the same for all table rows within at least the current granule.<br/>
494
484
For example, because the UserID values of mark 0 and mark 1 are different in the diagram above, ClickHouse can't assume that all URL values of all table rows in granule 0 are larger or equal to `'http://showtopics.html%3...'`. However, if the UserID values of mark 0 and mark 1 would be the same in the diagram above (meaning that the UserID value stays the same for all table rows within the granule 0), the ClickHouse could assume that all URL values of all table rows in granule 0 are larger or equal to `'http://showtopics.html%3...'`.
@@ -625,7 +615,7 @@ We discuss that second stage in more detail in the following section.
625
615
626
616
The following diagram illustrates a part of the primary index file for our table.
627
617
628
-
<imgsrc={sparsePrimaryIndexes04}class="image"/>
618
+
<Imageimg={sparsePrimaryIndexes04}size="lg"alt="Sparse Primary Indices 04"background="white"/>
629
619
630
620
As discussed above, via a binary search over the index’s 1083 UserID marks, mark 176 was identified. Its corresponding granule 176 can therefore possibly contain rows with a UserID column value of 749.927.693.
631
621
@@ -646,7 +636,8 @@ To achieve this, ClickHouse needs to know the physical location of granule 176.
646
636
In ClickHouse the physical locations of all granules for our table are stored in mark files. Similar to data files, there is one mark file per table column.
647
637
648
638
The following diagram shows the three mark files `UserID.mrk`, `URL.mrk`, and `EventTime.mrk` that store the physical locations of the granules for the table’s `UserID`, `URL`, and `EventTime` columns.
649
-
<imgsrc={sparsePrimaryIndexes05}class="image"/>
639
+
640
+
<Imageimg={sparsePrimaryIndexes05}size="lg"alt="Sparse Primary Indices 05"background="white"/>
650
641
651
642
We have discussed how the primary index is a flat uncompressed array file (primary.idx), containing index marks that are numbered starting at 0.
652
643
@@ -697,7 +688,7 @@ The indirection provided by mark files avoids storing, directly within the prima
697
688
698
689
The following diagram and the text below illustrate how for our example query ClickHouse locates granule 176 in the UserID.bin data file.
699
690
700
-
<imgsrc={sparsePrimaryIndexes06}class="image"/>
691
+
<Imageimg={sparsePrimaryIndexes06}size="lg"alt="Sparse Primary Indices 06"background="white"/>
701
692
702
693
We discussed earlier in this guide that ClickHouse selected the primary index mark 176 and therefore granule 176 as possibly containing matching rows for our query.
703
694
@@ -810,7 +801,8 @@ We have marked the key column values for the first table rows for each granule i
810
801
**Predecessor key column has low(er) cardinality**<aname="generic-exclusion-search-fast"></a>
811
802
812
803
Suppose UserID had low cardinality. In this case it would be likely that the same UserID value is spread over multiple table rows and granules and therefore index marks. For index marks with the same UserID, the URL values for the index marks are sorted in ascending order (because the table rows are ordered first by UserID and then by URL). This allows efficient filtering as described below:
813
-
<imgsrc={sparsePrimaryIndexes07}class="image"/>
804
+
805
+
<Imageimg={sparsePrimaryIndexes07}size="lg"alt="Sparse Primary Indices 06"background="white"/>
814
806
815
807
There are three different scenarios for the granule selection process for our abstract sample data in the diagram above:
816
808
@@ -824,7 +816,7 @@ There are three different scenarios for the granule selection process for our ab
824
816
825
817
When the UserID has high cardinality then it is unlikely that the same UserID value is spread over multiple table rows and granules. This means the URL values for the index marks are not monotonically increasing:
826
818
827
-
<imgsrc={sparsePrimaryIndexes08}class="image"/>
819
+
<Imageimg={sparsePrimaryIndexes08}size="lg"alt="Sparse Primary Indices 06"background="white"/>
828
820
829
821
As we can see in the diagram above, all shown marks whose URL values are smaller than W3 are getting selected for streaming its associated granule's rows into the ClickHouse engine.
830
822
@@ -858,7 +850,7 @@ ALTER TABLE hits_UserID_URL MATERIALIZE INDEX url_skipping_index;
858
850
```
859
851
ClickHouse now created an additional index that is storing - per group of 4 consecutive [granules](#data-is-organized-into-granules-for-parallel-data-processing) (note the `GRANULARITY 4` clause in the `ALTER TABLE` statement above) - the minimum and maximum URL value:
860
852
861
-
<imgsrc={sparsePrimaryIndexes13a}class="image"/>
853
+
<Imageimg={sparsePrimaryIndexes13a}size="lg"alt="Sparse Primary Indices 13a"background="white"/>
862
854
863
855
The first index entry (‘mark 0’ in the diagram above) is storing the minimum and maximum URL values for the [rows belonging to the first 4 granules of our table](#data-is-organized-into-granules-for-parallel-data-processing).
864
856
@@ -897,15 +889,16 @@ All three options will effectively duplicate our sample data into a additional t
897
889
However, the three options differ in how transparent that additional table is to the user with respect to the routing of queries and insert statements.
898
890
899
891
When creating a **second table** with a different primary key then queries must be explicitly send to the table version best suited for the query, and new data must be inserted explicitly into both tables in order to keep the tables in sync:
900
-
<imgsrc={sparsePrimaryIndexes09a}class="image"/>
901
892
893
+
<Imageimg={sparsePrimaryIndexes09a}size="md"alt="Sparse Primary Indices 09a"background="white"/>
902
894
903
895
With a **materialized view** the additional table is implicitly created and data is automatically kept in sync between both tables:
904
-
<imgsrc={sparsePrimaryIndexes09b}class="image"/>
905
896
897
+
<Imageimg={sparsePrimaryIndexes09b}size="md"alt="Sparse Primary Indices 09b"background="white"/>
906
898
907
899
And the **projection** is the most transparent option because next to automatically keeping the implicitly created (and hidden) additional table in sync with data changes, ClickHouse will automatically choose the most effective table version for queries:
908
-
<imgsrc={sparsePrimaryIndexes09c}class="image"/>
900
+
901
+
<Imageimg={sparsePrimaryIndexes09c}size="md"alt="Sparse Primary Indices 09c"background="white"/>
909
902
910
903
In the following we discuss this three options for creating and using multiple primary indexes in more detail and with real examples.
Because we switched the order of the columns in the primary key, the inserted rows are now stored on disk in a different lexicographical order (compared to our [original table](#a-table-with-a-primary-key)) and therefore also the 1083 granules of that table are containing different values than before:
954
947
955
-
<imgsrc={sparsePrimaryIndexes10}class="image"/>
948
+
<Imageimg={sparsePrimaryIndexes10}size="lg"alt="Sparse Primary Indices 10"background="white"/>
956
949
957
950
This is the resulting primary key:
958
951
959
-
<imgsrc={sparsePrimaryIndexes11}class="image"/>
952
+
<Imageimg={sparsePrimaryIndexes11}size="lg"alt="Sparse Primary Indices 11"background="white"/>
960
953
961
954
That can now be used to significantly speed up the execution of our example query filtering on the URL column in order to calculate the top 10 users that most frequently clicked on the URL "http://public_search":
962
955
```sql
@@ -1074,7 +1067,6 @@ Server Log:
1074
1067
1075
1068
We now have two tables. Optimized for speeding up queries filtering on `UserIDs`, and speeding up queries filtering on URLs, respectively:
- if new rows are inserted into the source table hits_UserID_URL, then that rows are automatically also inserted into the implicitly created table
1106
1098
- Effectively the implicitly created table has the same row order and primary index as the [secondary table that we created explicitly](/guides/best-practices/sparse-primary-indexes#option-1-secondary-tables):
1107
1099
1108
-
<imgsrc={sparsePrimaryIndexes12b1}class="image"/>
1100
+
<Imageimg={sparsePrimaryIndexes12b1}size="lg"alt="Sparse Primary Indices 12b1"background="white"/>
1109
1101
1110
1102
ClickHouse is storing the [column data files](#data-is-stored-on-disk-ordered-by-primary-key-columns) (*.bin), the [mark files](#mark-files-are-used-for-locating-granules) (*.mrk2) and the [primary index](#the-primary-index-has-one-entry-per-granule) (primary.idx) of the implicitly created table in a special folder withing the ClickHouse server's data directory:
1111
1103
1112
-
<imgsrc={sparsePrimaryIndexes12b2}class="image"/>
1104
+
<Imageimg={sparsePrimaryIndexes12b2}size="md"alt="Sparse Primary Indices 12b2"background="white"/>
1113
1105
1114
1106
:::
1115
1107
@@ -1189,11 +1181,12 @@ ALTER TABLE hits_UserID_URL
1189
1181
- please note that projections do not make queries that use ORDER BY more efficient, even if the ORDER BY matches the projection's ORDER BY statement (see https://github.com/ClickHouse/ClickHouse/issues/47333)
1190
1182
- Effectively the implicitly created hidden table has the same row order and primary index as the [secondary table that we created explicitly](/guides/best-practices/sparse-primary-indexes#option-1-secondary-tables):
1191
1183
1192
-
<imgsrc={sparsePrimaryIndexes12c1}class="image"/>
1184
+
<Imageimg={sparsePrimaryIndexes12c1}size="lg"alt="Sparse Primary Indices 12c1"background="white"/>
1193
1185
1194
1186
ClickHouse is storing the [column data files](#data-is-stored-on-disk-ordered-by-primary-key-columns) (*.bin), the [mark files](#mark-files-are-used-for-locating-granules) (*.mrk2) and the [primary index](#the-primary-index-has-one-entry-per-granule) (primary.idx) of the hidden table in a special folder (marked in orange in the screenshot below) next to the source table's data files, mark files, and primary index files:
1195
1187
1196
-
<imgsrc={sparsePrimaryIndexes12c2}class="image"/>
1188
+
<Imageimg={sparsePrimaryIndexes12c2}size="sm"alt="Sparse Primary Indices 12c2"background="white"/>
1189
+
1197
1190
:::
1198
1191
1199
1192
@@ -1455,7 +1448,8 @@ Having a good compression ratio for the data of a table's column on disk not onl
1455
1448
In the following we illustrate why it's beneficial for the compression ratio of a table's columns to order the primary key columns by cardinality in ascending order.
1456
1449
1457
1450
The diagram below sketches the on-disk order of rows for a primary key where the key columns are ordered by cardinality in ascending order:
1458
-
<imgsrc={sparsePrimaryIndexes14a}class="image"/>
1451
+
1452
+
<Imageimg={sparsePrimaryIndexes14a}size="lg"alt="Sparse Primary Indices 14a"background="white"/>
1459
1453
1460
1454
We discussed that [the table's row data is stored on disk ordered by primary key columns](#data-is-stored-on-disk-ordered-by-primary-key-columns).
1461
1455
@@ -1466,7 +1460,8 @@ In general, a compression algorithm benefits from the run length of data (the mo
1466
1460
and locality (the more similar the data is, the better the compression ratio is).
1467
1461
1468
1462
In contrast to the diagram above, the diagram below sketches the on-disk order of rows for a primary key where the key columns are ordered by cardinality in descending order:
1469
-
<imgsrc={sparsePrimaryIndexes14b}class="image"/>
1463
+
1464
+
<Imageimg={sparsePrimaryIndexes14b}size="lg"alt="Sparse Primary Indices 14b"background="white"/>
1470
1465
1471
1466
Now the table's rows are first ordered by their `ch` value, and rows that have the same `ch` value are ordered by their `cl` value.
1472
1467
But because the first key column `ch` has high cardinality, it is unlikely that there are rows with the same `ch` value. And because of that is is also unlikely that `cl` values are ordered (locally - for rows with the same `ch` value).
@@ -1508,7 +1503,8 @@ And one way to identify and retrieve (a specific version of) the pasted content
1508
1503
The following diagram shows
1509
1504
- the insert order of rows when the content changes (for example because of keystrokes typing the text into the text-area) and
1510
1505
- the on-disk order of the data from the inserted rows when the `PRIMARY KEY (hash)` is used:
1511
-
<imgsrc={sparsePrimaryIndexes15a}class="image"/>
1506
+
1507
+
<Imageimg={sparsePrimaryIndexes15a}size="lg"alt="Sparse Primary Indices 15a"background="white"/>
1512
1508
1513
1509
Because the `hash` column is used as the primary key column
1514
1510
- specific rows can be retrieved [very quickly](#the-primary-index-is-used-for-selecting-granules), but
@@ -1523,7 +1519,7 @@ The following diagram shows
1523
1519
- the insert order of rows when the content changes (for example because of keystrokes typing the text into the text-area) and
1524
1520
- the on-disk order of the data from the inserted rows when the compound `PRIMARY KEY (fingerprint, hash)` is used:
1525
1521
1526
-
<imgsrc={sparsePrimaryIndexes15b}class="image"/>
1522
+
<Imageimg={sparsePrimaryIndexes15b}size="lg"alt="Sparse Primary Indices 15b"background="white"/>
1527
1523
1528
1524
Now the rows on disk are first ordered by `fingerprint`, and for rows with the same fingerprint value, their `hash` value determines the final order.
0 commit comments