Skip to content

Commit 1167dba

Browse files
yizhang82Herman Lee
authored andcommitted
Improve count(*) performance and fix index merge bug
Summary: Today in `SELECT count(*)` MyRocks would still decode every single column due to this check, despite the readset being empty: ``` // bitmap is cleared on index merge, but it still needs to decode columns bool field_requested = decode_all_fields || m_verify_row_debug_checksums || bitmap_is_set(field_map, m_table->field[i]->field_index); ``` As a result MyRocks is significantly slower than InnoDB in this particular scenario. Turns out in index merge, when it tries to reset, it calls ha_index_init with an empty column_bitmap, so our field decoders didn't know it needs to decode anything, so the entire query would return nothing. This is discussed in [this commit](facebook@70f2bcd), and [issue 624](facebook#624) and [PR 626](facebook#626). So the workaround we had at that time is to simply treat empty map as implicitly everything, and the side effect is massively slowed down count(*). We have a few options to address this: 1. Fix index merge optimizer - looking at the code in QUICK_RANGE_SELECT::init_ror_merged_scan, it actually fixes up the column_bitmap properly, but after init/reset, so the fix would simply be moving the bitmap set code up. For secondary keys, prepare_for_position will automatically call `mark_columns_used_by_index_no_reset(s->primary_key, read_set)` if HA_PRIMARY_KEY_REQUIRED_FOR_POSITION is set (true for both InnoDB and MyRocks), so we would know correctly that we need to unpack PK when walking SK during index merge. 2. Overriding `column_bitmaps_signal` and setup decoders whenever the bitmap changes - however this doesn't work by itself. Because no storage engine today actually use handler::column_bitmaps_signal this path haven't been tested properly in index merge. In this case, QUICK_RANGE_SELECT::init_ror_merged_scan should call set_column_bitmaps_no_signal to avoid resetting the correct read/write set of head since head is used as first handler (reuses_handler=true) and subsequent place holders for read/write set updates (reuse_handler=false). 3. Follow InnoDB's solution - InnoDB delays it actually initialize its template again in index_read for the 2nd time (relying on `prebuilt->sql_stat_start`), and during index_read `QUICK_RANGE_SELECT::column_bitmap` is already fixed up and the table read/write set is switched to it, so the new template would be built correctly. In order to make it easier to maintain and port, after discussing with Manuel, I'm going with a simplified version of facebook#3 that delays decoder creation until the first read operation (index_*, rnd_*, range_read_*, multi_range_read_*), and setting the delay flag in index_init / rnd_init / multi_range_read_init. Also, I ran into a bug with truncation_partition where Rdb_converter's tbl_def is stale (we only update ha_rocksdb::m_tbl_def), but it is fine because it is not being used after table open. But my change moves the lookup_bitmap initialization into Rdb_converter which takes a dependency on Rdb_converter::m_tbl_def so now we need to reset it properly. Reference Patch: facebook@44d6a8d --------- Porting Note: Due to 8.0's new counting infra (handler::record & handler::record_with_index), this only helps PK counting. Will send out a better fix that works better with 8.0 new counting infra. Reviewed By: Pushapgl Differential Revision: D26265470
1 parent e4e5295 commit 1167dba

File tree

5 files changed

+84
-38
lines changed

5 files changed

+84
-38
lines changed

storage/rocksdb/ha_rocksdb.cc

Lines changed: 50 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -7330,7 +7330,8 @@ ha_rocksdb::ha_rocksdb(my_core::handlerton *const hton,
73307330
mrr_enabled_keyread(false),
73317331
mrr_used_cpk(false),
73327332
m_in_rpl_delete_rows(false),
7333-
m_in_rpl_update_rows(false) {}
7333+
m_in_rpl_update_rows(false),
7334+
m_need_build_decoder(false) {}
73347335

73357336
ha_rocksdb::~ha_rocksdb() {
73367337
int err MY_ATTRIBUTE((__unused__));
@@ -8803,8 +8804,7 @@ int ha_rocksdb::create(const char *const name, TABLE *const table_arg,
88038804
*/
88048805
int ha_rocksdb::truncate_table(Rdb_tbl_def *tbl_def_arg,
88058806
const std::string &actual_user_table_name,
8806-
const TABLE *table_arg,
8807-
ulonglong auto_increment_value,
8807+
TABLE *table_arg, ulonglong auto_increment_value,
88088808
dd::Table *table_def) {
88098809
DBUG_ENTER_FUNC();
88108810

@@ -8892,6 +8892,7 @@ int ha_rocksdb::truncate_table(Rdb_tbl_def *tbl_def_arg,
88928892

88938893
/* Update the local m_tbl_def reference */
88948894
m_tbl_def = ddl_manager.find(orig_tablename);
8895+
m_converter.reset(new Rdb_converter(ha_thd(), m_tbl_def, table_arg));
88958896
DBUG_RETURN(err);
88968897
}
88978898

@@ -9187,8 +9188,9 @@ int ha_rocksdb::read_row_from_secondary_key(uchar *const buf,
91879188
#endif
91889189
DBUG_EXECUTE_IF("dbug.rocksdb.HA_EXTRA_KEYREAD", { m_keyread_only = true; });
91899190

9190-
bool covered_lookup = (m_keyread_only && kd.can_cover_lookup()) ||
9191-
kd.covers_lookup(&value, &m_lookup_bitmap);
9191+
bool covered_lookup =
9192+
(m_keyread_only && kd.can_cover_lookup()) ||
9193+
kd.covers_lookup(&value, m_converter->get_lookup_bitmap());
91929194

91939195
#ifndef NDEBUG
91949196
m_keyread_only = save_keyread_only;
@@ -9318,7 +9320,8 @@ int ha_rocksdb::secondary_index_read(const int keyno, uchar *const buf) {
93189320
rocksdb::Slice value = m_scan_it->value();
93199321
bool covered_lookup =
93209322
(m_keyread_only && m_key_descr_arr[keyno]->can_cover_lookup()) ||
9321-
m_key_descr_arr[keyno]->covers_lookup(&value, &m_lookup_bitmap);
9323+
m_key_descr_arr[keyno]->covers_lookup(
9324+
&value, m_converter->get_lookup_bitmap());
93229325
if (covered_lookup && m_lock_rows == RDB_LOCK_NONE) {
93239326
rc = m_key_descr_arr[keyno]->unpack_record(
93249327
table, buf, &key, &value,
@@ -9363,6 +9366,8 @@ int ha_rocksdb::index_read_map(uchar *const buf, const uchar *const key,
93639366
enum ha_rkey_function find_flag) {
93649367
DBUG_ENTER_FUNC();
93659368

9369+
check_build_decoder();
9370+
93669371
DBUG_EXECUTE_IF(
93679372
"myrocks_busy_loop_on_row_read", volatile int debug_i = 0;
93689373
while (1) { debug_i++; });
@@ -10005,6 +10010,8 @@ int ha_rocksdb::records_from_index(ha_rows *num_rows, uint index) {
1000510010
int ha_rocksdb::index_next(uchar *const buf) {
1000610011
DBUG_ENTER_FUNC();
1000710012

10013+
check_build_decoder();
10014+
1000810015
bool moves_forward = true;
1000910016
ha_statistic_increment(&System_status_var::ha_read_next_count);
1001010017
if (m_key_descr_arr[active_index]->m_is_reverse_cf) {
@@ -10025,6 +10032,8 @@ int ha_rocksdb::index_next(uchar *const buf) {
1002510032
int ha_rocksdb::index_prev(uchar *const buf) {
1002610033
DBUG_ENTER_FUNC();
1002710034

10035+
check_build_decoder();
10036+
1002810037
bool moves_forward = false;
1002910038
ha_statistic_increment(&System_status_var::ha_read_prev_count);
1003010039
if (m_key_descr_arr[active_index]->m_is_reverse_cf) {
@@ -10085,6 +10094,8 @@ int ha_rocksdb::index_next_with_direction(uchar *const buf, bool move_forward) {
1008510094
int ha_rocksdb::index_first(uchar *const buf) {
1008610095
DBUG_ENTER_FUNC();
1008710096

10097+
check_build_decoder();
10098+
1008810099
m_sk_match_prefix = nullptr;
1008910100
ha_statistic_increment(&System_status_var::ha_read_first_count);
1009010101
int rc = index_read_intern(buf, true /* first */);
@@ -10101,6 +10112,8 @@ int ha_rocksdb::index_first(uchar *const buf) {
1010110112
int ha_rocksdb::index_last(uchar *const buf) {
1010210113
DBUG_ENTER_FUNC();
1010310114

10115+
check_build_decoder();
10116+
1010410117
m_sk_match_prefix = nullptr;
1010510118
ha_statistic_increment(&System_status_var::ha_read_last_count);
1010610119
int rc = index_read_intern(buf, false /* first */);
@@ -11472,17 +11485,16 @@ void ha_rocksdb::setup_iterator_for_rnd_scan() {
1147211485
int ha_rocksdb::rnd_init(bool scan) {
1147311486
DBUG_ENTER_FUNC();
1147411487

11488+
m_need_build_decoder = true;
11489+
active_index = table->s->primary_key;
11490+
1147511491
THD *thd = ha_thd();
1147611492
if (thd && thd->killed) {
1147711493
DBUG_RETURN(HA_ERR_QUERY_INTERRUPTED);
1147811494
}
1147911495

1148011496
Rdb_transaction *const tx = get_or_create_tx(table->in_use);
1148111497

11482-
// when this table is being updated, decode all fields
11483-
m_converter->setup_field_decoders(table->read_set,
11484-
m_lock_rows == RDB_LOCK_WRITE);
11485-
1148611498
if (scan) {
1148711499
m_rnd_scan_is_new_snapshot = !tx->has_snapshot();
1148811500
setup_iterator_for_rnd_scan();
@@ -11506,6 +11518,8 @@ int ha_rocksdb::rnd_init(bool scan) {
1150611518
int ha_rocksdb::rnd_next(uchar *const buf) {
1150711519
DBUG_ENTER_FUNC();
1150811520

11521+
check_build_decoder();
11522+
1150911523
int rc;
1151011524
ha_statistic_increment(&System_status_var::ha_read_rnd_next_count);
1151111525
for (;;) {
@@ -11651,11 +11665,26 @@ int ha_rocksdb::rnd_next_with_direction(uchar *const buf, bool move_forward) {
1165111665
int ha_rocksdb::rnd_end() {
1165211666
DBUG_ENTER_FUNC();
1165311667

11668+
m_need_build_decoder = false;
11669+
1165411670
release_scan_iterator();
1165511671

1165611672
DBUG_RETURN(HA_EXIT_SUCCESS);
1165711673
}
1165811674

11675+
void ha_rocksdb::build_decoder() {
11676+
m_converter->setup_field_decoders(table->read_set, active_index,
11677+
m_keyread_only,
11678+
m_lock_rows == RDB_LOCK_WRITE);
11679+
}
11680+
11681+
void ha_rocksdb::check_build_decoder() {
11682+
if (m_need_build_decoder) {
11683+
build_decoder();
11684+
m_need_build_decoder = false;
11685+
}
11686+
}
11687+
1165911688
/**
1166011689
@return
1166111690
HA_EXIT_SUCCESS OK
@@ -11664,6 +11693,9 @@ int ha_rocksdb::rnd_end() {
1166411693
int ha_rocksdb::index_init(uint idx, bool sorted MY_ATTRIBUTE((__unused__))) {
1166511694
DBUG_ENTER_FUNC();
1166611695

11696+
m_need_build_decoder = true;
11697+
active_index = idx;
11698+
1166711699
THD *thd = ha_thd();
1166811700
if (thd && thd->killed) {
1166911701
DBUG_RETURN(HA_ERR_QUERY_INTERRUPTED);
@@ -11672,21 +11704,11 @@ int ha_rocksdb::index_init(uint idx, bool sorted MY_ATTRIBUTE((__unused__))) {
1167211704
Rdb_transaction *const tx = get_or_create_tx(table->in_use);
1167311705
assert(tx != nullptr);
1167411706

11675-
// when this table is being updated, decode all fields
11676-
m_converter->setup_field_decoders(table->read_set,
11677-
m_lock_rows == RDB_LOCK_WRITE);
11678-
11679-
if (!m_keyread_only) {
11680-
m_key_descr_arr[idx]->get_lookup_bitmap(table, &m_lookup_bitmap);
11681-
}
11682-
1168311707
// If m_lock_rows is not RDB_LOCK_NONE then we will be doing a get_for_update
1168411708
// when accessing the index, so don't acquire the snapshot right away.
1168511709
// Otherwise acquire the snapshot immediately.
1168611710
tx->acquire_snapshot(m_lock_rows == RDB_LOCK_NONE);
1168711711

11688-
active_index = idx;
11689-
1169011712
DBUG_RETURN(HA_EXIT_SUCCESS);
1169111713
}
1169211714

@@ -11697,9 +11719,9 @@ int ha_rocksdb::index_init(uint idx, bool sorted MY_ATTRIBUTE((__unused__))) {
1169711719
int ha_rocksdb::index_end() {
1169811720
DBUG_ENTER_FUNC();
1169911721

11700-
release_scan_iterator();
11722+
m_need_build_decoder = false;
1170111723

11702-
bitmap_free(&m_lookup_bitmap);
11724+
release_scan_iterator();
1170311725

1170411726
active_index = MAX_KEY;
1170511727
in_range_check_pushed_down = false;
@@ -12052,6 +12074,8 @@ void ha_rocksdb::position(const uchar *const record) {
1205212074
int ha_rocksdb::rnd_pos(uchar *const buf, uchar *const pos) {
1205312075
DBUG_ENTER_FUNC();
1205412076

12077+
check_build_decoder();
12078+
1205512079
int rc;
1205612080
size_t len;
1205712081

@@ -16868,6 +16892,8 @@ class Mrr_sec_key_rowid_source : public Mrr_rowid_source {
1686816892
int ha_rocksdb::multi_range_read_init(RANGE_SEQ_IF *seq, void *seq_init_param,
1686916893
uint n_ranges, uint mode,
1687016894
HANDLER_BUFFER *buf) {
16895+
m_need_build_decoder = true;
16896+
1687116897
int res;
1687216898

1687316899
if (!current_thd->optimizer_switch_flag(OPTIMIZER_SWITCH_MRR) ||
@@ -17114,6 +17140,8 @@ void ha_rocksdb::mrr_free_rows() {
1711417140
}
1711517141

1711617142
int ha_rocksdb::multi_range_read_next(char **range_info) {
17143+
check_build_decoder();
17144+
1711717145
if (mrr_uses_default_impl) {
1711817146
return handler::multi_range_read_next(range_info);
1711917147
}

storage/rocksdb/ha_rocksdb.h

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -383,13 +383,6 @@ class ha_rocksdb : public my_core::handler {
383383

384384
void set_last_rowkey(const uchar *const old_data);
385385

386-
/*
387-
For the active index, indicates which columns must be covered for the
388-
current lookup to be covered. If the bitmap field is null, that means this
389-
index does not cover the current lookup for any record.
390-
*/
391-
MY_BITMAP m_lookup_bitmap;
392-
393386
int alloc_key_buffers(const TABLE *const table_arg,
394387
const Rdb_tbl_def *const tbl_def_arg,
395388
bool alloc_alter_buffers = false)
@@ -956,7 +949,7 @@ class ha_rocksdb : public my_core::handler {
956949
dd::Table *table_def);
957950
int truncate_table(Rdb_tbl_def *tbl_def,
958951
const std::string &actual_user_table_name,
959-
const TABLE *table_arg, ulonglong auto_increment_value,
952+
TABLE *table_arg, ulonglong auto_increment_value,
960953
dd::Table *table_def);
961954
bool check_if_incompatible_data(HA_CREATE_INFO *const info,
962955
uint table_changes) override
@@ -1012,6 +1005,9 @@ class ha_rocksdb : public my_core::handler {
10121005
void update_row_read(ulonglong count);
10131006
static void inc_covered_sk_lookup();
10141007

1008+
void build_decoder();
1009+
void check_build_decoder();
1010+
10151011
protected:
10161012
int records(ha_rows *num_rows) override;
10171013
int records_from_index(ha_rows *num_rows, uint index) override;
@@ -1048,6 +1044,9 @@ class ha_rocksdb : public my_core::handler {
10481044
/* Flags tracking if we are inside different replication operation */
10491045
bool m_in_rpl_delete_rows;
10501046
bool m_in_rpl_update_rows;
1047+
1048+
/* Need to build decoder on next read operation */
1049+
bool m_need_build_decoder;
10511050
};
10521051

10531052
/*

storage/rocksdb/nosql_access.cc

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1306,7 +1306,8 @@ void INLINE_ATTR select_exec::scan_value() {
13061306
m_key_def->get_lookup_bitmap(m_table, &m_lookup_bitmap);
13071307
}
13081308

1309-
m_converter->setup_field_decoders(m_table->read_set, m_index);
1309+
m_converter->setup_field_decoders(m_table->read_set, m_index,
1310+
false /* keyread_only */);
13101311
}
13111312

13121313
bool INLINE_ATTR select_exec::run() {

storage/rocksdb/rdb_converter.cc

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -313,6 +313,7 @@ Rdb_converter::~Rdb_converter() {
313313
m_encoder_arr = nullptr;
314314
// These are needed to suppress valgrind errors in rocksdb.partition
315315
m_storage_record.mem_free();
316+
bitmap_free(&m_lookup_bitmap);
316317
}
317318

318319
/*
@@ -337,31 +338,30 @@ void Rdb_converter::get_storage_type(Rdb_field_encoder *const encoder,
337338
Setup which fields will be unpacked when reading rows
338339
339340
@detail
340-
Three special cases when we still unpack all fields:
341+
Two special cases when we still unpack all fields:
341342
- When client requires decode_all_fields, such as this table is being
342343
updated (m_lock_rows==RDB_LOCK_WRITE).
343344
- When @@rocksdb_verify_row_debug_checksums is ON (In this mode, we need
344345
to read all fields to find whether there is a row checksum at the end. We
345346
could skip the fields instead of decoding them, but currently we do
346347
decoding.)
347-
- On index merge as bitmap is cleared during that operation
348348
349349
@seealso
350350
Rdb_converter::setup_field_encoders()
351351
Rdb_converter::convert_record_from_storage_format()
352352
*/
353353
void Rdb_converter::setup_field_decoders(const MY_BITMAP *field_map,
354+
uint active_index, bool keyread_only,
354355
bool decode_all_fields) {
355356
m_key_requested = false;
356357
m_decoders_vect.clear();
358+
bitmap_free(&m_lookup_bitmap);
357359
int last_useful = 0;
358360
int skip_size = 0;
359361

360362
for (uint i = 0; i < m_table->s->fields; i++) {
361-
// bitmap is cleared on index merge, but it still needs to decode columns
362363
bool field_requested =
363364
decode_all_fields || m_verify_row_debug_checksums ||
364-
bitmap_is_clear_all(field_map) ||
365365
bitmap_is_set(field_map, m_table->field[i]->field_index());
366366

367367
// We only need the decoder if the whole record is stored.
@@ -397,6 +397,11 @@ void Rdb_converter::setup_field_decoders(const MY_BITMAP *field_map,
397397
// skipping. Remove them.
398398
m_decoders_vect.erase(m_decoders_vect.begin() + last_useful,
399399
m_decoders_vect.end());
400+
401+
if (!keyread_only && active_index != m_table->s->primary_key) {
402+
m_tbl_def->m_key_descr_arr[active_index]->get_lookup_bitmap(
403+
m_table, &m_lookup_bitmap);
404+
}
400405
}
401406

402407
void Rdb_converter::setup_field_encoders() {
@@ -584,6 +589,11 @@ int Rdb_converter::convert_record_from_storage_format(
584589
const rocksdb::Slice *const key_slice,
585590
const rocksdb::Slice *const value_slice, uchar *const dst,
586591
bool decode_value = true) {
592+
bool skip_value = !decode_value || get_decode_fields()->size() == 0;
593+
if (!m_key_requested && skip_value) {
594+
return HA_EXIT_SUCCESS;
595+
}
596+
587597
int err = HA_EXIT_SUCCESS;
588598

589599
Rdb_string_reader value_slice_reader(value_slice);
@@ -605,7 +615,7 @@ int Rdb_converter::convert_record_from_storage_format(
605615
}
606616
}
607617

608-
if (!decode_value || get_decode_fields()->size() == 0) {
618+
if (skip_value) {
609619
// We are done
610620
return HA_EXIT_SUCCESS;
611621
}

storage/rocksdb/rdb_converter.h

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -134,8 +134,8 @@ class Rdb_converter {
134134
Rdb_converter &operator=(const Rdb_converter &decoder) = delete;
135135
~Rdb_converter();
136136

137-
void setup_field_decoders(const MY_BITMAP *field_map,
138-
bool decode_all_fields = false);
137+
void setup_field_decoders(const MY_BITMAP *field_map, uint active_index,
138+
bool keyread_only, bool decode_all_fields = false);
139139

140140
int decode(const std::shared_ptr<Rdb_key_def> &key_def, uchar *dst,
141141
const rocksdb::Slice *key_slice, const rocksdb::Slice *value_slice,
@@ -172,6 +172,8 @@ class Rdb_converter {
172172
return &m_decoders_vect;
173173
}
174174

175+
const MY_BITMAP *get_lookup_bitmap() { return &m_lookup_bitmap; }
176+
175177
int decode_value_header_for_pk(Rdb_string_reader *reader,
176178
const std::shared_ptr<Rdb_key_def> &pk_def,
177179
rocksdb::Slice *unpack_slice);
@@ -241,5 +243,11 @@ class Rdb_converter {
241243
my_core::ha_rows m_row_checksums_checked;
242244
// buffer to hold data during encode_value_slice
243245
String m_storage_record;
246+
/*
247+
For the active index, indicates which columns must be covered for the
248+
current lookup to be covered. If the bitmap field is null, that means this
249+
index does not cover the current lookup for any record.
250+
*/
251+
MY_BITMAP m_lookup_bitmap;
244252
};
245253
} // namespace myrocks

0 commit comments

Comments
 (0)