Skip to content

Commit 00b262d

Browse files
authored
Implement pre-packed blobs serialization on disk and their memory mapping on load (microsoft#23069)
### Description <!-- Describe your changes. --> Pre-packing is a feature, that allows kernels to re-arrange weights data to gain performance at interference time Currently, pre-packed blobs are shared when a cross-session weight sharing is enabled and only for those weights that are marked as shared by the user. Otherwise, data resides on the heap, the kernels own the data which may be duplicated. This change enables pre-packed data to be stored on disk alongside with the external initializers. The pre-packed blobs are memory mapped and are loaded into either the X-session shared container or a new container that shares pre-packed blobs within the session. With the new approach, pre-packed blobs are always owned by the shared container using the existing pre-pack mechanism for sharing. When X-session sharing is enabled, then the external container owns the data. A separate container owned by a root `SessionState` owns and shares the data when X-session sharing is not enabled. To facilitate this new approach, we introduce a new container that works in two modes. When an optimized model is being saved, and pre-packed weights saving is enabled, the new container will record pre-packed blobs and serialize them to disk using existing `ToGraphProtoWithExternalInitializers` function. To externalize the pre-packed weights, we introduce a new session option `kOrtSessionOptionsSavePrePackedConstantInitializers.` Note, that pre-packing should be enabled (default) for this to work. `ToGraphProtoWithExternalInitializers`function is modified to recurse into subgraphs to make sure we properly account for local initializer names. In the second mode, the container would simply hold the pre-packed weights memory-mapped from disk and share them with the kernels. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Reduce memory usage by pre-packed initializers and externalize them.
1 parent 29bccad commit 00b262d

28 files changed

+1308
-266
lines changed

include/onnxruntime/core/framework/op_kernel.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77

88
// It is safe to include the below header even if SHARED_PROVIDER macro is enabled
99
// as it doesn't include any pb headers.
10+
#include "core/framework/buffer_deleter.h"
1011
#include "core/framework/prepacked_weights_container.h"
1112

1213
#ifndef SHARED_PROVIDER

include/onnxruntime/core/graph/graph.h

Lines changed: 58 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -3,14 +3,15 @@
33

44
#pragma once
55

6+
#include <filesystem>
67
#include <functional>
78
#include <limits>
89
#include <memory>
10+
#include <optional>
911
#include <string>
1012
#include <type_traits>
1113
#include <unordered_map>
1214
#include <unordered_set>
13-
#include <filesystem>
1415

1516
#include "core/common/flatbuffers.h"
1617

@@ -19,13 +20,14 @@
1920
#include "core/common/common.h"
2021
#include "core/common/path_string.h"
2122
#include "core/common/const_pointer_container.h"
23+
#include "core/common/inlined_containers_fwd.h"
2224
#if !defined(ORT_MINIMAL_BUILD)
2325
#include "core/common/inlined_containers.h"
2426
#endif
25-
#include "core/common/inlined_containers_fwd.h"
2627
#include "core/common/span_utils.h"
2728
#include "core/common/status.h"
2829
#include "core/common/logging/logging.h"
30+
#include "core/framework/prepacked_weights_container.h"
2931
#include "core/graph/onnx_protobuf.h"
3032
#include "core/graph/basic_types.h"
3133
#include "core/graph/constants.h"
@@ -41,6 +43,7 @@ namespace onnxruntime {
4143
class Graph;
4244
struct IndexedSubGraph;
4345
class Model;
46+
struct ModelSavingOptions;
4447
class OpSignature;
4548

4649
#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)
@@ -1153,29 +1156,6 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
11531156
const ONNX_NAMESPACE::GraphProto& ToGraphProto();
11541157
ONNX_NAMESPACE::GraphProto ToGraphProto() const;
11551158

1156-
// Options to align external initializer offset.
1157-
// For models running on CPU, ORT will try to use mmap to load external initializers.
1158-
// To use mmap, external initializer need to be offset aligned.
1159-
// ORT saves external initializers into signle data file, each initializer is accessed with
1160-
// offset(start position of initializer) and length(byte length of initializer) of the data file.
1161-
// To use mmap, each offset need to be aligned which means offset need to divisible by
1162-
// allocation granularity(64KB for windows and 4K for other OSes).
1163-
// With align_offset to true, ORT will align offset for large initializer when
1164-
// save ONNX model with external data file.
1165-
struct OffsetAlignmentInfo {
1166-
// Offset will always be page aligned and allocation granularity aligned for mmap support.
1167-
// This is done by padding previous tensor data with zeros keeping same length.
1168-
bool align_offset = false;
1169-
// Alignment threshold for size of data.
1170-
// Having a low threshold will waste file space for small initializers.
1171-
// Only when tensor's data size is > the page_align_threshold it will be force aligned.
1172-
// Default to 1MB.
1173-
int64_t align_threshold = 1048576;
1174-
// The allocation Granularity for mmap() support.
1175-
// Typically 64KB for Windows & 4KB for other OSes. Default to 64KB.
1176-
int64_t allocation_granularity = 65536;
1177-
};
1178-
11791159
/** Gets the GraphProto representation of this Graph
11801160
@param external_file_path File path of the binary file to use for initializers.
11811161
@param model_file_path path of the model file.
@@ -1186,15 +1166,7 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
11861166
*/
11871167
ONNX_NAMESPACE::GraphProto ToGraphProtoWithExternalInitializers(const std::filesystem::path& external_file_path,
11881168
const std::filesystem::path& model_file_path,
1189-
size_t initializer_size_threshold,
1190-
const OffsetAlignmentInfo& align_info) const;
1191-
1192-
ONNX_NAMESPACE::GraphProto ToGraphProtoWithExternalInitializers(const std::filesystem::path& external_file_path,
1193-
const std::filesystem::path& model_file_path,
1194-
size_t initializer_size_threshold) const {
1195-
OffsetAlignmentInfo default_options;
1196-
return ToGraphProtoWithExternalInitializers(external_file_path, model_file_path, initializer_size_threshold, default_options);
1197-
}
1169+
const ModelSavingOptions& model_saving_options) const;
11981170

11991171
/** Gets the ISchemaRegistry instances being used with this Graph. */
12001172
IOnnxRuntimeOpSchemaCollectionPtr GetSchemaRegistry() const;
@@ -1400,6 +1372,18 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
14001372

14011373
#endif // !defined(ORT_MINIMAL_BUILD)
14021374

1375+
// This function constructs PrepackedSharedContainer in the root graph only
1376+
// and initializes a reference to it in all (sub)graphs
1377+
void ConstructPrepackedSharedContainerAndSetMode(bool saving_mode_on);
1378+
1379+
const PrepackedWeightsForGraph& GetPrepacked() const noexcept {
1380+
return *prepacked_weights_for_graph_;
1381+
}
1382+
1383+
PrepackedWeightsForGraph& GetPrepacked() noexcept {
1384+
return *prepacked_weights_for_graph_;
1385+
}
1386+
14031387
/** Returns the Node containing the GraphProto for this Graph instance if IsSubgraph is true */
14041388
const Node* ParentNode() const { return parent_node_; }
14051389

@@ -1519,6 +1503,31 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
15191503
Status AddConstantProtoAsInitializer(const ONNX_NAMESPACE::NodeProto& constant_node_proto,
15201504
std::optional<std::string_view> new_name);
15211505

1506+
/// <summary>
1507+
/// This function traverses the graph bottom up and externalizes
1508+
/// constant initializers along with their pre-packed blobs from different
1509+
/// kernels. Writes constant initializers to the external file with any pre-packed
1510+
/// blobs (if enabled and produced for this initializer) and then modifies TensorProto
1511+
/// entry with external data references.
1512+
/// </summary>
1513+
/// <param name="model_path">model file path from Model</param>
1514+
/// <param name="external_file_path">a binary file path for relative to the model file path
1515+
/// where the initializers data is written</param>
1516+
/// <param name="model_external_file_path">model file folder path with external file path appended</param>
1517+
/// <param name="model_saving_options">model saving options including alignment and pre-packs</param>
1518+
/// <param name="output_graph_proto">The graph proto to be modified</param>
1519+
/// <param name="external_stream">external file stream</param>
1520+
/// <param name="external_offset">current external file offset updated with each write</param>
1521+
/// <returns>Status instance</returns>
1522+
Status AddExternalInitializersToGraphProtoImpl(
1523+
const std::filesystem::path& model_path,
1524+
const std::filesystem::path& external_file_path,
1525+
const std::filesystem::path& model_external_file_path,
1526+
const ModelSavingOptions& model_saving_options,
1527+
ONNX_NAMESPACE::GraphProto& output_graph_proto,
1528+
std::ostream& external_stream,
1529+
int64_t& external_offset) const;
1530+
15221531
#endif
15231532

15241533
Version IrVersion() const noexcept {
@@ -1703,6 +1712,21 @@ class Graph { // NOLINT(clang-analyzer-optin.performance.Padding): preserve exi
17031712
std::hash<std::string>, std::equal_to<std::string>>
17041713
sparse_tensor_names_;
17051714

1715+
// Prepacked blobs container that stored pre-packed initializers
1716+
// data that is:
1717+
// - mem-mapped from disk
1718+
// - shared within the session
1719+
// - shared across sessions by transferring the ownership of loaded data entries to
1720+
// SessionState::PrepackedWeightsContainer* if one is present.
1721+
// This container is optional because it is present only in the root graph.
1722+
std::optional<PrepackedKeyToBlobMap> prepacked_key_to_blobs_;
1723+
1724+
// This container contains a reference to the root prepacked_key_to_blobs_
1725+
// and also (in the save mode) records association between the initializer
1726+
// names and their pre-packed blobs (via keys).
1727+
// This is optional due to delayed construction.
1728+
std::optional<PrepackedWeightsForGraph> prepacked_weights_for_graph_;
1729+
17061730
#if !defined(ORT_MINIMAL_BUILD) || defined(ORT_EXTENDED_MINIMAL_BUILD)
17071731
// Runtime optimization storage.
17081732
// Note: runtime_optimizations_ == *runtime_optimizations_ptr_ and must be initialized
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
// Copyright (c) Microsoft Corporation. All rights reserved.
2+
// Licensed under the MIT License.
3+
4+
#pragma once
5+
6+
namespace onnxruntime {
7+
8+
class PrepackedWeightsForGraph;
9+
10+
// These options affect how the model initializers are written to the external file.
11+
// This includes options to align external initializer offset.
12+
// For models running on CPU, ORT will try to use mmap to load external
13+
// initializers. To use mmap, external initializer need to be offset aligned.
14+
// ORT saves external initializers into single data file, each initializer is
15+
// accessed with offset(start position of initializer) and length(byte length of
16+
// initializer) of the data file. To use mmap, each offset need to be aligned
17+
// which means offset need to divisible by allocation granularity(64KB for
18+
// windows and 4K for other OSes). With align_offset to true, ORT will align
19+
// offset for large initializer when save ONNX model with external data file.
20+
struct ModelSavingOptions {
21+
explicit ModelSavingOptions(size_t size_threshold)
22+
: initializer_size_threshold(size_threshold) {}
23+
24+
// Mimimal initializer size in bytes to be externalized on disk
25+
size_t initializer_size_threshold;
26+
// Offset will always be page aligned and allocation granularity aligned for
27+
// mmap support. This is done by padding previous tensor data with zeros
28+
// keeping same length.
29+
bool align_offset = false;
30+
// Alignment threshold for size of data.
31+
// Having a low threshold will waste file space for small initializers.
32+
// Only when tensor's data size is > the page_align_threshold it will be force
33+
// aligned. Default to 1MB.
34+
int64_t align_threshold = 1048576;
35+
// The allocation Granularity for mmap() support.
36+
// Typically 64KB for Windows & 4KB for other OSes. Default to 64KB.
37+
#ifdef _WIN32
38+
int64_t allocation_granularity = 65536;
39+
#else
40+
int64_t allocation_granularity = 4096;
41+
#endif
42+
};
43+
44+
} // namespace onnxruntime

include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -250,6 +250,17 @@ static const char* const kOrtSessionOptionsOptimizedModelExternalInitializersFil
250250
static const char* const kOrtSessionOptionsOptimizedModelExternalInitializersMinSizeInBytes =
251251
"session.optimized_model_external_initializers_min_size_in_bytes";
252252

253+
// Use this config when saving pre-packed constant initializers to an external data file.
254+
// This allows you to memory map pre-packed initializers on model load and leave it to
255+
// to the OS the amount of memory consumed by the pre-packed initializers. Otherwise,
256+
// pre-packed data resides on the heap.
257+
//
258+
// - "0": Default is not save pre-packed initializers to a data file.
259+
// - "1": Save pre-packed constant initializers to an external data file.
260+
// Sample usage: sess_options.add_session_config_entry(kOrtSessionOptionsSavePrePackedConstantInitializers, "1")
261+
static const char* const kOrtSessionOptionsSavePrePackedConstantInitializers =
262+
"session.save_external_prepacked_constant_initializers";
263+
253264
// Enable EP context feature to dump the partitioned graph which includes the EP context into Onnx file.
254265
// The dumped Onnx model with EP context can be used for future inference to avoid the EP graph partitioning/compile overhead.
255266
// "0": disable. (default)

onnxruntime/core/framework/prepacked_weights.h

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,8 @@
66
#include <vector>
77

88
#include "core/common/basic_types.h"
9-
#include "core/framework/buffer_deleter.h"
9+
#include "core/common/inlined_containers_fwd.h"
10+
#include "core/framework/allocator.h"
1011
#include "core/framework/tensor_shape.h"
1112

1213
namespace onnxruntime {
@@ -16,11 +17,14 @@ struct PrePackedWeights final {
1617
// Hence we hold them in container. It is upto the developer implementing each PrePack()
1718
// method to define what gets stored in which position of the container.
1819

19-
std::vector<IAllocatorUniquePtr<void>> buffers_; // cache pre-packed buffers associated with the kernel
20-
std::vector<size_t> buffer_sizes_; // cache sizes of pre-packed buffers (in bytes)
20+
InlinedVector<IAllocatorUniquePtr<void>> buffers_; // cache pre-packed buffers associated with the kernel
21+
InlinedVector<size_t> buffer_sizes_; // cache sizes of pre-packed buffers (in bytes)
2122

2223
// Produces a hash of the buffers stored in the given instance of this class
2324
HashValue GetHash() const;
25+
26+
// The function creates a copy with non-owning BufferUniquePtrs.
27+
PrePackedWeights CreateReferringCopy() const;
2428
};
2529

2630
} // namespace onnxruntime

onnxruntime/core/framework/prepacked_weights_container.cc

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,21 @@
33

44
#include "core/framework/prepacked_weights_container.h"
55
#include "core/framework/allocator_utils.h"
6+
#include "core/graph/graph.h"
67

78
namespace onnxruntime {
89

10+
PrePackedWeights PrePackedWeights::CreateReferringCopy() const {
11+
PrePackedWeights copy;
12+
for (const auto& prepacked_buffer : buffers_) {
13+
// No deleter is needed as the buffer is not owned by the unique_ptr
14+
copy.buffers_.emplace_back(prepacked_buffer.get(), [](void*) {});
15+
}
16+
17+
copy.buffer_sizes_ = buffer_sizes_;
18+
return copy;
19+
}
20+
921
AllocatorPtr PrepackedWeightsContainer::GetOrCreateAllocator(const std::string& device_name) {
1022
auto iter = allocators_.find(device_name);
1123

@@ -49,4 +61,50 @@ size_t PrepackedWeightsContainer::GetNumberOfElements() const {
4961
return prepacked_weights_map_.size();
5062
}
5163

64+
void PrepackedWeightsForGraph::InsertPrepackedWeights(const std::string& key, PrePackedWeights&& packed_weight) {
65+
// We may have duplicate entries mapped from disk if the same weight is pre-packed from subgraphs and
66+
// up the tree by the same kernel with the same result. The map prevents this from happening.
67+
key_to_blobs_.emplace(key, std::move(packed_weight));
68+
}
69+
70+
void PrepackedWeightsForGraph::WritePackedMaybeForSave(const std::string& weight_name, const std::string& key,
71+
PrePackedWeights&& packed_weight) {
72+
key_to_blobs_.insert_or_assign(key, std::move(packed_weight));
73+
74+
if (save_mode_on_) {
75+
weight_prepacks_for_saving_[weight_name].insert(key);
76+
}
77+
}
78+
79+
const PrePackedWeights* PrepackedWeightsForGraph::GetPrepackedWeights(const std::string& key) const {
80+
auto it = key_to_blobs_.find(key);
81+
if (it == key_to_blobs_.end()) {
82+
return nullptr;
83+
}
84+
return &it->second;
85+
}
86+
87+
std::optional<PrePackedWeights> PrepackedWeightsForGraph::ReplaceWithReferenceIfSaving(
88+
const std::string& weight_name,
89+
const std::string& key,
90+
const PrePackedWeights& refer_to_if_absent) {
91+
auto it = key_to_blobs_.find(key);
92+
if (it == key_to_blobs_.end()) {
93+
if (save_mode_on_) {
94+
key_to_blobs_.emplace(key, refer_to_if_absent.CreateReferringCopy());
95+
weight_prepacks_for_saving_[weight_name].insert(key);
96+
}
97+
return std::nullopt;
98+
}
99+
100+
PrePackedWeights result = std::move(it->second);
101+
if (save_mode_on_) {
102+
it->second = result.CreateReferringCopy();
103+
weight_prepacks_for_saving_[weight_name].insert(key);
104+
} else {
105+
key_to_blobs_.erase(it);
106+
}
107+
return result;
108+
}
109+
52110
} // namespace onnxruntime

0 commit comments

Comments
 (0)