-
Notifications
You must be signed in to change notification settings - Fork 116
ElasticDL Memory Leak Debug
The instances including Master, Worker, and ParameterServer (PS) in ELasticDL will be killed because of OOM due to memory leak. In this document, we will summarize the problems of memory leaks when we train a model using ElasticDL.
-
tf.keras.metricscauses a memory leak in the master server.
In ElasticDL, each worker executes evaluation tasks and report the model outputs and corresponding labels to the master by grpc. Then, the master calls update_state of tf.kreas.metrics to calculate the metrics. Generally, grpc.server uses multi-threads to process the grpc request from workers and each thread will call update_state. The memory leak will occur if we use multi-threads in grpc.server。
The resource configuration:
--master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \- Use multi-threads with max_worker=64 in the master
grpc.serverand train a deepFM model in the model zoo.
def _create_master_service(self, args):
self.logger.info("Creating master service")
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=64),Then, we view the used memory in the master.
- Use a single thread with max_worker=1 in the master
grpc.serverand train a deepFM model in the model zoo.
def _create_master_service(self, args):
self.logger.info("Creating master service")
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=1),Then, we view the used memory in the master.
Using multi-threads in the PS grpc.server also cause memory leaks.
The resource configuration:
--master_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--worker_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \
--ps_resource_request="cpu=1,memory=1024Mi,ephemeral-storage=1024Mi" \- Use multi-threads with max_worker=64 in the PS
grpc.serverand train a deepFM model in the model zoo.
def prepare(self):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=64),
options=[
("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
(
"grpc.max_receive_message_length",
GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
),
],
)Then, we view the used memory in the PS instances.
- Use a single thread with max_worker=1 in the master
grpc.serverand train a deepFM model in the model zoo.
def prepare(self):
server = grpc.server(
futures.ThreadPoolExecutor(max_workers=1),
options=[
("grpc.max_send_message_length", GRPC.MAX_SEND_MESSAGE_LENGTH),
(
"grpc.max_receive_message_length",
GRPC.MAX_RECEIVE_MESSAGE_LENGTH,
),
],
)Then, we view the used memory in PS instances.
- The memory leak will occur if we use
tf.py_functionto wrap thelookup_embeddingin the ElasticdDL Embedding layer.. The detail is in tensorflow issue 35010.