Skip to content

Conversation

@nilsfriess
Copy link

Probably those ifdefs were necessary when compiling with an older AdaptiveCpp version but with a version built from the current develop branch they are no longer necessary (I don't know what the oldest AdaptiveCpp release is that would still work here).

Not sure how to properly test this, at least oneMath still compiles using this CMake command:

cmake .. -DCMAKE_CXX_COMPILER=acpp \
         -DONEMATH_SYCL_IMPLEMENTATION=hipsycl \
         -DENABLE_MKLCPU_BACKEND=False \
         -DBUILD_FUNCTIONAL_TESTS=True \
         -DBUILD_EXAMPLES=True \
         -DCMAKE_INSTALL_PREFIX=./install \
         -DTARGET_DOMAINS=blas \
         -DHIPSYCL_TARGETS=generic \
         -DENABLE_MKLGPU_BACKEND=False \
         -DENABLE_GENERIC_BLAS_BACKEND=True \
         -DUSE_ADD_SYCL_TO_TARGET_INTEGRATION=True \
         -DhipSYCL_DIR=/home/friessn/Projects/AdaptiveCpp/build/install/lib/cmake/hipSYCL/ \
         -DONEMATH_SYCL_BLAS_DIR=/home/friessn/Projects/generic-sycl-components/onemath/sycl/blas/build/install/

I can't seem to run the tests (running ctest just runs 2 test cases), I have to look into that again an report back.

This addresses the issue opened by @illuhad here: codeplaysoftware/portBLAS#540 (which was already closed in the portBLAS repo but not transferred to here).

Copy link

@illuhad illuhad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great effort, thanks :)

#ifndef __ADAPTIVECPP__
cgh.host_task([=]() { sycl::free(mem, context); });
#else
cgh.AdaptiveCpp_enqueue_custom_operation([=](sycl::interop_handle &) { sycl::free(mem, context); });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it correct to do a free here inside a custom operation? Custom operations will be evaluated at submission time, which is a fundamental difference from host task (See https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md). So this looks to me like there might be a hazard of deleting memory while the device might still be using it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you're correct. I am not sure how to fix this though, do you have a suggestion?

Copy link

@illuhad illuhad Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In pseudo code, maybe something like this could work (and, if the C++ standard library uses a thread pool, should not be slower than host_task):

std::async(..., [=](){
  if(q.is_in_order()) {
    q.wait();
  } else {
    for(auto d : dependencies){
      d.wait();
    }
  }
  free(...);
});

Not sure if we need the in-order branch. If we reliably get dependency lists even for in-order queues we might not need it.

The real issue here is that SYCL does not have the asynchronous free mechanisms that e.g. CUDA has.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a workaround for an asynchronous free. The suggested solutions makes sense to me. I suggest to make a helper function for it since the pattern is repeated a few times.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be some complexity here since it's possible we need to wait on the returned std::future that is returned by std::async at some point to get things working.

@Rbiessy I would recommend to also reconsider the current workaround for the DPC++ side. According to the current SYCL 2020 specification,

Capturing accessors in a host task is allowed, however, capturing or using any other SYCL class that has reference semantics (see Section 4.5.2) is undefined behavior.

(https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:interfaces.hosttasks.overview)

The sycl::context class, similarly to sycl::queue or sycl::device, follows common reference semantics, and thus cannot be used inside the host_task. So, the current code is undefined behavior and might break at any time.

The background here is that many backends impose the restriction that enqueued callbacks may not make calls into the backend runtime API or risk deadlocks. For example, in CUDA, cudaStreamAddCallback requires that the registered callback may not perform any calls into the CUDA runtime, otherwise, deadlocks or other forms of UB might appear.

The SYCL restriction that common-reference-semantics objects are not allowed in host tasks enforces this limitation.

Unfortunately, because of limitations like this, the current host_task is far less useful than it might appear at first glance, which is why AdaptiveCpp never implemented it. It's full of potential performance or correctness issues unfortunately :(

I'm not sure what the best approach here is. I suspect that designing a new approach might require a better understanding of what the goal here is with the async free emulation than I currently have.

One solution that I've considered would be to just add asynchronous free to SYCL implementations as an extension, but that is also not trivial, since the OpenCL USM extensions don't support this functionality. Also, it would mean that the code would rely on specific extensions, and even if both AdaptiveCpp and DPC++ implemented them, the code would no longer be based on standard SYCL and it wouldn't work anymore with other compilers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I am aware it is technically UB according to SYCL spec. In practice we agreed that it should work fine with DPC++ and likely other SYCL implementations that would support host_task.
I was just reminded that there is already an extension for async alloc and free, see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_async_memory_alloc.asciidoc.
DPC++ already supports that so if we want to change the DPC++ path I would strongly suggest to use this. We don't have enough bandwidth to work on this at Codeplay but we could help with the review. It may be a good opportunity to update it if AdaptiveCpp were to support that extension?

Copy link

@illuhad illuhad Jun 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice we agreed that it should work fine with DPC++ and likely other SYCL implementations that would support host_task

It's not clear to me why that should be the case. I can see that for specific implementation choices of host_task (e.g. using a SYCL-managed worker thread instead of a backend callback) it might be fine. But that is an implementation detail and not guaranteed.

It may be a good opportunity to update it if AdaptiveCpp were to support that extension?

hmm... We already have our own memory pool interface which is different from the one in that extension. The real new feature I suppose would be a mechanism to extend lifetime of an object until a specific kernel has finished executing. Then we could use RAII to return a memory object to the pool.

I wonder how DPC++ implements the async free on OpenCL, given that the Intel OpenCL USM extension doesn't have it.

#ifndef __ADAPTIVECPP__
cgh.host_task([&, mem]() { release_buff_mem_(mem); });
#else
cgh.AdaptiveCpp_enqueue_custom_operation([&, mem](sycl::interop_handle &) { release_buff_mem_(mem); });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps same issue here?

#ifndef __ADAPTIVECPP__
cgh.host_task([&, mem]() { release_usm_mem_(mem); });
#else
cgh.AdaptiveCpp_enqueue_custom_operation([&, mem](sycl::interop_handle &) { release_usm_mem_(mem); });
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here?

Copy link
Contributor

@Rbiessy Rbiessy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the patch, this looks good. Using oneMath is the best way to test this change indeed. If you're able to find why it's not running all the tests that would be great.

#ifndef __ADAPTIVECPP__
cgh.host_task([=]() { sycl::free(mem, context); });
#else
cgh.AdaptiveCpp_enqueue_custom_operation([=](sycl::interop_handle &) { sycl::free(mem, context); });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a workaround for an asynchronous free. The suggested solutions makes sense to me. I suggest to make a helper function for it since the pattern is repeated a few times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants