-
Notifications
You must be signed in to change notification settings - Fork 5
Remove (most) ifdefs that disable features for AdaptiveCpp #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…mpiling with AdaptiveCpp
illuhad
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great effort, thanks :)
| #ifndef __ADAPTIVECPP__ | ||
| cgh.host_task([=]() { sycl::free(mem, context); }); | ||
| #else | ||
| cgh.AdaptiveCpp_enqueue_custom_operation([=](sycl::interop_handle &) { sycl::free(mem, context); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it correct to do a free here inside a custom operation? Custom operations will be evaluated at submission time, which is a fundamental difference from host task (See https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md). So this looks to me like there might be a hazard of deleting memory while the device might still be using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you're correct. I am not sure how to fix this though, do you have a suggestion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In pseudo code, maybe something like this could work (and, if the C++ standard library uses a thread pool, should not be slower than host_task):
std::async(..., [=](){
if(q.is_in_order()) {
q.wait();
} else {
for(auto d : dependencies){
d.wait();
}
}
free(...);
});Not sure if we need the in-order branch. If we reliably get dependency lists even for in-order queues we might not need it.
The real issue here is that SYCL does not have the asynchronous free mechanisms that e.g. CUDA has.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a workaround for an asynchronous free. The suggested solutions makes sense to me. I suggest to make a helper function for it since the pattern is repeated a few times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be some complexity here since it's possible we need to wait on the returned std::future that is returned by std::async at some point to get things working.
@Rbiessy I would recommend to also reconsider the current workaround for the DPC++ side. According to the current SYCL 2020 specification,
Capturing accessors in a host task is allowed, however, capturing or using any other SYCL class that has reference semantics (see Section 4.5.2) is undefined behavior.
The sycl::context class, similarly to sycl::queue or sycl::device, follows common reference semantics, and thus cannot be used inside the host_task. So, the current code is undefined behavior and might break at any time.
The background here is that many backends impose the restriction that enqueued callbacks may not make calls into the backend runtime API or risk deadlocks. For example, in CUDA, cudaStreamAddCallback requires that the registered callback may not perform any calls into the CUDA runtime, otherwise, deadlocks or other forms of UB might appear.
The SYCL restriction that common-reference-semantics objects are not allowed in host tasks enforces this limitation.
Unfortunately, because of limitations like this, the current host_task is far less useful than it might appear at first glance, which is why AdaptiveCpp never implemented it. It's full of potential performance or correctness issues unfortunately :(
I'm not sure what the best approach here is. I suspect that designing a new approach might require a better understanding of what the goal here is with the async free emulation than I currently have.
One solution that I've considered would be to just add asynchronous free to SYCL implementations as an extension, but that is also not trivial, since the OpenCL USM extensions don't support this functionality. Also, it would mean that the code would rely on specific extensions, and even if both AdaptiveCpp and DPC++ implemented them, the code would no longer be based on standard SYCL and it wouldn't work anymore with other compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I am aware it is technically UB according to SYCL spec. In practice we agreed that it should work fine with DPC++ and likely other SYCL implementations that would support host_task.
I was just reminded that there is already an extension for async alloc and free, see https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/proposed/sycl_ext_oneapi_async_memory_alloc.asciidoc.
DPC++ already supports that so if we want to change the DPC++ path I would strongly suggest to use this. We don't have enough bandwidth to work on this at Codeplay but we could help with the review. It may be a good opportunity to update it if AdaptiveCpp were to support that extension?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice we agreed that it should work fine with DPC++ and likely other SYCL implementations that would support host_task
It's not clear to me why that should be the case. I can see that for specific implementation choices of host_task (e.g. using a SYCL-managed worker thread instead of a backend callback) it might be fine. But that is an implementation detail and not guaranteed.
It may be a good opportunity to update it if AdaptiveCpp were to support that extension?
hmm... We already have our own memory pool interface which is different from the one in that extension. The real new feature I suppose would be a mechanism to extend lifetime of an object until a specific kernel has finished executing. Then we could use RAII to return a memory object to the pool.
I wonder how DPC++ implements the async free on OpenCL, given that the Intel OpenCL USM extension doesn't have it.
| #ifndef __ADAPTIVECPP__ | ||
| cgh.host_task([&, mem]() { release_buff_mem_(mem); }); | ||
| #else | ||
| cgh.AdaptiveCpp_enqueue_custom_operation([&, mem](sycl::interop_handle &) { release_buff_mem_(mem); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps same issue here?
| #ifndef __ADAPTIVECPP__ | ||
| cgh.host_task([&, mem]() { release_usm_mem_(mem); }); | ||
| #else | ||
| cgh.AdaptiveCpp_enqueue_custom_operation([&, mem](sycl::interop_handle &) { release_usm_mem_(mem); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here?
Rbiessy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch, this looks good. Using oneMath is the best way to test this change indeed. If you're able to find why it's not running all the tests that would be great.
| #ifndef __ADAPTIVECPP__ | ||
| cgh.host_task([=]() { sycl::free(mem, context); }); | ||
| #else | ||
| cgh.AdaptiveCpp_enqueue_custom_operation([=](sycl::interop_handle &) { sycl::free(mem, context); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is a workaround for an asynchronous free. The suggested solutions makes sense to me. I suggest to make a helper function for it since the pattern is repeated a few times.
Probably those ifdefs were necessary when compiling with an older AdaptiveCpp version but with a version built from the current develop branch they are no longer necessary (I don't know what the oldest AdaptiveCpp release is that would still work here).
Not sure how to properly test this, at least oneMath still compiles using this CMake command:
cmake .. -DCMAKE_CXX_COMPILER=acpp \ -DONEMATH_SYCL_IMPLEMENTATION=hipsycl \ -DENABLE_MKLCPU_BACKEND=False \ -DBUILD_FUNCTIONAL_TESTS=True \ -DBUILD_EXAMPLES=True \ -DCMAKE_INSTALL_PREFIX=./install \ -DTARGET_DOMAINS=blas \ -DHIPSYCL_TARGETS=generic \ -DENABLE_MKLGPU_BACKEND=False \ -DENABLE_GENERIC_BLAS_BACKEND=True \ -DUSE_ADD_SYCL_TO_TARGET_INTEGRATION=True \ -DhipSYCL_DIR=/home/friessn/Projects/AdaptiveCpp/build/install/lib/cmake/hipSYCL/ \ -DONEMATH_SYCL_BLAS_DIR=/home/friessn/Projects/generic-sycl-components/onemath/sycl/blas/build/install/I can't seem to run the tests (running ctest just runs 2 test cases), I have to look into that again an report back.
This addresses the issue opened by @illuhad here: codeplaysoftware/portBLAS#540 (which was already closed in the portBLAS repo but not transferred to here).