Skip to content

Conversation

@xgqdut2016
Copy link
Collaborator

@xgqdut2016 xgqdut2016 commented Apr 21, 2025

1747128608018 1747290120395

对于a = torch.randn,b=1e-3 * torch.randn也能通过测试,但是由于CPU获取scale,zero涉及过多矩阵计算,测试非常慢,目前对于7B矩阵的测试,当group_size=-1的时候,需要3h,CUDA平台适配的是marlin,目前也能通过测试,国产芯片是arm架构,不支持immintrin.h,必须注释掉CPU平台和immintrin.h相关的函数和头文件才能编译成功
CUDA的性能如下图所示:
1748250667123

@xgqdut2016 xgqdut2016 changed the base branch from marlin to main April 30, 2025 08:04
@xgqdut2016 xgqdut2016 force-pushed the issue/170 branch 3 times, most recently from 38f7ad4 to c7f8aa6 Compare April 30, 2025 08:22
@xgqdut2016 xgqdut2016 linked an issue Apr 30, 2025 that may be closed by this pull request
@xgqdut2016 xgqdut2016 force-pushed the issue/170 branch 2 times, most recently from 318d48e to 2c512d5 Compare May 15, 2025 06:30
@xgqdut2016 xgqdut2016 requested a review from PanZezhong1725 May 16, 2025 01:36
@PanZezhong1725 PanZezhong1725 requested a review from YdrMaster May 19, 2025 09:32
@xgqdut2016 xgqdut2016 force-pushed the issue/170 branch 2 times, most recently from 4a4fad1 to d605520 Compare May 29, 2025 02:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEV] GPTQ算子 - CPU平台

2 participants