diff --git a/docs/docs/04-benchmarks/inference-time.md b/docs/docs/04-benchmarks/inference-time.md index dd0f1275a..89f1f9de1 100644 --- a/docs/docs/04-benchmarks/inference-time.md +++ b/docs/docs/04-benchmarks/inference-time.md @@ -8,46 +8,48 @@ Times presented in the tables are measured as consecutive runs of the model. Ini ## Classification -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ----------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| EFFICIENTNET_V2_S | 100 | 120 | 130 | 180 | 170 | +| EFFICIENTNET_V2_S | 105 | 110 | 149 | 299 | 227 | ## Object Detection -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 13 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| SSDLITE_320_MOBILENET_V3_LARGE | 190 | 260 | 280 | 100 | 90 | +| SSDLITE_320_MOBILENET_V3_LARGE | 116 | 120 | 164 | 257 | 129 | ## Style Transfer -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ---------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| STYLE_TRANSFER_CANDY | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_MOSAIC | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_UDNIE | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_RAIN_PRINCESS | 450 | 600 | 750 | 1650 | 1800 | +| STYLE_TRANSFER_CANDY | 1356 | 1550 | 2003 | 2578 | 2328 | +| STYLE_TRANSFER_MOSAIC | 1376 | 1456 | 1971 | 2657 | 2394 | +| STYLE_TRANSFER_UDNIE | 1389 | 1499 | 1858 | 2380 | 2124 | +| STYLE_TRANSFER_RAIN_PRINCESS | 1339 | 1514 | 2004 | 2608 | 2371 | ## OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_800) | 2099 | 2227 | ❌ | 2245 | 7108 | -| Recognizer (CRNN_512) | 70 | 252 | ❌ | 54 | 151 | -| Recognizer (CRNN_256) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_128) | 17 | 83 | ❌ | 14 | 39 | +Notice that the recognizer models were executed between 3 and 7 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_800_QUANTIZED) | 669 | 649 | 825 | 541 | 474 | +| Recognizer (CRNN_512) | 48 | 47 | 60 | 91 | 72 | +| Recognizer (CRNN_256) | 22 | 22 | 29 | 51 | 30 | +| Recognizer (CRNN_128) | 11 | 11 | 14 | 28 | 17 | ## Vertical OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_1280) | 5457 | 5833 | ❌ | 6296 | 14053 | -| Detector (CRAFT_320) | 1351 | 1460 | ❌ | 1485 | 3101 | -| Recognizer (CRNN_512) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_64) | 10 | 33 | ❌ | 7 | 18 | +Notice that the recognizer models, as well as detector CRAFT_320 model, were executed between 4 and 21 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_1280_QUANTIZED) | 1749 | 1804 | 2105 | 1216 | 1171 | +| Detector (CRAFT_320_QUANTIZED) | 458 | 474 | 561 | 360 | 332 | +| Recognizer (CRNN_512) | 54 | 52 | 68 | 144 | 72 | +| Recognizer (CRNN_64) | 5 | 6 | 7 | 28 | 11 | ## LLMs @@ -62,41 +64,31 @@ Times presented in the tables are measured as consecutive runs of the model. Ini ❌ - Insufficient RAM. -### Streaming mode - -Notice than for `Whisper` model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) `fast` mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. If you believe that this might be a problem for you, prefer `balanced` mode instead. - -| Model (mode) | iPhone 16 Pro (XNNPACK) [latency \| tokens/s] | iPhone 14 Pro (XNNPACK) [latency \| tokens/s] | iPhone SE 3 (XNNPACK) [latency \| tokens/s] | Samsung Galaxy S24 (XNNPACK) [latency \| tokens/s] | OnePlus 12 (XNNPACK) [latency \| tokens/s] | -| ----------------------- | :-------------------------------------------: | :-------------------------------------------: | :-----------------------------------------: | :------------------------------------------------: | :----------------------------------------: | -| Whisper-tiny (fast) | 2.8s \| 5.5t/s | 3.7s \| 4.4t/s | 4.4s \| 3.4t/s | 5.5s \| 3.1t/s | 5.3s \| 3.8t/s | -| Whisper-tiny (balanced) | 5.6s \| 7.9t/s | 7.0s \| 6.3t/s | 8.3s \| 5.0t/s | 8.4s \| 6.7t/s | 7.7s \| 7.2t/s | -| Whisper-tiny (quality) | 10.3s \| 8.3t/s | 12.6s \| 6.8t/s | 7.8s \| 8.9t/s | 13.5s \| 7.1t/s | 12.9s \| 7.5t/s | - ### Encoding Average time for encoding audio of given length over 10 runs. For `Whisper` model we only list 30 sec audio chunks since `Whisper` does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence). -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| Whisper-tiny (30s) | 1034 | 1344 | 1269 | 2916 | 2143 | +| Whisper-tiny (30s) | 1391 | 1372 | 1894 | 1303 | 1214 | ### Decoding -Average time for decoding one token in sequence of 100 tokens, with encoding context is obtained from audio of noted length. +Average time for decoding one token in sequence of approximately 100 tokens, with encoding context is obtained from audio of noted length. -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| Whisper-tiny (30s) | 128.03 | 113.65 | 141.63 | 89.08 | 84.49 | +| Whisper-tiny (30s) | 53 | 53 | 74 | 100 | 84 | ## Text Embeddings -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] | -| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: | -| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 | -| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 | -| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 | -| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 | -| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| -------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| ALL_MINILM_L6_V2 | 16 | 16 | 19 | 54 | 28 | +| ALL_MPNET_BASE_V2 | 115 | 116 | 144 | 145 | 95 | +| MULTI_QA_MINILM_L6_COS_V1 | 16 | 16 | 20 | 47 | 28 | +| MULTI_QA_MPNET_BASE_DOT_V1 | 112 | 119 | 144 | 146 | 96 | +| CLIP_VIT_BASE_PATCH32_TEXT | 47 | 45 | 57 | 65 | 48 | :::info Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly. @@ -104,9 +96,9 @@ Benchmark times for text embeddings are highly dependent on the sentence length. ## Image Embeddings -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | -| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| --------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| CLIP_VIT_BASE_PATCH32_IMAGE | 70 | 70 | 90 | 66 | 58 | :::info Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time. @@ -114,8 +106,6 @@ Image embedding benchmark times are measured using 224×224 pixel images, as req ## Text to Image -Average time for generating one image of size 256×256 in 10 inference steps. - -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :-------------------: | :-------------------------------: | :-----------------------: | -| BK_SDM_TINY_VPRED_256 | 19100 | 25000 | ❌ | ❌ | 23100 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| --------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| BK_SDM_TINY_VPRED_256 | 21184 | 21021 | ❌ | 18834 | 16617 | diff --git a/docs/docs/04-benchmarks/memory-usage.md b/docs/docs/04-benchmarks/memory-usage.md index e34c8a7ca..a0c5a7b6d 100644 --- a/docs/docs/04-benchmarks/memory-usage.md +++ b/docs/docs/04-benchmarks/memory-usage.md @@ -2,76 +2,80 @@ title: Memory Usage --- +:::info +All the below benchmarks were performed on iPhone 17 Pro (iOS) and OnePlus 12 (Android). +::: + ## Classification | Model | Android (XNNPACK) [MB] | iOS (Core ML) [MB] | | ----------------- | :--------------------: | :----------------: | -| EFFICIENTNET_V2_S | 130 | 85 | +| EFFICIENTNET_V2_S | 230 | 87 | ## Object Detection | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | ------------------------------ | :--------------------: | :----------------: | -| SSDLITE_320_MOBILENET_V3_LARGE | 90 | 90 | +| SSDLITE_320_MOBILENET_V3_LARGE | 164 | 132 | ## Style Transfer | Model | Android (XNNPACK) [MB] | iOS (Core ML) [MB] | | ---------------------------- | :--------------------: | :----------------: | -| STYLE_TRANSFER_CANDY | 950 | 350 | -| STYLE_TRANSFER_MOSAIC | 950 | 350 | -| STYLE_TRANSFER_UDNIE | 950 | 350 | -| STYLE_TRANSFER_RAIN_PRINCESS | 950 | 350 | +| STYLE_TRANSFER_CANDY | 1200 | 380 | +| STYLE_TRANSFER_MOSAIC | 1200 | 380 | +| STYLE_TRANSFER_UDNIE | 1200 | 380 | +| STYLE_TRANSFER_RAIN_PRINCESS | 1200 | 380 | ## OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_800) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 2100 | 1782 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ------------------------------------------------------------------------------------------------------ | :--------------------: | :----------------: | +| Detector (CRAFT_800_QUANTIZED) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 1400 | 1320 | ## Vertical OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_1280) + Detector (CRAFT_320) + Recognizer (CRNN_512) | 2770 | 3720 | -| Detector(CRAFT_1280) + Detector(CRAFT_320) + Recognizer (CRNN_64) | 1770 | 2740 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ---------------------------------------------------------------------------------------- | :--------------------: | :----------------: | +| Detector (CRAFT_1280_QUANTIZED) + Detector (CRAFT_320_QUANTIZED) + Recognizer (CRNN_512) | 1540 | 1470 | +| Detector(CRAFT_1280_QUANTIZED) + Detector(CRAFT_320_QUANTIZED) + Recognizer (CRNN_64) | 1070 | 1000 | ## LLMs | Model | Android (XNNPACK) [GB] | iOS (XNNPACK) [GB] | | --------------------- | :--------------------: | :----------------: | -| LLAMA3_2_1B | 3.2 | 3.1 | -| LLAMA3_2_1B_SPINQUANT | 1.9 | 2 | -| LLAMA3_2_1B_QLORA | 2.2 | 2.5 | +| LLAMA3_2_1B | 3.3 | 3.1 | +| LLAMA3_2_1B_SPINQUANT | 1.9 | 2.4 | +| LLAMA3_2_1B_QLORA | 2.7 | 2.8 | | LLAMA3_2_3B | 7.1 | 7.3 | | LLAMA3_2_3B_SPINQUANT | 3.7 | 3.8 | -| LLAMA3_2_3B_QLORA | 4 | 4.1 | +| LLAMA3_2_3B_QLORA | 3.9 | 4.0 | ## Speech to text | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | ------------ | :--------------------: | :----------------: | -| WHISPER_TINY | 900 | 600 | +| WHISPER_TINY | 410 | 375 | ## Text Embeddings | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | -------------------------- | :--------------------: | :----------------: | -| ALL_MINILM_L6_V2 | 85 | 100 | -| ALL_MPNET_BASE_V2 | 390 | 465 | -| MULTI_QA_MINILM_L6_COS_V1 | 115 | 130 | -| MULTI_QA_MPNET_BASE_DOT_V1 | 415 | 490 | -| CLIP_VIT_BASE_PATCH32_TEXT | 195 | 250 | +| ALL_MINILM_L6_V2 | 95 | 110 | +| ALL_MPNET_BASE_V2 | 405 | 455 | +| MULTI_QA_MINILM_L6_COS_V1 | 120 | 140 | +| MULTI_QA_MPNET_BASE_DOT_V1 | 435 | 455 | +| CLIP_VIT_BASE_PATCH32_TEXT | 200 | 280 | ## Image Embeddings | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | --------------------------- | :--------------------: | :----------------: | -| CLIP_VIT_BASE_PATCH32_IMAGE | 350 | 340 | +| CLIP_VIT_BASE_PATCH32_IMAGE | 345 | 340 | ## Text to Image | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | --------------------- | ---------------------- | ------------------ | -| BK_SDM_TINY_VPRED_256 | 2900 | 2800 | -| BK_SDM_TINY_VPRED | 6700 | 6560 | +| BK_SDM_TINY_VPRED_256 | 2400 | 2400 | +| BK_SDM_TINY_VPRED | 6210 | 6050 | diff --git a/docs/docs/04-benchmarks/model-size.md b/docs/docs/04-benchmarks/model-size.md index 5cf87f6fa..128cbd7fb 100644 --- a/docs/docs/04-benchmarks/model-size.md +++ b/docs/docs/04-benchmarks/model-size.md @@ -25,23 +25,23 @@ title: Model Size ## OCR -| Model | XNNPACK [MB] | -| --------------------- | :----------: | -| Detector (CRAFT_800) | 83.1 | -| Recognizer (CRNN_512) | 15 - 18\* | -| Recognizer (CRNN_256) | 16 - 18\* | -| Recognizer (CRNN_128) | 17 - 19\* | +| Model | XNNPACK [MB] | +| ------------------------------ | :----------: | +| Detector (CRAFT_800_QUANTIZED) | 19.8 | +| Recognizer (CRNN_512) | 15 - 18\* | +| Recognizer (CRNN_256) | 16 - 18\* | +| Recognizer (CRNN_128) | 17 - 19\* | \* - The model weights vary depending on the language. ## Vertical OCR -| Model | XNNPACK [MB] | -| ------------------------ | :----------: | -| Detector (CRAFT_1280) | 83.1 | -| Detector (CRAFT_320) | 83.1 | -| Recognizer (CRNN_EN_512) | 15 - 18\* | -| Recognizer (CRNN_EN_64) | 15 - 16\* | +| Model | XNNPACK [MB] | +| ------------------------------- | :----------: | +| Detector (CRAFT_1280_QUANTIZED) | 19.8 | +| Detector (CRAFT_320_QUANTIZED) | 19.8 | +| Recognizer (CRNN_EN_512) | 15 - 18\* | +| Recognizer (CRNN_EN_64) | 15 - 16\* | \* - The model weights vary depending on the language. diff --git a/docs/versioned_docs/version-0.4.x/benchmarks/inference-time.md b/docs/versioned_docs/version-0.4.x/benchmarks/inference-time.md index da35e7b6e..f5d6d0113 100644 --- a/docs/versioned_docs/version-0.4.x/benchmarks/inference-time.md +++ b/docs/versioned_docs/version-0.4.x/benchmarks/inference-time.md @@ -8,50 +8,52 @@ Times presented in the tables are measured as consecutive runs of the model. Ini ## Classification -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ----------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| EFFICIENTNET_V2_S | 100 | 120 | 130 | 180 | 170 | +| EFFICIENTNET_V2_S | 150 | 161 | 227 | 196 | 214 | ## Object Detection -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 13 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| SSDLITE_320_MOBILENET_V3_LARGE | 190 | 260 | 280 | 100 | 90 | +| SSDLITE_320_MOBILENET_V3_LARGE | 261 | 279 | 414 | 125 | 115 | ## Style Transfer -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ---------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| STYLE_TRANSFER_CANDY | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_MOSAIC | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_UDNIE | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_RAIN_PRINCESS | 450 | 600 | 750 | 1650 | 1800 | +| STYLE_TRANSFER_CANDY | 1565 | 1675 | 2325 | 1750 | 1620 | +| STYLE_TRANSFER_MOSAIC | 1565 | 1675 | 2325 | 1750 | 1620 | +| STYLE_TRANSFER_UDNIE | 1565 | 1675 | 2325 | 1750 | 1620 | +| STYLE_TRANSFER_RAIN_PRINCESS | 1565 | 1675 | 2325 | 1750 | 1620 | ## OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_800) | 2099 | 2227 | ❌ | 2245 | 7108 | -| Recognizer (CRNN_512) | 70 | 252 | ❌ | 54 | 151 | -| Recognizer (CRNN_256) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_128) | 17 | 83 | ❌ | 14 | 39 | +Notice that the recognizer models were executed between 3 and 7 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_800_QUANTIZED) | 779 | 897 | 1276 | 553 | 586 | +| Recognizer (CRNN_512) | 77 | 74 | 244 | 56 | 57 | +| Recognizer (CRNN_256) | 35 | 37 | 120 | 28 | 30 | +| Recognizer (CRNN_128) | 18 | 19 | 60 | 14 | 16 | ## Vertical OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_1280) | 5457 | 5833 | ❌ | 6296 | 14053 | -| Detector (CRAFT_320) | 1351 | 1460 | ❌ | 1485 | 3101 | -| Recognizer (CRNN_512) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_64) | 10 | 33 | ❌ | 7 | 18 | +Notice that the recognizer models, as well as detector CRAFT_320 model, were executed between 4 and 21 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_1280_QUANTIZED) | 1918 | 2304 | 3371 | 1391 | 1445 | +| Detector (CRAFT_320_QUANTIZED) | 473 | 563 | 813 | 361 | 382 | +| Recognizer (CRNN_512) | 78 | 83 | 310 | 59 | 57 | +| Recognizer (CRNN_64) | 9 | 9 | 38 | 8 | 7 | ## LLMs -| Model | iPhone 16 Pro (XNNPACK) [tokens/s] | iPhone 13 Pro (XNNPACK) [tokens/s] | iPhone SE 3 (XNNPACK) [tokens/s] | Samsung Galaxy S24 (XNNPACK) [tokens/s] | OnePlus 12 (XNNPACK) [tokens/s] | +| Model | iPhone 17 Pro (XNNPACK) [tokens/s] | iPhone 16 Pro (XNNPACK) [tokens/s] | iPhone SE 3 (XNNPACK) [tokens/s] | Samsung Galaxy S24 (XNNPACK) [tokens/s] | OnePlus 12 (XNNPACK) [tokens/s] | | --------------------- | :--------------------------------: | :--------------------------------: | :------------------------------: | :-------------------------------------: | :-----------------------------: | | LLAMA3_2_1B | 16.1 | 11.4 | ❌ | 15.6 | 19.3 | | LLAMA3_2_1B_SPINQUANT | 40.6 | 16.7 | 16.5 | 40.3 | 48.2 | @@ -68,7 +70,7 @@ Times presented in the tables are measured as consecutive runs of the model. Ini Notice than for `Whisper` model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) `fast` mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. That's why for the lowest latency and the fastest transcription we suggest using `Moonshine` model, if you still want to proceed with `Whisper` use preferably the `balanced` mode. -| Model (mode) | iPhone 16 Pro (XNNPACK) [latency \| tokens/s] | iPhone 14 Pro (XNNPACK) [latency \| tokens/s] | iPhone SE 3 (XNNPACK) [latency \| tokens/s] | Samsung Galaxy S24 (XNNPACK) [latency \| tokens/s] | OnePlus 12 (XNNPACK) [latency \| tokens/s] | +| Model (mode) | iPhone 17 Pro (XNNPACK) [latency \| tokens/s] | iPhone 16 Pro (XNNPACK) [latency \| tokens/s] | iPhone SE 3 (XNNPACK) [latency \| tokens/s] | Samsung Galaxy S24 (XNNPACK) [latency \| tokens/s] | OnePlus 12 (XNNPACK) [latency \| tokens/s] | | ------------------------- | :-------------------------------------------: | :-------------------------------------------: | :-----------------------------------------: | :------------------------------------------------: | :----------------------------------------: | | Moonshine-tiny (fast) | 0.8s \| 19.0t/s | 1.5s \| 11.3t/s | 1.5s \| 10.4t/s | 2.0s \| 8.8t/s | 1.6s \| 12.5t/s | | Moonshine-tiny (balanced) | 2.0s \| 20.0t/s | 3.2s \| 12.4t/s | 3.7s \| 10.4t/s | 4.6s \| 11.2t/s | 3.4s \| 14.6t/s | @@ -81,7 +83,7 @@ Notice than for `Whisper` model which has to take as an input 30 seconds audio c Average time for encoding audio of given length over 10 runs. For `Whisper` model we only list 30 sec audio chunks since `Whisper` does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence). -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | -------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | | Moonshine-tiny (5s) | 99 | 95 | 115 | 284 | 277 | | Moonshine-tiny (10s) | 178 | 177 | 204 | 555 | 528 | @@ -92,7 +94,7 @@ Average time for encoding audio of given length over 10 runs. For `Whisper` mode Average time for decoding one token in sequence of 100 tokens, with encoding context is obtained from audio of noted length. -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | -------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | | Moonshine-tiny (5s) | 48.98 | 47.98 | 46.86 | 36.70 | 29.03 | | Moonshine-tiny (10s) | 54.24 | 51.74 | 55.07 | 46.31 | 32.41 | @@ -101,9 +103,9 @@ Average time for decoding one token in sequence of 100 tokens, with encoding con ## Text Embeddings -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] | -| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: | -| ALL_MINILM_L6_V2 | 53 | 69 | 78 | 60 | 65 | -| ALL_MPNET_BASE_V2 | 352 | 423 | 478 | 521 | 527 | -| MULTI_QA_MINILM_L6_COS_V1 | 135 | 166 | 180 | 158 | 165 | -| MULTI_QA_MPNET_BASE_DOT_V1 | 503 | 598 | 680 | 694 | 743 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| -------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| ALL_MINILM_L6_V2 | 50 | 58 | 84 | 58 | 58 | +| ALL_MPNET_BASE_V2 | 352 | 428 | 879 | 483 | 517 | +| MULTI_QA_MINILM_L6_COS_V1 | 133 | 161 | 269 | 151 | 155 | +| MULTI_QA_MPNET_BASE_DOT_V1 | 502 | 796 | 1216 | 915 | 713 | diff --git a/docs/versioned_docs/version-0.4.x/benchmarks/memory-usage.md b/docs/versioned_docs/version-0.4.x/benchmarks/memory-usage.md index 862ffd574..25298f630 100644 --- a/docs/versioned_docs/version-0.4.x/benchmarks/memory-usage.md +++ b/docs/versioned_docs/version-0.4.x/benchmarks/memory-usage.md @@ -25,16 +25,16 @@ title: Memory Usage ## OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_800) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 2100 | 1782 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ------------------------------------------------------------------------------------------------------ | :--------------------: | :----------------: | +| Detector (CRAFT_800_QUANTIZED) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 1400 | 1320 | ## Vertical OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_1280) + Detector (CRAFT_320) + Recognizer (CRNN_512) | 2770 | 3720 | -| Detector(CRAFT_1280) + Detector(CRAFT_320) + Recognizer (CRNN_64) | 1770 | 2740 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ---------------------------------------------------------------------------------------- | :--------------------: | :----------------: | +| Detector (CRAFT_1280_QUANTIZED) + Detector (CRAFT_320_QUANTIZED) + Recognizer (CRNN_512) | 1540 | 1470 | +| Detector(CRAFT_1280) + Detector(CRAFT_320) + Recognizer (CRNN_64) | 1070 | 1000 | ## LLMs diff --git a/docs/versioned_docs/version-0.4.x/benchmarks/model-size.md b/docs/versioned_docs/version-0.4.x/benchmarks/model-size.md index f39fa2f14..d5e890120 100644 --- a/docs/versioned_docs/version-0.4.x/benchmarks/model-size.md +++ b/docs/versioned_docs/version-0.4.x/benchmarks/model-size.md @@ -25,23 +25,23 @@ title: Model Size ## OCR -| Model | XNNPACK [MB] | -| --------------------- | :----------: | -| Detector (CRAFT_800) | 83.1 | -| Recognizer (CRNN_512) | 15 - 18\* | -| Recognizer (CRNN_256) | 16 - 18\* | -| Recognizer (CRNN_128) | 17 - 19\* | +| Model | XNNPACK [MB] | +| ------------------------------ | :----------: | +| Detector (CRAFT_800_QUANTIZED) | 19.8 | +| Recognizer (CRNN_512) | 15 - 18\* | +| Recognizer (CRNN_256) | 16 - 18\* | +| Recognizer (CRNN_128) | 17 - 19\* | \* - The model weights vary depending on the language. ## Vertical OCR -| Model | XNNPACK [MB] | -| ------------------------ | :----------: | -| Detector (CRAFT_1280) | 83.1 | -| Detector (CRAFT_320) | 83.1 | -| Recognizer (CRNN_EN_512) | 15 - 18\* | -| Recognizer (CRNN_EN_64) | 15 - 16\* | +| Model | XNNPACK [MB] | +| ------------------------------- | :----------: | +| Detector (CRAFT_1280_QUANTIZED) | 19.8 | +| Detector (CRAFT_320_QUANTIZED) | 19.8 | +| Recognizer (CRNN_EN_512) | 15 - 18\* | +| Recognizer (CRNN_EN_64) | 15 - 16\* | \* - The model weights vary depending on the language. diff --git a/docs/versioned_docs/version-0.5.x/04-benchmarks/inference-time.md b/docs/versioned_docs/version-0.5.x/04-benchmarks/inference-time.md index 504c0f6e9..89f1f9de1 100644 --- a/docs/versioned_docs/version-0.5.x/04-benchmarks/inference-time.md +++ b/docs/versioned_docs/version-0.5.x/04-benchmarks/inference-time.md @@ -8,46 +8,48 @@ Times presented in the tables are measured as consecutive runs of the model. Ini ## Classification -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ----------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| EFFICIENTNET_V2_S | 100 | 120 | 130 | 180 | 170 | +| EFFICIENTNET_V2_S | 105 | 110 | 149 | 299 | 227 | ## Object Detection -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 13 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| SSDLITE_320_MOBILENET_V3_LARGE | 190 | 260 | 280 | 100 | 90 | +| SSDLITE_320_MOBILENET_V3_LARGE | 116 | 120 | 164 | 257 | 129 | ## Style Transfer -| Model | iPhone 16 Pro (Core ML) [ms] | iPhone 13 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| Model | iPhone 17 Pro (Core ML) [ms] | iPhone 16 Pro (Core ML) [ms] | iPhone SE 3 (Core ML) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | | ---------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| STYLE_TRANSFER_CANDY | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_MOSAIC | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_UDNIE | 450 | 600 | 750 | 1650 | 1800 | -| STYLE_TRANSFER_RAIN_PRINCESS | 450 | 600 | 750 | 1650 | 1800 | +| STYLE_TRANSFER_CANDY | 1356 | 1550 | 2003 | 2578 | 2328 | +| STYLE_TRANSFER_MOSAIC | 1376 | 1456 | 1971 | 2657 | 2394 | +| STYLE_TRANSFER_UDNIE | 1389 | 1499 | 1858 | 2380 | 2124 | +| STYLE_TRANSFER_RAIN_PRINCESS | 1339 | 1514 | 2004 | 2608 | 2371 | ## OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_800) | 2099 | 2227 | ❌ | 2245 | 7108 | -| Recognizer (CRNN_512) | 70 | 252 | ❌ | 54 | 151 | -| Recognizer (CRNN_256) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_128) | 17 | 83 | ❌ | 14 | 39 | +Notice that the recognizer models were executed between 3 and 7 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_800_QUANTIZED) | 669 | 649 | 825 | 541 | 474 | +| Recognizer (CRNN_512) | 48 | 47 | 60 | 91 | 72 | +| Recognizer (CRNN_256) | 22 | 22 | 29 | 51 | 30 | +| Recognizer (CRNN_128) | 11 | 11 | 14 | 28 | 17 | ## Vertical OCR -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | Samsung Galaxy S21 (XNNPACK) [ms] | -| --------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-------------------------------: | -| Detector (CRAFT_1280) | 5457 | 5833 | ❌ | 6296 | 14053 | -| Detector (CRAFT_320) | 1351 | 1460 | ❌ | 1485 | 3101 | -| Recognizer (CRNN_512) | 39 | 123 | ❌ | 24 | 78 | -| Recognizer (CRNN_64) | 10 | 33 | ❌ | 7 | 18 | +Notice that the recognizer models, as well as detector CRAFT_320 model, were executed between 4 and 21 times during a single recognition. +The values below represent the averages across all runs for the benchmark image. -❌ - Insufficient RAM. +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Detector (CRAFT_1280_QUANTIZED) | 1749 | 1804 | 2105 | 1216 | 1171 | +| Detector (CRAFT_320_QUANTIZED) | 458 | 474 | 561 | 360 | 332 | +| Recognizer (CRNN_512) | 54 | 52 | 68 | 144 | 72 | +| Recognizer (CRNN_64) | 5 | 6 | 7 | 28 | 11 | ## LLMs @@ -62,41 +64,31 @@ Times presented in the tables are measured as consecutive runs of the model. Ini ❌ - Insufficient RAM. -### Streaming mode - -Notice than for `Whisper` model which has to take as an input 30 seconds audio chunks (for shorter audio it is automatically padded with silence to 30 seconds) `fast` mode has the lowest latency (time from starting transcription to first token returned, caused by streaming algorithm), but the slowest speed. If you believe that this might be a problem for you, prefer `balanced` mode instead. - -| Model (mode) | iPhone 16 Pro (XNNPACK) [latency \| tokens/s] | iPhone 14 Pro (XNNPACK) [latency \| tokens/s] | iPhone SE 3 (XNNPACK) [latency \| tokens/s] | Samsung Galaxy S24 (XNNPACK) [latency \| tokens/s] | OnePlus 12 (XNNPACK) [latency \| tokens/s] | -| ------------------------- | :-------------------------------------------: | :-------------------------------------------: | :-----------------------------------------: | :------------------------------------------------: | :----------------------------------------: | -| Whisper-tiny (fast) | 2.8s \| 5.5t/s | 3.7s \| 4.4t/s | 4.4s \| 3.4t/s | 5.5s \| 3.1t/s | 5.3s \| 3.8t/s | -| Whisper-tiny (balanced) | 5.6s \| 7.9t/s | 7.0s \| 6.3t/s | 8.3s \| 5.0t/s | 8.4s \| 6.7t/s | 7.7s \| 7.2t/s | -| Whisper-tiny (quality) | 10.3s \| 8.3t/s | 12.6s \| 6.8t/s | 7.8s \| 8.9t/s | 13.5s \| 7.1t/s | 12.9s \| 7.5t/s | - ### Encoding Average time for encoding audio of given length over 10 runs. For `Whisper` model we only list 30 sec audio chunks since `Whisper` does not accept other lengths (for shorter audio the audio needs to be padded to 30sec with silence). -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | -| -------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| Whisper-tiny (30s) | 1034 | 1344 | 1269 | 2916 | 2143 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Whisper-tiny (30s) | 1391 | 1372 | 1894 | 1303 | 1214 | ### Decoding -Average time for decoding one token in sequence of 100 tokens, with encoding context is obtained from audio of noted length. +Average time for decoding one token in sequence of approximately 100 tokens, with encoding context is obtained from audio of noted length. -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | -| -------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| Whisper-tiny (30s) | 128.03 | 113.65 | 141.63 | 89.08 | 84.49 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| ------------------ | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| Whisper-tiny (30s) | 53 | 53 | 74 | 100 | 84 | ## Text Embeddings -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) | OnePlus 12 (XNNPACK) [ms] | -| -------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :--------------------------: | :-----------------------: | -| ALL_MINILM_L6_V2 | 15 | 22 | 23 | 36 | 31 | -| ALL_MPNET_BASE_V2 | 71 | 96 | 101 | 112 | 105 | -| MULTI_QA_MINILM_L6_COS_V1 | 15 | 22 | 23 | 36 | 31 | -| MULTI_QA_MPNET_BASE_DOT_V1 | 71 | 95 | 100 | 112 | 105 | -| CLIP_VIT_BASE_PATCH32_TEXT | 31 | 47 | 48 | 55 | 49 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| -------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| ALL_MINILM_L6_V2 | 16 | 16 | 19 | 54 | 28 | +| ALL_MPNET_BASE_V2 | 115 | 116 | 144 | 145 | 95 | +| MULTI_QA_MINILM_L6_COS_V1 | 16 | 16 | 20 | 47 | 28 | +| MULTI_QA_MPNET_BASE_DOT_V1 | 112 | 119 | 144 | 146 | 96 | +| CLIP_VIT_BASE_PATCH32_TEXT | 47 | 45 | 57 | 65 | 48 | :::info Benchmark times for text embeddings are highly dependent on the sentence length. The numbers above are based on a sentence of around 80 tokens. For shorter or longer sentences, inference time may vary accordingly. @@ -104,10 +96,16 @@ Benchmark times for text embeddings are highly dependent on the sentence length. ## Image Embeddings -| Model | iPhone 16 Pro (XNNPACK) [ms] | iPhone 14 Pro Max (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | -| --------------------------- | :--------------------------: | :------------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | -| CLIP_VIT_BASE_PATCH32_IMAGE | 48 | 64 | 69 | 65 | 63 | +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| --------------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| CLIP_VIT_BASE_PATCH32_IMAGE | 70 | 70 | 90 | 66 | 58 | :::info Image embedding benchmark times are measured using 224×224 pixel images, as required by the model. All input images, whether larger or smaller, are resized to 224×224 before processing. Resizing is typically fast for small images but may be noticeably slower for very large images, which can increase total inference time. ::: + +## Text to Image + +| Model | iPhone 17 Pro (XNNPACK) [ms] | iPhone 16 Pro (XNNPACK) [ms] | iPhone SE 3 (XNNPACK) [ms] | Samsung Galaxy S24 (XNNPACK) [ms] | OnePlus 12 (XNNPACK) [ms] | +| --------------------- | :--------------------------: | :--------------------------: | :------------------------: | :-------------------------------: | :-----------------------: | +| BK_SDM_TINY_VPRED_256 | 21184 | 21021 | ❌ | 18834 | 16617 | diff --git a/docs/versioned_docs/version-0.5.x/04-benchmarks/memory-usage.md b/docs/versioned_docs/version-0.5.x/04-benchmarks/memory-usage.md index 684020e2a..a0c5a7b6d 100644 --- a/docs/versioned_docs/version-0.5.x/04-benchmarks/memory-usage.md +++ b/docs/versioned_docs/version-0.5.x/04-benchmarks/memory-usage.md @@ -2,69 +2,80 @@ title: Memory Usage --- +:::info +All the below benchmarks were performed on iPhone 17 Pro (iOS) and OnePlus 12 (Android). +::: + ## Classification | Model | Android (XNNPACK) [MB] | iOS (Core ML) [MB] | | ----------------- | :--------------------: | :----------------: | -| EFFICIENTNET_V2_S | 130 | 85 | +| EFFICIENTNET_V2_S | 230 | 87 | ## Object Detection | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | ------------------------------ | :--------------------: | :----------------: | -| SSDLITE_320_MOBILENET_V3_LARGE | 90 | 90 | +| SSDLITE_320_MOBILENET_V3_LARGE | 164 | 132 | ## Style Transfer | Model | Android (XNNPACK) [MB] | iOS (Core ML) [MB] | | ---------------------------- | :--------------------: | :----------------: | -| STYLE_TRANSFER_CANDY | 950 | 350 | -| STYLE_TRANSFER_MOSAIC | 950 | 350 | -| STYLE_TRANSFER_UDNIE | 950 | 350 | -| STYLE_TRANSFER_RAIN_PRINCESS | 950 | 350 | +| STYLE_TRANSFER_CANDY | 1200 | 380 | +| STYLE_TRANSFER_MOSAIC | 1200 | 380 | +| STYLE_TRANSFER_UDNIE | 1200 | 380 | +| STYLE_TRANSFER_RAIN_PRINCESS | 1200 | 380 | ## OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_800) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 2100 | 1782 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ------------------------------------------------------------------------------------------------------ | :--------------------: | :----------------: | +| Detector (CRAFT_800_QUANTIZED) + Recognizer (CRNN_512) + Recognizer (CRNN_256) + Recognizer (CRNN_128) | 1400 | 1320 | ## Vertical OCR -| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | -| -------------------------------------------------------------------- | :--------------------: | :----------------: | -| Detector (CRAFT_1280) + Detector (CRAFT_320) + Recognizer (CRNN_512) | 2770 | 3720 | -| Detector(CRAFT_1280) + Detector(CRAFT_320) + Recognizer (CRNN_64) | 1770 | 2740 | +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| ---------------------------------------------------------------------------------------- | :--------------------: | :----------------: | +| Detector (CRAFT_1280_QUANTIZED) + Detector (CRAFT_320_QUANTIZED) + Recognizer (CRNN_512) | 1540 | 1470 | +| Detector(CRAFT_1280_QUANTIZED) + Detector(CRAFT_320_QUANTIZED) + Recognizer (CRNN_64) | 1070 | 1000 | ## LLMs | Model | Android (XNNPACK) [GB] | iOS (XNNPACK) [GB] | | --------------------- | :--------------------: | :----------------: | -| LLAMA3_2_1B | 3.2 | 3.1 | -| LLAMA3_2_1B_SPINQUANT | 1.9 | 2 | -| LLAMA3_2_1B_QLORA | 2.2 | 2.5 | +| LLAMA3_2_1B | 3.3 | 3.1 | +| LLAMA3_2_1B_SPINQUANT | 1.9 | 2.4 | +| LLAMA3_2_1B_QLORA | 2.7 | 2.8 | | LLAMA3_2_3B | 7.1 | 7.3 | | LLAMA3_2_3B_SPINQUANT | 3.7 | 3.8 | -| LLAMA3_2_3B_QLORA | 4 | 4.1 | +| LLAMA3_2_3B_QLORA | 3.9 | 4.0 | ## Speech to text | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | ------------ | :--------------------: | :----------------: | -| WHISPER_TINY | 900 | 600 | +| WHISPER_TINY | 410 | 375 | ## Text Embeddings | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | -------------------------- | :--------------------: | :----------------: | -| ALL_MINILM_L6_V2 | 85 | 100 | -| ALL_MPNET_BASE_V2 | 390 | 465 | -| MULTI_QA_MINILM_L6_COS_V1 | 115 | 130 | -| MULTI_QA_MPNET_BASE_DOT_V1 | 415 | 490 | -| CLIP_VIT_BASE_PATCH32_TEXT | 195 | 250 | +| ALL_MINILM_L6_V2 | 95 | 110 | +| ALL_MPNET_BASE_V2 | 405 | 455 | +| MULTI_QA_MINILM_L6_COS_V1 | 120 | 140 | +| MULTI_QA_MPNET_BASE_DOT_V1 | 435 | 455 | +| CLIP_VIT_BASE_PATCH32_TEXT | 200 | 280 | ## Image Embeddings | Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | | --------------------------- | :--------------------: | :----------------: | -| CLIP_VIT_BASE_PATCH32_IMAGE | 350 | 340 | +| CLIP_VIT_BASE_PATCH32_IMAGE | 345 | 340 | + +## Text to Image + +| Model | Android (XNNPACK) [MB] | iOS (XNNPACK) [MB] | +| --------------------- | ---------------------- | ------------------ | +| BK_SDM_TINY_VPRED_256 | 2400 | 2400 | +| BK_SDM_TINY_VPRED | 6210 | 6050 | diff --git a/docs/versioned_docs/version-0.5.x/04-benchmarks/model-size.md b/docs/versioned_docs/version-0.5.x/04-benchmarks/model-size.md index 9d20c95d5..128cbd7fb 100644 --- a/docs/versioned_docs/version-0.5.x/04-benchmarks/model-size.md +++ b/docs/versioned_docs/version-0.5.x/04-benchmarks/model-size.md @@ -25,23 +25,23 @@ title: Model Size ## OCR -| Model | XNNPACK [MB] | -| --------------------- | :----------: | -| Detector (CRAFT_800) | 83.1 | -| Recognizer (CRNN_512) | 15 - 18\* | -| Recognizer (CRNN_256) | 16 - 18\* | -| Recognizer (CRNN_128) | 17 - 19\* | +| Model | XNNPACK [MB] | +| ------------------------------ | :----------: | +| Detector (CRAFT_800_QUANTIZED) | 19.8 | +| Recognizer (CRNN_512) | 15 - 18\* | +| Recognizer (CRNN_256) | 16 - 18\* | +| Recognizer (CRNN_128) | 17 - 19\* | \* - The model weights vary depending on the language. ## Vertical OCR -| Model | XNNPACK [MB] | -| ------------------------ | :----------: | -| Detector (CRAFT_1280) | 83.1 | -| Detector (CRAFT_320) | 83.1 | -| Recognizer (CRNN_EN_512) | 15 - 18\* | -| Recognizer (CRNN_EN_64) | 15 - 16\* | +| Model | XNNPACK [MB] | +| ------------------------------- | :----------: | +| Detector (CRAFT_1280_QUANTIZED) | 19.8 | +| Detector (CRAFT_320_QUANTIZED) | 19.8 | +| Recognizer (CRNN_EN_512) | 15 - 18\* | +| Recognizer (CRNN_EN_64) | 15 - 16\* | \* - The model weights vary depending on the language. @@ -82,3 +82,9 @@ title: Model Size | Model | XNNPACK [MB] | | --------------------------- | :----------: | | CLIP_VIT_BASE_PATCH32_IMAGE | 352 | + +## Text to Image + +| Model | Text encoder (XNNPACK) [MB] | UNet (XNNPACK) [MB] | VAE decoder (XNNPACK) [MB] | +| ----------------- | --------------------------- | ------------------- | -------------------------- | +| BK_SDM_TINY_VPRED | 492 | 1290 | 198 |