ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

junhee-yoo · 2024-10-18T15:49:25Z

Summary

F32 POOL2D operations(AVG, MAX) are added.
Because MobileVLM_V2 needs im2col with N = 2048 which is larger than maxTotalThreadsPerThreadgroup of M3 Max, fix IM2COL op in Metal backend to support large N and add tests for it.
- According to the official Metal Feature Set Table, 1024 is the biggest value for now.
I'd run GG_BUILD_METAL=1 bash ci/run.sh ./tmp/results ./tmp/mnt with python 3.10.14 in macOS Sonoma 14.5, Apple M3 Max and it had been done without failure.
I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Running MobileVLM_V2

Test model : MobileVLM V2 1.7B (https://huggingface.co/ZiangWu/MobileVLM_V2-1.7B-GGUF)
Test image : https://avatars.githubusercontent.com/u/1991296?v=4

encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   109.58 ms by CLIP (    0.76 ms per image patch)

 A man wearing a blue shirt and looking up.
llama_perf_context_print:        load time =    2390.57 ms
llama_perf_context_print: prompt eval time =    1173.29 ms /   188 tokens (    6.24 ms per token,   160.23 tokens per second)
llama_perf_context_print:        eval time =      90.64 ms /    12 runs   (    7.55 ms per token,   132.39 tokens per second)
llama_perf_context_print:       total time =    2492.64 ms /   200 tokens

full test log

register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Max)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Max)
build: 3936 (1467a7a0) with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin23.5.0 (debug)
llama_load_model_from_file: using device Metal (Apple M3 Max) - 36863 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 219 tensors from /Users/user/workspace/projects/<censored>/llama.cpp/models/MobileVLM_V2-1.7B-GGUF/ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Work
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                          llama.block_count u32              = 24
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   49 tensors
llama_model_loader: - type q4_K:  162 tensors
llama_model_loader: - type q5_K:    7 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 1.36 B
llm_load_print_meta: model size       = 754.43 MiB (4.64 BPW) 
llm_load_print_meta: general.name     = Work
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      Metal buffer size =   719.29 MiB
llm_load_tensors:        CPU buffer size =    35.16 MiB
...........................................................................................
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    379
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from /Users/user/workspace/projects/<censored>/llama.cpp/models/MobileVLM_V2-1.7B-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = ldpv2
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  143 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
Warning: Compilation succeeded with: 

program_source:2632:19: warning: unused variable 'ncs' [-Wunused-variable]
    const int64_t ncs = ne00;
                  ^
program_source:2633:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = ne01;
                  ^
program_source:2634:19: warning: unused variable 'n_t' [-Wunused-variable]
    const int64_t n_t = ne1;
                  ^
program_source:2635:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = ne2;
                  ^
program_source:2689:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = d_inner;
                  ^
program_source:2691:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = n_seqs;
                  ^
program_source:479:28: warning: unused variable 'ksigns64' [-Wunused-const-variable]
GGML_TABLE_BEGIN(uint64_t, ksigns64, 128)
                           ^
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     567.51 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  567.51 MB (379 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
Warning: Compilation succeeded with: 

program_source:2632:19: warning: unused variable 'ncs' [-Wunused-variable]
    const int64_t ncs = ne00;
                  ^
program_source:2633:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = ne01;
                  ^
program_source:2634:19: warning: unused variable 'n_t' [-Wunused-variable]
    const int64_t n_t = ne1;
                  ^
program_source:2635:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = ne2;
                  ^
program_source:2689:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = d_inner;
                  ^
program_source:2691:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = n_seqs;
                  ^
program_source:479:28: warning: unused variable 'ksigns64' [-Wunused-const-variable]
GGML_TABLE_BEGIN(uint64_t, ksigns64, 128)
                           ^
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
llama_kv_cache_init:      Metal KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      Metal compute buffer size =    84.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   109.58 ms by CLIP (    0.76 ms per image patch)

 A man wearing a blue shirt and looking up.
llama_perf_context_print:        load time =    2390.57 ms
llama_perf_context_print: prompt eval time =    1173.29 ms /   188 tokens (    6.24 ms per token,   160.23 tokens per second)
llama_perf_context_print:        eval time =      90.64 ms /    12 runs   (    7.55 ms per token,   132.39 tokens per second)
llama_perf_context_print:       total time =    2492.64 ms /   200 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
Program ended with exit code: 0

test-backend-ops

I'd added test cases for large N.

full test-backend-ops log


$ bin/test-backend-ops -o POOL_2D -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  1615/1615 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

$ bin/test-backend-ops -o IM2COL -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  IM2COL(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): not supported [Metal]
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): not supported [Metal]
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  1615/1615 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

performance test

I've done this by adding perf tests with the below patch which is not included in this PR:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index d5ebd5cf..83f4d846 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -3718,6 +3718,15 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
         }
     }

+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 32}, {3, 3, 1, 32}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 32}, {3, 3, 2, 32}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 1024}, {3, 3, 1, 1024}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 1024}, {3, 3, 2, 1024}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 2048}, {3, 3, 1, 2048}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 2048}, {3, 3, 2, 2048}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 2560}, {3, 3, 1, 2560}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 2560}, {3, 3, 2, 2560}, 1, 1, 1, 1, 1, 1, true));
+
     return test_cases;
 }

full performance test log

bin/test-backend-ops perf -o IM2COL -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               278494 runs -     3.59 us/run -       99 kB/run -    0.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               204775 runs -     5.06 us/run -      199 kB/run -    1.50 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.89 us/run -     3186 kB/run -   20.29 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.29 us/run -     6372 kB/run -   21.06 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   142.90 us/run -     6372 kB/run -   21.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   283.61 us/run -    12744 kB/run -   21.43 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   175.71 us/run -     7965 kB/run -   21.62 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   350.47 us/run -    15930 kB/run -   21.68 GB/s
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

I hope this helps you. :)

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov

Well done!

ggerganov · 2024-10-22T12:51:00Z

ggml/src/ggml-metal.m

+    GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,
+    GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,


Suggested change

GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,

GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,

GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,

GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,

applied in 008484799146274784a5df692fd7b7508805be83

ggerganov · 2024-10-22T12:51:33Z

ggml/src/ggml-metal.m

+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,               avg_pool_2d_f32,                true);
+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,               max_pool_2d_f32,                true);


Suggested change

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32, avg_pool_2d_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32, max_pool_2d_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, pool_2d_avg_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, pool_2d_max_f32, true);

applied in 008484799146274784a5df692fd7b7508805be83

ggerganov · 2024-10-22T12:52:51Z

ggml/src/ggml-metal.m

+                id<MTLComputePipelineState> pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_IM2COL_F32].pipeline;
+                const uint64_t M = pipeline.maxTotalThreadsPerThreadgroup;
+
+                const bool is_gt_mttpt = ((size_t)(N * KH * KW)) > pipeline.maxTotalThreadsPerThreadgroup;


Would it make sense to keep just the _ext variant of the kernel? Does the old kernel have a significant advantage?

Summary: To keep backward compatibility and performance in smaller computations, I suggest you keep the old kernel.

details

I’ve done some investigations after you gave me this comment.
The new kernel(a.k.a _ext) has performance degradation in small size of N:

$ test-backend-ops perf -o IM2COL -b Metal NOTE: 9 x 113 is 1017, 9 x 114 is 1026 which exceeds limits. original + _ext, M IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 229348 runs - 4.38 us/run - 99 kB/run - 0.77 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 196584 runs - 5.24 us/run - 199 kB/run - 1.51 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 147438 runs - 6.82 us/run - 351 kB/run - 2.73 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 106483 runs - 9.67 us/run - 703 kB/run - 5.34 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 18.79 us/run - 354 kB/run - 2.57 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 18.74 us/run - 354 kB/run - 2.58 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 49146 runs - 21.43 us/run - 622 kB/run - 4.61 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 32764 runs - 35.12 us/run - 1244 kB/run - 8.45 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 16382 runs - 74.60 us/run - 3186 kB/run - 20.37 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 144.71 us/run - 6372 kB/run - 21.00 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 143.44 us/run - 6372 kB/run - 21.18 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 5266 runs - 283.55 us/run - 12744 kB/run - 21.43 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 8426 runs - 175.97 us/run - 7965 kB/run - 21.58 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 4214 runs - 352.93 us/run - 15930 kB/run - 21.52 GB/s only _ext, M - worst case. IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 73719 runs - 14.48 us/run - 99 kB/run - 0.73 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 49146 runs - 24.31 us/run - 199 kB/run - 1.30 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 17.90 us/run - 351 kB/run - 2.68 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 40955 runs - 28.89 us/run - 703 kB/run - 4.64 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 18.54 us/run - 354 kB/run - 2.61 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 18.06 us/run - 354 kB/run - 2.68 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 20.35 us/run - 622 kB/run - 4.17 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 32764 runs - 34.79 us/run - 1244 kB/run - 8.53 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 16382 runs - 74.19 us/run - 3186 kB/run - 20.48 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 144.72 us/run - 6372 kB/run - 21.00 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 142.86 us/run - 6372 kB/run - 21.27 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 5266 runs - 283.70 us/run - 12744 kB/run - 21.42 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 8426 runs - 175.84 us/run - 7965 kB/run - 21.60 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 4214 runs - 350.67 us/run - 15930 kB/run - 21.66 GB/s only_ext, MIN(N, M) IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 131056 runs - 7.78 us/run - 99 kB/run - 0.76 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 90101 runs - 11.24 us/run - 199 kB/run - 1.54 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 90101 runs - 11.58 us/run - 351 kB/run - 2.63 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 17.85 us/run - 703 kB/run - 5.37 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 90101 runs - 11.80 us/run - 354 kB/run - 2.61 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 90101 runs - 11.72 us/run - 354 kB/run - 2.62 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 65528 runs - 16.90 us/run - 622 kB/run - 4.39 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 40955 runs - 28.84 us/run - 1244 kB/run - 8.23 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 16382 runs - 74.75 us/run - 3186 kB/run - 20.32 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 144.31 us/run - 6372 kB/run - 21.06 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 143.02 us/run - 6372 kB/run - 21.25 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 5266 runs - 286.72 us/run - 12744 kB/run - 21.20 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 8426 runs - 176.62 us/run - 7965 kB/run - 21.51 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 4214 runs - 351.21 us/run - 15930 kB/run - 21.63 GB/s original + _ext, MIN(N, M) - best case IMO. IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 221157 runs - 4.53 us/run - 99 kB/run - 0.78 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 196584 runs - 5.14 us/run - 199 kB/run - 1.54 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 147438 runs - 6.80 us/run - 351 kB/run - 2.74 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 114674 runs - 9.23 us/run - 703 kB/run - 5.19 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 81910 runs - 12.44 us/run - 354 kB/run - 2.72 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 90101 runs - 11.55 us/run - 354 kB/run - 2.66 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 57337 runs - 17.50 us/run - 622 kB/run - 4.85 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 40955 runs - 28.67 us/run - 1244 kB/run - 8.28 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 16382 runs - 74.53 us/run - 3186 kB/run - 20.38 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 144.56 us/run - 6372 kB/run - 21.02 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 10532 runs - 143.35 us/run - 6372 kB/run - 21.20 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 5266 runs - 283.69 us/run - 12744 kB/run - 21.42 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 8426 runs - 176.14 us/run - 7965 kB/run - 21.56 GB/s IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): 4214 runs - 351.41 us/run - 15930 kB/run - 21.62 GB/s

I guess this comes from wasting threads in _ext kernel(in case of tpitg_0 >= N) series when N is smaller than M(maxTotalThreadsPerThreadgroup) and it gets worse M - N gap is getting bigger.
The Apple official document told me that they have new API spreading threads fit into the given grid. However, it’s supporting from Metal3/Apple4(you can find this in the Metal feature set table with the keyword nonuniform) so I didn’t check it because it narrows runnable devices of llama.cpp.

I applied original + _ext, MIN(N, M) to this PR which is the best from those investigations. Thanks for your review. I could do more optimization from your comment. :-^)

ggerganov · 2024-10-22T12:53:28Z

ggml/src/ggml-metal.metal

+        device const float* src0,
+        device       float* dst,
+        constant    int32_t& k0,
+        constant    int32_t& k1,
+        constant    int32_t& s0,
+        constant    int32_t& s1,
+        constant    int32_t& p0,
+        constant    int32_t& p1,
+        constant    int64_t& IH,
+        constant    int64_t& IW,
+        constant    int64_t& OH,
+        constant    int64_t& OW,
+        constant    int64_t& parallel_elements,
+        uint        gid[[thread_position_in_grid]]) {


Suggested change

device const float* src0,

device float* dst,

constant int32_t& k0,

constant int32_t& k1,

constant int32_t& s0,

constant int32_t& s1,

constant int32_t& p0,

constant int32_t& p1,

constant int64_t& IH,

constant int64_t& IW,

constant int64_t& OH,

constant int64_t& OW,

constant int64_t& parallel_elements,

uint gid[[thread_position_in_grid]]) {

device const float * src0,

device float * dst,

constant int32_t & k0,

constant int32_t & k1,

constant int32_t & s0,

constant int32_t & s1,

constant int32_t & p0,

constant int32_t & p1,

constant int64_t & IH,

constant int64_t & IW,

constant int64_t & OH,

constant int64_t & OW,

constant int64_t & parallel_elements,

uint gid[[thread_position_in_grid]]) {

applied in 008484799146274784a5df692fd7b7508805be83

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

- original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov · 2024-10-23T05:54:35Z

ggml/src/ggml-metal.m

+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,               avg_pool_2d_f32,                true);
+        GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,               max_pool_2d_f32,                true);


The kernel names also need to be updated to follow the max-prefix naming convention:

Suggested change

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, avg_pool_2d_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, max_pool_2d_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, pool_2d_avg_f32, true);

GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, pool_2d_max_f32, true);

My bad.
applied in bb9949b

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov

Here are a few more minor suggestions to apply:

diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index ed4d8326..0267b002 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -854,7 +854,6 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
         case GGML_OP_POOL_1D:
             return false;
         case GGML_OP_POOL_2D:
-            return true;
         case GGML_OP_UPSCALE:
         case GGML_OP_PAD:
         case GGML_OP_ARANGE:
@@ -2554,6 +2553,8 @@ static void ggml_metal_encode_node(
             } break;
         case GGML_OP_IM2COL:
             {
+                GGML_ASSERT(ggml_is_contiguous(src0));
+                GGML_ASSERT(ggml_is_contiguous(src1));
                 GGML_ASSERT(src0->type == GGML_TYPE_F16);
                 GGML_ASSERT(src1->type == GGML_TYPE_F32);
                 GGML_ASSERT( dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_F32);
@@ -2620,7 +2621,7 @@ static void ggml_metal_encode_node(
                 [encoder setBytes:&d1      length:sizeof( int32_t) atIndex:12];
 
                 if (is_gt_mttpt) {
-                    [encoder setBytes:&N       length:sizeof(int32_t) atIndex:13];
+                    [encoder setBytes:&N        length:sizeof(int32_t) atIndex:13];
                     [encoder setBytes:&KH       length:sizeof(int32_t) atIndex:14];
                     [encoder setBytes:&KW       length:sizeof(int32_t) atIndex:15];
 
@@ -3034,9 +3035,10 @@ static void ggml_metal_encode_node(
             } break;
         case GGML_OP_POOL_2D:
             {
+                GGML_ASSERT(ggml_is_contiguous(src0));
                 GGML_ASSERT(src0t == GGML_TYPE_F32 && src0t == dstt);
 
-                const int32_t* opts = dst->op_params;
+                const int32_t * opts = dst->op_params;
                 enum ggml_op_pool op = opts[0];
 
                 id<MTLComputePipelineState> pipeline = nil;
@@ -3063,7 +3065,7 @@ static void ggml_metal_encode_node(
                 const int64_t IH = src0->ne[1];
                 const int64_t IW = src0->ne[0];
 
-                const int64_t N = dst->ne[3];
+                const int64_t N  = dst->ne[3];
                 const int64_t OC = dst->ne[2];
                 const int64_t OH = dst->ne[1];
                 const int64_t OW = dst->ne[0];
diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index 1fb05cce..71b58be1 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -6479,15 +6479,16 @@ kernel void kernel_pool_2d_max_f32(
     const int cur_oh = idx % O_HW / OW;
     const int cur_ow = idx % O_HW % OW;
 
-    device const float* i_ptr = src0 + nc * I_HW;
-    device float* o_ptr = dst + nc * O_HW;
+    device const float * i_ptr = src0 + nc * I_HW;
+    device       float * o_ptr = dst  + nc * O_HW;
 
     const int start_h = cur_oh * s1 - p1;
-    const int bh = MAX(0, start_h);
+    const int bh = MAX(0,  start_h);
     const int eh = MIN(IH, start_h + k1);
     const int start_w = cur_ow * s0 - p0;
-    const int bw = MAX(0, start_w);
+    const int bw = MAX(0,  start_w);
     const int ew = MIN(IW, start_w + k0);
+
     float res = -INFINITY;
 
     for (int i = bh; i < eh; i += 1) {
@@ -6495,23 +6496,24 @@ kernel void kernel_pool_2d_max_f32(
             res = MAX(res, i_ptr[i * IW + j]);
         }
     }
+
     o_ptr[cur_oh * OW + cur_ow] = res;
 }
 
 kernel void kernel_pool_2d_avg_f32(
-        device const float* src0,
-        device       float* dst,
-        constant    int32_t& k0,
-        constant    int32_t& k1,
-        constant    int32_t& s0,
-        constant    int32_t& s1,
-        constant    int32_t& p0,
-        constant    int32_t& p1,
-        constant    int64_t& IH,
-        constant    int64_t& IW,
-        constant    int64_t& OH,
-        constant    int64_t& OW,
-        constant    int64_t& parallel_elements,
+        device  const float * src0,
+        device        float * dst,
+        constant    int32_t & k0,
+        constant    int32_t & k1,
+        constant    int32_t & s0,
+        constant    int32_t & s1,
+        constant    int32_t & p0,
+        constant    int32_t & p1,
+        constant    int64_t & IH,
+        constant    int64_t & IW,
+        constant    int64_t & OH,
+        constant    int64_t & OW,
+        constant    int64_t & parallel_elements,
         uint        gid[[thread_position_in_grid]]) {
 
     if (gid >= parallel_elements) {
@@ -6525,17 +6527,18 @@ kernel void kernel_pool_2d_avg_f32(
     const int cur_oh = idx % O_HW / OW;
     const int cur_ow = idx % O_HW % OW;
 
-    device const float* i_ptr = src0 + nc * I_HW;
-    device float* o_ptr = dst + nc * O_HW;
+    device const float * i_ptr = src0 + nc * I_HW;
+    device       float * o_ptr = dst  + nc * O_HW;
 
     const int start_h = cur_oh * s1 - p1;
-    const int bh = MAX(0, start_h);
+    const int bh = MAX(0,  start_h);
     const int eh = MIN(IH, start_h + k1);
     const int start_w = cur_ow * s0 - p0;
-    const int bw = MAX(0, start_w);
+    const int bw = MAX(0,  start_w);
     const int ew = MIN(IW, start_w + k0);
     // const float scale = 1. / ((eh - bh) * (ew - bw));
     const float scale = 1. / (k0 * k1);
+
     float res = 0;
 
     for (int i = bh; i < eh; i += 1) {
@@ -6544,5 +6547,6 @@ kernel void kernel_pool_2d_avg_f32(
             res += cur * scale;
         }
     }
+
     o_ptr[cur_oh * OW + cur_ow] = res;
 }

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

junhee-yoo · 2024-10-23T08:54:04Z

Here are a few more minor suggestions to apply:
(the rest omitted)

applied in 746e79e. 👍

* add pool_2d Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * fix im2col and add unittest for N>=1024 Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * add tests for N % 1024 != 0 Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * remove trailing whitespaces Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply suggestions Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply more optimization - original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply review: change kernel name of pool_2d Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * apply review Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> * fix more formatting and enhance readability Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com> --------- Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

junhee-yoo added 3 commits October 17, 2024 17:13

add pool_2d

b4d3c16

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

fix im2col and add unittest for N>=1024

adbec7f

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

add tests for N % 1024 != 0

1467a7a

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

github-actions bot added the testing Everything test related label Oct 18, 2024

ggerganov self-requested a review October 20, 2024 16:03

junhee-yoo changed the title ~~ggml: Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2.~~ ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. Oct 21, 2024

remove trailing whitespaces

e81462d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov approved these changes Oct 22, 2024

View reviewed changes

junhee-yoo added 2 commits October 23, 2024 11:18

apply suggestions

0084847

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

apply more optimization

bd86c4c

- original IM2COL kernel + _ext with MIN() Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov reviewed Oct 23, 2024

View reviewed changes

apply review: change kernel name of pool_2d

bb9949b

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov approved these changes Oct 23, 2024

View reviewed changes

junhee-yoo added 2 commits October 23, 2024 17:00

apply review

746e79e

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

fix more formatting and enhance readability

3c2b87d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

ggerganov merged commit 4c9388f into ggml-org:master Oct 23, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

Uh oh!

junhee-yoo commented Oct 18, 2024 •

edited

Loading

Uh oh!

ggerganov left a comment

Uh oh!

ggerganov Oct 22, 2024

Uh oh!

junhee-yoo Oct 23, 2024 •

edited

Loading

Uh oh!

ggerganov Oct 22, 2024

Uh oh!

junhee-yoo Oct 23, 2024 •

edited

Loading

Uh oh!

ggerganov Oct 22, 2024

Uh oh!

junhee-yoo Oct 23, 2024

Uh oh!

ggerganov Oct 22, 2024

Uh oh!

junhee-yoo Oct 23, 2024 •

edited

Loading

Uh oh!

ggerganov Oct 23, 2024

Uh oh!

junhee-yoo Oct 23, 2024

Uh oh!

ggerganov left a comment

Uh oh!

junhee-yoo commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

		GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,
		GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,

		GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32, avg_pool_2d_f32, true);
		GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32, max_pool_2d_f32, true);

ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

Uh oh!

Conversation

junhee-yoo commented Oct 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Running MobileVLM_V2

test-backend-ops

performance test

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

junhee-yoo Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

junhee-yoo Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

junhee-yoo Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 22, 2024

Choose a reason for hiding this comment

Uh oh!

junhee-yoo Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

junhee-yoo Oct 23, 2024

Choose a reason for hiding this comment

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

junhee-yoo commented Oct 23, 2024

Uh oh!

Uh oh!

Uh oh!

junhee-yoo commented Oct 18, 2024 •

edited

Loading

junhee-yoo Oct 23, 2024 •

edited

Loading

junhee-yoo Oct 23, 2024 •

edited

Loading

junhee-yoo Oct 23, 2024 •

edited

Loading