Skip to content

ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. #9943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Oct 23, 2024

Conversation

junhee-yoo
Copy link
Contributor

@junhee-yoo junhee-yoo commented Oct 18, 2024

Summary

  • F32 POOL2D operations(AVG, MAX) are added.

  • Because MobileVLM_V2 needs im2col with N = 2048 which is larger than maxTotalThreadsPerThreadgroup of M3 Max, fix IM2COL op in Metal backend to support large N and add tests for it.

  • I'd run GG_BUILD_METAL=1 bash ci/run.sh ./tmp/results ./tmp/mnt with python 3.10.14 in macOS Sonoma 14.5, Apple M3 Max and it had been done without failure.

  • I have read the contributing guidelines

  • Self-reported review complexity:

    • Low
    • Medium
    • High

Running MobileVLM_V2

encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   109.58 ms by CLIP (    0.76 ms per image patch)

 A man wearing a blue shirt and looking up.
llama_perf_context_print:        load time =    2390.57 ms
llama_perf_context_print: prompt eval time =    1173.29 ms /   188 tokens (    6.24 ms per token,   160.23 tokens per second)
llama_perf_context_print:        eval time =      90.64 ms /    12 runs   (    7.55 ms per token,   132.39 tokens per second)
llama_perf_context_print:       total time =    2492.64 ms /   200 tokens
full test log
register_backend: registered backend Metal (1 devices)
register_device: registered device Metal (Apple M3 Max)
register_backend: registered backend BLAS (1 devices)
register_device: registered device BLAS (Accelerate)
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Apple M3 Max)
build: 3936 (1467a7a0) with Apple clang version 16.0.0 (clang-1600.0.26.3) for arm64-apple-darwin23.5.0 (debug)
llama_load_model_from_file: using device Metal (Apple M3 Max) - 36863 MiB free
llama_model_loader: loaded meta data with 23 key-value pairs and 219 tensors from /Users/user/workspace/projects/<censored>/llama.cpp/models/MobileVLM_V2-1.7B-GGUF/ggml-model-q4_k.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = Work
llama_model_loader: - kv   2:                           llama.vocab_size u32              = 32000
llama_model_loader: - kv   3:                       llama.context_length u32              = 2048
llama_model_loader: - kv   4:                     llama.embedding_length u32              = 2048
llama_model_loader: - kv   5:                          llama.block_count u32              = 24
llama_model_loader: - kv   6:                  llama.feed_forward_length u32              = 5632
llama_model_loader: - kv   7:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   8:                 llama.attention.head_count u32              = 16
llama_model_loader: - kv   9:              llama.attention.head_count_kv u32              = 16
llama_model_loader: - kv  10:     llama.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  11:                       llama.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  12:                          general.file_type u32              = 14
llama_model_loader: - kv  13:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  14:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  15:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  17:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  18:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  22:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   49 tensors
llama_model_loader: - type q4_K:  162 tensors
llama_model_loader: - type q5_K:    7 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: control token:      1 '<s>' is not marked as EOG
llm_load_vocab: control token:      2 '</s>' is not marked as EOG
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 3
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 2048
llm_load_print_meta: n_layer          = 24
llm_load_print_meta: n_head           = 16
llm_load_print_meta: n_head_kv        = 16
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 2048
llm_load_print_meta: n_embd_v_gqa     = 2048
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-06
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 5632
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = ?B
llm_load_print_meta: model ftype      = Q4_K - Small
llm_load_print_meta: model params     = 1.36 B
llm_load_print_meta: model size       = 754.43 MiB (4.64 BPW) 
llm_load_print_meta: general.name     = Work
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: PAD token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_print_meta: EOG token        = 2 '</s>'
llm_load_print_meta: max token length = 48
llm_load_tensors: ggml ctx size =    0.20 MiB
llm_load_tensors: offloading 24 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 25/25 layers to GPU
llm_load_tensors:      Metal buffer size =   719.29 MiB
llm_load_tensors:        CPU buffer size =    35.16 MiB
...........................................................................................
clip_model_load: model name:   openai/clip-vit-large-patch14-336
clip_model_load: description:  image encoder for LLaVA
clip_model_load: GGUF version: 3
clip_model_load: alignment:    32
clip_model_load: n_tensors:    379
clip_model_load: n_kv:         19
clip_model_load: ftype:        f16

clip_model_load: loaded meta data with 19 key-value pairs and 379 tensors from /Users/user/workspace/projects/<censored>/llama.cpp/models/MobileVLM_V2-1.7B-GGUF/mmproj-model-f16.gguf
clip_model_load: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_model_load: - kv   0:                       general.architecture str              = clip
clip_model_load: - kv   1:                      clip.has_text_encoder bool             = false
clip_model_load: - kv   2:                    clip.has_vision_encoder bool             = true
clip_model_load: - kv   3:                   clip.has_llava_projector bool             = true
clip_model_load: - kv   4:                          general.file_type u32              = 1
clip_model_load: - kv   5:                               general.name str              = openai/clip-vit-large-patch14-336
clip_model_load: - kv   6:                        general.description str              = image encoder for LLaVA
clip_model_load: - kv   7:                        clip.projector_type str              = ldpv2
clip_model_load: - kv   8:                     clip.vision.image_size u32              = 336
clip_model_load: - kv   9:                     clip.vision.patch_size u32              = 14
clip_model_load: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_model_load: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_model_load: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_model_load: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_model_load: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_model_load: - kv  15:                    clip.vision.block_count u32              = 23
clip_model_load: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_model_load: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_model_load: - kv  18:                              clip.use_gelu bool             = false
clip_model_load: - type  f32:  236 tensors
clip_model_load: - type  f16:  143 tensors
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
Warning: Compilation succeeded with: 

program_source:2632:19: warning: unused variable 'ncs' [-Wunused-variable]
    const int64_t ncs = ne00;
                  ^
program_source:2633:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = ne01;
                  ^
program_source:2634:19: warning: unused variable 'n_t' [-Wunused-variable]
    const int64_t n_t = ne1;
                  ^
program_source:2635:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = ne2;
                  ^
program_source:2689:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = d_inner;
                  ^
program_source:2691:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = n_seqs;
                  ^
program_source:479:28: warning: unused variable 'ksigns64' [-Wunused-const-variable]
GGML_TABLE_BEGIN(uint64_t, ksigns64, 128)
                           ^
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
clip_model_load: CLIP using Metal backend
clip_model_load: text_encoder:   0
clip_model_load: vision_encoder: 1
clip_model_load: llava_projector:  1
clip_model_load: minicpmv_projector:  0
clip_model_load: model size:     567.51 MB
clip_model_load: metadata size:  0.13 MB
clip_model_load: params backend buffer size =  567.51 MB (379 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file
clip_model_load: compute allocated memory: 32.89 MB
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
Warning: Compilation succeeded with: 

program_source:2632:19: warning: unused variable 'ncs' [-Wunused-variable]
    const int64_t ncs = ne00;
                  ^
program_source:2633:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = ne01;
                  ^
program_source:2634:19: warning: unused variable 'n_t' [-Wunused-variable]
    const int64_t n_t = ne1;
                  ^
program_source:2635:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = ne2;
                  ^
program_source:2689:19: warning: unused variable 'nr' [-Wunused-variable]
    const int64_t nr  = d_inner;
                  ^
program_source:2691:19: warning: unused variable 'n_s' [-Wunused-variable]
    const int64_t n_s = n_seqs;
                  ^
program_source:479:28: warning: unused variable 'ksigns64' [-Wunused-const-variable]
GGML_TABLE_BEGIN(uint64_t, ksigns64, 128)
                           ^
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
llama_kv_cache_init:      Metal KV buffer size =   384.00 MiB
llama_new_context_with_model: KV self size  =  384.00 MiB, K (f16):  192.00 MiB, V (f16):  192.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:      Metal compute buffer size =    84.00 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.01 MiB
llama_new_context_with_model: graph nodes  = 774
llama_new_context_with_model: graph splits = 2
encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in   109.58 ms by CLIP (    0.76 ms per image patch)

 A man wearing a blue shirt and looking up.
llama_perf_context_print:        load time =    2390.57 ms
llama_perf_context_print: prompt eval time =    1173.29 ms /   188 tokens (    6.24 ms per token,   160.23 tokens per second)
llama_perf_context_print:        eval time =      90.64 ms /    12 runs   (    7.55 ms per token,   132.39 tokens per second)
llama_perf_context_print:       total time =    2492.64 ms /   200 tokens
ggml_metal_free: deallocating
ggml_metal_free: deallocating
Program ended with exit code: 0

test-backend-ops

I'd added test cases for large N.

full test-backend-ops log

$ bin/test-backend-ops -o POOL_2D -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=avg,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=1,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=1,s0=2,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=1,s1=2,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=1,p0=1,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=0,p1=1): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=0): OK
  POOL_2D(pool_type=max,type_input=f32,ne_input=[10,10,3,1],k0=3,k1=3,s0=2,s1=2,p0=1,p1=1): OK
  1615/1615 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK
$ bin/test-backend-ops -o IM2COL -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  IM2COL(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): not supported [Metal]
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[10,10,3,1],ne_kernel=[3,3,3,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f32,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): not supported [Metal]
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[3000,128,1,1],ne_kernel=[3,128,1280,1],s0=1,s1=0,p0=1,p1=0,d0=1,d1=0,is_2D=0): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): OK
  1615/1615 tests passed
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

performance test

I've done this by adding perf tests with the below patch which is not included in this PR:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index d5ebd5cf..83f4d846 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -3718,6 +3718,15 @@ static std::vector<std::unique_ptr<test_case>> make_test_cases_perf() {
         }
     }

+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 32}, {3, 3, 1, 32}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 32}, {3, 3, 2, 32}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 1024}, {3, 3, 1, 1024}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 1024}, {3, 3, 2, 1024}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 2048}, {3, 3, 1, 2048}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 2048}, {3, 3, 2, 2048}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 1, 2560}, {3, 3, 1, 2560}, 1, 1, 1, 1, 1, 1, true));
+    test_cases.emplace_back(new test_im2col(GGML_TYPE_F32, GGML_TYPE_F16, GGML_TYPE_F16, {12, 12, 2, 2560}, {3, 3, 2, 2560}, 1, 1, 1, 1, 1, 1, true));
+
     return test_cases;
 }
full performance test log
bin/test-backend-ops perf -o IM2COL -b Metal
Testing 3 devices

Backend 1/3: Metal
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M3 Max
ggml_metal_init: picking default device: Apple M3 Max
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M3 Max
ggml_metal_init: GPU family: MTLGPUFamilyApple9  (1009)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction support   = true
ggml_metal_init: simdgroup matrix mul. support = true
ggml_metal_init: hasUnifiedMemory              = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 38654.71 MB
  Device description: Apple M3 Max
  Device memory: 36864 MB (36858 MB free)

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               278494 runs -     3.59 us/run -       99 kB/run -    0.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               204775 runs -     5.06 us/run -      199 kB/run -    1.50 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.89 us/run -     3186 kB/run -   20.29 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.29 us/run -     6372 kB/run -   21.06 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   142.90 us/run -     6372 kB/run -   21.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   283.61 us/run -    12744 kB/run -   21.43 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   175.71 us/run -     7965 kB/run -   21.62 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   350.47 us/run -    15930 kB/run -   21.68 GB/s
  Backend Metal: OK

ggml_metal_free: deallocating
Backend 2/3: BLAS
  Skipping
Backend 3/3: CPU
  Skipping
3/3 backends passed
OK

I hope this helps you. :)

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
@github-actions github-actions bot added the testing Everything test related label Oct 18, 2024
@ggerganov ggerganov self-requested a review October 20, 2024 16:03
@junhee-yoo junhee-yoo changed the title ggml: Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2. Oct 21, 2024
Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done!

Comment on lines 277 to 278
GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,
GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32,
GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32,
GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32,
GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32,

Copy link
Contributor Author

@junhee-yoo junhee-yoo Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 725 to 726
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32, avg_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32, max_pool_2d_f32, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_AVG_POOL_2D_F32, avg_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_MAX_POOL_2D_F32, max_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, pool_2d_avg_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, pool_2d_max_f32, true);

Copy link
Contributor Author

@junhee-yoo junhee-yoo Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id<MTLComputePipelineState> pipeline = ctx->kernels[GGML_METAL_KERNEL_TYPE_IM2COL_F32].pipeline;
const uint64_t M = pipeline.maxTotalThreadsPerThreadgroup;

const bool is_gt_mttpt = ((size_t)(N * KH * KW)) > pipeline.maxTotalThreadsPerThreadgroup;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to keep just the _ext variant of the kernel? Does the old kernel have a significant advantage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary: To keep backward compatibility and performance in smaller computations, I suggest you keep the old kernel.

details

I’ve done some investigations after you gave me this comment.
The new kernel(a.k.a _ext) has performance degradation in small size of N:

$ test-backend-ops perf -o IM2COL -b Metal
NOTE: 9 x 113 is 1017, 9 x 114 is 1026 which exceeds limits.

original + _ext, M
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               229348 runs -     4.38 us/run -       99 kB/run -    0.77 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               196584 runs -     5.24 us/run -      199 kB/run -    1.51 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             147438 runs -     6.82 us/run -      351 kB/run -    2.73 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             106483 runs -     9.67 us/run -      703 kB/run -    5.34 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    18.79 us/run -      354 kB/run -    2.57 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    18.74 us/run -      354 kB/run -    2.58 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              49146 runs -    21.43 us/run -      622 kB/run -    4.61 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              32764 runs -    35.12 us/run -     1244 kB/run -    8.45 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.60 us/run -     3186 kB/run -   20.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.71 us/run -     6372 kB/run -   21.00 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   143.44 us/run -     6372 kB/run -   21.18 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   283.55 us/run -    12744 kB/run -   21.43 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   175.97 us/run -     7965 kB/run -   21.58 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   352.93 us/run -    15930 kB/run -   21.52 GB/s

only _ext, M - worst case.
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                73719 runs -    14.48 us/run -       99 kB/run -    0.73 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                49146 runs -    24.31 us/run -      199 kB/run -    1.30 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    17.90 us/run -      351 kB/run -    2.68 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              40955 runs -    28.89 us/run -      703 kB/run -    4.64 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    18.54 us/run -      354 kB/run -    2.61 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    18.06 us/run -      354 kB/run -    2.68 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    20.35 us/run -      622 kB/run -    4.17 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              32764 runs -    34.79 us/run -     1244 kB/run -    8.53 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.19 us/run -     3186 kB/run -   20.48 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.72 us/run -     6372 kB/run -   21.00 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   142.86 us/run -     6372 kB/run -   21.27 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   283.70 us/run -    12744 kB/run -   21.42 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   175.84 us/run -     7965 kB/run -   21.60 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   350.67 us/run -    15930 kB/run -   21.66 GB/s

only_ext, MIN(N, M)
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               131056 runs -     7.78 us/run -       99 kB/run -    0.76 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                90101 runs -    11.24 us/run -      199 kB/run -    1.54 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              90101 runs -    11.58 us/run -      351 kB/run -    2.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    17.85 us/run -      703 kB/run -    5.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              90101 runs -    11.80 us/run -      354 kB/run -    2.61 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              90101 runs -    11.72 us/run -      354 kB/run -    2.62 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              65528 runs -    16.90 us/run -      622 kB/run -    4.39 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              40955 runs -    28.84 us/run -     1244 kB/run -    8.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.75 us/run -     3186 kB/run -   20.32 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.31 us/run -     6372 kB/run -   21.06 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   143.02 us/run -     6372 kB/run -   21.25 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   286.72 us/run -    12744 kB/run -   21.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   176.62 us/run -     7965 kB/run -   21.51 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   351.21 us/run -    15930 kB/run -   21.63 GB/s

original + _ext, MIN(N, M) - best case IMO.
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,32],ne_kernel=[3,3,1,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               221157 runs -     4.53 us/run -       99 kB/run -    0.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,32],ne_kernel=[3,3,2,32],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               196584 runs -     5.14 us/run -      199 kB/run -    1.54 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,113],ne_kernel=[3,3,1,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             147438 runs -     6.80 us/run -      351 kB/run -    2.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,113],ne_kernel=[3,3,2,113],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             114674 runs -     9.23 us/run -      703 kB/run -    5.19 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              81910 runs -    12.44 us/run -      354 kB/run -    2.72 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,114],ne_kernel=[3,3,1,114],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              90101 runs -    11.55 us/run -      354 kB/run -    2.66 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,200],ne_kernel=[3,3,1,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              57337 runs -    17.50 us/run -      622 kB/run -    4.85 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,200],ne_kernel=[3,3,2,200],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              40955 runs -    28.67 us/run -     1244 kB/run -    8.28 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,1024],ne_kernel=[3,3,1,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    16382 runs -    74.53 us/run -     3186 kB/run -   20.38 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,1024],ne_kernel=[3,3,2,1024],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   144.56 us/run -     6372 kB/run -   21.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2048],ne_kernel=[3,3,1,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                    10532 runs -   143.35 us/run -     6372 kB/run -   21.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2048],ne_kernel=[3,3,2,2048],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     5266 runs -   283.69 us/run -    12744 kB/run -   21.42 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,1,2560],ne_kernel=[3,3,1,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8426 runs -   176.14 us/run -     7965 kB/run -   21.56 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f16,ne_input=[12,12,2,2560],ne_kernel=[3,3,2,2560],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     4214 runs -   351.41 us/run -    15930 kB/run -   21.62 GB/s

I guess this comes from wasting threads in _ext kernel(in case of tpitg_0 >= N) series when N is smaller than M(maxTotalThreadsPerThreadgroup) and it gets worse M - N gap is getting bigger.
The Apple official document told me that they have new API spreading threads fit into the given grid. However, it’s supporting from Metal3/Apple4(you can find this in the Metal feature set table with the keyword nonuniform) so I didn’t check it because it narrows runnable devices of llama.cpp.

I applied original + _ext, MIN(N, M) to this PR which is the best from those investigations. Thanks for your review. I could do more optimization from your comment. :-^)

Comment on lines 6456 to 6469
device const float* src0,
device float* dst,
constant int32_t& k0,
constant int32_t& k1,
constant int32_t& s0,
constant int32_t& s1,
constant int32_t& p0,
constant int32_t& p1,
constant int64_t& IH,
constant int64_t& IW,
constant int64_t& OH,
constant int64_t& OW,
constant int64_t& parallel_elements,
uint gid[[thread_position_in_grid]]) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device const float* src0,
device float* dst,
constant int32_t& k0,
constant int32_t& k1,
constant int32_t& s0,
constant int32_t& s1,
constant int32_t& p0,
constant int32_t& p1,
constant int64_t& IH,
constant int64_t& IW,
constant int64_t& OH,
constant int64_t& OW,
constant int64_t& parallel_elements,
uint gid[[thread_position_in_grid]]) {
device const float * src0,
device float * dst,
constant int32_t & k0,
constant int32_t & k1,
constant int32_t & s0,
constant int32_t & s1,
constant int32_t & p0,
constant int32_t & p1,
constant int64_t & IH,
constant int64_t & IW,
constant int64_t & OH,
constant int64_t & OW,
constant int64_t & parallel_elements,
uint gid[[thread_position_in_grid]]) {

Copy link
Contributor Author

@junhee-yoo junhee-yoo Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Comment on lines 725 to 726
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, avg_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, max_pool_2d_f32, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kernel names also need to be updated to follow the max-prefix naming convention:

Suggested change
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, avg_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, max_pool_2d_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_AVG_F32, pool_2d_avg_f32, true);
GGML_METAL_ADD_KERNEL(GGML_METAL_KERNEL_TYPE_POOL_2D_MAX_F32, pool_2d_max_f32, true);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad.
applied in bb9949b

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are a few more minor suggestions to apply:

diff --git a/ggml/src/ggml-metal.m b/ggml/src/ggml-metal.m
index ed4d8326..0267b002 100644
--- a/ggml/src/ggml-metal.m
+++ b/ggml/src/ggml-metal.m
@@ -854,7 +854,6 @@ static bool ggml_metal_supports_op(const struct ggml_backend_metal_device_contex
         case GGML_OP_POOL_1D:
             return false;
         case GGML_OP_POOL_2D:
-            return true;
         case GGML_OP_UPSCALE:
         case GGML_OP_PAD:
         case GGML_OP_ARANGE:
@@ -2554,6 +2553,8 @@ static void ggml_metal_encode_node(
             } break;
         case GGML_OP_IM2COL:
             {
+                GGML_ASSERT(ggml_is_contiguous(src0));
+                GGML_ASSERT(ggml_is_contiguous(src1));
                 GGML_ASSERT(src0->type == GGML_TYPE_F16);
                 GGML_ASSERT(src1->type == GGML_TYPE_F32);
                 GGML_ASSERT( dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_F32);
@@ -2620,7 +2621,7 @@ static void ggml_metal_encode_node(
                 [encoder setBytes:&d1      length:sizeof( int32_t) atIndex:12];
 
                 if (is_gt_mttpt) {
-                    [encoder setBytes:&N       length:sizeof(int32_t) atIndex:13];
+                    [encoder setBytes:&N        length:sizeof(int32_t) atIndex:13];
                     [encoder setBytes:&KH       length:sizeof(int32_t) atIndex:14];
                     [encoder setBytes:&KW       length:sizeof(int32_t) atIndex:15];
 
@@ -3034,9 +3035,10 @@ static void ggml_metal_encode_node(
             } break;
         case GGML_OP_POOL_2D:
             {
+                GGML_ASSERT(ggml_is_contiguous(src0));
                 GGML_ASSERT(src0t == GGML_TYPE_F32 && src0t == dstt);
 
-                const int32_t* opts = dst->op_params;
+                const int32_t * opts = dst->op_params;
                 enum ggml_op_pool op = opts[0];
 
                 id<MTLComputePipelineState> pipeline = nil;
@@ -3063,7 +3065,7 @@ static void ggml_metal_encode_node(
                 const int64_t IH = src0->ne[1];
                 const int64_t IW = src0->ne[0];
 
-                const int64_t N = dst->ne[3];
+                const int64_t N  = dst->ne[3];
                 const int64_t OC = dst->ne[2];
                 const int64_t OH = dst->ne[1];
                 const int64_t OW = dst->ne[0];
diff --git a/ggml/src/ggml-metal.metal b/ggml/src/ggml-metal.metal
index 1fb05cce..71b58be1 100644
--- a/ggml/src/ggml-metal.metal
+++ b/ggml/src/ggml-metal.metal
@@ -6479,15 +6479,16 @@ kernel void kernel_pool_2d_max_f32(
     const int cur_oh = idx % O_HW / OW;
     const int cur_ow = idx % O_HW % OW;
 
-    device const float* i_ptr = src0 + nc * I_HW;
-    device float* o_ptr = dst + nc * O_HW;
+    device const float * i_ptr = src0 + nc * I_HW;
+    device       float * o_ptr = dst  + nc * O_HW;
 
     const int start_h = cur_oh * s1 - p1;
-    const int bh = MAX(0, start_h);
+    const int bh = MAX(0,  start_h);
     const int eh = MIN(IH, start_h + k1);
     const int start_w = cur_ow * s0 - p0;
-    const int bw = MAX(0, start_w);
+    const int bw = MAX(0,  start_w);
     const int ew = MIN(IW, start_w + k0);
+
     float res = -INFINITY;
 
     for (int i = bh; i < eh; i += 1) {
@@ -6495,23 +6496,24 @@ kernel void kernel_pool_2d_max_f32(
             res = MAX(res, i_ptr[i * IW + j]);
         }
     }
+
     o_ptr[cur_oh * OW + cur_ow] = res;
 }
 
 kernel void kernel_pool_2d_avg_f32(
-        device const float* src0,
-        device       float* dst,
-        constant    int32_t& k0,
-        constant    int32_t& k1,
-        constant    int32_t& s0,
-        constant    int32_t& s1,
-        constant    int32_t& p0,
-        constant    int32_t& p1,
-        constant    int64_t& IH,
-        constant    int64_t& IW,
-        constant    int64_t& OH,
-        constant    int64_t& OW,
-        constant    int64_t& parallel_elements,
+        device  const float * src0,
+        device        float * dst,
+        constant    int32_t & k0,
+        constant    int32_t & k1,
+        constant    int32_t & s0,
+        constant    int32_t & s1,
+        constant    int32_t & p0,
+        constant    int32_t & p1,
+        constant    int64_t & IH,
+        constant    int64_t & IW,
+        constant    int64_t & OH,
+        constant    int64_t & OW,
+        constant    int64_t & parallel_elements,
         uint        gid[[thread_position_in_grid]]) {
 
     if (gid >= parallel_elements) {
@@ -6525,17 +6527,18 @@ kernel void kernel_pool_2d_avg_f32(
     const int cur_oh = idx % O_HW / OW;
     const int cur_ow = idx % O_HW % OW;
 
-    device const float* i_ptr = src0 + nc * I_HW;
-    device float* o_ptr = dst + nc * O_HW;
+    device const float * i_ptr = src0 + nc * I_HW;
+    device       float * o_ptr = dst  + nc * O_HW;
 
     const int start_h = cur_oh * s1 - p1;
-    const int bh = MAX(0, start_h);
+    const int bh = MAX(0,  start_h);
     const int eh = MIN(IH, start_h + k1);
     const int start_w = cur_ow * s0 - p0;
-    const int bw = MAX(0, start_w);
+    const int bw = MAX(0,  start_w);
     const int ew = MIN(IW, start_w + k0);
     // const float scale = 1. / ((eh - bh) * (ew - bw));
     const float scale = 1. / (k0 * k1);
+
     float res = 0;
 
     for (int i = bh; i < eh; i += 1) {
@@ -6544,5 +6547,6 @@ kernel void kernel_pool_2d_avg_f32(
             res += cur * scale;
         }
     }
+
     o_ptr[cur_oh * OW + cur_ow] = res;
 }

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
@junhee-yoo
Copy link
Contributor Author

Here are a few more minor suggestions to apply:
(the rest omitted)

applied in 746e79e. 👍

@ggerganov ggerganov merged commit 4c9388f into ggml-org:master Oct 23, 2024
52 checks passed
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Everything test related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants