MERT-v1-95M for Apple Silicon
MERT-v1-95M converted to Core ML and MLX-ready safetensors for native inference on Apple Silicon.
The original model by m-a-p is a music audio transformer (HuBERT architecture, 95M parameters). It produces 768-dimensional embeddings from audio.
No Apple Silicon-ready version of this model existed. This repo provides two formats:
- Core ML (.mlpackage, float16, 180MB) -- for CoreML deployment on macOS/iOS
- MLX safetensors (float16, 180MB) -- for MLX Swift or MLX Python on macOS
Both formats produce identical output to PyTorch (cosine similarity > 0.999 across all 13 layers).
Model details
| Architecture | HuBERT base (7 CNN layers + 12 transformer encoder layers) |
| Parameters | 95M |
| Input | Audio samples, float32, 24kHz, zero-mean unit-variance normalized |
| Output | 13 hidden states, each mean-pooled to (768,) |
| Production layer | Layer 6 (best balance of low-level and high-level features for similarity) |
| Original repo | m-a-p/MERT-v1-95M |
| License | CC-BY-NC-4.0 (inherited from original) |
Benchmarks (M1 Macbook Pro, 30s audio window)
| Runtime | Time per track | Notes |
|---|---|---|
| MLX Swift (GPU) | 0.336s | Release build |
| Core ML (GPU) | 0.40s | CPU_AND_GPU compute units |
| PyTorch MPS | 0.78s | Reference baseline |
MLX is 16% faster than Core ML and 57% faster than PyTorch MPS on the same hardware.
Files
coreml/
MERT-v1-95M.mlpackage/ -- Core ML model (float16)
mlx/
MERT-v1-95M.safetensors -- MLX-ready weights (float16, conv transposed, weight_norm fused)
scripts/
convert_coreml.py -- PyTorch to Core ML conversion script
convert_mlx.py -- PyTorch to MLX safetensors conversion script
benchmark_coreml.py -- Benchmark script for CoreML vs PyTorch
Usage: Core ML (Python)
import coremltools as ct
import numpy as np
model = ct.models.MLModel("coreml/MERT-v1-95M.mlpackage")
# Preprocess audio: 24kHz, zero-mean, unit-variance
audio = load_and_preprocess_audio("track.m4a") # shape (1, N)
prediction = model.predict({"input_values": audio})
hidden_states = prediction["hidden_states"] # shape (13, 768)
# Use layer 6 for similarity
embedding = hidden_states[6] # (768,)
embedding = embedding / np.linalg.norm(embedding)
Usage: MLX (Python)
import mlx.core as mx
from safetensors import safe_open
with safe_open("mlx/MERT-v1-95M.safetensors", framework="numpy") as f:
weights = {k: mx.array(f.get_tensor(k)) for k in f.keys()}
# Build model and run inference (see scripts/ for full implementation)
Preprocessing
The audio preprocessing is NOT included in either model. It is the standard Wav2Vec2FeatureExtractor pipeline:
- Load audio at 24kHz (resample if needed)
- Take a window (e.g., middle 30 seconds)
- Normalize to zero mean, unit variance per sample
This is trivial to implement in any framework (numpy, Accelerate, vDSP) and keeping it separate makes the model a clean, reusable artifact.
Conversion details
Core ML: Converted from the PyTorch checkpoint using coremltools. The model uses ML Program format with float16 precision. All 13 hidden states are returned as a stacked tensor, mean-pooled over the time dimension. The conversion script handles the CNN feature extractor's group normalization and weight normalization layers.
MLX safetensors: Converted from the PyTorch checkpoint with two transformations: (1) Conv1d weight transposition (PyTorch uses out/in/kernel, MLX uses out/kernel/in), (2) weight_norm fusion (multiplied weight_v by weight_g/norm and dropped the separate vectors). The result is 209 tensors in float16, loadable directly by MLX.
Validation
Both formats were validated against the PyTorch reference implementation:
- Core ML: cosine similarity > 0.999 for all 13 hidden state layers
- MLX Swift: cosine similarity 1.000000 for layer 6 (the production layer)
To validate, run the same audio through both the PyTorch original and the converted model, then compare cosine similarity of the layer 6 embeddings.
Attribution
This is a format conversion of MERT-v1-95M by the m-a-p team. The original model is licensed under CC-BY-NC-4.0. This conversion inherits the same license.
If you use this model, please cite the original work:
@article{li2024mert,
title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},
author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Chen, Xingran and Yin, Hanzhi and Lin, Chenghua and Ragni, Anton and Benetos, Emmanouil and Gyenge, Norbert and others},
journal={arXiv preprint arXiv:2306.00107},
year={2024}
}
- Downloads last month
- 12
Quantized
Model tree for khersh/MERT-v1-95M-apple-silicon
Base model
m-a-p/MERT-v1-95M