MERT-v1-95M for Apple Silicon

MERT-v1-95M converted to Core ML and MLX-ready safetensors for native inference on Apple Silicon.

The original model by m-a-p is a music audio transformer (HuBERT architecture, 95M parameters). It produces 768-dimensional embeddings from audio.

No Apple Silicon-ready version of this model existed. This repo provides two formats:

  • Core ML (.mlpackage, float16, 180MB) -- for CoreML deployment on macOS/iOS
  • MLX safetensors (float16, 180MB) -- for MLX Swift or MLX Python on macOS

Both formats produce identical output to PyTorch (cosine similarity > 0.999 across all 13 layers).

Model details

Architecture HuBERT base (7 CNN layers + 12 transformer encoder layers)
Parameters 95M
Input Audio samples, float32, 24kHz, zero-mean unit-variance normalized
Output 13 hidden states, each mean-pooled to (768,)
Production layer Layer 6 (best balance of low-level and high-level features for similarity)
Original repo m-a-p/MERT-v1-95M
License CC-BY-NC-4.0 (inherited from original)

Benchmarks (M1 Macbook Pro, 30s audio window)

Runtime Time per track Notes
MLX Swift (GPU) 0.336s Release build
Core ML (GPU) 0.40s CPU_AND_GPU compute units
PyTorch MPS 0.78s Reference baseline

MLX is 16% faster than Core ML and 57% faster than PyTorch MPS on the same hardware.

Files

coreml/
  MERT-v1-95M.mlpackage/     -- Core ML model (float16)

mlx/
  MERT-v1-95M.safetensors    -- MLX-ready weights (float16, conv transposed, weight_norm fused)

scripts/
  convert_coreml.py           -- PyTorch to Core ML conversion script
  convert_mlx.py              -- PyTorch to MLX safetensors conversion script
  benchmark_coreml.py         -- Benchmark script for CoreML vs PyTorch

Usage: Core ML (Python)

import coremltools as ct
import numpy as np

model = ct.models.MLModel("coreml/MERT-v1-95M.mlpackage")

# Preprocess audio: 24kHz, zero-mean, unit-variance
audio = load_and_preprocess_audio("track.m4a")  # shape (1, N)

prediction = model.predict({"input_values": audio})
hidden_states = prediction["hidden_states"]  # shape (13, 768)

# Use layer 6 for similarity
embedding = hidden_states[6]  # (768,)
embedding = embedding / np.linalg.norm(embedding)

Usage: MLX (Python)

import mlx.core as mx
from safetensors import safe_open

with safe_open("mlx/MERT-v1-95M.safetensors", framework="numpy") as f:
    weights = {k: mx.array(f.get_tensor(k)) for k in f.keys()}

# Build model and run inference (see scripts/ for full implementation)

Preprocessing

The audio preprocessing is NOT included in either model. It is the standard Wav2Vec2FeatureExtractor pipeline:

  1. Load audio at 24kHz (resample if needed)
  2. Take a window (e.g., middle 30 seconds)
  3. Normalize to zero mean, unit variance per sample

This is trivial to implement in any framework (numpy, Accelerate, vDSP) and keeping it separate makes the model a clean, reusable artifact.

Conversion details

Core ML: Converted from the PyTorch checkpoint using coremltools. The model uses ML Program format with float16 precision. All 13 hidden states are returned as a stacked tensor, mean-pooled over the time dimension. The conversion script handles the CNN feature extractor's group normalization and weight normalization layers.

MLX safetensors: Converted from the PyTorch checkpoint with two transformations: (1) Conv1d weight transposition (PyTorch uses out/in/kernel, MLX uses out/kernel/in), (2) weight_norm fusion (multiplied weight_v by weight_g/norm and dropped the separate vectors). The result is 209 tensors in float16, loadable directly by MLX.

Validation

Both formats were validated against the PyTorch reference implementation:

  • Core ML: cosine similarity > 0.999 for all 13 hidden state layers
  • MLX Swift: cosine similarity 1.000000 for layer 6 (the production layer)

To validate, run the same audio through both the PyTorch original and the converted model, then compare cosine similarity of the layer 6 embeddings.

Attribution

This is a format conversion of MERT-v1-95M by the m-a-p team. The original model is licensed under CC-BY-NC-4.0. This conversion inherits the same license.

If you use this model, please cite the original work:

@article{li2024mert,
  title={MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training},
  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Chen, Xingran and Yin, Hanzhi and Lin, Chenghua and Ragni, Anton and Benetos, Emmanouil and Gyenge, Norbert and others},
  journal={arXiv preprint arXiv:2306.00107},
  year={2024}
}
Downloads last month
12
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for khersh/MERT-v1-95M-apple-silicon

Quantized
(1)
this model

Paper for khersh/MERT-v1-95M-apple-silicon