---
base_model: sentence-transformers/all-MiniLM-L6-v2
datasets:
- fbroy/talk2ref
language: en
library_name: transformers
license: cc-by-4.0
pipeline_tag: feature-extraction
tags:
- scientific-retrieval
- dense-passage-retrieval
- dual-encoder
- talk2ref
- speech-to-text
- sentence-embedding
- SBERT
---

# 🗣️ Talk2Ref Query Talk Encoder

This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk.  
It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project.

The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder).

---

## 🎯 Usage

Example with `transformers`:

```python
from transformers import AutoModel
import torch

# Load model
model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder")

# Example input
title = "Attention Is All You Need"
year = 2017
query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \
              "In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling."

# Compute embedding
with torch.no_grad():
    embedding = model([query_text])

print(embedding.shape)  # (1, hidden_dim)
```

---

## 🧩 Model Overview

| Property | Description |
|-----------|-------------|
| **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) |
| **Pooling** | Mean pooling |
| **Max sequence length** | 512 tokens |
| **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) |
| **Objective** | Contrastive binary (DPR-style) loss |
| **Task** | Encode cited papers into a shared semantic space with talk transcripts |

---


## Citation

If you use this dataset, please cite the following paper:

```bibtex
@misc{broy2025talk2refdatasetreferenceprediction,
  title        = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks},
  author       = {Frederik Broy and Maike Züfle and Jan Niehues},
  year         = {2025},
  eprint       = {2510.24478},
  archivePrefix= {arXiv},
  primaryClass = {cs.CL},
  url          = {https://arxiv.org/abs/2510.24478}
}