--- base_model: sentence-transformers/all-MiniLM-L6-v2 datasets: - fbroy/talk2ref language: en library_name: transformers license: cc-by-4.0 pipeline_tag: feature-extraction tags: - scientific-retrieval - dense-passage-retrieval - dual-encoder - talk2ref - speech-to-text - sentence-embedding - SBERT --- # 🗣️ Talk2Ref Query Talk Encoder This model encodes **scientific talks** (transcripts, titles, and years) into dense vector representations, designed for **Reference Prediction from Talks (RPT)** — the task of retrieving relevant cited papers for a given talk. It was trained as part of the [Talk2Ref dataset](https://huggingface.co/datasets/s8frbroy/talk2ref) project. The model forms the **query-side encoder** in a **dual-encoder (DPR-style)** setup, paired with the [Talk2Ref Cited Paper Encoder](https://huggingface.co/s8frbroy/talk2ref_ref_key_cited_paper_encoder). --- ## 🎯 Usage Example with `transformers`: ```python from transformers import AutoModel import torch # Load model model = AutoModel.from_pretrained("s8frbroy/talk2ref_query_talk_encoder") # Example input title = "Attention Is All You Need" year = 2017 query_text = f"The following presentation is about the paper of the title: '{title}'. Published in {year}. " + \ "In this talk, we introduce the Transformer architecture and discuss its impact on sequence modeling." # Compute embedding with torch.no_grad(): embedding = model([query_text]) print(embedding.shape) # (1, hidden_dim) ``` --- ## 🧩 Model Overview | Property | Description | |-----------|-------------| | **Architecture** | Sentence-BERT (all-MiniLM-L6-v2 backbone) | | **Pooling** | Mean pooling | | **Max sequence length** | 512 tokens | | **Training data** | Talk2Ref dataset (≈ 43 k cited papers linked to 6 k talks) | | **Objective** | Contrastive binary (DPR-style) loss | | **Task** | Encode cited papers into a shared semantic space with talk transcripts | --- ## Citation If you use this dataset, please cite the following paper: ```bibtex @misc{broy2025talk2refdatasetreferenceprediction, title = {Talk2Ref: A Dataset for Reference Prediction from Scientific Talks}, author = {Frederik Broy and Maike Züfle and Jan Niehues}, year = {2025}, eprint = {2510.24478}, archivePrefix= {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2510.24478} }