AI & ML interests

Repo for all open-source affinity datasets from A-Alpha Bio

Recent Activity

nmurakowska  updated a dataset 3 days ago
aalphabio/open-alphaseq
dnoble  updated a dataset 5 days ago
aalphabio/open-alphaseq
nmurakowska  updated a Space 10 days ago
aalphabio/README
View all activity

A-Alpha Datasets

Protein-protein interactions (PPIs) are fundamental to countless biological processes. One of the most informative biophysical properties of a PPI is the binding affinity: the strength of how two proteins interact. Yet, despite its importance, publicly available affinity data remains limited, constaining the development and benchmarking of protein modeling methods.

Our high-throughput yeast mating assay, AlphaSeq, enables the quantitative measurement of PPIs at scale (often generating libraries up to 1M interactions per experiment!).

To help bridge the gap between experimental affinity measurements and computational protein models, we’re open-sourcing a selection of our datasets.

Each dataset captures the results of a yeast mating experiment between two protein libraries—one of binders and one of targets. Detailed experimental context and metadata are provided in the accompanying data cards.

Dataset schema

Every dataset will contain the following columns:

  • mata_description: Description of a-library proteins; usually VHHs/scFvs
  • mata_sequence: sequence from the A-library
  • matalpha_description: Description of the alpha-library proteins; usually some antigens
  • matalpha_sequence: sequence from the Alpha-library
  • alphaseq_affinity: Log10 Kd affinity score between the pair of sequences. Lower is better
  • alphaseq_affinity_lower_bound: Lower bound of affinity expected
  • alphaseq_affinity_upper_bound: Upper bound of affinity expected

Datasets Available

Please see the subsets in opendata

FAQ

Please see below for some clarifying details:

What kind of sequences are in the library?

While not a strict rule, the A-libraries typically contain designed sequences, while the Alpha-libraries contain corresponding targets of interest. Historically, we’ve used VHHs or scFvs in the A-library and antigen targets in the Alpha-library. Each dataset will have a card that details specific information of the individual assay run. When building or training models, note that PPIs can generally be treated as symmetric. However, members within the same library may share sequence, functional, or structural similarities. Also, some models are sensitive to input order — so ensure that (A, Alpha) pairs are treated consistently between training and testing.

Why are there duplicate PPIs in the dataset?

Some datasets include technical replicates, often for the wild-type (“WT”) or parent sequence in mutation studies. Replicates help capture the experimental and biological variation in measured affinities. This can be useful for analyses that assess the statistical significance of observed affinity difference, such as identifying how much a vaiant changes binding strength relative to a parent protein.

What is considered a strong or good binder?

Affinity measurements are reported in log-10 Kd. Lower values indicate stronger binding. In practice, we often compare relative affinities - for example, assessing differences in binding strength as a target interface is mutated, or comparing variant binders to their parent.

The dataset has NaN values in the affinity, why?

Not all PPIs form detectable interactions; weak or non-binding interactions may yield NaN values. For these cases, it may be more useful to look at the lower or upper bound affinities to help interpret the range of possible affinity within the assay.

How should I cite this dataset?

Please cite: _A-Alpha Bio (2025). Open Protein–Protein Interaction Affinity Datasets. https://huggingface.co/aalphabio

Can I use this dataset for model training or benchmarking?

Yes — the dataset is released fully open source, and is suitable for both academic and commercial use.

Who can I contact with questions or feedback?

Feel free to leave issues on the individual dataset cards.

models 0

None public yet