Data Distribution Adjustment
import pandas as pd
import datasets
data_source = "qiaojin/PubMedQA"
dataset = datasets.load_dataset(data_source, 'pqa_artificial', streaming=False)
train_data = dataset['train'].to_pandas()
binary_data = train_data[train_data["final_decision"].isin(["yes", "no"])]
yes_data = binary_data[binary_data["final_decision"] == "yes"]
no_data = binary_data[binary_data["final_decision"] == "no"]
min_size = min(len(yes_data), len(no_data))
yes_sampled = yes_data.sample(n=min_size, random_state=42)
no_sampled = no_data.sample(n=min_size, random_state=42)
balanced_data = pd.concat([yes_sampled, no_sampled])
balanced_data = balanced_data.sample(frac=1, random_state=42).reset_index(drop=True)
New Label Distribution
final_decision
no 15125
yes 15125
Name: count, dtype: int64