Voozh

VOOZH

URL: https://huggingface.co/datasets/ACSci/v3-eval-judge-gpt-oss-20b

⇱ ACSci/v3-eval-judge-gpt-oss-20b · Datasets at Hugging Face

Dataset Viewer

Auto-converted to Parquet Duplicate

paper_id stringclasses 221 values	title stringclasses 221 values	model stringclasses 1 value	strategy stringclasses 6 values	pair_id stringclasses 37 values	variant_id int64 1 3	style stringclasses 5 values	query stringlengths 176 1.08k	response stringlengths 714 33.6k	finished bool 1 class	response_chars int64 714 33.6k	score float64 0 1	n_criteria int64 0 11	n_yes int64 0 10	criterion_scores stringlengths 2 3.14k
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	direction_to_idea	p001	3	conversational	I'm trying to improve the efficiency of approximate unlearning for models where $p \sim n$. Right now, the state-of-the-art by Zou et al. (2025) needs at least two Newton steps to remove a data point because they use standard $\epsilon$-certifiability, which forces them to add massive amounts of Laplace noise. Since hi...	Yes. A promising way to do this is to replace the worst-case, coordinatewise notion behind standard \(\epsilon\)-certifiability with a distributional notion tailored to the fact that, in the \(p \sim n\) regime, the one-step parameter perturbation from removing a point is often well-approximated by a high-dimensional G...	true	20,831	0.7778	9	7	[{"criterion_id": 2, "judgment": "yes", "evidence": "Section 3.2 introduces a \"Gaussian certifiability\" framework, replacing standard \u03b5-certifiability with a Gaussian\u2011based definition."}, {"criterion_id": 3, "judgment": "yes", "evidence": "The response states: \"standard \u03b5\u2011certifiability ... force...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	literature_to_gap	p002	1	concise	Recent theoretical work on machine unlearning, such as Newton-based methods by Guo et al. and Sekhari et al., and gradient-based methods by Allouah et al., provides privacy guarantees for data removal. However, what is the critical gap or implicit assumption in these theoretical frameworks when applied to modern AI mod...	The critical gap is that these theoretical unlearning guarantees typically rely on a classical asymptotic regime that is badly mismatched to modern overparameterized models. More concretely, the implicit assumption is: ## Core hidden assumption They assume learning occurs in a regime where the **number of model p...	true	4,740	0.6	10	6	[{"criterion_id": 1, "judgment": "yes", "evidence": "They assume learning occurs in a regime where the number of model parameters is fixed or grows much more slowly than the number of training samples."}, {"criterion_id": 2, "judgment": "yes", "evidence": "In modern models, especially deep networks and large foundation...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	problem_gap_to_method	p003	3	conversational	I'm working on machine unlearning for high-dimensional models. The standard approach of exact retraining is too expensive, and existing approximate methods rely on low-dimensional assumptions that don't hold when parameters scale with the dataset size. If we want to build a randomized update procedure that relies on th...	A good way to think about this is: - You have a trained model parameter vector \(\hat\theta\) obtained on dataset \(S\). - You want to “delete” a subset \(U \subset S\), producing a new parameter vector that should behave like the retrained solution on \(S \setminus U\), but without full retraining. - In high dimensio...	true	19,569	1	10	10	[{"criterion_id": 1, "judgment": "yes", "evidence": "The goal is to construct a randomized map ... that is close in utility to retrained model ..."}, {"criterion_id": 2, "judgment": "yes", "evidence": "You asked specifically for formal metrics for both privacy protection and accuracy. In high dimensions, deterministic ...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	method_to_experiments	p004	2	detailed	I am working on a machine unlearning algorithm that avoids full retraining by updating the original estimator $\hat{\beta}$ using a randomized procedure with Gaussian noise $\mathbf{b}$. The goal is to satisfy $(\phi, \epsilon)$-Gaussian certifiability for privacy while maintaining accuracy, measured by Generalization ...	Below is a concrete experimental blueprint to show that a Gaussian-perturbation unlearning update yields better utility than a Laplace-perturbation baseline under matched certifiability guarantees, while avoiding full retraining. --- ## 1. Core claim to test You want to demonstrate: 1. Privacy/certifiability: y...	true	24,450	0.5	10	5	[{"criterion_id": 1, "judgment": "yes", "evidence": "The response explicitly defines GED in Section 4 and states that experiments will be run on both synthetic logistic regression tasks and real\u2011world datasets to evaluate GED."}, {"criterion_id": 2, "judgment": "yes", "evidence": "Section 2 lists both a Gaussian\u...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	results_to_analysis	p006	1	concise	In our unlearning experiments with large dimensions ($p=1000$ to $10000$), Gaussian noise consistently achieves lower Generalization Error Difference (GED) and Unlearning Error Difference (UED) than eps-delta and Laplace noise. To formally prove our theoretical results in this high-dimensional setting, what standard as...	For high-dimensional theoretical guarantees of unlearning with \(p \in [10^3,10^4]\), the standard route is to place assumptions on: 1. the feature distribution \(x \in \mathbb{R}^p\), 2. the linear predictor / generalized linear model, 3. the loss curvature and regularization, 4. the **scaling of norms an...	true	15,121	0.8	10	8	[{"criterion_id": 1, "judgment": "no", "evidence": "The response lists many assumptions (feature distribution, model, loss, scaling, noise) rather than stating two main assumptions."}, {"criterion_id": 2, "judgment": "yes", "evidence": "The response explicitly lists \"feature distribution\" as one of the key assumption...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	results_to_analysis	p007	1	concise	Our experiments show that Gaussian perturbation outperforms other unlearning estimators in the high-dimensional proportional regime where $p \sim n$. How can we mathematically demonstrate that standard optimization assumptions, like simultaneous $O(1)$-smoothness and $\Omega(1)$-strong convexity of the per-example loss...	Yes. In the proportional high-dimensional regime \(p \asymp n\), one can make this failure mathematically explicit by looking at the Hessian geometry of standard empirical risks. The key point is that the usual assumptions - per-example loss is \(L\)-smooth with \(L=O(1)\), and - empirical or population objective is \...	true	11,442	0.1111	9	1	[{"criterion_id": 1, "judgment": "yes", "evidence": "The response states: \"cannot both hold uniformly, once the feature dimension grows proportionally with sample size\" and that simultaneous O(1)-smoothness and \u03a9(1)-strong convexity fail in this regime."}, {"criterion_id": 3, "judgment": "no", "evidence": "not f...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	mixed_context	p008	1	concise	We have demonstrated that a Gaussian-perturbed one-step Newton estimator achieves vanishing Generalization Error Divergence in high dimensions for generalized linear models, with empirical unlearning batch sizes scaling as $m^3 = o(n)$. What are the most promising theoretical and practical directions to extend this unl...	Below are directions that seem most promising if you want to extend a Gaussian-perturbed one-step Newton unlearning framework beyond the current “high-dimensional GLM + vanishing GED + empirical \(m^3=o(n)\)” regime. I’ll organize them by what is likely to matter theoretically, algorithmically, and empirically. --- #...	true	22,740	0.6	10	6	[{"criterion_id": 1, "judgment": "yes", "evidence": "Section 3: \"Extend from GLMs to broader model classes\" lists general smooth M\u2011estimators, penalized estimators, kernel methods, and deep linear networks."}, {"criterion_id": 2, "judgment": "no", "evidence": "not found"}, {"criterion_id": 3, "judgment": "no", "...
0FJYicpOj0	Gaussian certified unlearning in high dimensions: A hypothesis testing approach	azure_openai_gpt-5_4	mixed_context	p009	2	detailed	I'm researching approximate machine unlearning in the high-dimensional proportional regime where the number of parameters $p$ is comparable to the sample size $n$. Existing theoretical guarantees for Newton-based unlearning rely on standard $\Omega(1)$ strong convexity and $O(1)$ smoothness assumptions, which break dow...	A useful way to attack this is to change the notion of certifiability so it matches what Newton-style unlearning actually produces in the proportional regime. The core issue is: - In the high-dimensional regime \(p \asymp n\), the Hessian is not uniformly well-conditioned in the classical \(\mu I \preceq \nabla^2 ...	true	21,757	0.4	5	2	[{"criterion_id": 1, "judgment": "no", "evidence": "The response discusses Gaussian certifiability but does not mention an epsilon\u2011Gaussian framework."}, {"criterion_id": 3, "judgment": "no", "evidence": "The response does not reference the hypothesis\u2011testing interpretation of differential privacy or the conv...
0jHyEKHDyx	Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention	azure_openai_gpt-5_4	direction_to_idea	p001	3	conversational	I've been looking into the persistent issue where BF16 flash attention causes training to suddenly diverge and explode after a few thousand steps. It seems like the community relies heavily on patches like QK-clip or higher precision, but I want to understand exactly why this happens at the arithmetic level. Can you br...	Yes. The short version is: BF16 flash attention can blow up because attention is an exponentially sensitive computation sitting on top of low-precision dot products and low-precision reductions. A small multiplicative or additive error in the logits can become a large relative error after softmax, and if that er...	true	17,234	0.3	10	3	[{"criterion_id": 1, "judgment": "no", "evidence": "not found"}, {"criterion_id": 2, "judgment": "yes", "evidence": "Fix B: Keep the softmax state in FP32, not BF16"}, {"criterion_id": 3, "judgment": "yes", "evidence": "softmax exponentiates these noisy logits ... This biases attention toward accidental winners"}, {"cr...
0jHyEKHDyx	Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention	azure_openai_gpt-5_4	problem_gap_to_method	p002	3	conversational	"We are trying to push our transformer training to lower precisions for efficiency, specifically usi(...TRUNCATED)	"A principled way to think about this is:\n\n**the failure is not fundamentally “large QK values(...TRUNCATED)	true	15,110	0	10	0	"[{\"criterion_id\": 1, \"judgment\": \"no\", \"evidence\": \"not found\"}, {\"criterion_id\": 2, \"(...TRUNCATED)

End of preview. Expand in Data Studio

README.md exists but content is empty.

Downloads last month: 292