Applications
See recent articles
Showing new listings for Friday, 3 April 2026
- arXiv:2604.01491 [pdf, html, other]
-
Title: Opponent-Adjusted Evaluation of NFL Pass Blocking and Pass Rushing PerformanceComments: 14 pages, 3 figures, 5 tables. Code available at this https URLSubjects: Applications (stat.AP)
Evaluating offensive linemen and pass rushers at the player level is difficult because observable outcomes are sparse, opponent-dependent, and strongly shaped by surrounding context. Using 2021 regular-season Hudl tracking data, we construct a blocker-rusher interaction dataset and estimate two ridge-regularized Bradley-Terry paired-comparison models: a binary win/loss model aligned with the 2.5-second pass block win-rate definition and a four-class severity model over loss, win, hit, and sack, with both models incorporating a double-team indicator. The final dataset contains 153,138 interactions across 33,283 pass plays in 266 games. On an ordered 80/20 holdout split (test n = 30,628), both models improve on global baselines and modestly outperform stronger matchup baselines under log-loss evaluation, corresponding to relative log-loss reductions of about 0.24% to 1.21%. Game-level bootstrap resampling indicates that these gains are most stable for the win model and for the severity model relative to the global baseline, while the severity-versus-matchup comparison remains directionally positive but less certain. External comparison to 2021 AP All-Pro selections provides additional face validation on the learned rankings, with the severity model showing the strongest alignment to expert recognition. Overall, ridge-regularized Bradley-Terry models provide an interpretable opponent-adjusted framework for evaluating NFL pass protection and pass rush at the interaction level.
- arXiv:2604.01735 [pdf, html, other]
-
Title: Correlation analysis of the dispersion of SARS-CoV-2 in MexicoComments: 8 pages, 6 figuresSubjects: Applications (stat.AP)
In this paper, we propose a method to analyze correlations in pandemic-related data across different geographical regions, relying on the analysis of correlations for non-stationary time series, which are typical of pandemic data. Unlike traditional epidemiological approaches focused on medical and modeling perspectives during a pandemic, our method emphasizes post-pandemic analysis to assess how societal responses; such as lockdowns, travel restrictions, mobility patterns, and vaccination campaigns, manifest in the collective behavior of regions. These insights can inform future public health strategies and enhance understanding of the complex dynamics underlying pandemic spread and control.
- arXiv:2604.02074 [pdf, html, other]
-
Title: Country-wide, high-resolution monitoring of forest browning with Sentinel-2Samantha Biegel, David Brüggemann, Francesco Grossi, Michele Volpi, Konrad Schindler, Benjamin D. StockerComments: 9 pages, 7 figures, to be published in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences (ISPRS Congress)Subjects: Applications (stat.AP); Computer Vision and Pattern Recognition (cs.CV)
Natural and anthropogenic disturbances are impacting the health of forests worldwide. Monitoring forest disturbances at scale is important to inform conservation efforts. Here, we present a scalable approach for country-wide mapping of forest greenness anomalies at the 10 m resolution of Sentinel-2. Using relevant ecological and topographical context and an established representation of the vegetation cycle, we learn a predictive quantile model of the normalised difference vegetation index (NDVI) derived from Sentinel-2 data. The resulting expected seasonal cycles are used to detect NDVI anomalies across Switzerland between April 2017 and August 2025. Goodness-of-fit evaluations show that the conditional model explains 65% of the observed variations in the median seasonal cycle. The model consistently benefits from the local context information, particularly during the green-up period. The approach produces coherent spatial anomaly patterns and enables country-wide quantification of forest browning. Case studies with independent reference data from known events illustrate that the model reliably detects different types of disturbances.
- arXiv:2604.02187 [pdf, html, other]
-
Title: Possible, Yes; Ignorant, Perhaps: A Scorecard for Possibilistic ForecastsComments: 11 figures; 7 sections;19 pages on PDF as-isSubjects: Applications (stat.AP); Atmospheric and Oceanic Physics (physics.ao-ph)
Probabilistic forecasts must sum to unity and cannot express ``I don't know.'' Possibility theory relaxes this constraint: a subnormal distribution explicitly measures how much of the plausibility budget remains unassigned, ignorance signal that probability cannot represent. This paper develops a verification framework for such forecasts, centred on a five-number scorecard that separately diagnoses whether the forecast pointed at the right outcome (depth-of-truth), how sharply (diffuseness, support margin), how confidently (ignorance), and how dominantly (conditional necessity). A possibility-to-probability conversion preserves ignorance for familiar frequency-based scoring; categorical threshold scores (POD, FAR, CSI, etc.) connect to operational practice. Together, these three complementary facets -- possibilistic, probabilistic, and categorical -- expose failure modes invisible to any single metric. Storm Prediction Center convective outlook categories serve as the running example throughout; a synthetic reforecast demonstrates diagnostic visualisations and scorecard interpretation. Ignorance is better expressed than repressed.
New submissions (showing 4 of 4 entries)
- arXiv:2604.01501 (cross-list from stat.ME) [pdf, html, other]
-
Title: Identifying and Estimating Causal Direct Effects Under Unmeasured ConfoundingPhilippe Boileau, Nima S. Hejazi, Ivana Malenica, Peter B. Gilbert, Sandrine Dudoit, Mark J. van der LaanSubjects: Methodology (stat.ME); Applications (stat.AP); Machine Learning (stat.ML); Other Statistics (stat.OT)
Causal mediation analysis provides techniques for defining and estimating effects that may be endowed with mechanistic interpretations. With many scientific investigations seeking to address mechanistic questions, causal direct and indirect effects have garnered much attention. The natural direct and indirect effects, the most widely used among such causal mediation estimands, are limited in their practical utility due to stringent identification requirements. Accordingly, considerable effort has been invested in developing alternative direct and indirect effect decompositions with relaxed identification requirements. Such efforts often yield effect definitions with nuanced and challenging interpretations. By contrast, relatively limited attention has been paid to relaxing the identification assumptions of the natural direct and indirect effects. Motivated by a secondary aim of a recent non-randomized vaccine prospective cohort study (NCT05168813), we present a set of relaxed conditions under which the natural direct effect is identifiable in spite of unobserved baseline confounding of the exposure-mediator pathway; we use this result to investigate the effect mediated by putative immune correlates of protection. Relaxing the commonly used but restrictive cross-world counterfactual independence assumption, we discuss strategies for evaluating the natural direct effect in non-randomized settings that arise in the analysis of vaccine studies. We revisit prior studies of semi-parametric efficiency theory to demonstrate the construction of flexible, multiply robust estimators of the natural direct effect and discuss efficient estimation strategies that do not place restrictive modeling assumptions on nuisance functions.
- arXiv:2604.02238 (cross-list from cs.CY) [pdf, other]
-
Title: Generative AI Spotlights the Human Core of Data Science: Implications for EducationSubjects: Computers and Society (cs.CY); Artificial Intelligence (cs.AI); Applications (stat.AP)
Generative AI (GAI) reveals an irreducible human core at the center of data science: advances in GAI should sharpen, rather than diminish, the focus on human reasoning in data science education. GAI can now execute many routine data science workflows, including cleaning, summarizing, visualizing, modeling, and drafting reports. Yet the competencies that matter most remain irreducibly human: problem formulation, measurement and design, causal identification, statistical and computational reasoning, ethics and accountability, and sensemaking. Drawing on Donoho's Greater Data Science framework, Nolan and Temple Lang's vision of computational literacy, and the McLuhan-Culkin insight that we shape our tools and thereafter our tools shape us, this paper traces the emergence of data science through three converging lineages: Tukey's intellectual vision of data analysis as a science, the commercial logic of surveillance capitalism that created industrial demand for data scientists, and the academic programs that followed. Mapping GAI's impact onto Donoho's six divisions of Greater Data Science shows that computing with data (GDS3) has been substantially automated, while data gathering, preparation, and exploration (GDS1) and science about data science (GDS6) still require essential human input. The educational implication is that data science curricula should focus on this human core while teaching students how to contribute effectively within iterative prompt-output-prompt cycles using retrieval-augmented generation, and that learning outcomes and assessments should explicitly evaluate reasoning and judgment.
- arXiv:2604.02286 (cross-list from stat.ME) [pdf, html, other]
-
Title: Bayesian covariance regression for differential network analysis of zero-inflated microbiome dataSubjects: Methodology (stat.ME); Applications (stat.AP)
Microbial interaction networks can rewire in response to host and environmental factors, yet most existing methods for network estimation treat the covariance structure as static across samples. We propose TRECOR, a Bayesian covariance regression framework for inferring covariate-dependent microbial covariation networks from zero-inflated compositional count data. The method models microbiome counts through a latent multivariate normal distribution defined on the internal nodes of a phylogenetic tree, where both the mean and covariance of the latent variables depend on covariates. The covariance is decomposed into a sparse baseline component, representing a stable microbial covariation network, and a low-rank covariate-dependent perturbation that captures network rewiring. By exploiting the binomial factorization of the multinomial distribution under the logistic-tree-normal representation, the model achieves full conjugacy and posterior inference proceeds via an efficient Gibbs sampler. In simulations, TRECOR substantially outperforms covariance regression applied to transformed counts, demonstrating the importance of explicitly modeling the compositional sampling layer. Applied to gut microbiome data from 531 individuals across three countries, we find that age has the largest effect on microbial covariation, which is a pattern not revealed by mean-based analysis alone. The age-associated differential network is enriched for Enterobacteriaceae and related families, consistent with known developmental shifts in the gut microbiota, while country-associated differential networks implicate diet-related taxa.
Cross submissions (showing 3 of 3 entries)
- arXiv:2509.12533 (replaced) [pdf, html, other]
-
Title: Transporting Predictions via Double Machine Learning: Predicting Partially Unobserved Students' OutcomesFalco J. Bargagli-Stoffi, Emma Landry, Kevin P. Josey, Kenneth De Beckker, Joana E. Maldonado, Kristof De WitteComments: arXiv admin note: substantial text overlap with arXiv:2102.04382Subjects: Applications (stat.AP); Methodology (stat.ME)
Educational policymakers often lack data on student outcomes where standardized tests were not administered. Machine learning can predict unobserved outcomes in target populations using source population data. However, covariate distribution differences between populations reduce model transportability, potentially decreasing predictive accuracy and introducing bias. We propose using double machine learning for covariate-shift weighted models. First, we estimate overlap scores -- the probability an observation belongs to the source dataset given covariates. Second, balancing weights, defined as density ratios of target-to-source membership probabilities, reweight individual observations' contributions to the loss function in target outcome prediction models. This downweights source observations less similar to the target population, allowing predictions to rely more on observations with greater overlap. Consequently, predictions become more transportable under covariate shift. We illustrate this framework using student standardized financial literacy scores (FLS) data. Using Bayesian Additive Regression Trees (BART), we predict missing FLS. We find minimal predictive performance differences between weighted and unweighted models, suggesting limited covariate shift in our setting. Nonetheless, our approach provides a principled framework for addressing covariate shift and is broadly applicable to predictive modeling in social and health sciences, where source-target population differences are common.
- arXiv:2509.15379 (replaced) [pdf, html, other]
-
Title: A Single Index Approach to Integrated Species Distribution Modeling for Fisheries Abundance DataSubjects: Applications (stat.AP)
In fisheries ecology, species abundance data are often collected by multiple surveys, each with unique characteristics. This article is motivated by a dataset of Atlantic sea scallop abundance records along the northeast coast of the United States, collected from two bottom trawl surveys which cover a larger spatial domain but have low catch efficiency, and a dredge survey which is more efficient but more bounded in domain. Over the past decade, integrated species distribution models (ISDMs) that include common environmental effects along with correlated survey-specific spatial fields have been used to incorporate information from multiple surveys. While flexible, ISDMs can be susceptible to overfitting, which can complicate interpretability of the shared environmental effects, and potentially lead to poor predictive performance. To overcome these drawbacks, we introduce a novel single index ISDM, built from a single index (with spatial random effects) that represents a latent measure of the true species distribution, and survey-specific catch efficiency functions which map the single index to the survey-specific expected catch. In this article, these functions are constructed via logistic functions or semiparametric spline-based functions. Simulations and application to the motivating sea scallop abundance data demonstrate that the proposed single index ISDM offers more meaningful interpretations of the environmental effects and survey catch efficiency differences, while achieving similar to or better predictive performance than existing ISDMs.
- arXiv:2509.22714 (replaced) [pdf, html, other]
-
Title: Vaccinating Now or Vaccinating Later: Separating Pull-Forward and Net Effects Using a Dynamic Regression Discontinuity DesignFabio I. Martinenghi, Mesfin Genie, Katie Attwell, Huong Le, Hannah Moore, Aregawi G. Gebremariam, Bette Liu, Francesco Paolucci, Christopher C. BlythSubjects: Applications (stat.AP)
We study the impact of a novel COVID-19 vaccine mandate, targeting graduating high-school students, on first vaccine uptake. In 2021, the State Government of Western Australia (WA) required attendees at "Leavers" -- a large-scale state-supported graduation party held annually in November in a WA regional town -- to be vaccinated. Using administrative data that link date-of-birth (at the month level), school attendance, and first-dose vaccination records, we exploit the strict school-age laws in WA to run regression discontinuity designs (RDDs). In other words, we use the date-of-birth cutoff for starting compulsory schooling in WA to build the counterfactual vaccination outcomes for Year-12 (i.e. graduating) students. We run both static and dynamic RDDs, the latter consisting of daily RDD estimations in a one-year window centred around the policy deadline in November 2021. We find that the "Leavers mandate" -- which excluded unvaccinated Year-12 students from popular post-graduation events -- raised vaccination rates by 9.3 percentage points at the mandate deadline. The dynamic RDD estimates show that this effect is entirely due to pulling forward future vaccinations by 46-80 days, with no net increase in ultimate uptake. Our paper is first to disentangle "pull-forward" (intensive margin) versus "net" (extensive margin) effects of a vaccine mandate in a pandemic context -- meaning that we identify how much the mandate made eventually-vaccinated people anticipate their vaccination, and how much it induced vaccinations that would not have happened absent the mandate. We also bring new evidence on the efficacy of time-limited non-monetary incentives for accelerating vaccination campaigns. Keywords: mandate; vaccination; incentives; uptake; adolescents; timing; coverage. JEL: I12; I18.
- arXiv:2602.08083 (replaced) [pdf, html, other]
-
Title: A Unified Server Quality Metric for TennisComments: 21 pages, published in Journal of Sports Analytics. Code available at this https URLSubjects: Applications (stat.AP)
Traditional tennis rating systems (e.g., Elo) summarize overall player strength but do not isolate the independent value of serving. Using point-by-point data from Wimbledon and the U.S.\ Open, we develop serve-specific player metrics that separate serving quality from return ability and other latent factors. For each tournament and gender, we fit logistic mixed-effects models of point outcomes using serve speed, speed variability, and placement features, with crossed server and returner random intercepts to capture unobserved player strengths. From these models we derive Server Quality Scores (SQS): partially pooled, opponent-adjusted estimates of players' serving impact. In out-of-sample evaluation, SQS aligns more strongly with serve efficiency$\unicode{x2014}$the probability of winning points within three shots$\unicode{x2014}$than weighted Elo. We further benchmark SQS against task-aligned serve-stat baselines and model ablations, quantifying the incremental value of serve features and partial pooling. Associations with overall serve win percentage are smaller and dataset-dependent, and neither SQS nor weighted Elo consistently dominates that outcome. Overall, SQS is best interpreted as a measure of serve-induced short-point advantage (serve quality plus early-point conversion), complementing holistic ratings with actionable insight for coaching, forecasting, and player evaluation.
- arXiv:2603.23675 (replaced) [pdf, other]
-
Title: Dynamical behaviors of a stochastic SIS epidemic model with mean-reverting inhomogeneous geometric brownian motionComments: It contains significant errors that require substantial revisionSubjects: Applications (stat.AP); Probability (math.PR)
The main purpose of this paper is to study the Dynamical behaviors of a stochastic SIS epidemic model using mean-reverting inhomogeneous geometric brownian motion process. First we demonstrate the existence of a global-in-time solution and establish that is unique and remains positive. Then we derive a sufficient condition for exponential extinction of infectious diseases and we show that our extinction threshold in the stochastic case coincides with that of the deterministic case. Finaly, we define an appropriate theoretical framework to guarantee the existence of an ergodic stationary distribution.
- arXiv:2405.20957 (replaced) [pdf, html, other]
-
Title: Causal-ICM: A Data Fusion Framework For Heterogeneous Treatment Effect Estimation With Multi-Task Gaussian ProcessesComments: Accepted at the 5th Conference on Causal Learning and Reasoning (CLeaR 2026)Subjects: Methodology (stat.ME); Applications (stat.AP)
Bridging the gap between internal and external validity is crucial for heterogeneous treatment effect estimation. Randomised controlled trials (RCTs), favoured for their internal validity due to randomisation, often encounter challenges in generalising findings due to strict eligibility criteria. Observational studies, on the other hand, may provide stronger external validity through larger and more representative samples but can suffer from compromised internal validity due to unmeasured confounding. Motivated by these complementary characteristics, we propose a novel Bayesian nonparametric approach, Causal-ICM, leveraging multi-task Gaussian processes to integrate data from both RCTs and observational studies. In particular, we introduce a parameter that controls the degree of borrowing between the datasets and prevents the observational dataset from dominating the estimation. We propose a data-adaptive procedure for choosing the optimal value of the parameter. Causal-ICM outperforms other data fusion methods in point estimation across the covariate support of the observational study and provides principled uncertainty quantification for the estimated treatment effects. We demonstrate the robust performance of Causal-ICM in diverse scenarios through multiple simulation studies and a real-world study.
- arXiv:2506.23849 (replaced) [pdf, html, other]
-
Title: Developing a Synthetic Socio-Economic Index through Autoencoders: Evidence from Florence's Suburban AreasSubjects: Methodology (stat.ME); Applications (stat.AP)
The interest in summarizing complex and multidimensional phenomena often related to one or more specific sectors (social, economic, environmental, political, etc.) to make them easily understandable even to non-experts is far from waning. A widely adopted approach for this purpose is the use of composite indices, statistical measures that aggregate multiple indicators into a single comprehensive measure. In this paper, we present a novel methodology called AutoSynth, designed to condense potentially extensive datasets into a single synthetic index or a hierarchy of such indices. AutoSynth leverages an Autoencoder, a neural network technique, to represent a matrix of features in a lower-dimensional space. Although this approach is not limited to the creation of a particular composite index and can be applied broadly across various sectors, the motivation behind this work arises from a real-world need. Specifically, we aim to assess the vulnerability of the Italian city of Florence at the suburban level across three dimensions: economic, demographic, and social. To demonstrate the methodology's effectiveness, it is also applied to estimate a vulnerability index using a rich, publicly available dataset on U.S. counties and validated through a simulation study.
- arXiv:2507.20598 (replaced) [pdf, html, other]
-
Title: Nullstrap-DE: A General Framework for Calibrating FDR and Preserving Power in DE Methods, with Applications to DESeq2 and edgeRSubjects: Methodology (stat.ME); Genomics (q-bio.GN); Applications (stat.AP)
Differential expression (DE) analysis is a key task in RNA-seq studies, aiming to identify genes with expression differences across conditions. A central challenge is balancing false discovery rate (FDR) control with statistical power. Parametric methods such as DESeq2 and edgeR achieve high power by modeling gene-level counts using negative binomial distributions and applying empirical Bayes shrinkage. However, these methods may suffer from FDR inflation when model assumptions are mildly violated, especially in large-sample settings. In contrast, non-parametric tests like Wilcoxon offer more robust FDR control but often lack power and do not support covariate adjustment. We propose Nullstrap-DE, a general add-on framework that combines the strengths of both approaches. Designed to augment tools like DESeq2 and edgeR, Nullstrap-DE calibrates FDR while preserving power, without modifying the original method's implementation. It generates synthetic null data from a model fitted under the gene-specific null (no DE), applies the same test statistic to both observed and synthetic data, and derives a threshold that satisfies the target FDR level. We show theoretically that Nullstrap-DE asymptotically controls FDR while maintaining power consistency. Simulations confirm that it achieves reliable FDR control and high power across diverse settings, where DESeq2, edgeR, or Wilcoxon often show inflated FDR or low power. Applications to real datasets show that Nullstrap-DE enhances statistical rigor and identifies biologically meaningful genes.
- arXiv:2511.04106 (replaced) [pdf, html, other]
-
Title: Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and NamesSubjects: Physics and Society (physics.soc-ph); Computation and Language (cs.CL); Computers and Society (cs.CY); Applications (stat.AP)
The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $\alpha$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $\alpha$ or the duration $T$; and (iii) $\alpha$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $\alpha$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.
- arXiv:2603.17717 (replaced) [pdf, html, other]
-
Title: Machine Learning for Network Attacks Classification and Statistical Evaluation of Adversarial Learning Methodologies for Synthetic Data GenerationSubjects: Cryptography and Security (cs.CR); Artificial Intelligence (cs.AI); Applications (stat.AP); Machine Learning (stat.ML)
Supervised detection of network attacks has always been a critical part of network intrusion detection systems (NIDS). Nowadays, in a pivotal time for artificial intelligence (AI), with even more sophisticated attacks that utilize advanced techniques, such as generative artificial intelligence (GenAI) and reinforcement learning, it has become a vital component if we wish to protect our personal data, which are scattered across the web. In this paper, we address two tasks, in the first unified multi-modal NIDS dataset, which incorporates flow-level data, packet payload information and temporal contextual features, from the reprocessed CIC-IDS-2017, CIC-IoT-2023, UNSW-NB15 and CIC-DDoS-2019, with the same feature space. In the first task we use machine learning (ML) algorithms, with stratified cross validation, in order to prevent network attacks, with stability and reliability. In the second task we use adversarial learning algorithms to generate synthetic data, compare them with the real ones and evaluate their fidelity, utility and privacy using the SDV framework, f-divergences, distinguishability and non-parametric statistical tests. The findings provide stable ML models for intrusion detection and generative models with high fidelity and utility, by combining the Synthetic Data Vault framework, the TRTS and TSTR tests, with non-parametric statistical tests and f-divergence measures.
