The Problem
You have environmental data (pH, organic matter, nutrients) and biodiversity data (Shannon index). You want to know: which environmental factors significantly affect diversity?
The quickest way to answer this is scatter plots + linear regression.
Here's a complete, runnable R workflow using real soil microbiome data — from a simple lm() call to a multi-panel publication-ready figure. Everything uses base R graphics, no ggplot2 required.
Data
This dataset has 24 soil samples with 6 environmental factors and 2 diversity indices:
| Variable | Description | Unit |
|---|---|---|
| pH | Soil acidity | — |
| AK | Available potassium | mg/kg |
| AP | Available phosphorus | mg/kg |
| AN | Available nitrogen | mg/kg |
| OM | Organic matter | g/kg |
| W | Water content | % |
| Shannon | Species diversity | — |
| PD | Phylogenetic diversity | — |
# Read and inspectdf<-read.csv("sandian.csv",header=TRUE)str(df)head(df)
Single Factor: pH → Shannon
x<-df$pHy<-df$Shannonfit<-lm(y~x)summary(fit)
Multiple R-squared: 0.0004, p-value: 0.927
R² = 0.0004, p = 0.927 — pH has no significant linear relationship with Shannon diversity in this dataset. That's OK. Non-significant results are normal in real research.
Plot it:
par(mar=c(5,5,3,1))plot(x,y,xlab="pH",ylab="Shannon Index",pch=16,cex=1.5,col="#00AFBB",cex.lab=1.8,cex.axis=1.3,las=1)abline(fit,lwd=3,col="#00AFBB")
Multi-Panel: All 6 Factors at Once
Doing one plot at a time is inefficient. Use layout() + for loop:
x_vars<-c("pH","AK","AP","AN","OM","W")x_labels<-c("pH","AK (mg/kg)","AP (mg/kg)","AN (mg/kg)","SOM (g/kg)","W (%)")layout(matrix(1:6,2,3,byrow=TRUE))par(mar=c(4.5,4.5,3,1))for(iinseq_along(x_vars)){fit<-lm(df$Shannon~df[[x_vars[i]]])s<-summary(fit)plot(df[[x_vars[i]]],df$Shannon,xlab=x_labels[i],ylab="Shannon",pch=16,cex=1.3,col="#00AFBB",cex.lab=1.5,cex.axis=1.2,las=1)abline(fit,lwd=3,col="#00AFBB")# Annotate R² and p-valuer2<-round(s$r.squared,3)p<-s$coefficients[2,4]p_label<-ifelse(p<0.001,"p<0.001",paste0("p=",round(p,3)))x_pos<-par("usr")[2]-diff(par("usr")[1:2])*0.25y_pos<-par("usr")[4]-diff(par("usr")[3:4])*0.1text(x_pos,y_pos,labels=bquote(R^2==.(r2)~" "~.(p_label)),cex=1.1,adj=c(1,1))}
Save at 300 dpi for publication:
png("scatter_6panel.png",width=3000,height=2000,res=300)# ... plotting code ...dev.off()
Results at a Glance
| Factor | R² | p-value | Significant? |
|---|---|---|---|
| pH | 0.000 | 0.927 | ❌ |
| AK | 0.088 | 0.160 | ❌ |
| AP | 0.012 | 0.615 | ❌ |
| AN | 0.220 | 0.021 | ✅ |
| OM | 0.540 | <0.001 | ✅✅ |
| W | 0.000 | 0.956 | ❌ |
Organic matter (OM) is the strongest predictor — it explains 54% of the variance in Shannon diversity.
Complete Script (Copy & Run)
# ================================================================# Scatter Plots + Linear Regression: Environmental Factors vs. Diversity# ================================================================plot_one<-function(x,y,xlab,ylab){fit<-lm(y~x)s<-summary(fit)r2<-round(s$r.squared,3)p<-s$coefficients[2,4]p_label<-ifelse(p<0.001,"p<0.001",paste0("p=",round(p,3)))plot(x,y,xlab=xlab,ylab=ylab,pch=16,cex=1.3,col="#00AFBB",cex.lab=1.5,cex.axis=1.2,las=1)abline(fit,lwd=3,col="#00AFBB")x_pos<-par("usr")[2]-diff(par("usr")[1:2])*0.25y_pos<-par("usr")[4]-diff(par("usr")[3:4])*0.1text(x_pos,y_pos,labels=bquote(R^2==.(r2)~" "~.(p_label)),cex=1.1,adj=c(1,1))}df<-read.csv("sandian.csv",header=TRUE)x_vars<-c("pH","AK","AP","AN","OM","W")x_labels<-c("pH","AK (mg/kg)","AP (mg/kg)","AN (mg/kg)","SOM (g/kg)","W (%)")png("scatter_6panel.png",width=3000,height=2000,res=300)layout(matrix(1:6,2,3,byrow=TRUE))par(mar=c(4.5,4.5,3,1))for(iinseq_along(x_vars)){plot_one(df[[x_vars[i]]],df$Shannon,x_labels[i],"Shannon")}dev.off()
To adapt for your data: change x_vars to your column names, x_labels to your axis labels, and df$Shannon to your diversity index.
Parameter Cheat Sheet
| param | what it does | common values |
|---|---|---|
pch |
point shape | 16 (filled), 1 (empty), 17 (triangle) |
cex |
point/text size | 1.3–1.5 (moderate) |
col |
color |
#00AFBB (teal), #E69F00 (orange) |
lty |
line type | 1 (solid), 2 (dashed) |
lwd |
line width | 2–3 (publication) |
res |
export dpi | 300 (journal) |
When R² Is Not Significant
- Normal. Don't only show significant results.
-
Try non-linear:
poly(x, 2)for quadratic terms. - Small n (< 30) : may lack statistical power, not ecological meaning.
*Follow me on Dev.to for more R tutorials.
Tags: r tutorial datascience ecology
For further actions, you may consider blocking this person and/or reporting abuse
