How to Create a Custom Synthetic Dataset in R

Last Updated : 23 Jul, 2025

Making synthetic datasets in R Programming Language is like creating pretend data that looks real. These datasets are useful for testing, experimenting and analyzing without needing actual records.

What is Synthetic Data?

Synthetic data is like making up information that acts just like real data. Instead of gathering it from the real world, we create it using computer programs or mathematical rules. It's useful because it helps us study and test things without using real people or sensitive data. It's a handy tool for researchers and companies to explore ideas and build algorithms without privacy concerns or data limitations.

Features of Synthetic Data

Privacy Protection: Synthetic data keeps personal information safe because it's made up and not collected from real people.
Data Augmentation: It adds more data to existing sets, which is handy when there's not enough real data for training models.
Diverse Scenarios: Synthetic data creates different situations, helping test models in various conditions.
Cost-Effective: It saves money because we don't need to collect real data, which can be expensive.
Risk Reduction: Since it's not real, there's no risk of data breaches or legal issues.
Testing Algorithms: It's great for trying out and improving algorithms without using real data.

Creating a Synthetic Dataset in R

We will generate a synthetic dataset in R by defining variables, generating values, adding noise and assembling them into a data frame.

1. Defining Variables

We define the size of the dataset and characteristics for each variable such as mean, range and standard deviation.

n: Defines the number of observations.
mean_age: Sets the mean value for the age variable.
sd_age: Determines the standard deviation for age.
min_salary: Minimum value for the salary variable.
max_salary: Maximum value for the salary variable.

2. Generating Values

We generate synthetic values for age and salary using R’s statistical functions.

set.seed: Ensures reproducibility by fixing the random output.
rnorm: Generates random values from a normal distribution.
round: Rounds numeric values to the nearest whole number.
runif: Generates random values from a uniform distribution.

Generate synthetic data for 'age' and 'salary' variables using the rnorm() function for age (normal distribution) and the runif() function for salary (uniform distribution). We round the generated ages to the nearest whole number using the round() function.

3. Combining Data

We combine the generated vectors into a single data frame for analysis.

data.frame: Creates a structured dataset from multiple variables.

We combine the generated data into a dataframe called 'synthetic_data' using the data.frame() function.

4. Adding Noise

We introduce some randomness into the salary variable to make the data more realistic.

noise_sd: Defines the standard deviation for added noise.
rnorm: Used again to generate random noise.
$: Accesses and modifies specific columns in a data frame.

Adding noise to the 'salary' variable to introduce variability.

We specify the standard deviation of the noise (noise_sd) and use the rnorm() function to generate random noise with that standard deviation, which is then added to the 'salary' variable.

5. Viewing the Dataset

We display the first few rows of the dataset to preview the generated values.

head: Displays the first 6 rows of a data frame.

Output:

👁 data_frame

Output

Creating a Custom Synthetic Dataset of Study Hours vs. Exam Scores

We create another synthetic dataset to explore the relationship between study hours and exam performance.

1. Generating Synthetic Data

We simulate study hours and exam scores assuming a linear relationship with added noise.

runif: Generates random study hour values.
rnorm: Adds variability to the exam scores.
round: Rounds both values to simulate real-world observations.

2. Analyzing the Data

We analyze the relationship using correlation and linear regression.

cor: Calculates the correlation coefficient between variables.
lm: Fits a linear model to predict exam scores from study hours.

3. Visualizing the Results

We plot the data and regression line to observe the relationship between study time and exam performance.

plot: Creates a scatter plot.
abline: Adds a regression line to the plot.
text: Displays the correlation value on the chart.

Output:

👁 gh

Create a Custom Synthetic Dataset in R

Generate synthetic data for study hours and exam scores, assuming a linear relationship between them.

Calculate the correlation between study hours and exam scores to measure their association.
Fit a linear regression model to examine how study hours predict exam scores.
Then visualize the relationship between study hours and exam scores using a scatter plot with a regression line.

Limitation of Synthetic Dataset

Limited Real-World Representation: Synthetic datasets may not capture the full complexity and variability of real-world data.
Potential Bias: The generation process can introduce biases if it doesn't accurately reflect the true characteristics of the population.
Lack of Context: Synthetic datasets often lack contextual information present in real-world data, impacting their usefulness for analysis.
Limited Generalizability: Models trained on synthetic data may not perform well on real-world data due to differences in distribution or underlying patterns.
Validation Challenges: It can be difficult to validate models trained on synthetic data without real-world testing opportunities.

Synthetic datasets are helpful for exploring different scenarios and relationships in data analysis. However, they're not perfect copies of real-world data. They might miss some details, have biases or be challenging to validate. It's essential to use them carefully, alongside real data when possible.

Comment

Article Tags:

R Language

R-basics

Explore

Introduction

Fundamentals of R

Variables

Input/Output

Control Flow

Functions

Data Structures

Object Oriented Programming

Error Handling

File Handling

Packages in R

Data Interfaces

Data Visualization

Statistics

Machine Learning

Courses

URL: https://www.geeksforgeeks.org/r-language/how-to-create-a-custom-synthetic-dataset-in-r/