VOOZH about

URL: https://dev.to/harihara_suthans_70ddf46/i-built-a-python-library-for-synthetic-dataset-generation-and-missing-value-simulation-1e0i

⇱ I Built a Python Library for Synthetic Dataset Generation and Missing Value Simulation - DEV Community


As a student interested in Data Science and Machine Learning, I often faced the same problem:

I needed datasets to test ideas, algorithms, and projects, but finding the right dataset wasn't always easy.

Sometimes I needed:

  • A dataset with specific correlations
  • A dataset generated from a formula
  • Missing values following MCAR, MAR, or MNAR patterns
  • Time-series data for experimentation
  • Multiple datasets that could be merged and compared

Most existing libraries solved only one part of the problem.

So I decided to build my own Python package:

Introducing GOSEIDATASET

GOSEIDATASET is a Python library designed for:

✅ Synthetic Dataset Generation

✅ Missing Value Simulation

✅ Time Series Generation

✅ Dataset Merging

✅ Supervised Learning Utilities


Installation

pip install goseidataset

Random Dataset Generation

Generating a dataset is straightforward:

from goseidataset import DatasetGenerator

dg = DatasetGenerator()

df = dg.generate_random(
 n_rows=100,
 constraints={
 "sleep": [4, 10],
 "revision": [0, 8],
 "session": ["Morning", "Evening"]
 }
)

print(df.head())

Generate Correlated Data

Need a dataset where features have predefined relationships?

df = dg.generate_correlated(
 n_rows=1000,
 target="marks",
 correlations={
 "hours": 0.8,
 "stress": -0.5
 },
 constraints={
 "marks": [0, 100],
 "hours": [0, 12],
 "stress": [0, 100]
 }
)

Formula-Based Dataset Generation

Generate data from mathematical relationships:

df = dg.generate_formula(
 n_rows=500,
 formula="hours*10 + revision*5",
 constraints={
 "hours": [1, 10],
 "revision": [0, 5],
 "marks": [0, 125]
 },
 target="marks"
)

Missing Value Simulation

One of the main goals of the package was to help test imputation techniques.

Supported methods include:

  • MCAR
  • MAR
  • MNAR
  • Random Missing
  • Consecutive Missing
  • Block Missing
  • Correlation-Based Missing

Example:

from goseidataset import MissingValueGenerator

mv = MissingValueGenerator(df)

result = mv.mcar(
 column="marks",
 percentage=20
)

Time Series Generation

Generate timestamp-based features easily:

from goseidataset import TimeSeriesGenerator

ts = TimeSeriesGenerator(df)

result = ts.timestamp_series()

Supervised Learning Utilities

The package also includes utilities for:

  • Dataset comparison
  • Weighted ensemble learning
  • Dataset merging
  • Missing value imputation

Example:

from goseidataset import Supervised_learning

sl = Supervised_learning(
 dataset_a,
 dataset_b,
 target="Retention"
)

result = sl.compare_models(model)

What I Learned Building This Project

Building this package taught me much more than writing Python code.

I learned about:

  • Package structure
  • API design
  • Documentation
  • Testing
  • PyPI packaging
  • Dependency management
  • Versioning
  • Real-world debugging

One of the biggest lessons was that writing the code is only part of the work. Making a package easy for others to install, understand, and use is equally important.


Future Improvements

Planned features include:

  • Classification dataset generators
  • Advanced time-series simulation
  • More missing-value mechanisms
  • Better visualization utilities
  • Additional machine learning helpers

Feedback Welcome

This is my first published Python package, and I'd love to hear feedback from the community.

PyPI:
https://pypi.org/project/goseidataset/

GitHub:
https://github.com/GITTY5678/Gosei-dataset

Thanks for reading!