Synthetic Data Artist | honardoust.codes

EXP-004 / question

Technical question

Which synthetic data method better preserves the behavior of real tabular data?

EXP-004 / method

Load or generate a tabular demo dataset and infer schema information.
Validate configuration and input data before generation.
Generate synthetic rows with Copula and VAE approaches.
Compare real and synthetic distributions, correlations, categorical rates, boundaries, and ML utility.
Produce plots and summaries for visual inspection.
Write reproducible outputs through a CLI-driven workflow.

real data schema detection generator synthetic data quality metrics plots report

EXP-004 / evidence

Metrics

Quality summaries cover distribution overlap, correlation drift, privacy proxy, duplicate rate, and ML utility.

Visuals

PCA projections, pairplots, heatmaps, and overlap charts make differences inspectable.

Workflow

Config validation and CLI options make the experiment easier to repeat and review.

EXP-004 / stack

PythonpandasNumPySciPyscikit-learnPyTorchmatplotlibseabornPyYAMLunittestGitHub Actions

Open repository ↗

EXP-004 / limitations

Synthetic quality depends heavily on dataset shape, feature types, and privacy requirements.
Nearest-neighbor privacy metrics are proxies, not formal privacy guarantees.
The project is an evaluation demo, not a certified anonymization system.

EXP-004 / next