EXP-004 / Evaluation Lab / Published

Synthetic Data Artist

A synthetic tabular-data lab comparing Copula and VAE generators with distribution overlap, correlation drift, PCA projections, privacy proxies, ML utility checks, CLI validation, and report artifacts.

EXP-004 / question

Technical question

Which synthetic data method better preserves the behavior of real tabular data?

EXP-004 / method

Method and workflow

  1. Load or generate a tabular demo dataset and infer schema information.
  2. Validate configuration and input data before generation.
  3. Generate synthetic rows with Copula and VAE approaches.
  4. Compare real and synthetic distributions, correlations, categorical rates, boundaries, and ML utility.
  5. Produce plots and summaries for visual inspection.
  6. Write reproducible outputs through a CLI-driven workflow.
real data schema detection generator synthetic data quality metrics plots report

EXP-004 / evidence

Evidence of work

Metrics

Quality summaries cover distribution overlap, correlation drift, privacy proxy, duplicate rate, and ML utility.

Visuals

PCA projections, pairplots, heatmaps, and overlap charts make differences inspectable.

Workflow

Config validation and CLI options make the experiment easier to repeat and review.

EXP-004 / stack

Technical stack

PythonpandasNumPySciPyscikit-learnPyTorchmatplotlibseabornPyYAMLunittestGitHub Actions
Open repository ↗

EXP-004 / limitations

Limitations and honesty check

  • Synthetic quality depends heavily on dataset shape, feature types, and privacy requirements.
  • Nearest-neighbor privacy metrics are proxies, not formal privacy guarantees.
  • The project is an evaluation demo, not a certified anonymization system.

EXP-004 / next

Next improvements

  • Add more generators and benchmark datasets.
  • Add formal privacy metrics and attack simulations.
  • Add richer HTML reporting and model cards.
  • Add optional Docker support.