EXP-004 / question
Technical question
Which synthetic data method better preserves the behavior of real tabular data?
EXP-004 / method
Method and workflow
- Load or generate a tabular demo dataset and infer schema information.
- Validate configuration and input data before generation.
- Generate synthetic rows with Copula and VAE approaches.
- Compare real and synthetic distributions, correlations, categorical rates, boundaries, and ML utility.
- Produce plots and summaries for visual inspection.
- Write reproducible outputs through a CLI-driven workflow.
real data
schema detection
generator
synthetic data
quality metrics
plots
report
EXP-004 / evidence
Evidence of work
Metrics
Quality summaries cover distribution overlap, correlation drift, privacy proxy, duplicate rate, and ML utility.
Visuals
PCA projections, pairplots, heatmaps, and overlap charts make differences inspectable.
Workflow
Config validation and CLI options make the experiment easier to repeat and review.
EXP-004 / stack
Technical stack
PythonpandasNumPySciPyscikit-learnPyTorchmatplotlibseabornPyYAMLunittestGitHub Actions
Open repository ↗
EXP-004 / limitations
Limitations and honesty check
- Synthetic quality depends heavily on dataset shape, feature types, and privacy requirements.
- Nearest-neighbor privacy metrics are proxies, not formal privacy guarantees.
- The project is an evaluation demo, not a certified anonymization system.
EXP-004 / next
Next improvements
- Add more generators and benchmark datasets.
- Add formal privacy metrics and attack simulations.
- Add richer HTML reporting and model cards.
- Add optional Docker support.