EXP-006 / question
Technical question
How far can a clean classical NLP pipeline go for fake-vs-real news classification — and how do you keep it honest about dataset leakage?
EXP-006 / method
Method and workflow
- Clean text with a pipeline-embedded cleaner so training and inference normalize inputs identically (no train/serve skew).
- Build sparse TF-IDF word n-gram features and train a Logistic Regression baseline.
- Run a dataset-leakage report and a source-confounding diagnostic, then optionally strip source artifacts for leakage-controlled training.
- Evaluate with accuracy, macro F1, ROC-AUC, and PR-AUC, including out-of-source evaluation when the data permits.
- Return REAL / FAKE / UNCERTAIN with a configurable uncertainty band, and load models with checksum verification.
- Expose predictions through a Streamlit dashboard and a CLI.
EXP-006 / evidence
Evidence of work
A leakage report flags source artifacts (such as a "contains Reuters" heuristic) and a confounding score quantifies how strongly the source predicts the label.
Outputs are REAL / FAKE / UNCERTAIN with a configurable band, and out-of-source evaluation reports infeasibility when the data cannot support it.
Pipeline-embedded cleaning avoids train/serve skew, models load with checksum verification, and pytest with GitHub Actions guards the workflow.
EXP-006 / stack
Technical stack
EXP-006 / limitations
Limitations and honesty check
- Fake-news detection is context-dependent and vulnerable to domain shift.
- The project is a learning and portfolio demo, not a moderation or fact-checking authority.
- Real-world use would require source analysis, adversarial evaluation, human review, and continuous monitoring.
EXP-006 / next
Next improvements
- Add stronger validation across sources and time periods.
- Add calibration and uncertainty reporting.
- Add model cards and dataset documentation.
- Compare with transformer-based baselines.