Fake News Detector | honardoust.codes

EXP-006 / question

Technical question

How far can a clean classical NLP pipeline go for fake-vs-real news classification — and how do you keep it honest about dataset leakage?

EXP-006 / method

Method and workflow

Clean text with a pipeline-embedded cleaner so training and inference normalize inputs identically (no train/serve skew).
Build sparse TF-IDF word n-gram features and train a Logistic Regression baseline.
Run a dataset-leakage report and a source-confounding diagnostic, then optionally strip source artifacts for leakage-controlled training.
Evaluate with accuracy, macro F1, ROC-AUC, and PR-AUC, including out-of-source evaluation when the data permits.
Return REAL / FAKE / UNCERTAIN with a configurable uncertainty band, and load models with checksum verification.
Expose predictions through a Streamlit dashboard and a CLI.

news text cleaning TF-IDF classifier metrics artifacts app

EXP-006 / evidence

Evidence of work

Leakage & confounding

A leakage report flags source artifacts (such as a "contains Reuters" heuristic) and a confounding score quantifies how strongly the source predicts the label.

Calibrated uncertainty

Outputs are REAL / FAKE / UNCERTAIN with a configurable band, and out-of-source evaluation reports infeasibility when the data cannot support it.

Reproducible & verified

Pipeline-embedded cleaning avoids train/serve skew, models load with checksum verification, and pytest with GitHub Actions guards the workflow.

EXP-006 / stack

Technical stack

PythonpandasNumPyscikit-learnStreamlitmatplotlibjoblibpytestRuffGitHub Actions

Open repository ↗

EXP-006 / limitations

Limitations and honesty check

Fake-news detection is context-dependent and vulnerable to domain shift.
The project is a learning and portfolio demo, not a moderation or fact-checking authority.
Real-world use would require source analysis, adversarial evaluation, human review, and continuous monitoring.

EXP-006 / next

Next improvements

Add stronger validation across sources and time periods.
Add calibration and uncertainty reporting.
Add model cards and dataset documentation.
Compare with transformer-based baselines.