EXP-006 / NLP Pipeline / Published

Fake News Detector

A responsible NLP pipeline that turns news text into a style-risk signal: TF-IDF + Logistic Regression with honest evaluation, dataset-leakage analysis, leakage-controlled training, a REAL/FAKE/UNCERTAIN decision band, checksum-verified model loading, a Streamlit dashboard, and CLI inference.

EXP-006 / question

Technical question

How far can a clean classical NLP pipeline go for fake-vs-real news classification — and how do you keep it honest about dataset leakage?

EXP-006 / method

Method and workflow

  1. Clean text with a pipeline-embedded cleaner so training and inference normalize inputs identically (no train/serve skew).
  2. Build sparse TF-IDF word n-gram features and train a Logistic Regression baseline.
  3. Run a dataset-leakage report and a source-confounding diagnostic, then optionally strip source artifacts for leakage-controlled training.
  4. Evaluate with accuracy, macro F1, ROC-AUC, and PR-AUC, including out-of-source evaluation when the data permits.
  5. Return REAL / FAKE / UNCERTAIN with a configurable uncertainty band, and load models with checksum verification.
  6. Expose predictions through a Streamlit dashboard and a CLI.
news text cleaning TF-IDF classifier metrics artifacts app

EXP-006 / evidence

Evidence of work

Leakage & confounding

A leakage report flags source artifacts (such as a "contains Reuters" heuristic) and a confounding score quantifies how strongly the source predicts the label.

Calibrated uncertainty

Outputs are REAL / FAKE / UNCERTAIN with a configurable band, and out-of-source evaluation reports infeasibility when the data cannot support it.

Reproducible & verified

Pipeline-embedded cleaning avoids train/serve skew, models load with checksum verification, and pytest with GitHub Actions guards the workflow.

EXP-006 / stack

Technical stack

PythonpandasNumPyscikit-learnStreamlitmatplotlibjoblibpytestRuffGitHub Actions
Open repository ↗

EXP-006 / limitations

Limitations and honesty check

  • Fake-news detection is context-dependent and vulnerable to domain shift.
  • The project is a learning and portfolio demo, not a moderation or fact-checking authority.
  • Real-world use would require source analysis, adversarial evaluation, human review, and continuous monitoring.

EXP-006 / next

Next improvements

  • Add stronger validation across sources and time periods.
  • Add calibration and uncertainty reporting.
  • Add model cards and dataset documentation.
  • Compare with transformer-based baselines.