EXP-009 / Deep Learning / Published

Image Captioning CNN-LSTM

An end-to-end image captioning project using a ResNet-50 CNN encoder and LSTM decoder in PyTorch, with vocabulary building, preprocessing, training, BLEU evaluation, inference, checkpoints, and visual outputs.

EXP-009 / question

Technical question

How can visual features be translated into natural-language descriptions through a reproducible multimodal pipeline?

EXP-009 / method

Method and workflow

  1. Use a pretrained ResNet-50 CNN encoder for visual feature extraction.
  2. Use an LSTM decoder with embeddings, dropout, and teacher forcing.
  3. Build a vocabulary and tokenize captions.
  4. Train with cross-entropy and Adam.
  5. Evaluate with BLEU-1 to BLEU-4 and save artifacts.
images CNN encoder visual features LSTM decoder captions BLEU evaluation inference

EXP-009 / evidence

Evidence of work

Model design

The README lists a pretrained ResNet-50 encoder and LSTM decoder with embeddings, dropout, and teacher forcing.

Evaluation

Validation uses BLEU-1 through BLEU-4 and saves metrics.

Reproducibility

The repo structure includes preprocessing, training, inference scripts, checkpoints, vocabulary, curves, and metrics.

EXP-009 / stack

Technical stack

PythonPyTorchTorchvisionResNet-50LSTMPillowNLTKtqdmMatplotlibpytest
Open repository ↗

EXP-009 / limitations

Limitations and honesty check

  • Toy datasets are useful for testing but not enough to prove real-world caption quality.
  • Captioning systems should be evaluated for hallucination, bias, and failure cases, not only BLEU.
  • Transformer-based encoder-decoder models would be a stronger modern baseline.

EXP-009 / next

Next improvements

  • Add qualitative failure-case analysis.
  • Compare against transformer captioning baselines.
  • Add attention visualization.
  • Train on a larger benchmark dataset.