Image Captioning CNN-LSTM | honardoust.codes

EXP-009 / question

Technical question

How can visual features be translated into natural-language descriptions through a reproducible multimodal pipeline?

EXP-009 / method

images CNN encoder visual features LSTM decoder captions BLEU evaluation inference

EXP-009 / evidence

Model design

The README lists a pretrained ResNet-50 encoder and LSTM decoder with embeddings, dropout, and teacher forcing.

Evaluation

Validation uses BLEU-1 through BLEU-4 and saves metrics.

Reproducibility

The repo structure includes preprocessing, training, inference scripts, checkpoints, vocabulary, curves, and metrics.

EXP-009 / stack

PythonPyTorchTorchvisionResNet-50LSTMPillowNLTKtqdmMatplotlibpytest

Open repository ↗

EXP-009 / limitations

Toy datasets are useful for testing but not enough to prove real-world caption quality.
Captioning systems should be evaluated for hallucination, bias, and failure cases, not only BLEU.
Transformer-based encoder-decoder models would be a stronger modern baseline.

EXP-009 / next