EXP-009 / question
Technical question
How can visual features be translated into natural-language descriptions through a reproducible multimodal pipeline?
EXP-009 / method
Method and workflow
- Use a pretrained ResNet-50 CNN encoder for visual feature extraction.
- Use an LSTM decoder with embeddings, dropout, and teacher forcing.
- Build a vocabulary and tokenize captions.
- Train with cross-entropy and Adam.
- Evaluate with BLEU-1 to BLEU-4 and save artifacts.
images
CNN encoder
visual features
LSTM decoder
captions
BLEU evaluation
inference
EXP-009 / evidence
Evidence of work
Model design
The README lists a pretrained ResNet-50 encoder and LSTM decoder with embeddings, dropout, and teacher forcing.
Evaluation
Validation uses BLEU-1 through BLEU-4 and saves metrics.
Reproducibility
The repo structure includes preprocessing, training, inference scripts, checkpoints, vocabulary, curves, and metrics.
EXP-009 / stack
Technical stack
PythonPyTorchTorchvisionResNet-50LSTMPillowNLTKtqdmMatplotlibpytest
Open repository ↗
EXP-009 / limitations
Limitations and honesty check
- Toy datasets are useful for testing but not enough to prove real-world caption quality.
- Captioning systems should be evaluated for hallucination, bias, and failure cases, not only BLEU.
- Transformer-based encoder-decoder models would be a stronger modern baseline.
EXP-009 / next
Next improvements
- Add qualitative failure-case analysis.
- Compare against transformer captioning baselines.
- Add attention visualization.
- Train on a larger benchmark dataset.