LSTM Text Embedding: Neural Sentence Representations
Keras LSTM model for learning word and text representations

Project Details
- Project Type : Research Implementation
- Domain : Natural Language Processing
- Technologies : Python, Keras, LSTM, Neural Networks
- GitHub : https://github.com/dkaterenchuk/lstm_text_embedding
- Stars : 1
- License : MIT
LSTM Text Embedding: Neural Sentence Representations
A Keras implementation of LSTM-based text embedding models for learning sentence representations through neural autoencoder architecture.
Overview
This project implements sentence embeddings using LSTM (Long Short-Term Memory) autoencoders. The model learns to encode sentences into fixed-length vectors that capture semantic meaning, which can be used for various downstream NLP tasks or as part of larger machine learning pipelines.
Key Features
- LSTM autoencoder architecture - Neural network approach to learning sentence representations
- Custom word embedding support - Train on your own corpus or use pretrained embeddings
- Flexible integration - Use as standalone embedding system or integrate into larger pipelines
- Wikipedia training data - Includes tools for processing Wikipedia dumps
- Continuing training - Resume training from saved model checkpoints
Architecture
The model uses LSTM layers to:
- Encode input sentences into fixed-length vector representations
- Decode these representations back to the original sentences
- Learn meaningful semantic features through this reconstruction process
Installation & Setup
Dependencies:
- Keras
- Pretrained word embeddings or tools to train your own
- wikiextractor (for processing Wikipedia XML data)
Configuration:
Configure paths in definitions.py to specify locations for embeddings and training data.
Data preparation:
The project includes sample Wikipedia data in the data/wiki/ directory. Use wikiextractor to process XML dumps into JSON format.
Usage
Train word embeddings:
python train_word_embedding.py <data_path> <output_model>
Train LSTM embedding model:
python train_lstm_embedding.py <data_dir> <output_file>
Continue training from checkpoint:
python continue_train_lstm.py <checkpoint_file>
Project Structure
/code- Implementation files and training scripts/data- Training datasets including Wikipedia samples/trained_models- Saved model outputs and checkpoints
Applications
- Semantic similarity - Measure similarity between sentences
- Text classification - Use embeddings as features for classification tasks
- Information retrieval - Find semantically similar documents
- Sentence clustering - Group similar sentences together
- Transfer learning - Use pretrained embeddings for downstream tasks
Technical Approach
The LSTM autoencoder learns to compress sentences into dense vector representations while preserving semantic information. This unsupervised approach creates embeddings that capture meaning without requiring labeled data.
Resources
- GitHub Repository: lstm_text_embedding
- License: MIT
- Framework: Keras with LSTM layers
This implementation provides researchers and practitioners with tools to create neural text embeddings using state-of-the-art LSTM architecture for various NLP applications.