LSTM Text Embedding: Neural Sentence Representations

Keras LSTM model for learning word and text representations

Project Details

Project Type : Research Implementation
Domain : Natural Language Processing
Technologies : Python, Keras, LSTM, Neural Networks
GitHub : https://github.com/dkaterenchuk/lstm_text_embedding
Stars : 1
License : MIT

LSTM Text Embedding: Neural Sentence Representations

A Keras implementation of LSTM-based text embedding models for learning sentence representations through neural autoencoder architecture.

Overview

This project implements sentence embeddings using LSTM (Long Short-Term Memory) autoencoders. The model learns to encode sentences into fixed-length vectors that capture semantic meaning, which can be used for various downstream NLP tasks or as part of larger machine learning pipelines.

Key Features

LSTM autoencoder architecture - Neural network approach to learning sentence representations
Custom word embedding support - Train on your own corpus or use pretrained embeddings
Flexible integration - Use as standalone embedding system or integrate into larger pipelines
Wikipedia training data - Includes tools for processing Wikipedia dumps
Continuing training - Resume training from saved model checkpoints

Architecture

The model uses LSTM layers to:

Encode input sentences into fixed-length vector representations
Decode these representations back to the original sentences
Learn meaningful semantic features through this reconstruction process

Installation & Setup

Dependencies:

Keras
Pretrained word embeddings or tools to train your own
wikiextractor (for processing Wikipedia XML data)

Configuration: Configure paths in definitions.py to specify locations for embeddings and training data.

Data preparation: The project includes sample Wikipedia data in the data/wiki/ directory. Use wikiextractor to process XML dumps into JSON format.

Usage

Train word embeddings:

python train_word_embedding.py <data_path> <output_model>

Train LSTM embedding model:

python train_lstm_embedding.py <data_dir> <output_file>

Continue training from checkpoint:

python continue_train_lstm.py <checkpoint_file>

Project Structure

/code - Implementation files and training scripts
/data - Training datasets including Wikipedia samples
/trained_models - Saved model outputs and checkpoints

Applications

Semantic similarity - Measure similarity between sentences
Text classification - Use embeddings as features for classification tasks
Information retrieval - Find semantically similar documents
Sentence clustering - Group similar sentences together
Transfer learning - Use pretrained embeddings for downstream tasks

Technical Approach

The LSTM autoencoder learns to compress sentences into dense vector representations while preserving semantic information. This unsupervised approach creates embeddings that capture meaning without requiring labeled data.

Resources

GitHub Repository: lstm_text_embedding
License: MIT
Framework: Keras with LSTM layers

This implementation provides researchers and practitioners with tools to create neural text embeddings using state-of-the-art LSTM architecture for various NLP applications.