py_anything2vec: Custom Embedding Training
A library to train your own word2vec, doc2vec or any other 'data'2vec models

Project Details
- Project Type : Open Source Library
- Domain : Natural Language Processing
- Technologies : Python, Gensim, Word2Vec
- GitHub : https://github.com/dkaterenchuk/py_anything2vec
- Stars : 8
- License : MIT
py_anything2vec: Custom Embedding Training
A Python library designed to train custom word2vec, doc2vec, and other embedding models on user-provided data using Gensim as the backend.
Overview
While pre-trained word and document embedding models are widely available, many applications require custom models trained on domain-specific data. This library provides simple scripts to train custom vector representations tailored to your specific corpus and use case.
Core Features
- Word2Vec model training - Train custom word embeddings on your text data
- Doc2Vec model training - Generate document-level vector representations
- Customizable dimensionality - Control the size of your embedding vectors
- Gensim backend - Built on the robust Gensim library
- Simple command-line interface - Easy to use with minimal setup
Installation
Install the required dependencies:
pip install gensim
pip install scipy
pip install numpy
Download or clone the repository and use the training scripts directly.
Usage
The command-line interface is straightforward:
python train_word2vec.py <doc_path.txt> <output_path.model> [n_dimensions]
Input format requirements:
- One sentence per line
- Lower case words separated by spaces
- No punctuation
Example commands:
# Train with 2 dimensions
python train_word2vec.py test_data/test.txt test_data/test.model 2
# Train with default dimensions
python train_word2vec.py test_data/test.txt test_data/test.model
Use Cases
- Domain-specific word embeddings (medical, legal, technical domains)
- Custom document similarity models
- Specialized recommendation systems
- Text classification with custom features
- Information retrieval on proprietary corpora
Resources
- GitHub Repository: py_anything2vec (8+ stars)
- License: MIT
- Backend: Gensim library
This library provides researchers and practitioners with flexible tools to create custom embedding models tailored to their specific data and applications.