py_anything2vec: Custom Embedding Training

A library to train your own word2vec, doc2vec or any other 'data'2vec models

Project Details

Project Type : Open Source Library
Domain : Natural Language Processing
Technologies : Python, Gensim, Word2Vec
GitHub : https://github.com/dkaterenchuk/py_anything2vec
Stars : 8
License : MIT

py_anything2vec: Custom Embedding Training

A Python library designed to train custom word2vec, doc2vec, and other embedding models on user-provided data using Gensim as the backend.

Overview

While pre-trained word and document embedding models are widely available, many applications require custom models trained on domain-specific data. This library provides simple scripts to train custom vector representations tailored to your specific corpus and use case.

Core Features

Word2Vec model training - Train custom word embeddings on your text data
Doc2Vec model training - Generate document-level vector representations
Customizable dimensionality - Control the size of your embedding vectors
Gensim backend - Built on the robust Gensim library
Simple command-line interface - Easy to use with minimal setup

Installation

Install the required dependencies:

pip install gensim
pip install scipy
pip install numpy

Download or clone the repository and use the training scripts directly.

Usage

The command-line interface is straightforward:

python train_word2vec.py <doc_path.txt> <output_path.model> [n_dimensions]

Input format requirements:

One sentence per line
Lower case words separated by spaces
No punctuation

Example commands:

# Train with 2 dimensions
python train_word2vec.py test_data/test.txt test_data/test.model 2

# Train with default dimensions
python train_word2vec.py test_data/test.txt test_data/test.model

Use Cases

Domain-specific word embeddings (medical, legal, technical domains)
Custom document similarity models
Specialized recommendation systems
Text classification with custom features
Information retrieval on proprietary corpora

Resources

GitHub Repository: py_anything2vec (8+ stars)
License: MIT
Backend: Gensim library

This library provides researchers and practitioners with flexible tools to create custom embedding models tailored to their specific data and applications.