py_anything2vec: Custom Embedding Training

A library to train your own word2vec, doc2vec or any other 'data'2vec models

Project Details

py_anything2vec: Custom Embedding Training

A Python library designed to train custom word2vec, doc2vec, and other embedding models on user-provided data using Gensim as the backend.

Overview

While pre-trained word and document embedding models are widely available, many applications require custom models trained on domain-specific data. This library provides simple scripts to train custom vector representations tailored to your specific corpus and use case.

Core Features

  • Word2Vec model training - Train custom word embeddings on your text data
  • Doc2Vec model training - Generate document-level vector representations
  • Customizable dimensionality - Control the size of your embedding vectors
  • Gensim backend - Built on the robust Gensim library
  • Simple command-line interface - Easy to use with minimal setup

Installation

Install the required dependencies:

pip install gensim
pip install scipy
pip install numpy

Download or clone the repository and use the training scripts directly.

Usage

The command-line interface is straightforward:

python train_word2vec.py <doc_path.txt> <output_path.model> [n_dimensions]

Input format requirements:

  • One sentence per line
  • Lower case words separated by spaces
  • No punctuation

Example commands:

# Train with 2 dimensions
python train_word2vec.py test_data/test.txt test_data/test.model 2

# Train with default dimensions
python train_word2vec.py test_data/test.txt test_data/test.model

Use Cases

  • Domain-specific word embeddings (medical, legal, technical domains)
  • Custom document similarity models
  • Specialized recommendation systems
  • Text classification with custom features
  • Information retrieval on proprietary corpora

Resources

  • GitHub Repository: py_anything2vec (8+ stars)
  • License: MIT
  • Backend: Gensim library

This library provides researchers and practitioners with flexible tools to create custom embedding models tailored to their specific data and applications.