GitHub - YatinChaudhary/Multi-view-Multi-source-Topic-Modeling: Implementation of NAACL2021 accepted paper: "Multi-source Neural Topic Modeling in Multi-view Embedding Spaces"

About

This repository consists of the implementations for the models proposed in the paper titled "Multi-source Multi-view Transfer Learning in Neural Topic Modeling with Pretrained Topic and Word Embeddings" accepted at NAACL2021.

@article{gupta2021multi,
title={Multi-source Neural Topic Modeling in Multi-view Embedding Spaces},
author={Gupta, Pankaj and Chaudhary, Yatin and Sch{\"u}tze, Hinrich},
journal={arXiv preprint arXiv:2104.08551},
year={2021}
}

NOTE: This code has been built upon the DocNADEe code.

Requirements

Requires Python 3 (tested with 3.6.5). The remaining dependencies can then be installed via:

$ pip install -r requirements.txt
$ python -c "import nltk; nltk.download('all')"

NOTE: installation of correct dependencies and version ensure the correct working of code.

Data format

"datasets" directory contains different sub-directories for different datasets. Each sub-directory contains CSV format files for training, validation and test sets. The CSV files in the directory must be named accordingly: "training_docnade.csv", "validation_docnade.csv", "test_docnade.csv". For this task, each CSV file (prior to preprocessing) consists of 2 string fields with a comma delimiter - the first is the label and the second is the document body (in bag-of-words representation). Each sub-directory also contains vocabulary file named "vocab_docnade.vocab", with 1 vocabulary token per line.

How to use

The script train_DocNADE_MVT_MST.py will train the DocNADE-MVT model and save it in a repository based on perplexity per word (PPL) or information retrieval (IR). It will also log all the training information in the same model folder. Here's how to use the script:

$ python train_DocNADE_MVT_MST.py --dataset --docnadeVocab --model --num-cores --use-glove-prior --use-fasttext-prior --lambda-glove --activation --use-embeddings-prior --lambda-embeddings --lambda-embeddings-list --learning-rate --batch-size --num-steps --log-every --validation-bs --test-bs --validation-ppl-freq --validation-ir-freq --test-ir-freq --test-ppl-freq --num-classes --multi-label --patience --hidden-size --vocab-size --reload --reload-model-dir --W-old-path-list --W-old-vocab-path-list --gvt-loss --gvt-lambda --gvt-lambda-init --projection --concat-projection --concat-projection-lambda

Option dataset is the path to the input dataset.

Option docnadeVocab is the path to vocabulary file used by DocNADE.

Option model is the path to model output directory.

Option use-glove-prior is whether to include glove embedding prior or not.

Option use-fasttext-prior is whether to include fasttext embedding prior or not.

Option lambda-glove lambda value for glove embeddings.

Option learning-rate is learning rate.

Option batch-size is batch size for training data.

Option num-steps is the number of steps to train for.

Option log-every is to print training loss after this many steps.

Option validation-bs is the batch size for validation evaluation.

Option test-bs is the batch size for test evaluation.

Option validation-ppl-freq is to evaluate validation PPL and NLL after this many steps.

Option validation-ir-freq is to evaluate validation IR after this many steps.

Option test-ir-freq is to evaluate test IR after this many steps.

Option test-ppl-freq is to evaluate test PPL and NLL after this many steps.

Option num-classes is number of classes.

Option patience is patience for early stopping criterion.

Option hidden-size is size of the hidden layer.

Option activation is which activation to use: sigmoid|tanh|relu.

Option vocab-size is the vocabulary size.

Option projection is whether to use projection matrix A or not.

Option reload is whether to reload model or not.

Option reload-model-dir is path of directory for which model to be reloaded.

Option use-embeddings-prior is whether to use embeddings prior E from source dataset or not.

Option lambda-embeddings is whether lambda for LVT is static or trainable: manual|automatic.

Option lambda-embeddings-list is a list of lambda parameter for E.

Option W-old-path-list is list of paths of source topic matrices Z.

Option W-old-vocab-path-list is path of source dataset vocabulary.

Option gvt-loss is whether to include topic matrix Z or not.

Option gvt-lambda is whether gamma for Z is static or trainable: manual|automatic.

Option gvt-lambda-init is value of gamma parameter for topic transfer using Z matrices.

Option concat-projection is whether to concatenate prior embeddings or not.

Option concat-projection-lambda is the value of lambda (weight) before adding projected prior embeddings into DocNADE.

Option use-bert-prior is whether to use BERT contextualized embedings as prior or not.

Option bert-reps-path is the path for BERT contextualized embedings.

Local View Transfer (LVT):

set parameter ``use-embeddings-prior`` to True
set parameter ``lambda-embeddings-list`` for lambda parameter of LVT accordingly

Global View Transfer (GVT):

set parameter ``gvt_loss`` to True
set parameter ``gvt_lambda_init`` for gamma parameter of GVT accordingly

Multi View Transfer (MVT):

set parameters for LVT and GVT as mentioned above

NOTE: Remove a parameter from the list in configuration file if it is not required when running an experiment. NOTE: Sample scripts for three different cases (with best parameters setting) have been provided with the code.

Dataset and Saved models directories

datasets directory     ->  ./datasets/
saved model directory  ->  ./model/

Model Files

train_docNADE_MVT_MST.py  ->  Main training file
model_MVT_MST.py          ->  Model file

Script Files

train_20NSshort_ALL_docnade_tanh_LL.sh -> Script file to run MST + MVT for 20NSshort dataset (IR)

Directory structure containing results of training

Example model_dir: "20NSshort_ALL_BERT_emb_glove_1.0_emb_lambda_manual_1.0_0.5_0.1_1.0_ftt__bert__act_tanh_hid_200_vocab_1448_lr_0.0001_gvt_loss_True_manual0.1_0.01_0.001_0.1_projection_cp_1.0__2_6_2020"

**Results directory**           ->  ./<model_dir>/

**Saved PPL model directory**   ->  ./model/<model_dir>/model_ppl/

**Saved IR model directory**    ->  ./model/<model_dir>/model_ir/

**Saved logs model directory**  ->  ./model/<model_dir>/logs/

**Training information**        ->  ./model/<model_dir>/logs/training_info.txt

**Reload IR results**           ->  ./model/<model_dir>/logs/reload_info_ir.txt

**Reload PPL results**          ->  ./model/<model_dir>/logs/reload_info_ppl.txt

In case of reload use following command line arguments

--reload              True
--reload-model-dir:   <model_dir>/

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
W_DocNADE_ir		W_DocNADE_ir
W_DocNADE_ppl		W_DocNADE_ppl
datasets		datasets
model		model
README.md		README.md
preprocess_data.py		preprocess_data.py
preprocess_data.sh		preprocess_data.sh
requirements.txt		requirements.txt
train_20NSshort_ALL_docnade_tanh_LL.sh		train_20NSshort_ALL_docnade_tanh_LL.sh
train_DocNADE_MVT_MST.py		train_DocNADE_MVT_MST.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Requirements

Data format

How to use

Dataset and Saved models directories

Model Files

Script Files

Directory structure containing results of training

In case of reload use following command line arguments

About

Uh oh!

Releases

Packages

Languages

YatinChaudhary/Multi-view-Multi-source-Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

About

Requirements

Data format

How to use

Dataset and Saved models directories

Model Files

Script Files

Directory structure containing results of training

In case of reload use following command line arguments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages