tnkeeh (تنقيح) is an Arabic preprocessing library for python. It was designed using re for creating quick replacement expressions for several examples.
pip install tnkeeh
- Quick cleaning
- Segmentation
- Normalization
- Data splitting
import tnkeeh as tn
tn.clean_data(file_path = 'data.txt', save_path = 'cleaned_data.txt',)Arguments
segmentuses farasa for segmentation.remove_diacriticsremoves all diacritics.remove_special_charsremoves all sepcial chars.remove_englishremoves english alphabets and digits.normalizematch digits that have the same writing but different encodings.remove_tatweeltatweel characterـis used a lot in arabic writing.remove_repeated_charsremove characters that appear three times in sequence.remove_html_elementsremove html elements in the form with their attirbutes.remove_linksremove links.remove_twitter_metaremove twitter mentions, links and hashtags.remove_long_wordsremove words longer than 15 chars.by_chunkread files by chunks with sizechunk_size.
import tnkeeh as tn
from datasets import load_dataset
dataset = load_dataset('metrec')
cleaner = tn.Tnkeeh(remove_diacritics = True)
cleaned_dataset = cleaner.clean_hf_dataset(dataset, 'text')Splits raw data into training and testing using the split_ratio
import tnkeeh as tn
tn.split_raw_data(data_path, split_ratio = 0.8)Splits data and labels into training and testing using the split_ratio
import tnkeeh as tn
tn.split_classification_data(data_path, lbls_path, split_ratio = 0.8)Splits input and target data with ration split_ratio. Commonly used for translation
tn.split_parallel_data('ar_data.txt','en_data.txt')Read split data, depending if it was raw or classification
import tnkeeh as tn
train_data, test_data = tn.read_data(mode = 0)Arguments
mode = 0read raw data.mode = 1read labeled data.mode = 2read parallel data.
This is an open source project where we encourage contributions from the community.
MIT license.
@misc{tnkeeh2020,
author = {Zaid Alyafeai and Maged Saeed},
title = {tkseem: A Preprocessing Library for Arabic.},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ARBML/tnkeeh}}
}
