Textpresso Documents Classifiers¶

Train and apply document classifiers for Textpresso literature

class textpresso_classifiers.classifiers.TextpressoDocumentClassifier¶

add_classified_docs_to_dataset(dir_path: str = None, recursive: bool = True, file_type: str = 'pdf', category: int = 1)¶

load the text from the cas files in the specified directory and add them to the dataset, assigning them to the specified category (class)

Note that only files with .tpcas.gz extension will be loaded

Parameters:	dir_path (str) – the path to the directory containing the text files to be added to the dataset recursive (bool) – scan directory recursively file_type (str) – the type of cas files from which to extract the fulltext category (int) – the category value to be associated with the documents

add_features(features: typing.List[str], delete_old_vocabulary: bool = False)¶

add a list of features to the current vocabulary. The classifier must be re-trained to apply the new vocabulary

Parameters:	features (List[str]) – the list of features to be added to the current vocabulary delete_old_vocabulary (bool) – whether to delete the old vocabulary before adding the new features

extract_features(tokenizer_type: textpresso_classifiers.classifiers.TokenizerType = <TokenizerType.BOW: 1>, ngram_range: typing.Tuple[int, int] = (1, 1), lemmatization: bool = False, top_n_feat: int = None, stop_words='english', max_df: float = 1.0, max_features: int = None, fit_vocabulary: bool = True, transform_features: bool = True)¶

perform feature extraction on training and test sets and store the transformed features. By default, the method uses the vocabulary stored in the vocabulary field. If the vocabulary is None, a new vocabulary is built from the corpus.

Parameters:

tokenizer_type (TokenizerType) – the type of tokenizer to use for feature extraction
ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
lemmatization (bool) – whether to apply lemmatization to the text
top_n_feat (int) – select the best n features through feature selection
stop_words – stop words to use
max_df (float) – max_df to use
max_features (int) – consider only the best n features sorted by tfidf
fit_vocabulary (bool) – whether to fit the vocabulary of the vectorizer
transform_features (bool) – whether to transform the text in the documents into feature vectors

generate_training_and_test_sets(percentage_training: float = 0.8)¶

split the dataset into training and test sets, storing the results in separate training_set and test_set: fields and clearing the original dataset variable. If training and test sets have already been populated, the method automatically re-construct the dataset by merging the two sets before re-splitting it into the new training and test sets.

Parameters:	percentage_training (float) – the percentage of observations to be placed in the training set

get_features_with_importance()¶

retrieve the list of features of the classifier together with their chi-squared score. The score is set to 0 in case the importance of the features has not been calculated

Returns:	the list of features of the classifier with their importance score
Return type:	List[Tuple[str, float]]

static load_from_file(file_path: str)¶

load a classifier from file

Parameters:	file_path (str) – the path to the pickle file containing the classifier
Returns:	the classifier object
Return type:	TextpressoDocumentClassifier

predict_file(file_path: str, file_type: str = 'pdf', dense: bool = False)¶

predict the class of a single file

Parameters:	file_path (str) – the path to the file file_type (str) – the type of file dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:	the class predicted by the classifier or None if the class cannot be predicted (e.g., the input file cannot be converted)
Return type:	int

predict_files(dir_path: str, file_type: str = 'pdf', dense: bool = False)¶

predict the class of a set of files in a directory

Parameters:	dir_path (str) – the path to the directory containing the files to be classified file_type (str) – the type of files dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:	the file names of the classified documents along with the classes predicted by the classifier or None if the class cannot be predicted (e.g., the input file cannot be converted)
Return type:	Tuple[List[str], List[int]]

remove_features(features: typing.List[str])¶

remove a list of features from the current vocabulary of the classifier, if not empty. The classifier must be re-trained to apply the new vocabulary.

Parameters:	features (List[str]) – the list of features to be removed

save_to_file(file_path: str, compact: bool = True)¶

save the classifier to file

Parameters:	file_path (str) – path to the location where to store the classifier compact (bool) – whether to save the classifier in compact mode. If True, the raw data used to train the classifier is deleted and the classifier cannot be further modified by adding or removing features and cannot be re-trained.

test_classifier(test_on_training: bool = False, dense: bool = False)¶

test the classifier on the test set and return the results

Parameters:	test_on_training (bool) – whether to test the classifier on the training set instead of the test set dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:	the test results of the classifier
Return type:	TestResults

train_classifier(model, dense: bool = False)¶

train a classifier using the sample documents in the training set and save the trained model

Parameters:	model – the model to train dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Raise:	Exception in case the training set features have not been extracted yet

class textpresso_classifiers.classifiers.DatasetStruct(_self, data, filenames, target, tr_features)¶

structure that defines fields of a dataset

This data structure is used to store the properties of training sets and test sets within the models, so that the textual content and the file names of the documents used to create the classifiers are included with them and they can be easily retrieved.

class textpresso_classifiers.classifiers.TestResults(_self, precision, recall, accuracy)¶: List that contains the different values obtained while testing a classifier