Textpresso Documents Classifiers

Train and apply document classifiers for Textpresso literature

class textpresso_classifiers.classifiers.TextpressoDocumentClassifier
add_classified_docs_to_dataset(dir_path: str = None, recursive: bool = True, file_type: str = 'pdf', category: int = 1)

load the text from the cas files in the specified directory and add them to the dataset, assigning them to the specified category (class)

Note that only files with .tpcas.gz extension will be loaded

Parameters:
  • dir_path (str) – the path to the directory containing the text files to be added to the dataset
  • recursive (bool) – scan directory recursively
  • file_type (str) – the type of cas files from which to extract the fulltext
  • category (int) – the category value to be associated with the documents
add_features(features: typing.List[str], delete_old_vocabulary: bool = False)

add a list of features to the current vocabulary. The classifier must be re-trained to apply the new vocabulary

Parameters:
  • features (List[str]) – the list of features to be added to the current vocabulary
  • delete_old_vocabulary (bool) – whether to delete the old vocabulary before adding the new features
extract_features(tokenizer_type: textpresso_classifiers.classifiers.TokenizerType = <TokenizerType.BOW: 1>, ngram_range: typing.Tuple[int, int] = (1, 1), lemmatization: bool = False, top_n_feat: int = None, stop_words='english', max_df: float = 1.0, max_features: int = None, fit_vocabulary: bool = True, transform_features: bool = True)

perform feature extraction on training and test sets and store the transformed features. By default, the method uses the vocabulary stored in the vocabulary field. If the vocabulary is None, a new vocabulary is built from the corpus.

Parameters:
  • tokenizer_type (TokenizerType) – the type of tokenizer to use for feature extraction
  • ngram_range (Tuple[int, int]) – The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
  • lemmatization (bool) – whether to apply lemmatization to the text
  • top_n_feat (int) – select the best n features through feature selection
  • stop_words – stop words to use
  • max_df (float) – max_df to use
  • max_features (int) – consider only the best n features sorted by tfidf
  • fit_vocabulary (bool) – whether to fit the vocabulary of the vectorizer
  • transform_features (bool) – whether to transform the text in the documents into feature vectors
generate_training_and_test_sets(percentage_training: float = 0.8)
split the dataset into training and test sets, storing the results in separate training_set and test_set
fields and clearing the original dataset variable. If training and test sets have already been populated, the method automatically re-construct the dataset by merging the two sets before re-splitting it into the new training and test sets.
Parameters:percentage_training (float) – the percentage of observations to be placed in the training set
get_features_with_importance()

retrieve the list of features of the classifier together with their chi-squared score. The score is set to 0 in case the importance of the features has not been calculated

Returns:the list of features of the classifier with their importance score
Return type:List[Tuple[str, float]]
static load_from_file(file_path: str)

load a classifier from file

Parameters:file_path (str) – the path to the pickle file containing the classifier
Returns:the classifier object
Return type:TextpressoDocumentClassifier
predict_file(file_path: str, file_type: str = 'pdf', dense: bool = False)

predict the class of a single file

Parameters:
  • file_path (str) – the path to the file
  • file_type (str) – the type of file
  • dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:

the class predicted by the classifier or None if the class cannot be predicted (e.g., the input file cannot be converted)

Return type:

int

predict_files(dir_path: str, file_type: str = 'pdf', dense: bool = False)

predict the class of a set of files in a directory

Parameters:
  • dir_path (str) – the path to the directory containing the files to be classified
  • file_type (str) – the type of files
  • dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:

the file names of the classified documents along with the classes predicted by the classifier or None if the class cannot be predicted (e.g., the input file cannot be converted)

Return type:

Tuple[List[str], List[int]]

remove_features(features: typing.List[str])

remove a list of features from the current vocabulary of the classifier, if not empty. The classifier must be re-trained to apply the new vocabulary.

Parameters:features (List[str]) – the list of features to be removed
save_to_file(file_path: str, compact: bool = True)

save the classifier to file

Parameters:
  • file_path (str) – path to the location where to store the classifier
  • compact (bool) – whether to save the classifier in compact mode. If True, the raw data used to train the classifier is deleted and the classifier cannot be further modified by adding or removing features and cannot be re-trained.
test_classifier(test_on_training: bool = False, dense: bool = False)

test the classifier on the test set and return the results

Parameters:
  • test_on_training (bool) – whether to test the classifier on the training set instead of the test set
  • dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Returns:

the test results of the classifier

Return type:

TestResults

train_classifier(model, dense: bool = False)

train a classifier using the sample documents in the training set and save the trained model

Parameters:
  • model – the model to train
  • dense (bool) – whether to transform the sparse matrix of features to a dense structure (required by some models)
Raise:

Exception in case the training set features have not been extracted yet

class textpresso_classifiers.classifiers.DatasetStruct(_self, data, filenames, target, tr_features)

structure that defines fields of a dataset

This data structure is used to store the properties of training sets and test sets within the models, so that the textual content and the file names of the documents used to create the classifiers are included with them and they can be easily retrieved.

class textpresso_classifiers.classifiers.TestResults(_self, precision, recall, accuracy)

List that contains the different values obtained while testing a classifier