File Management Utilities

Utilities to transform pdf and CAS files into feature vectors for the classifiers

class textpresso_classifiers.fileutils.CasType

type of cas file

textpresso_classifiers.fileutils.extract_text_from_article_xml(text: str)

extract the text of an article from its xml representation (in pubmed format)

:param text the xml text of the article in pubmed format :type text str :return: the fulltext of the article :rtype: str

textpresso_classifiers.fileutils.extract_text_from_cas_content(cas_content: str, cas_type: textpresso_classifiers.fileutils.CasType = 1)

extract the fulltext of an article from a Textpresso cas file

Parameters:
  • cas_content (str) – the content of the cas file
  • cas_type (CasType) – the type of cas file
Returns:

the fulltext of the article represented by the cas file

Return type:

str

textpresso_classifiers.fileutils.extract_text_from_pdf(file_path: str)

extract the fulltext of an article from a pdf file

Parameters:file_path (str) – the path to the pdf file
Returns:the fulltext of the article represented by the cas file
Return type:str
textpresso_classifiers.fileutils.read_compressed_cas_content(file_path: str)

read a compressed cas file and return its content as a string

:param file_path the path to the compressed cas file :type file_path str :return the content of the compressed cas file :rtype str

textpresso_classifiers.fileutils.remove_pdf_tags_from_text(text: str)

remove pdf tags from text

Parameters:text (str) – the text of an article possibly containing pdf tags
Returns:the text of the article without pdf tags
Return type:str