File Management Utilities¶
Utilities to transform pdf and CAS files into feature vectors for the classifiers
-
class
textpresso_classifiers.fileutils.
CasType
¶ type of cas file
-
textpresso_classifiers.fileutils.
extract_text_from_article_xml
(text: str)¶ extract the text of an article from its xml representation (in pubmed format)
:param text the xml text of the article in pubmed format :type text str :return: the fulltext of the article :rtype: str
-
textpresso_classifiers.fileutils.
extract_text_from_cas_content
(cas_content: str, cas_type: textpresso_classifiers.fileutils.CasType = 1)¶ extract the fulltext of an article from a Textpresso cas file
Parameters: - cas_content (str) – the content of the cas file
- cas_type (CasType) – the type of cas file
Returns: the fulltext of the article represented by the cas file
Return type: str
-
textpresso_classifiers.fileutils.
extract_text_from_pdf
(file_path: str)¶ extract the fulltext of an article from a pdf file
Parameters: file_path (str) – the path to the pdf file Returns: the fulltext of the article represented by the cas file Return type: str
-
textpresso_classifiers.fileutils.
read_compressed_cas_content
(file_path: str)¶ read a compressed cas file and return its content as a string
:param file_path the path to the compressed cas file :type file_path str :return the content of the compressed cas file :rtype str
remove pdf tags from text
Parameters: text (str) – the text of an article possibly containing pdf tags Returns: the text of the article without pdf tags Return type: str