ChromaPPTXLoader
- Purpose:
This module provides the entry point for loading PPTX files into a Chroma database.
- Platform:
Linux/Windows | Python 3.10+
- Developer:
J Berendt
- Email:
- Comments:
n/a
- Examples:
Parse and load a single PPTX file into a Chroma database collection:
>>> from docp_loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(path='/path/to/chroma', collection='spam', split_text=False) >>> l.load(path='/path/to/directory/myfile.pptx')
Parse and load a directory of PPTX files into a Chroma database collection:
>>> from docp_loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(path='/path/to/chroma', collection='spam', split_text=False) >>> l.load(path='/path/to/directory', ext='pptx')
For further example code use, please refer to the
ChromaPPTXLoaderclass docstring.
- class ChromaPPTXLoader(path: str | ChromaDB, collection: str = None, *, split_text: bool = True, allow_duplication: bool = False, chunk_size: int = 512, chunk_overlap: int = 128, separators=['\n\n\n', '\n\n', '\n', ' '], separators_md=['###', '##', '#', '\n'], embedding_model_path: str = None, repo_id: str = None, offline: bool = False)[source]
Bases:
_ChromaBasePPTXLoaderChroma database PPTX-specific document loader.
- Parameters:
path (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a
ChromaDBclass. If the instance is passed, thecollectionargument is ignored.collection (str, optional) – Name of the Chroma database collection. Only required if the
dbparameter is a path. Defaults to None.split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.
offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.
Tip
It is recommended to pass
split_text=Falseinto theChromaPPTXLoaderconstructor.Often, PowerPoint presentations are structured such that related text is found in the same ‘shape’ (textbox) on a slide. Splitting the text in these shapes may have undesired results.
- Examples:
Parse and load a single PPTX file into a Chroma database collection:
>>> from docp_loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(path='/path/to/chroma', collection='spam', split_text=False) # <-- Note this >>> l.load(path='/path/to/directory/myfile.pptx')
Parse and load a directory of PPTX files into a Chroma database collection:
>>> from docp_loaders import ChromaPPTXLoader >>> l = ChromaPPTXLoader(path='/path/to/chroma', collection='spam', split_text=False) # <-- Note this >>> l.load(path='/path/to/directory', ext='pptx')
- load(path: str, *, ext: str = '**', recursive: bool = True, remove_newlines: bool = True, convert_to_ascii: bool = True, **kwargs) None[source]
Load a PPTX file (or files) into a Chroma database.
- Parameters:
path (str) – Full path to the file (or directory) to be parsed and loaded. Note: If this is a directory, a specific file extension can be passed into the
load()method using theextargument.ext (str, optional) –
If the
pathargument refers to a directory, a specific file extension can be specified here. For example:ext = 'pptx'.If anything other than
'**'is provided, all alpha-characters are parsed from the string, and prefixed with*.. Meaning, if'.pptx'is passed, the characters'pptx'are parsed and prefixed with*.to create'*.pptx'. However, if'things.foo'is passed, the derived extension will be'*.thingsfoo'. Defaults to ‘**’, for a recursive search.recursive (bool, optional) – If True, subdirectories are searched. Defaults to True.
remove_newlines (bool, optional) – Replace newline characters with a space. Defaults to True, as this helps with document chunk splitting.
convert_to_ascii (bool, optional) – Convert all characters to ASCII. Defaults to True.
- Keyword Args:
kwargs (dict): Additional keywords to be passed into the document parser(s).
- property chroma
Accessor to the database client object.
- property parser
Accessor to the document parser object.