ChromaPPTXLoader

Purpose:

This module provides the entry point for loading PPTX files into a Chroma database.

Platform:

Linux/Windows | Python 3.10+

Developer:

J Berendt

Email:

development@s3dev.uk

Comments:

n/a

Examples:

Parse and load a single PPTX file into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)
>>> l.load(path='/path/to/directory/myfile.pptx')

Parse and load a directory of PPTX files into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)
>>> l.load(path='/path/to/directory', ext='pptx')

For further example code use, please refer to the ChromaPPTXLoader class docstring.

class ChromaPPTXLoader(path: str | ChromaDB, collection: str = None, *, split_text: bool = True, allow_duplication: bool = False, chunk_size: int = 512, chunk_overlap: int = 128, separators=['\n\n\n', '\n\n', '\n', ' '], separators_md=['###', '##', '#', '\n'], embedding_model_path: str = None, repo_id: str = None, offline: bool = False)[source]

Bases: _ChromaBasePPTXLoader

Chroma database PPTX-specific document loader.

Parameters:

path (str | ChromaDB) – Either the full path to the Chroma database directory, or an instance of a ChromaDB class. If the instance is passed, the collection argument is ignored.
collection (str, optional) – Name of the Chroma database collection. Only required if the db parameter is a path. Defaults to None.
split_text (bool, optional) – Split the document into chunks, before loading it into the database. Defaults to True.
offline (bool, optional) – Remain offline and use the locally cached embedding function model. Defaults to False.

Tip

It is recommended to pass split_text=False into the ChromaPPTXLoader constructor.

Often, PowerPoint presentations are structured such that related text is found in the same ‘shape’ (textbox) on a slide. Splitting the text in these shapes may have undesired results.

Examples:

Parse and load a single PPTX file into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)  # <-- Note this
>>> l.load(path='/path/to/directory/myfile.pptx')

Parse and load a directory of PPTX files into a Chroma database collection:

>>> from docp_loaders import ChromaPPTXLoader

>>> l = ChromaPPTXLoader(path='/path/to/chroma',
                         collection='spam',
                         split_text=False)  # <-- Note this
>>> l.load(path='/path/to/directory', ext='pptx')

load(path: str, *, ext: str = '**', recursive: bool = True, remove_newlines: bool = True, convert_to_ascii: bool = True, **kwargs) → None[source]

Load a PPTX file (or files) into a Chroma database.

Parameters:

path (str) – Full path to the file (or directory) to be parsed and loaded. Note: If this is a directory, a specific file extension can be passed into the load() method using the ext argument.
ext (str, optional) –
If the path argument refers to a directory, a specific file extension can be specified here. For example: ext = 'pptx'.

If anything other than '**' is provided, all alpha-characters are parsed from the string, and prefixed with *.. Meaning, if '.pptx' is passed, the characters 'pptx' are parsed and prefixed with *. to create '*.pptx'. However, if 'things.foo' is passed, the derived extension will be '*.thingsfoo'. Defaults to ‘**’, for a recursive search.
recursive (bool, optional) – If True, subdirectories are searched. Defaults to True.
remove_newlines (bool, optional) – Replace newline characters with a space. Defaults to True, as this helps with document chunk splitting.
convert_to_ascii (bool, optional) – Convert all characters to ASCII. Defaults to True.

Keyword Args:

kwargs (dict): Additional keywords to be passed into the document parser(s).

property chroma: Accessor to the database client object.

property parser: Accessor to the document parser object.