Data loader
Last updated
Last updated
To build a structured AI application you need the ability to convert all the information you have into text, so you can generate embeddings, save them into a vector store, and then feed your Agent to answer the user's questions.
Neuron gives you several tools (data loaders) to simplify this process.
Using the Neuron toolkit you can create data loading pipelines with the benefits of unified interfaces to facilitate interactions between components, like embedding providers, vector store, and file readers.
If you need to extract text from files the FileDataLoader
allows you to process any simple text document.
By default the FileDataLoader
process any simple text document. If you need to process an transform into text other document formats you can attach additional readers, like the PdfReader
:
Notice that each file reader is associated to a file extension. So based on the input file extension the data loader will automatically use the appropriate reader.
If you are already getting text from your database or other sources, you can use the StringDataLoader to convert this text into documents, ready to be embedded and stored by the other Neuron components in the chain:
Neuron data loader gets some text in input and generates an array of \NeuronAI\RAG\Document
objects. These documents are basically embeddable units. They contain chunks of original text ready to be converted to vector embeddings.
The process to split a long text into chunks can be customized using some configuration parameters.
Each of these parameters has an impact on the performance and accuracy of your RAG agent.
The text is first split into chunks based on a separator. By default the component use the period character. You can eventually customize this separator by using any delimiter for your text.
Each chunk will not be longer than this value, and it will be divided into smaller documents eventually. The length can impact the accuracy of embeddings representations. More length your units of text are, the less accurate the embeddings representation will be.
Sometimes it could be useful to bring words from the previous and next chunk into a document to increase the semantic connection between adjacent sections of the text. By default no overlap is applied.
With this toolkit you can iterate a list of file and convert them ensuring every type of file has its reader that will transform the document into text in the right way.
Here is a complete example of a full featured Embedding process:
With this simple process you can ingest GB of data into your vector store to feed your RAG agent.
To use PdfReader
you need to install the php extension.