Data loader
To build a structured AI application you need the ability to convert all the information you have into text, so you can generate embeddings, save them into a vector store, and then feed your Agent to answer the user's questions.

Neuron gives you several tools (data loaders) to simplify this process.
use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;
MyRAG::make()->addDocuments(
// Use the file data loader component to process a text file
FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments()
);
Using the Neuron toolkit you can create data loading pipelines with the benefits of unified interfaces to facilitate interactions between components, like embedding providers, vector store, and file readers.
FileDataLoader
If you need to extract text from files the FileDataLoader
allows you to process any simple text document.
use NeuronAI\RAG\DataLoader\FileDataLoader;
// Read a file and get "documents"
$documents = FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments();
// Pass a directory to process all files
$documents = FileDataLoader::for(__DIR__)->getDocuments();
Process PDFs
To use PdfReader
you need to install the pdftotext php extension.
By default the FileDataLoader
process any simple text document. If you need to process an transform into text other document formats you can attach additional readers, like the PdfReader
:
use NeuronAI\RAG\DataLoader\FileDataLoader;
// Register the PDF reader
$documents = FileDataLoader::for(__DIR__)
->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
->getDocuments();
Notice that each file reader is associated to a file extension. So based on the input file extension the data loader will automatically use the appropriate reader.
StringDataLoader
If you are already getting text from your database or other sources, you can use the StringDataLoader to convert this text into documents, ready to be embedded and stored by the other Neuron components in the chain:
use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\StringDataLoader;
$contents = [
// list of strings (text you want to embed)
];
foreach ($contents as $text) {
$documents = StringDataLoader::for($text)->getDocuments();
MyRAG::make()->addDocuments($documents);
}
Document meta-data
After getting the array of documents from a data loader you can eventually attach custom meta-data to the document that will be saved in the vector store along with other document default fields:
$documents = FileDataLoader::for($directory)->getDocuments();
foreach($documents as $document) {
$document->addMetadata('user_id', 1234);
}
MyRAG::make()->addDocuments($documents);
Once you have these custom fields in the vector store you can use hybrid search for databases that support this feature.
Text Splitter
Neuron data loaders get files or text in input and generates an array of \NeuronAI\RAG\Document
objects. These documents are embeddable units. The original text is split into smaller pieces of text to be converted into embeddings and saved in the vector store.
The logic data loaders use to split a long text into chunks can be customized using different strategies. Neuron has a dedicated component for this purpose called "Splitter", and it can be attached to the data loader based on the strategy you prefer or need:
$documents = FileDataLoader::for($directory)
->withSplitter(
new DelimiterTextSplitter()
)
->getDocuments();
DelimiterTextSplitter (default)
This is the default splitter for all data loaders.
$documents = FileDataLoader::for($directory)
->withSplitter(
new DelimiterTextSplitter(
maxLength: 1000,
separator: '.',
wordOverlap: 0
)
)
->getDocuments();
Each of these parameters has an impact on the performance and accuracy of your RAG agent.
Max Length
Each chunk will not be longer than this value, and it will be divided into smaller documents eventually. The length can impact the accuracy of embeddings representations. More length your units of text are, the less accurate the embeddings representation will be.
Separator
The text is first split into chunks based on a separator. By default the component use the period character. You can eventually customize this separator by using any delimiter for your text.
Overlap
Sometimes it could be useful to bring words from the previous and next chunk into a document to increase the semantic connection between adjacent sections of the text. By default no overlap is applied.
SentenceTextSplitter
Splits text into sentences, groups into word-based chunks, and optionally applies overlap in terms of words.
$documents = FileDataLoader::for($directory)
->withSplitter(
new SentenceTextSplitter(
maxWords: 200,
overlapWords: 0
)
)
->getDocuments();
MaxWords: maximum number of words per chunk
OverlapWords: number of overlapping words between chunks
Use standalone components
In the examples below we used the RAG agent instance to process the final part of the ingestion pipeline: generate embeddings for document chunks, and store them into jthe vector database.
In alternative of take advantage of the RAG agent instance you can use the embedding provider and the vector store as standalone components. Remember that the vector store here must be same connected to the RAG agent.
use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;
use NeuronAI\RAG\DataLoader\StringDataLoader;
use NeuronAI\RAG\EmbeddingProvider\OpenAIEmbeddingProvider;
use NeuronAI\RAG\VectorStore\FileVectorStore;
$embedder = new OpenAIEmbeddingProvider(
key: 'OPENAI_API_KEY',
model: 'OPENAI_MODEL'
);
$store = new FileVectoreStore(
directory: __DIR__,
key: 'demo'
);
// Process files and contents
$documents = FileDataLoader::for(__DIR__.'/documents');
->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
->getDocuments();
// Generate embeddings and store documents in the vector database
$store->addDocuments(
$embedder->embedDocuments($documents)
);
With this simple process you can ingest GB of data into your vector store to feed your RAG agent.
Last updated