Data loader

PREREQUISITES

This guide assumes you are already familiar with RAG. Check out the dedicated documentation: https://docs.neuron-ai.dev/rag

To build a structured AI application you need the ability to convert all the information you have into text, so you can generate embeddings, save them into a vector store, and then feed your Agent to answer the user's questions.

Neuron gives you several tools (data loaders) to simplify this process.

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;

MyRAG::make()->addDocuments(
    // Use the file data loader component to process a text file
    FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments()
);

Using the Neuron toolkit you can create data loading pipelines with the benefits of unified interfaces to facilitate interactions between components, like embedding providers, vector store, and file readers.

FileDataLoader

If you need to extract text from files the FileDataLoader allows you to process any simple text document.

use NeuronAI\RAG\DataLoader\FileDataLoader;

// Read a file and get "documents"
$documents = FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments();

// Pass a directory to process all files
$documents = FileDataLoader::for(__DIR__)->getDocuments();

Process PDFs

By default the FileDataLoader process any simple text document. If you need to process an transform into text other document formats you can attach additional readers, like the PdfReader:

use NeuronAI\RAG\DataLoader\FileDataLoader;

// Register the PDF reader
$documents = FileDataLoader::for(__DIR__)
    ->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
    ->getDocuments();

Notice that each file reader is associated to a file extension. So based on the input file extension the data loader will automatically use the appropriate reader.

StringDataLoader

If you are already getting text from your database or other sources, you can use the StringDataLoader to convert this text into documents, ready to be embedded and stored by the other Neuron components in the chain:

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\StringDataLoader;

$contents = [
    // list of strings (text you want to embed)
];

foreach ($contents as $text) {
    $documents = StringDataLoader::for($text)->getDocuments(); 
    
    MyRAG::make()->addDocuments($documents);
}

Document meta-data

After getting the array of documents from a data loader you can eventually attach custom meta-data to the document that will be saved in the vector store along with other document default fields:

$documents = FileDataLoader::for($directory)->getDocuments(); 

foreach($documents as $document) {
    $document->addMetadata('user_id', 1234);
}

MyRAG::make()->addDocuments($documents);

Once you have these custom fields in the vector store you can use hybrid search for databases that support this feature.

Hybrid search allows you to narrow the scope of a semantic search query against records that match certain criteria on other document fields rather that compare only the vector embeddings. Explore the Vector Store section to know which database support hybrid search.

Text Splitter

Neuron data loaders get files or text in input and generates an array of \NeuronAI\RAG\Document objects. These documents are embeddable units. The original text is split into smaller pieces of text to be converted into embeddings and saved in the vector store.

The logic data loaders use to split a long text into chunks can be customized using different strategies. Neuron has a dedicated component for this purpose called "Splitter", and it can be attached to the data loader based on the strategy you prefer or need:

$documents = FileDataLoader::for($directory)
    ->withSplitter(
        new DelimiterTextSplitter()
    )
    ->getDocuments();

DelimiterTextSplitter (default)

This is the default splitter for all data loaders.

$documents = FileDataLoader::for($directory)
    ->withSplitter(
        new DelimiterTextSplitter(
            maxLength: 1000,
            separator: '.',
            wordOverlap: 0
        )
    )
    ->getDocuments();

Each of these parameters has an impact on the performance and accuracy of your RAG agent.

Max Length

Each chunk will not be longer than this value, and it will be divided into smaller documents eventually. The length can impact the accuracy of embeddings representations. More length your units of text are, the less accurate the embeddings representation will be.

Separator

The text is first split into chunks based on a separator. By default the component use the period character. You can eventually customize this separator by using any delimiter for your text.

Overlap

Sometimes it could be useful to bring words from the previous and next chunk into a document to increase the semantic connection between adjacent sections of the text. By default no overlap is applied.

SentenceTextSplitter

Splits text into sentences, groups into word-based chunks, and optionally applies overlap in terms of words.

$documents = FileDataLoader::for($directory)
    ->withSplitter(
        new SentenceTextSplitter(
            maxWords: 200,
            overlapWords: 0
        )
    )
    ->getDocuments();

MaxWords: maximum number of words per chunk

OverlapWords: number of overlapping words between chunks

Use standalone components

In the examples below we used the RAG agent instance to process the final part of the ingestion pipeline: generate embeddings for document chunks, and store them into jthe vector database.

In alternative of take advantage of the RAG agent instance you can use the embedding provider and the vector store as standalone components. Remember that the vector store here must be same connected to the RAG agent.

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;
use NeuronAI\RAG\DataLoader\StringDataLoader;
use NeuronAI\RAG\EmbeddingProvider\OpenAIEmbeddingProvider;
use NeuronAI\RAG\VectorStore\FileVectorStore;

$embedder = new OpenAIEmbeddingProvider(
    key: 'OPENAI_API_KEY',
    model: 'OPENAI_MODEL'
);

$store = new FileVectoreStore(
    directory: __DIR__,
    key: 'demo'
);

// Process files and contents
$documents = FileDataLoader::for(__DIR__.'/documents');
    ->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
    ->getDocuments(); 

// Generate embeddings and store documents in the vector database
$store->addDocuments(
    $embedder->embedDocuments($documents)
);

With this simple process you can ingest GB of data into your vector store to feed your RAG agent.

Last updated