Neuron AI
GitHubForumNewsletter
  • Getting Started
    • Introduction
  • Key Concepts
  • Installation
  • Agent
  • Tools & Function Calls
  • Streaming
  • RAG
  • Attachments (Documents & Images)
  • Advanced
    • Structured Output
    • Logging & Observability
    • MCP Connector
    • Error Handling
  • Post Processor
  • Asynchronous Processing
  • Components
    • AI provider
    • Chat History & Memory
    • Embeddings Provider
    • Vector Store
    • Data loader
  • Examples
    • YouTube Agent
Powered by GitBook
On this page
  • FileDataLoader
  • Process PDFs
  • StringDataLoader
  • Configure text splitter
  • Full Featured Example (from files to vector store)
  1. Components

Data loader

PreviousVector StoreNextYouTube Agent

Last updated 5 days ago

To build a structured AI application you need the ability to convert all the information you have into text, so you can generate embeddings, save them into a vector store, and then feed your Agent to answer the user's questions.

Neuron gives you several tools (data loaders) to simplify this process.

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;

MyRAG::make()->addDocuments(
    // Use the file data loader component to process a text file
    FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments()
);

Using the Neuron toolkit you can create data loading pipelines with the benefits of unified interfaces to facilitate interactions between components, like embedding providers, vector store, and file readers.

FileDataLoader

If you need to extract text from files the FileDataLoader allows you to process any simple text document.

use NeuronAI\RAG\DataLoader\FileDataLoader;

// Read a file and get "documents"
$documents = FileDataLoader::for(__DIR__.'/my-article.md')->getDocuments();

Process PDFs

By default the FileDataLoader process any simple text document. If you need to process an transform into text other document formats you can attach additional readers, like the PdfReader:

use NeuronAI\RAG\DataLoader\FileDataLoader;

// Register the PDF reader
$documents = FileDataLoader::for(__DIR__.'/readme.pdf')
    ->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
    ->getDocuments();

Notice that each file reader is associated to a file extension. So based on the input file extension the data loader will automatically use the appropriate reader.

StringDataLoader

If you are already getting text from your database or other sources, you can use the StringDataLoader to convert this text into documents, ready to be embedded and stored by the other Neuron components in the chain:

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\StringDataLoader;

$contents = [
    // list of strings (text you want to embed)
];

foreach ($contents as $text) {
    $documents = StringDataLoader::for($text)->getDocuments(); 
    
    MyRAG::make()->addDocuments($documents);
}

Configure text splitter

Neuron data loader gets some text in input and generates an array of \NeuronAI\RAG\Document objects. These documents are basically embeddable units. They contain chunks of original text ready to be converted to vector embeddings.

The process to split a long text into chunks can be customized using some configuration parameters.

$documents = FileDataLoader::for(__DIR__.'/readme.pdf')
    ->withSeparator('.')
    ->withMaxLength(1000)
    ->withOverlap(0)
    ->getDocuments();

Each of these parameters has an impact on the performance and accuracy of your RAG agent.

Separator

The text is first split into chunks based on a separator. By default the component use the period character. You can eventually customize this separator by using any delimiter for your text.

Max Length

Each chunk will not be longer than this value, and it will be divided into smaller documents eventually. The length can impact the accuracy of embeddings representations. More length your units of text are, the less accurate the embeddings representation will be.

Overlap

Sometimes it could be useful to bring words from the previous and next chunk into a document to increase the semantic connection between adjacent sections of the text. By default no overlap is applied.

Full Featured Example (from files to vector store)

With this toolkit you can iterate a list of file and convert them ensuring every type of file has its reader that will transform the document into text in the right way.

Here is a complete example of a full featured Embedding process:

use App\Neuron\MyRAG;
use NeuronAI\RAG\DataLoader\FileDataLoader;
use NeuronAI\RAG\DataLoader\StringDataLoader;

/*
 * Process files
 */

$files = [
    // list of file paths...
];

foreach ($files as $file) {
    // Register the PDF reader if they are in the list of files to process
    $documents = FileDataLoader::for($file);
        ->addReader('pdf', new \NeuronAI\RAG\DataLoader\PdfReader())
        ->getDocuments(); 
    
    MyRAG::make()->addDocuments($documents);
}

/*
 * Process raw strings
 */

$contents = [
    // list of strings (text you want to embed)
];

foreach ($contents as $text) {
    $documents = StringDataLoader::for($text)->getDocuments(); 

    MyRAG::make()->addDocuments($documents);
}

With this simple process you can ingest GB of data into your vector store to feed your RAG agent.

To use PdfReader you need to install the php extension.

pdftotext