Implementing RAG (Retrieval-Augmented Generation) with Spring AI

Large Language Models (LLMs) are powerful, but they have a critical limitation: they only know what they were trained on. Ask them about your company’s internal documentation, recent product updates, or proprietary processes, and they’ll either hallucinate or admit ignorance. Retrieval-Augmented Generation (RAG) solves this by combining the reasoning power of LLMs with real-time access to your own data.

In this deep-dive, we’ll explore how to implement production-ready RAG systems using Spring AI, covering everything from document ingestion to advanced retrieval strategies.

What is RAG and Why Does It Matter?

RAG is an architectural pattern that enhances LLM responses by retrieving relevant information from a knowledge base before generating answers. Instead of relying solely on the model’s training data, RAG systems:

Retrieve relevant documents from a vector database
Augment the prompt with retrieved context
Generate a response grounded in actual data

┌─────────────────────────────────────────────────────────────────────┐
│                        RAG Pipeline                                  │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────┐     ┌──────────────┐     ┌───────────────────────┐   │
│  │  User    │     │   Embedding  │     │    Vector Database    │   │
│  │  Query   │────►│   Model      │────►│   (Similarity Search) │   │
│  └──────────┘     └──────────────┘     └───────────┬───────────┘   │
│                                                     │               │
│                                                     ▼               │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │              Retrieved Context Documents                      │  │
│  └──────────────────────────────┬───────────────────────────────┘  │
│                                 │                                   │
│                                 ▼                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │    Augmented Prompt = User Query + Retrieved Context          │  │
│  └──────────────────────────────┬───────────────────────────────┘  │
│                                 │                                   │
│                                 ▼                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │                      LLM (Generate)                           │  │
│  └──────────────────────────────┬───────────────────────────────┘  │
│                                 │                                   │
│                                 ▼                                   │
│  ┌──────────────────────────────────────────────────────────────┐  │
│  │           Grounded Response with Citations                    │  │
│  └──────────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────────┘

Benefits of RAG

Current Information: Access data that didn’t exist when the model was trained
Reduced Hallucination: Responses are grounded in actual documents
Source Attribution: Cite specific documents for transparency
Cost Efficiency: Smaller models can perform well with good retrieval
Data Privacy: Keep sensitive data in your own infrastructure

Spring AI RAG Architecture

Spring AI provides comprehensive support for building RAG systems with familiar Spring patterns. Let’s explore the key components.

Project Dependencies

<dependencies>
    <!-- Spring AI Core -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-openai-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    
    <!-- Vector Store - Choose one -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store-spring-boot-starter</artifactId>
        <version>1.0.0</version>
    </dependency>
    
    <!-- Document Readers -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pdf-document-reader</artifactId>
        <version>1.0.0</version>
    </dependency>
    
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
        <version>1.0.0</version>
    </dependency>
</dependencies>

Configuration

spring:
  ai:
    openai:
      api-key: ${OPENAI_API_KEY}
      embedding:
        options:
          model: text-embedding-3-small
      chat:
        options:
          model: gpt-4-turbo-preview
          temperature: 0.2
    vectorstore:
      pgvector:
        index-type: HNSW
        distance-type: COSINE_DISTANCE
        dimensions: 1536

  datasource:
    url: jdbc:postgresql://localhost:5432/ragdb
    username: ${DB_USERNAME}
    password: ${DB_PASSWORD}

Document Ingestion Pipeline

The first step in RAG is getting your documents into the vector database. This involves reading, chunking, embedding, and storing documents.

Document Reader Service

@Service
public class DocumentIngestionService {
    
    private final VectorStore vectorStore;
    private final EmbeddingModel embeddingModel;
    private final DocumentTransformer documentTransformer;
    
    public DocumentIngestionService(VectorStore vectorStore,
                                     EmbeddingModel embeddingModel) {
        this.vectorStore = vectorStore;
        this.embeddingModel = embeddingModel;
        this.documentTransformer = new TokenTextSplitter();
    }
    
    /**
     * Ingest a PDF document into the vector store
     */
    public IngestionResult ingestPdf(Resource pdfResource, 
                                      Map<String, Object> metadata) {
        // Read PDF
        PagePdfDocumentReader reader = new PagePdfDocumentReader(
            pdfResource,
            PdfDocumentReaderConfig.builder()
                .withPageExtractedTextFormatter(
                    new ExtractedTextFormatter.Builder()
                        .withNumberOfBottomTextLinesToDelete(3)
                        .withNumberOfTopPagesToSkipBeforeDelete(1)
                        .build()
                )
                .withPagesPerDocument(1)
                .build()
        );
        
        List<Document> documents = reader.get();
        
        // Add metadata
        documents.forEach(doc -> doc.getMetadata().putAll(metadata));
        
        // Process and store
        return processAndStore(documents);
    }
    
    /**
     * Ingest various document types using Apache Tika
     */
    public IngestionResult ingestDocument(Resource resource, 
                                           String contentType,
                                           Map<String, Object> metadata) {
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        List<Document> documents = reader.get();
        
        // Add metadata
        documents.forEach(doc -> {
            doc.getMetadata().putAll(metadata);
            doc.getMetadata().put("contentType", contentType);
            doc.getMetadata().put("ingestedAt", Instant.now().toString());
        });
        
        return processAndStore(documents);
    }
    
    private IngestionResult processAndStore(List<Document> documents) {
        // Split into chunks
        List<Document> chunks = documentTransformer.apply(documents);
        
        // Store in vector database (embedding happens automatically)
        vectorStore.add(chunks);
        
        return new IngestionResult(
            documents.size(),
            chunks.size(),
            Instant.now()
        );
    }
}

Advanced Text Splitting

Proper chunking is crucial for retrieval quality. Here’s a custom splitter that respects document structure:

@Component
public class SemanticTextSplitter implements DocumentTransformer {
    
    private static final int DEFAULT_CHUNK_SIZE = 1000;
    private static final int DEFAULT_OVERLAP = 200;
    
    private final int chunkSize;
    private final int overlap;
    
    public SemanticTextSplitter() {
        this(DEFAULT_CHUNK_SIZE, DEFAULT_OVERLAP);
    }
    
    public SemanticTextSplitter(int chunkSize, int overlap) {
        this.chunkSize = chunkSize;
        this.overlap = overlap;
    }
    
    @Override
    public List<Document> apply(List<Document> documents) {
        return documents.stream()
            .flatMap(doc -> splitDocument(doc).stream())
            .collect(Collectors.toList());
    }
    
    private List<Document> splitDocument(Document document) {
        String content = document.getContent();
        List<Document> chunks = new ArrayList<>();
        
        // First, try to split by semantic boundaries
        List<String> sections = splitBySections(content);
        
        for (int sectionIndex = 0; sectionIndex < sections.size(); sectionIndex++) {
            String section = sections.get(sectionIndex);
            
            // If section is small enough, keep as single chunk
            if (section.length() <= chunkSize) {
                chunks.add(createChunk(document, section, sectionIndex, 0));
                continue;
            }
            
            // Split large sections by paragraphs with overlap
            List<String> paragraphChunks = splitWithOverlap(section);
            for (int chunkIndex = 0; chunkIndex < paragraphChunks.size(); chunkIndex++) {
                chunks.add(createChunk(document, paragraphChunks.get(chunkIndex), 
                    sectionIndex, chunkIndex));
            }
        }
        
        return chunks;
    }
    
    private List<String> splitBySections(String content) {
        // Split by markdown headers or double newlines
        String[] sections = content.split("(?=^#{1,3}\\s)|\\n\\n(?=[A-Z])");
        return Arrays.stream(sections)
            .filter(s -> !s.isBlank())
            .collect(Collectors.toList());
    }
    
    private List<String> splitWithOverlap(String text) {
        List<String> chunks = new ArrayList<>();
        String[] sentences = text.split("(?<=[.!?])\\s+");
        
        StringBuilder currentChunk = new StringBuilder();
        StringBuilder overlapBuffer = new StringBuilder();
        
        for (String sentence : sentences) {
            if (currentChunk.length() + sentence.length() > chunkSize) {
                if (currentChunk.length() > 0) {
                    chunks.add(currentChunk.toString().trim());
                    
                    // Start new chunk with overlap
                    currentChunk = new StringBuilder(overlapBuffer.toString());
                    overlapBuffer = new StringBuilder();
                }
            }
            
            currentChunk.append(sentence).append(" ");
            overlapBuffer.append(sentence).append(" ");
            
            // Keep overlap buffer trimmed
            while (overlapBuffer.length() > overlap) {
                int spaceIndex = overlapBuffer.indexOf(" ", 1);
                if (spaceIndex > 0) {
                    overlapBuffer.delete(0, spaceIndex + 1);
                } else {
                    break;
                }
            }
        }
        
        if (currentChunk.length() > 0) {
            chunks.add(currentChunk.toString().trim());
        }
        
        return chunks;
    }
    
    private Document createChunk(Document source, String content, 
                                  int sectionIndex, int chunkIndex) {
        Map<String, Object> metadata = new HashMap<>(source.getMetadata());
        metadata.put("sectionIndex", sectionIndex);
        metadata.put("chunkIndex", chunkIndex);
        metadata.put("sourceId", source.getId());
        
        return new Document(content, metadata);
    }
}

Building the RAG Service

Now let’s implement the core RAG functionality:

@Service
public class RagService {
    
    private final VectorStore vectorStore;
    private final ChatClient chatClient;
    private final EmbeddingModel embeddingModel;
    
    public RagService(VectorStore vectorStore,
                      ChatClient.Builder chatClientBuilder,
                      EmbeddingModel embeddingModel) {
        this.vectorStore = vectorStore;
        this.chatClient = chatClientBuilder.build();
        this.embeddingModel = embeddingModel;
    }
    
    public RagResponse query(String question) {
        return query(question, RagOptions.defaults());
    }
    
    public RagResponse query(String question, RagOptions options) {
        // 1. Retrieve relevant documents
        SearchRequest searchRequest = SearchRequest.query(question)
            .withTopK(options.getTopK())
            .withSimilarityThreshold(options.getSimilarityThreshold())
            .withFilterExpression(options.getFilterExpression());
        
        List<Document> relevantDocs = vectorStore.similaritySearch(searchRequest);
        
        if (relevantDocs.isEmpty()) {
            return RagResponse.noContext(
                "I couldn't find any relevant information to answer your question."
            );
        }
        
        // 2. Build augmented prompt
        String context = buildContext(relevantDocs);
        String augmentedPrompt = buildPrompt(question, context, options);
        
        // 3. Generate response
        ChatResponse response = chatClient.prompt()
            .user(augmentedPrompt)
            .call()
            .chatResponse();
        
        // 4. Build response with sources
        return RagResponse.builder()
            .answer(response.getResult().getOutput().getContent())
            .sources(extractSources(relevantDocs))
            .tokensUsed(response.getMetadata().getUsage().getTotalTokens())
            .build();
    }
    
    private String buildContext(List<Document> documents) {
        StringBuilder context = new StringBuilder();
        
        for (int i = 0; i < documents.size(); i++) {
            Document doc = documents.get(i);
            context.append(String.format("[Document %d]%n", i + 1));
            context.append(String.format("Source: %s%n", 
                doc.getMetadata().getOrDefault("source", "Unknown")));
            context.append(String.format("Content: %s%n%n", doc.getContent()));
        }
        
        return context.toString();
    }
    
    private String buildPrompt(String question, String context, RagOptions options) {
        return String.format("""
            You are a helpful assistant that answers questions based on the provided context.
            
            ## Instructions
            - Answer the question based ONLY on the provided context
            - If the context doesn't contain enough information, say so
            - Cite your sources using [Document N] notation
            - Be concise but thorough
            %s
            
            ## Context
            %s
            
            ## Question
            %s
            
            ## Answer
            """,
            options.getAdditionalInstructions(),
            context,
            question
        );
    }
    
    private List<SourceReference> extractSources(List<Document> documents) {
        return documents.stream()
            .map(doc -> new SourceReference(
                (String) doc.getMetadata().getOrDefault("source", "Unknown"),
                (String) doc.getMetadata().getOrDefault("title", "Untitled"),
                doc.getMetadata().containsKey("page") ? 
                    ((Number) doc.getMetadata().get("page")).intValue() : null
            ))
            .distinct()
            .collect(Collectors.toList());
    }
}

Advanced Retrieval Strategies

Basic similarity search works well, but production systems often need more sophisticated retrieval.

Hybrid Search

Combine semantic and keyword search for better results:

@Service
public class HybridSearchService {
    
    private final VectorStore vectorStore;
    private final ElasticsearchClient elasticsearchClient;
    
    public List<Document> hybridSearch(String query, int topK) {
        // Semantic search
        List<Document> semanticResults = vectorStore.similaritySearch(
            SearchRequest.query(query).withTopK(topK * 2)
        );
        
        // Keyword search (BM25)
        List<Document> keywordResults = keywordSearch(query, topK * 2);
        
        // Reciprocal Rank Fusion
        return reciprocalRankFusion(semanticResults, keywordResults, topK);
    }
    
    private List<Document> keywordSearch(String query, int topK) {
        SearchResponse<Document> response = elasticsearchClient.search(s -> s
            .index("documents")
            .query(q -> q
                .multiMatch(m -> m
                    .query(query)
                    .fields("content^2", "title^3", "metadata.*")
                    .fuzziness("AUTO")
                )
            )
            .size(topK),
            Document.class
        );
        
        return response.hits().hits().stream()
            .map(Hit::source)
            .collect(Collectors.toList());
    }
    
    private List<Document> reciprocalRankFusion(List<Document> list1, 
                                                  List<Document> list2,
                                                  int topK) {
        int k = 60; // Constant for RRF
        Map<String, Double> scores = new HashMap<>();
        Map<String, Document> documents = new HashMap<>();
        
        // Score from first list
        for (int i = 0; i < list1.size(); i++) {
            Document doc = list1.get(i);
            String id = doc.getId();
            scores.merge(id, 1.0 / (k + i + 1), Double::sum);
            documents.put(id, doc);
        }
        
        // Score from second list
        for (int i = 0; i < list2.size(); i++) {
            Document doc = list2.get(i);
            String id = doc.getId();
            scores.merge(id, 1.0 / (k + i + 1), Double::sum);
            documents.put(id, doc);
        }
        
        // Sort by combined score and return top K
        return scores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(e -> documents.get(e.getKey()))
            .collect(Collectors.toList());
    }
}

Query Expansion with HyDE

Hypothetical Document Embeddings (HyDE) generates a hypothetical answer to improve retrieval:

@Service
public class HydeQueryExpander {
    
    private final ChatClient chatClient;
    private final EmbeddingModel embeddingModel;
    private final VectorStore vectorStore;
    
    public List<Document> searchWithHyde(String query, int topK) {
        // Generate hypothetical answer
        String hypotheticalAnswer = generateHypotheticalAnswer(query);
        
        // Combine query and hypothetical answer for embedding
        String expandedQuery = query + "\n\n" + hypotheticalAnswer;
        
        // Search with expanded query
        return vectorStore.similaritySearch(
            SearchRequest.query(expandedQuery).withTopK(topK)
        );
    }
    
    private String generateHypotheticalAnswer(String query) {
        String prompt = String.format("""
            Write a short, factual paragraph that would answer the following question.
            Do not include phrases like "According to..." or "Based on...".
            Just write the content as if it were from an authoritative source.
            
            Question: %s
            
            Answer:
            """, query);
        
        return chatClient.prompt()
            .user(prompt)
            .call()
            .content();
    }
}

Multi-Query Retrieval

Generate multiple query variations to improve recall:

@Service
public class MultiQueryRetriever {
    
    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    
    public List<Document> retrieve(String originalQuery, int topK) {
        // Generate query variations
        List<String> queryVariations = generateQueryVariations(originalQuery);
        
        // Search with each variation
        Set<Document> allResults = new LinkedHashSet<>();
        
        for (String query : queryVariations) {
            List<Document> results = vectorStore.similaritySearch(
                SearchRequest.query(query).withTopK(topK)
            );
            allResults.addAll(results);
        }
        
        // Re-rank and return top K
        return rerank(new ArrayList<>(allResults), originalQuery, topK);
    }
    
    private List<String> generateQueryVariations(String query) {
        String prompt = String.format("""
            Generate 3 different versions of the following query for a search system.
            Each version should capture the same intent but use different words or phrasings.
            Return only the queries, one per line.
            
            Original query: %s
            
            Variations:
            """, query);
        
        String response = chatClient.prompt()
            .user(prompt)
            .call()
            .content();
        
        List<String> variations = new ArrayList<>();
        variations.add(query); // Include original
        variations.addAll(Arrays.asList(response.split("\n")));
        
        return variations.stream()
            .map(String::trim)
            .filter(s -> !s.isEmpty())
            .limit(4)
            .collect(Collectors.toList());
    }
    
    private List<Document> rerank(List<Document> documents, 
                                   String query, int topK) {
        // Simple reranking using embedding similarity
        float[] queryEmbedding = embeddingModel.embed(query);
        
        return documents.stream()
            .sorted((d1, d2) -> {
                float sim1 = cosineSimilarity(queryEmbedding, 
                    embeddingModel.embed(d1.getContent()));
                float sim2 = cosineSimilarity(queryEmbedding, 
                    embeddingModel.embed(d2.getContent()));
                return Float.compare(sim2, sim1);
            })
            .limit(topK)
            .collect(Collectors.toList());
    }
}

Conversation Memory with RAG

Maintain context across multiple turns:

@Service
public class ConversationalRagService {
    
    private final RagService ragService;
    private final ChatClient chatClient;
    private final ConversationMemoryStore memoryStore;
    
    public RagResponse chat(String sessionId, String userMessage) {
        // Get conversation history
        List<Message> history = memoryStore.getHistory(sessionId);
        
        // Determine if this is a follow-up question
        String standaloneQuestion = reformulateQuestion(history, userMessage);
        
        // Perform RAG with reformulated question
        RagResponse ragResponse = ragService.query(standaloneQuestion);
        
        // Store in memory
        memoryStore.addMessage(sessionId, MessageType.USER, userMessage);
        memoryStore.addMessage(sessionId, MessageType.ASSISTANT, 
            ragResponse.getAnswer());
        
        return ragResponse;
    }
    
    private String reformulateQuestion(List<Message> history, String question) {
        if (history.isEmpty()) {
            return question;
        }
        
        String conversationContext = history.stream()
            .map(m -> m.getType() + ": " + m.getContent())
            .collect(Collectors.joining("\n"));
        
        String prompt = String.format("""
            Given the conversation history and the follow-up question,
            reformulate the question to be a standalone question that captures
            all necessary context.
            
            Conversation history:
            %s
            
            Follow-up question: %s
            
            Standalone question:
            """, conversationContext, question);
        
        return chatClient.prompt()
            .user(prompt)
            .call()
            .content()
            .trim();
    }
}

Evaluation and Monitoring

RAG Evaluation Metrics

@Service
public class RagEvaluationService {
    
    private final ChatClient chatClient;
    private final EmbeddingModel embeddingModel;
    
    public EvaluationResult evaluate(String question, 
                                      String generatedAnswer,
                                      List<Document> retrievedDocs,
                                      String groundTruth) {
        return EvaluationResult.builder()
            .faithfulness(evaluateFaithfulness(generatedAnswer, retrievedDocs))
            .relevance(evaluateRelevance(question, generatedAnswer))
            .contextRelevance(evaluateContextRelevance(question, retrievedDocs))
            .answerSimilarity(computeAnswerSimilarity(generatedAnswer, groundTruth))
            .build();
    }
    
    private double evaluateFaithfulness(String answer, List<Document> docs) {
        String context = docs.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n"));
        
        String prompt = String.format("""
            Evaluate if the following answer is faithful to the given context.
            An answer is faithful if all claims in it can be verified from the context.
            
            Context: %s
            
            Answer: %s
            
            Return a score from 0 to 1, where 1 means completely faithful.
            Return ONLY the number.
            """, context, answer);
        
        String response = chatClient.prompt()
            .user(prompt)
            .call()
            .content();
        
        return Double.parseDouble(response.trim());
    }
    
    private double evaluateRelevance(String question, String answer) {
        String prompt = String.format("""
            Evaluate if the following answer is relevant to the question.
            
            Question: %s
            
            Answer: %s
            
            Return a score from 0 to 1, where 1 means highly relevant.
            Return ONLY the number.
            """, question, answer);
        
        String response = chatClient.prompt()
            .user(prompt)
            .call()
            .content();
        
        return Double.parseDouble(response.trim());
    }
    
    private double evaluateContextRelevance(String question, List<Document> docs) {
        double totalScore = 0;
        
        for (Document doc : docs) {
            float[] questionEmbed = embeddingModel.embed(question);
            float[] docEmbed = embeddingModel.embed(doc.getContent());
            totalScore += cosineSimilarity(questionEmbed, docEmbed);
        }
        
        return totalScore / docs.size();
    }
    
    private double computeAnswerSimilarity(String generated, String groundTruth) {
        float[] genEmbed = embeddingModel.embed(generated);
        float[] truthEmbed = embeddingModel.embed(groundTruth);
        return cosineSimilarity(genEmbed, truthEmbed);
    }
}

Observability

@Configuration
public class RagObservabilityConfig {
    
    @Bean
    public MeterRegistryCustomizer<MeterRegistry> ragMetrics() {
        return registry -> {
            // Custom metrics for RAG
            Gauge.builder("rag.vector_store.document_count", vectorStore,
                vs -> vs.count())
                .register(registry);
        };
    }
}

@Aspect
@Component
public class RagMetricsAspect {
    
    private final MeterRegistry meterRegistry;
    
    @Around("execution(* com.example.rag.RagService.query(..))")
    public Object measureQuery(ProceedingJoinPoint joinPoint) throws Throwable {
        Timer.Sample sample = Timer.start(meterRegistry);
        
        try {
            Object result = joinPoint.proceed();
            
            RagResponse response = (RagResponse) result;
            
            // Record metrics
            meterRegistry.counter("rag.queries.total").increment();
            meterRegistry.counter("rag.sources.retrieved")
                .increment(response.getSources().size());
            meterRegistry.counter("rag.tokens.used")
                .increment(response.getTokensUsed());
            
            sample.stop(Timer.builder("rag.query.duration")
                .tag("status", "success")
                .register(meterRegistry));
            
            return result;
            
        } catch (Exception e) {
            sample.stop(Timer.builder("rag.query.duration")
                .tag("status", "error")
                .register(meterRegistry));
            
            meterRegistry.counter("rag.queries.errors",
                "exception", e.getClass().getSimpleName()).increment();
            
            throw e;
        }
    }
}

REST API for RAG

@RestController
@RequestMapping("/api/rag")
public class RagController {
    
    private final RagService ragService;
    private final ConversationalRagService conversationalRagService;
    private final DocumentIngestionService ingestionService;
    
    @PostMapping("/query")
    public ResponseEntity<RagResponse> query(@RequestBody QueryRequest request) {
        RagOptions options = RagOptions.builder()
            .topK(request.getTopK() != null ? request.getTopK() : 5)
            .similarityThreshold(request.getThreshold() != null ? 
                request.getThreshold() : 0.7)
            .filterExpression(request.getFilter())
            .build();
        
        RagResponse response = ragService.query(request.getQuestion(), options);
        return ResponseEntity.ok(response);
    }
    
    @PostMapping("/chat/{sessionId}")
    public ResponseEntity<RagResponse> chat(
            @PathVariable String sessionId,
            @RequestBody ChatRequest request) {
        
        RagResponse response = conversationalRagService.chat(
            sessionId, 
            request.getMessage()
        );
        return ResponseEntity.ok(response);
    }
    
    @PostMapping("/ingest")
    public ResponseEntity<IngestionResult> ingestDocument(
            @RequestParam("file") MultipartFile file,
            @RequestParam(required = false) String source,
            @RequestParam(required = false) String category) {
        
        try {
            Resource resource = new ByteArrayResource(file.getBytes());
            Map<String, Object> metadata = new HashMap<>();
            metadata.put("source", source != null ? source : file.getOriginalFilename());
            metadata.put("category", category);
            metadata.put("originalFilename", file.getOriginalFilename());
            
            IngestionResult result = ingestionService.ingestDocument(
                resource,
                file.getContentType(),
                metadata
            );
            
            return ResponseEntity.ok(result);
            
        } catch (IOException e) {
            return ResponseEntity.badRequest().build();
        }
    }
}

Best Practices

1. Chunk Size Optimization

// Experiment with different chunk sizes for your use case
// General guidelines:
// - 256-512 tokens: Better for precise factual retrieval
// - 512-1024 tokens: Good balance for most use cases
// - 1024-2048 tokens: Better for complex topics requiring more context

@Configuration
public class ChunkingConfig {
    
    @Bean
    public DocumentTransformer textSplitter(
            @Value("${rag.chunk.size:800}") int chunkSize,
            @Value("${rag.chunk.overlap:200}") int overlap) {
        
        return TokenTextSplitter.builder()
            .withChunkSize(chunkSize)
            .withChunkOverlap(overlap)
            .withSeparators(List.of("\n\n", "\n", ". ", " "))
            .build();
    }
}

2. Metadata Enrichment

@Component
public class MetadataEnricher implements DocumentTransformer {
    
    private final ChatClient chatClient;
    
    @Override
    public List<Document> apply(List<Document> documents) {
        return documents.stream()
            .map(this::enrichWithMetadata)
            .toList();
    }
    
    private Document enrichWithMetadata(Document doc) {
        // Generate summary
        String summary = generateSummary(doc.getContent());
        
        // Extract keywords
        List<String> keywords = extractKeywords(doc.getContent());
        
        // Add to metadata
        Map<String, Object> metadata = new HashMap<>(doc.getMetadata());
        metadata.put("summary", summary);
        metadata.put("keywords", keywords);
        metadata.put("wordCount", doc.getContent().split("\\s+").length);
        
        return new Document(doc.getContent(), metadata);
    }
}

3. Filter by Metadata

// Use filter expressions to scope searches
SearchRequest request = SearchRequest.query(question)
    .withTopK(5)
    .withFilterExpression(
        new Filter.Expression(
            Filter.ExpressionType.AND,
            new Filter.Expression(
                Filter.ExpressionType.EQ,
                new Filter.Key("category"),
                new Filter.Value("technical-docs")
            ),
            new Filter.Expression(
                Filter.ExpressionType.GTE,
                new Filter.Key("date"),
                new Filter.Value("2024-01-01")
            )
        )
    );

Conclusion

RAG with Spring AI provides a powerful foundation for building context-aware AI applications. By combining retrieval with generation, you can create systems that provide accurate, grounded responses based on your own data.

Key takeaways:

Document quality matters: Good chunking and metadata lead to better retrieval
Retrieval is crucial: Invest in advanced retrieval strategies like hybrid search and query expansion
Evaluate continuously: Use metrics to understand and improve your RAG pipeline
Monitor in production: Track latency, token usage, and retrieval quality

References and Further Reading

The code examples in this post are simplified for clarity. Always follow security best practices and thoroughly test RAG implementations before deploying to production.