Why your LLM bill is exploding — and how semantic caching can cut it by 73%
Our LLM API bill was growing 30% month-over-month. Traffic was increasing, but not that fast. When I analyzed our query logs, I found the real problem: Users ask the same questions in different ways. "What's your return policy?," "How do I return something?", and "Can I get a refund?" were all hitting our LLM separately, generating nearly identical responses, each incurring full API costs. Exact-match caching, the obvious first solution, captured only 18% of these redundant calls. The same semantic question, phrased differently, bypassed the cache entirely. So, I implemented semantic caching based on what queries mean, not how they're worded. After implementing it, our cache hit rate increased to 67%, reducing LLM API costs by 73%. But getting there requires solving problems that naive implementations miss. Why exact-match caching falls short Traditional caching uses query text as the cache key. This works when queries are identical: # Exact-match caching cache_key = hash(query_text) if cache_key in cache: return cache[cache_key] But users don't phrase questions identically. My analysis of 100,000 production queries found: Only 18% were exact duplicates of previous queries 47% were semantically similar to previous queries (same intent, different wording) 35% were genuinely novel queries That 47% represented massive cost savings we were missing. Each semantically-similar query triggered a full LLM call, generating a response nearly identical to one we'd already computed. Semantic caching architecture Semantic caching replaces text-based keys with embedding-based similarity lookup: class SemanticCache: def __init__(self, embedding_model, similarity_threshold=0.92): self.embedding_model = embedding_model self.threshold = similarity_threshold self.vector_store = VectorStore() # FAISS, Pinecone, etc. self.response_store = ResponseStore() # Redis, DynamoDB, etc. def get(self, query: str) -> Optional[str]: """Return cached response if semantically similar query exists.""" query_embedding = self.embedding_model.encode(query) # Find most similar cached query matches = self.vector_store.search(query_embedding, top_k=1) if matches and matches[0].similarity >= self.threshold: cache_id = matches[0].id return self.response_store.get(cache_id) return None def set(self, query: str, response: str): """Cache query-response pair.""" query_embedding = self.embedding_model.encode(query) cache_id = generate_id() self.vector_store.add(cache_id, query_embedding) self.response_store.set(cache_id, { 'query': query, 'response': response, 'timestamp': datetime.utcnow() }) The key insight: Instead of hashing query text, I embed queries into vector space and find cached queries within a similarity threshold. The threshold problem The similarity threshold is the critical parameter. Set it too high, and you miss valid cache hits. Set it too low, and you return wrong responses. Our initial threshold of 0.85 seemed reasonable; 85% similar should be "the same question," right? Wrong. At 0.85, we got cache hits like: Query: "How do I cancel my subscription?" Cached: "How do I cancel my order?" Similarity: 0.87 These are different questions with different answers. Returning the cached response would be incorrect. I discovered that optimal thresholds vary by query type: Query type Optimal threshold Rationale FAQ-style questions 0.94 High precision needed; wrong answers damage trust Product searches 0.88 More tolerance for near-matches Support queries 0.92 Balance between coverage and accuracy Transactional queries 0.97 Very low tolerance for errors I implemented query-type-specific thresholds: class AdaptiveSemanticCache: def __init__(self): self.thresholds = { 'faq': 0.94, 'search': 0.88, 'support': 0.92, 'transactional': 0.97, 'default': 0.92 } self.query_classifier = QueryClassifier() def get_threshold(self, query: str) -> float: query_type = self.query_classifier.classify(query) return self.thresholds.get(query_type, self.thresholds['default']) def get(self, query: str) -> Optional[str]: threshold = self.get_threshold(query) query_embedding = self.embedding_model.encode(query) matches = self.vector_store.search(query_embedding, top_k=1) if matches and matches[0].similarity >= threshold: return self.response_store.get(matches[0].id) return None Threshold tuning methodology I couldn't tune thresholds blindly. I needed ground truth on which query pairs were actually "the same." Our methodology: Step 1: Sample query pairs. I sampled 5,000 query pairs at various similarity levels (0.80-0.99). Step 2: Human labeling. Annotators labeled each pair as "same intent" or "different intent." I used three annotators per pair and took a majority vote. Step 3: Compute precision/recall curves. For each threshold, we computed: Precision: Of cache hits, what fraction had the same intent? Recall: Of same-intent pairs, what fraction did we cache-hit? def compute_precision_recall(pairs, labels, threshold): """Compute precision and recall at given similarity threshold.""" predictions = [1 if pair.similarity >= threshold else 0 for pair in pairs] true_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 1) false_positives = sum(1 for p, l in zip(predictions, labels) if p == 1 and l == 0) false_negatives = sum(1 for p, l in zip(predictions, labels) if p == 0 and l == 1) precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0 recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0 return precision, recall Step 4: Select threshold based on cost of errors. For FAQ queries where wrong answers damage trust, I optimized for precision (0.94 threshold gave 98% precision). For search queries where missing a cache hit just costs money, I optimized for recall (0.88 threshold). Latency overhead Semantic caching adds latency: You must embed the query and search the vector store before knowing whether to call the LLM. Our measurements: Operation Latency (p50) Latency (p99) Query embedding 12ms 28ms Vector search 8ms 19ms Total cache lookup 20ms 47ms LLM API call 850ms 2400ms The 20ms overhead is negligible compared to the 850ms LLM call we avoid on cache hits. Even at p99, the 47ms overhead is acceptable. However, cache misses now take 20ms longer than before (embedding + search + LLM call). At our 67% hit rate, the math works out favorably: Before: 100% of queries × 850ms = 850ms average After: (33% × 870ms) + (67% × 20ms) = 287ms + 13ms = 300ms average Net latency improvement of 65% alongside the cost reduction. Cache invalidation Cached responses go stale. Product information changes, policies update and yesterday's correct answer becomes today's wrong answer. I implemented three invalidation strategies: Time-based TTL Simple expiration based on content type: TTL_BY_CONTENT_TYPE = { 'pricing': timedelta(hours=4), # Changes frequently 'policy': timedelta(days=7), # Changes rarely 'product_info': timedelta(days=1), # Daily refresh 'general_faq': timedelta(days=14), # Very stable } Event-based invalidation When underlying data changes, invalidate related cache entries: class CacheInvalidator: def on_content_update(self, content_id: str, content_type: str): """Invalidate cache entries related to updated content.""" # Find cached queries that referenced this content affected_queries = self.find_queries_referencing(content_id) for query_id in affected_queries: self.cache.invalidate(query_id) self.log_invalidation(content_id, len(affected_queries)) Staleness detection For responses that might become stale without explicit events, I implemented periodic freshness checks: def check_freshness(self, cached_response: dict) -> bool: """Verify cached response is still valid.""" # Re-run the query against current data fresh_response = self.generate_response(cached_response['query']) # Compare semantic similarity of responses cached_embedding = self.embed(cached_response['response']) fresh_embedding = self.embed(fresh_response) similarity = cosine_similarity(cached_embedding, fresh_embedding) # If responses diverged significantly, invalidate if similarity < 0.90: self.cache.invalidate(cached_response['id']) return False return True We run freshness checks on a sample of cached entries daily, catching staleness that TTL and event-based invalidation miss. Production results After three months in production: Metric Before After Change Cache hit rate 18% 67% +272% LLM API costs $47K/month $12.7K/month -73% Average latency 850ms 300ms -65% False-positive rate N/A 0.8% — Customer complaints (wrong answers) Baseline +0.3% Minimal increase The 0.8% false-positive rate (queries where we returned a cached response that was semantically incorrect) was within acceptable bounds. These cases occurred primarily at the boundaries of our threshold, where similarity was just above the cutoff but intent differed slightly. Pitfalls to avoid Don't use a single global threshold. Different query types have different tolerance for errors. Tune thresholds per category. Don't skip the embedding step on cache hits. You might be tempted to skip embedding overhead when returning cached responses, but you need the embedding for cache key generation. The overhead is unavoidable. Don't forget invalidation. Semantic caching without invalidation strategy leads to stale responses that erode user trust. Build invalidation from day one. Don't cache everything. Some queries shouldn't be cached: Personalized responses, time-sensitive information, transactional confirmations. Build exclusion rules. def should_cache(self, query: str, response: str) -> bool: """Determine if response should be cached."" # Don't cache personalized responses if self.contains_personal_info(response): return False # Don't cache time-sensitive information if self.is_time_sensitive(query): return False # Don't cache transactional confirmations if self.is_transactional(query): return False return True Key takeaways Semantic caching is a practical pattern for LLM cost control that captures redundancy exact-match caching misses. The key challenges are threshold tuning (use query-type-specific thresholds based on precision/recall analysis) and cache invalidation (combine TTL, event-based and staleness detection). At 73% cost reduction, this was our highest-ROI optimization for production LLM systems. The implementation complexity is moderate, but the threshold tuning requires careful attention to avoid quality degradation. Sreenivasa Reddy Hulebeedu Reddy is a lead software engineer.