Treatmybrand


a Kainjoo SA Venture
Ch. du Vernay 14a
1196 Gland
+41.21.561.34.96
info@treatmybrand.com

Support


Monday to Friday
8AM to 8PM
support@treatmybrand.com

Cutting Your LLM Costs by 73% with Smart Semantic Caching

Picture of Venturebeat

Venturebeat

We faced a rapid 30% monthly growth in our LLM API bill that didn’t match traffic increases. Upon examining query logs, we realized that users were asking the same questions differently, like “What’s your return policy?” and “Can I get a refund?”, each triggering costly API calls. Exact-match caching barely helped, covering only 18% of duplicates because it relied on identical wording.

We switched to semantic caching, which looks at the meaning behind queries rather than their exact text. This change boosted our cache hit rate to 67%, slashing API costs by 73%. Our approach embedded queries into vectors and used similarity thresholds to find closely related cached queries.

An adaptive threshold approach tailored to query types (FAQ, searches, support) prevented incorrect cached answers. To maintain accuracy, we combined time-based TTLs, event-driven invalidation, and periodic freshness checks. Though semantic caching adds about 20ms latency, it’s minimal compared to the 850ms saved per cached LLM call.

After three months, our system achieved a 65% latency reduction, a 67% cache hit rate, and only a 0.8% false-positive rate, all while causing minimal customer complaints. Key tips include avoiding a one-size-fits-all threshold, implementing robust invalidation, and excluding certain sensitive queries from being cached.

Related Chronicles

Receive the latest news

Subscribe To Our Weekly Newsletter

Get notified about chronicles from TreatMyBrand (TMB.) directly in your inbox

Subscription Form