KV Cache and Kernels - Search Videos

Meet kvcached (KV cache daemon): a KV cache open-source library for LLM serving on shared GPUs

Meet kvcached (KV cache daemon): a KV cache open-source library for LLM serving on shared GPUs

Unlock 90% KV Cache Hit Rates with llm-d Intelligent Routing | Tushar Katarki

Unlock 90% KV Cache Hit Rates with llm-d Intelligent Routing | Tushar Katarki

6.3K views4 months ago

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

New KV cache compaction technique cuts LLM memory 50x without accuracy loss

venturebeat.com

KV Cache Speeds Up Large Language Model Inference | Tushar Kumar posted on the topic | LinkedIn

KV Cache Speeds Up Large Language Model Inference | Tushar Kumar posted on the topic | LinkedIn

2K views1 month ago

Tensormesh CEO Junchen Jiang on KV Cache for Large-Scale LLM Inference | University of Chicago Department of Computer Science posted on the topic | LinkedIn

Tensormesh CEO Junchen Jiang on KV Cache for Large-Scale LLM Inference | University of Chicago Department of Computer Science posted on the topic | LinkedIn

2.9K views4 months ago

Making AI Faster | The KV Cache

Making AI Faster | The KV Cache

7 views3 weeks ago

YouTubeLike Engineer

Kv cache algorithms HBM #ai #travel #nvidia #nvidia #viral #gpu #viral #gpu #motivation #aiinfra

Kv cache algorithms HBM #ai #travel #nvidia #nvidia #viral #gpu #viral #gpu #motivation #aiinfra

YouTubeAmit_Chopra_assruc

Quantization: What Everyone Gets Wrong (Accuracy Myths)

65 views2 weeks ago

YouTubeCode & Capital

The KV Cache Hack That Saved My GPU (TurboQuant Explained)

63 views1 month ago

YouTubeOEvortex

Lightning Talk: KV-Cache Centric Inference: Building a State-Aware... Maroon Ayoub & Martin Hickey

1 views3 weeks ago

KV Cache Aware Routing in vLLM using Production Stack

11 views6 months ago

YouTubeSuraj Deshmukh

Tensormesh: KV Cache hit rate

2 views1 month ago

YouTubeTensormesh

Silent Bit-Flips in Shared LLM KV-Cache Blocks

18 views2 weeks ago

YouTubeAI Research Roundup

Tensormesh: KV Cache Persistence for Faster, Cheaper, Smarter Inference

184 views2 months ago

YouTubeBryan Bamford

Rethinking KV Cache Compression Techniques for LLM Serving

148 views1 month ago

YouTubeDSAI by Dr. Osbert Tay

TurboQuant Explained: Google's 3-Bit KV Cache Compression Algorithm

191 views1 month ago

I Used Karpathy's Autoresearch to Write a Custom GPU Kernel

152 views3 weeks ago

YouTubeOnchain AI Garage

Scalable LLM Memory — Engram & Memory Banks Explained | Beyond KV Cache

YouTubeZariga Tongy

TurboQuant Explained: 3-Bit KV Cache Quantization

866 views2 weeks ago

YouTubeTales Of Tensors

standard vs kv cache performance

1 views2 months ago

YouTubedoi song thuong ngay canada

【Whitepaper】KV Cache Offload to Improve AI Inferencing Cost and Performance

42 views2 months ago

RotorQuant: KV Cache Compression for LLMs 31x Faster than TurboQuant.

240 views4 weeks ago

YouTubeEn la mente de la máquina, Inteligencia Artificial

How Tool-Calling Changes Everything: KV Cache & Prefill Explained 🧠

25 views2 months ago

YouTubeSAIL Media

保姆级KV Cache教程！从底层原理到显存计算，新手也能一次看懂

204 views2 months ago

YouTube算法魔法師

after turboquant and qwen3.5-35b-a3b, i got curious: how realistic is it to use kv cache as a document store today? to have vectorless, RAG-less search. so i prefilled 258K out of 262K context window on L4 (a budget GPU popular in prod). ~99% of the slot is pre-computed and stored, users load it on the fly in ~1s. system prompt + query append to the end, generation takes ~3K tokens, enough for search. at 99% fill rate, decoding runs ~20 tps on L4.i prepared some ego datasets (jina papers, which

42.1K views1 month ago

I added KV caching and INT8 KV quantization to our transformer inference, improving throughput by 35x.All of this was done from scratch in Rust + CUDA, on top of a homemade ML framework.On a 4-token prompt with 252 generated tokens:- Original: 0.76 tok/s- KV cache fp32: 27.21 tok/s- KV cache int8 (quantized): 27.29 tok/sTry it out yourself here: https://t.co/kFS9Z0fs4hIn practice:- KV caching gave us about a 35x end-to-end speedup- INT8 KV cache kept roughly the same speed as fp32 but cut KV cac

48.8K views3 weeks ago

x.comReese Chong

I implemented @GoogleResearch's TurboQuant as a CUDA-native compression engine on Blackwell B200.5x KV cache compression on Qwen 2.5-1.5B, near-loseless attention scores, generating live from compressed memory.5 custom cuTile CUDA kernels ft: - fused attention (with QJL corrections)- online softmax-on-chip cache decompression- pipelined TMA loadsTry it out: https://t.co/m5vkJxWIY6s/o @blelbach and the cuTile team at @nvidia for lending me Blackwell GPU access :) cc @sundeep @GavinSherry

787.9K views1 month ago

x.comanirudh bv

【GTC23】Writing Performant CUDA Kernels using the Source Page in Nsight Compute

8 views2 weeks ago

$NVDA $MU $SNDK $LITE EXECUTIVE OVERVIEWThe Reiner Pope interview should be read as a 1st-principles economic model of frontier AI systems rather than as a generic technical lecture. Its central claim is that the binding constraint for frontier inference is not raw tensor-core FLOPs in isolation, but the joint system of HBM bandwidth, KV-cache movement, scale-up interconnect, batching policy, and memory hierarchy. The result is a coherent framework for explaining why token prices differ across i

9.2K views1 week ago

x.comTheValueist

Optimize KV Caches for LLM Inference: Dynamo KVBM, FlexKV, LMCache S82033 | GTC San Jose 2026 | NVIDIA On-Demand

See more