Tag

inference

11 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.

vLLM introduced in: Kwon et al. 2023 — high-throughput LLM serving via PagedAttention.
468a9e2c047d8f2f · 2 sources · 100% confidence
llama.cpp publicly released on: 2023-03-10 by Georgi Gerganov.
2c6ddc094019890c · 2 sources · 100% confidence
GPTQ introduced in: Frantar et al. 2022 — accurate post-training quantization for GPT models.
a9ab1ec12062f7ae · 2 sources · 100% confidence
Triton inference server publicly released on: 2018-11 by NVIDIA — formerly TensorRT Inference Server.
78ec1ceed08a221c · 2 sources · 100% confidence
SGLang introduced in: Zheng et al. 2024 — efficient LLM serving with structured outputs.
4244c11611a72550 · 2 sources · 100% confidence
Groq LPU publicly released on: 2024-02-19 by Groq — language processing unit inference.
6e19ed543cadbcdd · 2 sources · 95% confidence
Speculative decoding introduced in: Leviathan, Kalman, Matias 2023 — Google Research.
6cdc7730bf41bb3d · 2 sources · 100% confidence
Fireworks AI founded in: 2022 — fast inference for open-source models.
c47d097cc3bceaaa · 2 sources · 95% confidence
NVIDIA NIM publicly released on: 2024-03-18 by NVIDIA — inference microservices for foundation models.
2e98d9bb149590fc · 2 sources · 100% confidence
Grouped-Query Attention (GQA) introduced in paper: GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints (Ainslie et al., 2023).
3e9122ba60a3fe99 · 3 sources · 92% confidence
FlashAttention-2 introduced in paper: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning (Dao, 2023).
786f534a9f79a3be · 3 sources · 92% confidence

Related tags

released_on4 20234 introduced_in4 open-source4 20243 20223 nvidia2 attention2 uc-berkeley2 serving2