|
|
ea4647b7d7
|
[Doc] Add documentation for GGUF quantization (#8618)
|
2024-09-19 13:15:55 -06:00 |
|
|
|
e42c634acb
|
[Core] simplify logits resort in _apply_top_k_top_p (#8619)
|
2024-09-19 18:28:25 +00:00 |
|
|
|
9cc373f390
|
[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (#8577)
|
2024-09-19 17:37:57 +00:00 |
|
|
|
76515f303b
|
[Frontend] Use MQLLMEngine for embeddings models too (#8584)
|
2024-09-19 12:51:06 -04:00 |
|
|
|
855c8ae2c9
|
[MISC] remove engine_use_ray in benchmark_throughput.py (#8615)
|
2024-09-18 22:33:20 -07:00 |
|
|
|
c52ec5f034
|
[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (#8616)
|
2024-09-19 05:24:24 +00:00 |
|
|
|
02c9afa2d0
|
Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (#8593)
|
2024-09-19 04:14:28 +00:00 |
|
|
|
3118f63385
|
[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models. (#8545)
|
2024-09-19 02:24:15 +00:00 |
|
|
|
4c34ce8916
|
[Kernel] Remove marlin moe templating on thread_m_blocks (#8573)
Co-authored-by: lwilkinson@neuralmagic.com
|
2024-09-19 01:42:49 +00:00 |
|
|
|
0d47bf3bf4
|
[Bugfix] add dead_error property to engine client (#8574)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-18 22:10:01 +00:00 |
|
|
|
d9cd78eb71
|
[BugFix] Nonzero exit code if MQLLMEngine startup fails (#8572)
|
2024-09-18 20:17:55 +00:00 |
|
|
|
db9120cded
|
[Kernel] Change interface to Mamba selective_state_update for continuous batching (#8039)
|
2024-09-18 20:05:06 +00:00 |
|
|
|
b3195bc9e4
|
[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (#8380)
Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-18 10:41:08 -07:00 |
|
|
|
e18749ff09
|
[Model] Support Solar Model (#8386)
Co-authored-by: Michael Goin <michael@neuralmagic.com>
|
2024-09-18 11:04:00 -06:00 |
|
|
|
d65798f78c
|
[Core] zmq: bind only to 127.0.0.1 for local-only usage (#8543)
Signed-off-by: Russell Bryant <rbryant@redhat.com>
|
2024-09-18 16:10:27 +00:00 |
|
|
|
a8c1d161a7
|
[Core] *Prompt* logprobs support in Multi-step (#8199)
|
2024-09-18 08:38:43 -07:00 |
|
|
|
7c7714d856
|
[Core][Bugfix][Perf] Introduce MQLLMEngine to avoid asyncio OH (#8157)
Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
|
2024-09-18 13:56:58 +00:00 |
|
|
|
9d104b5beb
|
[CI/Build] Update Ruff version (#8469)
Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-18 11:00:56 +00:00 |
|
|
|
6ffa3f314c
|
[CI/Build] Avoid CUDA initialization (#8534)
|
2024-09-18 10:38:11 +00:00 |
|
|
|
e351572900
|
[Misc] Add argument to disable FastAPI docs (#8554)
|
2024-09-18 09:51:59 +00:00 |
|
|
|
95965d31b6
|
[CI/Build] fix Dockerfile.cpu on podman (#8540)
|
2024-09-18 10:49:53 +08:00 |
|
|
|
8110e44529
|
[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (#8012)
|
2024-09-17 23:44:27 +00:00 |
|
|
|
09deb4721f
|
[CI/Build] Excluding kernels/test_gguf.py from ROCm (#8520)
|
2024-09-17 16:40:29 -07:00 |
|
|
|
fa0c114fad
|
[doc] improve installation doc (#8550)
Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>
|
2024-09-17 16:24:06 -07:00 |
|
|
|
98f9713399
|
[Bugfix] Fix TP > 1 for new granite (#8544)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-17 23:17:08 +00:00 |
|
|
|
56c3de018c
|
[Misc] Don't dump contents of kvcache tensors on errors (#8527)
|
2024-09-17 12:24:29 -07:00 |
|
|
|
a54ed80249
|
[Model] Add mistral function calling format to all models loaded with "mistral" format (#8515)
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
|
2024-09-17 17:50:37 +00:00 |
|
|
|
9855b99502
|
[Feature][kernel] tensor parallelism with bitsandbytes quantization (#8434)
|
2024-09-17 08:09:12 -07:00 |
|
|
|
1009e93c5d
|
[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (#7631)
|
2024-09-17 07:35:01 -07:00 |
|
|
|
1b6de8352b
|
[Benchmark] Support sample from HF datasets and image input for benchmark_serving (#8495)
|
2024-09-17 07:34:27 +00:00 |
|
|
|
cbdb252259
|
[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (#8509)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
|
2024-09-17 00:06:26 -07:00 |
|
|
|
99aa4eddaf
|
[torch.compile] register allreduce operations as custom ops (#8526)
|
2024-09-16 22:57:57 -07:00 |
|
|
|
ee2bceaaa6
|
[Misc][Bugfix] Disable guided decoding for mistral tokenizer (#8521)
|
2024-09-16 22:22:45 -07:00 |
|
|
|
1c1bb388e0
|
[Frontend] Improve Nullable kv Arg Parsing (#8525)
Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
|
2024-09-17 04:17:32 +00:00 |
|
|
|
546034b466
|
[refactor] remove triton based sampler (#8524)
|
2024-09-16 20:04:48 -07:00 |
|
|
|
cca61642e0
|
[Bugfix] Fix 3.12 builds on main (#8510)
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
|
2024-09-17 00:01:45 +00:00 |
|
|
|
5ce45eb54d
|
[misc] small qol fixes for release process (#8517)
|
2024-09-16 15:11:27 -07:00 |
|
|
|
5478c4b41f
|
[perf bench] set timeout to debug hanging (#8516)
|
2024-09-16 14:30:02 -07:00 |
|
|
|
47f5e03b5b
|
[Bugfix] Bind api server port before starting engine (#8491)
|
2024-09-16 13:56:28 -07:00 |
|
|
|
2759a43a26
|
[doc] update doc on testing and debugging (#8514)
|
2024-09-16 12:10:23 -07:00 |
|
|
|
5d73ae49d6
|
[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (#7270)
|
2024-09-16 11:52:40 -07:00 |
|
|
|
781e3b9a42
|
[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (#8506)
|
2024-09-16 12:15:57 -06:00 |
|
|
|
acd5511b6d
|
[BugFix] Fix clean shutdown issues (#8492)
|
2024-09-16 09:33:46 -07:00 |
|
|
|
837c1968f9
|
[Frontend] Expose revision arg in OpenAI server (#8501)
|
2024-09-16 15:55:26 +00:00 |
|
|
|
a091e2da3e
|
[Kernel] Enable 8-bit weights in Fused Marlin MoE (#8032)
Co-authored-by: Dipika <dipikasikka1@gmail.com>
|
2024-09-16 09:47:19 -06:00 |
|
|
|
fc990f9795
|
[Bugfix][Kernel] Add IQ1_M quantization implementation to GGUF kernel (#8357)
|
2024-09-15 16:51:44 -06:00 |
|
|
|
3724d5f6b5
|
[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (#8490)
|
2024-09-15 04:20:05 +00:00 |
|
|
|
50e9ec41fc
|
[TPU] Implement multi-step scheduling (#8489)
|
2024-09-14 16:58:31 -07:00 |
|
|
|
47790f3e32
|
[torch.compile] add a flag to disable custom op (#8488)
|
2024-09-14 13:07:16 -07:00 |
|
|
|
a36e070dad
|
[torch.compile] fix functionalization (#8480)
|
2024-09-14 09:46:04 -07:00 |
|