* Type hints and small fixes
* Remove unusued params
* Made slice inputs the default
* ruffed
* Updated some var name and moved index slicing
* Logging arg in example
* Added some padding debug var and reformat out cg
* First working CG, fixe size
* Working flexible CG
* CG are compatible with all implementations
* Fixed CG API
* Update example
* Documentation
* Fix padding tokens in FA
* Review compliance
* Better doc around weird bug
* Style
* Fix for sliding with CG
* setup
* start the purge
* continue the purge
* more and more
* more
* continue the quest: remove loading tf/jax checkpoints
* style
* fix configs
* oups forgot conflict
* continue
* still grinding
* always more
* in tje zone
* never stop
* should fix doc
* fic
* fix
* fix
* fix tests
* still tests
* fix non-deterministic
* style
* remove last rebase issues
* onnx configs
* still on the grind
* always more references
* nearly the end
* could it really be the end?
* small fix
* add converters back
* post rebase
* latest qwen
* add back all converters
* explicitly add functions in converters
* re-add
* Fix for CB attn mask and refactor
* Tests for CB (not all passing)
* Passing tests and a logger fix
* Fixed the KV metrics that were broken when we moved to hybrid alloc
* Fix circular import and style
* Added tests for FA
* Unfolded test to have device expectations
* Fixes for H100
* more fixes for h100
* H100 are good
* Style
* Adding some comments from #40831
* Rename test
* Avoid 1 letter variables
* Dictonnary is only removed during kwargs
* Test for supported sample
* Fix a unvoluntary slice
* Fixes for non-sliced inputs and small example improvments
* Slice inputs is more understandabe
* Style
* CB example: better compare feature
* Cache managers, still issue w/ effective length
* WIP -- fix for effective length
* Renames
* Wroking, need better parity checks, we mind be missing 1 token
* Small fixes
* Fixed wrong attn mask and broke cache into pieces
* Warmup is slowing down things, disabling it
* Cache was too big, fixed
* Simplified index objects
* Added a profile option to the example
* Avoid calls to memory reporing tools
* Restore full attention read indices for better latency
* Adressed some TODOS and style
* Docstrings for cache managers
* Docstrings for Schedulers
* Refactor scheudlers
* [Important] Cache fix for sliding window, check with small sw size
* Updated doc for cache memory compute and cache as a whole
* Moved a todo
* Nits and style
* Fix for when sliding window is smaller than max batch per token
* Paged interface update
* Support for FLash in new API
* Fix example CB
* Fix bug in CB for paged
* Revert example
* Style
* Review compliance
* Style
* Styleeeee
* Removed NO_SLIDING_WINDOW
* Review #2 compliance
* Better art
* Turn cum_seqlens_k in a dict
* Attn mask is now a dict
* Update examples/pytorch/continuous_batching.py
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Adressed McPatate pro review
* Style and fix
---------
Co-authored-by: Luc Georges <McPatate@users.noreply.github.com>
* Rework of the CB example
* Further rework of CB example
* Refactor PA cache, slice on tokens, add debug prints -- WIP
* Slice cache -- WIP
* Added a mechanism to check batched outputs in CB script
* Less logging, debug flag for slice, !better reset! -- WIP
* QOL and safety margins
* Refactor and style
* Better saving of cb example
* Fix
* Fixes and QOL
* Mor einformations about metrics
* Further logging
* Style
* Licenses
* Removed some comments
* Add a slice input flag
* Fix in example
* Added back some open-telemetry deps
* Removed some aux function
* Added FA2 option to example script
* Fixed math (all of it)
* Added a simple example
* Renamed core to classes
* Made allocation of attention mask optionnal
* Style
* update everywhere
* style
* pipelines
* switch it everywhere in tests
* switch it everywhere in docs
* switch in converters everywhere
* update in examples
* update in model docstrings
* style
* warnings
* style
* Update configuration_utils.py
* fix
* Update configuration_utils.py
* fixes and add first test
* add pipeline tests
* Update test_pipelines_common.py
* add config test
* Update test_modeling_common.py
* add new ones
* post rebase
* add new
* post rebase adds
* remove trust_remote_code
* again
* Revert "Skip some tests for now (#38931)"
This reverts commit 31d30b72245aacfdf70249165964b53790d9c4d8.
* again
* style
* again
* again
* style
* fix integration test
* fix tests
* style
* fix
* fix
* fix the last ones
* style
* last one
* fix last
* fix
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* No more Tuple, List, Dict
* make fixup
* More style fixes
* Docstring fixes with regex replacement
* Trigger tests
* Redo fixes after rebase
* Fix copies
* [test all]
* update
* [test all]
* update
* [test all]
* make style after rebase
* Patch the hf_argparser test
* Patch the hf_argparser test
* style fixes
* style fixes
* style fixes
* Fix docstrings in Cohere test
* [test all]
---------
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* remove it from all py files
* remove it from the doc
* remove it from examples
* style
* remove traces of _fast_init
* Update test_peft_integration.py
* CIs