* Adapt and test huggingface_hub v1.0.0.rc0
* forgot to bump hfh
* bump
* code quality
* code quality
* relax dependency table
* fix has_file
* install hfh 1.0.0.rc0 in circle ci jobs
* repostiryo
* push to hub now returns a commit url
* catch HfHubHTTPError
* check commit on branch
* add it back
* fix ?
* remove deprecated test
* uncomment another test
* trigger
* no proxies
* many more small changes
* fix load PIL Image from httpx
* require 1.0.0.rc0
* fix mocked tests
* fix others
* unchange
* unchange
* args
* Update .circleci/config.yml
* Bump to 1.0.0.rc1
* bump kernels version
* fix deps
* setup
* start the purge
* continue the purge
* more and more
* more
* continue the quest: remove loading tf/jax checkpoints
* style
* fix configs
* oups forgot conflict
* continue
* still grinding
* always more
* in tje zone
* never stop
* should fix doc
* fic
* fix
* fix
* fix tests
* still tests
* fix non-deterministic
* style
* remove last rebase issues
* onnx configs
* still on the grind
* always more references
* nearly the end
* could it really be the end?
* small fix
* add converters back
* post rebase
* latest qwen
* add back all converters
* explicitly add functions in converters
* re-add
* porting not maintained jieba to rjieba
* Fix format
* replaced the line with rjieba instead of removing it
* cut_all is not included as a parameter. cut_all is a seperate function rjieba
* rev
* jieba remove installation
* Trigger tests
* Update tokenization_cpm.py
* Update tokenization_cpm_fast.py
---------
Co-authored-by: Yih-Dar <2521628+ydshieh@users.noreply.github.com>
* Rework of the CB example
* Further rework of CB example
* Refactor PA cache, slice on tokens, add debug prints -- WIP
* Slice cache -- WIP
* Added a mechanism to check batched outputs in CB script
* Less logging, debug flag for slice, !better reset! -- WIP
* QOL and safety margins
* Refactor and style
* Better saving of cb example
* Fix
* Fixes and QOL
* Mor einformations about metrics
* Further logging
* Style
* Licenses
* Removed some comments
* Add a slice input flag
* Fix in example
* Added back some open-telemetry deps
* Removed some aux function
* Added FA2 option to example script
* Fixed math (all of it)
* Added a simple example
* Renamed core to classes
* Made allocation of attention mask optionnal
* Style
* add jinja2 as a dependency
* Make jinja2 a core dependency in install_requires
- Add jinja2 to install_requires list in setup.py for automatic installation
- Add jinja2 to runtime version checks in dependency_versions_check.py
- Resolves issue where pip install transformers doesn't install jinja2
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Make jinja2 a core dependency in install_requires
* Make jinja2 an extra dependency instead of adding a core dep
---------
Co-authored-by: Claude <noreply@anthropic.com>
* fix
* nice
* where i am at
* Bro this works
* Update src/transformers/integrations/tensor_parallel.py
* cleanups
* yups that was breaking
* Update src/transformers/models/openai_moe/modeling_openai_moe.py
* gather on experts and not mlp
* add changes for latest convert branch
* adds options to get output_router_logits from config
* bring chat temlate + special tokens back into the script.
* initial commmit
* update
* working with shards
* add model.safetensors.index.json
* fix
* fix
* mxfp4 flag
* rm print
* Fix PAD/EOS/BOS (#18)
* fix pad/eos/bos
* base model maybe one day
* add some doc
* special tokens based on harmony.
* add in tokenizer config as well.
* prepare for rebase with main
* Fix for initialize_tensor_parallelism now returning 4-tuple
```
[rank0]: File "/fsx/edward/work/openai-tsm-examples/examples/generate.py", line 17, in <module>
[rank0]: model = AutoModelForCausalLM.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/edward/work/new-model-addition-openai/src/transformers/models/auto/auto_factory.py", line 600, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/edward/work/new-model-addition-openai/src/transformers/modeling_utils.py", line 316, in _wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/fsx/edward/work/new-model-addition-openai/src/transformers/modeling_utils.py", line 4748, in from_pretrained
[rank0]: tp_plan, device_map, device_mesh = initialize_tensor_parallelism(tp_plan, tp_size=None)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: ValueError: too many values to unpack (expected 3)
```
* mxfp4
* mxfp4 draft
* fix
* fix import
* draft
* draft impl
* finally working !
* simplify
* add import
* working version
* consider blocks and scales
* device mesh fix
* initial commit
* add working dequant + quant logic
* update
* non nan, gibberish output
* working EP + quantization finally !
* start cleaning
* remove reversing process
* style
* some cleaning
* initial commmit
* more cleaning
* more cleaning
* simplify
* more cleaning
* rm duplicated function
* changing tp_plan
* update tp plan check
* add loading attribute
* dequantizing logic
* use subfunctions
* import cleaning
* update_param_name
* adds clamped swiglu
* add clamping to training path
* simplify dequant logic
* update
* Bad merge
* more simplifications & tests
* fix !
* fix registering custom attention
* fix order
* fixes
* some test nits
* nits
* nit
* fix
* Clamp sink logits
* Clean
* Soft-max trick
* Clean up
* p
* fix deepspeed
* update both modeling and modular for cleanup
* contiguous
* update tests
* fix top_k router call
* revert renaming
* test nits
* small fixes for EP
* fix path for our local tests
* update as I should not have broken that!
* fix the loss of mixtral
* revert part of the changes related to router_scores, kernel probably no ready for that!
* deleting a small nit
* update arch
* fix post processing
* update
* running version but not expected output
* moving to cuda
* initial commit
* revert
* erroring when loading on cpu
* updates
* del blocks, scales
* fix
* style
* rm comm
* comment
* add comment
* style
* remove duplicated lines
* Fix minor issue with weight_map conversion script
* fix sampling params
* rename to final name
* upate pre-final version of template
* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py
* fix batched inference
* serve fixes
* swizzle !
* update final chat template by Matt.
* fix responses; pin oai
* sinplify
* Thanks Matt for his tireless efforts!
Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>
* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
* fix
* Use ROCm kernels from HUB
* Make kernel modes explicit
* update final chat template by Matt. x2
* Thanks Matt for his tireless efforts!
Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>
* Fix installation
* Update setup.py
Co-authored-by: Ákos Hadnagy <akos.hadnagy@gmail.com>
* allow no content
* fix: update message handling in write_tokenizer function
* Fix template logic for user message role
* last nits for CB and flash_paged!
* there was one bad merge
* fix CB (hardcode for now, its just using kv groups instead)
* fix
* better fix for device_map
* minor device fix
* Fix flash paged
* updates
* Revert "remove dtensors, not explicit (#39840)"
This reverts commit 6dfd561d9cd722dfc09f702355518c6d09b9b4e3.
* update
* Revert "remove dtensors, not explicit (#39840)"
This reverts commit 6dfd561d9cd722dfc09f702355518c6d09b9b4e3.
* fix merge
* fix
* Fix line break when custom model indentity
* nits testing
* to locals first and pass sliding window to flash paged
* register modes for MegaBlocksMoeMlp
* add integration test in fixtures -> now update the tests to use it!
* update integration tests
* initial fix
* style and update tests
* fix
* chore(gpt oss): remove mlp_bias from configuration
It was just a leftover.
* stats
* Integration tests
* whoops
* Shouldn't move model
* Ensure assistant messages without thinking always go to "final" channel
* More checks to ensure expected format
* Add pad_token_id to model configuration in write_model function (#51)
* Add oai fix fast tests (#59)
* Fix some fast tests
* Force some updates
* Remove unnecessary fixes
* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
* Update src/transformers/models/gpt_oss/convert_gpt_oss_weights_to_hf.py
* reasoning -> Reasoning
* Add additional integration tests
* fixup
* Slight fixes
* align chat template with harmony
* simplify
* Add comment
* torch testing assert close
* torch testing assert close
* torch testing assert close
* torch testing assert close
* torch testing assert close
* torch testing assert close
* Revert fixup
* skip 2 test remove todo
* merge
* padding side should be left for integration tests
* fix modular wrt to changes made to modeling
* style
* isort
* fix opies for the loss
* mmmm
---------
Co-authored-by: Quentin Gallouédec <gallouedec.quentin@gmail.com>
Co-authored-by: Quentin Gallouédec <45557362+qgallouedec@users.noreply.github.com>
Co-authored-by: Marc Sun <marc@huggingface.co>
Co-authored-by: edbeeching <edbeeching@gmail.com>
Co-authored-by: Vaibhavs10 <vaibhavs10@gmail.com>
Co-authored-by: MekkCyber <mekk.cyber@gmail.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Edward Beeching <edbeeching@users.noreply.github.com>
Co-authored-by: Mohamed Mekkouri <93391238+MekkCyber@users.noreply.github.com>
Co-authored-by: Lewis Tunstall <lewis.c.tunstall@gmail.com>
Co-authored-by: Zhuohan Li <zhuohan@openai.com>
Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
Co-authored-by: joao@huggingface.co <joao@ip-10-53-88-32.ec2.internal>
Co-authored-by: Rocketknight1 <Rocketknight1@users.noreply.github.com>
Co-authored-by: Joao Gante <joaofranciscocardosogante@gmail.com>
Co-authored-by: Akos Hadnagy <akos@ahadnagy.com>
Co-authored-by: Ákos Hadnagy <akos.hadnagy@gmail.com>
Co-authored-by: Alvaro Moran <alvaro.moran@huggingface.co>
Co-authored-by: Lysandre <hi@lysand.re>
Co-authored-by: Matt <rocketknight1@gmail.com>
* stash for now
* initial commit
* small updated
* up
* up
* works!
* nits and fixes
* don't loop too much
* finish working example
* update
* fix the small freeblocks issue
* feat: stream inputs to continuous batch
* fix: update attn from `eager` to `sdpa`
* refactor: fmt
* refactor: cleanup unnecessary code
* feat: add `update` fn to `PagedAttentionCache`
* feat: broken optimal block size computation
* fix: debugging invalid cache logic
* fix: attention mask
* refactor: use custom prompts for example
* feat: add streaming output
* fix: prefill split
refactor: add doc strings and unsound/redundant logic
fix: compute optimal blocks logic
* fix: send decoded tokens when `prefilling_split` -> `decoding`
* refactor: move logic to appropriate parent class
* fix: remove truncation as we split prefilling anyways
refactor: early return when we have enough selected requests
* feat: add paged attention forward
* push Ggraoh>
* add paged sdpa
* update
* btter mps defaults
* feat: add progress bar for `generate_batch`
* feat: add opentelemetry metrics (ttft + batch fill %age)
* feat: add tracing
* Add cuda graphs (#38059)
* draft cudagraphs addition
* nits
* styling
* update
* fix
* kinda draft of what it should look like
* fixes
* lol
* not sure why inf everywhere
* can generate but output is shit
* some fixes
* we should have a single device synch
* broken outputs but it does run
* refactor
* updates
* updates with some fixes
* fix mask causality
* another commit that casts after
* add error
* simplify example
* update
* updates
* revert llama changes
* fix merge conflicts
* fix: tracing and metrics
* my updates
* update script default values
* fix block allocation issue
* fix prefill split attnetion mask
* no bugs
* add paged eager
* fix
* update
* style
* feat: add pytorch traces
* fix
* fix
* refactor: remove pytorch profiler data
* style
* nits
* cleanup
* draft test file
* fix
* fix
* fix paged and graphs
* small renamings
* cleanups and push
* refactor: move tracing and metrics logic to utils
* refactor: trace more blocks of code
* nits
* nits
* update
* to profile or not to profile
* refactor: create new output object
* causal by default
* cleanup but generations are still off for IDK what reason
* simplifications but not running still
* this does work.
* small quality of life updates
* nits
* updaet
* fix the scheduler
* fix warning
* ol
* fully fixed
* nits
* different generation parameters
* nice
* just style
* feat: add cache memory usage
* feat: add kv cache free memory
* feat: add active/waiting count & req latency
* do the sampling
* fix: synchronize CUDA only if available and improve error handling in ContinuousBatchingManager
* fix on mps
* feat: add dashboard & histogram buckets
* perf: improve waiting reqs data structures
* attempt to compile, but we should only do it on mps AFAIK
* feat: decouple scheduling logic
* just a draft
* c;eanup and fixup
* optional
* style
* update
* update
* remove the draft documentation
* fix import as well
* update
* fix the test
* style doomed
---------
Co-authored-by: Luc Georges <luc.sydney.georges@gmail.com>