For FSDP2, parameters might be on a meta device, and the weight.device attribute may
not accurately reflect where the actual computation will happen during forward passes.
```log
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 776, in forward
pos_embeds = self.fast_pos_embed_interpolate(grid_thw)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 745, in fast_pos_embed_interpolate
pos_embeds = self.pos_embed(idx_tensor) * weight_tensor[:, :, None]
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/module.py", line 1879, in _call_impl
return inner()
^^^^^^^
File "torch/nn/modules/module.py", line 1827, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "torch/nn/modules/sparse.py", line 192, in forward
return F.embedding(
^^^^^^^^^^^^
File "torch/nn/functional.py", line 2546, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected all tensors to be on the same device, but got index is on cpu, different from other tensors on cuda:0 (when checking argument in method wrapper_CUDA__index_select)
```
https://github.com/volcengine/verl/pull/3686#issuecomment-3380981817
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* Add video processor for VideoMAE
* Document VideoMAE video processor
* Add regression tests for VideoMAE video processor
* refactor: Use direct batch key access for pixel_values_videos
* test: add parity test for VideoMAEVideoProcessor vs VideoMAEImageProcessor
* docs(videomae): update model docstring example to demonstrate VideoMAEVideoProcessor (TorchCodec-based decoding and sampling)
* Type hints and small fixes
* Remove unusued params
* Made slice inputs the default
* ruffed
* Updated some var name and moved index slicing
* Logging arg in example
* Added some padding debug var and reformat out cg
* First working CG, fixe size
* Working flexible CG
* CG are compatible with all implementations
* Fixed CG API
* Update example
* Documentation
* Fix padding tokens in FA
* Review compliance
* Better doc around weird bug
* Style
* Fix for sliding with CG
* Merge conflict
* add fast processor
* add fast processor
* make style
* add new convert rgb
* use nested group by shape in mllama fast, add support for multiple inputs in group by shape
* refactor after review
---------
Co-authored-by: Vincent <phamvinh257@gmail.com>
```
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 941, in forward
hidden_states = self._deepstack_process(
^^^^^^^^^^^^^^^^^^^^^^^^
File "transformers/models/qwen3_vl_moe/modeling_qwen3_vl_moe.py", line 960, in _deepstack_process
hidden_states[visual_pos_masks, :] = local_this
~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Output 0 of SliceBackward0 is a view and is being modified inplace. This view was created inside a custom Function (or because an input was returned as-is) and the autograd logic to handle view+inplace would override the custom backward associated with the custom Function, leading to incorrect gradients. This behavior is forbidden. You can fix this by cloning the output of the custom Function.
```
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* Set `truncation` to `False` in Qwen3Omni to avoid default truncation
* move `padding` and `truncation` to audio default args
---------
Co-authored-by: lvyuanjun.lyj <lvyuanjun.lyj@alibaba-inc.com>