7 Commits

Author SHA1 Message Date
2f9f759587 Add num_trainable_params column to gradio app (#2819)
While memory usage correlates with the number of trainable params, having this number directly
makes it easier to see that methods are using similar numbers of trainable params and outliers
can be inspected easily.
2025-10-13 14:36:58 +02:00
43845f9b14 Method Comparison: Improve formatting/layout of table (#2670)
* Method Comparison: Improve formatting/layout of table

Quick improvement to reduce the dominance of columns like `{peft,train}_config` and make
numbers a bit more readable through proper decimal/thousands formatting.

* Bump gradio version to accomodate required fixes
2025-07-24 19:02:09 +02:00
f650b08abb make method comparison device agnostic, so it can expand to more accelerators like XPU (#2610)
make method comparision device agnostic, so it can expand to more
accelerators like XPU

---------

Signed-off-by: YAO Matrix <matrix.yao@intel.com>
2025-07-22 15:25:56 +02:00
5fe7f8f8ab ENH: Method comparison allow full finetuning (#2597)
- Allow full fine-tuning
- Add an experiment for full fine-tuning
- Rename some column names with wrong names
- Remove redundant metric
- Factor out file size calculation (estimate for FT)
2025-06-19 18:10:20 +02:00
4721213828 Add Makefile + results for MetaMathQA task (#2593)
These are the first results for the MetaMathQA task and also the first
test of the Makefile used to run these tests.

The Makefile offers the functionality to run individual experiments by
specifying the result you want to have, e.g.
`make results/adalora--llama-3.2-3B-rank32[...].json`. Alternatively
you can simply run `make` for `make all` which runs all experiments
that don't have a result yet or which have outdated configs (comparing
result timestamp and config timestamp).

The results are from the main branch. No errors happened during the run.
There were errors with a compute instance that used a A10G 24GB because
of OOM. L40S with 48GB RAM was fine.


* Make sure to use original batch size for OFT

This was not done previously because of runner memory constraints.

* Remove timestamp from result files

We're tracking the results in git for now which makes
looking back easy enough (`git restore -s <rev> results`).
This makes it easier for `make` to track the results that
are already computed and which need to change.
2025-06-19 17:41:51 +02:00
6bcefb02c6 Input sanitizer for benchmark result renderer (#2594)
Since `DataFrame.query` is potentially vulnerable we limit the possible
filter input to a fixed grammar that is roughly like this:

```
expr = left op right
left = ( expr ) | literal
right = ( expr ) | literal
op = in | >= | < | <= | == | and | or
```

this will give us boolean operations and basic comparisons. Note that
`literal` can be arbitrary python literals (strings, tuples, ...).
2025-06-19 11:45:43 +02:00
41921013f5 Method comparison evaluation suite (#2395)
Introduction of a method evaluation suite.

We generally face the problem that there is little knowledge on what PEFT methods perform best. To this end we decided to build an evaluation suite that has defined tasks, shared hyper-parameters and can be extended with new tasks and new method configurations over time.

For the sake of comparison we've not decided to incorporate user-submitted results but we encourage users to inspect the results, suggest new experiments and improve the configuration of methods if they're deemed unfavorable.

As of now there's only one task based on the MetaMathQA dataset which has the benefit of being complex while still fitting on a consumer GPU.

Notable changes in this squash:

* Add default training params

The experiment specific training params use the default training params
but can override any parameter from it if needed. However, this way it's
easier to make a change to all experiments (say, I want to change the
base model, I don't need to change each individual
training_parameters.json).

* Add possibility to change attn implementation

However, both flash attention 2 and flex attention are slower on my
system. Thus, stay with default None (-> SDPA).

* Refactor to use GenerationConfig

Allows to more easily use, say, static cache, which is the new default,
as it's faster (apart from the first pass)

* Better parsing of answers

E.g. 1/2 == 0.5

* Keep adapter file by default after train run

But add --clean to delete it.

Keeping the adapter can be useful if the user wants to run further tests
with the trained model.

---------

Co-authored-by: Benjamin Bossan <benjamin.bossan@gmail.com>
2025-03-27 17:00:38 +01:00