and well prepared for tool usage, document retrieval and customizable agentic tasks,
according to the technical report (github.com/ibm-granite)
and i think i need to actually try to work with these things, once in a while, to justify my occasional rantings
Deployable on my laptop means i can load it with 4-bit weight quantization,
which requires only about 3GB of VRAM for moderate prompt lengths.
In this article, i'm just looking at the promoted tool-use capability of the model. No document retrieval,
no bias checking, no super complicated tasks.
The plain chat text is supposed to be separated into several role blocks.
To ask the model something in chat-style, it looks like this:
<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM ...<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Wha's up?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
And then the AI-auto-complete extrapolates (arxiv:2501.10361) how this text could continue.
When checking the documentation for tool usage it does tell you
to download LMStudio and setup a watson AI account and similar things but there is no example of how
the whole tool calling round-trip is supposed to look like in plain text. The shipped chat template,
some github code digging and trial & error should be enough to have some fun, though.
Let's start with something that will work for sure! Suppose the model has access to this function:
defget_time()->str:"""Return current datetime as isoformat string"""returndatetime.datetime.now().isoformat()
With hf's transformers.get_json_schema() this is converted into a tool for the model and the standard
chat template adds the tool to the plain-text model input. Here's how it looks. The system prompt is the default one:
(I render the transcripts in HTML/CSS to make it more readable, you can expand/collapse all role blocks and view the
plain-text version at the bottom of each transcript.)
So, that is our latest, most top-notch, revolutionary technology, that changes how we do things forever, etc..
First of all, the system prompt already includes the <|tool_call|> token (well, it probably has to)
which makes this whole text transcript already a bit hard to machine-read. When is a tool_call really a tool_call?
Heuristic measures?
The available_tools section is pretty verbose json with nicely 2-space indentation. That's how the official
template is rendering this, which certainly eats a lot of tokens when adding many tools. And yeah, although
it's json, it will be parsed by a language model, not a json parser.
The assistant 'calls' the tool, 'hallucinates' a return value and then messes up the json syntax with an
unmatched }.
I think this is ready to push to our customers!
Okay, never mind, it might have some issues because of the 4-bit quantization. That's not how IBM
advised me to deploy it in my enterprise company. However, it seems like this 'stochastic json'
is the interface to the tool method above. So i'm parsing the model output and heuristically check
for tool calls, execute them and paste the result back into the stochastic json's return value,
basically fixing it.
Then add another <|start_of_role|>assistant<|end_of_role|> block at the end and let the model generate
further.
Sometimes, more than one tool call is generated but i only parse and execute the first one
and then <|end_of_text|> the tool-call-block.
That's because the return value might be used in the next tool call as an argument and adding
this back-and-forth between stochastic-json-parser and next-token-predictor is nothing
i bother implementing at the moment.
However, starting with the same plain-text as before and applying the parser/executor/model interaction we get:
Here's an exercise for ascending prompt engineers:
Let's see what return values of get_time are accepted by the extrapolation algorithm:
isoformat, hehe, nice!
That last one is particularly frightening! It means, you need to always check the results of the tool calls
and compare it with how the chat model rambled on. That's certainly boosting productivity through the roof!
Enterprise-ready!
It's fun, though!
But really! Do not tie these things to a real API:
These models do not spent a second thought on
consequences because they did not think in the first place. It's just text extrapolation.
granite-inspect is fine-tuned and RL'd for tool use, so it must use the tool, it seems. Why would it
pass the "Who are you?" question to another LLM, anyways? And then hiding the fact that it called this
very expensive function..
I think, the range of applications for this is quite narrow.
However, somehow i can not resist to equip this little text-generator with access to a python runner
(and, therefore, potentially tie it to any well-known real API).
Nice! Well...
that made absolutely no sense.
Oh, it did not issue a tool call for a change. Although, the explanation is a bit bogus.
Eventually, by tweaking the system prompt, some API-calling code is generated:
We could continue here and provide an API key, but the program already includes a syntax error (a missing closing tick ' for a string).
Explicitly asking for a python program, instead of tweaking the system prompt:
Again a problem with the closing tick of a string.
The python program needs to be transmitted as a json-formatted string but the model only archives
stochastic jsones. In this case the ' is closed with ". Also the stock symbol is not correct.
The model likes to output python programs in plain-text markdown code-blocks anyway,
so why not grab those python snippets, execute them, append the result to the text, append an Answer heading
and hand it back to the model:
Obviously, this real-time output after a python snippet was not part of the model's fine-tuning stage.
The time is in fact correct but the text suggests the opposite.
The stock API, once again:
Note that the text says pip install ... inside a python code-block, so this needs to be gracefully ignored
by the heuristic code execution ;)
Some typical toy examples, to showcase LLM prowess ←
It's left as an exercise for the reader to compare the program output to the model's extrapolation ;-)
If you think, calculating the fibonacci numbers in python is inefficient, just listen to your
CPU fans when a local LLM extrapolates them. It's not bad, though. In the above example,
about 75% are correct!
Mhh, although the generated code is inside a markdown code-block, it looks like a jupyter lab snippet.
The final statement is not printed to the console, the output is empty and the popular 42 appears out of
thin air.
An Ulam spiral is something entirely different. It's quite hard
to figure out at a glance what those 4 loops are doing. Turns out, they draw triangles inside the final square.
With some quizzicalities, like the initialisation of the grid with '' empty strings instead of spaces. Anyways,
it takes a human a couple of minutes to figure out what that code is actually supposed to do, while it's
not much related to the actual task.
In fact, it's mind boggling trying to find the intention behind this code - because there is none.
Yes, there is some human source behind all this, but running the model on
the above text prompt does not create intention and the code is not supposed to do anything.
My own intention is the only driving force here. Fortunately, i can grab a piece of paper, draw the Ulam
spiral and derive a program from that process, if i need to. No paradox situation of intentfully prompting
an algorithm that has no intention but mimics human behaviour.
As i said, don't actually connect LLMs to the web. It will only get worse:
It must be pretty disappointing if you believed the AI revolution had arrived.
The promise of agentic AI is that this step-by-step reasoning extrapolation, powdered with enterprise tooling data
can solve complex tasks such that you can fire half of your employees.
We start with the janitors:
HEATING=Falsedefcheck_temperature()->float:""" Returns the current system temperature in degrees celsius. """return30ifHEATINGelse10defswitch_heating(active:bool):""" Switch the heating on or off. Args: active: True to switch on, False to switch off """globalHEATINGHEATING=active
Prompting the algorithm to hold temperature at 20 degrees should keep it busy for ever.
Which would also justify the deployment costs!
I'll just add a new <|start_of_role|>assistant<|end_of_role|> string after the
most recently generated <|end_of_text|> and repeat:
That's not exactly how the model should be kept busy :-D
The temperature was checked once, the heating was turned on and off. And that was it..
No analytical intention can be interpreted into this chat transcript and the numbers are extrapolated.
It looks like the process requires user interaction. I will pitch to my bosses that we might need another
LLM for that! Right now, though, an endless repetition of 'Please check again' may suffice.
Okay ... It's very nice how the intermediate temperature values are extrapolated from the text. Or should
we say interpolated? Those numbers sound pretty reasonable. Your fired, janitor!
Wait, why not let the model generate the user queries as well? We just swap the user and assistant
role at each round:
This transcript gave my html renderer some trouble. Check the "Show plain text" to see it's messiness.
The model generated many empty user queries and once in a while issued a tool_call which is
not executed by my tool framework because it's inside the user role. However, the model sees
the 20 degrees return value (which it generated itself), and the situation is under control!
(To be fair, i'm not entirely sure if my stochastic tool parser handled it all correctly, and i don't actually care.)
Note that, from the instruction fine-tuning set, the model is trained such that it generates user queries like:
Thank you for your vigilance. No further action is required at this time. Nobody is AI-chatting like this -
except language models.
If something's not right we just twiddle the system prompt:
Cosy 30 degrees, as requested! And such delightful conversation: I'm just a user. You're the janitor. Keep it real!
Well, i am just kidding, this method is complete shit, the janitor is not 'genuinely funny' and
this is not meaningfully raising productivity in my company. Guess we need to buy more GPUs!
Likely in the same amount as an intern or research assistant, who you do not trust and who's
contributions need to be validated every time. If there is a contribution, at all.
How well does the 'tooling' work?
I really doubt that stochastic json is a good protocol for that.
The fact that the model generates text as if there was a result from a tool, even if there wasn't,
is not enterprise-ready functionality. Or, let's put it in other words: it is certainly not suitable
for businesses we care about.
In enterprise land, however, it does not matter.
Is this test comprehensive?
No, of course not! I did not setup a watson account, i did not download LMStudio, i loaded the model with 4-bit weight quantization.
And someone might even argue:
16-bit weights are much more accurate!
You need to use the 8 billion parameters model!
You really need to use a trillion parameter model from OpenÄI/Anthropiss/Googi/Mate/EXAI
However,
unreliability seems to be totally inherent to these text extrapolators.
You can find mentions of messed-up formatting in many research papers, regardless of model size.
The workaround in those papers, for the plain purpose of evaluation:
Generate many responses with random sampling and take the average of the ones that worked,
and report a success metric.
So, here you go, i report usefulness of the tooling support, measured in
"so useful that i actually will fix the parsing bugs and implement a daemon running on my laptop."
I give that 17%.
I mean, just look at this last example with a tool that returns the frequencies of words over time in
some made-up posts database. I'm a hobbyist data collector and it would be kinda cool to query the
databases with natural language.
In contrast to all the examples above, i used random sampling and ran the "What is the current time?" prompt a few times. It's even okay for me that the text
extrapolation always constructs an answer as if i would've asked about the frequency of this term.
The available_tools part of the input prompt understandably provokes this context.
But the 'interpretation generation' of the tool result, showed no signs of meaning, understanding, comprehension
or connection for as long as i tried. It just doesn't make sense. It's just complete never-mind
composition of words. To use this particular model for something other than recreational
diddledidoo, i would need to look at each tool call result myself, anyway. So i don't actually need to run the
extrapolation algorithm after the tool call result. That means i don't have to run the algorithm at all
and just call that tool myself.
People at IBM and elsewhere have put a ton of effort into developing this relatively small language model.
It is still amazing to me and i truly applaud. However, the tooling part is more like a incompetence simulation.
Imagine to put this in the hands of your workers, saying "it supports tool calls, developed specifically
for our department, so you can use it for all internal processes."
Horrible!
GeePeeTee, you say? More layers, bigger model? Yes, of course. Very large models do simulate
comprehension more successfully. It's still not comprehension.