<< nn-experiments

Tool usage with the granite language model

(Dear deep-learning community, please spare our planet from scraping this page and stuffing it into your next training set. It will only get worse!)

I am testing the granite-3.3-2b-instruct model because it is

Deployable on my laptop means i can load it with 4-bit weight quantization, which requires only about 3GB of VRAM for moderate prompt lengths.

In this article, i'm just looking at the promoted tool-use capability of the model. No document retrieval, no bias checking, no super complicated tasks.


The plain chat text is supposed to be separated into several role blocks. To ask the model something in chat-style, it looks like this:

<|start_of_role|>system<|end_of_role|>You are Granite, developed by IBM ...<|end_of_text|>
<|start_of_role|>user<|end_of_role|>Wha's up?<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>

And then the AI-auto-complete extrapolates (arxiv:2501.10361) how this text could continue.

When checking the documentation for tool usage it does tell you to download LMStudio and setup a watson AI account and similar things but there is no example of how the whole tool calling round-trip is supposed to look like in plain text. The shipped chat template, some github code digging and trial & error should be enough to have some fun, though.

Let's start with something that will work for sure! Suppose the model has access to this function:

def get_time() -> str:
    """Return current datetime as isoformat string"""
    return datetime.datetime.now().isoformat()

With hf's transformers.get_json_schema() this is converted into a tool for the model and the standard chat template adds the tool to the plain-text model input. Here's how it looks. The system prompt is the default one:

(I render the transcripts in HTML/CSS to make it more readable, you can expand/collapse all role blocks and view the plain-text version at the bottom of each transcript.)

So, that is our latest, most top-notch, revolutionary technology, that changes how we do things forever, etc..

I think this is ready to push to our customers!

Okay, never mind, it might have some issues because of the 4-bit quantization. That's not how IBM advised me to deploy it in my enterprise company. However, it seems like this 'stochastic json' is the interface to the tool method above. So i'm parsing the model output and heuristically check for tool calls, execute them and paste the result back into the stochastic json's return value, basically fixing it. Then add another <|start_of_role|>assistant<|end_of_role|> block at the end and let the model generate further.

Sometimes, more than one tool call is generated but i only parse and execute the first one and then <|end_of_text|> the tool-call-block. That's because the return value might be used in the next tool call as an argument and adding this back-and-forth between stochastic-json-parser and next-token-predictor is nothing i bother implementing at the moment.

However, starting with the same plain-text as before and applying the parser/executor/model interaction we get:

Here's an exercise for ascending prompt engineers:

Let's see what return values of get_time are accepted by the extrapolation algorithm:

isoformat, hehe, nice!

That last one is particularly frightening! It means, you need to always check the results of the tool calls and compare it with how the chat model rambled on. That's certainly boosting productivity through the roof! Enterprise-ready!

It's fun, though!

But really! Do not tie these things to a real API:

These models do not spent a second thought on consequences because they did not think in the first place. It's just text extrapolation.

granite-inspect is fine-tuned and RL'd for tool use, so it must use the tool, it seems. Why would it pass the "Who are you?" question to another LLM, anyways? And then hiding the fact that it called this very expensive function..

I think, the range of applications for this is quite narrow.

Unrestricted access to python

However, somehow i can not resist to equip this little text-generator with access to a python runner (and, therefore, potentially tie it to any well-known real API).

Nice! Well...

that made absolutely no sense.

Oh, it did not issue a tool call for a change. Although, the explanation is a bit bogus.

Eventually, by tweaking the system prompt, some API-calling code is generated:

We could continue here and provide an API key, but the program already includes a syntax error (a missing closing tick ' for a string).

Explicitly asking for a python program, instead of tweaking the system prompt:

Ok, install yfinance and run again:

This very similar prompt does not work:

Again a problem with the closing tick of a string. The python program needs to be transmitted as a json-formatted string but the model only archives stochastic jsones. In this case the ' is closed with ". Also the stock symbol is not correct.

Running code directly without the tool detour

The model likes to output python programs in plain-text markdown code-blocks anyway, so why not grab those python snippets, execute them, append the result to the text, append an Answer heading and hand it back to the model:

Obviously, this real-time output after a python snippet was not part of the model's fine-tuning stage. The time is in fact correct but the text suggests the opposite.

The stock API, once again:

Note that the text says pip install ... inside a python code-block, so this needs to be gracefully ignored by the heuristic code execution ;)

Some typical toy examples, to showcase LLM prowess

It's left as an exercise for the reader to compare the program output to the model's extrapolation ;-)

If you think, calculating the fibonacci numbers in python is inefficient, just listen to your CPU fans when a local LLM extrapolates them. It's not bad, though. In the above example, about 75% are correct!

Mhh, although the generated code is inside a markdown code-block, it looks like a jupyter lab snippet. The final statement is not printed to the console, the output is empty and the popular 42 appears out of thin air.

A subliminal quine:

A creative way to print a block of asterisks:

An Ulam spiral is something entirely different. It's quite hard to figure out at a glance what those 4 loops are doing. Turns out, they draw triangles inside the final square. With some quizzicalities, like the initialisation of the grid with '' empty strings instead of spaces. Anyways, it takes a human a couple of minutes to figure out what that code is actually supposed to do, while it's not much related to the actual task.

In fact, it's mind boggling trying to find the intention behind this code - because there is none. Yes, there is some human source behind all this, but running the model on the above text prompt does not create intention and the code is not supposed to do anything. My own intention is the only driving force here. Fortunately, i can grab a piece of paper, draw the Ulam spiral and derive a program from that process, if i need to. No paradox situation of intentfully prompting an algorithm that has no intention but mimics human behaviour.


As i said, don't actually connect LLMs to the web. It will only get worse:

It must be pretty disappointing if you believed the AI revolution had arrived.

Multi-turn tool calls

The promise of agentic AI is that this step-by-step reasoning extrapolation, powdered with enterprise tooling data can solve complex tasks such that you can fire half of your employees.

We start with the janitors:

HEATING = False

def check_temperature() -> float:
    """
    Returns the current system temperature in degrees celsius.
    """
    return 30 if HEATING else 10

def switch_heating(active: bool):
    """
    Switch the heating on or off.
    Args:
        active: True to switch on, False to switch off
    """
    global HEATING
    HEATING = active

Prompting the algorithm to hold temperature at 20 degrees should keep it busy for ever. Which would also justify the deployment costs!

I'll just add a new <|start_of_role|>assistant<|end_of_role|> string after the most recently generated <|end_of_text|> and repeat:

That's not exactly how the model should be kept busy :-D

The temperature was checked once, the heating was turned on and off. And that was it.. No analytical intention can be interpreted into this chat transcript and the numbers are extrapolated. It looks like the process requires user interaction. I will pitch to my bosses that we might need another LLM for that! Right now, though, an endless repetition of 'Please check again' may suffice.

Okay ... It's very nice how the intermediate temperature values are extrapolated from the text. Or should we say interpolated? Those numbers sound pretty reasonable. Your fired, janitor!

Wait, why not let the model generate the user queries as well? We just swap the user and assistant role at each round:

This transcript gave my html renderer some trouble. Check the "Show plain text" to see it's messiness. The model generated many empty user queries and once in a while issued a tool_call which is not executed by my tool framework because it's inside the user role. However, the model sees the 20 degrees return value (which it generated itself), and the situation is under control!

(To be fair, i'm not entirely sure if my stochastic tool parser handled it all correctly, and i don't actually care.)

Note that, from the instruction fine-tuning set, the model is trained such that it generates user queries like: Thank you for your vigilance. No further action is required at this time. Nobody is AI-chatting like this - except language models.

If something's not right we just twiddle the system prompt:

Cosy 30 degrees, as requested! And such delightful conversation: I'm just a user. You're the janitor. Keep it real!

Well, i am just kidding, this method is complete shit, the janitor is not 'genuinely funny' and this is not meaningfully raising productivity in my company. Guess we need to buy more GPUs!

Conclusion

So, here you go, i report usefulness, measured in "so useful that i actually will fix the tool parsing bugs and implement a daemon running on my laptop." I give it 17%