Comparing model performance: GPT-3.5 Turbo vs GPT-4o mini

My Telegram calendar bot converts natural language inputs into calendar invites. It currently uses GPT-3.5 Turbo and I wanted to see how the recently launched GPT-4o mini compares. Not only is the new model 60% cheaper, but it should also be more intelligent. Here’s a breakdown of the journey and the notable differences between these models.

The Evolution

Version 1: Basic hardcoded text parser with limited functionality. It worked quite well and I’m still using it as a fallback if the API has errors
Version 2: Upgraded to GPT-3.5 Turbo. It can now extract event titles from pretty much anything. However, the date parsing required a lot of hacking to get it accurate with relative dates.
Version 3: The new GPT 4-o mini

Parsing dates

Let’s examine how these models handle relative time without the additional tweaks I used for GPT-3.5 Turbo.

Example JSON Output:
{
"date": "2022-01-01",
"time": "14:00",
"event_title": "Lunch with friends",
"event_description": "",
"event_location": "Sushi Töölö",
"duration_in_minutes": 60
}

- This JSON output is used to generate a calendar event that's shared with people.
- The date key can only be of the following format: "2022-01-01". 
- Estimate event duration
- Only add description if it provides extra context

Current date: 2024-07-19
Input: EMMA museum 12:00 today
JSON Output:

GPT 3.5 Turbo

{
"date": "2024-07-21",
"time": "12:00",
"event_title": "EMMA museum visit",
"event_description": "",
"event_location": "EMMA museum",
"duration_in_minutes": 120
}

GPT 4o Mini

{
  "date": "2024-07-20",
  "time": "12:00",
  "event_title": "Visit to EMMA Museum",
  "event_description": "A day out at the Espoo Museum of Modern Art",
  "event_location": "EMMA Museum",
  "duration_in_minutes": 120
}

The difference is drastic. GPT-3.5 Turbo misinterpreted the date (Sunday, not Saturday) and provided minimal event details. The new model easily handles relative days and even added a helpful event description.

Verifying results with random dates

import os
from openai import OpenAI
import json
from datetime import datetime, timedelta
import random

client = OpenAI()


def get_random_date():
    """Generate a random date in 2024"""
    return datetime(2024, 1, 1) + timedelta(days=random.randint(0, 365))


def format_date(date):
    """Format the date as YYYY-MM-DD."""
    return date.strftime("%Y-%m-%d")


def call_gpt(model, prompt):
    """Call the OpenAI API with the given model and prompt."""
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[
                {
                    "role": "system",
                    "content": "You are a helpful assistant that creates calendar events from natural language input.",
                },
                {"role": "user", "content": prompt},
            ],
            temperature=0,
        )
        return json.loads(response.choices[0].message.content)
    except json.JSONDecodeError:
        return None


def check_response(response, target_date):
    """Check if the response date matches the target date and is a Friday."""
    if response and "date" in response:
        try:
            response_date = datetime.strptime(response["date"], "%Y-%m-%d")
            return response_date.weekday() == 4  # 4 represents Friday (0 is Monday)
        except ValueError:
            return False
    return False


def main():
    models = {"gpt-3.5-turbo": "GPT-3.5-Turbo", "gpt-4o-mini": "GPT-4o-Mini"}
    stats = {
        "total": 0,
        "gpt-3.5-turbo": 0,
        "gpt-4o-mini": 0,
    }

    for i in range(100):
        random_date = get_random_date()
        prompt = f"""
Example JSON Output:
{{
"date": "2022-01-01",
"time": "14:00",
"event_title": "Lunch with friends",
"event_description": "",
"event_location": "Sushi Töölö",
"duration_in_minutes": 60
}}

- This JSON output is used to generate a calendar event that's shared with people.
- The date key can only be of the following format: "2022-01-01". 
- Estimate event duration
- Only add description if it provides extra context

Current date: {random_date}
Input: EMMA museum 12:00 Friday
JSON Output:"""

        stats["total"] += 1
        for model_id, model_name in models.items():
            response = call_gpt(model_id, prompt)
            is_correct = check_response(response, random_date)
            try:
                response_date = datetime.strptime(response["date"], "%Y-%m-%d")
                parsed_weekday_humanized = response_date.strftime("%A")
                # print(f"Week date: {response_date.weekday()}")
            except Exception:
                response_date = None
                parsed_weekday_humanized = None
            if is_correct:
                stats[model_id] += 1

            print(
                f"- {model_name}: {is_correct} - {parsed_weekday_humanized} - {response_date}"
            )

        # Print stats
        print("@ Stats (%s total):" % stats["total"])
        for model_id, model_name in models.items():
            score_percentage = stats[model_id] / stats["total"]
            score_percentage = round(score_percentage * 100, 2)
            print(f"@ {model_name}: {score_percentage}%")


if __name__ == "__main__":
    main()

I might have celebrated too soon. I tried this again with random dates and while the results are better, they are not as accurate as I hoped:

Model	Accuracy
GPT-3.5-Turbo	54.05%
GPT-4o-Mini	87.84%

So we do have to give the model a little hint. If we add “Current weekdate: {random_date_weekday_humanized}” to the prompt, we get a lot better results. These results are after 100 calls.

Model	Accuracy
GPT-3.5-Turbo	90.0%
GPT-4o-Mini	100%

Previous Workaround for GPT-3.5 Turbo

For reference, here’s the extra prompt previously required for GPT-3.5-Turbo:

- The date key can only be of the following format: "TODAY", "TOMORROW", "MONDAY", "TUESDAY", "WEDNESDAY", "THURSDAY", "FRIDAY", "SATURDAY", "SUNDAY", "2022-01-01". Prefer exact date if it's provided in the input. Use the current year (%s).

Additional Python code was necessary to convert relative dates to actual dates~~, a step no longer required with GPT-4o-Mini.~~