Data Science & Advanced Analytics

 View Only

Can AI Be Trusted With Schedule Analysis? I Benchmarked 15 Models to Start Answering That

  • 1.  Can AI Be Trusted With Schedule Analysis? I Benchmarked 15 Models to Start Answering That

    Posted 4 hours ago

    I benchmarked 15 leading AI models on controlled project-controls tasks, and the spread between them is large enough that it should change how we talk about "using AI" in scheduling.

    Each model answered the same unaided schedule-analysis problems: critical path, float, DCMA-14 logic, statusing, and earned value. No calculators. No external tools. No schedule engine access during the answer.

    The important part: every answer was checked against objective, arithmetic results from a deterministic scheduling engine.

    These were calculable outputs, critical path, float, dates, variances, and earned-value numbers, not subjective opinions about whether an answer "sounded right," and not what we hoped the model understood. It was model output compared directly against computed ground truth.


    Takeaway 1: The model you choose materially changes reliability.


    Chart 1 shows the main schedule-analysis leaderboard.

    • The top score was Fable 5 at 0.95. I have confirmed that result across three internal runs using the same objectively graded battery and the result was stable across those runs.
    • Among the established frontier models, the spread is also meaningful. ChatGPT-5.5 scored 0.83, while Claude Opus 4.8 scored 0.59 on the same benchmark.
    • Several lighter models landed below 0.25.

    That matters because "we use AI" tells you almost nothing in project controls.

    Which model you use is the real decision.


    Takeaway 2: The best schedule reasoner is not always the best optimizer.

    Chart 2 compares unaided schedule reasoning against tool aided (single shot) resource-constrained optimization / leveling (RCPSP) performance. (This would be similar to when you use ChatGPT or Claude chat capability - Models usually code for complex tasks!)

    This is where the result gets more interesting.

    • Claude Opus 4.8 was mid-pack on the schedule-reasoning leaderboard, but ranked first on the resource-leveling test, hitting the proven-optimal answer.
    • ChatGPT-5.5 showed the opposite pattern: very strong on reasoning, but much weaker on resource leveling.
    • Fable 5 was the only model that stayed near the top across both views.

    That is the practical lesson "best AI model" is not a universal label. It depends on the job.

    The model I would use to draft or challenge a delay narrative is not automatically the model I would trust to reason through an optimization problem.

    And even then, resource leveling is exactly the kind of work where I would still want engine-backed verification before treating the answer as reliable.


    My conclusion

    My takeaway is neither "AI can do schedules" nor "AI cannot do schedules."

    It is this:

    AI is useful for schedule-analysis support, but model choice and verification are not optional, and it comes almost always with an additional cost.

    • I would use top-tier AI for first-pass reasoning, narrative drafting, explaining a DCMA flag, and initial schedule logic review (on small, clean schedules).
    • I would not treat a general chatbot as the system of record for schedule, earned-value, or resource-leveling numbers without a trusted engine-backed verification.
    • A float calculation, SPI(t), or leveling result can look polished in a report and still be wrong.

    Simply "Using AI" is not a strategy.

    Choosing the right model for the right task, knowing where it fails, and verifying the arithmetic is the strategy.

    Independent research; views my own.



    ------------------------------
    Zine Eddine Zouaghi
    zine.zouaghi@gmail.com
    ------------------------------