January 2024 – gg:sfit

Recently I saw these test results on huggingface.co:

Now, the large language models (LLMs) are nearmost reaching the benchmark of average humans.

On huggingface.co/../open_llm_leaderboard large language models are tested against certain well-defined key benchmarks using an unified framework by applying a set of distinct evaluation tasks.

The details each part of the test are shown below:

The top performer, i.e. those having the top improvements are the maths related tests:

GSM8k (5-shot) – diverse grade school math word problems to measure a model’s ability to solve multi-step mathematical reasoning problems.

MMLU (5-shot) – a test to measure a text model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

Let’s see which parts of the test have reached the highest score:

Winogrande (5-shot) – an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.

The common sense reasoning test is now at the leading edge, showing up 90 of max. 94 points!
Congratulations!

gg:sfit

fitting content is feeding brains

Month: January 2024

Then, in 2023 we saw AI rising