โญ Big Code Models Leaderboard


Inspired from the ๐Ÿค— Open LLM Leaderboard and ๐Ÿค— Open LLM-Perf Leaderboard ๐Ÿ‹๏ธ, we compare performance of base multilingual code generation models on HumanEval benchmark and MultiPL-E. We also measure throughput and provide information about the models. We only compare open pre-trained multilingual code models, that people can start from as base models for their trainings.

Warning: This leaderboard was last updated as of the release of DeepSeek-Coder-33b-instruct on November 2023. Stronger models might have been released since, check the Submit Results section for submitting new evaluation results for the leaderboard. You can also check other code leaderboards like Can-AI-Code .

โš Filter model types
๐Ÿ”ด
39.58
80.02
52.03
65.13
25.2

Notes:

  • Win Rate represents how often a model outperforms other models in each language, averaged across all languages.
  • The scores of instruction-tuned models might be significantly higher on humaneval-python than other languages. We use the instruction format of HumanEval. For other languages, we use base MultiPL-E prompts.
  • For more details check the ๐Ÿ“ About section.
  • Models with a ๐Ÿ”ด symbol represent external evaluation submission, this means that we didn't verify the results, you can find the author's submission under Submission PR field from See All Columns tab.