Skip to main content

TheVault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Accepted in EMNLP 2023

Key Contributions

  • A dataset of 34 million high-quality code-text (comment) pairs across 10 languages.
  • Nearly 290 millions stand-alone functions in 10 languages.
  • A data cleaning pipeline using both syntatic rule-based filters and CodeBert as semantic classifier.
  • Evaluation of the dataset on common coding tasks of code generation, code summarization, and code search.

Details

A more detailed report could be found in this blog: TheVault (proudly created by Dung Nguyen, one of the first authors).