Skip to main content

TheVault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Accepted in EMNLP 2023

Key Contributions

A dataset of 34 million high-quality code-text (comment) pairs across 10 languages.
Nearly 290 millions stand-alone functions in 10 languages.
A data cleaning pipeline using both syntatic rule-based filters and CodeBert as semantic classifier.
Evaluation of the dataset on common coding tasks of code generation, code summarization, and code search.

Details

A more detailed report could be found in this blog: TheVault (proudly created by Dung Nguyen, one of the first authors).

Key Contributions
Details