TheVault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation
Accepted in EMNLP 2023
Key Contributions
- A dataset of 34 million high-quality code-text (comment) pairs across 10 languages.
- Nearly 290 millions stand-alone functions in 10 languages.
- A data cleaning pipeline using both syntatic rule-based filters and CodeBert as semantic classifier.
- Evaluation of the dataset on common coding tasks of code generation, code summarization, and code search.
Details
A more detailed report could be found in this blog: TheVault (proudly created by Dung Nguyen, one of the first authors).