Example: biology
Evaluating Large Language Models Trained on Code
human evaluators. To accurately benchmark our model, we create a dataset of 164 original programming problems with unit tests. These problems assess language compre-hension, algorithms, and simple mathematics, with some comparable to simple software interview questions. We release this data along with an evaluation framework at
Download Evaluating Large Language Models Trained on Code
Information
Domain:
Source:
Link to this page: