Skip to content

Commit 43f9efa

Browse files
Update tools/listing.yaml
1 parent 4bb4658 commit 43f9efa

File tree

3 files changed

+24
-3
lines changed

3 files changed

+24
-3
lines changed

README.md

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -300,6 +300,14 @@ The questions were generated by GPT-4 based on the "Computer Systems Security: P
300300
inspect eval inspect_evals/boolq
301301
```
302302

303+
- ### [DocVQA: A Dataset for VQA on Document Images](src/inspect_evals/docvqa)
304+
DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
305+
<sub><sup>Contributed by: [@evanmiller-anthropic](https://github.com/evanmiller-anthropic)</sub></sup>
306+
307+
```bash
308+
inspect eval inspect_evals/docvqa
309+
```
310+
303311
- ### [DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs](src/inspect_evals/drop)
304312
Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
305313
<sub><sup>Contributed by: [@xeon27](https://github.com/xeon27)</sub></sup>
@@ -443,4 +451,4 @@ The questions were generated by GPT-4 based on the "Computer Systems Security: P
443451
inspect eval inspect_evals/agie_lsat_lr
444452
```
445453

446-
<!-- /Eval Listing: Automatically Generated -->
454+
<!-- /Eval Listing: Automatically Generated -->

src/inspect_evals/docvqa/README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,22 @@
33
[DocVQA](https://arxiv.org/abs/2007.00398) is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
44

55
<!-- Contributors: Automatically Generated -->
6-
Contributed by [@xeon27](https://github.com/xeon27)
6+
Contributed by [@evanmiller-anthropic](https://github.com/evanmiller-anthropic)
77
<!-- /Contributors: Automatically Generated -->
88

99

1010
<!-- Usage: Automatically Generated -->
1111
## Usage
1212

1313
First, install the `inspect_ai` and `inspect_evals` Python packages with:
14+
1415
```bash
1516
pip install inspect_ai
1617
pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
1718
```
1819

1920
Then, evaluate against one or more models with:
21+
2022
```bash
2123
inspect eval inspect_evals/docvqa --model openai/gpt-4o
2224
```
@@ -39,6 +41,7 @@ ANTHROPIC_API_KEY=<anthropic-api-key>
3941
## Options
4042

4143
You can control a variety of options from the command line. For example:
44+
4245
```bash
4346
inspect eval inspect_evals/docvqa --limit 10
4447
inspect eval inspect_evals/docvqa --max-connections 10
@@ -67,4 +70,4 @@ The model is tasked to answer each question by referring to the image. The promp
6770

6871
DocVQA computes the Average Normalized Levenstein Similarity:
6972

70-
[Average Normalized Levenstein Similarity definiion](https://user-images.githubusercontent.com/48327001/195277520-b1ef2be2-c4d7-417b-91ec-5fda8aa6db06.png)
73+
[Average Normalized Levenstein Similarity definition](https://user-images.githubusercontent.com/48327001/195277520-b1ef2be2-c4d7-417b-91ec-5fda8aa6db06.png)

tools/listing.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -278,6 +278,16 @@
278278
contributors: ["seddy-aisi"]
279279
tasks: ["boolq"]
280280

281+
- title: "DocVQA: A Dataset for VQA on Document Images"
282+
description: |
283+
DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
284+
path: src/inspect_evals/docvqa
285+
arxiv: https://arxiv.org/abs/2007.00398
286+
group: Reasoning
287+
tags: ["Multimodal"]
288+
contributors: ["evanmiller-anthropic"]
289+
tasks: ["docvqa"]
290+
281291
- title: "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs"
282292
description: |
283293
Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

0 commit comments

Comments
 (0)