Update tools/listing.yaml

evanmiller-anthropic · evanmiller-anthropic · commit 43f9efa62429 · 2025-01-17T15:34:13.000-05:00
diff --git a/README.md b/README.md
@@ -300,6 +300,14 @@ The questions were generated by GPT-4 based on the "Computer Systems Security: P
    inspect eval inspect_evals/boolq
    ```
 
+- ### [DocVQA: A Dataset for VQA on Document Images](src/inspect_evals/docvqa)
+  DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
+ <sub><sup>Contributed by: [@evanmiller-anthropic](https://github.com/evanmiller-anthropic)</sub></sup>
+
+   ```bash
+   inspect eval inspect_evals/docvqa
+   ```
+
 - ### [DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs](src/inspect_evals/drop)
   Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
  <sub><sup>Contributed by: [@xeon27](https://github.com/xeon27)</sub></sup>
@@ -443,4 +451,4 @@ The questions were generated by GPT-4 based on the "Computer Systems Security: P
    inspect eval inspect_evals/agie_lsat_lr
    ```
 
-<!-- /Eval Listing: Automatically Generated -->
+<!-- /Eval Listing: Automatically Generated -->
diff --git a/src/inspect_evals/docvqa/README.md b/src/inspect_evals/docvqa/README.md
@@ -3,20 +3,22 @@
 [DocVQA](https://arxiv.org/abs/2007.00398) is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
 
 <!-- Contributors: Automatically Generated -->
-Contributed by [@xeon27](https://github.com/xeon27)
+Contributed by [@evanmiller-anthropic](https://github.com/evanmiller-anthropic)
 <!-- /Contributors: Automatically Generated -->
 
 
 <!-- Usage: Automatically Generated -->
 ## Usage
 
 First, install the `inspect_ai` and `inspect_evals` Python packages with:
+
 ```bash
 pip install inspect_ai
 pip install git+https://github.com/UKGovernmentBEIS/inspect_evals
 ```
 
 Then, evaluate against one or more models with:
+
 ```bash
 inspect eval inspect_evals/docvqa --model openai/gpt-4o
 ```
@@ -39,6 +41,7 @@ ANTHROPIC_API_KEY=<anthropic-api-key>
 ## Options
 
 You can control a variety of options from the command line. For example:
+
 ```bash
 inspect eval inspect_evals/docvqa --limit 10
 inspect eval inspect_evals/docvqa --max-connections 10
@@ -67,4 +70,4 @@ The model is tasked to answer each question by referring to the image. The promp
 
 DocVQA computes the Average Normalized Levenstein Similarity:
 
-[Average Normalized Levenstein Similarity definiion](https://user-images.githubusercontent.com/48327001/195277520-b1ef2be2-c4d7-417b-91ec-5fda8aa6db06.png)
+[Average Normalized Levenstein Similarity definition](https://user-images.githubusercontent.com/48327001/195277520-b1ef2be2-c4d7-417b-91ec-5fda8aa6db06.png)
diff --git a/tools/listing.yaml b/tools/listing.yaml
@@ -278,6 +278,16 @@
   contributors: ["seddy-aisi"]
   tasks: ["boolq"]
 
+- title: "DocVQA: A Dataset for VQA on Document Images"
+  description: |
+    DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
+  path: src/inspect_evals/docvqa
+  arxiv: https://arxiv.org/abs/2007.00398
+  group: Reasoning
+  tags: ["Multimodal"]
+  contributors: ["evanmiller-anthropic"]
+  tasks: ["docvqa"]
+
 - title: "DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs"
   description: |
     Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).