neo4j-labs · IamAGP · Sep 27, 2025
diff --git a/datasets/synthetic_haiku35_demodbs/README.md b/datasets/synthetic_haiku35_demodbs/README.md
@@ -0,0 +1,41 @@
+# Synthetic dataset created with Claude 3.5 Haiku
+
+Synthetic dataset of text2cypher over 7 different graph schemas.
+
+Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with `claude-3-5-haiku-20241022` via AWS Bedrock.
+The demo database is available at:
+
+```
+URI: neo4j+s://demo.neo4jlabs.com
+username: name of the database, for example 'movies'
+password: name of the database, for example 'movies'
+database: name of the database, for example 'movies'
+```
+
+Notebooks:
+
+* `anthropic_text2cypher_haiku35.ipynb`: Generate Cypher statements and validate them by examining if they return any values, have syntax errors, or do queries timeout.
+
+Dataset is available at `text2cypher_haiku35.csv`. Columns are the following:
+
+* `question`: Natural language question
+* `cypher`: Corresponding Cypher statement based on the provided question
+* `type`: Type of question, see `synthetic_gpt4turbo_demodbs/generate_text2cypher_questions.ipynb` for more information
+* `database`: Database that the questions is aimed at
+* `syntax_error`: Does the Cypher statement result in Cypher syntax error
+* `timeout`: Does the Cypher statement take more than 10 seconds to complete
+* `returns_results`: Does the Cypher statement return non-null results
+* `false_schema`: Does the Cypher statement uses parts of graph schema (node types or properties) that aren't present in the graph
+
+## Potential Tasks and Uses of the Dataset
+
+This synthetic dataset can be utilized for various research and development tasks, including:
+
+* Evaluating Syntax Errors: Analyze and categorize the types of syntax errors generated by the LLM to improve error handling and debugging capabilities in Cypher statement generation.
+* Detecting Schema Hallucination: Evaluate instances when the LLM hallucinates graph schema elements that do not exist in the database, aiding in the improvement of schema-aware model training.
+* Benchmarking LLM Performance: Use the dataset to evaluate the performance of different LLMs in generating valid Cypher queries, providing insights into model capabilities and limitations.
+* Finetuning LLMs: Leverage the dataset for finetuning LLMs on domain-specific languages like Cypher to enhance their accuracy and efficiency in generating database queries.
+* Prompt engineering: Determine which prompt produces the most accurate Cypher statements.
+* Comparing accuracy and limitations between Haiku 3.5 and other models (Opus, GPT-4-Turbo, etc.)
+* Cost-effective text2cypher generation: Haiku 3.5 provides a budget-friendly option for generating large-scale synthetic datasets
+* Performance analysis of smaller vs larger models on text2cypher tasks