Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions datasets/synthetic_haiku35_demodbs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Synthetic dataset created with Claude 3.5 Haiku

Synthetic dataset of text2cypher over 7 different graph schemas.

Questions were generated using GPT-4-turbo, and the corresponding Cypher statements with `claude-3-5-haiku-20241022` via AWS Bedrock.
The demo database is available at:

```
URI: neo4j+s://demo.neo4jlabs.com
username: name of the database, for example 'movies'
password: name of the database, for example 'movies'
database: name of the database, for example 'movies'
```

Notebooks:

* `anthropic_text2cypher_haiku35.ipynb`: Generate Cypher statements and validate them by examining if they return any values, have syntax errors, or do queries timeout.

Dataset is available at `text2cypher_haiku35.csv`. Columns are the following:

* `question`: Natural language question
* `cypher`: Corresponding Cypher statement based on the provided question
* `type`: Type of question, see `synthetic_gpt4turbo_demodbs/generate_text2cypher_questions.ipynb` for more information
* `database`: Database that the questions is aimed at
* `syntax_error`: Does the Cypher statement result in Cypher syntax error
* `timeout`: Does the Cypher statement take more than 10 seconds to complete
* `returns_results`: Does the Cypher statement return non-null results
* `false_schema`: Does the Cypher statement uses parts of graph schema (node types or properties) that aren't present in the graph

## Potential Tasks and Uses of the Dataset

This synthetic dataset can be utilized for various research and development tasks, including:

* Evaluating Syntax Errors: Analyze and categorize the types of syntax errors generated by the LLM to improve error handling and debugging capabilities in Cypher statement generation.
* Detecting Schema Hallucination: Evaluate instances when the LLM hallucinates graph schema elements that do not exist in the database, aiding in the improvement of schema-aware model training.
* Benchmarking LLM Performance: Use the dataset to evaluate the performance of different LLMs in generating valid Cypher queries, providing insights into model capabilities and limitations.
* Finetuning LLMs: Leverage the dataset for finetuning LLMs on domain-specific languages like Cypher to enhance their accuracy and efficiency in generating database queries.
* Prompt engineering: Determine which prompt produces the most accurate Cypher statements.
* Comparing accuracy and limitations between Haiku 3.5 and other models (Opus, GPT-4-Turbo, etc.)
* Cost-effective text2cypher generation: Haiku 3.5 provides a budget-friendly option for generating large-scale synthetic datasets
* Performance analysis of smaller vs larger models on text2cypher tasks
Loading