|
| 1 | +# How to Create Metadata Curation Workflows |
| 2 | + |
| 3 | +This guide shows you how to set up a metadata curation workflows in Synapse using the curator extension. You'll learn to find appropriate schemas, create curation tasks for your research data. |
| 4 | + |
| 5 | +## What you'll accomplish |
| 6 | + |
| 7 | +By following this guide, you will: |
| 8 | + |
| 9 | +- Find and select the right JSON schema for your data type |
| 10 | +- Create a metadata curation workflow with automatic validation |
| 11 | +- Set up either file-based or record-based metadata collection |
| 12 | +- Configure curation tasks that guide collaborators through metadata entry |
| 13 | + |
| 14 | +## Prerequisites |
| 15 | + |
| 16 | +- A Synapse account with project creation permissions |
| 17 | +- Python environment with synapseclient and the `curator` extension installed (ie. `pip install --upgrade "synapseclient[curator]"`) |
| 18 | +- An existing Synapse project and folder where you want to manage metadata |
| 19 | +- A JSON Schema registered in Synapse (many schemas are already available for Sage-affiliated projects, or you can register your own by following the [JSON Schema tutorial](../../../tutorials/python/json_schema.md)) |
| 20 | + |
| 21 | +## Step 1: Authenticate and import required functions |
| 22 | + |
| 23 | +```python |
| 24 | +from synapseclient.extensions.curator import ( |
| 25 | + create_record_based_metadata_task, |
| 26 | + create_file_based_metadata_task, |
| 27 | + query_schema_registry |
| 28 | +) |
| 29 | +from synapseclient import Synapse |
| 30 | + |
| 31 | +syn = Synapse() |
| 32 | +syn.login() |
| 33 | +``` |
| 34 | + |
| 35 | +## Step 2: Find the right schema for your data |
| 36 | + |
| 37 | +Before creating a curation task, identify which JSON schema matches your data type. Many schemas are already registered in Synapse for Sage-affiliated projects. The schema registry contains validated schemas organized by data coordination center (DCC) and data type. |
| 38 | + |
| 39 | +**If you need to register your own schema**, follow the [JSON Schema tutorial](../../../tutorials/python/json_schema.md) to understand the registration process. |
| 40 | + |
| 41 | +```python |
| 42 | +# Find the latest schema for your specific data type |
| 43 | +schema_uri = query_schema_registry( |
| 44 | + synapse_client=syn, |
| 45 | + dcc="ad", # Your data coordination center, check out the `syn69735275` table if you do not know your code |
| 46 | + datatype="IndividualAnimalMetadataTemplate" # Your specific data type |
| 47 | +) |
| 48 | + |
| 49 | +print("Latest schema URI:", schema_uri) |
| 50 | +``` |
| 51 | + |
| 52 | +**When to use this approach:** You know your DCC and data type, you want the most current schema version, and it has already been registered into <https://www.synapse.org/Synapse:syn69735275/tables/>. |
| 53 | + |
| 54 | +**Alternative - browse available schemas:** |
| 55 | +```python |
| 56 | +# Get all versions to see what's available |
| 57 | +all_schemas = query_schema_registry( |
| 58 | + synapse_client=syn, |
| 59 | + dcc="ad", |
| 60 | + datatype="IndividualAnimalMetadataTemplate", |
| 61 | + return_latest_only=False |
| 62 | +) |
| 63 | +``` |
| 64 | + |
| 65 | +## Step 3: Choose your metadata workflow type |
| 66 | + |
| 67 | +### Option A: Record-based metadata |
| 68 | + |
| 69 | +Use this when metadata describes individual data files and is stored as annotations directly on each file. |
| 70 | + |
| 71 | +```python |
| 72 | +record_set, curation_task, data_grid = create_record_based_metadata_task( |
| 73 | + synapse_client=syn, |
| 74 | + project_id="syn123456789", # Your project ID |
| 75 | + folder_id="syn987654321", # Folder where files are stored |
| 76 | + record_set_name="AnimalMetadata_Records", |
| 77 | + record_set_description="Centralized metadata for animal study data", |
| 78 | + curation_task_name="AnimalMetadata_Curation", # Must be unique within the project |
| 79 | + upsert_keys=["StudyKey"], # Fields that uniquely identify records |
| 80 | + instructions="Complete all required fields according to the schema. Use StudyKey to link records to your data files.", |
| 81 | + schema_uri=schema_uri, # Schema found in Step 2 |
| 82 | + bind_schema_to_record_set=True |
| 83 | +) |
| 84 | + |
| 85 | +print(f"Created RecordSet: {record_set.id}") |
| 86 | +print(f"Created CurationTask: {curation_task.task_id}") |
| 87 | +``` |
| 88 | + |
| 89 | +**What this creates:** |
| 90 | + |
| 91 | +- A RecordSet where metadata is stored as structured records (like a spreadsheet) |
| 92 | +- A CurationTask that guides users through completing the metadata |
| 93 | +- Automatic schema binding for validation |
| 94 | +- A data grid interface for easy metadata entry |
| 95 | + |
| 96 | +### Option B: File-based metadata (for unique per-file metadata) |
| 97 | + |
| 98 | +Use this when metadata is normalized in structured records to eliminate duplication and ensure consistency. |
| 99 | + |
| 100 | +```python |
| 101 | +entity_view_id, task_id = create_file_based_metadata_task( |
| 102 | + synapse_client=syn, |
| 103 | + folder_id="syn987654321", # Folder containing your data files |
| 104 | + curation_task_name="FileMetadata_Curation", # Must be unique within the project |
| 105 | + instructions="Annotate each file with metadata according to the schema requirements.", |
| 106 | + attach_wiki=True, # Creates a wiki in the folder with the entity view |
| 107 | + entity_view_name="Animal Study Files View", |
| 108 | + schema_uri=schema_uri # Schema found in Step 2 |
| 109 | +) |
| 110 | + |
| 111 | +print(f"Created EntityView: {entity_view_id}") |
| 112 | +print(f"Created CurationTask: {task_id}") |
| 113 | +``` |
| 114 | + |
| 115 | +**What this creates:** |
| 116 | + |
| 117 | +- An EntityView that displays all files in the folder |
| 118 | +- A CurationTask for guided metadata entry |
| 119 | +- Automatic schema binding to the folder for validation |
| 120 | +- Optional wiki attached to the folder |
| 121 | + |
| 122 | +## Complete example script |
| 123 | + |
| 124 | +Here's the full script that demonstrates both workflow types: |
| 125 | + |
| 126 | +```python |
| 127 | +from pprint import pprint |
| 128 | +from synapseclient.extensions.curator import ( |
| 129 | + create_record_based_metadata_task, |
| 130 | + create_file_based_metadata_task, |
| 131 | + query_schema_registry |
| 132 | +) |
| 133 | +from synapseclient import Synapse |
| 134 | + |
| 135 | +# Step 1: Authenticate |
| 136 | +syn = Synapse() |
| 137 | +syn.login() |
| 138 | + |
| 139 | +# Step 2: Find schema |
| 140 | +schema_uri = query_schema_registry( |
| 141 | + synapse_client=syn, |
| 142 | + dcc="ad", |
| 143 | + datatype="IndividualAnimalMetadataTemplate" |
| 144 | +) |
| 145 | +print("Using schema:", schema_uri) |
| 146 | + |
| 147 | +# Step 3A: Create record-based workflow |
| 148 | +record_set, curation_task, data_grid = create_record_based_metadata_task( |
| 149 | + synapse_client=syn, |
| 150 | + project_id="syn123456789", |
| 151 | + folder_id="syn987654321", |
| 152 | + record_set_name="AnimalMetadata_Records", |
| 153 | + record_set_description="Centralized animal study metadata", |
| 154 | + curation_task_name="AnimalMetadata_Curation", |
| 155 | + upsert_keys=["StudyKey"], |
| 156 | + instructions="Complete metadata for all study animals using StudyKey to link records to data files.", |
| 157 | + schema_uri=schema_uri, |
| 158 | + bind_schema_to_record_set=True |
| 159 | +) |
| 160 | + |
| 161 | +print(f"Record-based workflow created:") |
| 162 | +print(f" RecordSet: {record_set.id}") |
| 163 | +print(f" CurationTask: {curation_task.task_id}") |
| 164 | + |
| 165 | +# Step 3B: Create file-based workflow |
| 166 | +entity_view_id, task_id = create_file_based_metadata_task( |
| 167 | + synapse_client=syn, |
| 168 | + folder_id="syn987654321", |
| 169 | + curation_task_name="FileMetadata_Curation", |
| 170 | + instructions="Annotate each file with complete metadata according to schema.", |
| 171 | + attach_wiki=True, |
| 172 | + entity_view_name="Animal Study Files View", |
| 173 | + schema_uri=schema_uri |
| 174 | +) |
| 175 | + |
| 176 | +print(f"File-based workflow created:") |
| 177 | +print(f" EntityView: {entity_view_id}") |
| 178 | +print(f" CurationTask: {task_id}") |
| 179 | +``` |
| 180 | + |
| 181 | +## Additional utilities |
| 182 | + |
| 183 | +### Validate schema binding on folders |
| 184 | + |
| 185 | +Use this script to verify the schema on a folder against the items contained within that folder: |
| 186 | + |
| 187 | +```python |
| 188 | +from synapseclient import Synapse |
| 189 | +from synapseclient.models import Folder |
| 190 | + |
| 191 | +# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema. |
| 192 | +FOLDER_ID = "" |
| 193 | + |
| 194 | +syn = Synapse() |
| 195 | +syn.login() |
| 196 | + |
| 197 | +folder = Folder(id=FOLDER_ID).get() |
| 198 | +schema_validation = folder.validate_schema() |
| 199 | + |
| 200 | +print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}") |
| 201 | +``` |
| 202 | + |
| 203 | +### List existing curation tasks |
| 204 | + |
| 205 | +Use this script to see all curation tasks in a project: |
| 206 | + |
| 207 | +```python |
| 208 | +from pprint import pprint |
| 209 | +from synapseclient import Synapse |
| 210 | +from synapseclient.models.curation import CurationTask |
| 211 | + |
| 212 | +PROJECT_ID = "" # The Synapse ID of the project to list tasks from |
| 213 | + |
| 214 | +syn = Synapse() |
| 215 | +syn.login() |
| 216 | + |
| 217 | +for curation_task in CurationTask.list( |
| 218 | + project_id=PROJECT_ID |
| 219 | +): |
| 220 | + pprint(curation_task) |
| 221 | +``` |
| 222 | + |
| 223 | +## References |
| 224 | + |
| 225 | +### API Documentation |
| 226 | + |
| 227 | +- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] - Search for schemas in the registry |
| 228 | +- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] - Create RecordSet-based curation workflows |
| 229 | +- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] - Create EntityView-based curation workflows |
| 230 | +- [Folder.bind_schema][synapseclient.models.Folder.bind_schema] - Bind schemas to folders |
| 231 | +- [Folder.validate_schema][synapseclient.models.Folder.validate_schema] - Validate folder schema compliance |
| 232 | +- [CurationTask.list][synapseclient.models.CurationTask.list] - List curation tasks in a project |
| 233 | + |
| 234 | +### Related Documentation |
| 235 | + |
| 236 | +- [JSON Schema Tutorial](../../../tutorials/python/json_schema.md) - Learn how to register schemas |
| 237 | +- [Schema Registry](https://synapse.org/Synapse:syn69735275/tables/) - Browse available schemas |
0 commit comments