Skip to content

Commit cbf384b

Browse files
[SYNPY-1668] Creating curator extensions directory (#1263)
* Creating an extensions directory in SYNPY to share out the curator extension, allowing binding of folders, creation of CuratorTasks, EntityViews, and RecordSets --------- Co-authored-by: Andrew Lamb <andrewelamb@gmail.com>
1 parent 781aa36 commit cbf384b

File tree

12 files changed

+3404
-1
lines changed

12 files changed

+3404
-1
lines changed
Lines changed: 237 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,237 @@
1+
# How to Create Metadata Curation Workflows
2+
3+
This guide shows you how to set up a metadata curation workflows in Synapse using the curator extension. You'll learn to find appropriate schemas, create curation tasks for your research data.
4+
5+
## What you'll accomplish
6+
7+
By following this guide, you will:
8+
9+
- Find and select the right JSON schema for your data type
10+
- Create a metadata curation workflow with automatic validation
11+
- Set up either file-based or record-based metadata collection
12+
- Configure curation tasks that guide collaborators through metadata entry
13+
14+
## Prerequisites
15+
16+
- A Synapse account with project creation permissions
17+
- Python environment with synapseclient and the `curator` extension installed (ie. `pip install --upgrade "synapseclient[curator]"`)
18+
- An existing Synapse project and folder where you want to manage metadata
19+
- A JSON Schema registered in Synapse (many schemas are already available for Sage-affiliated projects, or you can register your own by following the [JSON Schema tutorial](../../../tutorials/python/json_schema.md))
20+
21+
## Step 1: Authenticate and import required functions
22+
23+
```python
24+
from synapseclient.extensions.curator import (
25+
create_record_based_metadata_task,
26+
create_file_based_metadata_task,
27+
query_schema_registry
28+
)
29+
from synapseclient import Synapse
30+
31+
syn = Synapse()
32+
syn.login()
33+
```
34+
35+
## Step 2: Find the right schema for your data
36+
37+
Before creating a curation task, identify which JSON schema matches your data type. Many schemas are already registered in Synapse for Sage-affiliated projects. The schema registry contains validated schemas organized by data coordination center (DCC) and data type.
38+
39+
**If you need to register your own schema**, follow the [JSON Schema tutorial](../../../tutorials/python/json_schema.md) to understand the registration process.
40+
41+
```python
42+
# Find the latest schema for your specific data type
43+
schema_uri = query_schema_registry(
44+
synapse_client=syn,
45+
dcc="ad", # Your data coordination center, check out the `syn69735275` table if you do not know your code
46+
datatype="IndividualAnimalMetadataTemplate" # Your specific data type
47+
)
48+
49+
print("Latest schema URI:", schema_uri)
50+
```
51+
52+
**When to use this approach:** You know your DCC and data type, you want the most current schema version, and it has already been registered into <https://www.synapse.org/Synapse:syn69735275/tables/>.
53+
54+
**Alternative - browse available schemas:**
55+
```python
56+
# Get all versions to see what's available
57+
all_schemas = query_schema_registry(
58+
synapse_client=syn,
59+
dcc="ad",
60+
datatype="IndividualAnimalMetadataTemplate",
61+
return_latest_only=False
62+
)
63+
```
64+
65+
## Step 3: Choose your metadata workflow type
66+
67+
### Option A: Record-based metadata
68+
69+
Use this when metadata describes individual data files and is stored as annotations directly on each file.
70+
71+
```python
72+
record_set, curation_task, data_grid = create_record_based_metadata_task(
73+
synapse_client=syn,
74+
project_id="syn123456789", # Your project ID
75+
folder_id="syn987654321", # Folder where files are stored
76+
record_set_name="AnimalMetadata_Records",
77+
record_set_description="Centralized metadata for animal study data",
78+
curation_task_name="AnimalMetadata_Curation", # Must be unique within the project
79+
upsert_keys=["StudyKey"], # Fields that uniquely identify records
80+
instructions="Complete all required fields according to the schema. Use StudyKey to link records to your data files.",
81+
schema_uri=schema_uri, # Schema found in Step 2
82+
bind_schema_to_record_set=True
83+
)
84+
85+
print(f"Created RecordSet: {record_set.id}")
86+
print(f"Created CurationTask: {curation_task.task_id}")
87+
```
88+
89+
**What this creates:**
90+
91+
- A RecordSet where metadata is stored as structured records (like a spreadsheet)
92+
- A CurationTask that guides users through completing the metadata
93+
- Automatic schema binding for validation
94+
- A data grid interface for easy metadata entry
95+
96+
### Option B: File-based metadata (for unique per-file metadata)
97+
98+
Use this when metadata is normalized in structured records to eliminate duplication and ensure consistency.
99+
100+
```python
101+
entity_view_id, task_id = create_file_based_metadata_task(
102+
synapse_client=syn,
103+
folder_id="syn987654321", # Folder containing your data files
104+
curation_task_name="FileMetadata_Curation", # Must be unique within the project
105+
instructions="Annotate each file with metadata according to the schema requirements.",
106+
attach_wiki=True, # Creates a wiki in the folder with the entity view
107+
entity_view_name="Animal Study Files View",
108+
schema_uri=schema_uri # Schema found in Step 2
109+
)
110+
111+
print(f"Created EntityView: {entity_view_id}")
112+
print(f"Created CurationTask: {task_id}")
113+
```
114+
115+
**What this creates:**
116+
117+
- An EntityView that displays all files in the folder
118+
- A CurationTask for guided metadata entry
119+
- Automatic schema binding to the folder for validation
120+
- Optional wiki attached to the folder
121+
122+
## Complete example script
123+
124+
Here's the full script that demonstrates both workflow types:
125+
126+
```python
127+
from pprint import pprint
128+
from synapseclient.extensions.curator import (
129+
create_record_based_metadata_task,
130+
create_file_based_metadata_task,
131+
query_schema_registry
132+
)
133+
from synapseclient import Synapse
134+
135+
# Step 1: Authenticate
136+
syn = Synapse()
137+
syn.login()
138+
139+
# Step 2: Find schema
140+
schema_uri = query_schema_registry(
141+
synapse_client=syn,
142+
dcc="ad",
143+
datatype="IndividualAnimalMetadataTemplate"
144+
)
145+
print("Using schema:", schema_uri)
146+
147+
# Step 3A: Create record-based workflow
148+
record_set, curation_task, data_grid = create_record_based_metadata_task(
149+
synapse_client=syn,
150+
project_id="syn123456789",
151+
folder_id="syn987654321",
152+
record_set_name="AnimalMetadata_Records",
153+
record_set_description="Centralized animal study metadata",
154+
curation_task_name="AnimalMetadata_Curation",
155+
upsert_keys=["StudyKey"],
156+
instructions="Complete metadata for all study animals using StudyKey to link records to data files.",
157+
schema_uri=schema_uri,
158+
bind_schema_to_record_set=True
159+
)
160+
161+
print(f"Record-based workflow created:")
162+
print(f" RecordSet: {record_set.id}")
163+
print(f" CurationTask: {curation_task.task_id}")
164+
165+
# Step 3B: Create file-based workflow
166+
entity_view_id, task_id = create_file_based_metadata_task(
167+
synapse_client=syn,
168+
folder_id="syn987654321",
169+
curation_task_name="FileMetadata_Curation",
170+
instructions="Annotate each file with complete metadata according to schema.",
171+
attach_wiki=True,
172+
entity_view_name="Animal Study Files View",
173+
schema_uri=schema_uri
174+
)
175+
176+
print(f"File-based workflow created:")
177+
print(f" EntityView: {entity_view_id}")
178+
print(f" CurationTask: {task_id}")
179+
```
180+
181+
## Additional utilities
182+
183+
### Validate schema binding on folders
184+
185+
Use this script to verify the schema on a folder against the items contained within that folder:
186+
187+
```python
188+
from synapseclient import Synapse
189+
from synapseclient.models import Folder
190+
191+
# The Synapse ID of the entity you want to bind the JSON Schema to. This should be the ID of a Folder where you want to enforce the schema.
192+
FOLDER_ID = ""
193+
194+
syn = Synapse()
195+
syn.login()
196+
197+
folder = Folder(id=FOLDER_ID).get()
198+
schema_validation = folder.validate_schema()
199+
200+
print(f"Schema validation result for folder {FOLDER_ID}: {schema_validation}")
201+
```
202+
203+
### List existing curation tasks
204+
205+
Use this script to see all curation tasks in a project:
206+
207+
```python
208+
from pprint import pprint
209+
from synapseclient import Synapse
210+
from synapseclient.models.curation import CurationTask
211+
212+
PROJECT_ID = "" # The Synapse ID of the project to list tasks from
213+
214+
syn = Synapse()
215+
syn.login()
216+
217+
for curation_task in CurationTask.list(
218+
project_id=PROJECT_ID
219+
):
220+
pprint(curation_task)
221+
```
222+
223+
## References
224+
225+
### API Documentation
226+
227+
- [query_schema_registry][synapseclient.extensions.curator.query_schema_registry] - Search for schemas in the registry
228+
- [create_record_based_metadata_task][synapseclient.extensions.curator.create_record_based_metadata_task] - Create RecordSet-based curation workflows
229+
- [create_file_based_metadata_task][synapseclient.extensions.curator.create_file_based_metadata_task] - Create EntityView-based curation workflows
230+
- [Folder.bind_schema][synapseclient.models.Folder.bind_schema] - Bind schemas to folders
231+
- [Folder.validate_schema][synapseclient.models.Folder.validate_schema] - Validate folder schema compliance
232+
- [CurationTask.list][synapseclient.models.CurationTask.list] - List curation tasks in a project
233+
234+
### Related Documentation
235+
236+
- [JSON Schema Tutorial](../../../tutorials/python/json_schema.md) - Learn how to register schemas
237+
- [Schema Registry](https://synapse.org/Synapse:syn69735275/tables/) - Browse available schemas
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
::: synapseclient.extensions.curator

mkdocs.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,8 @@ nav:
5656
# - Using Entity Views: guides/views.md
5757
- Data Storage: guides/data_storage.md
5858
- Access the REST API: guides/accessing_the_rest_api.md
59+
- Extensions:
60+
- Curator: guides/extensions/curator/metadata_curation.md
5961
# - Expermental Features:
6062
# - Validating Annotations: guides/validate_annotations.md
6163
- API Reference:
@@ -100,6 +102,8 @@ nav:
100102
- Curator: reference/experimental/sync/curator.md
101103
- Link: reference/experimental/sync/link_entity.md
102104
- Functional Interfaces: reference/experimental/functional_interfaces.md
105+
- Extensions:
106+
- Curator: reference/extensions/curator.md
103107
- Asynchronous:
104108
- Factory Operations: reference/experimental/async/factory_operations.md
105109
- Agent: reference/experimental/async/agent.md

setup.cfg

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
21
[metadata]
32
name = synapseclient
43
description = A client for Synapse, a collaborative, open-source research platform that allows teams to share data, track analyses, and collaborate.
@@ -106,6 +105,9 @@ tests =
106105
pandas =
107106
pandas>=1.5,<3.0
108107

108+
curator =
109+
%(pandas)s
110+
109111
pysftp =
110112
pysftp>=0.2.8,<0.3
111113
paramiko<4.0.0

synapseclient/extensions/__init__.py

Whitespace-only changes.
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
"""
2+
Synapse Curator Extensions
3+
4+
This module provides library functions for metadata curation tasks in Synapse.
5+
"""
6+
7+
from .file_based_metadata_task import create_file_based_metadata_task
8+
from .record_based_metadata_task import create_record_based_metadata_task
9+
from .schema_registry import query_schema_registry
10+
11+
__all__ = [
12+
"create_file_based_metadata_task",
13+
"create_record_based_metadata_task",
14+
"query_schema_registry",
15+
]

0 commit comments

Comments
 (0)