Skip to content

Commit de1fe18

Browse files
committed
Various updates throughout docs
1 parent 27b0985 commit de1fe18

File tree

6 files changed

+185
-10
lines changed

6 files changed

+185
-10
lines changed

docs/src/concepts/data-model.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -120,10 +120,10 @@ storing large contiguous data objects.
120120

121121
DataJoint comprises:
122122

123-
- a schema [definition](../design/tables/declare.md) language
124-
- a data [manipulation](../manipulation/index.md) language
125-
- a data [query](../query/principles.md) language
126-
- a [diagramming](../design/diagrams.md) notation for visualizing relationships between
123+
+ a schema [definition](../design/tables/declare.md) language
124+
+ a data [manipulation](../manipulation/index.md) language
125+
+ a data [query](../query/principles.md) language
126+
+ a [diagramming](../design/diagrams.md) notation for visualizing relationships between
127127
modeled entities
128128

129129
The key refinement of DataJoint over other relational data models and their

docs/src/design/integrity.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Data Integrity
22

3-
The term **data integrity** describes guarantees made by the data management process
3+
The term **data integrity** describes guarantees made by the data management process
44
that prevent errors and corruption in data due to technical failures and human errors
55
arising in the course of continuous use by multiple agents.
66
DataJoint pipelines respect the following forms of data integrity: **entity

docs/src/design/tables/blobs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Overview
1+
# Blobs
22

33
DataJoint provides functionality for serializing and deserializing complex data types
44
into binary blobs for efficient storage and compatibility with MATLAB's mYm
Lines changed: 80 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,80 @@
1-
# Work in progress
1+
# Custom Types
2+
3+
In modern scientific research, data pipelines often involve complex workflows that
4+
generate diverse data types. From high-dimensional imaging data to machine learning
5+
models, these data types frequently exceed the basic representations supported by
6+
traditional relational databases. For example:
7+
8+
+ A lab working on neural connectivity might use graph objects to represent brain
9+
networks.
10+
+ Researchers processing raw imaging data might store custom objects for pre-processing
11+
configurations.
12+
+ Computational biologists might store fitted machine learning models or parameter
13+
objects for downstream predictions.
14+
15+
To handle these diverse needs, DataJoint provides the `dj.AttributeAdapter` method. It
16+
enables researchers to store and retrieve complex, non-standard data types—like Python
17+
objects or data structures—in a relational database while maintaining the
18+
reproducibility, modularity, and query capabilities required for scientific workflows.
19+
20+
## Uses in Scientific Research
21+
22+
Imagine a neuroscience lab studying neural connectivity. Researchers might generate
23+
graphs (e.g., networkx.Graph) to represent connections between brain regions, where:
24+
25+
+ Nodes are brain regions.
26+
+ Edges represent connections weighted by signal strength or another metric.
27+
28+
Storing these graph objects in a database alongside other experimental data (e.g.,
29+
subject metadata, imaging parameters) ensures:
30+
31+
1. Centralized Data Management: All experimental data and analysis results are stored
32+
together for easy access and querying.
33+
2. Reproducibility: The exact graph objects used in analysis can be retrieved later for
34+
validation or further exploration.
35+
3. Scalability: Graph data can be integrated into workflows for larger datasets or
36+
across experiments.
37+
38+
However, since graphs are not natively supported by relational databases, here’s where
39+
`dj.AttributeAdapter` becomes essential. It allows researchers to define custom logic for
40+
serializing graphs (e.g., as edge lists) and deserializing them back into Python
41+
objects, bridging the gap between advanced data types and the database.
42+
43+
### Example: Storing Graphs in DataJoint
44+
45+
To store a networkx.Graph object in a DataJoint table, researchers can define a custom
46+
attribute type in a datajoint table class:
47+
48+
```python
49+
import datajoint as dj
50+
51+
class GraphAdapter(dj.AttributeAdapter):
52+
53+
attribute_type = 'longblob' # this is how the attribute will be declared
54+
55+
def put(self, obj):
56+
# convert the nx.Graph object into an edge list
57+
assert isinstance(obj, nx.Graph)
58+
return list(obj.edges)
59+
60+
def get(self, value):
61+
# convert edge list back into an nx.Graph
62+
return nx.Graph(value)
63+
64+
65+
# instantiate for use as a datajoint type
66+
graph = GraphAdapter()
67+
68+
69+
# define a table with a graph attribute
70+
schema = dj.schema('test_graphs')
71+
72+
73+
@schema
74+
class Connectivity(dj.Manual):
75+
definition = """
76+
conn_id : int
77+
---
78+
conn_graph = null : <graph> # a networkx.Graph object
79+
"""
80+
```

docs/src/design/tables/indexes.md

Lines changed: 97 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,97 @@
1-
# Work in progress
1+
# Indexes
2+
3+
Table indexes are data structures that allow fast lookups by an indexed attribute or
4+
combination of attributes.
5+
6+
In DataJoint, indexes are created by one of the three mechanisms:
7+
8+
1. Primary key
9+
2. Foreign key
10+
3. Explicitly defined indexes
11+
12+
The first two mechanisms are obligatory. Every table has a primary key, which serves as
13+
an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys
14+
create additional indexes unless a suitable index already exists.
15+
16+
## Indexes for single primary key tables
17+
18+
Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued
19+
by the animal facility.
20+
21+
```python
22+
@schema
23+
class Mouse(dj.Manual):
24+
definition = """
25+
mouse_id : int # lab-specific ID
26+
---
27+
tag_id : int # animal facility ID
28+
"""
29+
```
30+
31+
In this case, searching for a mouse by `mouse_id` is much faster than by `tag_id`
32+
because `mouse_id` is a primary key, and is therefore indexed.
33+
34+
To make searches faster on fields other than the primary key or a foreign key, you can
35+
add a secondary index explicitly.
36+
37+
Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in
38+
the table declration (below the primary key divide).
39+
40+
Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`.
41+
42+
Let’s redeclare the table with a unique index on `tag_id`.
43+
44+
```python
45+
@schema
46+
class Mouse(dj.Manual):
47+
definition = """
48+
mouse_id : int # lab-specific ID
49+
---
50+
tag_id : int # animal facility ID
51+
unique index (tag_id)
52+
"""
53+
```
54+
Now, searches with `mouse_id` and `tag_id` are similarly fast.
55+
56+
## Indexes for tables with multiple primary keys
57+
58+
Let’s now imagine that rats in a lab are identified by the combination of `lab_name` and
59+
`rat_id` in a table `Rat`.
60+
61+
```python
62+
@schema
63+
class Rat(dj.Manual):
64+
definition = """
65+
lab_name : char(16)
66+
rat_id : int unsigned # lab-specific ID
67+
---
68+
date_of_birth = null : date
69+
"""
70+
```
71+
Note that despite the fact that `rat_id` is in the index, searches by `rat_id` alone are not
72+
helped by the index because it is not first in the index. This is similar to searching for
73+
a word in a dictionary that orders words alphabetically. Searching by the first letters
74+
of a word is easy but searching by the last few letters of a word requires scanning the
75+
whole dictionary.
76+
77+
In this table, the primary key is a unique index on the combination `(lab_name, rat_id)`.
78+
Therefore searches on these attributes or on `lab_name` alone are fast. But this index
79+
cannot help searches on `rat_id` alone. Similarly, searing by `date_of_birth` requires a
80+
full-table scan and is inefficient.
81+
82+
To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexes to
83+
`Rat`:
84+
85+
```python
86+
@schema
87+
class Rat2(dj.Manual):
88+
definition = """
89+
lab_name : char(16)
90+
rat_id : int unsigned # lab-specific ID
91+
---
92+
date_of_birth = null : date
93+
94+
index(rat_id)
95+
index(date_of_birth)
96+
"""
97+
```

docs/src/publish-data.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -23,12 +23,12 @@ populated DataJoint pipeline.
2323
One example of publishing a DataJoint pipeline as a docker container is
2424
> Sinz, F., Ecker, A.S., Fahey, P., Walker, E., Cobos, E., Froudarakis, E., Yatsenko, D., Pitkow, Z., Reimer, J. and Tolias, A., 2018. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. In Advances in Neural Information Processing Systems (pp. 7198-7209). https://www.biorxiv.org/content/early/2018/10/25/452672
2525
26-
The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS
26+
The code and the data can be found at [https://github.com/sinzlab/Sinz2018_NIPS](https://github.com/sinzlab/Sinz2018_NIPS).
2727

2828
## Exporting into a collection of files
2929

3030
Another option for publishing and archiving data is to export the data from the
3131
DataJoint pipeline into a collection of files.
3232
DataJoint provides features for exporting and importing sections of the pipeline.
3333
Several ongoing projects are implementing the capability to export from DataJoint
34-
pipelines into [Neurodata Without Borders](https://www.nwb.org/) files.
34+
pipelines into [Neurodata Without Borders](https://www.nwb.org/) files.

0 commit comments

Comments
 (0)