Various updates throughout docs

kushalbakshi · kushalbakshi · commit de1fe189f1a1 · 2025-01-23T10:42:47.000-05:00
diff --git a/docs/src/concepts/data-model.md b/docs/src/concepts/data-model.md
@@ -120,10 +120,10 @@ storing large contiguous data objects.
 
 DataJoint comprises:
 
-- a schema [definition](../design/tables/declare.md) language
-- a data [manipulation](../manipulation/index.md) language
-- a data [query](../query/principles.md) language
-- a [diagramming](../design/diagrams.md) notation for visualizing relationships between 
++ a schema [definition](../design/tables/declare.md) language
++ a data [manipulation](../manipulation/index.md) language
++ a data [query](../query/principles.md) language
++ a [diagramming](../design/diagrams.md) notation for visualizing relationships between 
 modeled entities
 
 The key refinement of DataJoint over other relational data models and their 
diff --git a/docs/src/design/integrity.md b/docs/src/design/integrity.md
@@ -1,6 +1,6 @@
 # Data Integrity
 
-The term **data integrity** describes  guarantees made by the data management process 
+The term **data integrity** describes guarantees made by the data management process 
 that prevent errors and corruption in data due to technical failures and human errors 
 arising in the course of continuous use by multiple agents.
 DataJoint pipelines respect the following forms of data integrity: **entity 
diff --git a/docs/src/design/tables/blobs.md b/docs/src/design/tables/blobs.md
@@ -1,4 +1,4 @@
-# Overview
+# Blobs
 
 DataJoint provides functionality for serializing and deserializing complex data types
 into binary blobs for efficient storage and compatibility with MATLAB's mYm
diff --git a/docs/src/design/tables/customtype.md b/docs/src/design/tables/customtype.md
@@ -1 +1,80 @@
-# Work in progress
+# Custom Types
+
+In modern scientific research, data pipelines often involve complex workflows that
+generate diverse data types. From high-dimensional imaging data to machine learning
+models, these data types frequently exceed the basic representations supported by
+traditional relational databases. For example:
+
++ A lab working on neural connectivity might use graph objects to represent brain
+  networks.
++ Researchers processing raw imaging data might store custom objects for pre-processing
+  configurations.
++ Computational biologists might store fitted machine learning models or parameter
+  objects for downstream predictions.
+
+To handle these diverse needs, DataJoint provides the `dj.AttributeAdapter` method. It
+enables researchers to store and retrieve complex, non-standard data types—like Python
+objects or data structures—in a relational database while maintaining the
+reproducibility, modularity, and query capabilities required for scientific workflows.
+
+## Uses in Scientific Research
+
+Imagine a neuroscience lab studying neural connectivity. Researchers might generate
+graphs (e.g., networkx.Graph) to represent connections between brain regions, where:
+
++ Nodes are brain regions.
++ Edges represent connections weighted by signal strength or another metric.
+
+Storing these graph objects in a database alongside other experimental data (e.g.,
+subject metadata, imaging parameters) ensures:
+
+1. Centralized Data Management: All experimental data and analysis results are stored
+   together for easy access and querying.
+2. Reproducibility: The exact graph objects used in analysis can be retrieved later for
+   validation or further exploration.
+3. Scalability: Graph data can be integrated into workflows for larger datasets or
+   across experiments.
+
+However, since graphs are not natively supported by relational databases, here’s where
+`dj.AttributeAdapter` becomes essential. It allows researchers to define custom logic for
+serializing graphs (e.g., as edge lists) and deserializing them back into Python
+objects, bridging the gap between advanced data types and the database.
+
+### Example: Storing Graphs in DataJoint
+
+To store a networkx.Graph object in a DataJoint table, researchers can define a custom
+attribute type in a datajoint table class:
+
+```python
+import datajoint as dj
+
+class GraphAdapter(dj.AttributeAdapter):
+    
+    attribute_type = 'longblob'   # this is how the attribute will be declared
+    
+    def put(self, obj):
+        # convert the nx.Graph object  into an edge list
+        assert isinstance(obj, nx.Graph)
+        return list(obj.edges)
+
+    def get(self, value):
+        # convert edge list back into an nx.Graph
+        return nx.Graph(value)
+    
+
+# instantiate for use as a datajoint type
+graph = GraphAdapter()
+
+
+# define a table with a graph attribute
+schema = dj.schema('test_graphs')
+
+
+@schema
+class Connectivity(dj.Manual):
+    definition = """
+    conn_id : int
+    ---
+    conn_graph = null : <graph>  # a networkx.Graph object 
+    """
+```
diff --git a/docs/src/design/tables/indexes.md b/docs/src/design/tables/indexes.md
@@ -1 +1,97 @@
-# Work in progress
+# Indexes
+
+Table indexes are data structures that allow fast lookups by an indexed attribute or
+combination of attributes.
+
+In DataJoint, indexes are created by one of the three mechanisms:
+
+1. Primary key
+2. Foreign key
+3. Explicitly defined indexes
+
+The first two mechanisms are obligatory. Every table has a primary key, which serves as
+an unique index. Therefore, restrictions by a primary key are very fast. Foreign keys
+create additional indexes unless a suitable index already exists.
+
+## Indexes for single primary key tables
+
+Let’s say a mouse in the lab has a lab-specific ID but it also has a separate id issued
+by the animal facility.
+
+```python
+@schema
+class Mouse(dj.Manual):
+    definition = """
+    mouse_id : int  # lab-specific ID
+    ---
+    tag_id : int  # animal facility ID
+    """
+```
+
+In this case, searching for a mouse by `mouse_id` is much faster than by `tag_id`
+because `mouse_id` is a primary key, and is therefore indexed.
+
+To make searches faster on fields other than the primary key or a foreign key, you can
+add a secondary index explicitly.
+
+Regular indexes are declared as `index(attr1, ..., attrN)` on a separate line anywhere in
+the table declration (below the primary key divide).
+
+Indexes can be declared with unique constraint as `unique index (attr1, ..., attrN)`.
+
+Let’s redeclare the table with a unique index on `tag_id`.
+
+```python
+@schema
+class Mouse(dj.Manual):
+    definition = """
+    mouse_id : int  # lab-specific ID
+    ---
+    tag_id : int  # animal facility ID
+    unique index (tag_id)
+    """
+```
+Now, searches with `mouse_id` and `tag_id` are similarly fast.
+
+## Indexes for tables with multiple primary keys
+
+Let’s now imagine that rats in a lab are identified by the combination of `lab_name` and
+`rat_id` in a table `Rat`.
+
+```python
+@schema
+class Rat(dj.Manual):
+    definition = """
+    lab_name : char(16) 
+    rat_id : int unsigned # lab-specific ID
+    ---
+    date_of_birth = null : date
+    """
+```
+Note that despite the fact that `rat_id` is in the index, searches by `rat_id` alone are not
+helped by the index because it is not first in the index. This is similar to searching for
+a word in a dictionary that orders words alphabetically. Searching by the first letters
+of a word is easy but searching by the last few letters of a word requires scanning the
+whole dictionary.
+
+In this table, the primary key is a unique index on the combination `(lab_name, rat_id)`.
+Therefore searches on these attributes or on `lab_name` alone are fast. But this index
+cannot help searches on `rat_id` alone. Similarly, searing by `date_of_birth` requires a
+full-table scan and is inefficient.
+
+To speed up searches by the `rat_id` and `date_of_birth`, we can explicit indexes to
+`Rat`:
+
+```python
+@schema
+class Rat2(dj.Manual):
+    definition = """
+    lab_name : char(16) 
+    rat_id : int unsigned # lab-specific ID
+    ---
+    date_of_birth = null : date
+
+    index(rat_id)
+    index(date_of_birth)
+    """
+```
diff --git a/docs/src/publish-data.md b/docs/src/publish-data.md
@@ -23,12 +23,12 @@ populated DataJoint pipeline.
 One example of publishing a DataJoint pipeline as a docker container is 
 > Sinz, F., Ecker, A.S., Fahey, P., Walker, E., Cobos, E., Froudarakis, E., Yatsenko, D., Pitkow, Z., Reimer, J. and Tolias, A., 2018. Stimulus domain transfer in recurrent models for large scale cortical population prediction on video. In Advances in Neural Information Processing Systems (pp. 7198-7209).  https://www.biorxiv.org/content/early/2018/10/25/452672
 
-The code and the data can be found at https://github.com/sinzlab/Sinz2018_NIPS
+The code and the data can be found at [https://github.com/sinzlab/Sinz2018_NIPS](https://github.com/sinzlab/Sinz2018_NIPS).
 
 ## Exporting into a collection of files
 
 Another option for publishing and archiving data is to export the data from the 
 DataJoint pipeline into a collection of files.
 DataJoint provides features for exporting and importing sections of the pipeline. 
 Several ongoing projects are implementing the capability to export from DataJoint 
-pipelines into [Neurodata Without Borders](https://www.nwb.org/) files.  
+pipelines into [Neurodata Without Borders](https://www.nwb.org/) files.

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-# Overview`
	`1`	`+# Blobs`
`2`	`2`
`3`	`3`	`DataJoint provides functionality for serializing and deserializing complex data types`
`4`	`4`	`into binary blobs for efficient storage and compatibility with MATLAB's mYm`