-
Notifications
You must be signed in to change notification settings - Fork 34
Implement Validation for Non-Queryable Attributes in Filtering #532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
c488a39
7434493
bb939bd
15ad1b9
e3ff023
5de02fc
3c3b5cb
7eef065
6cdb67c
a46665e
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -367,8 +367,10 @@ You can customize additional settings in your `.env` file: | |
| | `STAC_INDEX_ASSETS` | Controls if Assets are indexed when added to Elasticsearch/Opensearch. This allows asset fields to be included in search queries. | `false` | Optional | | ||
| | `USE_DATETIME` | Configures the datetime search behavior in SFEOS. When enabled, searches both datetime field and falls back to start_datetime/end_datetime range for items with null datetime. When disabled, searches only by start_datetime/end_datetime range. | `true` | Optional | | ||
| | `USE_DATETIME_NANOS` | Enables nanosecond precision handling for `datetime` field searches as per the `date_nanos` type. When `False`, it uses 3 millisecond precision as per the type `date`. | `true` | Optional | | ||
| | `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. | None | Optional | | ||
| | `EXCLUDED_FROM_QUERYABLES` | Comma-separated list of fully qualified field names to exclude from the queryables endpoint and filtering. Use full paths like `properties.auth:schemes,properties.storage:schemes`. Excluded fields and their nested children will not be exposed in queryables. If `VALIDATE_QUERYABLES` is enabled, these fields will also be considered invalid for filtering. | None | Optional | | ||
| | `EXCLUDED_FROM_ITEMS` | Specifies fields to exclude from STAC item responses. Supports comma-separated field names and dot notation for nested fields (e.g., `private_data,properties.confidential,assets.internal`). | `None` | Optional | | ||
| | `VALIDATE_QUERYABLES` | Enable validation of query parameters against the collection's queryables. If set to `true`, the API will reject queries containing fields that are not defined in the collection's queryables. | `false` | Optional | | ||
| | `QUERYABLES_CACHE_TTL` | Time-to-live (in seconds) for the queryables cache. Used when `VALIDATE_QUERYABLES` is enabled. | `3600` | Optional | | ||
|
|
||
|
|
||
| > [!NOTE] | ||
|
|
@@ -424,6 +426,29 @@ EXCLUDED_FROM_QUERYABLES="properties.auth:schemes,properties.storage:schemes,pro | |
| - Excluded fields and their nested children will be skipped during field traversal | ||
| - Both the field itself and any nested properties will be excluded | ||
|
|
||
| ## Queryables Validation | ||
|
|
||
| SFEOS supports validating query parameters against the collection's defined queryables. This ensures that users only query fields that are explicitly exposed and indexed. | ||
|
|
||
| **Configuration:** | ||
|
|
||
| To enable queryables validation, set the following environment variables: | ||
|
|
||
| ```bash | ||
| VALIDATE_QUERYABLES=true | ||
| QUERYABLES_CACHE_TTL=3600 # Optional, defaults to 3600 seconds (1 hour) | ||
| ``` | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does this seem like a long default?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is what my team suggested at first as the cache query for queryable parameters doesn't seem that costly. I'll rediscuss that with them though in detail. |
||
|
|
||
| **Behavior:** | ||
|
|
||
| - When enabled, the API maintains a cache of all queryable fields across all collections. | ||
| - Search requests (both GET and POST) are checked against this cache. | ||
| - If a request contains a query parameter or filter field that is not in the list of allowed queryables, the API returns a `400 Bad Request` error with a message indicating the invalid field(s). | ||
| - The cache is automatically refreshed based on the `QUERYABLES_CACHE_TTL` setting. | ||
| - **Interaction with `EXCLUDED_FROM_QUERYABLES`**: If `VALIDATE_QUERYABLES` is enabled, fields listed in `EXCLUDED_FROM_QUERYABLES` will also be considered invalid for filtering. This effectively enforces the exclusion of these fields from search queries. | ||
|
|
||
| This feature helps prevent queries on non-queryable fields which could lead to unnecessary load on the database. | ||
|
|
||
| ## Datetime-Based Index Management | ||
|
|
||
| ### Overview | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,110 @@ | ||
| """A module for managing queryable attributes.""" | ||
|
|
||
| import asyncio | ||
| import os | ||
| import time | ||
| from typing import Any, Dict, List, Set | ||
|
|
||
| from fastapi import HTTPException | ||
|
|
||
|
|
||
| class QueryablesCache: | ||
| """A thread-safe, time-based cache for queryable properties.""" | ||
|
|
||
| def __init__(self, database_logic: Any): | ||
| """ | ||
| Initialize the QueryablesCache. | ||
|
|
||
| Args: | ||
| database_logic: An instance of a class with a `get_queryables_mapping` method. | ||
| """ | ||
| self._db_logic = database_logic | ||
| self._cache: Dict[str, List[str]] = {} | ||
| self._all_queryables: Set[str] = set() | ||
| self._last_updated: float = 0 | ||
| self._lock = asyncio.Lock() | ||
| self.validation_enabled: bool = False | ||
| self.cache_ttl: int = 3600 # How often to refresh cache (in seconds) | ||
| self.excluded_queryables: Set[str] = set() | ||
| self.reload_settings() | ||
|
|
||
| def reload_settings(self): | ||
| """Reload settings from environment variables.""" | ||
| self.validation_enabled = ( | ||
| os.getenv("VALIDATE_QUERYABLES", "false").lower() == "true" | ||
| ) | ||
| self.cache_ttl = int(os.getenv("QUERYABLES_CACHE_TTL", "3600")) | ||
|
|
||
| excluded = os.getenv("EXCLUDED_FROM_QUERYABLES", "") | ||
| self.excluded_queryables = set() | ||
| if excluded: | ||
| for field in excluded.split(","): | ||
| field = field.strip() | ||
| if field: | ||
| # Remove 'properties.' prefix if present | ||
| if field.startswith("properties."): | ||
| field = field[11:] | ||
| self.excluded_queryables.add(field) | ||
|
|
||
| async def _update_cache(self): | ||
| """Update the cache with the latest queryables from the database.""" | ||
| if not self.validation_enabled: | ||
| return | ||
|
|
||
| async with self._lock: | ||
| if (time.time() - self._last_updated < self.cache_ttl) and self._cache: | ||
| return | ||
|
|
||
| queryables_mapping = await self._db_logic.get_queryables_mapping() | ||
| all_queryables_set = set(queryables_mapping.keys()) | ||
|
|
||
| if self.excluded_queryables: | ||
| all_queryables_set = all_queryables_set - self.excluded_queryables | ||
|
|
||
| self._all_queryables = all_queryables_set | ||
|
|
||
| self._cache = {"*": list(all_queryables_set)} | ||
| self._last_updated = time.time() | ||
|
|
||
| async def get_all_queryables(self) -> Set[str]: | ||
| """ | ||
| Return a set of all queryable attributes across all collections. | ||
|
|
||
| This method will update the cache if it's stale or has been cleared. | ||
| """ | ||
| if not self.validation_enabled: | ||
| return set() | ||
|
|
||
| if (time.time() - self._last_updated >= self.cache_ttl) or not self._cache: | ||
| await self._update_cache() | ||
| return self._all_queryables | ||
|
|
||
| async def validate(self, fields: Set[str]) -> None: | ||
| """ | ||
| Validate if the provided fields are queryable. | ||
|
|
||
| Raises HTTPException if invalid fields are found. | ||
| """ | ||
| if not self.validation_enabled: | ||
| return | ||
|
|
||
| allowed_fields = await self.get_all_queryables() | ||
| invalid_fields = fields - allowed_fields | ||
| if invalid_fields: | ||
| raise HTTPException( | ||
| status_code=400, | ||
| detail=f"Invalid query fields: {', '.join(invalid_fields)}.", | ||
| ) | ||
|
|
||
|
|
||
| def get_properties_from_cql2_filter(cql2_filter: Dict[str, Any]) -> Set[str]: | ||
| """Recursively extract property names from a CQL2 filter.""" | ||
| props: Set[str] = set() | ||
| if "op" in cql2_filter and "args" in cql2_filter: | ||
| for arg in cql2_filter["args"]: | ||
| if isinstance(arg, dict): | ||
| if "op" in arg: | ||
| props.update(get_properties_from_cql2_filter(arg)) | ||
| elif "property" in arg: | ||
| props.add(arg["property"]) | ||
| return props |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,116 @@ | ||
| import json | ||
| import os | ||
| from unittest import mock | ||
|
|
||
| import pytest | ||
|
|
||
| if os.getenv("BACKEND", "elasticsearch").lower() == "opensearch": | ||
| from stac_fastapi.opensearch.app import app_config | ||
| else: | ||
| from stac_fastapi.elasticsearch.app import app_config | ||
|
|
||
|
|
||
| def get_core_client(): | ||
| if os.getenv("BACKEND", "elasticsearch").lower() == "opensearch": | ||
| from stac_fastapi.opensearch.app import app_config | ||
| else: | ||
| from stac_fastapi.elasticsearch.app import app_config | ||
| return app_config["client"] | ||
|
|
||
|
|
||
| def reload_queryables_settings(): | ||
| client = get_core_client() | ||
| if hasattr(client, "queryables_cache"): | ||
| client.queryables_cache.reload_settings() | ||
|
|
||
|
|
||
| @pytest.fixture(autouse=True) | ||
| def enable_validation(): | ||
|
|
||
| client = app_config["client"] | ||
| with mock.patch.dict(os.environ, {"VALIDATE_QUERYABLES": "true"}): | ||
| client.queryables_cache.reload_settings() | ||
| yield | ||
| client.queryables_cache.reload_settings() | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_search_post_query_valid_param(app_client, ctx): | ||
| """Test POST /search with a valid query parameter""" | ||
| query = {"query": {"eo:cloud_cover": {"lt": 10}}} | ||
| resp = await app_client.post("/search", json=query) | ||
| assert resp.status_code == 200 | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_search_post_query_invalid_param(app_client, ctx): | ||
| """Test POST /search with an invalid query parameter""" | ||
| query = {"query": {"invalid_param": {"eq": "test"}}} | ||
| resp = await app_client.post("/search", json=query) | ||
| assert resp.status_code == 400 | ||
| resp_json = resp.json() | ||
| assert "Invalid query fields: invalid_param" in resp_json["detail"] | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_item_collection_get_filter_valid_param(app_client, ctx): | ||
| """Test GET /collections/{collection_id}/items with a valid filter parameter""" | ||
| collection_id = ctx.item["collection"] | ||
| filter_body = { | ||
| "op": "<", | ||
| "args": [{"property": "eo:cloud_cover"}, 10], | ||
| } | ||
| params = { | ||
| "filter-lang": "cql2-json", | ||
| "filter": json.dumps(filter_body), | ||
| } | ||
| resp = await app_client.get(f"/collections/{collection_id}/items", params=params) | ||
| assert resp.status_code == 200 | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_item_collection_get_filter_invalid_param(app_client, ctx): | ||
| """Test GET /collections/{collection_id}/items with an invalid filter parameter""" | ||
| collection_id = ctx.item["collection"] | ||
| filter_body = { | ||
| "op": "=", | ||
| "args": [{"property": "invalid_param"}, "test"], | ||
| } | ||
| params = { | ||
| "filter-lang": "cql2-json", | ||
| "filter": json.dumps(filter_body), | ||
| } | ||
| resp = await app_client.get(f"/collections/{collection_id}/items", params=params) | ||
| assert resp.status_code == 400 | ||
| resp_json = resp.json() | ||
| assert "Invalid query fields: invalid_param" in resp_json["detail"] | ||
|
|
||
|
|
||
| @pytest.mark.asyncio | ||
| async def test_validate_queryables_excluded(app_client, ctx): | ||
| """Test that excluded queryables are rejected when validation is enabled.""" | ||
|
|
||
| excluded_field = "eo:cloud_cover" | ||
| client = app_config["client"] | ||
|
|
||
| with mock.patch.dict( | ||
| os.environ, | ||
| { | ||
| "VALIDATE_QUERYABLES": "true", | ||
| "EXCLUDED_FROM_QUERYABLES": excluded_field, | ||
| "QUERYABLES_CACHE_TTL": "0", | ||
| }, | ||
| ): | ||
| client.queryables_cache.reload_settings() | ||
|
|
||
| query = {"query": {excluded_field: {"lt": 10}}} | ||
| resp = await app_client.post("/search", json=query) | ||
| assert resp.status_code == 400 | ||
| assert "Invalid query fields" in resp.json()["detail"] | ||
| assert excluded_field in resp.json()["detail"] | ||
|
|
||
| query = {"query": {"id": {"eq": "test-item"}}} | ||
| resp = await app_client.post("/search", json=query) | ||
| assert resp.status_code == 200 | ||
|
|
||
| client.queryables_cache.reload_settings() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bountx How does this interact with the EXCLUDED_FROM_QUERYABLES environment variable? Should both these env vars be combined into one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did miss this in my implementation so I will fix it, but these two variables serve a different purpose:
VALIDATE_QUERYABLESshould validate on set of all queryables andEXCLUDED_FROM_QUERYABLEScan be used to make this set of all queryables smaller.Their interaction when both used should look something along:
Cache of all queryables -> Removes from this set all excluded queryables -> Validates on that set
If you had something other in mind let me know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonhealy1 Fixed this interaction to work the way I mentioned beforehand.