Skip to content

Conversation

@kumar-shivam-ranjan
Copy link
Contributor

@kumar-shivam-ranjan kumar-shivam-ranjan commented Feb 3, 2025

This PR intends to add additional model API to fetch tokenizer config of the model. It reads the tokenizer_config.json file from the model artifact location and returns the whole json as response body. The tokenizer config of the model contains chat template of the model which is useful when we deploy the model with chat completions endpoint.

Chat templates specify how to convert conversations, represented as lists of messages, into a single tokenizable string in the format that the model expects.

For models which don't have chat_templates, it will return empty json.

Request

curl --location 'http://localhost:8888/aqua/model/ocid1.datasciencemodel.oc1.iad.xxxx/tokenizer'

Response

{
    "bos_token": "<s>",
    "clean_up_tokenization_spaces": true,
    "cls_token": "<s>",
    "eos_token": "</s>",
    "mask_token": {
        "__type": "AddedToken",
        "content": "<mask>",
        "lstrip": true,
        "normalized": true,
        "rstrip": false,
        "single_word": false
    },
    "model_max_length": 8192,
    "pad_token": "<pad>",
    "sep_token": "</s>",
    "sp_model_kwargs": {},
    "tokenizer_class": "XLMRobertaTokenizer",
    "unk_token": "<unk>"
}

Unit tests


Screenshot 2025-02-05 at 12 47 47 AM

@oracle-contributor-agreement oracle-contributor-agreement bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Feb 3, 2025
@github-actions
Copy link

github-actions bot commented Feb 3, 2025

📌 Cov diff with main:

Coverage-11%

📌 Overall coverage:

Coverage-56.71%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments. Also, could you please add some unit tests as well?

ads/aqua/app.py Outdated

return config

def get_chat_template(self, model_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do similar to deployment and fine-tuning?

AQUA_MODEL_TOKENIZER_CONFIG = "tokenizer_config.json"

def get_hf_tokenizer_config(self, model_id: str) -> Dict:
        config = self.get_config(model_id, AQUA_MODEL_TOKENIZER_CONFIG)
        if not config:
            logger.debug(
                f"Tokenizer config for model: {model_id} is not available. Use defaults."
            )
        return config

Relevant to this, for verified models, we want to load the config from the service model bucket instead of user bucket. PR- #1053

If we use get_config directly, the above PR should take care of it.

ads/aqua/app.py Outdated

return config

def get_chat_template(self, model_id):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add @telemetry decorator.

)


class AquaModelChatTemplateHandler(AquaAPIhandler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good to return the entire config instead of only the chat template field since there might be use cases down the line where other values might be needed from the tokenizer. UI can parse the json and get the chat_template.
We can name this AquaModelTokenizerConfigHandler accordingly.

@github-actions
Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-76%

📌 Overall coverage:

Coverage-56.75%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes look good. requesting one update to the endpoint name to match what we're returning.

__handlers__ = [
("model/?([^/]*)", AquaModelHandler),
("model/?([^/]*)/license", AquaModelLicenseHandler),
("model/?([^/]*)/chat_template", AquaModelTokenizerConfigHandler),
Copy link
Member

@VipulMascarenhas VipulMascarenhas Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's change the endpoint name to /tokenizer instead of /chat_template since we'll be returning the entire json file contents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@github-actions
Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-77%

📌 Overall coverage:

Coverage-56.76%

@github-actions
Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-77%

📌 Overall coverage:

Coverage-56.76%

@github-actions
Copy link

github-actions bot commented Feb 4, 2025

📌 Cov diff with main:

Coverage-76%

📌 Overall coverage:

Coverage-56.77%

ads/aqua/app.py Outdated
)
base_model = self.ds_client.get_model(base_model_ocid).data
artifact_path = get_artifact_path(base_model.custom_metadata_list)
if config_folder == "artifact":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move strings to constants.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ads/aqua/app.py Outdated
self,
model_id: str,
config_file_name: str,
config_folder: Optional[str] = "config",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if to pass None as a config_folder? I guess in the code we should add something like

config_folder = config_folder or DEFAULT_MODEL_CONFIG_FOLDER

Let's also move the strings to the constants.

DEFAULT_MODEL_CONFIG_FOLDER = "config"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

ads/aqua/app.py Outdated
The OCID of the Aqua model.
config_file_name: str
name of the config file
config_folder: Optional[str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a default value to the docstring as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

ads/aqua/app.py Outdated
)
base_model = self.ds_client.get_model(base_model_ocid).data
artifact_path = get_artifact_path(base_model.custom_metadata_list)
if config_folder == "artifact":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to convert the config_folder to the lowercase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_folder values comes from constant enums which is always lowercase. so probably no need to convert explicitly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially true, however we allow to pass a custom config_folder to the function?

ads/aqua/app.py Outdated
return config

config_path = f"{os.path.dirname(artifact_path)}/config/"
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the config_folder will have / at the end? It might be better to use os.path.join?

Copy link
Contributor Author

@kumar-shivam-ranjan kumar-shivam-ranjan Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config_folder is parameter that we pass to get_config function.
This determines whether we need to search for a particular file in artifact folder (like tokenizer_config.json) or config folder (like ft_config.json).
The value of the config folder will always be either "config" (default) or "artifact".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)?

paths = url_parse.path.strip("/")
path_list = paths.split("/")
if (
len(path_list) == 4
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is 4 and 3 in this code? :) Let's move them either to constants or add some meaningful comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The request path here is: /aqua/models/ocid1.iad.ahdxxx/tokenizer
path_list=['aqua','models','ocid1.iad.ahdxxx','tokenizer']

Added comments

Copy link
Member

@mrDzurb mrDzurb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @kumar-shivam-ranjan, could you also change the title and description for the PR, since the scope of the PR has been adjusted.

@kumar-shivam-ranjan kumar-shivam-ranjan changed the title Adding API to fetch default chat template for models Adding API to fetch tokenizer config for model Feb 5, 2025
@kumar-shivam-ranjan
Copy link
Contributor Author

Hi @kumar-shivam-ranjan, could you also change the title and description for the PR, since the scope of the PR has been adjusted.

Updated

@github-actions
Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-19.50%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments, rest looks good.

ads/aqua/app.py Outdated
return config

config_path = f"{os.path.dirname(artifact_path)}/config/"
config_path = f"{os.path.dirname(artifact_path)}/{config_folder}/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with DC, makes it easier to handle paths - something like
os.path.join(os.path.dirname(artifact_path), config_folder)?

@github-actions
Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-0%

📌 Overall coverage:

Coverage-19.50%

@github-actions
Copy link

github-actions bot commented Feb 5, 2025

📌 Cov diff with main:

Coverage-82%

📌 Overall coverage:

Coverage-56.67%

Copy link
Member

@VipulMascarenhas VipulMascarenhas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm 👍

@kumar-shivam-ranjan kumar-shivam-ranjan self-assigned this Feb 5, 2025
mrDzurb
mrDzurb previously approved these changes Feb 5, 2025
def get(self, model_id):
url_parse = urlparse(self.request.path)
paths = url_parse.path.strip("/")
path_list = paths.split("/")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Looks like we only use path_list further in this method, then probably can do

path_list = urlparse(self.request.path).path.strip("/").split("/")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

)


class AquaModelTokenizerConfigHandler(AquaAPIhandler):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add the pydocs to the class?

Handles requests for retrieving the Hugging Face tokenizer configuration 
    of a specified model.
    
    Expected request format:
        GET /aqua/models/<model-ocid>/tokenizer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

and path_list[3] == "tokenizer"
):
return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id))
else:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: looks like else is not required here. I think it would be more clear to do something like this:

 is_valid_path = (
    len(path_list) == 4
    and path_list[3] == "tokenizer"
    and is_valid_ocid(path_list[2])
)

if not is_valid_path:
       raise HTTPError(400, f"The request {self.request.path} is invalid.")

return self.finish(AquaModelApp().get_hf_tokenizer_config(model_id))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@github-actions
Copy link

github-actions bot commented Feb 6, 2025

📌 Cov diff with main:

Coverage-80%

📌 Overall coverage:

Coverage-56.67%

@github-actions
Copy link

github-actions bot commented Feb 6, 2025

📌 Cov diff with main:

Coverage-82%

📌 Overall coverage:

Coverage-56.68%

@kumar-shivam-ranjan kumar-shivam-ranjan merged commit 0db36f2 into main Feb 6, 2025
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants