implement match for generate.yaml #483

mzuenni · 2025-11-17T16:59:31Z

solves #312
@thorehusfeldt do you mind adjusting the schemas?

RagnarGrootKoerkamp · 2025-12-02T22:03:06Z

Should we also allow this on the answer files? to ensure a testcase is (im)possible as intended.

thorehusfeldt · 2025-12-03T08:15:49Z

Consider bumping the generator framework version.

The sample generators.yaml script linked from doc might want to include

version: 2025-12

and the default in the CUE schema generators key (and presumably the JSON file) updated accordingly.

thorehusfeldt · 2025-12-03T08:25:45Z

@RagnarGrootKoerkamp :

Should we also allow this on the answer files?

But ans: possible and ans: impossible can already be specified. I guess what you are proposing would be to support “make sure the answer is not impossible", for instance (by

ans-match: \d+

Hm. I find this useful, but now it’s getting ugly.

An idea for syntax that is consistent with the current proposal:

match:
 in: foo
 ans: bar
---
match:
  in: [42, forty-two] 
  ans: bar
---
match: \w+\s\w+ # same as match: { in: \w+\s\w+ }
---
# same as match: { in: [42, forty-two] }:
match:
  - 42
  - forty-two

I guess the schema is

match: string | [...string] | {
  in: string | [...string]
  ans: string | [...string]
}

mzuenni · 2025-12-03T17:27:37Z

even though this might be less yaml like i would prefer to not nest these and go for something like in.match or match.in?

thorehusfeldt · 2025-12-04T07:58:32Z

Just be make sure I was clear: I propose to retain

match: \d+

as a valid expression, and expect it to be the widest-used form. I propose that the above is the same as

match:
  in: \d+

(which, thanks to standard YAML syntax, can also be written as a one-liner, match: { in: \d+ }, but which is not the same as match.in: \d+. )

The situation in which the “mapping” form would mainly arise is when you want to specify something about ans, like “the answer is not impossible”. Not sure what kinds of conventions will arise among authors, but here are some suggestions:

match: { ans: ^[^i] }
---
match: { ans: \d+ }
---
match: { ans: ^(?!impossible$).* }

I would advise against introducing more keys in the top-level mapping (such asmatch, match.in, and match.ans); tool support for YAML is just better when we stick to YAML conventions.

The main alternatives I can see to my proposal would be to add pattern to in and ans. (I use pattern here instead of match just to keep the proposals syntactically separate.)

[in|ans]: string | {
  value: string
  pattern: string | [...string]
}

so you’d have expressions like this:

generate: make_random_tree -n 100 --balanced {seed:0}
in:
  pattern: \d+
ans: impossible

This doesn’t smell right to me, but it’s just a hunch.

thorehusfeldt · 2025-12-04T09:33:51Z

I notice that we already have a plethora of stuff, namely

["in" | "in.statement" | "in.download" |
    "ans" | "ans.statement" | "ans.download" |
    "out"]: string

The current semantics is that the key: value pair means "<testcasename>.<key> must equal value". What we’re looking for in the current proposal is a semantics that says "<testcasename>.<key> should obey constraint".

This is a case against introducing keys like in.match, by the way. You’d need ans.statement.match etc.

My hunch is that the cleanest way is to enrich the right-hand side, instead of introducing more left-hand sides of such expressions.

I think what I’m saying is

let extension = "in" | "in.statement" | "in.download" |  "ans" | "ans.statement" | "ans.download" | "out"
[extension]:  string # as we have now
match: string | { [extension]: string } # default string same as { in: string }

allowing

in: foo
match:
  ans.statement: \d\w+

Alternatively,

let extension = "in" | "in.statement" | "in.download" |  "ans" | "ans.statement" | "ans.download" | "out"
[extension]:  string  | { match: string }

in: foo
ans.statement:
  match: \d\w+

Dream state

The dream state would be what CUE already supports out-of-the box:

ans: "impossible"  # ans must equal impossible
---
in: number & >0 # in must be a number, and strictly larger than 0
---
ans: "yes" | "no" # ans  must be either "yes" or "no"
in.statement: =~"^\w\w$" # in.statement has two letters
in: !~"impossible" # in does not contain impossible
in: in.statement # in and in.statement are identical

In other words, there’s a whole grammar on the right hand side supporting |, &, literal match, and =~ and !~ for regex match and unmatch.

Note that explicit creation and constraint checking are the same: CUE just unifies everything it knows about, say .in (including whatever copy or generate may have produced) and expects the result to be a singleton. Otherwise it complains. Specifying a constraint is the same as specifying a value (the latter is just a constraint with a singleton valid instantiation.)

This would be sah-weet!

RagnarGrootKoerkamp · 2025-12-04T10:19:20Z

Interesting idea to do in: {match: ...}, sounds reasonable as well to me, but no strong opinion either way.

Should it be matches instead of match maybe? As in the .ans matches X Y Z.

mzuenni · 2025-12-04T11:07:23Z

The main alternatives I can see to my proposal would be to add pattern to in and ans. (I use pattern here instead of match just to keep the proposals syntactically separate.)

I don't like that, also feels weird in combination with generated testcases...

["in" | "in.statement" | "in.download" |
   "ans" | "ans.statement" | "ans.download" |
   "out"]: string

I dont think we need this for something else as .in and .ans since this is only intended to additionally check generated files. The others are already hardcoded typically?

and =~ and !~ for regex match and unmatch.

Unmatch would certainly be nice...

match: string | [...string] | {
  in: string | [...string]
  ans: string | [...string]
}

I am fine with that, even though I like to not nest things... :D

The question is if/how we want to support unmatch than?

thorehusfeldt · 2025-12-04T15:00:53Z

The question is if/how we want to support unmatch than?

The current proposal already supports “unmatching”, since regexen support that. Here are the three examples from upthread again, for a problem with output impossible or some numbers:

match: { ans: ^[^i] }
---
match: { ans: \d+ }
---
match: { ans: ^(?!impossible$).* }

CUE of course would make this nicer to look at:

ans: !~"impossible"

mzuenni · 2025-12-04T15:35:11Z

match: { ans: ^(?!impossible$).* }

I don't think that one is right? (\A(?!.*^impossible$).*\Z would work but is not very nice...) we could say that if the string starts with ! we do unmatch and if it starts with = we do a match

thorehusfeldt · 2025-12-04T17:53:59Z

The only thing I’m unsure about for my negative lookahead regex is what do to with a possibly trailing newline. (I don’t understand the specification well enough.) So maybe it should be ^(?!impossible).* Otherwise I’m pretty sure it’s fine.

mzuenni · 2025-12-05T10:47:18Z

I think the issue is that we don't do a full match but a search, and you pattern still matches a suffix not containing impossible? Anyway: yes it can be expressed... the question is do we want simpler syntax for this? ^^'

thorehusfeldt · 2025-12-05T15:22:50Z

we don't do a full match but a search

Now I understand. That’s what ^ is for in my expression. (You prefer \A.)

mzuenni · 2025-12-05T15:23:56Z

but ^ matches any start of line \A matches start of the first line

thorehusfeldt · 2025-12-05T17:11:35Z

Hear me out. This actually works:

Syntax

Do allow certain CUE-expressions as the right hand sides of in:, ans:, etc. To be precise, allow string expressions.

For instance, we can do

in: "impossible" # just like we always have
---
in: =~"\\d\\d" # two digits
---
in: "foo"  | "bar"
---
in: "^[a-z]+$" & !~"^impossible$" # alphabetic word, but not impossible

and a thousand other things. CUE is quite expressive. The main use case are disjuntions and regex match and unmatch.

What is new is that the right hand side is now a constraint. If no in-key is present, it defaults to in: string.

Semantics

For a generator rule, various files can be created. generate, copy, or the default submissions producing ans.

Whatever has been produced (maybe nothing) is now unified using CUE and produce a concrete value (i.e., a concrete string). In the simplest case, the expression

ans: "impossible"

means that “the output of the default submissions will be unified with impossible. In this special case, this means that the two string need to be the same. This is exactly the behaviour that we already have.

But if we had

ans: "yes" | "maybe"

the output of the default submission could be yes or maybe, since both those string unifify with the ans-expression.

Implementation

The CUE CLI already does this. You can set up a very small CUE snippet:

input: string   // will be filled from CLI with concrete value
expr: "foo" | "bar"  // the value of an ans-key in generators.yaml
ok:   input & expr

cue cmd --inject input=foobarbaz exactly replaces input with "foobarbaz", and then CUE does it magic by trying to unify ok. The result of the command is either an error (in this case it would be because "foobarbaz" does not satisfy the rule ), or the unified string.

The only reason to not do this is that it increases the dependencies of BAPCtools. (Which is a good enough reason, I think.)

Still, cool AF. Backwards compatible.

mzuenni · 2025-12-05T20:42:09Z

I actually don't understand what you want to suggest? ^^'

suppose one of your examples is the actual generators.yaml:

data:
  secret:
    - testcase:
        generate: gen.py
        in: "foo"  | "bar"

what is supposed to happen (in our implementation)? do we first run cue on the yaml? do we parse the yaml?

thorehusfeldt · 2025-12-05T21:21:26Z

what is supposed to happen (in our implementation)? do we first run cue on the yaml? do we parse the yaml?

Maybe this is too much of a rabbit hole, but: yes.

gen.py generates testcase.in. (Of type string.) Say its value is "foo". Now, from CUE’s perspective, all the different value of in are unified. There are only two such values:

in: "foo"         # generated by from gen.py
in: "foo" | "bar" # the rule

This unifies nicely (to "foo", a concrete string) and CUE is happy.

Had gen.py generated the string "baz" then CUE would have tried to unify this:

in: "baz"         # from gen.py
in: "foo" | "bar" # the rule

And err.

In the special case where the rule is a concrete string, the two strings must be the same. (This is the current behaviour of BAPCtools.)

In the special case where nothing else produces .in (because there is no generate and no copy in the rule), then the in rule itself must be a concrete string, like in: "foo" but not in: "foo" | "bar" (This is the current behaviour of BAPCtools.)

CUE does not distinguish between “a constraint” (like "foo" | "bar") and “a constraint that is so tight that only a singleton value satisfies it”. Types are values. It’s really cute.

mzuenni · 2025-12-05T22:43:06Z

Maybe this is too much of a rabbit hole, but: yes.

but what exactly is yes supposed to mean? How would I implement this?

data:
  secret:
    - testcase:
        generate: gen.py
        in: "foo"  | "bar"

right now this is neither a valid .yaml file nor a valid .cue file? So I am unsure what I would need to implement to get what you want ^^'
in: "foo" | "bar" is not valid yaml and raises an error when parsed as yaml but running the whole thing as cue is meaningless because cue does not know what running a generator should mean?

Is our goal to write a cue file from whats specified within the yaml? which would look like this:

in: <output of generator>
in: "foo"  | "bar"

or would we want a separate generators.cue? which only contains the part related to in: "foo" | "bar" and than that part is not present in the generators.yaml?

thorehusfeldt · 2025-12-06T07:58:24Z

right now this is neither a valid .yaml file nor a valid .cue file?

Exactly. That’s why I’m not pushing hard for this (and call it dream-mode or rabbit hole). We’d need start having annoying conversations about writing

in: '"foo" | "bar"'

and understand double-escapes in \\w so that YAML’s opinions about what must be quoted (in various '"-orthodoxies) are reliably transformed into CUE’s. I think it’s doable, and in the long run CUE would be a much better “generator configuration language” than YAML. But that’s Zukunftsmusik.

mzuenni · 2025-12-07T22:24:11Z

in: '"foo" | "bar"'

Yeah, i am not really a fan of those double quote thingies... especially since the type of quotes already seem to have meaning in yaml ^^' So for now I would prefer to just go with standard regex notation.

But there is still the open question on how to handle .in/.ans files. You proposed to enrich the match entry to possibly be a map, but we could also go the other way around by enriching the regex like this (not sure if I got the cue syntax right):

#Matcher: string | {
    pattern: string
    extension?: "ans" | "in"  //defaults to "in"
    unmatch?: bool            //defaults to false
}

which would then allow us to write:

match:
  - '^1 1$'  #shorthand for "pattern: '^1 1$'", matches a selfloop at vertex 1
  - pattern: '^possible$'
    unmatch: True
    extension: 'ans'  #the answer must not contain possible
  - pattern: '^impossible$'
    extension: 'ans'  #the answer must contain impossible

mzuenni added 2 commits November 17, 2025 17:57

implement match for generate.yaml

8387796

retries are not unique to testcases?

a4940d1

mzuenni requested a review from mpsijm November 17, 2025 21:58

mzuenni added 2 commits November 17, 2025 23:18

update doc

fc90190

update schema?

5011ab1

mzuenni marked this pull request as ready for review November 19, 2025 16:22

mzuenni requested a review from RagnarGrootKoerkamp November 19, 2025 16:23

RagnarGrootKoerkamp approved these changes Dec 2, 2025

View reviewed changes

implement match for generate.yaml #483

Are you sure you want to change the base?

implement match for generate.yaml #483

Uh oh!

Conversation

mzuenni commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Dec 2, 2025

Uh oh!

thorehusfeldt commented Dec 3, 2025

Uh oh!

thorehusfeldt commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Dec 3, 2025

Uh oh!

thorehusfeldt commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thorehusfeldt commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RagnarGrootKoerkamp commented Dec 4, 2025

Uh oh!

mzuenni commented Dec 4, 2025

Uh oh!

thorehusfeldt commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Dec 4, 2025

Uh oh!

thorehusfeldt commented Dec 4, 2025

Uh oh!

mzuenni commented Dec 5, 2025

Uh oh!

thorehusfeldt commented Dec 5, 2025

Uh oh!

mzuenni commented Dec 5, 2025

Uh oh!

thorehusfeldt commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thorehusfeldt commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mzuenni commented Dec 5, 2025

Uh oh!

thorehusfeldt commented Dec 6, 2025

Uh oh!

mzuenni commented Dec 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mzuenni commented Nov 17, 2025 •

edited

Loading

thorehusfeldt commented Dec 3, 2025 •

edited

Loading

thorehusfeldt commented Dec 4, 2025 •

edited

Loading

thorehusfeldt commented Dec 4, 2025 •

edited

Loading

thorehusfeldt commented Dec 4, 2025 •

edited

Loading

thorehusfeldt commented Dec 5, 2025 •

edited

Loading

mzuenni commented Dec 5, 2025 •

edited

Loading

thorehusfeldt commented Dec 5, 2025 •

edited

Loading

mzuenni commented Dec 7, 2025 •

edited

Loading