@@ -336,4 +336,93 @@ def caii_check(endpoint: str, timeout: int = 3) -> requests.Response:
336336
337337 return r
338338
339+ LENDING_DATA_PROMPT = """
340+ Create profile data for the LendingClub company which specialises in lending various types of loans to urban customers.
341+
342+ Background:
343+ LendingClub is a peer-to-peer lending platform connecting borrowers with investors. The dataset captures loan applications,
344+ borrower profiles, and outcomes to assess credit risk, predict defaults, and determine interest rates.
345+
346+
347+ Loan Record field:
348+
349+ Each generated record must include the following fields in the exact order provided, with values generated as specified:
350+
351+ - loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department
352+ reduces the loan amount, then it will be reflected in this value.
353+ - term: The number of payments on the loan. Values are in months and can be either " 36 months" or " 60 months".
354+ - int_rate: Interest Rate on the loan
355+ - installment: The monthly payment owed by the borrower if the loan originates.
356+ - grade: LC assigned loan grade (Possible values: A, B, C, D, E, F, G)
357+ - sub_grade: LC assigned loan subgrade (Possible sub-values: 1-5 i.e. A5)
358+ - emp_title: The job title supplied by the Borrower when applying for the loan.
359+ - emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10
360+ means ten or more years.
361+ - home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report.
362+ Possible values are: RENT, OWN, MORTGAGE, ANY, OTHER
363+ - annual_inc: The self-reported annual income provided by the borrower during registration.
364+ - verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
365+ - issue_d: The month which the loan was funded
366+ - loan_status: Current status of the loan (Possible values: "Fully Paid", "Charged Off")
367+ - purpose: A category provided by the borrower for the loan request.
368+ - title: The loan title provided by the borrower
369+ - dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage
370+ and the requested LC loan, divided by the borrower’s self-reported monthly income.
371+ - earliest_cr_line: The month the borrower's earliest reported credit line was opened
372+ - open_acc: The number of open credit lines in the borrower's credit file.
373+ - pub_rec: Number of derogatory public records
374+ - revol_bal: Total credit revolving balance
375+ - revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available
376+ revolving credit.
377+ - total_acc: The total number of credit lines currently in the borrower's credit file
378+ - initial_list_status: The initial listing status of the loan. Possible values are: w, f
379+ - application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers
380+ - mort_acc: Number of mortgage accounts.
381+ - pub_rec_bankruptcies: Number of public record bankruptcies
382+ - address: The physical address of the person
383+
384+ In addition to the definitions above, when generating samples, adhere to following guidelines:
385+
386+ Privacy Compliance guidelines:
387+ 1) Ensure PII from examples such as addresses are not used in the generated data to minimize any privacy concerns.
388+ 2) Avoid real PII in addresses. Use generic street names and cities.
389+
390+ Formatting guidelines:
391+ 1) Use consistent decimal precision (e.g., "10000.00" for loan_amnt).
392+ 2) Dates (e.g. issue_d, earliest_cr_line) should follow the "Jan-YYYY" format.
393+ 3) term has a leading space before the number of months (i.e. " 36 months")
394+ 4) The address field is a special case where the State zipcode needs to be exactly as specified in the seed instructions.
395+ The persons address must follow the format as specified in the examples with the State zipcode coming last.
396+ 5) Any other formatting guidelines that can be inferred from the examples or field definitions but are not listed above.
397+
398+ Cross-row guidelines:
399+ 1) Generated data should maintain consistency with all statistical parameters and distributions defined in the seed instruction
400+ across records (e.g., 60% of `term` as " 36 months").
401+
402+ Cross-column guidelines:
403+ 1) Ensure logical and realistic consistency and correlations between variables. Examples include but not limited to:
404+ a) Grade/Sub-grade consistency: Sub-grade must match the grade (e.g., "B" grade → "B1" to "B5").
405+ b) Interest Rate vs Grade/Subgrade relationship: Higher subgrades (e.g., A5) could have higher `int_rate` than lower subgrades (e.g., A3).
406+ c) Mortgage Consistency: `mort_acc` should be 1 or more if `home_ownership` is `MORTGAGE`.
407+ d) Open Accounts: `open_acc` ≤ `total_acc`.
408+
409+ Data distribution guidelines:
410+ 1) Continuous Variables (e.g., `loan_amnt`, `annual_inc`): Adhere to the mean and standard deviation given in the seed
411+ instructions for each variable.
412+ 2) Categorical variables (e.g., `term`, `home_ownership`): Use probability distributions given in the seed instructions
413+ (e.g. 60% for " 36 months", 40% for " 60 months").
414+ 3) Discrete Variables (e.g., `pub_rec`, `mort_acc`): Adhere to value ranges and statistical parameters
415+ provided in the seed instructions.
416+ 4) Any other logical data distribution guidelines that can be inferred from the seed instructions or field definitions
417+ and are not specified above.
418+
419+ Background knowledge and realism guidelines:
420+ 1) Ensure fields such as interest rates reflect real-world interest rates at the time the loan is issued.
421+ 2) Generate values that are plausible (e.g., `annual_inc` ≤ $500,000 for most `emp_length` ranges).
422+ 3) Avoid unrealistic values (e.g., `revol_util` as "200%" is unrealistic).
423+ 4) Ensure that the generated data is realistic and plausible, avoiding extreme or impossible values.
424+ 5) Ensure that the generated data is diverse and not repetitive, avoiding identical or very similar records.
425+ 6) Ensure that the generated data is coherent and consistent, avoiding contradictions or inconsistencies between fields.
426+ 7) Ensure that the generated data is relevant to the LendingClub use case and adheres to the guidelines provided."""
427+
339428
0 commit comments