Skip to content

Commit 915f553

Browse files
Khauneesh-AIKeivan Vosoughi
authored andcommitted
analyzer for prompt assist
1 parent 847cf88 commit 915f553

File tree

7 files changed

+692
-70
lines changed

7 files changed

+692
-70
lines changed

app/core/config.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -336,4 +336,93 @@ def caii_check(endpoint: str, timeout: int = 3) -> requests.Response:
336336

337337
return r
338338

339+
LENDING_DATA_PROMPT = """
340+
Create profile data for the LendingClub company which specialises in lending various types of loans to urban customers.
341+
342+
Background:
343+
LendingClub is a peer-to-peer lending platform connecting borrowers with investors. The dataset captures loan applications,
344+
borrower profiles, and outcomes to assess credit risk, predict defaults, and determine interest rates.
345+
346+
347+
Loan Record field:
348+
349+
Each generated record must include the following fields in the exact order provided, with values generated as specified:
350+
351+
- loan_amnt: The listed amount of the loan applied for by the borrower. If at some point in time, the credit department
352+
reduces the loan amount, then it will be reflected in this value.
353+
- term: The number of payments on the loan. Values are in months and can be either " 36 months" or " 60 months".
354+
- int_rate: Interest Rate on the loan
355+
- installment: The monthly payment owed by the borrower if the loan originates.
356+
- grade: LC assigned loan grade (Possible values: A, B, C, D, E, F, G)
357+
- sub_grade: LC assigned loan subgrade (Possible sub-values: 1-5 i.e. A5)
358+
- emp_title: The job title supplied by the Borrower when applying for the loan.
359+
- emp_length: Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10
360+
means ten or more years.
361+
- home_ownership: The home ownership status provided by the borrower during registration or obtained from the credit report.
362+
Possible values are: RENT, OWN, MORTGAGE, ANY, OTHER
363+
- annual_inc: The self-reported annual income provided by the borrower during registration.
364+
- verification_status: Indicates if income was verified by LC, not verified, or if the income source was verified
365+
- issue_d: The month which the loan was funded
366+
- loan_status: Current status of the loan (Possible values: "Fully Paid", "Charged Off")
367+
- purpose: A category provided by the borrower for the loan request.
368+
- title: The loan title provided by the borrower
369+
- dti: A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage
370+
and the requested LC loan, divided by the borrower’s self-reported monthly income.
371+
- earliest_cr_line: The month the borrower's earliest reported credit line was opened
372+
- open_acc: The number of open credit lines in the borrower's credit file.
373+
- pub_rec: Number of derogatory public records
374+
- revol_bal: Total credit revolving balance
375+
- revol_util: Revolving line utilization rate, or the amount of credit the borrower is using relative to all available
376+
revolving credit.
377+
- total_acc: The total number of credit lines currently in the borrower's credit file
378+
- initial_list_status: The initial listing status of the loan. Possible values are: w, f
379+
- application_type: Indicates whether the loan is an individual application or a joint application with two co-borrowers
380+
- mort_acc: Number of mortgage accounts.
381+
- pub_rec_bankruptcies: Number of public record bankruptcies
382+
- address: The physical address of the person
383+
384+
In addition to the definitions above, when generating samples, adhere to following guidelines:
385+
386+
Privacy Compliance guidelines:
387+
1) Ensure PII from examples such as addresses are not used in the generated data to minimize any privacy concerns.
388+
2) Avoid real PII in addresses. Use generic street names and cities.
389+
390+
Formatting guidelines:
391+
1) Use consistent decimal precision (e.g., "10000.00" for loan_amnt).
392+
2) Dates (e.g. issue_d, earliest_cr_line) should follow the "Jan-YYYY" format.
393+
3) term has a leading space before the number of months (i.e. " 36 months")
394+
4) The address field is a special case where the State zipcode needs to be exactly as specified in the seed instructions.
395+
The persons address must follow the format as specified in the examples with the State zipcode coming last.
396+
5) Any other formatting guidelines that can be inferred from the examples or field definitions but are not listed above.
397+
398+
Cross-row guidelines:
399+
1) Generated data should maintain consistency with all statistical parameters and distributions defined in the seed instruction
400+
across records (e.g., 60% of `term` as " 36 months").
401+
402+
Cross-column guidelines:
403+
1) Ensure logical and realistic consistency and correlations between variables. Examples include but not limited to:
404+
a) Grade/Sub-grade consistency: Sub-grade must match the grade (e.g., "B" grade → "B1" to "B5").
405+
b) Interest Rate vs Grade/Subgrade relationship: Higher subgrades (e.g., A5) could have higher `int_rate` than lower subgrades (e.g., A3).
406+
c) Mortgage Consistency: `mort_acc` should be 1 or more if `home_ownership` is `MORTGAGE`.
407+
d) Open Accounts: `open_acc` ≤ `total_acc`.
408+
409+
Data distribution guidelines:
410+
1) Continuous Variables (e.g., `loan_amnt`, `annual_inc`): Adhere to the mean and standard deviation given in the seed
411+
instructions for each variable.
412+
2) Categorical variables (e.g., `term`, `home_ownership`): Use probability distributions given in the seed instructions
413+
(e.g. 60% for " 36 months", 40% for " 60 months").
414+
3) Discrete Variables (e.g., `pub_rec`, `mort_acc`): Adhere to value ranges and statistical parameters
415+
provided in the seed instructions.
416+
4) Any other logical data distribution guidelines that can be inferred from the seed instructions or field definitions
417+
and are not specified above.
418+
419+
Background knowledge and realism guidelines:
420+
1) Ensure fields such as interest rates reflect real-world interest rates at the time the loan is issued.
421+
2) Generate values that are plausible (e.g., `annual_inc` ≤ $500,000 for most `emp_length` ranges).
422+
3) Avoid unrealistic values (e.g., `revol_util` as "200%" is unrealistic).
423+
4) Ensure that the generated data is realistic and plausible, avoiding extreme or impossible values.
424+
5) Ensure that the generated data is diverse and not repetitive, avoiding identical or very similar records.
425+
6) Ensure that the generated data is coherent and consistent, avoiding contradictions or inconsistencies between fields.
426+
7) Ensure that the generated data is relevant to the LendingClub use case and adheres to the guidelines provided."""
427+
339428

0 commit comments

Comments
 (0)