Graduate Job Classification Case Study

Graduate Job Classification Case Study

Clavié et al., 2023 (opens in a new tab) provide a case-study on prompt-engineering applied to a medium-scale text classification use-case in a production system. Using the task of classifying whether a job is a true "entry-level job", suitable for a recent graduate, or not, they evaluated a series of prompt engineering techniques and report their results using GPT-3.5 (gpt-3.5-turbo).

The work shows that LLMs outperforms all other models tested, including an extremely strong baseline in DeBERTa-V3. gpt-3.5-turbo also noticeably outperforms older GPT3 variants in all key metrics, but requires additional output parsing as its ability to stick to a template appears to be worse than the other variants.

The key findings of their prompt engineering approach are:

  • For tasks such as this one, where no expert knowledge is required, Few-shot CoT prompting performed worse than Zero-shot prompting in all experiments.
  • The impact of the prompt on eliciting the correct reasoning is massive. Simply asking the model to classify a given job results in an F1 score of 65.6, whereas the post-prompt engineering model achieves an F1 score of 91.7.
  • Attempting to force the model to stick to a template lowers performance in all cases (this behaviour disappears in early testing with GPT-4, which are posterior to the paper).
  • Many small modifications have an outsized impact on performance.
    • The tables below show the full modifications tested.
    • Properly giving instructions and repeating the key points appears to be the biggest performance driver.
    • Something as simple as giving the model a (human) name and referring to it as such increased F1 score by 0.6pts.

Prompt Modifications Tested

Short nameDescription
BaselineProvide a a job posting and asking if it is fit for a graduate.
CoTGive a few examples of accurate classification before querying.
Zero-CoTAsk the model to reason step-by-step before providing its answer.
rawinstGive instructions about its role and the task by adding to the user msg.
sysinstGive instructions about its role and the task as a system msg.
bothinstSplit instructions with role as a system msg and task as a user msg.
mockGive task instructions by mocking a discussion where it acknowledges them.
reitReinforce key elements in the instructions by repeating them.
strictAsk the model to answer by strictly following a given template.
looseAsk for just the final answer to be given following a given template.
rightAsking the model to reach the right conclusion.
infoProvide additional information to address common reasoning failures.
nameGive the model a name by which we refer to it in conversation.
posProvide the model with positive feedback before querying it.

Performance Impact of All Prompt Modifications

PrecisionRecallF1Template Stickiness

Template stickiness refers to how frequently the model answers in the desired format.