Graduate Job Classification Case Study

Graduate Job Classification Case Study

Clavié et al., 2023 (opens in a new tab) provide a case-study on prompt-engineering applied to a medium-scale text classification use-case in a production system. Using the task of classifying whether a job is a true "entry-level job", suitable for a recent graduate, or not, they evaluated a series of prompt engineering techniques and report their results using GPT-3.5 (gpt-3.5-turbo).

The work shows that LLMs outperforms all other models tested, including an extremely strong baseline in DeBERTa-V3. gpt-3.5-turbo also noticeably outperforms older GPT3 variants in all key metrics, but requires additional output parsing as its ability to stick to a template appears to be worse than the other variants.

The key findings of their prompt engineering approach are:

  • For tasks such as this one, where no expert knowledge is required, Few-shot CoT prompting performed worse than Zero-shot prompting in all experiments.
  • The impact of the prompt on eliciting the correct reasoning is massive. Simply asking the model to classify a given job results in an F1 score of 65.6, whereas the post-prompt engineering model achieves an F1 score of 91.7.
  • Attempting to force the model to stick to a template lowers performance in all cases (this behaviour disappears in early testing with GPT-4, which are posterior to the paper).
  • Many small modifications have an outsized impact on performance.
    • The tables below show the full modifications tested.
    • Properly giving instructions and repeating the key points appears to be the biggest performance driver.
    • Something as simple as giving the model a (human) name and referring to it as such increased F1 score by 0.6pts.

Prompt Modifications Tested

Short nameDescription
BaselineProvide a a job posting and asking if it is fit for a graduate.
CoTGive a few examples of accurate classification before querying.
Zero-CoTAsk the model to reason step-by-step before providing its answer.
rawinstGive instructions about its role and the task by adding to the user msg.
sysinstGive instructions about its role and the task as a system msg.
bothinstSplit instructions with role as a system msg and task as a user msg.
mockGive task instructions by mocking a discussion where it acknowledges them.
reitReinforce key elements in the instructions by repeating them.
strictAsk the model to answer by strictly following a given template.
looseAsk for just the final answer to be given following a given template.
rightAsking the model to reach the right conclusion.
infoProvide additional information to address common reasoning failures.
nameGive the model a name by which we refer to it in conversation.
posProvide the model with positive feedback before querying it.

Performance Impact of All Prompt Modifications

PrecisionRecallF1Template Stickiness
Baseline61.270.665.679%
CoT72.685.178.487%
Zero-CoT75.588.381.465%
+rawinst8092.485.868%
+sysinst77.790.983.869%
+bothinst81.993.987.571%
+bothinst+mock83.395.188.874%
+bothinst+mock+reit83.895.589.375%
+bothinst+mock+reit+strict79.993.786.398%
+bothinst+mock+reit+loose80.594.887.195%
+bothinst+mock+reit+right8495.989.677%
+bothinst+mock+reit+right+info84.996.590.377%
+bothinst+mock+reit+right+info+name85.796.890.979%
+bothinst+mock+reit+right+info+name+pos86.99791.781%

Template stickiness refers to how frequently the model answers in the desired format.