Retrieval Augmented Generation and Requirement Matching

Generating text based on a skill list to match an advert
Published

February 4, 2024

Retrieval Augmented Generation is where documents are used alongside a prompt to get a large language model to perform a task. If you have a question then you can first ask the language model to generate a query, and use the results of that query to generate the answer to the question. This leads to answers grounded in the facts from the document store you searched over.

More broadly it shows that we can use documents as supplementary information to generate output. If we have the CV of an executive, and a company description, can we generate a press release about the executive joining the company?

This has been done before. I’m playing with it now as a chance to test the generative power of Mistral 7B.

Dataset

The CEO that I am going to use is Giles Palmer. He has recently joined Cint as the CEO. I can take a copy of his linkedin profile as one document and then some details about Cint and see what we get.

Generation

I’m going to use the Mistral 7B Instruct model as that works with a chat like interface and will readily follow instructions. It’s also a chance to try out the chat templating that huggingface provides.

Code
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2")
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.2",
    load_in_4bit=True,
)
Code
from transformers import AutoTokenizer, AutoModelForCausalLM
import warnings
import textwrap

warnings.filterwarnings("ignore")

def run_model(
    model: AutoModelForCausalLM,
    tokenizer: AutoTokenizer,
    prompt: str,
    max_new_tokens: int = 100,
) -> str:
    chat_input = tokenizer.apply_chat_template(
        [{"role": "user", "content": prompt}],
        return_tensors="pt",
        padding="longest",
    )
    chat_input = chat_input.to(model.device)
    generated_ids = model.generate(
        chat_input,
        max_new_tokens=max_new_tokens,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )
    output = tokenizer.decode(generated_ids[0, chat_input.shape[1]:], skip_special_tokens=True)
    wrapped = [
        wrapped_line
        for line in output.splitlines()
        for wrapped_line in textwrap.wrap(line) + [""]
    ]
    return "\n".join(wrapped).strip()

To test this out we can run it on a generic chat prompt:

Code
print(
    run_model(
        model=model,
        tokenizer=tokenizer,
        prompt="Hello, how are you?",
    )
)
I'm just a computer program, so I don't have feelings or physical
sensations. I'm here to help answer any questions you have to the best
of my ability. Is there something specific you'd like to know?

Seems fine. Let’s go with the full task. To make this post readable I’ve put the markdown version of the CV in a separate file, alongside the description of the company.

Code
from pathlib import Path

GILES_CV = Path("giles.txt").read_text().strip()
CINT_DESCRIPTION = Path("cint.txt").read_text().strip()
PROMPT = f"""
I am going to provide you with two documents, one about a new CEO for a company and the other for the company.
I want you to write an exciting press release about this appointment.

{GILES_CV}

{CINT_DESCRIPTION}
""".strip()
Code
print(
    run_model(
        model=model,
        tokenizer=tokenizer,
        prompt=PROMPT,
        max_new_tokens=1_000,
    )
)
FOR IMMEDIATE RELEASE


Cint, a Global Leader in Digital Insights and Research Technology,
Announces the Appointment of New CEO: Giles Palmer


LONDON, ENGLAND, UNITED KINGDOM—April 20, 2023—Cint, a trailblazer in
digital insights and research technology, is proud to announce the
appointment of Giles Palmer as its new Chief Executive Officer (CEO).
Palmer, a distinguished entrepreneur and business leader, brings a
wealth of experience in scaling and transforming ambitious
organizations, having previously founded and led Brandwatch, a company
that grew to over $100m in revenue and 500 employees before its
acquisition by Cision.


Cint, with one of the world's largest consumer networks for digital
survey-based research, boasts millions of engaged respondents across
130 countries. Following significant growth, the company recently
acquired Lucid, a survey-based programmatic marketplace, and GapFish,
providing access to over 500,000 profiled Panellists in Germany,
Austria, and Switzerland. Palmer's proven leadership and insightful
business strategy are certain to propel Cint to new heights.


Cint, established in Stockholm in 1998, has a rapidly growing
workforce of over 1,000 employees spread across 18 global offices.
Palmler, known for his expertise in growth management and business
strategy, will lead Cint's dedicated team in delivering powerful
insights solutions to clients worldwide.


In addition to his role at Cint, Palmer is a Non-Executive Director
for several tech startups and a school governor. He is also an
investor in numerous startups and funds. Furthermore, Palmer's passion
for creativity and innovation is a testament to his success in driving
growth within the business landscape.


Palmer shared his sentiments about this exciting opportunity, "I am
thrilled to join Cint, a genuinely innovative company that has proven
its ability to adapt and grow in the digital landscape. I am humbled
by the opportunity to lead such an accomplished team and look forward
to exploring new opportunities and taking Cint to the next level."


Throughout his impressive career, Palmer has demonstrated a remarkable
ability to foster growth and drive success, culminating in the sale of
Brandwatch to Cision in 2021. As the new CEO at Cint, Palmer is poised
to revolutionize digital market research and insights solutions for
businesses worldwide.


For media inquiries, contact:

[Your Name]

[Your Email Address]

[Your Phone Number]

This seems good (I would’ve referred to Giles as Giles and it makes a spelling mistake), however the CV that I have for Giles already has his role in Cint on it. How well does it work if I remove that?

Code
from pathlib import Path

GILES_CV = Path("giles-no-cint.txt").read_text().strip()
CINT_DESCRIPTION = Path("cint.txt").read_text().strip()
PROMPT = f"""
I am going to provide you with two documents, one about a new CEO for a company and the other for the company.
I want you to write an exciting press release about this appointment.

{GILES_CV}

{CINT_DESCRIPTION}
""".strip()
Code
print(
    run_model(
        model=model,
        tokenizer=tokenizer,
        prompt=PROMPT,
        max_new_tokens=1_000,
    )
)
FOR IMMEDIATE RELEASE


Cint, the leading global software company in digital insights and
research technology, announces the appointment of Giles Palmer as its
new Chief Executive Officer.


Giles Palmer, a seasoned entrepreneur and executive with a proven
track record of growing innovative businesses, brings his extensive
experience to Cint as it continues to redefine the market research
industry. Palmer, who most notably founded and grew Brandwatch to over
$100m in revenue and 500 employees before its acquisition by Cision,
is no stranger to building successful companies.


Beyond his impressive resume at Brandwatch, Palmer has served numerous
non-executive roles, including with HappySignals, Whatagraph, Leaf
Grow, and Cision. He also co-founded Runtime Collective and has held
positions at renowned companies like Sky Interactive, SmithKline
Beecham Corporation, Equitas, and Smith and Williamson.


Palmer shared, "I am thrilled to join Cint during an incredibly
exciting period in its growth. Their acquisition of Lucid and GapFish,
coupled with their vast consumer network, positions Cint as a market
leader in digital survey-based research. I am looking forward to
leading this talented team and further expanding Cint's global
impact."


Cint, founded in 1998 in Stockholm, has achieved significant growth
through strategic acquisitions such as Lucid and GapFish in 2021 and
2022, respectively. The combination of these acquisitions and Cint's
one million-plus engaged respondents across 130 countries has placed
Cint at the forefront of insights solutions.


As Cint's new CEO, Palmer is expected to foster continued growth and
collaboration within a workforce spanning 18 global offices. His
leadership will guide Cint in delivering powerful insights to their
ever-growing global customer base.


About Cint:

Cint is a global software leader in digital insights and research
technology. With the world's largest consumer network for digital
survey-based research, Cint is dedicated to providing valuable
insights to customers worldwide. Cint's continuous growth and
acquisitions have transformed it into a powerful platform driving
impactful market research solutions.

Yep, this seems fine. I might use this for other generation tasks that I have in mind…