Title Generator with Simple Transformers

I want to see how easy simple transformers is to use for context sensitive text generation
Published

October 15, 2021

At work I’ve been speaking with someone that wants to generate titles. This sort of content generation is interesting because the titles have to relate to a context. I thought that it was an interesting challenge and it would be a good chance to try out a generative model.

The model I want to look into looks like this:

transformers

This is a pretty notable picture that first turned up in the Attention Is All You Need paper (Vaswani et al. 2017). The idea here is that the encoder is the left side of the picture and the decoder is on the right. Encoding takes the input and uses it to provide context to the decoder. The decoder then takes all of the outputs that have been generated so far and uses them and the context to generate the next token.

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. “Attention Is All You Need.” https://arxiv.org/abs/1706.03762.

I wonder how accurate my description is?

Anyway the corresponding structure in huggingface is the Encoder Decoder Model. This is formed of two transformer models, as in the picture above. If you pay attention you can see that the two sides are actually identical except that the decoder has the masked multi-head attention block at the start, and the classifier at the end. This means that the main body of the encoder and decoder can be duplicated.


Pretrained Models

Given an encoder-decoder model I now need to find a pretrained model to use with it. Looking on the huggingface models I can see a big list of them.

Since the generation of titles can be viewed as a summarization task (e.g. if I was generating the title from the body of the text), I’m going to try generating based on the bert-small2bert-small-summarization.

Generation Example

For now I’ll just copy the example code directly.

Code
#hide_output
from transformers import BertTokenizerFast, EncoderDecoderModel
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizerFast.from_pretrained(
    "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization"
)
model = EncoderDecoderModel.from_pretrained(
    "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization"
).to(device)

def generate_summary(text):
    # cut off at BERT max length 512
    inputs = tokenizer(
        [text],
        padding="max_length",
        truncation=True,
        max_length=512, 
        return_tensors="pt"
    )
    input_ids = inputs.input_ids.to(device)
    attention_mask = inputs.attention_mask.to(device)

    output = model.generate(input_ids, attention_mask=attention_mask)

    return tokenizer.decode(output[0], skip_special_tokens=True)
Code
from transformers import BertTokenizerFast

tokenizer = BertTokenizerFast.from_pretrained(
    "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization"
)
Code
text = "your text to be summarized here..."
generate_summary(text)
"in the u. s., you can't wait until you're in the middle of the night. you'll be able to use the weekly newsquiz to test your knowledge of stories you saw on cnn ireport on cnn. com / heroes."

It would be nice to be able to generate with this by passing the title as the input and then allowing it to regenerate the title. Then I can try cutting down the title to see what it does.

Generate Method Details

The generate method ultimately calls the model repeatedly using various forms of search like beam search. To ensure that the input_ids provided are only used for the context we can explore it a little.

If you run the model.generate?? code then you can see the extremely long code. As with all huggingface code it is quite readable. Since it’s so long it’s folded here and I am going to provide the most important bits after.

Signature:
model.generate(
    input_ids: Optional[torch.LongTensor] = None,
    max_length: Optional[int] = None,
    min_length: Optional[int] = None,
    do_sample: Optional[bool] = None,
    early_stopping: Optional[bool] = None,
    num_beams: Optional[int] = None,
    temperature: Optional[float] = None,
    top_k: Optional[int] = None,
    top_p: Optional[float] = None,
    repetition_penalty: Optional[float] = None,
    bad_words_ids: Optional[Iterable[int]] = None,
    bos_token_id: Optional[int] = None,
    pad_token_id: Optional[int] = None,
    eos_token_id: Optional[int] = None,
    length_penalty: Optional[float] = None,
    no_repeat_ngram_size: Optional[int] = None,
    encoder_no_repeat_ngram_size: Optional[int] = None,
    num_return_sequences: Optional[int] = None,
    max_time: Optional[float] = None,
    max_new_tokens: Optional[int] = None,
    decoder_start_token_id: Optional[int] = None,
    use_cache: Optional[bool] = None,
    num_beam_groups: Optional[int] = None,
    diversity_penalty: Optional[float] = None,
    prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_scores: Optional[bool] = None,
    return_dict_in_generate: Optional[bool] = None,
    forced_bos_token_id: Optional[int] = None,
    forced_eos_token_id: Optional[int] = None,
    remove_invalid_values: Optional[bool] = None,
    synced_gpus: Optional[bool] = None,
    **model_kwargs,
) -> Union[transformers.generation_utils.GreedySearchEncoderDecoderOutput, transformers.generation_utils.GreedySearchDecoderOnlyOutput, transformers.generation_utils.SampleEncoderDecoderOutput, transformers.generation_utils.SampleDecoderOnlyOutput, transformers.generation_utils.BeamSearchEncoderDecoderOutput, transformers.generation_utils.BeamSearchDecoderOnlyOutput, transformers.generation_utils.BeamSampleEncoderDecoderOutput, transformers.generation_utils.BeamSampleDecoderOnlyOutput, torch.LongTensor]
Source:   
    @torch.no_grad()
    def generate(
        self,
        input_ids: Optional[torch.LongTensor] = None,
        max_length: Optional[int] = None,
        min_length: Optional[int] = None,
        do_sample: Optional[bool] = None,
        early_stopping: Optional[bool] = None,
        num_beams: Optional[int] = None,
        temperature: Optional[float] = None,
        top_k: Optional[int] = None,
        top_p: Optional[float] = None,
        repetition_penalty: Optional[float] = None,
        bad_words_ids: Optional[Iterable[int]] = None,
        bos_token_id: Optional[int] = None,
        pad_token_id: Optional[int] = None,
        eos_token_id: Optional[int] = None,
        length_penalty: Optional[float] = None,
        no_repeat_ngram_size: Optional[int] = None,
        encoder_no_repeat_ngram_size: Optional[int] = None,
        num_return_sequences: Optional[int] = None,
        max_time: Optional[float] = None,
        max_new_tokens: Optional[int] = None,
        decoder_start_token_id: Optional[int] = None,
        use_cache: Optional[bool] = None,
        num_beam_groups: Optional[int] = None,
        diversity_penalty: Optional[float] = None,
        prefix_allowed_tokens_fn: Optional[Callable[[int, torch.Tensor], List[int]]] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_scores: Optional[bool] = None,
        return_dict_in_generate: Optional[bool] = None,
        forced_bos_token_id: Optional[int] = None,
        forced_eos_token_id: Optional[int] = None,
        remove_invalid_values: Optional[bool] = None,
        synced_gpus: Optional[bool] = None,
        **model_kwargs,
    ) -> Union[GreedySearchOutput, SampleOutput, BeamSearchOutput, BeamSampleOutput, torch.LongTensor]:
        r"""
        Generates sequences for models with a language modeling head. The method currently supports greedy decoding,
        multinomial sampling, beam-search decoding, and beam-search multinomial sampling.
        Apart from :obj:`input_ids` and :obj:`attention_mask`, all the arguments below will default to the value of the
        attribute of the same name inside the :class:`~transformers.PretrainedConfig` of the model. The default values
        indicated are the default values of those config.
        Most of these parameters are explained in more detail in `this blog post
        <https://huggingface.co/blog/how-to-generate>`__.
        Parameters:
            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
                The sequence used as a prompt for the generation. If :obj:`None` the method initializes it as an empty
                :obj:`torch.LongTensor` of shape :obj:`(1,)`.
            max_length (:obj:`int`, `optional`, defaults to :obj:`model.config.max_length`):
                The maximum length of the sequence to be generated.
            max_new_tokens (:obj:`int`, `optional`, defaults to None):
                The maximum numbers of tokens to generate, ignore the current number of tokens. Use either
                :obj:`max_new_tokens` or :obj:`max_length` but not both, they serve the same purpose.
            min_length (:obj:`int`, `optional`, defaults to 10):
                The minimum length of the sequence to be generated.
            do_sample (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether or not to use sampling ; use greedy decoding otherwise.
            early_stopping (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether to stop the beam search when at least ``num_beams`` sentences are finished per batch or not.
            num_beams (:obj:`int`, `optional`, defaults to 1):
                Number of beams for beam search. 1 means no beam search.
            temperature (:obj:`float`, `optional`, defaults to 1.0):
                The value used to module the next token probabilities.
            top_k (:obj:`int`, `optional`, defaults to 50):
                The number of highest probability vocabulary tokens to keep for top-k-filtering.
            top_p (:obj:`float`, `optional`, defaults to 1.0):
                If set to float < 1, only the most probable tokens with probabilities that add up to :obj:`top_p` or
                higher are kept for generation.
            repetition_penalty (:obj:`float`, `optional`, defaults to 1.0):
                The parameter for repetition penalty. 1.0 means no penalty. See `this paper
                <https://arxiv.org/pdf/1909.05858.pdf>`__ for more details.
            pad_token_id (:obj:`int`, `optional`):
                The id of the `padding` token.
            bos_token_id (:obj:`int`, `optional`):
                The id of the `beginning-of-sequence` token.
            eos_token_id (:obj:`int`, `optional`):
                The id of the `end-of-sequence` token.
            length_penalty (:obj:`float`, `optional`, defaults to 1.0):
                Exponential penalty to the length. 1.0 means no penalty. Set to values < 1.0 in order to encourage the
                model to generate shorter sequences, to a value > 1.0 in order to encourage the model to produce longer
                sequences.
            no_repeat_ngram_size (:obj:`int`, `optional`, defaults to 0):
                If set to int > 0, all ngrams of that size can only occur once.
            encoder_no_repeat_ngram_size (:obj:`int`, `optional`, defaults to 0):
                If set to int > 0, all ngrams of that size that occur in the ``encoder_input_ids`` cannot occur in the
                ``decoder_input_ids``.
            bad_words_ids(:obj:`List[List[int]]`, `optional`):
                List of token ids that are not allowed to be generated. In order to get the tokens of the words that
                should not appear in the generated text, use :obj:`tokenizer(bad_word,
                add_prefix_space=True).input_ids`.
            num_return_sequences(:obj:`int`, `optional`, defaults to 1):
                The number of independently computed returned sequences for each element in the batch.
            max_time(:obj:`float`, `optional`, defaults to None):
                The maximum amount of time you allow the computation to run for in seconds. generation will still
                finish the current pass after allocated time has been passed.
            attention_mask (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
                Mask to avoid performing attention on padding token indices. Mask values are in ``[0, 1]``, 1 for
                tokens that are not masked, and 0 for masked tokens. If not provided, will default to a tensor the same
                shape as :obj:`input_ids` that masks the pad token. `What are attention masks?
                <../glossary.html#attention-mask>`__
            decoder_start_token_id (:obj:`int`, `optional`):
                If an encoder-decoder model starts decoding with a different token than `bos`, the id of that token.
            use_cache: (:obj:`bool`, `optional`, defaults to :obj:`True`):
                Whether or not the model should use the past last key/values attentions (if applicable to the model) to
                speed up decoding.
            num_beam_groups (:obj:`int`, `optional`, defaults to 1):
                Number of groups to divide :obj:`num_beams` into in order to ensure diversity among different groups of
                beams. `this paper <https://arxiv.org/pdf/1610.02424.pdf>`__ for more details.
            diversity_penalty (:obj:`float`, `optional`, defaults to 0.0):
                This value is subtracted from a beam's score if it generates a token same as any beam from other group
                at a particular time. Note that :obj:`diversity_penalty` is only effective if ``group beam search`` is
                enabled.
            prefix_allowed_tokens_fn: (:obj:`Callable[[int, torch.Tensor], List[int]]`, `optional`):
                If provided, this function constraints the beam search to allowed tokens only at each step. If not
                provided no constraint is applied. This function takes 2 arguments: the batch ID :obj:`batch_id` and
                :obj:`input_ids`. It has to return a list with the allowed tokens for the next generation step
                conditioned on the batch ID :obj:`batch_id` and the previously generated tokens :obj:`inputs_ids`. This
                argument is useful for constrained generation conditioned on the prefix, as described in
                `Autoregressive Entity Retrieval <https://arxiv.org/abs/2010.00904>`__.
            output_attentions (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
                returned tensors for more details.
            output_hidden_states (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors
                for more details.
            output_scores (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the prediction scores. See ``scores`` under returned tensors for more details.
            return_dict_in_generate (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
            forced_bos_token_id (:obj:`int`, `optional`):
                The id of the token to force as the first generated token after the :obj:`decoder_start_token_id`.
                Useful for multilingual models like :doc:`mBART <../model_doc/mbart>` where the first generated token
                needs to be the target language token.
            forced_eos_token_id (:obj:`int`, `optional`):
                The id of the token to force as the last generated token when :obj:`max_length` is reached.
            remove_invalid_values (:obj:`bool`, `optional`):
                Whether to remove possible `nan` and `inf` outputs of the model to prevent the generation method to
                crash. Note that using ``remove_invalid_values`` can slow down generation.
            synced_gpus (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
            model_kwargs:
                Additional model specific kwargs will be forwarded to the :obj:`forward` function of the model. If the
                model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific
                kwargs should be prefixed with `decoder_`.
        Return:
            :class:`~transformers.file_utils.ModelOutput` or :obj:`torch.LongTensor`: A
            :class:`~transformers.file_utils.ModelOutput` (if ``return_dict_in_generate=True`` or when
            ``config.return_dict_in_generate=True``) or a :obj:`torch.FloatTensor`.
                If the model is `not` an encoder-decoder model (``model.config.is_encoder_decoder=False``), the
                possible :class:`~transformers.file_utils.ModelOutput` types are:
                    - :class:`~transformers.generation_utils.GreedySearchDecoderOnlyOutput`,
                    - :class:`~transformers.generation_utils.SampleDecoderOnlyOutput`,
                    - :class:`~transformers.generation_utils.BeamSearchDecoderOnlyOutput`,
                    - :class:`~transformers.generation_utils.BeamSampleDecoderOnlyOutput`
                If the model is an encoder-decoder model (``model.config.is_encoder_decoder=True``), the possible
                :class:`~transformers.file_utils.ModelOutput` types are:
                    - :class:`~transformers.generation_utils.GreedySearchEncoderDecoderOutput`,
                    - :class:`~transformers.generation_utils.SampleEncoderDecoderOutput`,
                    - :class:`~transformers.generation_utils.BeamSearchEncoderDecoderOutput`,
                    - :class:`~transformers.generation_utils.BeamSampleEncoderDecoderOutput`
        Examples::
            >>> from transformers import AutoTokenizer, AutoModelForCausalLM, AutoModelForSeq2SeqLM
            >>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
            >>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
            >>> # do greedy decoding without providing a prompt
            >>> outputs = model.generate(max_length=40)
            >>> print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
            >>> tokenizer = AutoTokenizer.from_pretrained("t5-base")
            >>> model = AutoModelForSeq2SeqLM.from_pretrained("t5-base")
            >>> document = (
            ... "at least two people were killed in a suspected bomb attack on a passenger bus "
            ... "in the strife-torn southern philippines on monday , the military said."
            ... )
            >>> # encode input context
            >>> input_ids = tokenizer(document, return_tensors="pt").input_ids
            >>> # generate 3 independent sequences using beam search decoding (5 beams)
            >>> # with T5 encoder-decoder model conditioned on short news article.
            >>> outputs = model.generate(input_ids=input_ids, num_beams=5, num_return_sequences=3)
            >>> print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))
            >>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
            >>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
            >>> input_context = "The dog"
            >>> # encode input context
            >>> input_ids = tokenizer(input_context, return_tensors="pt").input_ids
            >>> # generate 3 candidates using sampling
            >>> outputs = model.generate(input_ids=input_ids, max_length=20, num_return_sequences=3, do_sample=True)
            >>> print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))
            >>> tokenizer = AutoTokenizer.from_pretrained("ctrl")
            >>> model = AutoModelForCausalLM.from_pretrained("ctrl")
            >>> # "Legal" is one of the control codes for ctrl
            >>> input_context = "Legal My neighbor is"
            >>> # encode input context
            >>> input_ids = tokenizer(input_context, return_tensors="pt").input_ids
            >>> outputs = model.generate(input_ids=input_ids, max_length=20, repetition_penalty=1.2)
            >>> print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
            >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
            >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
            >>> input_context = "My cute dog"
            >>> # get tokens of words that should not be generated
            >>> bad_words_ids = [tokenizer(bad_word, add_prefix_space=True).input_ids for bad_word in ["idiot", "stupid", "shut up"]]
            >>> # encode input context
            >>> input_ids = tokenizer(input_context, return_tensors="pt").input_ids
            >>> # generate sequences without allowing bad_words to be generated
            >>> outputs = model.generate(input_ids=input_ids, max_length=20, do_sample=True, bad_words_ids=bad_words_ids)
            >>> print("Generated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
        """
        # set init values
        if max_length is None and max_new_tokens is None:
            # Both are None, default
            max_length = self.config.max_length
        elif max_length is not None and max_new_tokens is not None:
            # Both are set, this is odd, raise a warning
            warnings.warn(
                "Both `max_length` and `max_new_tokens` have been set but they serve the same purpose.", UserWarning
            )
        max_length = max_length if max_length is not None else self.config.max_length
        num_beams = num_beams if num_beams is not None else self.config.num_beams
        num_beam_groups = num_beam_groups if num_beam_groups is not None else self.config.num_beam_groups
        do_sample = do_sample if do_sample is not None else self.config.do_sample
        num_return_sequences = (
            num_return_sequences if num_return_sequences is not None else self.config.num_return_sequences
        )
        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
        bos_token_id = bos_token_id if bos_token_id is not None else self.config.bos_token_id
        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
        output_scores = output_scores if output_scores is not None else self.config.output_scores
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict_in_generate = (
            return_dict_in_generate if return_dict_in_generate is not None else self.config.return_dict_in_generate
        )
        model_kwargs["output_attentions"] = output_attentions
        model_kwargs["output_hidden_states"] = output_hidden_states
        if input_ids is None and "inputs_embeds" not in model_kwargs:
            # init `input_ids` with bos_token_id
            input_ids = self._prepare_input_ids_for_generation(bos_token_id, model_kwargs.get("encoder_outputs"))
        if model_kwargs.get("attention_mask", None) is None:
            # init `attention_mask` depending on `pad_token_id`
            model_kwargs["attention_mask"] = self._prepare_attention_mask_for_generation(
                input_ids, pad_token_id, eos_token_id
            )
        # special case if pad_token_id is not defined
        if pad_token_id is None and eos_token_id is not None:
            logger.warning(f"Setting `pad_token_id` to `eos_token_id`:{eos_token_id} for open-end generation.")
            pad_token_id = eos_token_id
        # Storing encoder_input_ids for logits_processor that could use them
        encoder_input_ids = input_ids if self.config.is_encoder_decoder else None
        if self.config.is_encoder_decoder:
            # add encoder_outputs to model_kwargs
            model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)
            # set input_ids as decoder_input_ids
            if "decoder_input_ids" in model_kwargs:
                input_ids = model_kwargs.pop("decoder_input_ids")
            else:
                input_ids = self._prepare_decoder_input_ids_for_generation(
                    input_ids, decoder_start_token_id=decoder_start_token_id, bos_token_id=bos_token_id
                )
            if "encoder_outputs" not in model_kwargs or not isinstance(model_kwargs["encoder_outputs"], ModelOutput):
                raise ValueError("Make sure that `model_kwargs` include `encoder_outputs` of type `ModelOutput`.")
        if input_ids.shape[-1] >= max_length:
            input_ids_string = "decoder_input_ids" if self.config.is_encoder_decoder else "input_ids"
            logger.warning(
                f"Input length of {input_ids_string} is {input_ids.shape[-1]}, but ``max_length`` is set to {max_length}."
                "This can lead to unexpected behavior. You should consider increasing ``config.max_length`` or ``max_length``."
            )
        # determine generation mode
        is_greedy_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is False
        is_sample_gen_mode = (num_beams == 1) and (num_beam_groups == 1) and do_sample is True
        is_beam_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is False
        is_beam_sample_gen_mode = (num_beams > 1) and (num_beam_groups == 1) and do_sample is True
        is_group_beam_gen_mode = (num_beams > 1) and (num_beam_groups > 1)
        if num_beam_groups > num_beams:
            raise ValueError("`num_beam_groups` has to be smaller or equal to `num_beams`")
        if is_group_beam_gen_mode and do_sample is True:
            raise ValueError(
                "Diverse beam search cannot be used in sampling mode. Make sure that `do_sample` is set to `False`."
            )
        # set model_kwargs
        model_kwargs["use_cache"] = use_cache
        # get distribution pre_processing samplers
        logits_processor = self._get_logits_processor(
            repetition_penalty=repetition_penalty,
            no_repeat_ngram_size=no_repeat_ngram_size,
            encoder_no_repeat_ngram_size=encoder_no_repeat_ngram_size,
            encoder_input_ids=encoder_input_ids,
            bad_words_ids=bad_words_ids,
            min_length=min_length,
            max_length=max_length,
            eos_token_id=eos_token_id,
            forced_bos_token_id=forced_bos_token_id,
            forced_eos_token_id=forced_eos_token_id,
            prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
            num_beams=num_beams,
            num_beam_groups=num_beam_groups,
            diversity_penalty=diversity_penalty,
            remove_invalid_values=remove_invalid_values,
        )
        cur_len = input_ids.shape[-1]
        stopping_criteria = self._get_stopping_criteria(
            max_length=max_length, max_time=max_time, max_new_tokens=max_new_tokens, start_length=cur_len
        )
        if is_greedy_gen_mode:
            if num_return_sequences > 1:
                raise ValueError(
                    f"num_return_sequences has to be 1, but is {num_return_sequences} when doing greedy search."
                )
            # greedy search
            return self.greedy_search(
                input_ids,
                logits_processor=logits_processor,
                stopping_criteria=stopping_criteria,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                synced_gpus=synced_gpus,
                **model_kwargs,
            )
        elif is_sample_gen_mode:
            # get probability distribution warper
            logits_warper = self._get_logits_warper(
                top_k=top_k, top_p=top_p, temperature=temperature, num_beams=num_beams
            )
            # expand input_ids with `num_return_sequences` additional sequences per batch
            input_ids, model_kwargs = self._expand_inputs_for_generation(
                input_ids,
                expand_size=num_return_sequences,
                is_encoder_decoder=self.config.is_encoder_decoder,
                **model_kwargs,
            )
            # sample
            return self.sample(
                input_ids,
                logits_processor=logits_processor,
                logits_warper=logits_warper,
                stopping_criteria=stopping_criteria,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                synced_gpus=synced_gpus,
                **model_kwargs,
            )
        elif is_beam_gen_mode:
            batch_size = input_ids.shape[0]
            length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
            early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
            if num_return_sequences > num_beams:
                raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
            if stopping_criteria.max_length is None:
                raise ValueError("`max_length` needs to be a stopping_criteria for now.")
            beam_scorer = BeamSearchScorer(
                batch_size=batch_size,
                num_beams=num_beams,
                device=self.device,
                length_penalty=length_penalty,
                do_early_stopping=early_stopping,
                num_beam_hyps_to_keep=num_return_sequences,
            )
            # interleave with `num_beams`
            input_ids, model_kwargs = self._expand_inputs_for_generation(
                input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs
            )
            return self.beam_search(
                input_ids,
                beam_scorer,
                logits_processor=logits_processor,
                stopping_criteria=stopping_criteria,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                synced_gpus=synced_gpus,
                **model_kwargs,
            )
        elif is_beam_sample_gen_mode:
            logits_warper = self._get_logits_warper(
                top_k=top_k, top_p=top_p, temperature=temperature, num_beams=num_beams
            )
            batch_size = input_ids.shape[0] * num_return_sequences
            length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
            if stopping_criteria.max_length is None:
                raise ValueError("`max_length` needs to be a stopping_criteria for now.")
            beam_scorer = BeamSearchScorer(
                batch_size=batch_size,
                num_beams=num_beams,
                device=self.device,
                length_penalty=length_penalty,
                do_early_stopping=early_stopping,
            )
            # interleave with `num_beams * num_return_sequences`
            input_ids, model_kwargs = self._expand_inputs_for_generation(
                input_ids,
                expand_size=num_beams * num_return_sequences,
                is_encoder_decoder=self.config.is_encoder_decoder,
                **model_kwargs,
            )
            return self.beam_sample(
                input_ids,
                beam_scorer,
                logits_processor=logits_processor,
                logits_warper=logits_warper,
                stopping_criteria=stopping_criteria,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                synced_gpus=synced_gpus,
                **model_kwargs,
            )
        elif is_group_beam_gen_mode:
            batch_size = input_ids.shape[0]
            length_penalty = length_penalty if length_penalty is not None else self.config.length_penalty
            early_stopping = early_stopping if early_stopping is not None else self.config.early_stopping
            if num_return_sequences > num_beams:
                raise ValueError("`num_return_sequences` has to be smaller or equal to `num_beams`.")
            if num_beams % num_beam_groups != 0:
                raise ValueError("`num_beams` should be divisible by `num_beam_groups` for group beam search.")
            if stopping_criteria.max_length is None:
                raise ValueError("`max_length` needs to be a stopping_criteria for now.")
            diverse_beam_scorer = BeamSearchScorer(
                batch_size=batch_size,
                num_beams=num_beams,
                max_length=stopping_criteria.max_length,
                device=self.device,
                length_penalty=length_penalty,
                do_early_stopping=early_stopping,
                num_beam_hyps_to_keep=num_return_sequences,
                num_beam_groups=num_beam_groups,
            )
            # interleave with `num_beams`
            input_ids, model_kwargs = self._expand_inputs_for_generation(
                input_ids, expand_size=num_beams, is_encoder_decoder=self.config.is_encoder_decoder, **model_kwargs
            )
            return self.group_beam_search(
                input_ids,
                diverse_beam_scorer,
                logits_processor=logits_processor,
                stopping_criteria=stopping_criteria,
                pad_token_id=pad_token_id,
                eos_token_id=eos_token_id,
                output_scores=output_scores,
                return_dict_in_generate=return_dict_in_generate,
                synced_gpus=synced_gpus,
                **model_kwargs,
            )
File:      ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/transformers/generation_utils.py
Type:      method

The input_ids parameter is passed to the encoder here:

# Storing encoder_input_ids for logits_processor that could use them
encoder_input_ids = input_ids if self.config.is_encoder_decoder else None

if self.config.is_encoder_decoder:
    # add encoder_outputs to model_kwargs
    model_kwargs = self._prepare_encoder_decoder_kwargs_for_generation(input_ids, model_kwargs)

    # set input_ids as decoder_input_ids
    if "decoder_input_ids" in model_kwargs:
        input_ids = model_kwargs.pop("decoder_input_ids")
    else:
        input_ids = self._prepare_decoder_input_ids_for_generation(
            input_ids, decoder_start_token_id=decoder_start_token_id, bos_token_id=bos_token_id
        )

As the comment suggests, the _prepare_encoder_decoder_kwargs_for_generation method passes the input_ids to the encoder and then saves the output in the kwargs that are later used in generation:

Signature:
model._prepare_encoder_decoder_kwargs_for_generation(
    input_ids: torch.LongTensor,
    model_kwargs,
) -> Dict[str, Any]
Docstring: <no docstring>
Source:   
    def _prepare_encoder_decoder_kwargs_for_generation(
        self, input_ids: torch.LongTensor, model_kwargs
    ) -> Dict[str, Any]:
        if "encoder_outputs" not in model_kwargs:
            # retrieve encoder hidden states
            encoder = self.get_encoder()
            encoder_kwargs = {
                argument: value
                for argument, value in model_kwargs.items()
                if not (argument.startswith("decoder_") or argument.startswith("cross_attn"))
            }
            model_kwargs["encoder_outputs"]: ModelOutput = encoder(input_ids, return_dict=True, **encoder_kwargs)
        return model_kwargs
File:      ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/transformers/generation_utils.py
Type:      method

The _prepare_decoder_input_ids_for_generation method is quite simple and just returns a single token tensor that is the start token:

Signature:
model._prepare_decoder_input_ids_for_generation(
    input_ids: torch.LongTensor,
    decoder_start_token_id: int = None,
    bos_token_id: int = None,
) -> torch.LongTensor
Docstring: <no docstring>
Source:   
    def _prepare_decoder_input_ids_for_generation(
        self, input_ids: torch.LongTensor, decoder_start_token_id: int = None, bos_token_id: int = None
    ) -> torch.LongTensor:
        decoder_start_token_id = self._get_decoder_start_token_id(decoder_start_token_id, bos_token_id)
        decoder_input_ids = (
            torch.ones((input_ids.shape[0], 1), dtype=torch.long, device=input_ids.device) * decoder_start_token_id
        )
        return decoder_input_ids
File:      ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/transformers/generation_utils.py
Type:      method

I’m happy that this means that the input provided is only used to generate the context. The next thing to review is the sequence generation. We can review the greedy_search approach as that is the simplest generation method and it shouldn’t fundamentally behave differently to the others. The greedy search is triggered like this:

# greedy search
return self.greedy_search(
    input_ids,
    logits_processor=logits_processor,
    stopping_criteria=stopping_criteria,
    pad_token_id=pad_token_id,
    eos_token_id=eos_token_id,
    output_scores=output_scores,
    return_dict_in_generate=return_dict_in_generate,
    synced_gpus=synced_gpus,
    **model_kwargs,
)

so you can see that the input_ids from earlier are the first parameter and the model_kwargs is where the context from the encoder is.

Signature:
model.greedy_search(
    input_ids: torch.LongTensor,
    logits_processor: Optional[transformers.generation_logits_process.LogitsProcessorList] = None,
    stopping_criteria: Optional[transformers.generation_stopping_criteria.StoppingCriteriaList] = None,
    max_length: Optional[int] = None,
    pad_token_id: Optional[int] = None,
    eos_token_id: Optional[int] = None,
    output_attentions: Optional[bool] = None,
    output_hidden_states: Optional[bool] = None,
    output_scores: Optional[bool] = None,
    return_dict_in_generate: Optional[bool] = None,
    synced_gpus: Optional[bool] = None,
    **model_kwargs,
) -> Union[transformers.generation_utils.GreedySearchEncoderDecoderOutput, transformers.generation_utils.GreedySearchDecoderOnlyOutput, torch.LongTensor]
Source:   
    def greedy_search(
        self,
        input_ids: torch.LongTensor,
        logits_processor: Optional[LogitsProcessorList] = None,
        stopping_criteria: Optional[StoppingCriteriaList] = None,
        max_length: Optional[int] = None,
        pad_token_id: Optional[int] = None,
        eos_token_id: Optional[int] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        output_scores: Optional[bool] = None,
        return_dict_in_generate: Optional[bool] = None,
        synced_gpus: Optional[bool] = None,
        **model_kwargs,
    ) -> Union[GreedySearchOutput, torch.LongTensor]:
        r"""
        Generates sequences for models with a language modeling head using greedy decoding.
        Parameters:
            input_ids (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
                The sequence used as a prompt for the generation. If :obj:`None` the method initializes it as an empty
                :obj:`torch.LongTensor` of shape :obj:`(1,)`.
            logits_processor (:obj:`LogitsProcessorList`, `optional`):
                An instance of :class:`~transformers.LogitsProcessorList`. List of instances of class derived from
                :class:`~transformers.LogitsProcessor` used to modify the prediction scores of the language modeling
                head applied at each generation step.
            stopping_criteria (:obj:`StoppingCriteriaList`, `optional`):
                An instance of :class:`~transformers.StoppingCriteriaList`. List of instances of class derived from
                :class:`~transformers.StoppingCriteria` used to tell if the generation loop should stop.
            max_length (:obj:`int`, `optional`, defaults to 20):
                **DEPRECATED**. Use :obj:`logits_processor` or :obj:`stopping_criteria` directly to cap the number of
                generated tokens. The maximum length of the sequence to be generated.
            pad_token_id (:obj:`int`, `optional`):
                The id of the `padding` token.
            eos_token_id (:obj:`int`, `optional`):
                The id of the `end-of-sequence` token.
            output_attentions (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the attentions tensors of all attention layers. See ``attentions`` under
                returned tensors for more details.
            output_hidden_states (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the hidden states of all layers. See ``hidden_states`` under returned tensors
                for more details.
            output_scores (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return the prediction scores. See ``scores`` under returned tensors for more details.
            return_dict_in_generate (:obj:`bool`, `optional`, defaults to `False`):
                Whether or not to return a :class:`~transformers.file_utils.ModelOutput` instead of a plain tuple.
            synced_gpus (:obj:`bool`, `optional`, defaults to :obj:`False`):
                Whether to continue running the while loop until max_length (needed for ZeRO stage 3)
            model_kwargs:
                Additional model specific keyword arguments will be forwarded to the :obj:`forward` function of the
                model. If model is an encoder-decoder model the kwargs should include :obj:`encoder_outputs`.
        Return:
            :class:`~transformers.generation_utils.GreedySearchDecoderOnlyOutput`,
            :class:`~transformers.generation_utils.GreedySearchEncoderDecoderOutput` or obj:`torch.LongTensor`: A
            :obj:`torch.LongTensor` containing the generated tokens (default behaviour) or a
            :class:`~transformers.generation_utils.GreedySearchDecoderOnlyOutput` if
            ``model.config.is_encoder_decoder=False`` and ``return_dict_in_generate=True`` or a
            :class:`~transformers.generation_utils.GreedySearchEncoderDecoderOutput` if
            ``model.config.is_encoder_decoder=True``.
        Examples::
            >>> from transformers import (
            ... AutoTokenizer,
            ... AutoModelForCausalLM,
            ... LogitsProcessorList,
            ... MinLengthLogitsProcessor,
            ... )
            >>> tokenizer = AutoTokenizer.from_pretrained("gpt2")
            >>> model = AutoModelForCausalLM.from_pretrained("gpt2")
            >>> # set pad_token_id to eos_token_id because GPT2 does not have a EOS token
            >>> model.config.pad_token_id = model.config.eos_token_id
            >>> input_prompt = "Today is a beautiful day, and"
            >>> input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids
            >>> # instantiate logits processors
            >>> logits_processor = LogitsProcessorList([
            ...     MinLengthLogitsProcessor(15, eos_token_id=model.config.eos_token_id),
            ... ])
            >>> outputs = model.greedy_search(input_ids, logits_processor=logits_processor)
            >>> print("Generated:", tokenizer.batch_decode(outputs, skip_special_tokens=True))
        """
        # init values
        logits_processor = logits_processor if logits_processor is not None else LogitsProcessorList()
        stopping_criteria = stopping_criteria if stopping_criteria is not None else StoppingCriteriaList()
        if max_length is not None:
            warnings.warn(
                "`max_length` is deprecated in this function, use `stopping_criteria=StoppingCriteriaList(MaxLengthCriteria(max_length=max_length))` instead.",
                UserWarning,
            )
            stopping_criteria = validate_stopping_criteria(stopping_criteria, max_length)
        pad_token_id = pad_token_id if pad_token_id is not None else self.config.pad_token_id
        eos_token_id = eos_token_id if eos_token_id is not None else self.config.eos_token_id
        output_scores = output_scores if output_scores is not None else self.config.output_scores
        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
        output_hidden_states = (
            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
        )
        return_dict_in_generate = (
            return_dict_in_generate if return_dict_in_generate is not None else self.config.return_dict_in_generate
        )
        # init attention / hidden states / scores tuples
        scores = () if (return_dict_in_generate and output_scores) else None
        decoder_attentions = () if (return_dict_in_generate and output_attentions) else None
        cross_attentions = () if (return_dict_in_generate and output_attentions) else None
        decoder_hidden_states = () if (return_dict_in_generate and output_hidden_states) else None
        # if model is an encoder-decoder, retrieve encoder attention weights and hidden states
        if return_dict_in_generate and self.config.is_encoder_decoder:
            encoder_attentions = model_kwargs["encoder_outputs"].get("attentions") if output_attentions else None
            encoder_hidden_states = (
                model_kwargs["encoder_outputs"].get("hidden_states") if output_hidden_states else None
            )
        # keep track of which sequences are already finished
        unfinished_sequences = input_ids.new(input_ids.shape[0]).fill_(1)
        cur_len = input_ids.shape[-1]
        this_peer_finished = False  # used by synced_gpus only
        while True:
            if synced_gpus:
                # Under synced_gpus the `forward` call must continue until all gpus complete their sequence.
                # The following logic allows an early break if all peers finished generating their sequence
                this_peer_finished_flag = torch.tensor(0.0 if this_peer_finished else 1.0).to(input_ids.device)
                # send 0.0 if we finished, 1.0 otherwise
                dist.all_reduce(this_peer_finished_flag, op=dist.ReduceOp.SUM)
                # did all peers finish? the reduced sum will be 0.0 then
                if this_peer_finished_flag.item() == 0.0:
                    break
            # prepare model inputs
            model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
            # forward pass to get next token
            outputs = self(
                **model_inputs,
                return_dict=True,
                output_attentions=output_attentions,
                output_hidden_states=output_hidden_states,
            )
            if synced_gpus and this_peer_finished:
                cur_len = cur_len + 1
                continue  # don't waste resources running the code we don't need
            next_token_logits = outputs.logits[:, -1, :]
            # Store scores, attentions and hidden_states when required
            if return_dict_in_generate:
                if output_scores:
                    scores += (next_token_logits,)
                if output_attentions:
                    decoder_attentions += (
                        (outputs.decoder_attentions,) if self.config.is_encoder_decoder else (outputs.attentions,)
                    )
                    if self.config.is_encoder_decoder:
                        cross_attentions += (outputs.cross_attentions,)
                if output_hidden_states:
                    decoder_hidden_states += (
                        (outputs.decoder_hidden_states,)
                        if self.config.is_encoder_decoder
                        else (outputs.hidden_states,)
                    )
            # pre-process distribution
            next_tokens_scores = logits_processor(input_ids, next_token_logits)
            # argmax
            next_tokens = torch.argmax(next_tokens_scores, dim=-1)
            # finished sentences should have their next token be a padding token
            if eos_token_id is not None:
                assert pad_token_id is not None, "If eos_token_id is defined, make sure that pad_token_id is defined."
                next_tokens = next_tokens * unfinished_sequences + pad_token_id * (1 - unfinished_sequences)
            # update generated ids, model inputs, and length for next step
            input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)
            model_kwargs = self._update_model_kwargs_for_generation(
                outputs, model_kwargs, is_encoder_decoder=self.config.is_encoder_decoder
            )
            cur_len = cur_len + 1
            # if eos_token was found in one sentence, set sentence to finished
            if eos_token_id is not None:
                unfinished_sequences = unfinished_sequences.mul((next_tokens != eos_token_id).long())
            # stop when each sentence is finished, or if we exceed the maximum length
            if unfinished_sequences.max() == 0 or stopping_criteria(input_ids, scores):
                if not synced_gpus:
                    break
                else:
                    this_peer_finished = True
        if return_dict_in_generate:
            if self.config.is_encoder_decoder:
                return GreedySearchEncoderDecoderOutput(
                    sequences=input_ids,
                    scores=scores,
                    encoder_attentions=encoder_attentions,
                    encoder_hidden_states=encoder_hidden_states,
                    decoder_attentions=decoder_attentions,
                    cross_attentions=cross_attentions,
                    decoder_hidden_states=decoder_hidden_states,
                )
            else:
                return GreedySearchDecoderOnlyOutput(
                    sequences=input_ids,
                    scores=scores,
                    attentions=decoder_attentions,
                    hidden_states=decoder_hidden_states,
                )
        else:
            return input_ids
File:      ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/transformers/generation_utils.py
Type:      method

The interesting bits here are within the while loop that generates the tokens:

# prepare model inputs
model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)

# forward pass to get next token
outputs = self(
    **model_inputs,
    return_dict=True,
    output_attentions=output_attentions,
    output_hidden_states=output_hidden_states,
)

# later...

input_ids = torch.cat([input_ids, next_tokens[:, None]], dim=-1)

The prepare_inputs_for_generation just copies over the appropriate fields:

Signature:
model.prepare_inputs_for_generation(
    input_ids,
    past=None,
    attention_mask=None,
    use_cache=None,
    encoder_outputs=None,
    **kwargs,
)
Docstring:
Implement in subclasses of :class:`~transformers.PreTrainedModel` for custom behavior to prepare inputs in the
generate method.
Source:   
    def prepare_inputs_for_generation(
        self, input_ids, past=None, attention_mask=None, use_cache=None, encoder_outputs=None, **kwargs
    ):
        decoder_inputs = self.decoder.prepare_inputs_for_generation(input_ids, past=past)
        decoder_attention_mask = decoder_inputs["attention_mask"] if "attention_mask" in decoder_inputs else None
        input_dict = {
            "attention_mask": attention_mask,
            "decoder_attention_mask": decoder_attention_mask,
            "decoder_input_ids": decoder_inputs["input_ids"],
            "encoder_outputs": encoder_outputs,
            "past_key_values": decoder_inputs["past_key_values"],
            "use_cache": use_cache,
        }
        return input_dict
File:      ~/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/transformers/models/encoder_decoder/modeling_encoder_decoder.py
Type:      method

Finally the torch.cat([input_ids, next_tokens[:, None]], dim=-1) builds up the input_ids by appending the last token chosen.

To me this confirms that the encoder is used to provide additional context to the generation, and that the generation only runs over that context and the accumulated model output. Now that we know that pretraining the model using an input of the title and the same output as the target is reasonable. We are teaching the model to encode the title in the context.

When we move to a less direct encoding of the title that pretrained model will know to look for title generation hints in the context, so we can have context sensitive titles.


Simple Transformers Introduction

I want to practice using simpletransformers for this task. The best way to start with this is to use their existing example code. Then we can work up to using it to train the titles.

Using a Pretrained Model

For the training it’s best to start with a pretrained model. Lets try altering the example code to load the pretrained model we used earlier.

Code
from pathlib import Path

DATA_FOLDER = Path("/data/blog/2021-10-15-title-generator")

MODEL_NAME = "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization"
TRAIN_EPOCHS = 2 # intentionally short

The huggingface transformers library that simpletransformers uses produces a lot of logging output. Simpletransformers also appears to invoke the tokenizer in a way that endlessly produces warnings about parallelism. To manage the volume of output these statements turn down the huggingface logging substantially and disable parallel use of tokenizers.

Code
#collapse_input
import os
import transformers

# suppress warning from huggingface
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# turn down huggingface logging dramatically
transformers.logging.set_verbosity_error()

The example data uses a two row train dataset along with a two row evaluate dataset. They are reproduced below.

Code
#collapse_input
import pandas as pd

train_data = [
    [
        "Perseus “Percy” Jackson is the main protagonist and the narrator of the Percy Jackson and the Olympians series.",
        "Percy is the protagonist of Percy Jackson and the Olympians",
    ],
    [
        "Annabeth Chase is one of the main protagonists in Percy Jackson and the Olympians.",
        "Annabeth is a protagonist in Percy Jackson and the Olympians.",
    ],
]

train_df = pd.DataFrame(train_data, columns=["input_text", "target_text"])

eval_data = [
    [
        "Grover Underwood is a satyr and the Lord of the Wild. He is the satyr who found the demigods Thalia Grace, Nico and Bianca di Angelo, Percy Jackson, Annabeth Chase, and Luke Castellan.",
        "Grover is a satyr who found many important demigods.",
    ],
    [
        "Thalia Grace is the daughter of Zeus, sister of Jason Grace. After several years as a pine tree on Half-Blood Hill, she got a new job leading the Hunters of Artemis.",
        "Thalia is the daughter of Zeus and leader of the Hunters of Artemis.",
    ],
]

eval_df = pd.DataFrame(eval_data, columns=["input_text", "target_text"])

Now we have the dataset we can define and train the model. I’m trying to load the same model that we used earlier as it seems to be a reasonable starting point for this task.

Code
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_name=MODEL_NAME,
    decoder_name=MODEL_NAME,
    args=model_args,
)

# Train the model
model.train_model(train_df, eval_data=eval_df)
ValueError: Unrecognized configuration class <class 'transformers.models.encoder_decoder.configuration_encoder_decoder.EncoderDecoderConfig'> for this kind of AutoModel: AutoModel.
Model type should be one of FNetConfig, GPTJConfig, LayoutLMv2Config, BeitConfig, RemBertConfig, VisualBertConfig, CanineConfig, RoFormerConfig, CLIPConfig, BigBirdPegasusConfig, DeiTConfig, LukeConfig, DetrConfig, GPTNeoConfig, BigBirdConfig, Speech2TextConfig, ViTConfig, Wav2Vec2Config, M2M100Config, ConvBertConfig, LEDConfig, BlenderbotSmallConfig, RetriBertConfig, IBertConfig, MT5Config, T5Config, MobileBertConfig, DistilBertConfig, AlbertConfig, BertGenerationConfig, CamembertConfig, XLMRobertaConfig, PegasusConfig, MarianConfig, MBartConfig, MegatronBertConfig, MPNetConfig, BartConfig, BlenderbotConfig, ReformerConfig, LongformerConfig, RobertaConfig, DebertaV2Config, DebertaConfig, FlaubertConfig, FSMTConfig, SqueezeBertConfig, HubertConfig, BertConfig, OpenAIGPTConfig, GPT2Config, TransfoXLConfig, XLNetConfig, XLMProphetNetConfig, ProphetNetConfig, XLMConfig, CTRLConfig, ElectraConfig, FunnelConfig, LxmertConfig, DPRConfig, LayoutLMConfig, TapasConfig, SplinterConfig.

The problem here is that the encoder decoder model is being loaded in a slightly different way. You can see the problem line here:

self.model = EncoderDecoderModel.from_encoder_decoder_pretrained(
    encoder_name, decoder_name, config=config
)

It is loading the model by loading the two halves of the model separately. There are two ways that we can fix this. We could try writing the pretrained model to a local folder and split it between encoder and decoder, as there is code that allows us to load from a folder:

if encoder_decoder_name:
    # self.model = EncoderDecoderModel.from_pretrained(encoder_decoder_name)
    self.model = EncoderDecoderModel.from_encoder_decoder_pretrained(
        os.path.join(encoder_decoder_name, "encoder"),
        os.path.join(encoder_decoder_name, "decoder"),
    )
    self.encoder_tokenizer = tokenizer_class.from_pretrained(
        os.path.join(encoder_decoder_name, "encoder")
    )
    self.decoder_tokenizer = AutoTokenizer.from_pretrained(
        os.path.join(encoder_decoder_name, "decoder")
    )

source.

It’s infuriating that the commented out line is the one that we want! Oh well, when you use a library like this which makes things simple sometimes you disagree with the choices that are made.

The other choice is to load the model using settings that are as close to the pretrained model and then just replace the model with the pretrained one. This is an approach I am less keen to do. The simpletransformers library may well augment the model in some way and directly replacing it could lose those augmentations.

With this in mind lets try saving the model. We can do this by loading the model and then saving the submodules that are used to create it (the encoder and decoder). Once these have been saved to a folder the tokenizer can be saved alongside, as the code above loads that tokenizer from the same folder.

Code
#collapse_input
#hide_output
from transformers import BertTokenizerFast, EncoderDecoderModel

tokenizer = BertTokenizerFast.from_pretrained(MODEL_NAME)
model = EncoderDecoderModel.from_pretrained(MODEL_NAME)

PRETRAINED_MODEL_FOLDER = DATA_FOLDER / "pretrained-model"

model.encoder.save_pretrained(PRETRAINED_MODEL_FOLDER / "encoder")
model.decoder.save_pretrained(PRETRAINED_MODEL_FOLDER / "decoder")

tokenizer.save_pretrained(PRETRAINED_MODEL_FOLDER / "encoder")
tokenizer.save_pretrained(PRETRAINED_MODEL_FOLDER / "decoder")

Now we can try loading this model from the folder we saved to.

The configuration for these models is saved alongside the model. This means that all of the Path objects need to be converted to strings to ensure that they serialize to json correctly. It’s a little annoying.

Code
#collapse_input
#hide_output
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_decoder_name=str(PRETRAINED_MODEL_FOLDER),
    args=model_args,
)

# Train the model
model.train_model(train_df, eval_data=eval_df)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "

(2,
 {'global_step': [1, 2],
  'eval_loss': [3.0969717502593994, 3.0969717502593994],
  'train_loss': [3.6886348724365234, 3.6297719478607178]})

We can see how well the model does in an evaluation against those two other rows.

Code
# Evaluate the model
result = model.eval_model(eval_df)
result
{'eval_loss': 3.0969717502593994}

And we can see what we would get when run over this text.

Code
# Use the model for prediction
print(
    model.predict(
        [
            "Tyson is a Cyclops, a son of Poseidon, and Percy Jackson’s half brother. He is the current general of the Cyclopes army."
        ]
    )
)

['the current general of the cyclopes army is the current general of the cyclope']

If we had not used a pretrained encoder-decoder model then this would’ve produced absolute hot trash as output. I’ve seen output that is just a series of : for example. Having a pretrained model that expects to receive the contextual data makes such a difference.


Title Generator

I want this model to generate titles, and the easiest way to start with that is to give it the title to generate. This will encourage it to encode the title, in some way, in the context and then pay attention to that as it produces output. After that it can be refined to work off of a more abstract input.

Training can be complex, especially for something like this. The simpletransformers library can make it easier to train models. We can try that out. I might try to do an equivalent train using the huggingface Seq2SeqTrainer in the future as a comparison.

Dataset

Now we can try training with a real dataset. I’ve got a collection of blog post titles that can be used.

There are a few stages of this training and it’s quite important that we don’t taint the evaluation data with the training data. Let’s split it first.

Code
import pandas as pd

df = pd.read_excel("/data/blog/2021-10-15-title-generator/titles.xlsx")[["title"]]
df
title
0 How Books Can Become Your Best Content Marketi...
1 Content Marketing Trends in the post-COVID World
2 5 Simple Rules to Boost Your Visual Marketing ROI
3 7 Tips for Developing Your Blog Keyword Strategy
4 Content Marketing | DemandJump
... ...
9989 Content Marketing Newsletter #14
9990 B2B Content Marketing Solutions | Rep Cap
9991 How Content Marketing Helps Your HVAC Company ...
9992 Best Content Marketing Tools 2021 #Shorts
9993 Online Content Marketing Classes | Skillshare

9994 rows × 1 columns

Code
df = df.rename(columns={"title": "input_text"})
df["target_text"] = df["input_text"]

train_df = df.sample(frac=0.8, random_state=42)
test_df = df[~df.index.isin(train_df.index)]

len(train_df), len(test_df)
(7995, 1999)

Training the Model

Let’s try training with this. Since we know we have a working simpletransformers setup we can use that. It is quite annoying how noisy the simpletransformers train is.

Code
#collapse_input
from pathlib import Path

DATA_FOLDER = Path("/data/blog/2021-10-15-title-generator")
PRETRAINED_MODEL_FOLDER = DATA_FOLDER / "pretrained-model"

MODEL_NAME = "mrm8488/bert-small2bert-small-finetuned-cnn_daily_mail-summarization"
TRAIN_EPOCHS = 5
Code
#collapse_input
import os
import transformers

# suppress warning from huggingface
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# turn down huggingface logging dramatically
transformers.logging.set_verbosity_error()
Code
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,

    use_multiprocessing=False, # RuntimeError: received 0 items of ancdata
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_decoder_name=str(PRETRAINED_MODEL_FOLDER),
    args=model_args,
)

# Train the model
model.train_model(train_df, eval_data=test_df)

RuntimeError: received 0 items of ancdata

I’m getting this error when running the train. It’s very disappointing as I’ve never had this problem when training using the underlying huggingface trainer and the entire point of this is to make training simple.

There are two associated issues and they have both been closed without a single comment. So this is a known issue with the framework that does not have a good fix.

I’ve found this issue in fastai for the same problem and they suggest the following fix:

Code
import resource
rlimit = resource.getrlimit(resource.RLIMIT_NOFILE)
resource.setrlimit(resource.RLIMIT_NOFILE, (2048, rlimit[1]))
resource.getrlimit(resource.RLIMIT_NOFILE)
(2048, 1048576)

I applied this and I also disabled all multiprocessing options in the arguments. This is because even with the resource fix it then got the error

OSError(24, ‘Too many open files’)

Finally it appears to be training. I must admit that I am not amazed by all this.

Code
#collapse_output
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,

    # RuntimeError: received 0 items of ancdata
    # OSError(24, 'Too many open files')
    use_multiprocessed_decoding=False,
    use_multiprocessing_for_evaluation=False,
    use_multiprocessing=False, 
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_decoder_name=str(PRETRAINED_MODEL_FOLDER),
    args=model_args,
)

# Train the model
model.train_model(train_df, eval_data=test_df)



(5000,
 {'global_step': [1000, 2000, 2000, 3000, 4000, 4000, 5000],
  'eval_loss': [0.011338614523105207,
   0.005651695015156292,
   0.005651695015156292,
   0.0036847138374760105,
   0.0029968298273361144,
   0.0029968298273361144,
   0.002707378440167304],
  'train_loss': [0.0003398778208065778,
   0.00018986582290381193,
   0.00018986582290381193,
   0.000479072768939659,
   0.00012406017049215734,
   0.00012406017049215734,
   7.835013093426824e-05]})

This took about 5 minutes to train and has produced some loss metrics. Unfortunately it does not seem to be reporting any other metrics to judge how good the model is.

Evaluation

Lets try evaluating it.

Code
model.eval_model(test_df)
{'eval_loss': 0.002707378440167304}

This evaluation hasn’t really told us anything that wasn’t in the training output. Instead we can try running the model on some of the test titles.

Code
titles = test_df.input_text.iloc[:5].tolist()

for original, predicted in zip(titles, model.predict(titles)):
    print(f"original: {original}\npredicted: {predicted}\n")

original: Content Marketing Trends in the post-COVID World
predicted: content marketing trends in the post - covid world content marketing trends in the post - co

original: Content Marketing | DemandJump
predicted: content marketing | demandjump content marketing | demandjump content marketing | demandju

original: The Most Googled Artist in Every Country in the World | Ken Bromley Art Supplies
predicted: the most googled artist in every country in the world | ken bromley art supplies the most

original: 5 of the Most Important Stages in the Content Marketing Process
predicted: 5 of the most important stages in the content marketing process 5 of the most important stages in

original: How to plan SEO content that actually ranks
predicted: how to plan seo content that actually ranks how to plan seo content that actually ranks how

The model is lowercasing everything which is a pity and there appears to be a problem with correctly stopping the output. Either way this model has trained to replicate these titles reasonably in about 5 minutes.


Changing Context

Now that we have a model that can copy the input lets reduce the input into context. To do this we can just drop every stopword and then order the words in a fixed but arbitrary way.

Context Dataset

The first thing is to strip down the input to just significant words. I really just want to see if this works so I’m going to hack around with spacy to filter out words and reduce them to their lemma forms.

Code
import spacy
import re

nlp = spacy.load("en_core_web_sm")
alpha_pattern = re.compile(r"[a-zA-Z]+")

def to_context(title: str) -> str:
    tokens = nlp(title)
    ascii_tokens = [
        token
        for token in tokens
        if alpha_pattern.match(token.lemma_)
    ]

    # words is a set
    words = {
        token.lemma_.casefold()
        for token in ascii_tokens
        if not any([
            token.is_bracket,
            token.is_currency,
            token.is_digit,
            token.is_punct,
            token.is_quote,
            token.is_space,
            token.is_stop,
        ])
    }

    words = words - {"content", "marketing"}
    
    return " ".join(sorted(words))
Code
%%time

context_train_df = train_df.copy()
context_train_df["input_text"] = context_train_df.input_text.apply(to_context)
context_train_df
CPU times: user 24.6 s, sys: 0 ns, total: 24.6 s
Wall time: 24.6 s
input_text target_text
3125 brand competitor edge gain global grow help ho... How content marketing is helping home-grown sa...
1441 infographic visual Content Marketing Visual Infographic
4510 lot media rock social spend time Rock Social Media  (without spending a lot of...
39 ash borland creators secrets Content Marketing Secrets for Creators with As...
4509 spill tea SPILLING THE TEA: Content marketing (HOW/WHY/W...
... ... ...
1179 achieve march netccentric platform quarter rec... Netccentric's content marketing platform achie...
5448 today wrong What's Wrong With Content Marketing Today
4244 major trends watch 7 Major Content Marketing Trends to Watch in 2...
5218 apricot rocket seo Apricot Rocket: "SEO and Content Marketing"
3976 hat relationship seo ux white White Hat SEO: The Relationship Between Conten...

7995 rows × 2 columns

Code
%%time

context_test_df = test_df.copy()
context_test_df["input_text"] = context_test_df.input_text.apply(to_context)
context_test_df
CPU times: user 6.29 s, sys: 0 ns, total: 6.29 s
Wall time: 6.29 s
input_text target_text
1 covid post trends world Content Marketing Trends in the post-COVID World
4 demandjump Content Marketing | DemandJump
5 art artist bromley country googled ken supply ... The Most Googled Artist in Every Country in th...
9 important process stage 5 of the Most Important Stages in the Content ...
11 actually plan rank seo How to plan SEO content that actually ranks
... ... ...
9978 book challenge day learn write writing I Can Write A Book (10-Day Book Writing Challe...
9980 actionable strategy today Content Marketing Strategy (2021 and 2022) - A...
9981 article category late service Latest article on category "Content Marketing ...
9982 contractors lead local seo How To Get Leads For Contractors | Local SEO A...
9991 company convert help hvac lead sale visitors How Content Marketing Helps Your HVAC Company ...

1999 rows × 2 columns

Code
context_train_df.input_text.value_counts()
                                 88
world                            17
strategy                         15
newsletter                       13
important                        11
                                 ..
blueprint leadspanda page         1
director invision                 1
blog jarvee price real spend      1
opportunity sponsorship world     1
hat relationship seo ux white     1
Name: input_text, Length: 7193, dtype: int64

After all of this filtering there are 88 rows that lack any kind of context. Rather than try to work with that I am just going to drop them.

Code
context_train_df = context_train_df[context_train_df.input_text.str.len() > 0]
context_test_df = context_test_df[context_test_df.input_text.str.len() > 0]
context_train_df.input_text.value_counts()
world                            17
strategy                         15
newsletter                       13
important                        11
trends                           10
                                 ..
blueprint leadspanda page         1
director invision                 1
blog jarvee price real spend      1
opportunity sponsorship world     1
hat relationship seo ux white     1
Name: input_text, Length: 7192, dtype: int64

Context Train

I’ve copied the last checkpoint of the fine tuned model to a special path (DATA_FOLDER / "model" / "full-context" below). This just makes it easier to use the existing code to train it again.

Code
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,

    # RuntimeError: received 0 items of ancdata
    # OSError(24, 'Too many open files')
    use_multiprocessed_decoding=False,
    use_multiprocessing_for_evaluation=False,
    use_multiprocessing=False, 
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_decoder_name=str(DATA_FOLDER / "model" / "full-context"),
    args=model_args,
)
Code
#hide_output
model.train_model(context_train_df, eval_data=context_test_df)



(4945,
 {'global_step': [989, 1978, 2000, 2967, 3956, 4000, 4945],
  'eval_loss': [14.994169389670677,
   14.994169389670677,
   14.994169389670677,
   14.994169389670677,
   14.994169389670677,
   14.994169389670677,
   14.994169389670677],
  'train_loss': [13.772074699401855,
   13.0637845993042,
   12.200809478759766,
   11.790735244750977,
   12.536486625671387,
   15.395352363586426,
   11.742057800292969]})

This is running at a high evaluation loss of around 15, compared to 0.0027 when training on the first task. The model is likely terrible right now.

Code
context = context_test_df.input_text.iloc[:5].tolist()

for original, predicted in zip(context, model.predict(context)):
    print(f"original: {original}\npredicted: {predicted}\n")

original: covid post trends world
predicted: covid post trends world covid post trends world covid post trends world co

original: demandjump
predicted: demandjump demandjump demandjump demandjump demandjump

original: art artist bromley country googled ken supply world
predicted: art artist bromley country googled ken supply world art artist bromley country googled ken supply world

original: important process stage
predicted: important process stage important process stage important process stage important process stage important process stage

original: actually plan rank seo
predicted: actually plan rank seo actually plan rank seo actually plan rank seo actually plan rank seo

It hasn’t broken out of the habit of directly copying the title. Let’s see if loading the original model and training on this works better.

Code
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,

    # RuntimeError: received 0 items of ancdata
    # OSError(24, 'Too many open files')
    use_multiprocessed_decoding=False,
    use_multiprocessing_for_evaluation=False,
    use_multiprocessing=False, 
)

model = Seq2SeqModel(
    encoder_type="bert",
    encoder_decoder_name=str(PRETRAINED_MODEL_FOLDER),
    args=model_args,
)
Code
#hide_output
model.train_model(context_train_df, eval_data=context_test_df)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "



(4945,
 {'global_step': [989, 1978, 2000, 2967, 3956, 4000, 4945],
  'eval_loss': [2.1103564370498966,
   1.904784091571082,
   1.8770285617967366,
   1.7982574819553236,
   1.7708213616479263,
   1.7832981082591934,
   1.7558919220318197],
  'train_loss': [2.705289602279663,
   2.183368444442749,
   1.8347290754318237,
   2.4768571853637695,
   0.6300644874572754,
   1.1236987113952637,
   1.0715324878692627]})

Now this has improved quite substantially. The evaluation loss has ended up around 1.75 which is a vast improvement on the last attempt.

Context Evaluation

Hopefully this will produce more reasonable titles.

Code
rows = context_test_df.iloc[:5]

for row, predicted in zip(
    rows.to_dict("records"),
    model.predict(rows.input_text.tolist())
):
    print(f"context: {row['input_text']}\ntarget: {row['target_text']}\npredicted: {predicted}\n")

context: covid post trends world
target: Content Marketing Trends in the post-COVID World
predicted: post - covid post - covid content marketing world content marketing trends post

context: demandjump
target: Content Marketing | DemandJump
predicted: content marketing | demandjump demandjump content marketing

context: art artist bromley country googled ken supply world
target: The Most Googled Artist in Every Country in the World | Ken Bromley Art Supplies
predicted: art works from art world - ken ken - content marketing world - art art world art works

context: important process stage
target: 5 of the Most Important Stages in the Content Marketing Process
predicted: why is content marketing important? the 5 stages of the process the next stage of content

context: actually plan rank seo
target: How to plan SEO content that actually ranks
predicted: how to rank your seo and content marketing plan how to rank how to rank your seo

Remember that this is inclined to generate more text than it should. When I look at this some of the output is awful and some of it isn’t that bad. The last two titles are approaching acceptable.

This is an interesting outcome because directly training to the task was better than going through learning the titles to begin with. It was still fun trying things out.


Separate Models

The simpletransformers library is set up to use a separate pretrained model for the encoder and decoder. In this post I have been using a pretrained complete model, however if I wanted to change the model to something larger that might be quite tricky. How well would these same settings work for a encoder-decoder model made out of two separate pretrained models?

I’m going to use roberta as the base for both parts as it is unquestionably better than bert-base-uncased.

Code
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs

# Configure the model
model_args = Seq2SeqArgs(
    num_train_epochs=TRAIN_EPOCHS,

    evaluate_during_training=True,
    evaluate_during_training_silent=True,
    
    # have to str these paths otherwise you get the error
    #  "Object of type PosixPath is not JSON serializable"
    best_model_dir=str(DATA_FOLDER / "outputs" / "best_model"),
    cache_dir=str(DATA_FOLDER / "cache"),
    dataset_cache_dir=str(DATA_FOLDER / "dataset"),
    output_dir=str(DATA_FOLDER / "outputs"),
    overwrite_output_dir=True,

    # RuntimeError: received 0 items of ancdata
    # OSError(24, 'Too many open files')
    use_multiprocessed_decoding=False,
    use_multiprocessing_for_evaluation=False,
    use_multiprocessing=False, 
)

model = Seq2SeqModel(
    encoder_type="roberta",
    encoder_name="roberta-base",
    decoder_name="roberta-base",
    args=model_args,
)
Code
#hide_output
model.train_model(context_train_df, eval_data=context_test_df)
/home/matthew/.cache/pypoetry/virtualenvs/blog-HrtMnrOS-py3.9/lib/python3.9/site-packages/torch/optim/lr_scheduler.py:129: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "



(4945,
 {'global_step': [989, 1978, 2000, 2967, 3956, 4000, 4945],
  'eval_loss': [6.194431372499659,
   5.941921693593384,
   5.912081602613935,
   5.539853030370797,
   5.231190737442449,
   5.220090480951162,
   5.126498561156423],
  'train_loss': [5.742949485778809,
   4.965937614440918,
   5.736418724060059,
   5.247804164886475,
   5.681723117828369,
   5.09929084777832,
   6.1811418533325195]})

It’s worth showing the train and eval loss for this as it’s quite interesting. You can see that the training loss is all over the place while the training loss consistently moves down. These values are still significantly higher than using the pretrained model.

Separate Model Evaluation

Now it’s time to evaluate it. Remember that the loss is very high so I do not expect great things.

Code
rows = context_test_df.iloc[:5]

for row, predicted in zip(
    rows.to_dict("records"),
    model.predict(rows.input_text.tolist())
):
    print(f"context: {row['input_text']}\ntarget: {row['target_text']}\npredicted: {predicted}\n")

context: covid post trends world
target: Content Marketing Trends in the post-COVID World
predicted: 

context: demandjump
target: Content Marketing | DemandJump
predicted: 

context: art artist bromley country googled ken supply world
target: The Most Googled Artist in Every Country in the World | Ken Bromley Art Supplies
predicted: Content:::::: Marketing World - World - World - World

context: important process stage
target: 5 of the Most Important Stages in the Content Marketing Process
predicted: 

context: actually plan rank seo
target: How to plan SEO content that actually ranks
predicted: How to to to Content Content Content Content Content Content Content Marketing

Most of these rows don’t produce a prediction at all. The ones that do are extremely repetative and consistently produce “content” and “marketing” as output. This would need a lot more training to be comparable with the fully pretrained model.

Separate Model Improvements

To get the two models working together effectively you need to prepare them. I suggest that training the two models together using masked language modelling would get them working effectively together. Once that has been done the pair of models should work together better.