Recommendations for LLM fine-tuning

[アーティクル]
01/13/2025

In some cases, LLMs may not perform well on specific domains, tasks, or datasets, or may produce inaccurate or misleading outputs. In such cases, fine-tuning the model can be a useful technique to adapt it to the desired goal and improve its quality and reliability.

For the latest documentation on fine-tuning within Azure, see Fine-tuning your model in the official docs.

When to consider fine-tuning

Below are some scenarios where fine-tuning can be considered.

Hallucinations: Hallucinations are untrue statements output by the model. They can harm the credibility and trustworthiness of your application. One possible mitigation is fine-tuning the model with data that contains accurate and consistent information.
Accuracy and quality problems: Pre-trained models may not achieve the desired level of accuracy or quality for a specific task or domain. This shortfall can be due a mismatch between the pre-training data and the target data, the diversity and complexity of the target data, and/or incorrect evaluation metrics and criteria.

How fine-tuning can help

Fine-tuning the model with data that is representative and relevant to the target task or domain may help improve the model's performance. Examples include:

Adding domain-specific knowledge: Teaching the model new (uncommon) tasks or constraining it to a smaller space, especially complex specialized tasks may require the model to learn new skills, concepts, or vocabulary that are not well represented in the model's original training data. Some examples are legal, medical, and technical texts. These tasks or domains may also have specific constraints or requirements, such as length, format, or style, that limit the model's generative space. Fine-tuning the model with domain-specific data may help the model acquire the necessary knowledge and skills and generate more appropriate and coherent texts.
Add data that doesn't fit in a prompt: The LLM prompt is the input text that is given to the model to generate an output. It usually contains some keywords, instructions, or examples that guide the model's behavior. However, the prompt has a limited size, and the data needed to complete a task may exceed the prompt's capacity. This happens in applications that require the LLM to process long documents, tables, etc. In such cases, fine-tuning can help the model handle more data and use smaller prompts at inference time to generate more relevant and complete outputs.
Simplifying prompts: Long or complex prompts can affect the model's efficiency and scalability. Fine-tuning the model with data that is tailored to the target task or domain can help the model provide quality responses from simpler prompts, and potentially use fewer tokens and improve latency.

Best practices for fine-tuning

Here are some best practices that can help improve the efficiency and effectiveness of fine-tuning LLMs for various applications:

Try different data formats: Depending on the task, different data formats can have different impacts on the model’s performance. For example, for a classification task, you can use a format that separates the prompt and the completion with a special token, such as {"prompt": "Paris##\n", "completion": " city\n###\n"}. Be sure to use formats suitable for your application.
Collect a large, high-quality dataset: LLMs are data-hungry and can benefit from having more diverse and representative data to fine-tune on. However, collecting and annotating large datasets can be costly and time-consuming. Therefore, you can also use synthetic data generation techniques to increase the size and variety of your dataset. However, you should also ensure that the synthetic data is relevant and consistent with your task and domain. Also ensure that it does not introduce noise or bias to the model.
Try fine-tuning subsets first: To assess the value of getting more data, you can fine-tune models on subsets of your current dataset to see how performance scales with dataset size. This fine-tuning can help you estimate the learning curve of your model and decide whether adding more data is worth the effort and cost. You can also compare the performance of your model with the pre-trained model or a baseline. This comparison shows how much improvement you can achieve with fine-tuning.
Experiment with hyperparameters: Iteratively adjust hyperparameters to optimize the model performance. Hyperparameters, such as the learning rate, the batch size and the number of epochs, can have significant effect on the model’s performance. Therefore, you should experiment with different values and combinations of hyperparameters to find the best ones for your task and dataset.
Start with a smaller model: A common mistake is assuming that your application needs the newest, biggest, most expensive model. Especially for simpler tasks, start with smaller models and only try larger models if needed.

Challenges and limitations of fine-tuning

Fine-tuning large language models scan be a powerful technique to adapt them to specific domains and tasks. However, fine-tuning also comes with some challenges and disadvantages that need to be considered before applying it to a real-world problem. Below are a few of these challenges and disadvantages.

Fine-tuning requires high-quality, sufficiently large, and representative training data matching the target domain and task. Quality data is relevant, accurate, consistent, and diverse enough to cover the possible scenarios and variations the model will encounter in the real world. Poor-quality or unrepresentative data leads to over-fitting, under-fitting, or bias in the fine-tuned model, which harms its generalization and robustness.
Fine-tuning large language models means extra costs associated with training and hosting the custom model.
Formatting input/output pairs used to fine-tune a large language model can be crucial to its performance and usability.
Fine-tuning may need to be repeated whenever the data is updated, or when an updated base model is released. This involves monitoring and updating regularly.
Fine-tuning is a repetitive task (trial and error) so, the hyperparameters need to be carefully set. Fine-tuning requires much experimentation and testing to find the best combination of hyperparameters and settings to achieve desired performance and quality.

次の方法で共有

Recommendations for LLM fine-tuning

When to consider fine-tuning

How fine-tuning can help

Best practices for fine-tuning

Challenges and limitations of fine-tuning

フィードバック

その他のリソース