VOOZH about

URL: https://www.analyticsvidhya.com/blog/2024/02/knowledge-fusion-of-large-language-models-llms/

⇱ All About Knowledge Fusion of Large Language Models (LLMs)


India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Knowledge Fusion of Large Language Models (LLMs)

Pankaj Singh Last Updated : 14 Mar, 2024
6 min read

Introduction

In Natural Language Processing (NLP), developing Large Language Models (LLMs) has proven to be a transformative and revolutionary endeavor. These models, equipped with massive parameters and trained on extensive datasets, have demonstrated unprecedented proficiency across many NLP tasks. However, the exorbitant costs of training these models from scratch have prompted researchers to explore alternative strategies. A pioneering strategy that has emerged to enhance the capabilities of Large Language Models (LLMs) is knowledge fusion, a concept explored in-depth in the research paper titled Knowledge “Fusion of Large Language Models” by Wan, Huang, Cai, Quan, and others. 

Recognizing the need to address redundancy in the functionalities of newly developed LLMs, this innovative approach offers a compelling solution. The paper delves into the intricate process of merging the knowledge of various LLMs, presenting a promising avenue to refine and amplify the performance of these language models.

The fundamental idea is to combine the strengths and capabilities of existing LLMs, transcending the limitations of individual models. By merging existing pre-trained LLMs, we can create a more powerful model that surpasses the individual strengths of each source model. 

👁 Knowledge Fusion of Large Language Models

Understanding the Knowledge Fusion of LLMs

The paper begins by highlighting the challenges and costs of training LLMs from scratch. The authors propose knowledge fusion as an efficient and cost-effective alternative. Rather than merging weights directly, the approach focuses on externalizing the collective knowledge of source LLMs and transferring it to a target model. The research introduces FUSELLM, a method that leverages the generative distributions of source LLMs, aiming to enhance the target model’s capabilities beyond any individual source LLM.

The primary objective of LLMs fusion is to externalize the inherent knowledge embedded within multiple source LLMs and integrate their capabilities into a target LLM. The paper emphasizes stimulating LLMs to manifest knowledge by predicting the next token in a given text. The probabilistic distributions generated by different source LLMs for the same text are then fused into a single representation, creating a unified probabilistic understanding over the text.

Implementation Details: Token Alignment and Fusion Strategies

The paper introduces two crucial implementation details to ensure effective knowledge fusion: token alignment and fusion strategies. 

Token alignment is achieved through a Minimum Edit Distance (MinED) strategy, enhancing the success rate of aligning tokens from different LLMs. 

Fusion strategies, namely MinCE and AvgCE, evaluate the quality of different LLMs and assign varying levels of importance to their distribution matrices based on cross-entropy scores.

Experiments and Evaluation

The research conducts experiments on a challenging scenario of LLMs fusion, where the source models exhibit minimal commonalities. Three representative open-source models – Llama-2, OpenLLaMA, and MPT – are selected as source LLMs for fusion, with another Llama-2 serving as the target LLM. The experiments span benchmarks assessing reasoning, commonsense, and code generation capabilities.

Performance Across Different Benchmarks

The comprehensive evaluation of FUSELLM’s performance across various benchmarks provides valuable insights into its efficacy. Table 1 showcases the overall results of FUSELLM in comparison to baseline methods on the Big-Bench Hard (BBH). Notably, FUSELLM demonstrates an average relative performance gain of 5.16% over the original Llama-2 across all 27 tasks. The specific tasks, such as Hyperbaton, show substantial enhancements, underscoring FUSELLM’s ability to leverage collective knowledge for improved performance.

Moving on to the Common Sense (CS) benchmark in Table 2, FUSELLM consistently outperforms baselines across all tasks, achieving a relative performance improvement of 1.25% over Llama-2. This trend holds true even in challenging tasks like ARC-challenge and OpenBookQA, where FUSELLM exhibits significant improvements, highlighting its effectiveness in addressing intricate problems.

In the context of code generation, Table 3 illustrates the zero-shot performance of FUSELLM on the MultiPL-E (ME) benchmark. Outperforming Llama-2 in 9 out of 10 tasks, FUSELLM showcases a notable enhancement in the pass@1 score, particularly for specific programming languages like R. Despite a performance gap compared to OpenLLaMA or MPT, FUSELLM still achieves a remarkable average performance gain of 6.36%, surpassing the 1.37% improvement observed in Llama-2 CLM.

The Fused Probabilistic Distributions: Accelerating Optimization

👁 Image
👁 Image

A crucial aspect of FUSELLM’s success lies in its ability to utilize fused probabilistic distributions from multiple LLMs. Figure 2 compares the few-shot Chain-of-Thought (CoT) performance between Llama-2 CLM and FUSELLM with varying scales of training data on BBH. FUSELLM significantly enhances the exact match (EM) accuracy by 2.5%, achieving the best performance of Llama-2 CLM within 0.52 billion tokens. This represents a 3.9× reduction in token requirements compared to Llama-2 CLM, indicating that the probabilistic distributions derived from LLMs contain more readily learnable knowledge than the original text sequences, thereby accelerating the optimization process.

Analysis of the Implementation Process

Delving into the implementation details of FUSELLM reveals critical considerations for its success. The number of source LLMs, token alignment criteria, and the choice of fusion function play pivotal roles in shaping FUSELLM’s performance.

  • Number of Source LLMs: Table 4 demonstrates the performance improvement of FUSELLM with varying numbers of models. The results show an apparent enhancement as the number of models increases from 1 to 3, with consistent improvements observed in BBH.
  • Criteria for Token Alignment: Proper token alignment is crucial during the fusion of LLMs. The proposed MinED method consistently outperforms the EM method, showcasing the effectiveness of MinED in aligning tokens from multiple models.
  • Fusion Function: The choice of the fusion function is critical, and FUSELLM with MinCE consistently outperforms AvgCE across all benchmarks. This emphasizes the importance of the fusion function in preserving the distinct advantages of individual LLMs.

FUSELLM vs. Knowledge Distillation and Ensemble/Merging

Comparative analyses with traditional techniques like knowledge distillation and ensemble/merging shed light on FUSELLM’s unique strengths.

  • FUSELLM vs. Knowledge Distillation: FUSELLM outperforms knowledge distillation, especially in BBH, where the improvement achieved by FUSELLM (5.16%) surpasses the modest gain of knowledge distillation (2.97%). This highlights FUSELLM’s ability to harness collective knowledge from multiple LLMs more effectively.
  • FUSELLM vs. Ensemble/Merging: In scenarios where multiple LLMs originated from the same base model but were trained on distinct corpora, FUSELLM consistently achieves the lowest average perplexity across three domains compared to ensemble and weight merging methods. This reinforces FUSELLM’s potential to leverage collective knowledge more effectively than traditional fusion methods.

Also read: Knowledge Distillation: Theory and End to End Case Study

You can find the code, model weights, and data public here: GitHub FUSELLM

Conclusion

The paper concludes with compelling results, showcasing the effectiveness of FUSELLM over individual source LLMs and established baselines. The study opens up a promising avenue for future exploration in LLMs fusion. The findings emphasize the potential of combining the diverse capabilities and strengths of structurally different LLMs, shedding light on a cost-effective and powerful approach to developing large language models.

The knowledge fusion of large language models is an innovative solution in a world where the demand for advanced natural language processing capabilities continues to rise. This research paves the way for future endeavors in creating unified models that harness the collective intelligence of diverse LLMs, pushing the boundaries of what is achievable in the realm of natural language understanding and generation.

I’m eager to know your opinions regarding the Knowledge Fusion of Large Language Models (LLMs). Feel free to share your insights on any other noteworthy and informative papers you may have encountered in the comments section.

Also read: A Comprehensive Guide to Fine-Tuning Large Language Models

Hi, I am Pankaj Singh Negi - Senior Content Editor | Passionate about storytelling and crafting compelling narratives that transform ideas into impactful content. I love reading about technology revolutionizing our lifestyle.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
👁 Av Logo White

Continue your learning for FREE

Forgot your password?
👁 Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

👁 Popup Banner
👁 AI Popup Banner