Weaviate Blog

Distillation Experiments

2024-09-01T00:00:00.000Z

Is it better to distill or finetune language models?

Arcee did a bunch of experiments comparing model distillation to finetuning, base vs instruct model distillation and more.

Main Takeaways:

♦️ Both logit-based and hidden states-based distillation methods consistently outperform standard SFT across various benchmarks.

♦️ General-Purpose Performance Gains: Significant improvements across datasets like OpenHermes, WebInstruct-Sub, and FineTome, particularly in MMLU and MMLU-Pro benchmarks, indicating enhanced knowledge absorption.

♦️ Domain-Specific Performance Gains: Distilling models for domain-specific tasks, especially when using the same training data as the teacher model leads to performance improvements.

Experiments

♦️Experiment 1: What's better Supervised-Finetune(SFT) or Distill+SFT?

Three models—Hermes-Distilled (logit-based), Hermes-Hidden-States, and Hermes Vanilla (SFT-only)—were evaluated, all distilled from Arcee Spark using a subset of the Teknium's OpenHermes-2.5 dataset (200k examples). Both distillation methods were better than SFT-only model across major benchmarks such as BBH, MUSR, and MMLU-PRO. The logit-based approach was better then the hidden-states-based distillation.

♦️Experiment 2: Effectiveness of Logit-based Distillation in a Generic Domain

The 1.5B Distilled model, trained on a 200k subset of WebInstruct-Sub, demonstrated performance improvements over the baseline Qwen2-1.5B-Instruct model across all metrics. Its performance was also comparable to the teacher model, Arcee Spark, particularly on MUSR and GPQA benchmarks.

♦️Experiment 3: Distillation on Instruct vs. Base Student Models

The 1.5B-Instruct-Distilled model (logit-based), trained on WebInstruct-Sub, showed performance improvements over the vanilla Qwen2-1.5B-Instruct model on the MMLU benchmark, showing benefits of distillation for enhancing knowledge retrieval.

♦️Experiment 4: Effectiveness of Domain-specific Distillation

Distilling Arcee Agent, a 7B parameter model specialized in function calling, into Qwen2-1.5B-Instruct using the same dataset that trained the teacher model resulted in performance gains. This approach underscores the potential of using the same training data for both teacher and student models to achieve even greater improvements, particularly in domain-specific tasks.

More resources: DistillKit by Arcee AI

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Language Model Distillation

2024-08-08T00:00:00.000Z

Distillation has become popular recently due to its ability to efficiently compress the knowledge of larger LLMs into smaller ones. Here’s how it works, why it’s useful, and examples of how you can perform distillation

What is distillation?

Distillation is a model compression technique where a smaller "student" model is trained to mimic the behavior of a larger "teacher" model. This is achieved by transferring knowledge from the teacher to the student, usually through methods like logit-based or hidden states-based distillation. These methods are designed to help the student model replicate the teacher's output distribution or internal representations, often leading to a more efficient model with comparable performance.

When would we use this?

Distillation is commonly used when deploying large models is impractical due to resource constraints, such as in real-time applications or edge devices. For instance, a smaller student model can be distilled from a powerful teacher model like Llama3.1 405B, retaining much of the original model’s capability but with significantly lower computational demands. Distillation is also useful when adapting models to specific tasks or domains, as seen in domain-specific distillation cases like "function calling," where specialized knowledge from a teacher model is transferred to a smaller model for specific use cases.

What’s the benefit?

Distillation offers a significant reduction in model size and computational requirements while maintaining a high level of performance. This is especially valuable in scenarios where memory and processing power are limited. Moreover, distillation allows for flexibility in model architecture choices; for example, distilling knowledge from a Llama-3.1-70B model into a much smaller StableLM-2-1.6B model. Distillation methods like those provided in Arcee-AI's DistillKit, including logit-based and hidden states-based distillation, can lead to substantial performance gains over traditional training routines without requiring additional data.

Examples of Distillation Techniques:

(1) Logit-based Distillation:

This method involves transferring knowledge by using both the hard targets (actual labels) and soft targets (teacher logits) to guide the student model. The student is trained to minimize the difference between its output distribution and the teacher’s output, typically using Kullback-Leibler (KL) divergence. This method is particularly effective for maintaining performance close to the teacher model while improving the student’s generalization abilities.

(2) Hidden States-based Distillation:

Here, the focus is on aligning the intermediate layer representations of the student with those of the teacher. This layer-wise guidance helps the student model capture similar features and improves its performance and generalization. This method also allows for cross-architecture distillation, enabling knowledge transfer between different model architectures, such as distilling from a Llama-3.1-70B model into a StableLM-2-1.6B model.

More resources: DistillKit by Arcee AI

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

LoRA: Low-Rank Adaptation of Large Language Models

2024-07-28T00:00:00.000Z

Detailed explanation of Low-Rank Adaptation (LoRA), a method for efficiently fine-tuning pre-trained neural networks.

The Problem LoRA Solves:

In early 2021, Microsoft partnered with OpenAI to explore the commercial viability of GPT-3.
They found that prompting was insufficient for production tasks like natural language to code generation.
Fine-tuning was necessary but prohibitively expensive due to the large size of model checkpoints.

How It Works:

LoRA generalizes full fine-tuning(updating every single parameter) by asking two questions:
1. Do we need to fine-tune all parameters?
2. For the weight matrices we fine-tune, how expressive should the updates be in terms of matrix rank?
These questions define a 2D plane where full fine-tuning is the top-right corner(full rank and full parameter updates) and the origin represents the original model.
Any point in this plane is a valid LoRA configuration.

The chosen rank of the update matrix controls the expressivity of the finetuning process.

A d x d matrix can represent any linear transformation in a d-dimensional vector space.
By first transforming the input to a lower-dimensional space and then back to the original space, we can restrict the kind of linear transformations that can be represented.
This reduces the number of parameters that need to be stored from (dxd) to (dxr + dxr) where r << d.
A point near the origin often performs as well as full fine-tuning. - because often Neural Networks are over-parametrized and thus the weight matrices are full of linearly dependent
This suggests that we can start with a low-rank configuration and gradually increase the rank if needed.

Common practices when using LoRA:

How to choose the rank R of the update matrix: Start with a low rank and increase it if needed.
When to use full fine-tuning?: When finetuning on data that is completely new and absent from the pretraining of the base model (for example if you are tuning an English model on Martian then full fine-tuning may be necessary).
Can I use LoRA for any model architecture?: As long as the model uses matrix multiplication, LoRA can be applied. So basically pretty much every model architecture can use LoRA!

Benefits of LoRA:

Reduced checkpoint sizes: On GPT-3, checkpoint size was reduced from 1TB to 25MB.
No additional inference latency: LoRA updates can be merged with the original parameters during inference. W_new = W_old + AxB
Ability to quickly switch between tasks: LoRA modules can be loaded and unloaded efficiently.(A_frenchxB_french),(A_germanxB_german),(A_spanishxB_spanish)

Some interesting engineering ideas enabled by LoRA:

Caching LoRA modules in RAM for faster model switching and routing between different finetunes.
Training multiple LoRA modules in parallel on different batches of the training set.
Creating a tree of adaptive models where each node is a LoRA module.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Prover-verifier Games Improve Legibility of LLM Outputs

2024-07-22T00:00:00.000Z

Training LLMs to write solutions such that smaller models can better check them. This makes them easier to check for humans, too.

Key Findings

Performance vs. Legibility Trade-Off:
- A model with only supervised fine-tuning on human-written derivations achieved high legibility but poor accuracy.
- Optimizing solely for correctness led to high accuracy but poor legibility.
- They propose method balanced both, retaining high legibility with moderate accuracy.
They Train Three Models:
- Robust Verifier: Accurately distinguishes correct from incorrect solutions, even when misleading.
- Helpful Prover: Generates human-legible solutions, reducing human error.
- Sneaky Prover: Produces subtly incorrect solutions, initially confusing human evaluators and highlighting improvement areas.

How It Works

To train the prover models to produce human-legible text, they use a scheme where the prover's solutions must be easily verifiable by a weaker model.

They use models from the GPT-4 family with significant differences in pretraining compute, they train using a GAN like(generator and discriminator) alternating training process as follows:

Generate multiple solutions to math problems from the prover, some of which are naturally incorrect.
Alternate between two training phases:
- Verifier Training: Train the weak verifier to predict the correctness of the prover’s solutions.
- Prover Training: Instruct the prover to be either "helpful" or "sneaky," rewarding correct solutions in the former role and incorrect solutions that fool the verifier in the latter.

Initially, the sneaky prover easily deceives the verifier. However, as incorrect solutions are added to verifier training, the prover must continuously develop new strategies. Training concludes when the sneaky prover can no longer deceive the verifier.

Training Framework: Multi-agent setup with honest prover (correct solutions) and sneaky prover (deceptive solutions), both checked by a smaller model(robust verifier).

Details:

Unique, effective loss function.
Honest prover's solutions become detailed and reasoned over time.
Sneaky prover's solutions develop subtler flaws.
Legibility transfer: Honest prover’s solutions become easier for humans to verify with training.
Humans perform better (speed and accuracy) with more trained models in time-constrained tasks.
Traditional reinforcement learning for correctness leads to less legible solutions.
Approach helps in training models to explain actions comprehensibly, reducing deception.
"Legibility tax" affects large model performance. - as legibility increases performance decreases!

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

RouteLLM: Learning to Route LLMs with Preference Data

2024-07-14T00:00:00.000Z

You don't need a 2 trillion parameter model to tell you the capitol of France is Paris.

Be smart and route between a panel of models according to query difficulty and model specialty!

New paper proposes a framework to train a router that routes queries to the appropriate LLM to optimize the trade-off b/w cost vs. performance.

Model inference cost varies significantly: Per one million output tokens: Llama-3-70b ($1) vs. GPT-4-0613 ($60), Haiku ($1.25) vs. Opus ($75)

Overview:

The RouteLLM paper propose a router training framework based on human preference data and augmentation techniques, demonstrating over 2x cost saving on widely used benchmarks.

They define the problem as having to choose between two classes of models: (1) strong models - produce high quality responses but at a high cost (GPT-4o, Claude3.5)

(2) weak models - relatively lower quality and lower cost (Mixtral8x7B, Llama3-8b)

A good router requires a deep understanding of the question’s complexity as well as the strengths and weaknesses of the available LLMs.

Explore different routing approaches:

Similarity-weighted (SW) ranking
Matrix factorization
BERT query classifier
Causal LLM query classifier

Neat Ideas to Build From:

Users can collect a small amount of in-domain data to improve performance for their specific use cases via dataset augmentation.
Can expand this problem from routing between a strong and weak LLM to a multiclass model routing approach where we have specialist models(language vision model, function calling model etc.)
Larger framework controlled by a router - imagine a system of 15-20 tuned small models and the router as the n+1'th model responsible for picking the LLM that will handle a particular query at inference time.
MoA architectures: Routing to different architectures of a Mixture of Agents would be a cool idea as well. Depending on the query you decide how many proposers there should be, how many layers in the mixture, what the aggregate models should be etc.
Route based caching: If you get redundant queries that are slightly different then route the query+previous answer to a small model to light rewriting instead of regenerating the answer

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

2024-07-07T00:00:00.000Z

How do you get the retrieval quality of a cross-encoder/re-ranker and the efficiency of a bi-encoder?

Typically people choose to do this with the trusty old retrieve-and-re-rank approach.

This new paper from DeepMind proposes Adaptive Cross-Encoder Nearest Neighbor Search, an alternative which approximates the re-ranker query-document similarities while still using a bi-encoder setup.

High-level Overview:

You can think of this as an efficient way to train an adaptor for the query vector that transforms the query vector in such a way that makes the similarity scores b/w query-documents more like the cross-encoder similarity scores.

Once you pass the query vector through the trained adopter then you can simply use cosine similarity with the document embeddings
Can use existing bi-encoder models to initialize the item and query embeddings
In an offline indexing step -> compute query/item embeddings to index a given set of items from a target domain making sure the similarity scores are similar to cross encoder scores
At test time -> compute the test query embedding to approximate cross-encoder scores of the given test query for a small set of adaptively-chosen items
Perform retrieval with the approximate cross-encoder scores

Details:

At test time, they keep item embeddings fixed and perform retrieval over multiple rounds, alternating between:

estimating the test query embedding by minimizing error in approximating CE scores of items retrieved thus far, and

using the updated test query embedding for retrieving more items in the next round.

In the last round once the test query embedding is fully refined, this test query embedding can now be used to retrieve items using cosine similarity.

Proposed k-NN search method can achieve up to 5% and 54% improvement in k-NN recall for k = 1 and 100 respectively over the widely-used DE-based retrieve-and-rerank approach.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Many-Shot In-Context Learning

2024-07-05T00:00:00.000Z

Should you finetune your LLM or just give relevant examples in the prompt? How many examples should you give for best performance?? If you give more will it hurt perf?? Does order of the examples matter!??

New paper from Deepmind answers all these questions and more, so much to take away from this one, lets dig in!

Main Takeaways:

Large performance jumps when going from providing very few(1-5) examples(few-shot incontext learning(ICL) to providing many(100s-1000s) examples(many-shot ICL) - The harder the task the more it benefits from more examples in the prompt!
Propose using synthetically generated examples(as opposed to human labelled ones) and find that works quite well
Propose providing only questions and no answers, in the examples, and find this also works quite well!!
Show that many-shot ICL can overcome pre-training biases, perform comparably to supervised fine-tuning, and learn non-NLP prediction tasks.

Juicy Details:

Full supervised/instruction fine-tuning only slightly outperforms many-shot ICL for many tasks
They mainly test Gemini 1.5 but also try GPT4 and Claude 3.5 and show that different LLMs have varying degrees of success when using many-shot ICL - not a model agnostic trick
They show that if you provide encough examples in the prompt it can adapt to unseen non-lingual tasks and even in domains that might be misaligned with an LLM’s training data
Surprisingly, the order of examples in the prompt also influences many-shot performance - would be interesting to see how optimization systems like DSPy can help with this
Adding more examples, then optimal, in the prompt can also sometimes degrade performance for certain tasks - weird finding - opportunity for DSPy to do its thing here aswell
Many-shot ICL achieves comparable or superior performance when using only problems compared to using problems with solutions - signals that providing solutions with many-shot ICL might just be redundant
Many shot ICL also shows an improvement in out-of-distribution general problem-solving abilities from in-context learning - Math tasks and etc.
Biases instilled in the model during pre-training can also be overcome with many shot ICL - a small number of shots leads to the model giving into the bias but with enough examples this eventually diminishes as task learning takes effect in the many shot regime.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Token Pooling to Scale Multi-Vector Retrieval Systems

2024-06-30T00:00:00.000Z

🏹Multi-vector retrieval approaches, like ColBERT, have great retrieval quality but vector count can balloon, AnswerAI propose a solution!

Below is an explanation of how ColBERT works and AnswerAI's proposed modification!

Breakdown of different types of encoders:

Cross-encoders:

Document text & query text strings concatenated and passed into a cross-encoder which then outputs a rank/score.

Bi-encoders:

Document text passed into an encoder and generates a document embedding
Query text separately passed into an encoder and generates a query embedding
Similarity of query and doc embedding calculated
Retrieval performance can suffer especially on Out-Of-Domain data

Multi-vector bi-encoder - (eg. ColBERT):

Functions as a bi-encoder: all documents representations are pre-computed in isolation
Similarity computation occurs between individual query and document token vectors, as opposed to the full document.

Main weakness of multi-vector approaches:

Storage and memory usage balloons up, each token in a document requires a vector(as opposed to one document = one vector)
Complicated to efficiently search through multiple vectors

AnswerAI's Proposed Token Pooling Solution:

Main Idea: a lot of the tokens are likely to carry somewhat redundant semantic information, we can semantically cluster them!
Requires no model modification whatsoever, nor any complex processing, while greatly improving the scalability of easily updatable indexing methods - like HNSW, which are typically harder to use with ColBERT.

Approach:

Token pooling by clustering similar tokens within a given document and averaging (mean pooling) their representation.
After being pooled, the vectors are then quantised to 2-bits using the ColBERTv2 quantisation approach
Each cluster is represented by a single token by averaging the values of contained tokens

Results:

All results compared to non-pooled ColBERT vector approach
Pooling by a factor 2 achieves a 50% vector count reduction and 100.6% retrieval performance on average.
Pool factor = 3 achieves 66% reduction while reaching 99% performance.
Pool factor = 4 achieves 75% reduction while reaching 97% performance.

Code

Blog

🔗 Blog Link

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Mixture-of-Agents Enhances Large Language Model Capabilities

2024-06-29T00:00:00.000Z

🤖Can multiple smaller open-source LLMs be combined to outperform larger monolithic LLMs?

New paper shows that LLMs tend to generate better responses when presented with outputs from other models, even if less capable.

They use this to build a Mixture of Agents(MoA) architecture where multiple LLMs are used to iteratively enhance the generation quality.

LLMs in deeper layers are presented responses from LLMs in earlier layers and iteratively improve the response; mitigates individual model deficiencies and enhance overall response.

MoA Architecture

The complete architecture consists of LLM agents playing one of two roles:

Proposers: These models generate initial reference responses.
Aggregators: These models synthesize the different responses from the proposers into one high-quality response.

Models used: Qwen1.5-110B-Chat, Qwen1.572B-Chat, WizardLM-8x22B, LLaMA-3-70B-Instruct, Mixtral-8x22B-v0.1, dbrx-instruct
3 MoA layers and use the same set of models in each MoA layer
Qwen1.5-110BChat as the aggregator in the last layer
Some models work better as proposers and others as both proposers and aggregators.

How do you choose which models to include in the MoA??

Two metrics are used to assess which models are included in the mixture:

Performance: The average win rate of models in layer i decides if they are included in layer i + 1.
Diversity: The diversity of model outputs is important - using heterogeneous models across layers is better then using the same model

Details:

MoA achieves a new SOTA win rate of 65.8% on AlpacaEval 2.0 compared to the previous best of 57.5% achieved by GPT-4 Omni.
Overall performance comparable to GPT-4 Turbo while being 2× more cost-effective.
No finetuning required only utilizes the interface of prompting and generation of LLMs.
Extends the MoE concept to the model level by operating at the model level rather than at the activation level.
You can swap the final aggregator to any LLM of your choice (Gemini, GPT-4o, Claude3.5) and this improves performance!

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

2024-06-22T00:00:00.000Z

Using Metadata Filters to Improve Recall in RAG

Filtered search using metadata filtering is a simple technique that can significantly improve retrieval quality in a RAG pipeline, but how do you extract metadata from chunks if your data doesn't already come with it??

GLiNER is a powerful model that allows you to extract arbitrary entities such as names, times, places, etc. from any text chunk. It outperforms decoder models like ChatGPT and others at zero-shot identification of names entities in text chunks.

How it Works:

GLiNER operates by taking in text chunks and entity labels, that you want to identify in the chunks. Both inputs are concatenated, encoded, and projected into the same latent space and fed into a classifier that predicts the entity labels per word in the text input.

This method allows the model to generalize across different NER tasks and labels passed in at query time.

The fact that entity vectors and the text chunk is concatenated allows the entity labels to attend to the text chunks and vice versa in the encoder step which allows GLiNER to work very well OOD.

Architechture:

GLiNER consists of three joined components:

An encoder backbone(DeBERTa) that generates token-level representations of the entity labels and text tokens.
A simple feedforward network that takes in entity token representations from the encoder and embeds them into vectors.
A Span layer that embeds groups of words (ie. "McGill University" -> vector) from the text into vectors

The entity vectors and span vectors are then combined and used to train a classifier that identifies which text spans correctly paired with entity labels.

Effectiveness:

Generalization: The model demonstrates SoTA performance across various NER benchmarks, outperforming traditional task-specific models.
Adaptability: GLiNER is super easy to use, pass in any text and any labels you want to extract and it simply works making it a flexible solution to add to your RAG pipeline.
Scalability: The unified approach simplifies the deployment process, as a single model can handle multiple NER tasks.

Demo Code

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

2024-06-18T00:00:00.000Z

How do you train a Large Language Model without it memorizing training data?

This paper proposes a technique called Goldfish Loss that is now used to mitigate the risk of LLMs memorizing copyrighted or private training data.

In Short:

The paper introduces Goldfish Loss, a method where the model does not compute the loss on every token but excludes (e.g.) 1 in 4 tokens from loss computation. This makes it difficult for the model to memorize the training data.

How it Works:

Goldfish Loss works by omitting a portion of tokens from loss computation during training. When the model encounters these excluded tokens at test time, it has to guess, reducing its ability to reproduce training samples exactly.

Effectiveness:

In standard training on Wikipedia articles, about 85% of them get perfectly memorized after 100 updates. With Goldfish Loss, the model usually diverges from the training data within the first 5 tokens it generates.

Trade-off:

The model learns slower because it does not get credit for the dropped tokens. Training on N tokens with Goldfish Loss is equivalent to standard training on 0.75N tokens.

Benefits:

Goldfish training is scalable and helps avoid the need for unlearning methods, which are often not scalable. This makes it possible to prevent the memorization of copyrighted text/code during training.

Results:

The paper validates Goldfish Loss by pre-training a model for 200B tokens, showing that it effectively prevents memorization without significantly compromising the learning rate.

Details in the Paper:

Explanation of the Goldfish Loss technique
Comparison of memorization rates with standard training
Analysis of the trade-offs between learning rate and memorization prevention
Validation experiments and results

Code

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Visual Instruction Tuning

2024-04-28T00:00:00.000Z

How do you teach a Large Language Model to see? Here's a breakdown!

This paper proposes a technique called Visual Instruction Tuning that is now used by many of the language vision models we see in the field such as LLaVA, GPT4-Vision and Gemini etc.

In Short:

The paper introduces a method to generate multimodal language-image instruction-following data using a language-only GPT-4 model. This data is then used to train LLaVA, a model that combines a vision encoder and a large language model (LLM) for general-purpose visual and language understanding.

How it Works:

VIT works by using GPT-4 to generate instructions for corresponding images and captions. This dataset is used to train LLaVA to learn to follow instructions and understand images. A vision encoder (CLIP ViT 40) is combined with an LLM (Vicuna) to process text and images and generate text.

Architecture:

LLaVA consists of two main components:

Vision Encoder (VE): A pre-trained vision encoder (e.g. CLIP) that takes an image as input and generates a visual embedding.
Large Language Model (LLM): A pre-trained LLM (Vicuna) that takes a text prompt as input and generates a language embedding.

The VE and LLM are combined through a series of layers and mechanisms to enable multimodal understanding and generation:

Benefits:

The combination of the VE and LLM enables LLaVA to understand and generate text and images in a unified framework, leverage the strengths of both visual and language models, generalize to unseen images and text prompts.

Results:

LLaVA achieves a 85.1% relative score compared to GPT-4 on a synthetic multimodal instruction-following dataset

Details in the Paper:

The VIT generation process, including prompt engineering and filtering
The LLaVA architecture, including the vision encoder and LLM components
Experimental results, including comparisons to GPT-4 and other baselines
Ablation studies and analysis of the effectiveness of different components and training strategies

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Retrieval-Augmented Dual Instruction Tuning (RA-DIT)

2024-04-25T00:00:00.000Z

Can we finetune our LLM and retriever together to improve RAG performance? This paper proposes a technique to do exactly that!

RAG Basics:

When you prompt an LLM, RAG supplies relevant documents. A separate retrieval model computes the probability of each text chunk being relevant and provides the top chunks to the LLM. The LLM generates tokens based on the chunks, prompt, and previous tokens.

In Short:

Fine-tuning LLMs and retrieval models together improves performance without extensive data processing, enabling better retrieval-augmented generation. LLMs aren't exposed to retrieval-augmented inputs during pretraining, limiting their ability to use retrieved text effectively. Fine-tuning the LLM and retrieval model together can improve performance without requiring extensive data processing.

How it Works:

Authors from Meta fine-tuned Llama 2 (65B parameters) and DRAGON+, a retriever, to create RA-DIT 65B. They fine-tuned Llama 2 on prompts with retrieved text and questions, and fine-tuned DRAGON+ to retrieve more relevant chunks. Fine-tuning was supervised for tasks like question-answering and self-supervised for text chunk completion.

Results:

RA-DIT 65B achieved 49.1% accuracy on average across four question datasets, outperforming LLaMA 2 65B with DRAGON+ (45.1%) and LLaMA 2 65B alone (32.9%). With five example inputs, RA-DIT 65B reached 51.8% accuracy. RA-DIT offers an efficient way to enhance LLM performance with RAG, making it a valuable technique for developers.

Details:

RA-DIT fine-tunes Llama 2 and DRAGON+ to work together effectively, leveraging the strengths of both models to generate better output. By fine-tuning the LLM to better use retrieved knowledge and the retrieval model to select more relevant text, RA-DIT achieves improved performance without requiring extensive data processing.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Spotting LLMs With Binoculars: Zero-Shot Detection Of Machine-Generated Text

2024-02-19T00:00:00.000Z

Can you reliably tell apart fake, LLM-generated, text from human-written text?🤖⚖️👱

Binoculars is a technique that requires no training and can 0-shot detect 90% of LLM-generated content at a 0.01% false positive rate.

In Short⏩:

Human tokens are, on average more surprising to LLMs than other LLM tokens. They use this insight to identify a classification threshold.

Given two LLMs, M1 and M2. Their main insight is that human text should diverge from M1 more than M2 diverges from M1, provided the LLMs M1 and M2 are more similar to each other than they are to a human.

Details🔎:

They look at the text in question through the lenses of two different LMs and calculate two perplexity scores:

Perplexity of the text using an "observer" LLM(M1).
Compute all the next-token predictions that a "performer" LLM(M2) would make at each position in the string, and compute their perplexity according to the "observer" LLM(M1).

Then, they take the ratio of the two PPL scores: PPL1/PPL2.

If the string is written by a machine, we should expect these two perplexities to be similar. If it is written by a human they should be different.

They find that if PPL1/PPL2 > 0.9 then text is human generated; otherwise it's LLM generated.

Works on detecting fake multilingual text as well.

They think of PPL as how surprising the next token is - human tokens are, on average more surprising to LLMs than LLM tokens.

They use PPL2, what they call cross perplexity, to account for the increase in perplexity due to the prompt; normalizing the observed perplexity by the expected perplexity of a machine acting on the same text, we can arrive at a detection metric that is fairly invariant to the prompt

They use Falcon-7b model (M1) and the Falcon-7b-instruct (M2)

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Retrieval-Augmented Generation for Large Language Models: A Survey

2024-02-13T00:00:00.000Z

A recent survey on Retrieval-Augmented Generation (RAG) mentions an evolving paradigm: Modular RAG.

Modular RAG is comprised of various functional modules. Thus, modular RAG is not standalone. Instead, different RAG patterns are composed of different modules.

For example, the following animation shows: 🥚 The original naive RAG paradigm consists of the “Retrieval”, "Augmentation," and "Generation" modules.

🐣 After naive RAG has shown some limitations, advanced RAG has emerged as a new paradigm. A typical pattern of Advanced RAG builds upon the foundation of Naive RAG by adding “Rewrite” and “Rerank” modules.

🐓 Different RAG patterns, such as DSP, can be composed of entirely different modules.

The modular RAG paradigm is slowly becoming the norm in the RAG domain due to its versatility and flexibility, allowing:

the adaption of modules within the RAG process to suit your specific problem,
for a serialized pipeline or an end-to-end training approach across multiple modules.

I definitely recommend checking out the full survey if you want to catch up on recent advancements in the RAG domain.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Matryoshka Representation Learning

2024-01-29T00:00:00.000Z

An Overview of OpenAI's New Truncatable - Matryoshka Embeddings🪆

OpenAI recently announced embeddings that you can simply use chunks of (say the first 8, 16, 32, 64, 128 or 256 ... dimensions of the total 2048d vector) they use Matryoshka representation learning(MRL).

This is how they work, In Short⏩:

MLR allows you to use a subset of the dimensions of the embedding vector - earlier dimensions store more information than dimensions later on in the vector, which simply add more details
You can understand how this works by the analogy of trying to classify an image at multiple resolutions - the lower res give high-level info and the higher res add details - Human perception of the natural world also has a naturally coarse-to-fine granularity
This is done by modifying the loss function which is optimized. If previously the loss function was L, for MRL we break down the Loss function into the sum of the losses on individual vector dimension ranges: Loss_Total = L(upto 8d) + L(upto 16d) + L(upto 32d) + ... + L(upto 2048d) - Now there is incentive for the model to capture information in each sub-section of the vec.
After modifying the loss you get these truncatable vectors for free/no additional costs - this works on almost all loss functions and pre-existing models can be finetuned to output MRL vectors! - super easy-to-adopt technique
You can actually use any slice of dimensions, not just 8, 16,32 ... - b/c information is diffused in an interpolative fashion; so you can choose an arbitrary-sized chunk dimension that falls between the chosen granularity of the representations

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

A Simple Overview of the LLM Training Steps 🔡

2024-01-24T00:00:00.000Z

A Simple Overview of the LLM Training Steps:🔡

Unsupervised Pretraining:
High quantity, low quality data The model is trained to predict the next token for trillions of tokens. Produces what is called the foundation or base model.
Supervised Finetuning:
Low quantity, high quality {prompt, response} Enables the model to be finetuned for dialogue - turning the base model into a chatbot Often referred to as instruction tuning
Reinforcement Learning from Human Feedback (RLHF): - lots of innovation going on here (will cover DPO, PTO, and KTO soon)

This is a two-step process:

a. Train a reward model to act as a scoring function:

This model will take in a prompt + response and provide a score of how good it is. Human labelers are asked to pick good vs. bad responses and this data is used to train a model.

b. Optimize LLM to generate responses for which the reward model will give high scores.

Use an iterative procedure to update a part of the model such that:
Produces outputs with higher score
Outputs that are not too far away from the SFT model from Step 2
Outputs that aren't getting worse a text completion

Specifically for this phase it is better to think of this as learning an optimal strategy/policy for predicting a probability distribution over tokens and we want to tweak this distribution to produce higher quality text, here the:

The policy is a language model that takes in a prompt and returns a probability distribution over text. The action space of this policy is all the tokens corresponding to the vocabulary of the language model (~50k tokens) The observation space: distribution of possible input token sequences The reward Model is a combination of the preference model(score higher) and a constraint on policy shift(don't change too much, get worse at text completion).

RLHF Learning Resources:

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Using a 7B Model + RAG to Identify and Edit Word-level Hallucinations

2024-01-20T00:00:00.000Z

Using a 7B Model + RAG to Identify and Edit Word-level Hallucinations in LLMs better then GPT-4:

In Short⏩:

Train a model that consists of a Retreiver and a Language Model:

>> The retriever, Mret, takes the original output you want to check to hallucination (y) and optionally input prompt (x) and retrieves top relevant documents (C). So C = Mret(x, y). This can be a vector database like Weaviate for example.

>> The detector and editor, Medit, takes in the context - (C), input - (x) and output - (y) and detects (and if possible also edits/corrects) factual errors in (y) given the retrieved context (C): y* = Medit(x, y, C).

🏋️Training:

Create a synthetic hallucination dataset of 35k C = context, y=incorrect output, y=annotated fixed output -> (C, y, y)

Magic Synthetic Dataset Creation:

>> GPT-4 is few-shot prompted to add different types of errors to a passage

>> It is also instructed to mark phrases or sentences for deletion along with their error type and insert phrases and sentences along with insertion tags

Start off with Llama2-Chat 7B to initialize Medit and then train on (C, y, y∗)

Medit takes in (C, y) as input and learns to predict the edited outputs with tags to represent error type y∗ using standard language modeling objective.

The model, once trained, can identify different types of hallucination and mark which words they come from - it also suggests edits to improve factuality.

Result 📈:

The model has a fine-grained hallucination detection accuracy 46.5% while it's binary acc.{hallucination, no hallucination} is 79%.

For comparison ChatGPT has a fine-grained hallucination detection acc. of 21.5% (59% binary acc) w/o RAG and 26%(68.5% binary hall detect) w/ RAG

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Fine-grained Hallucination Detection and Editing for Language Models

2024-01-19T00:00:00.000Z

A breakdown of the different types of hallucinations from AI2:🍄

Verifiably Factually Wrong ❌

Entity: an entity in a statement is incorrect (eg. Christmas falls on Nov. 25th)
Relation: semantic relationship in a statement is incorrect (eg. The mouse ate the cat.)
Contradictory: statements that entirely contradict relevant evidence from the web (eg. Raptors are yet to win an NBA final.)

Unverifiable Types of Hallucinations ⁉️

Invented: statements of concepts that do not exist in world knowledge (eg. MJ created the sideways somersault)
Subjective: Statement that lacks universal validity - basically an opinion (eg. The Raptors are the best NBA team)
Unverifiable: potentially factual statement but cannot be grounded in world evidence(eg. Jensen sleeps in a leather jacket.)

🔍Word vs. Sentence Level:

Entity and Relation are usually word level, and so can be fixed with small edits if you know where they occur.

Contradictory, Invented, Subjective, and Unverifiable are often sentence level and thus need to be removed completely to fix the issue.

💻Code

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Long-Context Retrieval Models with Monarch Mixer

2024-01-15T00:00:00.000Z

A breakdown of the Long Context Retrieval Embedding Models from Stanford!💥

In Short⏩:

They release 3 long context(2k/8k/32k) BERT-like encoder embedding models on HuggingFace
The models are only 80M params and outperform MUCH larger models (4-85x larger)
Accessible via @togethercompute endpoints and integrated into @llama_index and @LangChainAI
They also release LoCo a long context retrieval benchmark.

🏗️Architechtural Details:

They replace the Attention and MLP blocks in the transformer architecture with diagonal block matrix (Monarch Matrices -M2) operations which are hardware optimized and subquadratic in the sequence length - O(N^(1.5))
This enables scaling sequence length and model parameters better.

🪃Training Details:

These M2 models are trained for long context retrieval on a mixture of long and short context tasks data - surprisingly only training on long context doesn't work.
Use a cosine similarity loss instead of the trusty supervised contrastive training loss.
This loss function. can be computed independently per datapoint in a batch instead of needing to sum over all negative examples in a batch.
Thus training can be scaled for large batch sizes of long context inputs without OOM'ing

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

2024-01-09T00:00:00.000Z

🗣️Persuasive Adversarial Prompting to Jailbreak LLMs with 92% Success Rate

🔒Fascinating new paper breaks down jailbreak prompting to a science!

⏩In Short:

Provide a taxonomy of 40 persuasion prompting techniques
Use this list of 40 techniques they can jailbreak LLMs including GPT4 with a 92% success rate!!
Pretty interestingly Anthropic models are not susceptible at all to PAP attacks!! More advanced models like GPT-4 are more vulnerable to persuasive adversarial prompts (PAPs).
If you can defend against these PAPs this also provides effective protection against other attacks
Test these PAPs to perform attacks covering 14 different risk categories (such as economic harm, etc.)

Blog+Demo: https://chats-lab.github.io/persuasive_jailbreaker/

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Improving Text Embeddings with Large Language Models

2024-01-02T00:00:00.000Z

❓Your RAG workflow is only as good as the retrieved context. Can you use LLMs to improve recall and search relevance for dense retrievers?🤔

📜Work from Microsoft uses synthetic data + LLMs as embedding models to achieve SOTA on the MTEB benchmark.

⏩In Short:

They generate a multilingual synthetic retrieval dataset using GPT-4 which includes {queries, positive matches, hard negatives}.
They use this synthetic dataset along with 13 other public datasets and embed the queries & positives/negatives using the last layer vectors of Mistral-7B.
They tune the Mistral-7B embedding model using a contrastive loss along with embeddings from step 2.
Using this fine-tuned Mistral-7B as an embedding model then achieves SOTA(+2.4%) on MTEB.

❌ Limitations/Short-coming: Potential data contamination? Didn't Mistral-7B have access to all MTEB benchmark datasets? - The MTEB was released(Oct 2022) before the training cutoff(~2023) of the model. So there might be some contamination since we are using the LLM to embed this same data.

maybe I just don't understand!🤷

📑The details:

The synthetic data generation consists of prompting GPT-4 to brainstorm a list of potential retrieval tasks followed by getting GPT-4 to generate (query, positive, hard negative) triplets for each task.

This synthetic data captures text embedding tasks in 93 languages, covering 1000's of embedding tasks. - see prompt templates for how this diversity is obtained and the different tasks that are generated

Fine-tune Mistral-7B using standard contrastive loss with a temperature-scaled cosine similarity with LoRA with rank 16.
For Mistral-7B based models, contrastive pre-training has negligible impact on the model quality. This is surprising since it's one of the key factors behind the success of existing text embedding models.

🚀Pretty cool work overall, anytime an LLM augments a well studied field and beats SOTA it's exciting times! They show that language modeling and text embeddings are two sides of the same coin.🪙

Given an embedding task prompt template, a robust LLM should be able to generate training data and then be transformed into an embedding model through light-weight fine-tuning.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Discovering the Hidden Vocabulary of DALLE-2

2023-12-25T00:00:00.000Z

TIL that text2image diffusion models learn and use a secret language.

Tested this with the new DALL-E-3 and it works!🤯

Read a couple of papers and they mentioned that diffusion models when forced to output text generate images of gibberish words.

If you take those words and pass them back in as prompts, the model can draw for you what the word means to it.

For example: "cagama gur gerano" = "a fantasy creature"

I tested this for the newly released DALL-E-3 model and, interestingly, even when told to generate English it still uses this secret learned language instead.

Below is a conversation about fantasy creatures between two farmers in this secret language.

Initial prompt: "Two farmers talking about vegetables, with english subtitles."

After this just prompt the model with individual and word pairs to get images with secret words. I share examples below.

Prompt: "cagama gur gerano"

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Fine-Tuning or Retrieval? Comparing Knowledge Injection in LLMs

2023-12-19T00:00:00.000Z

❓When using LLMs is unsupervised fine-tuning better than RAG for knowledge-intensive tasks? Should you do both?

If you want to augment an LLM with knowledge of your enterprise data you can do so by augmenting the parametric (finetune) or non-parametric(w/ a vector db like @weaviate_io ) memory.

📜Researchers from Microsoft(https://arxiv.org/abs/2312.05934) asked if unsupervised next token prediction finetuning is better than RAG to improve LLM perf. on both seen and unseen QnA tasks?

⏩In Short: RAG is a better way to inject knowledge into LLMs than unsupervised fine-tuning(USFT) and more surprisingly they found that RAG alone is even better than RAG + finetuning. Probably because USFT is not efficiently persisting new knowledge into params.

Would be cool to see a study comparing RAG vs. SFT/Instruction tuning or RLHF.

This improvement in QnA tasks with RAG occurred for both questions in the MMLU dataset as well as on a new dataset of "current events" that the model was not trained on.

📑The details:

Used Mistral, Llama2, Orca2 7B for all assessments.
Only unsupervised finetuning was done - a direct continuation of the pre-training phase - by predicting the next token on the dataset
Used bge-large-en as the embedding model for the RAG component
Finetuning with multiple paraphrases of the same fact provides a significant improvement over the baseline. - To teach pre-trained LLMs new knowledge, the knowledge must be repeated in numerous ways

❌ Limitations/Short-comings:

Only a continuation of the pre-training was assessed - no instruction tuning or RLHF - SFT and RLHF will boost performance further.
Accuracy performance variance is quite high across the experiments - so it's quite hard to determine the statistical significance of results.
Why is the performance of baseline models on future data not 25% for MCQs with 4 choices? - Not truly "unseen" knowledge.
Only straightforward knowledge/fact tasks were assessed - reasoning capabilities were not assessed..

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Dense X Retrieval: What Retrieval Granularity Should We Use?

2023-12-17T00:00:00.000Z

❓What text chunk size should we use in our RAG workflows? How does chunk size impact retrieval recall? Are bigger chunks better? Smaller chunks but keep more top-k?

📜The new paper from Tencent and Carnegie Mellon(https://arxiv.org/abs/2312.06648) asked:

What chunk size is best to segment and index a vector database like @weaviate_io ?
How does chunk size impact generalization for passage retrieval and accuracy for QA RAG tasks?

⏩In Short: They found that instead of using 100-word passage or sentence-level chunking it's best create Propositions - concise, distinct and self-contained expressions of factoids.

Propositions are generated by a finetuned LLM - which takes in paragraphs as input and is instructed to generate propositions.(blue in the image)

Going to try this out with the current @weaviate_io workflows.

📑The details:

QnA RAG Improvements: +5.9, +7.8,+5.8, +4.9, +5.9, and +6.9 EM@100(exact match using 100 words) for SimCSE, Contriever, DPR, ANCE, TAS-B, and GTR.
Passage Retrieval Perf. : Improvement of Recall@20 is +10.1% and +2.2% for unsupervised and supervised retrievers resp.
Propositions have the following properties: a. unique: a distinct piece of meaning in text b. atomic: cannot be further split into separate propositions c. self-contained: includes all the necessary context
The paragraph-to-proposition generating LLM (a FlanT5-large model) is finetuned using a 42k passage dataset that has been atomized into propositions using GPT-4 - ie. the process is automatable.
Supervised retrievers show less improvements with Propositions b/c these retrievers are trained on query-passage pairs.
Unsupervised retrieval by proposition demonstrates a clear advantage - 17-25% Recall@5 relative improvement on EntityQuestions with DPR and ANCE.
Works better for rare concepts: Retrieving by proposition much more advantageous for questions targeting less common entities.
The RAG (retrieve-then-read) task uses a T5-large size UnifiedQA-v2 as the reader model.
Proposition chunks outperform passage chunks for QnA most in the range of 100-200 words = ~10 propositions = ~5 sentences = ~2 passages.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

2023-12-08T00:00:00.000Z

❓Can you really get a LLM to self-check its own responses for hallucinations?

📜Researchers from Cambridge released a paper developing a method called SelfCheckGPT - a framework that uses only black-box access to a LLM through an API to assess if it's hallucinating.

⏩TLDR: They pass in the same prompt to the model multiple times and generate N more sample responses in addition to the original response and get the LLM to check for inconsistencies.

📑They compare how often each sentence in the original response contradicts these samples using the following prompt for every sentence:

Context: {} 
Sentence: {} 
Is the sentence supported by the context above? 
Answer Yes or No: 

The intuition is that if an LLM has knowledge of a given concept, sampled responses are likely to be similar and contain consistent facts. Hallucinated statements, on the other hand, are more likely to diverge from the extra sampled responses and will contradict one another.

They report higher AUC-PR scores in sentence-level hallucination detection and higher correlation scores in passage-level factuality assessment compared to grey-box methods - which use model weights and output token probs.

❌Some potential problems with this approach:

What if the model hallucinates in a confident way where even the sampled responses contain the hallucinations?
Very high-cost method(requires (N + number of sentences) queries per prompt) - if you are generating N samples every time and then verifying for every sentence in a query you will rack up a bill really quickly

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Who’s Harry Potter? Approximate Unlearning in LLMs

2023-12-06T00:00:00.000Z

Sure, you can train a LLM, perhaps you can even finetune one! But can you brainwash one into forgetting specific concepts?🧠

How would you erase a concept from a LLM's parametric memory?

This question was addressed by researchers at MicrosoftAI in their new paper(https://arxiv.org/abs/2310.02238) where they "propose a novel technique for unlearning a subset of the training data from a LLM" without adversely impacting performance on other benchmarks.

They propose "unlearning" or "un-training" as a three-step process:

First they finetune a model to always respond with some reference to the information they want to later erase. This "reinforced model" becomes a specialist in the information we eventually want to unlearn. This step is used to identify which tokens should be targeted in the unlearning step!
For each of these unlearning targets identified in step 1 they generate synthetic generic alternatives using GPT4. So for example a sentence that originally says "Harry went to the Gryffindor common room" should be turned into "Harry went to the gym".
Each block of text from the unlearn target is then replaced with the generic counterparts and this dataset is now used to finetune the base model, which effectively erases the original text from the model’s memory whenever it is prompted with its context.

For these experiments, they finetuned a Llama2-7b and it took about 1 GPU hour to implement this unlearning for a dataset that consisted of about 3.1M Harry Potter related tokens.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Long context prompting for Claude 2.1

2023-12-06T00:00:00.000Z

Anthropic was able to solve the "lost in the middle" problem "by adding the sentence “Here is the most relevant sentence in the context:” to the start of Claude’s response. This was enough to raise Claude 2.1’s score from 27% to 98% on the original evaluation."

Does it just take a little bit of prompt engineering to solve low accuracy when needing to retrieve from the middle of a context window??

🔗 Article Link

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Retrieval-Augmented Multimodal Language Modeling

2023-12-04T00:00:00.000Z

What's better than retrieval augmented generation(RAG)? 🥁🥁🥁

Multimodal RAG! 😎👌🔥

RAG allows you to pack retrieved context into a prompt so that a language model can read relevant information before generating a response - this function is critical and allows us to integrate knowledge in a more scalable and modular way into LLMs.

But isn't a picture worth a thousand words? So why just stop at retrieving textual context??

This is where multimodal RAG(MM-RAG) comes into the picture!

If you have an external knowledge database that can represent and store multimedia like images, audio, and video, just as well as it can text, then you can retrieve these objects and provide them a richer context for Large Multimodal Models to generate with.

Vector databases provide an ideal store from which multimedia knowledge can be retrieved and can capture the meaning of all of these modalities.

A paper(https://arxiv.org/abs/2211.12561) earlier this year from @michiyasunaga at Stanford presented the first multimodal model that can retrieve and generate both text and images and discusses the advantages of MM-RAG.

They found that MM-RAG:

Significantly outperforms baseline multimodal models such as DALL-E and CM3 on both image and caption generation tasks
Require much less compute for training (<30% of DALLE)
MM-RAG capable models also generate images much more faithful to the retrieved context
Are capable of multimodal in-context learning (e.g., image generation from demonstrations)

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

A Watermark for Large Language Model

2023-11-25T00:00:00.000Z

Can you tell the difference between human-written language and AI-generated text?🤔

To solve this problem we need watermarks!📃

Researchers at the University of Maryland(https://arxiv.org/abs/2301.10226) created a way for us to modify LLMs such that a watermark would automatically be applied to any content that LLM generates. This allows us to run a test for this watermark to identify synthetic content in the wild.

A watermark is a hidden pattern in text that is imperceptible to humans, but when the text is statistically analyzed it allows us to identify synthetic content. The watermark they created can be identified in as little as 25 tokens and has negligible impact on text quality.

The watermark works by selecting a randomized set of secret “green” tokens before a word is generated, and then softly incentivizing the LLM to use those green tokens by slightly nudging the output word probabilities during sampling. The more "green" tokens found in a chunk of text the higher the probability it was generated by an LLM.

The challenge here is that in order to apply this watermark the company owning the LLM (OpenAI, Cohere, Anthropic etc.) needs to promote the use of these random secret "green" tokens by slightly increasing their probability of being generated.

Yet, another problem is that the higher this "green" token probability the easier the watermark will be to detect however, this also lowers the quality of the text overall.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

The Curse of Recursion: Training on Generated Data Makes Models Forget

2023-11-23T00:00:00.000Z

Can we use synthetic, LLM generated, data to train the next generations of bigger and better LLMs? How far will synthetic data take us in the pursuit of AGI?🤔

A paper(https://arxiv.org/abs/2305.17493) from researchers at Oxford and Cambridge addressed these questions earlier this year.

Synthetic data is quite promising, but to see how far we can push the limit with it, this paper investigates what happens when text produced by a version of GPT forms most of the training dataset of following models. What happens to GPT versions GPT-{n} as generation n increases?

In short, it was found that "model-collapse" occurs – a degenerative process whereby, over time, models forget the true underlying data distribution. Over time, you start losing information about the true distribution, which first starts with tails disappearing, and over the generations learned behaviors start converging to a point estimate with very small variance.

In other words - using synthetic data works at first but relying on it to successively train better and better models generation after generation seems like a loosing bet. Access to the original data distribution is crucial: in learning where the tails of the underlying distribution matter you need access to real human-produced data.

Here are some more details:

Two sets of experiments were done: one in which all data is replaced with synthetic data produced by the last generation of LLM and another where only 90% is replaced(10% original human-produced data)
They found that preservation of the original data allows for better model fine-tuning and leads to only minor degradation of performance.
In early model collapse the model begins losing information about the tails of the distribution
In late model collapse model entangles different modes of the original distributions and converges to a distribution that carries little resemblance to the original one, often with very small variance.
This collapse occurs due to statistical approximation error(due to the number of samples being finite) and functional approximation error(comes from models being insufficiently expressive). A third source of error can also be computational error coming from floating point arithmetic.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Can Large Language Models Infer Causation from Correlation?

2023-11-22T00:00:00.000Z

Can large language models infer causation from correlation?

And if they can't automatically bridge the gap from correlation to causation, then can we at least fine-tune them to improve at this task?

These two questions were addressed by researchers at the Max Planck Institute(https://arxiv.org/abs/2306.05836).

We know that the success of LLM's arises from capturing a vast set of statistical correlations among words, but how well can these correlations be used to infer deeper causal relationships behind the words?

They showed that this is a very big shortcoming of all of the LLMs they tested as of June 2023.

Some details:

Using a dataset, they created called Corr2Cause, they assessed an LLM's ability to take multiple correlational statements as input and accurately determine the causal relationships between these same variables.
Every model, out of 17 tested, provided a near-random performance on the task and was unable to perform pure causal reasoning.
Furthermore, they tested that even if you finetune the model for this task, it still doesn't generalize and only performs well for the variable names and textual expressions found in the training set. When the variable names and text were paraphrased model accuracy in inferring causality dropped.

An important point to make is that It's quite difficult to differentiate and assess actual reasoning from rote training set memorization - even more so now that we have ever-growing training sets and don't know what data the model was trained on and only have API access to the models.

It'd be interesting to see if newer models since the date of publication can do better on paraphrased and renamed versions of this Corr2Cause dataset of ~400k datapoints.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Lost in the Middle: How Language Models Use Long Contexts

2023-11-19T00:00:00.000Z

Why do large language models pay more attention to and reason better over the beginning and end of what you tell them in prompts?🤔

Nelson Liu and Percy Liang's group at Stanford recently published a paper that discovered this "lost in the middle" effect.

Greg Kamradt also ran great experiments and posted about how this very same pattern of underperformance exists in the new GPT-4 128K models from OpenAI.

The point of the paper was to establish "how well LLMs use longer context" and ran experiments conducting QnA and key-value retrieval tasks on models from Mosaic, Anthropic and OpenAI and varied input context size and the position of the relevant information in the context.

The main discovery was that attention followed a U-shaped pattern where more importance was given to the beginning and end of the context window as opposed to the middle portion.

This is such a great paper with a wealth of knowledge gems💎- here are some details and reasons why this happens:

Due to Model Architecture: LLMs are transformers that scale poorly to long sequences (O(d^2)). As a result, language models are typically trained with relatively small context windows and thus perform better on these.
Tasks during supervised instruction-tuning are commonly placed at the beginning of the input context, which might lead these LLMs to place more weight on the start of the input context.
Encoder-decoder models perform better than decoder-only models, by making better use of their context windows because their bidirectional encoder allows processing each document in the context of future documents.
You can improve the performance of decoder-only models(can only attend to prior tokens at each timestep) by placing the query before and after the data, enabling query-aware contextualization of documents.
Based on the key-value retrieval experiments, let alone attending less to the middle, many models struggle to simply retrieve matching tokens that occur in the middle of their input context.
Even base language models (i.e., without instruction fine-tuning) show a U-shaped performance curve.
For open-domain QnA tasks, where none or many of the top k documents may contain the answer, models fail to effectively use more than 20 retrieved documents.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Retrieval meets Long Context Large Language Models

2023-11-18T00:00:00.000Z

Fine-tuned larger language models and longer context lengths eliminate the need for retrieval from external knowledge/vector databases, right? ... Not quite!!

NVIDIA asked the same question last month!

They published a new paper examining how well very large finetuned LLMs with longer context lengths compare to shorter context length RAG supported LLMs. They explore two main questions:

Retrieval-augmentation versus long context window, which one is better for downstream tasks?
Can both methods be combined to get the best of both worlds?

In short, they found:

RAG outperforms long context alone.
Yes they perform better together. RAG works better with longer context than with shorter context.

The main finding presented in the paper was that "retrieval can significantly improve the performance of LLMs regardless of their extended context window sizes".

Some more details:

RAG is more important than context windows: a LLM with 4K context window using simple retrieval-augmentation at generation can achieve comparable performance to finetuned LLM with 16K context window.
RAG is also faster: Augmenting generation with retrieval not only performs better by requiring significantly less computation and is much faster at generation.
RAG works even better as parameter count increases because smaller 6-7B LLMs have relatively worse zero-shot capability to incorporate the retrieved chunked context: Perhaps counter intuitively the benefits of RAG on performance are more pronounced the larger the language model gets, experiments were done for LLMs with 43B and 70B params.
RAG works even better as context length increases: Retrieval-augmented long context LLM (e.g., 16K and 32K) can obtain better results than retrieval-augmented 4K context LLM, even when fed with the same top 5 chunks of evidence.
Retrieval-augmented LLaMA2-70B with 32K context window, outperforms GPT-3.5-turbo-16k and Davinci003 and non-retrieval LLaMA2-70B-32k baseline for question answering and query-based summarization.

🔗 arXiv Link

📜 Download paper

Ready to start building?

Check out the Quickstart tutorial, or build amazing apps with a free trial of Weaviate Cloud (WCD).

GitHub

Forum

X (Twitter)

Don't want to miss another blog post?

By submitting, I agree to the Terms of Service and Privacy Policy.

Weaviate Blog

Distillation Experiments

Is it better to distill or finetune language models?​

Main Takeaways:​

Experiments​

Ready to start building?​

Don't want to miss another blog post?

Language Model Distillation

What is distillation?​

When would we use this?​

What’s the benefit?​

Examples of Distillation Techniques:​

(1) Logit-based Distillation:​

(2) Hidden States-based Distillation:​

Ready to start building?​

Don't want to miss another blog post?

LoRA: Low-Rank Adaptation of Large Language Models

Detailed explanation of Low-Rank Adaptation (LoRA), a method for efficiently fine-tuning pre-trained neural networks.

The Problem LoRA Solves:​

How It Works:​

Common practices when using LoRA:​

Benefits of LoRA:​

Some interesting engineering ideas enabled by LoRA:​

Ready to start building?​

Don't want to miss another blog post?

Prover-verifier Games Improve Legibility of LLM Outputs

Training LLMs to write solutions such that smaller models can better check them. This makes them easier to check for humans, too.​

Key Findings​

How It Works​

Details:​

Ready to start building?​

Don't want to miss another blog post?

RouteLLM: Learning to Route LLMs with Preference Data

You don't need a 2 trillion parameter model to tell you the capitol of France is Paris.​

Overview:​

Neat Ideas to Build From:​

Ready to start building?​

Don't want to miss another blog post?

Adaptive Retrieval and Scalable Indexing for k-NN Search with Cross-Encoders

How do you get the retrieval quality of a cross-encoder/re-ranker and the efficiency of a bi-encoder?​

High-level Overview:​

Details:​

Ready to start building?​

Don't want to miss another blog post?

Many-Shot In-Context Learning

Should you finetune your LLM or just give relevant examples in the prompt? How many examples should you give for best performance?? If you give more will it hurt perf?? Does order of the examples matter!??​

Main Takeaways:​

Juicy Details:​

Ready to start building?​

Don't want to miss another blog post?

Token Pooling to Scale Multi-Vector Retrieval Systems

🏹Multi-vector retrieval approaches, like ColBERT, have great retrieval quality but vector count can balloon, AnswerAI propose a solution!​

Breakdown of different types of encoders:​

Main weakness of multi-vector approaches:​

AnswerAI's Proposed Token Pooling Solution:​

Results:​

Ready to start building?​

Don't want to miss another blog post?

Mixture-of-Agents Enhances Large Language Model Capabilities

🤖Can multiple smaller open-source LLMs be combined to outperform larger monolithic LLMs?​

MoA Architecture​

How do you choose which models to include in the MoA??​

Details:​

Ready to start building?​

Don't want to miss another blog post?

GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer

Using Metadata Filters to Improve Recall in RAG​

How it Works:​

Architechture:​

Effectiveness:​

Ready to start building?​

Don't want to miss another blog post?

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

How do you train a Large Language Model without it memorizing training data?​

In Short:​

How it Works:​

Effectiveness:​

Trade-off:​

Benefits:​

Results:​

Is it better to distill or finetune language models?

Main Takeaways:

Experiments

Ready to start building?

What is distillation?

When would we use this?

What’s the benefit?

Examples of Distillation Techniques:

(1) Logit-based Distillation:

(2) Hidden States-based Distillation:

Ready to start building?

The Problem LoRA Solves:

How It Works:

Common practices when using LoRA:

Benefits of LoRA:

Some interesting engineering ideas enabled by LoRA:

Ready to start building?

Training LLMs to write solutions such that smaller models can better check them. This makes them easier to check for humans, too.

Key Findings

How It Works

Details:

Ready to start building?

You don't need a 2 trillion parameter model to tell you the capitol of France is Paris.

Overview:

Neat Ideas to Build From:

Ready to start building?

How do you get the retrieval quality of a cross-encoder/re-ranker and the efficiency of a bi-encoder?

High-level Overview:

Details:

Ready to start building?

Should you finetune your LLM or just give relevant examples in the prompt? How many examples should you give for best performance?? If you give more will it hurt perf?? Does order of the examples matter!??

Main Takeaways:

Juicy Details:

Ready to start building?

🏹Multi-vector retrieval approaches, like ColBERT, have great retrieval quality but vector count can balloon, AnswerAI propose a solution!

Breakdown of different types of encoders:

Main weakness of multi-vector approaches:

AnswerAI's Proposed Token Pooling Solution:

Results:

Ready to start building?

🤖Can multiple smaller open-source LLMs be combined to outperform larger monolithic LLMs?

MoA Architecture

How do you choose which models to include in the MoA??

Details:

Ready to start building?

Using Metadata Filters to Improve Recall in RAG

How it Works:

Architechture:

Effectiveness:

Ready to start building?

How do you train a Large Language Model without it memorizing training data?

In Short:

How it Works:

Effectiveness:

Trade-off:

Benefits:

Results:

Details in the Paper:

Ready to start building?

In Short:

How it Works:

Architecture:

Benefits:

Results:

Details in the Paper:

Ready to start building?

RAG Basics:

In Short:

How it Works:

Results:

Details:

Ready to start building?

In Short⏩:

Details🔎:

Ready to start building?

Ready to start building?

An Overview of OpenAI's New Truncatable - Matryoshka Embeddings🪆

Ready to start building?

Ready to start building?

Using a 7B Model + RAG to Identify and Edit Word-level Hallucinations in LLMs better then GPT-4: