VOOZH about

URL: https://www.analyticsvidhya.com/blog/2022/11/an-introduction-to-bigbird/

โ‡ฑ An Introduction to BigBird - Analytics Vidhya


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Reading list

An Introduction to BigBird

Drishti Last Updated : 02 Nov, 2022
7 min read

 This article was published as a part of the Data Science Blogathon.

Source: Canva|Arxiv

Introduction

In 2018 GoogleAI researchers developed Bidirectional Encoder Representations from Transformers (BERT) for various NLP tasks. However, one of the key limitations of this technique was the quadratic dependency, due to which the BERT-like model can handle sequences of 512 tokens at a time because of their full attention mechanism. To overcome this, Manzil Zaheer, Guru Guruganesh, Avinava Dubey, et al proposed BigBird having a sparse attention mechanism that can handle sequences of length up to 8x of what was previously possible by similar hardware.

In this article, we will take a look at the proposed work in greater detail.

Now, letโ€™s dive in!

Highlights

  • BigBird employs a sparse attention mechanism that turns the quadratic dependency of the transformer-based model into a linear one. It is a universal approximator of sequence functions that retain the properties of the quadratic, full-attention model.

  • With the help of BigBird and its Sparse attention mechanism, the complexity of BERT (O(n2)) is reduced to O(n). As a result, the input sequence limited to  512 tokens is now increased to 4096 tokens (8 * 512). Hence, BigBird can handle longer sequences of length ie. up to 8x of what was previously possible by similar hardware.

  • The capability to accommodate longer context allows BigBird to perform dramatically better on a variety of NLP tasks, including question answering and summarising.

What is the Impact of the Self-Attention Mechanism in Transformers?

The key advancement in Transformers includes a self-attention mechanism, which can be estimated in parallel for each token of the input sequence, eliminating the sequential dependency in recurrent neural networks (like LSTM). This parallelism enables Transformers to leverage the full potential of contemporary SIMD hardware accelerators like GPUs and TPUs, hence facilitating the training of NLP models on datasets of unprecedented size. Pre-training transformers on a large-scale dataset have led to significant improvement in low data regime downstream tasks and tasks with sufficient data and thus has been a major force behind the widespread use of transformers in contemporary NLP.

The self-attention mechanism solves constraints related to the sequential nature of RNNs by enabling each token in the input sequence to attend independently to every other token in the sequence. However,  the full self-attention have high computational and memory requirement that is quadratic in the sequence length. Moreover, it was observed that while the corpus size can be huge, the sequence length, which provides the context is minimal. Using currently available hardware and model sizes, input sequences of length 512 tokens can be handled simultaneously. This limits its direct applicability to tasks that require a larger context, like question-answering (QA), document classification, etc.

Figure 1: Diagram illustrating Full all-pair attention, which is obtained by direct matrix multiplication between the query
and key matrix.

Why Did We Need a BigBird-like Model?

As we briefly discussed in the prior sections, transformer-based models like BERT have a core limitation: the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. Consequently, quadratic dependency on the sequence length limits the context size of the model.

These limitations lead us to two questions: 1) Can we obtain the empirical advantages of a fully quadratic self-attention scheme with fewer inner products? 2) Do the sparse attention mechanisms sustain the expressivity and adaptability of the original network?

BigBird addresses the aforementioned problems by using a sparse attention mechanism that scales linearly. As a result, contexts can be drastically scaled up from 512 tokens (present in most BERT models) to 4,096 in BigBird. This is especially helpful in many tasks where long dependencies need to be preserved eg. text summarization.

BigBird Architecture

The authors drew inspiration from graph sparsification methods and studied where the proof for the expressiveness of Transformers fails when full attention is relaxed to form the proposed attention pattern. This knowledge helped in developing BIGBIRD.

BigBird is a sparse-attention-based transformer that extends transformer-based models like BERT to 8 times longer sequences so that empirical advantages of a fully quadratic self-attention scheme are retained with fewer computations.

The building blocks of the sparse-attention mechanism used in BIGBIRD are as follows:

  1. Random Attention: All tokens attending to a set of random tokens (r).
  2. Window Local Attention: All tokens attending to a set of local neighboring tokens (w).
  3. Global Attention: A set of global tokens (g) attending all parts of the sequence.

Figure 2: Diagram illustrating different types of attention mechanisms. The last one is BigBirdโ€™s sparse attention mechanism.

Letโ€™s take a look at each type of attention mechanism in more detail.

1. Random Attention: Figure 2a illustrates the random attention mechanism, where r=1 with block size 2. In this, every query block randomly attends to random key (r) blocks, meaning in Figure 2a, each query block of size 2 attends to a key block of size 2 (randomly).

2. Window local attention: While creating the block, it is ensured that the number of query blocks
and the number of key blocks are the same. This aids in defining the block window
attention. Each query block with index j attends to the key block with index j โˆ’ (w โˆ’ 1)/2 to
j + (w โˆ’ 1)/2, including key block j. Figure 2b shows sliding window attention with w = 3 and block size 2, meaning each query block j attends to key block j โˆ’ 1, j, j + 1. This ensures that every query attends to at least one block of keys of size b on each side and a maximum of two blocks.

Figure 3 further illustrates the idea behind the window attention mechanism in detail for different parameters.

Figure 3: Diagram illustrating how window local attention is obtained (in general) by โ€œblockingโ€ the query and key matrix, copying the key matrix, and rolling the resulting key tensor.

3. Global attention: Global attention is computed in terms of blocks. Figure 2c illustrates the global attention mechanism with g = 1 and block size 2. For BIGBIRD-ITC, this suggests that
one query and key block attend to everyone.

Figure 2d illustrates the resulting overall attention mechanism used in BigBird. To sump up, we can say that the final attention mechanism for BigBird has the following three properties:

โ€“ queries attend
to random keys (r)

โ€“ each query attends to w/2 tokens to the right of its location and w/2 to the left of
its location

โ€“ contains global tokens (g) that can be from already existing tokens or extra added tokens

Unfortunately, when it comes to the computation of this attention score by simply multiplying arbitrary pairs of key and query vectors, it usually requires the use of the gather operation, which turns out to be inefficient. Upon examination of the global attention and window attention, it was found that these attention scores can be calculated without using a gather operation.

Figure 4: Overview of BigBird attention computation.

BigBird Attention Computation: Structured block sparsity aids in compactly packing the operations of sparse attention, thereby making the method efficient on GPU or TPU. Figure 4 shows the transformed dense query and key tensors on the left. The query tensor is obtained by blocking and reshaping, whereas the final key tensor is obtained by concatenating three transformations: The first green columns (which corresponds to global attention) is fixed. The center blue columns (corresponding to
window local attention) are obtained by aptly rolling. A computationally inefficient gather operation is supposed to be used for the last orange columns (which correspond to random attention).

Dense multiplication between the query and key tensors effectively computes the sparse attention pattern (except for the first-row block, which is calculated using direct multiplication). The resulting matrix on the right (in Figure 4) is identical to that shown in Figure 2d.

Potential Applications of BigBird

Some of the applications of BigBird are as follows:

1. Genomics Processing: Genomics sequence is provided as input to the encoder for tasks like methylation analysis, predicting functional impacts of non-coding variants, etc.

2. Question Answering and Long Document Summarization: BigBird can now handle up to 8 times larger sequence lengths than BERT, making it suitable for NLP tasks like answering and summarizing long documents.

3. Search Engine: Since BigBird can handle long context better than BERT, it can be used in search engines.

Limitations of BigBird

The sparse attention mechanisms canโ€™t universally substitute dense attention
mechanisms. Moreover, switching to a sparse attention mechanism does incur a cost.

BigBird for Language Modeling Task

For this, we will first install and import all the required packages. Following that, we will load the model (โ€œgoogle/bigbird-roberta-baseโ€) and the corresponding tokenizer with the help of BigBirdMaskedLM and AutoTokenizer classes. In addition, we will also load the โ€œsquad_v2โ€ dataset, and then we will decode the masked token at the end.

!pip install -q transformers datasets sentencepiece
import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from datasets import load_dataset
model_name = "google/bigbird-roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BigBirdForMaskedLM.from_pretrained(model_name)
squad_ds = load_dataset("squad_v2", split="train")
#Randomly selecting a long article
random_long_article = squad_ds[81515]["context"]
#Adding mask token
add_mask_token = random_long_article.replace("maximum", "[MASK]")
inputs = tokenizer(add_mask_token, return_tensors="pt")
with torch.inference_mode():
       logits = model(**inputs).logits
# Retrieving index of the [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

>> Output: โ€œmaximumโ€

Link to Colab Notebook: https://bit.ly/3fgOYXN

Conclusion

To sum it up, in this article, we learned the following:

  1. BigBird is a sparse-attention-based transformer that extends transformer-based models like BERT to 8 times longer sequences (4096 tokens) in such a manner that empirical advantages of a fully quadratic self-attention scheme are retained with fewer computations.
  2. BigBird satisfies all the known theoretical properties of the full transformer. In particular, it was demonstrated that adding extra global tokens preserves the expressiveness of the model by allowing the expression of continuous sequence-to-sequence functions with only O(n)-inner products.
  3. Extended context modeled by BigBird benefits various NLP tasks like question answering, summarization, long document classification, etc.
  4. The sparse attention mechanisms canโ€™t universally substitute dense attention mechanisms. Moreover, switching to a sparse attention mechanism does incur a cost.

That concludes this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy learning!

Link to Research Paper: https://arxiv.org/pdf/2007.14062.pdf

Link to Colab Notebook: https://bit.ly/3fgOYXN

The media shown in this article is not owned by Analytics Vidhya and is used at the Authorโ€™s discretion.

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Login to continue reading and enjoy expert-curated content.

Free Courses

Building a Deep Research AI Agent

Build a Research & Report Agent with LangGraph & OpenAI for under $1!

Introduction to Transformers and Attention Mechanisms

Learn attention mechanisms, RNNs, Seq2Seq, BERT & NLP applications.

Getting Started with Large Language Models

Embark on an LLM journey: Master NLP and model training

Nano Course: Building Large Language Models for Code

Train Code LLMs from scratch: curate data, evaluate & build Starcoder (15B)

DeepSeek from Scratch; Architectural Components

DeepSeek from Scratch: Learn input, self-attention, RoPE & more.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner