Bag-of-Words Representations in TensorFlow

Last Updated : 23 Jul, 2025

Bag-of-Words (BoW) converts text into numerical vectors based on word occurrences, ignoring grammar and word order. The model represents text as a collection (bag) of words, where each word's frequency or presence is recorded. It follows these steps:

Tokenization – Splitting text into words.
Vocabulary Creation – Listing all unique words from the dataset.
Vectorization – Converting text into a numerical vector where each dimension represents a word's occurrence.

For example, consider two sentences:

"The cat sat on the mat."
"The dog lay on the rug."

The vocabulary: ["the", "cat", "sat", "on", "mat", "dog", "lay", "rug"]

Their BoW representation:

Sentence 1: [1, 1, 1, 1, 1, 0, 0, 0]
Sentence 2: [1, 0, 0, 1, 0, 1, 1, 1]

Bag-of-Words method is widely used in text classification, sentiment analysis, and information retrieval.

Implementing Bag-of-Words in TensorFlow

We will implement BoW using TensorFlow's tf.keras.layers.TextVectorization layer.

1. Install Dependencies

Ensure you have TensorFlow installed:

pip install tensorflow

2. Import Libraries

3. Define Sample Text Data

4. Create and Configure the TextVectorization Layer

5. Convert Text into BoW Representation

Output:

Vocabulary: ['[UNK]', 'the', 'on', 'sat', 'rug', 'mat', 'lay', 'dog', 'cat']
Bag-of-Words Representation:
[[0 2 1 1 0 1 0 0 1]
[0 2 1 0 1 0 1 1 0]]

The vectorizer.get_vocabulary() method returns the learned vocabulary, and bow_representation.numpy() provides the BoW vectors for the input sentences.

2 represents the count of "the".
Other values indicate word frequencies in the respective sentence.

Applications of BoW

Text Classification – Used in spam detection, sentiment analysis.
Information Retrieval – Search engines match queries with documents using BoW.
Topic Modeling – Helps in clustering similar documents.

Limitations of Bag-of-Words

Ignores Word Order – Cannot differentiate between "dog bites man" and "man bites dog."
Sparse Representation – Large vocabularies lead to high-dimensional vectors.
Lack of Semantic Understanding – Words with similar meanings are treated differently.

Alternatives: TF-IDF, Word Embeddings (Word2Vec, GloVe), Transformer-based models (BERT).

Bag-of-Words is a simple yet effective method for text representation. With TensorFlow’s TextVectorization layer, implementing BoW is efficient and scalable. However, for complex NLP tasks, embeddings and deep learning-based representations are often preferred.

Comment

Article Tags: