![]() |
VOOZH | about |
In an age where information is abundantly available on the internet, the need for efficient content consumption has never been greater. This insightful article explores the development of a cutting-edge web-based application for summarizing website content, all thanks to the powerful capabilities of the Hugging Face Transformer model.
When you like an article but it's super long, it can be hard to find the time to read the whole thing. That's where a summarizer can be a real lifesaver. It's a tool that gives you a short and sweet version of the article, so you can quickly get the main points without spending too much time.
A Website Summarizer is a vital tool for summarizing and reducing web page information. Offering succinct and cohesive summaries, enables visitors to easily comprehend the major concepts and vital information from lengthy articles, blog entries, or news reports. Website Summarizers examine the source material, identify essential lines or phrases, and provide summaries that encapsulate the core of the information using powerful natural language processing algorithms. This not only saves time but also improves the user's capacity to make educated decisions, whether it's staying up to speed on current events or performing research. Website Summarizers are extremely useful in today's information age, making internet content more accessible and controllable.
There are two types of summarizers out there:
This article focuses on abstractive summarizers. We will be developing a real-time summarizer that extracts meaningful and human-like summaries from the given website URL.
BART, also known as Bidirectional and Auto Regressive Transformers stands out as a language model created by Facebook AI. It falls under the category of sequence-to-sequence (seq2seq) models enabling it to be trained for tasks involving converting one set of data into another. These tasks include machine translation, text summarization and question answering.
What sets BART apart is its training process. It undergoes training on a dataset consisting of text and code using an objective. In terms of BART is trained to reconstruct text that has been altered in some way such as, through sentence shuffling or replacing sections with a mask token. This pre-training approach equips BART with an understanding of language structure and meaning which empowers it to excel in real-world applications.
BART functions, as a model that operates in a sequence, to sequence manner. In terms it takes a series of input tokens. Generates a corresponding series of output tokens. The model consists of two components; an encoder and a decoder:
Decoders consist of multiple layers of attention and feed-forward neural networks. The attention layers enable the model to grasp connections between tokens in both input and output sequences. The feed-forward neural networks enable the model to learn relationships, among these tokens.
The summarization task offered by Hugging Face Transformers involves creating a brief and logical summary of a longer piece of text or document. This task falls under the wider scope of natural language processing (NLP) and is especially valuable for condensing information, improving readability for readers, or extracting important insights from long articles, documents, or web pages.
The Hugging Face Transformers, commonly known as "Transformers," is a freely available library and system for deep learning and natural language processing (NLP). It offers a diverse array of pre-trained models, such as BERT, GPT, RoBERTa, and others, based on transformers. These models cater to various NLP tasks, including text categorization, language creation, identifying named entities, and machine translation.
transformers: This package uses Natural Language Processing under the hood and summarizes the input text using Transformer architecture.
pip install transformers
tensorflow: This package is needed for transformers to work.
pip install tensorflow
requests: This package is used to make a "GET" request on the given website URL for extracting text.
pip install requests
bs4: This package is used for scraping the content in a given website for summarizing.
lxml: This is used for processing XML and HTML documents.
pip install bs4
pip install lxml
streamlit: This package is used for designing a GUI (Graphical User Interface) thus making an interactive fully functional application.
pip install streamlit
Install all the above packages using pip in the same order mentioned (Use Virtual Environment if you get any issues in the installation)
Analysis:
Code Analysis:
Code Analysis:
Note: The model we are using is "facebook/bart-large-cnn" which takes a max of 1024 tokens. So 1024 is specified as chunk_size. Adjust the chunk_size parameter according to your model needs.
Code Analysis:
Note: After this function is run, a TensorFlow model sized 1.63GB gets installed on your machine.
Output:
'Natural Language Processing is a subset of artificial intelligence. It enables machines to comprehend and analyze human languages.
In NLP we need to perform some extra processing steps. NLP software mainly works at the sentence level and it also expects words to be separate.
We will see some of the ways of collecting data if it is not available in our local machine or database. In NLP this process of feature engineering is
known as Text Representation or Text Vectorization. In the traditional approach, we create a vocabulary of unique words assign a unique id
(integer value) for each word. Bag of n-gram tries to solve this problem by breaking text into chunks of n continuous words.
N-gram representations are in the form of a sparse matrix, where each row represents a sentence and each column represents an n-gram in the vocabulary.
TF-IDF tries to quantify the importance of a given word relative to the other word in the corpus.
The value in the vector represents the measurements of some features or quality of the word. This is not interpretable for humans but Just for
representation purposes. We can understand this with the help of the below table. Heuristic-based approach is also used for the data-gathering
tasks for ML/DL model. Regular expressions are largely used in this type of model. Recurrent neural networks are a class of artificial neural networks.
The basic concept of RNNs is that they analyze input sequences one element at a time while maintaining track in a hidden state that contains a summary
of the sequence’s previous elements. This enables the RNN to process data from sources like natural languages, where context is crucial.
Long Short-Term Memory (LSTM) is an advanced form of RNN model. LSTMs function by selectively passing or retaining information from one-time
step to the next. Gated Recurrent Unit (GRU) is also the advanced form of RNN. GRUs also have gating mechanisms that allow them to selectively
update or forget information from the previous time steps. '
Code Analysis:
The final application can be run and built using the below command in the terminal.
After running the command it will give you a localhost URL where the application can be accessed locally in the system and a Network URL where the application can be accessed anywhere on the internet, copy and paste any of the above two URLs in your browser to access the application.
Here is the website that is displayed after running the above command.
streamlit run app.py
Output:
👁 Screenshot-2023-10-21-155628-(1)