Natural Language Processing serves as an interrelationship between human language and computers. It is a subfield of Artificial Intelligence that helps machines process, understand and generate natural language intuitively. Common tasks done by NLP are text and speech processing, language translation, sentiment analysis, etc. The use cases include spam detection, chatbots, text summarization, etc.
There are three types of NLP approaches:
Rule-based Approach - Based on linguistic rules and patterns
Machine Learning Approach - Based on statistical analysis
Neural Network Approach- Based on various artificial, recurrent, and convolutional neural network algorithms
Rule-based approach in NLP
Rule-based approach is one of the oldest NLP methods in which predefined linguistic rules are used to analyze and process textual data. Rule-based approach involves applying a particular set of rules or patterns to capture specific structures, extract information, or perform tasks such as text classification and so on. Some common rule-based techniques include regular expressions and pattern matches.
Steps in Rule-based approach in NLP:
Rule Creation: Based on the desired tasks, domain-specific linguistic rules are created such as grammar rules, syntax patterns, semantic rules or regular expressions.
Rule Application: The predefined rules are applied to the inputted data to capture matched patterns.
Rule Processing: The text data is processed in accordance with the results of the matched rules to extract information, make decisions or other tasks.
Rule refinement: The created rules are iteratively refined by repetitive processing to improve accuracy and performance. Based on previous feedback, the rules are modified and updated when needed.
Libraries that can be used for a rule-based approach are: Spacy(Best suited for production), fast.ai, NLTK(Not preferred nowadays) In this article, we'll work with the Spacy library to demonstrate the Rule-based Approach. Spacy is an open-source software library designed for advanced Natural Language Processing (NLP) tasks. It is built in Python and provides a wide range of functionalities for processing and analyzing large volumes of text data
A rule-matching engine in Spacy called the Matcher can work over tokens, entities, and phrases in a manner similar to regular expressions.
Spacy Installation:
# Spacy Installation
!pip install - U spacy
!pip install - U spacy-lookups-data
!python - m spacy download en_core_web_sm # For English language
Example 1: Matching Token with Rule-based Approach
Step 1: The necessary modules are imported
Step 2: The English Language Spacy model is loaded
Step 3: The input text is added and all the tokens are separated.
Output:
Tokens: [Natural, Language, Processing, serves, as, an, interrelationship, between, human,
language, and, computers, ., Natural, Language, Processing, is, a, subfield, of, Artificial,
Intelligence, that, helps, machines, process, ,, understand, and, generate, natural,
language, intuitively, .]
Number of token : 34
Step 4: The rule-based matching Engine 'Matcher' is loaded.
Step 5: The rule or the pattern to be searched in the text is added. Here the words 'language' and 'human' are set as patterns.
Step 6: The pattern is added to the matcher object using the 'add' method with the first parameter as ID and the second parameter as the pattern.
Step 7: The matcher object is called with the 'doc' object input text to match the pattern. The result is stored in 'matches' variable
Step 8: The matched results are extracted and printed.
Example 2: Matching Phrases with the Rule-based Approach
Step 1: The PhraseMatcher module is imported from Spacy
Step 2: The English Language Spacy model is loaded
Step 3: The input text is added as 'doc' object
Output:
Natural Language Processing serves as an interrelationship between human language and computers.
Natural Language Processing is a subfield of Artificial Intelligence that helps machines process,
understand and generate natural language intuitively.
Step 4: The PhraseMatcher object is instantiated.
Step 5: The list of phrases is added in term_list which is converted to a patterns object using 'make_doc' method to speed up the process.
Step 6: The created rule is added to the matcher object
Step 7: The matcher object is called on the input text 'doc' with parameter 'is_spans=True' that returns span objects directly. The extracted results are printed.
Output:
Language Processing :- Phrase Match
human language :- Phrase Match
Language Processing :- Phrase Match
Example 3: Named Entity Recognization with Spacy
Step 1: Import spacy and Load the English Language Spacy model
Step 2: Named Entity Recognization with Spacy
Output:
Text:Pawan Kumar Gunjan, Label:PERSON
Text:India, Label:GPE
Text:India, Label:GPE
Text:the Republic of India, Label:GPE
Text:South Asia, Label:LOC
Text:seventh, Label:ORDINAL
Text:second, Label:ORDINAL
Text:the Indian Ocean, Label:LOC
Text:the Arabian Sea, Label:LOC
Text:the Bay of Bengal, Label:LOC
Text:Pakistan, Label:GPE
Text:China, Label:GPE
Text:Nepal, Label:GPE
Text:Bhutan, Label:GPE
Text:Bangladesh, Label:GPE
Text:Myanmar, Label:GPE
Advantages of the Rule-based approach:
Easily interpretable as rules are explicitly defined
Rule-based techniques can help semi-automatically annotate some data in domains where you don't have annotated data (for example, NER(Named Entity Recognization) tasks in a particular domain).
Functions even with scant or poor training data
Computation time is fast and it offers high precision
Many times, deterministic solutions to various issues, such as tokenization, sentence breaking, or morphology, can be achieved through rules (at least in some languages).
Disadvantages of the Rule-based approach:
Labor-intensive as more rules are needed to generalize
Generating rules for complex tasks is time-consuming
Needs regular maintenance
May not perform well in handling variations and exceptions in language usage
May not have a high recall metric
Why Rule-based Approach with Machine Learning and Neural Network Approaches?
Rule-based NLP usually deals with edge cases when included with other approaches.
It helps to speed up the data annotation. For instance, a rule-based technique is used for URL formats, date formats, etc., and a machine learning approach can be used to determine the position of text in a pdf file (including numerical data).
Also, in languages other than English annotated data is really scarce even for common tasks which are carried out by Rule-based NLP.
By using a rule-based approach, the computation performance of the pipeline is also improved.