![]() |
VOOZH | about |
In this article, we are going to discuss Item Loaders in Scrapy.
Scrapy is used for extracting data, using spiders, that crawl through the website. The obtained data can also be processed, in the form, of Scrapy Items. The Item Loaders play a significant role, in parsing the data, before populating the Item fields. In this article, we will learn about Item Loaders.
Scrapy, requires a Python version, of 3.6 and above. Install it, using the pip command, at the terminal as:
pip install Scrapy
This command will install the Scrapy library, in your environment. Now, we can create a Scrapy project, to write the Python Spider code.
Scrapy comes with an efficient command-line tool, called the Scrapy tool. The commands have a different set of arguments, based on their purpose. To write the Spider code, we begin by creating, a Scrapy project. Use the following, 'startproject' command, at the terminal -
scrapy startproject gfg_itemloaders
This command will create a folder, called 'gfg_itemloaders'. Now, change the directory, to the same folder, as shown below -
The folder structure, of the scrapy project, is as shown below:
It has a scrapy.cfg file, which, is the project configuration file. The folder, containing this file, is called as the root directory. The directory, also has items.py, middleware.py, and other settings files, as shown below -
The spider file, for crawling, will be created inside the 'spiders' folder. We will mention, our Scrapy items, and, related loader logic, in the items.py file. Keep the contents of the file, as it is, for now. Using 'genspider' command, create a spider code file.
scrapy genspider gfg_loadbookdata "books.toscrape.com/catalogue/category/books/womens-fiction_9"
The command, at the terminal, is as shown below -
We will scrape the Book Title, and, Book Price, from the Women's fiction webpage. Scrapy, allows the use of selectors, to write the extraction code. They can be written, using CSS or XPath expressions, which traverse the entire HTML page, to get our desired data. The main objective, of scraping, is to get structured data, from unstructured sources. Usually, Scrapy spiders will yield data, in Python dictionary objects. The approach is beneficial, with a small amount of data. But, as your data increases, the complexity increases. Also, it may be desired, to process the data, before we store the content, in any file format. This is where, the Scrapy Items, come in handy. They allow the data, to be processed, using Item Loaders. Let us write, Scrapy Item for Book Title and Price, and, the XPath expressions, for the same.
'items.py' file, mention the attributes, we need to scrape.
We define them as follows:
We will create, an object of the above, Item class, in the spider, and, yield the same. The spider code file will look as follows:
scrapy crawl gfg_loadbookdata -o not_parsed_data.json
The data is exported, in the "not_parsed_data.json" file, which can be seen as below:
Now, suppose we want to process, the scraped data, before yielding and storing them, in any file format, then we can use Item Loaders.
Item loaders, allow a smoother way, to manage scraped data. Many times, we may need to process, the data we scrape. This processing can be:
In this article, we will do the following processing -
So far we know, Item Loaders are used to parse, the data, before Item fields are populated. Let us understand, how Item Loaders work -
Now, let us understand, the built-in processors, and, methods that we will use, in Item Loaders, implementation. Scrapy has six built-in processors. Let us know them -
Identity(): This is the default, and, simplest processor. It never changes any value. It can be used, as an input, as well as, output processor. This means, when no other processor, is mentioned, this acts, and, returns the values unchanged.
Output:
['star','moon','galaxy']
TakeFirst(): This returns, the first non-null, or, non-empty value, from the data received. It is usually, used as an output processor.
Output:
'star'
Compose(): This takes data, and, passes it to the function, present in the argument. If more than one function, is present in the argument, then the result of the previous, is passed to the next. This continues, till the last function, is executed, and, the output is received.
Output:
HI
MapCompose(): This processor, works similarly to Compose. It can have, more than one function, in the argument. Here, the input values are iterated, and, the first function, is applied to all of them, resulting in a new iterable. This new iterable is now passed to the second function, in argument, and so on. This is mainly used, as an input processor.
Output:
['TWINKLE', 'LITTLE', 'WONDER', 'THEY']
Join(): This processor, returns the values joined together. To put an expression, between each item, one can use a separator, the default one is 'u'. In the example below, we have used <a> as a separator:
Output:
'Sky<a>Moon<a>Stars'
SelectJmes(): This processor, using the JSON path given, queries the value and returns the output.
Output:
scrapy
In this example, we have used TakeFirst() and MapCompose() processors. The processors, act on the scraped data, when Item loader functions, like add_xpath() and others, are executed. The most commonly used, loader functions are -
One can make use, of any of the above loader methods. In this article, we have used XPath expressions, to scrape data, hence the add_xpath() method, of the loader is used. In the Scrapy configuration, the processors.py file, is present, from which we can import, all mentioned processors.
We get an item loader object, by instantiating, the ItemLoader class. The ItemLoader class, present in the Scrapy library, is the scrapy.loader.ItemLoader. The parameters, for ItemLoader object creation, are -
| Sr. No | Method | Description |
|---|---|---|
| 1 | get_value(value,*processors,**kwargs) |
The value is processed by the mentioned processor, and, keyword arguments. The keyword argument parameter can be : 're', A regular expression to use, for getting data, from the given value, applied before the processor. |
| 2 | add_value(fieldname,*processors, **kwargs) | Process, and, then add the given value, for the field given. Here, value is first passed, through the get_value(), by giving the processor and kwargs. It is then passed, through the field input processor. The result is appended, to the data collected, for that field. If field, already contains data, then, new data is added. The field name can have None value as well. Here, multiple values can be added, in the form of dictionary objects. |
| 3 | replace_value(fieldname, *processors, **kwargs) | This method, replaces the collected value with a new value, instead of adding it. |
| 4 | get_xpath( XPath,*processors, **kwargs) |
This method receives an XPath expression. This expression is used to get a list of Unicode strings, from the selector, which is related, to the ItemLoader. This method, is similar to ItemLoader.get_value(). The parameters, of this method, are - XPath - the XPath expression to extract data from the webpage re - A regular expression string, or, a pattern to get data from the XPath region. |
| 5 | add_xpath(xpath,*processors, **kwargs) |
This method, receives an XPath expression, that is used to select, a list of strings, from the selector, related to the ItemLoader. It is similar to ItemLoader.add_value(). Parameter is - XPath - The XPath expression to extract data from. |
| 6 | replace_xpath(fieldname, XPath,*processors,**kwargs) | Instead of, adding the extracted data, this method, replaces the collected data. |
| 7 | get_css(CSS, *processors, **kwargs) |
This method receives a CSS selector, and, not a value, which is then used to get a list of Unicode strings, from the selector, associated with the ItemLoader. The parameters can be - CSS - The string selector to get data from re - A regular expression string or a pattern to get data from the CSS region. |
| 8 | add_css(fieldname, css, *processors, **kwargs) |
This method, adds a CSS selector, to the field. It is similar to add_value(), but, receives a CSS selector. Parameter is - CSS - A string CSS selector to extract data from |
| 9 | replace_css(fieldname, CSS, *processors, **kwargs) | Instead of, adding collected data, this method replaces it, using the CSS selector. |
| 10 | load_item() | This method is used to populate, the item received so far, and return it. The data is first passed through, the output_processors, so that the final value, is assigned to each field. |
| 11 | nested_css(css, **context) | Using CSS selector, this method is used to create nested selectors. The CSS supplied, is applied relative, to the selector, associated with the ItemLoader. |
| 12 | nested_xpath(xpath) | Using the XPath selector, create a nested loader. The XPath supplied, is applied relative, to the selector associated with the ItemLoader. |
Nested loaders are useful when we are parsing values, that are related, from the subsection of a document. Without them, we need to mention the entire XPath or CSS path, of the data we want to extract. Consider, the following HTML footer example -
Using nested loaders, we can avoid, using the nested footer selector, as follows:
Please note the following points about nested loaders:
Maintenance, becomes difficult, as the project grows, and, also the number of spiders, written for data scraping. Also, the parsing rules may change, for every other spider. To simplify the maintenance, of parsing, Item Loaders support, regular Python inheritance, to deal with differences, present in a group of spiders. Let us look, at an example, where extending loaders, may turn beneficial.
Suppose, any eCommerce book website, has its book author names, starting with an "*"(asterisk). If you want, to remove those "*", present in the final scraped author names, we can reuse, and, extend the default loader class 'BookLoader' as follows:
In the above code, the BookLoader is a parent class, for the SiteSpecificLoader class. By reusing the existing loader, we have added only the strip "*" functionality, in the new loader class.
Just like Items, Item Loaders too can be declared by using the class syntax. The declaration can be done, as follows:
The code can be understood as:
Now, we have a general understanding of Item Loaders. Let us implement, the above concepts, in our example -
The final code, for our 'items.py' class, will look as shown below:
The final spider file code will look as follows:
We can run, and, save the data in JSON file, using the scrapy 'crawl' command using the syntax scrapy crawl spider_name as -
scrapy crawl gfg_loadbookdata -o parsed_bookdata.json
The above command will scrape the data, parse the data, which means the pound sign, won't be there, and, '&' operator will be replaced with 'AND'. The parsed_bookdata.json file is created as follows: