Need A Web Scraper? Here’s How To Build One Using SCRAPY AND XPATH!
WHAT IS WEB SCRAPING?
Web scraping is the process of extraction of data from websites. Large amounts of data can be pulled from websites and then saved to a computer in a spreadsheet (table) format.
IS WEB SCRAPING ILLEGAL?
It’s legal as long as :
1. You’re not exceeding authorised use of the site as you scrape.
2. You’re not scraping copyrighted material.
WHAT IS SCRAPY?
Scrapy is a Python library created exclusively for web scraping
WHAT IS XPATH?
XPath is a major element in the XSLT standard. It can be used to navigate through elements and attributes in an XML document.
HOW DO YOU CREATE A SIMPLE CRAWLER?
Installation instructions for Scrapy can be found at the bottom of the page. A few code snippets are explained below:
import scrapy class MySpider(scrapy.Spider): name = "myspider" start_urls = ['https://www.example.com',]
We have created a class and defined the websites we want to crawl over.
def parse(self, response): #do stuff here.
The response object that we obtain from the parse method contains the data obtained from the page which was reached in the urls contained in start_urls.
There, we have our crawler ready. All we need to do now is execute the following command:
scrapy crawl myspider
And let the crawling begin.
HOW TO EXTRACT DATA?
Once you’ve finished scraping, you have to think about how you want to use the data you’ve gathered. Consider the scenario where we want to extract all the links contained in the page that was hit (defined in start_urls, of course). Scrapy has a class called LinkExtractor just for this purpose. The following code snippet will walk us through the process. Let us revisit the parse method.
from scrapy.linkextractors import LinkExtractor def parse(self, response): link_extractor = LinkExtractor() links = link_extractor.extract_links(response) selector = scrapy.Selector(response)
The links object is a list which contains all the links obtained. Let us assume that we want to create a Python dictionary of links and their texts. The following code snippet is quite self explanatory to achieve this:
dict_links = {} for link in links: dict_links.update({link.text:link.url})
As simple as that. All the work is done just by accessing the text and url attributes of a link object contained in our links object.
SO WHAT DOES XPATH HAVE TO DO WITH ALL OF THIS?
Of course, yes. This is where XPath comes into play. Scrapy works best with XPath in order to scrape more specific content. Let’s extract all the links and the text associated with them manually using XPath. The following code snippet demonstrates how to do this:
def parse(self, response): selector = scrapy.Selector(response) # extract all links from page link_texts = selector.xpath('//a//text()').extract() urls = selector.xpath('//a/@href').extract()
According to Scrapy’s documentation,”Scrapy comes with its own mechanism for extracting data. They’re called selectors because they ‘select’ certain parts of the HTML document specified either by XPath or CSS expressions.”
Hence we have created a Selector instance for our response object. Thereafter we have applied XPath expressions to the selector instance in order to fetch all the links and the text associated with them.
FOR MORE INFORMATION AND HELP:
In the growing era of automation, Scrapy is incredibly useful when it comes to content marketing and similar scenarios. It can be used to filter through a large amount of data to get the precise information you want, quickly. It can also be integrated with various frameworks such as Django. If you’d like to read more about how Scrapy works, please see their documentation below:
Its a beautiful page thanks we liked it.
Thanks a lot, I’m glad that the article helped.