Automated scraping with Scrapely
For scraping the annotated fields Portia uses a library called Scrapely (https://github.com/scrapy/scrapely), which is a useful open-source tool developed independently from Portia. Scrapely uses training data to build a model of what to scrape from a web page. The trained model can then be applied to scrape other web pages with the same structure.
You can install it using pip:
pip install scrapely
Here is an example to show how it works:
>>> from scrapely import Scraper >>> s = Scraper() >>> train_url = 'http://example.webscraping.com/view/Afghanistan-1' >>> s.train(train_url, {'name': 'Afghanistan', 'population': '29,121,286'}) >>> test_url = 'http://example.webscraping.com/view/United-Kingdom-239' >>> s.scrape(test_url) [{u'name': [u'United Kingdom'], u'population': [u'62,348,447']}]
First, Scrapely is given the data we want to scrape from the Afghanistan
web page to train the model (here, the country name...