Using a regular expression to get the information from the downloaded web pages
The regular expression (re) module helps to find specific patterns of text from the downloaded web page. Regular expressions can be used to parse data from the web pages.
For instance, we can try to download all images in a web page with the help of the regular expression module.
How to do it...
For this, we can write a Python script that can download all JPG images in a web page:
- Create a file named
download_image.py
in your working directory. - Open this file in a text editor. You could use sublime text3.
- As usual, import the required modules:
import urllib2import refrom os.path import basenamefrom urlparse import urlsplit
- Download the web page as we did in the previous recipe:
url='https://www.packtpub.com/'response = urllib2.urlopen(url)source = response.read()file = open("packtpub.txt", "w")file.write(source)file.close()
- Now, iterate each line in the downloaded web page, search for image URLs, and download them:
patten...