Downloading a web page as plain text
Web pages are simply text with HTML tags, JavaScript, and CSS. The HTML tags define the content of the web page, which we can parse for specific content. Bash scripts can parse web pages. An HTML file can be viewed in a web browser to see it properly formatted or processed with tools described in the previous chapter.
Parsing a text document is simpler than parsing HTML data because we aren't required to strip off the HTML tags. Lynx is a command-line web browser that downloads a web page as plain text.
Getting ready
Lynx is not installed in all distributions, but is available via the package manager.
# yum install lynx
Alternatively, you can execute the following command:
apt-get install lynx
How to do it...
The -dump
option downloads a web page as pure ASCII. The next recipe shows how to send that ASCII version of the page to a file:
$ lynx URL -dump > webpage_as_text.txt
This command will list all the hyperlinks (<a href="link">
) separately under a...