Subscription

All Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Learning Hub

Newsletter Hub

Free Learning

Getting Started with Python 2.6 Text Processing

Packt

840 min read
2011-01-10 00:00:00

0 Likes
0 Comments

Python 2.6 Text Processing: Beginners Guide

getting-started-python-26-text-processing-img-0

The easiest way to learn how to manipulate text with Python

The easiest way to learn text processing with Python

Deals with the most important textual data formats you will encounter

Learn to use the most popular text processing libraries available for Python

Packed with examples to guide you through

Read more about this book

(For more resources on this subject, see here.)

Categorizing types of text data

Textual data comes in a variety of formats. For our purposes, we'll categorize text into three very broad groups. Isolating down into segments helps us to understand the problem a bit better, and subsequently choose a parsing approach. Each one of these sweeping groups can be further broken down into more detailed chunks.

One thing to remember when working your way through the book is that text content isn't limited to the Latin alphabet. This is especially true when dealing with data acquired via the Internet.

Providing information through markup

Structured text includes formats such as XML and HTML. These formats generally consist of text content surrounded by special symbols or markers that give extra meaning to a file's contents. These additional tags are usually meant to convey information to the processing application and to arrange information in a tree-like structure. Markup allows a developer to define his or her own data structure, yet rely on standardized parsers to extract elements.

For example, consider the following contrived HTML document.

<html>
   <head>
      <title>Hello, World!</title>
   </head>
   <body>
       <p>
          Hi there, all of you earthlings.
       </p>
       <p>
        Take us to your leader.
       </p>
    </body>
</html>

In this example, our document's title is clearly identified because it is surrounded by opening and closing &lttitle> and </title> elements.

Note that although the document's tags give each element a meaning, it's still up to the application developer to understand what to do with a title object or a p element.

Notice that while it still has meaning to us humans, it is also laid out in such a way as to make it computer friendly.

One interesting aspect to these formats is that it's possible to embed references to validation rules as well as the actual document structure. This is a nice benefit in that we're able to rely on the parser to perform markup validation for us. This makes our job much easier as it's possible to trust that the input structure is valid.

Meaning through structured formats

Text data that falls into this category includes things such as configuration files, marker delimited data, e-mail message text, and JavaScript Object Notation web data. Content within this second category does not contain explicit markup much like XML and HTML does, but the structure and formatting is required as it conveys meaning and information about the text to the parsing application. For example, consider the format of a Windows INI file or a Linux system's /etc/hosts file. There are no tags, but the column on the left clearly means something other than the column on the right.

Python provides a collection of modules and libraries intended to help us handle popular formats from this category.

Understanding freeform content

This category contains data that does not fall into the previous two groupings. This describes e-mail message content, letters, article copy, and other unstructured character-based content. However, this is where we'll largely have to look at building our own processing components. There are external packages available to us if we wish to perform common functions. Some examples include full text searching and more advanced natural language processing.

Ensuring you have Python installed

Our first order of business is to ensure that you have Python installed. We'll be working with Python 2.6 and we assume that you're using that same version. If there are any drastic differences in earlier releases, we'll make a note of them as we go along. All of the examples should still function properly with Python 2.4 and later versions.

If you don't have Python installed, you can download the latest 2.X version from http://www.python.org. Most Linux distributions, as well as Mac OS, usually have a version of Python preinstalled.

At the time of this writing, Python 2.6 was the latest version available, while 2.7 was in an alpha state.

Providing support for Python 3

The examples in this book are written for Python 2. However, wherever possible, we will provide code that has already been ported to Python 3. You can find the Python 3 code in the Python3 directories in the code bundle available on the Packt Publishing FTP site.

Unfortunately, we can't promise that all of the third-party libraries that we'll use will support Python 3. The Python community is working hard to port popular modules to version 3.0. However, as the versions are incompatible, there is a lot of work remaining. In situations where we cannot provide example code, we'll note this.

Implementing a simple cipher

Let's get going early here and implement our first script to get a feel for what's in store.

A Caesar Cipher is a simple form of cryptography in which each letter of the alphabet is shifted down by a number of letters. They're generally of no cryptographic use when applied alone, but they do have some valid applications when paired with more advanced techniques.

getting-started-python-26-text-processing-img-1

This preceding diagram depicts a cipher with an offset of three. Any X found in the source data would simply become an A in the output data. Likewise, any A found in the input data would become a D.

Time for action – implementing a ROT13 encoder

The most popular implementation of this system is ROT13. As its name suggests, ROT13 shifts – or rotates – each letter by 13 spaces to produce an encrypted result. As the English alphabet has 26 letters, we simply run it a second time on the encrypted text in order to get back to our original result.

Let's implement a simple version of that algorithm.

Start your favorite text editor and create a new Python source file. Save it as rot13.py.

Enter the following code exactly as you see it below and save the file.

import sys
import string
CHAR_MAP = dict(zip(
    string.ascii_lowercase,
    string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
    )
)
def rotate13_letter(letter):
    """
    Return the 13-char rotation of a letter.
    """
    do_upper = False
    if letter.isupper():
       do_upper = True
    letter = letter.lower()
    if letter not in CHAR_MAP:
        return letter
    else:
        letter = CHAR_MAP[letter]
        if do_upper:
            letter = letter.upper()
    return letter
if __name__ == '__main__':
        for char in sys.argv[1]:
            sys.stdout.write(rotate13_letter(char))
        sys.stdout.write('n')

Now, from a command line, execute the script as follows. If you've entered all of the code correctly, you should see the same output.
```
$ python rot13.py 'We are the knights who say, nee!'
```
Unlock access to the largest independent learning library in Tech for FREE!

Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.

Renews at €14.99/month. Cancel anytime

getting-started-python-26-text-processing-img-2

Run the script a second time, using the output of the first run as the new input string. If everything was entered correctly, the original text should be printed to the console.
```
$ python rot13.py 'Dv ziv gsv pmrtsgh dsl hzb, mvv!'
```

getting-started-python-26-text-processing-img-3

What just happened?

We implemented a simple text-oriented cipher using a collection of Python's string handling features. We were able to see it put to use for both encoding and decoding source text. We saw a lot of stuff in this little example, so you should have a good feel for what can be accomplished using the standard Python string object.

Following our initial module imports, we defined a dictionary named CHAR_MAP, which gives us a nice and simple way to shift our letters by the required 13 places. The value of a dictionary key is the target letter! We also took advantage of string slicing here.

In our translation function rotate13_letter, we checked whether our input character was uppercase or lowercase and then saved that as a Boolean attribute. We then forced our input to lowercase for the translation work. As ROT13 operates on letters alone, we only performed a rotation if our input character was a letter of the Latin alphabet. We allowed other values to simply pass through. We could have just as easily forced our string to a pure uppercased value.

The last thing we do in our function is restore the letter to its proper case, if necessary. This should familiarize you with upper- and lowercasing of Python ASCII strings.

We're able to change the case of an entire string using this same method; it's not limited to single characters.

>>> name = 'Ryan Miller'
>>> name.upper()
'RYAN MILLER'
>>> "PLEASE DO NOT SHOUT".lower()
'please do not shout'
>>>

It's worth pointing out here that a single character string is still a string. There is not a char type, which you may be familiar with if you're coming from a different language such as C or C++. However, it is possible to translate between character ASCII codes and back using the ord and chr built-in methods and a string with a length of one.

Notice how we were able to loop through a string directly using the Python for syntax. A string object is a standard Python iterable, and we can walk through them detailed as follows. In practice, however, this isn't something you'll normally do. In most cases, it makes sense to rely on existing libraries.

$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> for char in "Foo":
...     print char
...
F
o
o
>>>

Finally, you should note that we ended our script with an if statement such as the following:

Python modules all contain an internal __name__ variable that corresponds to the name of the module. If a module is executed directly from the command line, as is this script, whose name value is set to __main__, this code only runs if we've executed this script directly. It will not run if we import this code from a different script. You can import the code directly from the command line and see for yourself.

>>> if__name__ == '__main__'

Notice how we were able to import our module and see all of the methods and attributes inside of it, but the driver code did not execute.

Have a go hero – more translation work

Each Python string instance contains a collection of methods that operate on one or more characters. You can easily display all of the available methods and attributes by using the dir method. For example, enter the following command into a Python window. Python responds by printing a list of all methods on a string object.

$ python
Python 2.6.1 (r261:67515, Jul 7 2009, 23:51:51)
[GCC 4.2.1 (Apple Inc. build 5646)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import rot13
>>> dir(rot13)
['CHAR_MAP', '__builtins__', '__doc__', '__file__', '__name__', '__
package__', 'rotate13_letter', 'string', 'sys']
>>>

Much like the isupper and islower methods discussed previously, we also have an isspace method. Using this method, in combination with your newfound knowledge of Python strings, update the method we defined previously to translate spaces to underscores and underscores to spaces.

Processing structured markup with a filter

Our ROT13 application works great for simple one-line strings that we can fit on the command line. However, it wouldn't work very well if we wanted to encode an entire file, such as the HTML document we took a look at earlier. In order to support larger text documents, we'll need to change the way we accept input. We'll redesign our application to work as a filter.

A filter is an application that reads data from its standard input file descriptor and writes to its standard output file descriptor. This allows users to create command pipelines that allow multiple utilities to be strung together. If you've ever typed a command such as cat /etc/hosts | grep mydomain.com, you've set up a pipeline

getting-started-python-26-text-processing-img-4

In many circumstances, data is fed into the pipeline via the keyboard and completes its journey when a processed result is displayed on the screen.

Time for action – processing as a filter

Let's make the changes required to allow our simple ROT13 processor to work as a command-line filter. This will allow us to process larger files.

Create a new source file and enter the following code. When complete, save the file as rot13-b.py.

>>> dir("content")
['__add__', '__class__', '__contains__', '__delattr__', '__doc__',
'__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__',
'__getnewargs__', '__getslice__', '__gt__', '__hash__', '__init__', '__
le__', '__len__', '__lt__', '__mod__', '__mul__', '__ne__', '__new__',
'__reduce__', '__reduce_ex__', '__repr__', '__rmod__', '__rmul__', '__
setattr__', '__sizeof__', '__str__', '__subclasshook__', '_formatter_
field_name_split', '_formatter_parser', 'capitalize', 'center', 'count',
'decode', 'encode', 'endswith', 'expandtabs', 'find', 'format', 'index',
'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace', 'istitle',
'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition', 'replace',
'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit', 'rstrip', 'split',
'splitlines', 'startswith', 'strip', 'swapcase', 'title', 'translate',
'upper', 'zfill']
>>>

Enter the following HTML data into a new text file and save it as sample_page.html. We'll use this as example input to our updated rot13.py.

import sys
import string
CHAR_MAP = dict(zip(
    string.ascii_lowercase,
    string.ascii_lowercase[13:26] + string.ascii_lowercase[0:13]
    )
)
def rotate13_letter(letter):
    """
    Return the 13-char rotation of a letter.
    """
    do_upper = False
    if letter.isupper():
        do_upper = True
    letter = letter.lower()
    if letter not in CHAR_MAP:
        return letter
    else:
        letter = CHAR_MAP[letter]
        if do_upper:
            letter = letter.upper()
    return letter
if __name__ == '__main__':
    for line in sys.stdin:
        for char in line:
            sys.stdout.write(rotate13_letter(char))

Now, run our rot13.py example and provide our HTML document as standard input data. The exact method used will vary with your operating system. If you've entered the code successfully, you should simply see a new prompt.

<html>
    <head>
       <title>Hello, World!</title>
    </head>
    <body> 
       <p>
          Hi there, all of you earthlings.
       </p>
       <p>
          Take us to your leader.
       </p>
    </body>
</html>

The contents of rot13.html should be as follows. If that's not the case, double back and make sure everything is correct.
```
$ cat sample_page.html | python rot13-b.py > rot13.html
$
```

Open the translated HTML file using your web browser.

getting-started-python-26-text-processing-img-5

What just happened?

We updated our rot13.py script to read standard input data rather than rely on a command-line option. Doing this provides optimal configurability going forward and lets us feed input of varying length from a collection of different sources. We did this by looping on each line available on the sys.stdin file stream and calling our translation function. We wrote each character returned by that function to the sys.stdout stream.

Next, we ran our updated script via the command line, using sample_page.html as input. As expected, the encoded version was printed on our terminal.

As you can see, there is a major problem with our output. We should have a proper page title and our content should be broken down into different paragraphs.

Remember, structured markup text is sprinkled with tag elements that define its structure and organization.

In this example, we not only translated the text content, we also translated the markup tags, rendering them meaningless. A web browser would not be able to display this data properly. We'll need to update our processor code to ignore the tags. We'll do just that in the next section.