Rewriting an immutable string
How can we rewrite an immutable string? We can't change individual characters inside a string:
>>> title = "Recipe 5: Rewriting, and the Immutable String" >>> title[8]= '' Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'str' object does not support item assignment
Since this doesn't work, how do we make a change to a string?
Getting ready
Let's assume we have a string like this:
>>> title = "Recipe 5: Rewriting, and the Immutable String"
We'd like to do two transformations:
Remove the part before the
:
Replace the punctuation with
_
, and make all the characters lowercase
Since we can't replace characters in a string object, we have to work out some alternatives. There are several common things we can do, shown as follows:
A combination of slicing and concatenating a string to create a new string.
When shortening, we often use the
partition()
method.We can replace a character or a substring with the
replace()
method.We can expand the string into a list of characters, then join the string back into a single string again. This is the subject for a separate recipe, Building complex strings with a list of characters.
How to do it...
Since we can't update a string in place, we have to replace the string variable's object with each modified result. We'll use a statement that looks like this:
some_string = some_string.method()
Or we could even use:
some_string = some_string[:chop_here]
We'll look at a number of specific variations on this general theme. We'll slice a piece of a string, we'll replace individual characters within a string, and we'll apply blanket transformations such as making the string lowercase. We'll also look at ways to remove extra _
that show up in our final string.
Slicing a piece of a string
Here's how we can shorten a string via slicing:
Find the boundary:
>>> colon_position = title.index(':')
The index function locates a particular substring and returns the position where that substring can be found. If the substring doesn't exist, it raises an exception. This is always
true
of the resulttitle[colon_position] == ':'
.Pick the substring:
>>> discard_text, post_colon_text = title[:colon_position], title[colon_position+1:] >>> discard_text 'Recipe 5' >>> post_colon_text ' Rewriting, and the Immutable String'
We've used the slicing notation to show the start:end
of the characters to pick. We also used multiple assignment to assign two variables, discard_text
and post_colon_text
, from two expressions.
We can use partition()
as well as manual slicing. Find the boundary and partition:
>>> pre_colon_text, _, post_colon_text = title.partition(':') >>> pre_colon_text 'Recipe 5' >>> post_colon_text ' Rewriting, and the Immutable String'
The partition
function returns three things: the part before the target, the target, and the part after the target. We used multiple assignment to assign each object to a different variable. We assigned the target to a variable named _
because we're going to ignore that part of the result. This is a common idiom for places where we must provide a variable, but we don't care about using the object.
Updating a string with a replacement
We can use replace()
to remove punctuation marks. When using replace
to switch punctuation marks, save the results back into the original variable. In this case, post_colon_text
:
>>> post_colon_text = post_colon_text.replace(' ', '_') >>> post_colon_text = post_colon_text.replace(',', '_') >>> post_colon_text '_Rewriting__and_the_Immutable_String'
This has replaced the two kinds of punctuation with the desired _
characters. We can generalize this to work with all punctuation. This leverages the for
statement, which we'll look at in Chapter 2, Statements and Syntax.
We can iterate through all punctuation characters:
>>> from string import whitespace, punctuation >>> for character in whitespace + punctuation: ... post_colon_text = post_colon_text.replace(character, '_') >>> post_colon_text '_Rewriting__and_the_Immutable_String'
As each kind of punctuation character is replaced, we assign the latest and greatest version of the string to the post_colon_text
variable.
Making a string all lowercase
Another transformational step is changing a string to all lowercase. As with the previous examples, we'll assign the results back to the original variable. Use the lower()
method, assigning the result to the original variable:
>>> post_colon_text = post_colon_text.lower()
Removing extra punctuation marks
In many cases, there are some additional steps we might follow. We often want to remove leading and trailing _
characters. We can use strip()
for this:
>>> post_colon_text = post_colon_text.strip('_')
In some cases, we'll have multiple _
characters because we had multiple punctuation marks. The final step would be something like this to cleanup up multiple _
characters:
>>> while '__' in post_colon_text: ... post_colon_text = post_colon_text.replace('__', '_')
This is yet another example of the same pattern we've been using to modify a string in place. This depends on the while
statement, which we'll look at in Chapter 2, Statements and Syntax.
How it works...
We can't—technically—modify a string in place. The data structure for a string is immutable. However, we can assign a new string back to the original variable. This technique behaves the same as modifying a string in place.
When a variable's value is replaced, the previous value no longer has any references and is garbage collected. We can see this by using the id()
function to track each individual string object:
>>> id(post_colon_text) 4346207968 >>> post_colon_text = post_colon_text.replace('_','-') >>> id(post_colon_text) 4346205488
Your actual id numbers may be different. What's important is that the original string object assigned to post_colon_text
had one id. The new string object assigned to post_colon_text
has a different id. It's a new string object.
When the old string has no more references, it is removed from memory automatically.
We made use of slice notation to decompose a string. A slice has two parts: [start:end]
. A slice always includes the starting index. String indices always start with zero as the first item. It never includes the ending index.
Tip
The items in a slice have an index from start
to end-1
. This is sometimes called a half-open interval.
Think of a slice like this: all characters where the index, i, are in the range start ≤ i < end.
We noted briefly that we can omit the start or end indices. We can actually omit both. Here are the various options available:
title[colon_position]
: A single item, the:
we found usingtitle.index(':')
.title[:colon_position]
: A slice with the start omitted. It begins at the first position, index of zero.title[colon_position+1:]
: A slice with the end omitted. It ends at the end of the string, as if we saidlen(title)
.title[:]
: Since both start and end are omitted, this is the entire string. Actually, it's a copy of the entire string. This is the quick and easy way to duplicate a string.
There's more...
There are more features to indexing in Python collections like a string. The normal indices start with 0 at the left end. We have an alternate set of indices using negative names that work from the right end of a string.
title[-1]
is the last character in the title,g
title[-2]
is the next-to-last character,n
title[-6:]
is the last six characters,String
We have a lot of ways to pick pieces and parts out of a string.
Python offers dozens of methods for modifying a string. Section 4.7 of the Python Standard Library describes the different kinds of transformations that are available to us. There are three broad categories of string methods. We can ask about a string, we can parse a string, and we can transform a string. Methods such as isnumeric()
tell us if a string is all digits.
Here's an example:
>>> 'some word'.isnumeric() False >>> '1298'.isnumeric() True
We've looked at parsing with the partition()
method. And we've looked at transforming with the lower()
method.
See also
We'll look at the string as list technique for modifying a string in the Building complex strings from lists of characters recipe.
Sometimes we have data that's only a stream of bytes. In order to make sense of it, we need to convert it into characters. That's the subject for the Decoding bytes – how to get proper characters from some bytes recipe.