Removing punctuation marks
Depending upon the tokenizer used, and the input to those tokenizers, it may be desired to remove punctuation from the resulting list of tokens. The regexp_tokenize
function with '\w+'
as the expression removes punctuation well, but word_tokenize
does not do it very well and will return many punctuation marks as their own tokens.
How to do it
Removing punctuation marks from our tokens is done similarly to the removal of other words within our tokens by using a list comprehension and only selecting those items that are not punctuation marks. The script 07/09_remove_punctuation.py
file demonstrates this. Let's walk through the process:
- We'll start with the following, which will
word_tokenize
a string from a job listing:
>>> content = "Strong programming experience in C#, ASP.NET/MVC, JavaScript/jQuery and SQL Server" >>> tokenized = word_tokenize(content) >>> stop_list = stopwords.words('english') >>> cleaned = [word for word in...