Reading and cleaning the description in the job listing
The description of the job listing is still in HTML. We will want to extract the valuable content out of this data, so we will need to parse this HTML and perform tokenization, stop word removal, common word removal, do some tech 2-gram processing, and in general all of those different processes. Let's look at doing these.
Getting ready
I have collapsed the code for determining tech-based 2-grams into the 07/tech2grams.py
file. We will use the tech_2grams
function within the file.
How to do it...
The code for this example is in the 07/13_clean_jd.py
file. It continues on where the 07/12_scrape_job_stackoverflow.py
file ends:
- We start by creating a
BeautifulSoup
object from the description key of the description we loaded. We will also print this to see what it looks like:
desc_bs = BeautifulSoup(job_listing_contents["description"], "lxml") print(desc_bs) <p><span>Location options: <strong>Paid relocation</strong...