Preparing, training, and testing data
As always, we will start by setting up our data. In this case, the data is the messages received by our fantasy company, The Cake Factory. These are in the client_messages.RDS
file that we created in Chapter 4, Simulating Sales Data and Working with Databases. The data contains 300 observations for 8 variables: SALE_ID
, DATE
, STARS
, SUMMARY
, MESSAGE
, LAT
, LNG
, and MULT_PURCHASES
. During this chapter, we will work with the MESSAGE
and MULT_PURCHASES
variables.
We will set up our seed to have reproducible results. Keep in mind that this should be before every function call that involves some randomization. We will show it just once here to save space and avoid repeating ourselves, but keep that in mind when you are trying to generate reproducible results:
set.seed(12345)
Next, we need to make sure that we don't have any missing data in the relevant variables. To do so, we use the complete.cases()
function together with the negation (!
) and the sum()
function...