In this article by C. Y. Kan, author of the book Cassandra Data Modeling and Analysis, we will see the techniques of how to evolve an existing Cassandra data model in detail. Meanwhile, the techniques of modeling by query will be demonstrated as well.

(For more resources related to this topic, see here.)

The Stock Screener Application is good enough to retrieve and analyze a single stock at one time. However, scanning just a single stock looks very limited in practical use. A slight improvement can be made here; it can handle a bunch of stocks instead of one. This bunch of stocks will be stored as Watch List in the Cassandra database.

Accordingly, the Stock Screener Application will be modified to analyze the stocks in the Watch List, and therefore it will produce alerts for each of the stocks being watched based on the same screening rule.

For the produced alerts, saving them in Cassandra will be beneficial for backtesting trading strategies and continuous improvement of the Stock Screener Application. They can be reviewed from time to time without having to review them on the fly.

Backtesting is a jargon used to refer to testing a trading strategy, investment strategy, or a predictive model using existing historical data. It is also a special type of cross-validation applied to time series data.

In addition, when the number of the stocks in the Watch List grows to a few hundred, it will be difficult for a user of the Stock Screener Application to recall what the stocks are by simply referring to their stock codes. Hence, it would be nice to have the name of the stocks added to the produced alerts to make them more descriptive and user-friendly.

Finally, we might have an interest in finding out how many alerts were generated on a particular stock over a specified period of time and how many alerts were generated on a particular date. We will use CQL to write queries to answer these two questions. By doing so, the modeling by query technique can be demonstrated.

The enhancement approach

The enhancement approach consists of four change requests in total. First, we will conduct changes in the data model and then the code will be enhanced to provide the new features. Afterwards, we will test run the enhanced Stock Screener Application again. The parts of the Stock Screener Application that require modifications are highlighted in the following figure.

It is remarkable that two new components are added to the Stock Screener Application. The first component, Watch List, governs Data Mapper and Archiver to collect stock quote data of those stocks in the Watch List from Yahoo! Finance. The second component is Query. It provides two Queries on Alert List for backtesting purposes:

evolving-data-model-img-0

Watch List

Watch List is a very simple table that merely stores the stock code of its constituents. It is rather intuitive for a relational database developer to define the stock code as the primary key, isn't it? Nevertheless, remember that in Cassandra, the primary key is used to determine the node that stores the row. As Watch List is expected to not be a very long list, it would be more appropriate to put all of its rows on the same node for faster retrieval. But how can we do that?

We can create an additional column, say watch_list_code, for this particular purpose. The new table is called watchlist and will be created in the packtcdma keyspace. The CQL statement is shown in chapter06_001.py:

# -*- coding: utf-8 -*-
# program: chapter06_001.py
## import Cassandra driver library
from cassandra.cluster import Cluster
## function to create watchlist
def create_watchlist(ss):
   ## create watchlist table if not exists
   ss.execute('CREATE TABLE IF NOT EXISTS watchlist (' + 
               'watch_list_code varchar,' + 
               'symbol varchar,' + 
               'PRIMARY KEY (watch_list_code, symbol))')
  
   ## insert AAPL, AMZN, and GS into watchlist
   ss.execute("INSERT INTO watchlist (watch_list_code, " + 
               "symbol) VALUES ('WS01', 'AAPL')")
   ss.execute("INSERT INTO watchlist (watch_list_code, " + 
               "symbol) VALUES ('WS01', 'AMZN')")
   ss.execute("INSERT INTO watchlist (watch_list_code, " + 
               "symbol) VALUES ('WS01', 'GS')")
## create Cassandra instance
cluster = Cluster()
## establish Cassandra connection, using local default
session = cluster.connect()
## use packtcdma keyspace
session.set_keyspace('packtcdma')
## create watchlist table
create_watchlist(session)
## close Cassandra connection
cluster.shutdown()

The create_watchlist function creates the table. Note that the watchlist table has a compound primary key made of watch_list_code and symbol. A Watch List called WS01 is also created, which contains three stocks, AAPL, AMZN, and GS.

Alert List

It is produced by a Python program and enumerates the date when the close price was above its 10-day SMA, that is, the signal and the close price at that time. Note that there were no stock code and stock name.

We will create a table called alertlist to store the alerts with the code and name of the stock. The inclusion of the stock name is to meet the requirement of making the Stock Screener Application more user-friendly. Also, remember that joins are not allowed and denormalization is really the best practice in Cassandra. This means that we do not mind repeatedly storing (duplicating) the stock name in the tables that will be queried. A rule of thumb is one table for one query; as simple as that.

The alertlist table is created by the CQL statement, as shown in chapter06_002.py:

# -*- coding: utf-8 -*-
# program: chapter06_002.py
## import Cassandra driver library
from cassandra.cluster import Cluster
## function to create alertlist
def create_alertlist(ss):
   ## execute CQL statement to create alertlist table if not exists
   ss.execute('CREATE TABLE IF NOT EXISTS alertlist (' + 
               'symbol varchar,' + 
               'price_time timestamp,' + 
               'stock_name varchar,' + 
               'signal_price float,' + 
               'PRIMARY KEY (symbol, price_time))')
## create Cassandra instance
cluster = Cluster()
## establish Cassandra connection, using local default
session = cluster.connect()
## use packtcdma keyspace
session.set_keyspace('packtcdma')
## create alertlist table
create_alertlist(session)
## close Cassandra connection
cluster.shutdown()

The primary key is also a compound primary key that consists of symbol and price_time.

Adding the descriptive stock name

Until now, the packtcdma keyspace has three tables, which are alertlist, quote, and watchlist. To add the descriptive stock name, one can think of only adding a column of stock name to alertlist only. As seen in the previous section, this has been done. So, do we need to add a column for quote and watchlist?

It is, in fact, a design decision that depends on whether these two tables will be serving user queries. What a user query means is that the table will be used to retrieve rows for a query raised by a user. If a user wants to know the close price of Apple Inc. on June 30, 2014, it is a user query. On the other hand, if the Stock Screener Application uses a query to retrieve rows for its internal processing, it is not a user query. Therefore, if we want quote and watchlist to return rows for user queries, they need the stock name column; otherwise, they do not need it.

The watchlist table is only for internal use by the current design, and so it need not have the stock name column. Of course, if in future, the Stock Screener Application allows a user to maintain Watch List, the stock name should also be added to the watchlist table.

However, for quote, it is a bit tricky. As the stock name should be retrieved from the Data Feed Provider, which is Yahoo! Finance in our case, the most suitable time to get it is when the corresponding stock quote data is retrieved. Hence, a new column called stock_name is added to quote, as shown in chapter06_003.py:

# -*- coding: utf-8 -*-
# program: chapter06_003.py
## import Cassandra driver library
from cassandra.cluster import Cluster
## function to add stock_name column
def add_stockname_to_quote(ss):
   ## add stock_name to quote
   ss.execute('ALTER TABLE quote ' + 
               'ADD stock_name varchar')
## create Cassandra instance
cluster = Cluster()
## establish Cassandra connection, using local default
session = cluster.connect()
## use packtcdma keyspace
session.set_keyspace('packtcdma')
## add stock_name column
add_stockname_to_quote(session)
## close Cassandra connection
cluster.shutdown()

It is quite self-explanatory. Here, we use the ALTER TABLE statement to add the stock_name column of the varchar data type to quote.

Queries on alerts

As mentioned previously, we are interested in two questions:

How many alerts were generated on a stock over a specified period of time?
How many alerts were generated on a particular date?

For the first question, alertlist is sufficient to provide an answer. However, alertlist cannot answer the second question because its primary key is composed of symbol and price_time. We need to create another table specifically for that question. This is an example of modeling by query.

Basically, the structure of the new table for the second question should resemble the structure of alertlist. We give that table a name, alert_by_date, and create it as shown in chapter06_004.py:

# -*- coding: utf-8 -*-
# program: chapter06_004.py
## import Cassandra driver library
from cassandra.cluster import Cluster
## function to create alert_by_date table
def create_alertbydate(ss):
   ## create alert_by_date table if not exists
   ss.execute('CREATE TABLE IF NOT EXISTS alert_by_date (' + 
              'symbol varchar,' + 
               'price_time timestamp,' + 
               'stock_name varchar,' + 
               'signal_price float,' + 
               'PRIMARY KEY (price_time, symbol))')
## create Cassandra instance
cluster = Cluster()
## establish Cassandra connection, using local default
session = cluster.connect()
## use packtcdma keyspace
session.set_keyspace('packtcdma')
## create alert_by_date table
create_alertbydate(session)
## close Cassandra connection
cluster.shutdown()

When compared to alertlist in chapter06_002.py, alert_by_date only swaps the order of the columns in the compound primary key. One might think that a secondary index can be created on alertlist to achieve the same effect. Nonetheless, in Cassandra, a secondary index cannot be created on columns that are already engaged in the primary key. Always be aware of this constraint.

We now finish the modifications on the data model. It is time for us to enhance the application logic in the next section.

Summary

This article extends the Stock Screener Application by a number of enhancements. We made changes to the data model to demonstrate the modeling by query techniques and how denormalization can help us achieve a high-performance application.