Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-working-large-data-sources
Packt
08 Jul 2015
20 min read
Save for later

Working with large data sources

Packt
08 Jul 2015
20 min read
In this article, by Duncan M. McGreggor, author of the book Mastering matplotlib, we come across the use of NumPy in the world of matplotlib and big data, problems with large data sources, and the possible solutions to these problems. (For more resources related to this topic, see here.) Most of the data that users feed into matplotlib when generating plots is from NumPy. NumPy is one of the fastest ways of processing numerical and array-based data in Python (if not the fastest), so this makes sense. However by default, NumPy works on in-memory database. If the dataset that you want to plot is larger than the total RAM available on your system, performance is going to plummet. In the following section, we're going to take a look at an example that illustrates this limitation. But first, let's get our notebook set up, as follows: In [1]: import matplotlib        matplotlib.use('nbagg')        %matplotlib inline Here are the modules that we are going to use: In [2]: import glob, io, math, os         import psutil        import numpy as np        import pandas as pd        import tables as tb        from scipy import interpolate        from scipy.stats import burr, norm        import matplotlib as mpl        import matplotlib.pyplot as plt        from IPython.display import Image We'll use the custom style sheet that we created earlier, as follows: In [3]: plt.style.use("../styles/superheroine-2.mplstyle") An example problem To keep things manageable for an in-memory example, we're going to limit our generated dataset to 100 million points by using one of SciPy's many statistical distributions, as follows: In [4]: (c, d) = (10.8, 4.2)        (mean, var, skew, kurt) = burr.stats(c, d, moments='mvsk') The Burr distribution, also known as the Singh–Maddala distribution, is commonly used to model household income. Next, we'll use the burr object's method to generate a random population with our desired count, as follows: In [5]: r = burr.rvs(c, d, size=100000000) Creating 100 million data points in the last call took about 10 seconds on a moderately recent workstation, with the RAM usage peaking at about 2.25 GB (before the garbage collection kicked in). Let's make sure that it's the size we expect, as follows: In [6]: len(r) Out[6]: 100000000 If we save this to a file, it weighs in at about three-fourths of a gigabyte: In [7]: r.tofile("../data/points.bin") In [8]: ls -alh ../data/points.bin        -rw-r--r-- 1 oubiwann staff 763M Mar 20 11:35 points.bin This actually does fit in the memory on a machine with a RAM of 8 GB, but generating much larger files tends to be problematic. We can reuse it multiple times though, to reach a size that is larger than what can fit in the system RAM. Before we do this, let's take a look at what we've got by generating a smooth curve for the probability distribution, as follows: In [9]: x = np.linspace(burr.ppf(0.0001, c, d),                          burr.ppf(0.9999, c, d), 100)          y = burr.pdf(x, c, d) In [10]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.plot(x, y, linewidth=5, alpha=0.7)          axes.hist(r, bins=100, normed=True)          plt.show() The following plot is the result of the preceding code: Our plot of the Burr probability distribution function, along with the 100-bin histogram with a sample size of 100 million points, took about 7 seconds to render. This is due to the fact that NumPy handles most of the work, and we only displayed a limited number of visual elements. What would happen if we did try to plot all the 100 million points? This can be checked by the following code: In [11]: (figure, axes) = plt.subplots()          axes.plot(r)          plt.show() formatters.py:239: FormatterWarning: Exception in image/png formatter: Allocated too many blocks After about 30 seconds of crunching, the preceding error was thrown—the Agg backend (a shared library) simply couldn't handle the number of artists required to render all the points. But for now, this case clarifies the point that we stated a while back—our first plot rendered relatively quickly because we were selective about the data we chose to present, given the large number of points with which we are working. However, let's say we have data from the files that are too large to fit into the memory. What do we do about this? Possible ways to address this include the following: Moving the data out of the memory and into the filesystem Moving the data off the filesystem and into the databases We will explore examples of these in the following section. Big data on the filesystem The first of the two proposed solutions for large datasets involves not burdening the system memory with data, but rather leaving it on the filesystem. There are several ways to accomplish this, but the following two methods in particular are the most common in the world of NumPy and matplotlib: NumPy's memmap function: This function creates memory-mapped files that are useful if you wish to access small segments of large files on the disk without having to read the whole file into the memory. PyTables: This is a package that is used to manage hierarchical datasets. It is built on the top of the HDF5 and NumPy libraries and is designed to efficiently and easily cope with extremely large amounts of data. We will examine each in turn. NumPy's memmap function Let's restart the IPython kernel by going to the IPython menu at the top of notebook page, selecting Kernel, and then clicking on Restart. When the dialog box pops up, click on Restart. Then, re-execute the first few lines of the notebook by importing the required libraries and getting our style sheet set up. Once the kernel is restarted, take a look at the RAM utilization on your system for a fresh Python process for the notebook: In [4]: Image("memory-before.png") Out[4]: The following screenshot shows the RAM utilization for a fresh Python process: Now, let's load the array data that we previously saved to disk and recheck the memory utilization, as follows: In [5]: data = np.fromfile("../data/points.bin")        data_shape = data.shape        data_len = len(data)        data_len Out[5]: 100000000 In [6]: Image("memory-after.png") Out[6]: The following screenshot shows the memory utilization after loading the array data: This took about five seconds to load, with the memory consumption equivalent to the file size of the data. This means that if we wanted to build some sample data that was too large to fit in the memory, we'd need about 11 of those files concatenated, as follows: In [7]: 8 * 1024 Out[7]: 8192 In [8]: filesize = 763        8192 / filesize Out[8]: 10.73656618610747 However, this is only if the entire memory was available. Let's see how much memory is available right now, as follows: In [9]: del data In [10]: psutil.virtual_memory().available / 1024**2 Out[10]: 2449.1796875 That's 2.5 GB. So, to overrun our RAM, we'll just need a fraction of the total. This is done in the following way: In [11]: 2449 / filesize Out[11]: 3.2096985583224114 The preceding output means that we only need four of our original files to create a file that won't fit in memory. However, in the following section, we will still use 11 files to ensure that data, if loaded into the memory, will be much larger than the memory. How do we create this large file for demonstration purposes (knowing that in a real-life situation, the data would already be created and potentially quite large)? We can try to use numpy.tile to create a file of the desired size (larger than memory), but this can make our system unusable for a significant period of time. Instead, let's use numpy.memmap, which will treat a file on the disk as an array, thus letting us work with data that is too large to fit into the memory. Let's load the data file again, but this time as a memory-mapped array, as follows: In [12]: data = np.memmap(            "../data/points.bin", mode="r", shape=data_shape) The loading of the array to a memmap object was very quick (compared to the process of bringing the contents of the file into the memory), taking less than a second to complete. Now, let's create a new file to write the data to. This file must be larger in size as compared to our total system memory (if held on in-memory database, it will be smaller on the disk): In [13]: big_data_shape = (data_len * 11,)          big_data = np.memmap(              "../data/many-points.bin", dtype="uint8",              mode="w+", shape=big_data_shape) The preceding code creates a 1 GB file, which is mapped to an array that has the shape we requested and just contains zeros: In [14]: ls -alh ../data/many-points.bin          -rw-r--r-- 1 oubiwann staff 1.0G Apr 2 11:35 many-points.bin In [15]: big_data.shape Out[15]: (1100000000,) In [16]: big_data Out[16]: memmap([0, 0, 0, ..., 0, 0, 0], dtype=uint8) Now, let's fill the empty data structure with copies of the data we saved to the 763 MB file, as follows: In [17]: for x in range(11):              start = x * data_len              end = (x * data_len) + data_len              big_data[start:end] = data          big_data Out[17]: memmap([ 90, 71, 15, ..., 33, 244, 63], dtype=uint8) If you check your system memory before and after, you will only see minimal changes, which confirms that we are not creating an 8 GB data structure on in-memory. Furthermore, checking your system only takes a few seconds. Now, we can do some sanity checks on the resulting data and ensure that we have what we were trying to get, as follows: In [18]: big_data_len = len(big_data)          big_data_len Out[18]: 1100000000 In [19]: data[100000000 – 1] Out[19]: 63 In [20]: big_data[100000000 – 1] Out[20]: 63 Attempting to get the next index from our original dataset will throw an error (as shown in the following code), since it didn't have that index: In [21]: data[100000000] ----------------------------------------------------------- IndexError               Traceback (most recent call last) ... IndexError: index 100000000 is out of bounds ... But our new data does have an index, as shown in the following code: In [22]: big_data[100000000 Out[22]: 90 And then some: In [23]: big_data[1100000000 – 1] Out[23]: 63 We can also plot data from a memmaped array without having a significant lag time. However, note that in the following code, we will create a histogram from 1.1 million points of data, so the plotting won't be instantaneous: In [24]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.hist(big_data, bins=100)          plt.show() The following plot is the result of the preceding code: The plotting took about 40 seconds to generate. The odd shape of the histogram is due to the fact that, with our data file-hacking, we have radically changed the nature of our data since we've increased the sample size linearly without regard for the distribution. The purpose of this demonstration wasn't to preserve a sample distribution, but rather to show how one can work with large datasets. What we have seen is not too shabby. Thanks to NumPy, matplotlib can work with data that is too large for memory, even if it is a bit slow iterating over hundreds of millions of data points from the disk. Can matplotlib do better? HDF5 and PyTables A commonly used file format in the scientific computing community is Hierarchical Data Format (HDF). HDF is a set of file formats (namely HDF4 and HDF5) that were originally developed at the National Center for Supercomputing Applications (NCSA), a unit of the University of Illinois at Urbana-Champaign, to store and organize large amounts of numerical data. The NCSA is a great source of technical innovation in the computing industry—a Telnet client, the first graphical web browser, a web server that evolved into the Apache HTTP server, and HDF, which is of particular interest to us, were all developed here. It is a little known fact that NCSA's web browser code was the ancestor to both the Netscape web browser as well as a prototype of Internet Explorer that was provided to Microsoft by a third party. HDF is supported by Python, R, Julia, Java, Octave, IDL, and MATLAB, to name a few. HDF5 offers significant improvements and useful simplifications over HDF4. It uses B-trees to index table objects and, as such, works well for write-once/read-many time series data. Common use cases span fields such as meteorological studies, biosciences, finance, and aviation. The HDF5 files of multiterabyte sizes are common in these applications. Its typically constructed from the analyses of multiple HDF5 source files, thus providing a single (and often extensive) source of grouped data for a particular application. The PyTables library is built on the top of the Python HDF5 library and NumPy. As such, it not only provides access to one of the most widely used large data file formats in the scientific computing community, but also links data extracted from these files with the data types and objects provided by the fast Python numerical processing library. PyTables is also used in other projects. Pandas wraps PyTables, thus extending its convenient in-memory data structures, functions, and objects to large on-disk files. To use HDF data with Pandas, you'll want to create pandas.HDFStore, read from the HDF data sources with pandas.read_hdf, or write to one with pandas.to_hdf. Files that are too large to fit in the memory may be read and written by utilizing chunking techniques. Pandas does support the disk-based DataFrame operations, but these are not very efficient due to the required assembly on columns of data upon reading back into the memory. One project to keep an eye on under the PyData umbrella of projects is Blaze. It's an open wrapper and a utility framework that can be used when you wish to work with large datasets and generalize actions such as the creation, access, updates, and migration. Blaze supports not only HDF, but also SQL, CSV, and JSON. The API usage between Pandas and Blaze is very similar, and it offers a nice tool for developers who need to support multiple backends. In the following example, we will use PyTables directly to create an HDF5 file that is too large to fit in the memory (for an 8GB RAM machine). We will follow the following steps: Create a series of CSV source data files that take up approximately 14 GB of disk space Create an empty HDF5 file Create a table in the HDF5 file and provide the schema metadata and compression options Load the CSV source data into the HDF5 table Query the new data source once the data has been migrated Remember the temperature precipitation data for St. Francis, in Kansas, USA, from a previous notebook? We are going to generate random data with similar columns for the purposes of the HDF5 example. This data will be generated from a normal distribution, which will be used in the guise of the temperature and precipitation data for hundreds of thousands of fictitious towns across the globe for the last century, as follows: In [25]: head = "country,town,year,month,precip,tempn"          row = "{},{},{},{},{},{}n"          filename = "../data/{}.csv"          town_count = 1000          (start_year, end_year) = (1894, 2014)          (start_month, end_month) = (1, 13)          sample_size = (1 + 2                        * town_count * (end_year – start_year)                        * (end_month - start_month))          countries = range(200)          towns = range(town_count)          years = range(start_year, end_year)          months = range(start_month, end_month)          for country in countries:             with open(filename.format(country), "w") as csvfile:                  csvfile.write(head)                  csvdata = ""                  weather_data = norm.rvs(size=sample_size)                  weather_index = 0                  for town in towns:                    for year in years:                          for month in months:                              csvdata += row.format(                                  country, town, year, month,                                  weather_data[weather_index],                                  weather_data[weather_index + 1])                              weather_index += 2                  csvfile.write(csvdata) Note that we generated a sample data population that was twice as large as the expected size in order to pull both the simulated temperature and precipitation data at the same time (from the same set). This will take about 30 minutes to run. When complete, you will see the following files: In [26]: ls -rtm ../data/*.csv          ../data/0.csv, ../data/1.csv, ../data/2.csv,          ../data/3.csv, ../data/4.csv, ../data/5.csv,          ...          ../data/194.csv, ../data/195.csv, ../data/196.csv,          ../data/197.csv, ../data/198.csv, ../data/199.csv A quick look at just one of the files reveals the size of each, as follows: In [27]: ls -lh ../data/0.csv          -rw-r--r-- 1 oubiwann staff 72M Mar 21 19:02 ../data/0.csv With each file that is 72 MB in size, we have data that takes up 14 GB of disk space, which exceeds the size of the RAM of the system in question. Furthermore, running queries against so much data in the .csv files isn't going to be very efficient. It's going to take a long time. So what are our options? Well, to read this data, HDF5 is a very good fit. In fact, it is designed for jobs like this. We will use PyTables to convert the .csv files to a single HDF5. We'll start by creating an empty table file, as follows: In [28]: tb_name = "../data/weather.h5t"          h5 = tb.open_file(tb_name, "w")          h5 Out[28]: File(filename=../data/weather.h5t, title='', mode='w',              root_uep='/', filters=Filters(                  complevel=0, shuffle=False, fletcher32=False,                  least_significant_digit=None))          / (RootGroup) '' Next, we'll provide some assistance to PyTables by indicating the data types of each column in our table, as follows: In [29]: data_types = np.dtype(              [("country", "<i8"),              ("town", "<i8"),              ("year", "<i8"),              ("month", "<i8"),               ("precip", "<f8"),              ("temp", "<f8")]) Also, let's define a compression filter that can be used by PyTables when saving our data, as follows: In [30]: filters = tb.Filters(complevel=5, complib='blosc') Now, we can create a table inside our new HDF5 file, as follows: In [31]: tab = h5.create_table(              "/", "weather",              description=data_types,              filters=filters) With that done, let's load each CSV file, read it in chunks so that we don't overload the memory, and then append it to our new HDF5 table, as follows: In [32]: for filename in glob.glob("../data/*.csv"):          it = pd.read_csv(filename, iterator=True, chunksize=10000)          for chunk in it:              tab.append(chunk.to_records(index=False))            tab.flush() Depending on your machine, the entire process of loading the CSV file, reading it in chunks, and appending to a new HDF5 table can take anywhere from 5 to 10 minutes. However, what started out as a collection of the .csv files that weigh in at 14 GB is now a single compressed 4.8 GB HDF5 file, as shown in the following code: In [33]: h5.get_filesize() Out[33]: 4758762819 Here's the metadata for the PyTables-wrapped HDF5 table after the data insertion: In [34]: tab Out[34]: /weather (Table(288000000,), shuffle, blosc(5)) '' description := { "country": Int64Col(shape=(), dflt=0, pos=0), "town": Int64Col(shape=(), dflt=0, pos=1), "year": Int64Col(shape=(), dflt=0, pos=2), "month": Int64Col(shape=(), dflt=0, pos=3), "precip": Float64Col(shape=(), dflt=0.0, pos=4), "temp": Float64Col(shape=(), dflt=0.0, pos=5)} byteorder := 'little' chunkshape := (1365,) Now that we've created our file, let's use it. Let's excerpt a few lines with an array slice, as follows: In [35]: tab[100000:100010] Out[35]: array([(0, 69, 1947, 5, -0.2328834718674, 0.06810312195695),          (0, 69, 1947, 6, 0.4724989007889, 1.9529216219569),          (0, 69, 1947, 7, -1.0757216683235, 1.0415374480545),          (0, 69, 1947, 8, -1.3700249968748, 3.0971874991576),          (0, 69, 1947, 9, 0.27279758311253, 0.8263207523831),          (0, 69, 1947, 10, -0.0475253104621, 1.4530808932953),          (0, 69, 1947, 11, -0.7555493935762, -1.2665440609117),          (0, 69, 1947, 12, 1.540049376928, 1.2338186532516),          (0, 69, 1948, 1, 0.829743501445, -0.1562732708511),          (0, 69, 1948, 2, 0.06924900463163, 1.187193711598)],          dtype=[('country', '<i8'), ('town', '<i8'),                ('year', '<i8'), ('month', '<i8'),                ('precip', '<f8'), ('temp', '<f8')]) In [36]: tab[100000:100010]["precip"] Out[36]: array([-0.23288347, 0.4724989 , -1.07572167,                -1.370025 , 0.27279758, -0.04752531,                -0.75554939, 1.54004938, 0.8297435 ,                0.069249 ]) When we're done with the file, we do the same thing that we would do with any other file-like object: In [37]: h5.close() If we want to work with it again, simply load it, as follows: In [38]: h5 = tb.open_file(tb_name, "r")          tab = h5.root.weather Let's try plotting the data from our HDF5 file: In [39]: (figure, axes) = plt.subplots(figsize=(20, 10))          axes.hist(tab[:1000000]["temp"], bins=100)          plt.show() Here's a plot for the first million data points: This histogram was rendered quickly, with a much better response time than what we've seen before. Hence, the process of accessing the HDF5 data is very fast. The next question might be "What about executing calculations against this data?" Unfortunately, running the following will consume an enormous amount of RAM: tab[:]["temp"].mean() We've just asked for all of the data—all of its 288 million rows. We are going to end up loading everything into the RAM, grinding the average workstation to a halt. Ideally though, when you iterate through the source data and create the HDF5 file, you also crunch the numbers that you will need, adding supplemental columns or groups to the HDF5 file that can be used later by you and your peers. If we have data that we will mostly be selecting (extracting portions) and which has already been crunched and grouped as needed, HDF5 is a very good fit. This is why one of the most common use cases that you see for HDF5 is the sharing and distribution of the processed data. However, if we have data that we need to process repeatedly, then we will either need to use another method besides the one that will cause all the data to be loaded into the memory, or find a better match for our data processing needs. We saw in the previous section that the selection of data is very fast in HDF5. What about calculating the mean for a small section of data? If we've got a total of 288 million rows, let's select a divisor of the number that gives us several hundred thousand rows at a time—2,81,250 rows, to be more precise. Let's get the mean for the first slice, as follows: In [40]: tab[0:281250]["temp"].mean() Out[40]: 0.0030696632864265312 This took about 1 second to calculate. What about iterating through the records in a similar fashion? Let's break up the 288 million records into chunks of the same size; this will result in 1024 chunks. We'll start by getting the ranges needed for an increment of 281,250 and then, we'll examine the first and the last row as a sanity check, as follows: In [41]: limit = 281250          ranges = [(x * limit, x * limit + limit)              for x in range(2 ** 10)]          (ranges[0], ranges[-1]) Out[41]: ((0, 281250), (287718750, 288000000)) Now, we can use these ranges to generate the mean for each chunk of 281,250 rows of temperature data and print the total number of means that we generated to make sure that we're getting our counts right, as follows: In [42]: means = [tab[x * limit:x * limit + limit]["temp"].mean()              for x in range(2 ** 10)]          len(means) Out[42]: 1024 Depending on your machine, that should take between 30 and 60 seconds. With this work done, it's now easy to calculate the mean for all of the 288 million points of temperature data: In [43]: sum(means) / len(means) Out[43]: -5.3051780413782918e-05 Through HDF5's efficient file format and implementation, combined with the splitting of our operations into tasks that would not copy the HDF5 data into memory, we were able to perform calculations across a significant fraction of a billion records in less than a minute. HDF5 even supports parallelization. So, this can be improved upon with a little more time and effort. However, there are many cases where HDF5 is not a practical choice. You may have some free-form data, and preprocessing it will be too expensive. Alternatively, the datasets may be actually too large to fit on a single machine. This is when you may consider using matplotlib with distributed data. Summary In this article, we covered the role of NumPy in the world of big data and matplotlib as well as the process and problems in working with large data sources. Also, we discussed the possible solutions to these problems using NumPy's memmap function and HDF5 and PyTables. Resources for Article: Further resources on this subject: First Steps [article] Introducing Interactive Plotting [article] The plot function [article]
Read more
  • 0
  • 0
  • 3765

article-image-transactions-redis
Packt
07 Jul 2015
9 min read
Save for later

Transactions in Redis

Packt
07 Jul 2015
9 min read
In this article by Vinoo Das author of the book Learning Redis, we will see how Redis as a NOSQL data store, provides a loose sense of transaction. As in a traditional RDBMS, the transaction starts with a BEGIN and ends with either COMMIT or ROLLBACK. All these RDBMS servers are multithreaded, so when a thread locks a resource, it cannot be manipulated by another thread unless and until the lock is released. Redis by default has MULTI to start and EXEC to execute the commands. In case of a transaction, the first command is always MULTI, and after that all the commands are stored, and when EXEC command is received, all the stored commands are executed in sequence. So inside the hood, once Redis receives the EXEC command, all the commands are executed as a single isolated operation. Following are the commands that can be used in Redis for transaction: MULTI: This marks the start of a transaction block EXEC: This executes all the commands in the pipeline after MULTI WATCH: This watches the keys for conditional execution of a transaction UNWATCH: This removes the WATCH keys of a transaction DISCARD: This flushes all the previously queued commands in the pipeline (For more resources related to this topic, see here.) The following figure represents how transaction in Redis works: Transaction in Redis Pipeline versus transaction As we have seen for many generic terms in pipeline the commands are grouped and executed, and the responses are queued in a block and sent. But in transaction, until the EXEC command is received, all the commands received after MULTI are queued and then executed. To understand that, it is important to take a case where we have a multithreaded environment and see the outcome. In the first case, we take two threads firing pipelined commands at Redis. In this sample, the first thread fires a pipelined command, which is going to change the value of a key multiple number of times, and the second thread will try to read the value of that key. Following is the class which is going to fire the two threads at Redis: MultiThreadedPipelineCommandTest.java: package org.learningRedis.chapter.four.pipelineandtx; public class MultiThreadedPipelineCommandTest { public static void main(String[] args) throws InterruptedException {    Thread pipelineClient = new Thread(new PipelineCommand());    Thread singleCommandClient = new Thread(new SingleCommand());    pipelineClient.start();    Thread.currentThread().sleep(50);    singleCommandClient.start(); } } The code for the client which is going to fire the pipeline commands is as follows: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; import Redis.clients.jedis.Pipeline; public class PipelineCommand implements Runnable{ Jedis jedis = ConnectionManager.get(); @Override public void run() {      long start = System.currentTimeMillis();      Pipeline commandpipe = jedis.pipelined();      for(int nv=0;nv<300000;nv++){        commandpipe.sadd("keys-1", "name"+nv);      }      commandpipe.sync();      Set<String> data= jedis.smembers("keys-1");      System.out.println("The return value of nv1 after pipeline [ " + data.size() + " ]");    System.out.println("The time taken for executing client(Thread-1) "+ (System.currentTimeMillis()-start));    ConnectionManager.set(jedis); } } The code for the client which is going to read the value of the key when pipeline is executed is as follows: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; public class SingleCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {    Set<String> data= jedis.smembers("keys-1");    System.out.println("The return value of nv1 is [ " + data.size() + " ]");    ConnectionManager.set(jedis); } } The result will vary as per machine configuration but by changing the thread sleep time and running the program couple of times, the result will be similar to the one shown as follows: The return value of nv1 is [ 3508 ] The return value of nv1 after pipeline [ 300000 ] The time taken for executing client(Thread-1) 3718 Please fire FLUSHDB command every time you run the test, otherwise you end up seeing the value of the previous test run, that is 300,000 Now we will run the sample in a transaction mode, where the command pipeline will be preceded by MULTI keyword and succeeded by EXEC command. This client is similar to the previous sample where two clients in separate threads will fire commands to a single key on Redis. The following program is a test client that gives two threads one with commands in transaction mode and the second thread will try to read and modify the same resource: package org.learningRedis.chapter.four.pipelineandtx; public class MultiThreadedTransactionCommandTest { public static void main(String[] args) throws InterruptedException {    Thread transactionClient = new Thread(new TransactionCommand());    Thread singleCommandClient = new Thread(new SingleCommand());    transactionClient.start();    Thread.currentThread().sleep(30);    singleCommandClient.start(); } } This program will try to modify the resource and read the resource while the transaction is going on: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; public class SingleCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {    Set<String> data= jedis.smembers("keys-1");    System.out.println("The return value of nv1 is [ " + data.size() + " ]");    ConnectionManager.set(jedis); } } This program will start with MULTI command, try to modify the resource, end it with EXEC command, and later read the value of the resource: package org.learningRedis.chapter.four.pipelineandtx; import java.util.Set; import Redis.clients.jedis.Jedis; import Redis.clients.jedis.Transaction; import chapter.four.pubsub.ConnectionManager; public class TransactionCommand implements Runnable { Jedis jedis = ConnectionManager.get(); @Override public void run() {      long start = System.currentTimeMillis();      Transaction transactionableCommands = jedis.multi();      for(int nv=0;nv<300000;nv++){        transactionableCommands.sadd("keys-1", "name"+nv);      }      transactionableCommands.exec();      Set<String> data= jedis.smembers("keys-1");      System.out.println("The return value nv1 after tx [ " + data.size() + " ]");    System.out.println("The time taken for executing client(Thread-1) "+ (System.currentTimeMillis()-start));    ConnectionManager.set(jedis); } } The result of the preceding program will vary as per machine configuration but by changing the thread sleep time and running the program couple of times, the result will be similar to the one shown as follows: The return code is [ 1 ] The return value of nv1 is [ null ] The return value nv1 after tx [ 300000 ] The time taken for executing client(Thread-1) 7078 Fire the FLUSHDB command every time you run the test. The idea is that the program should not pick up a value obtained because of a previous run of the program. The proof that the single command program is able to write to the key is if we see the following line: The return code is [1]. Let's analyze the result. In case of pipeline, a single command reads the value and the pipeline command sets a new value to that key as evident in the following result: The return value of nv1 is [ 3508 ] Now compare this with what happened in case of transaction when a single command tried to read the value but it was blocked because of the transaction. Hence the value will be NULL or 300,000. The return value of nv1 after tx [0] or The return value of nv1 after tx [300000] So the difference in output can be attributed to the fact that in a transaction, if we have started a MULTI command, and are still in the process of queueing commands (that is, we haven't given the server the EXEC request yet), then any other client can still come in and make a request, and the response would be sent to the other client. Once the client gives the EXEC command, then all other clients are blocked while all of the queued transaction commands are executed. Pipeline and transaction To have a better understanding, let's analyze what happened in case of pipeline. When two different connections made requests to the Redis for the same resource, we saw a result where client-2 picked up the value while client-1 was still executing: Pipeline in Redis in a multi connection environment What it tells us is that requests from the first connection which is pipeline command is stacked as one command in its execution stack, and the command from the other connection is kept in its own stack specific to that connection. The Redis execution thread time slices between these two executions stacks, and that is why client-2 was able to print a value when the client-1 was still executing. Let's analyze what happened in case of transaction here. Again the two commands (transaction commands and GET commands) were kept in their own execution stacks, but when the Redis execution thread gave time to the GET command, and it went to read the value, seeing the lock it was not allowed to read the value and was blocked. The Redis execution thread again went back to executing the transaction commands, and again it came back to GET command where it was again blocked. This process kept happening until the transaction command released the lock on the resource and then the GET command was able to get the value. If by any chance, the GET command was able to reach the resource before the transaction lock, it got a null value. Please bear in mind that Redis does not block execution to other clients while queuing transaction commands but blocks only during executing them. Transaction in Redis multi connection environment This exercise gave us an insight into what happens in the case of pipeline and transaction. Summary In this article, we saw in brief how to use Redis, not simply as a datastore, but also as pipeline the commands which is so much more like bulk processing. Apart from that, we covered areas such as transaction, messaging, and scripting. We also saw how to combine messaging and scripting, and create reliable messaging in Redis. This capability of Redis makes it different from some of the other datastore solutions. Resources for Article: Further resources on this subject: Implementing persistence in Redis (Intermediate) [article] Using Socket.IO and Express together [article] Exploring streams [article]
Read more
  • 0
  • 1
  • 3251

Packt
06 Jul 2015
8 min read
Save for later

CoreOS – Overview and Installation

Packt
06 Jul 2015
8 min read
In this article by Rimantas Mocevicius, author of the book CoreOS Essentials, has described CoreOS is often as Linux for massive server deployments, but it can also run easily as a single host on bare-metal, cloud servers, and as a virtual machine on your computer as well. It is designed to run application containers as docker and rkt, and you will learn about its main features later in this article. This article is a practical, example-driven guide to help you learn about the essentials of the CoreOS Linux operating system. We assume that you have experience with VirtualBox, Vagrant, Git, Bash shell scripting and the command line (terminal on UNIX-like computers), and you have already installed VirtualBox, Vagrant, and git on your Mac OS X or Linux computer. As for a cloud installation, we will use Google Cloud's Compute Engine instances. By the end of this article, you will hopefully be familiar with setting up CoreOS on your laptop or desktop, and on the cloud. You will learn how to set up a local computer development machine and a cluster on a local computer and in the cloud. Also, we will cover etcd, systemd, fleet, cluster management, deployment setup, and production clusters. In this article, you will learn how CoreOS works and how to carry out a basic CoreOS installation on your laptop or desktop with the help of VirtualBox and Vagrant. We will basically cover two topics in this article: An overview of CoreOS Installing the CoreOS virtual machine (For more resources related to this topic, see here.) An overview of CoreOS CoreOS is a minimal Linux operation system built to run docker and rkt containers (application containers). By default, it is designed to build powerful and easily manageable server clusters. It provides automatic, very reliable, and stable updates to all machines, which takes away a big maintenance headache from sysadmins. And, by running everything in application containers, such setup allows you to very easily scale servers and applications, replace faulty servers in a fraction of a second, and so on. How CoreOS works CoreOS has no package manager, so everything needs to be installed and used via docker containers. Moreover, it is 40 percent more efficient in RAM usage than an average Linux installation, as shown in this diagram: CoreOS utilizes an active/passive dual-partition scheme to update itself as a single unit, instead of using a package-by-package method. Its root partition is read-only and changes only when an update is applied. If the update is unsuccessful during reboot time, then it rolls back to the previous boot partition. The following image shows OS updated gets applied to partition B (passive) and after reboot it becomes the active to boot from. The docker and rkt containers run as applications on CoreOS. Containers can provide very good flexibility for application packaging and can start very quickly—in a matter of milliseconds. The following image shows the simplicity of CoreOS. Bottom part is Linux OS, the second level is etcd/fleet with docker daemon and the top level are running containers on the server. By default, CoreOS is designed to work in a clustered form, but it also works very well as a single host. It is very easy to control and run application containers across cluster machines with fleet and use the etcd service discovery to connect them as it shown in the following image. CoreOS can be deployed easily on all major cloud providers, for example, Google Cloud, Amazon Web Services, Digital Ocean, and so on. It runs very well on bare-metal servers as well. Moreover, it can be easily installed on a laptop or desktop with Linux, Mac OS X, or Windows via Vagrant, with VirtualBox or VMware virtual machine support. This short overview should throw some light on what CoreOS is about and what it can do. Let's now move on to the real stuff and install CoreOS on to our laptop or desktop machine. Installing the CoreOS virtual machine To use the CoreOS virtual machine, you need to have VirtualBox, Vagrant, and git installed on your computer. In the following examples, we will install CoreOS on our local computer, which will serve as a virtual machine on VirtualBox. Okay, let's get started! Cloning the coreos-vagrant GitHub project Let‘s clone this project and get it running. In your terminal (from now on, we will use just the terminal phrase and use $ to label the terminal prompt), type the following command: $ git clone https://github.com/coreos/coreos-vagrant/ This will clone from the GitHub repository to the coreos-vagrant folder on your computer. Working with cloud-config To start even a single host, we need to provide some config parameters in the cloud-config format via the user data file. In your terminal, type this: $ cd coreos-vagrant$ mv user-data.sample user-data The user data should have content like this (the coreos-vagrant Github repository is constantly changing, so you might see a bit of different content when you clone the repository): #cloud-config coreos: etcd2:    #generate a new token for each unique cluster from “     https://discovery.etcd.io/new    #discovery: https://discovery.etcd.io/<token>    # multi-region and multi-cloud deployments need to use “     $public_ipv4    advertise-client-urls: http://$public_ipv4:2379    initial-advertise-peer-urls: http://$private_ipv4:2380    # listen on both the official ports and the legacy ports    # legacy ports can be omitted if your application doesn‘t “     depend on them    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001    listen-peer-urls: “     http://$private_ipv4:2380,http://$private_ipv4:7001 fleet:    public-ip: $public_ipv4 flannel:    interface: $public_ipv4 units:    - name: etcd2.service      command: start    - name: fleet.service      command: start    - name: docker-tcp.socket      command: start      enable: true      content: |        [Unit]        Description=Docker Socket for the API          [Socket]        ListenStream=2375        Service=docker.service        BindIPv6Only=both        [Install]        WantedBy=sockets.target Replace the text between the etcd2: and fleet: lines to look this: etcd2:    name: core-01    initial-advertise-peer-urls: http://$private_ipv4:2380    listen-peer-urls: “     http://$private_ipv4:2380,http://$private_ipv4:7001    initial-cluster-token: core-01_etcd    initial-cluster: core-01=http://$private_ipv4:2380    initial-cluster-state: new    advertise-client-urls: “     http://$public_ipv4:2379,http://$public_ipv4:4001    listen-client-urls: http://0.0.0.0:2379,http://0.0.0.0:4001 fleet: You can also download the latest user-data file from https://github.com/rimusz/coreos-essentials-book/blob/master/Chapter1/user-data. This should be enough to bootstrap a single-host CoreOS VM with etcd, fleet, and docker running there. Startup and SSH It's now time to boot our CoreOS VM and log in to its console using ssh. Let's boot our first CoreOS VM host. To do so, using the terminal, type the following command: $ vagrant up This will trigger vagrant to download the latest CoreOS alpha (this is the default channel set in the config.rb file, and it can easily be changed to beta, or stable) channel image and the lunch VM instance. You should see something like this as the output in your terminal: CoreOS VM has booted up, so let's open the ssh connection to our new VM using the following command: $ vagrant ssh It should show something like this: CoreOS alpha (some version) core@core-01 ~ $ Perfect! Let's verify that etcd, fleet, and docker are running there. Here are the commands required and the corresponding screenshots of the output: $ systemctl status etcd2 To check the status of fleet, type this: $ systemctl status fleet To check the status of docker, type the following command: $ docker version Lovely! Everything looks fine. Thus, we've got our first CoreOS VM up and running in VirtualBox. Summary In this article, we saw what CoreOS is and how it is installed. We covered a simple CoreOS installation on a local computer with the help of Vagrant and VirtualBox, and checked whether etcd, fleet, and docker are running there. Resources for Article: Further resources on this subject: Core Data iOS: Designing a Data Model and Building Data Objects [article] Clustering [article] Deploying a Play application on CoreOS and Docker [article]
Read more
  • 0
  • 0
  • 1152
Visually different images

article-image-introduction-ggplot2-and-plotting-environments-r
Packt
25 Jun 2015
15 min read
Save for later

Introduction to ggplot2 and the plotting environments in R

Packt
25 Jun 2015
15 min read
In this article by Donato Teutonico, author of the book ggplot2 Essentials, we are going to explore different plotting environments in R and subsequently learn about the package, ggplot2. R provides a complete series of options available for realizing graphics, which make this software quite advanced concerning data visualization. The core of the graphics visualization in R is within the package grDevices, which provides the basic structure of data plotting, as for instance the colors and fonts used in the plots. Such graphic engine was then used as starting point in the development of more advanced and sophisticated packages for data visualization; the most commonly used being graphics and grid. (For more resources related to this topic, see here.) The graphics package is often referred to as the base or traditional graphics environment, since historically it was already available among the default packages delivered with the base installation of R and it provides functions that allow to the generation of complete plots. The grid package developed by Paul Murrell, on the other side, provides an alternative set of graphics tools. This package does not provide directly functions that generate complete plots, so it is not frequently used directly for generating graphics, but it was used in the development of advanced data visualization packages. Among the grid-based packages, the most widely used are lattice and ggplot2, although they are built implementing different visualization approaches. In fact lattice was build implementing the Trellis plots, while ggplot2 was build implementing the grammar of graphics. A diagram representing the connections between the tools just mentioned is represented in the Figure 1. Figure 1: Overview of the most widely used R packages for graphics Just keep in mind that this is not a complete overview of the packages available, but simply a small snapshot on the main packages used for data visualization in R, since many other packages are built on top of the tools just mentioned. If you would like to get a more complete overview of the graphics tools available in R, you may have a look at the web page of the R project summarizing such tools, http://cran.r-project.org/web/views/Graphics.html. ggplot2 and the Grammar of Graphics The package ggplot2 was developed by Hadley Wickham by implementing a completely different approach to statistical plots. As in the case of lattice, this package is also based on grid, providing a series of high-level functions which allow the creation of complete plots. The ggplot2 package provides an interpretation and extension of the principles of the book The Grammar of Graphics by Leland Wilkinson. Briefly, the Grammar of Graphics assumes that a statistical graphic is a mapping of data to aesthetic attributes and geometric objects used to represent the data, like points, lines, bars, and so on. Together with the aesthetic attributes, the plot can also contain statistical transformation or grouping of the data. As in lattice, also in ggplot2 we have the possibility to split data by a certain variable obtaining a representation of each subset of data in an independent sub-plot; such representation in ggplot2 is called faceting. In a more formal way, the main components of the grammar of graphics are: the data and their mapping, the aesthetic, the geometric objects, the statistical transformations, scales, coordinates and faceting. A more detailed description of these elements is provided along the book ggplot2 Essentials, but this is a summary of the general principles The data that must be visualized are mapped to aesthetic attributes which define how the data should be perceived The geometric objects describe what is actually represented on the plot like lines, points, or bars; the geometric objects basically define which kind of plot you are going to draw The statistical transformations are transformations which are applied to the data to group them; an example of statistical transformations would be, for instance, the smooth line or the regression lines of the previous examples or the binning of the histograms. Scales represent the connection between the aesthetic spaces with the actual values which should be represented. Scales maybe also be used to draw legends The coordinates represent the coordinate system in which the data are drawn The faceting, which we have already mentioned, is a grouping of data in subsets defined by a value of one variable In ggplot2 there are two main high-level functions, capable of creating directly creating a plot, qplot() and ggplot(); qplot() stands for quick plot and it is a simple function with serve a similar purpose to the plot() function in graphics. The function ggplot() on the other side is a much more advanced function which allow the user to have a deep control of the plot layout and details. In this article we will see some examples of qplot() in order to provide you with a taste of the typical plots which can be realized with ggplot2, but for more advanced data visualization the function ggplot(), is much more flexible. If you have a look on the different forums of R programming, there is quite some discussion about which of these two functions would be more convenient to use. My general recommendation would be that it depends on the type of graph you are drawing more frequently. For simple and standard plot, where basically only the data should be represented and some minor modification of standard layout, the qplot() function will do the job. On the other side, if you would need to apply particular transformations to the data or simply if you would like to keep the freedom of controlling and defining the different details of the plot layout, I would recommend to focus in learning the code of ggplot(). In the code below you will see an example of plot realized with ggplot2 where you can identify some of the components of the grammar of graphics. The example is realized with the function ggplot() which allow a more direct comparison with the grammar, but just below you may also find the corresponding code for the use of qplot(). Both codes generate the graph depicted on Figure 2. require(ggplot2) ## Load ggplot2 data(Orange) # Load the data   ggplot(data=Orange,    ## Data used aes(x=circumference,y=age, color=Tree))+  ##mapping to aesthetic geom_point()+      ##Add geometry (plot with data points) stat_smooth(method="lm",se=FALSE) ##Add statistics(linear regression)   ### Corresponding code with qplot() qplot(circumference,age,data=Orange, ## Data used color=Tree, ## Aestetic mapping geom=c("point","smooth"),method="lm",se=FALSE) This simple example can give you an idea of the role of each portion of code in a ggplot2 graph; you have seen how the main function body create the connection between the data and the aesthetic we are interested to represent and how, on top of this, you add the components of the plot like in this case the geometry element of points and the statistical element of regression. You can also notice how the components which need to be added to the main function call are included using the + sign. One more thing worth to mention at this point, is the if you run just the main body function in the ggplot() function, you will get an error message. This is because this call is not able to generate an actual plot. The step during which the plot is actually created is when you include the geometric attributes, in this case geom_point(). This is perfectly in line with the grammar of graphics, since as we have seen the geometry represent the actual connection between the data and what is represented on the plot. Is in fact at this stage that we specify we are interested in having points representing the data, before that nothing was specified yet about which plot we were interested in drawing. Figure 2: Example of plot of Orange dataset with ggplot2 The qplot() function The qplot (quick plot) function is a basic high level function of ggplot2. The general syntax that you should use with this function is the following qplot(x, y, data, colour, shape, size, facets, geom, stat) where x and y represent the variables to plot (y is optional with a default value NULL) data define the dataset containing the variables colour, shape and size are the aesthetic arguments that can be mapped on additional variables facets define the optional faceting of the plot based on one variable contained in the dataset geom allows you to select the actual visualization of the data, which basically will represent the plot which will be generated. Possible values are point, line or boxplot, but we will see several different examples in the next pages stat define the statistics to be used on the data These options represents the most important options available in qplot(). You may find a descriptions of the other function arguments in the help page of the function accessible with ?qplot, or on the ggplot2 website under the following link http://docs.ggplot2.org/0.9.3/qplot.html. Most of the options just discussed can be applied to different types of plots, since most of the concepts of the grammar of graphics, embedded in the code, may be translated from one plot to the other. For instance, you may use the argument colour to do an aesthetics mapping to one variable; these same concepts can in example be applied to scatterplots as well as histograms. Exactly the same principle would be applied to facets, which can be used for splitting plots independently on the type of plot considered. Histograms and density plots Histograms are plots used to explore how one (or more) quantitative variables are distributed. To show some examples of histograms we will use the iris data. This dataset contains measurements in centimetres of the variables sepal length and width and petal length and width for 50 flowers from each of three species of the flower iris: iris setosa, versicolor, and virginica. You may find more details running ?iris. The geometric attribute used to produce histograms is simply by specifying geom=”histogram” in the qplot() function. This default histogram will represent the variable specified on the x axis while the y axis will represent the number of elements in each bin. One other very useful way of representing distributions is to look at the kernel density function, which will basically produce a sort of continuous histogram instead of different bins by estimating the probability density function. For example let’s plot the petal length of all the three species of iris as histogram and density plot. data(iris)   ## Load data qplot(Petal.Length, data=iris, geom="histogram") ## Histogram qplot(Petal.Length, data=iris, geom="density")   ## Density plot The output of this code is showed in Figure 3. Figure 3: Histogram (left) and density plot (right) As you can see in both plots of Figure 3, it appears that the data are not distributed homogenously, but there are at least two distinct distribution clearly separated. This is very reasonably due to a different distribution for one of the iris species. To try to verify if the two distributions are indeed related to specie differences, we could generate the same plot using aesthetic attributes and have a different colour for each subtype of iris. To do this, we can simply map the fill to the Species column in the dataset; also in this case we can do that for the histogram and the density plot too. Below you may see the code we built, and in Figure 4 the resulting output. qplot(Petal.Length, data=iris, geom="histogram", colour=Species, fill=Species) qplot(Petal.Length, data=iris, geom="density", colour=Species, fill=Species) Figure 4: Histogram (left) and density plot (right) with aesthetic attribute for colour and fill In the distribution we can see that the lower data are coming from the Setosa species, while the two other distributions are partly overlapping. Scatterplots Scatterplots are probably the most common plot, since they are usually used to display the relationship between two quantitative variables. When two variables are provided, ggplot2 will make a scatterplot by default. For our example on how to build a scatterplot, we will use a dataset called ToothGrowth, which is available in the base R installation. In this dataset are reported measurements of teeth length of 10 guinea pig for three different doses of vitamin C (0.5, 1, and 2 mg) delivered in two different ways, as orange juice or as ascorbic acid (a compound having vitamin C activity). You can find, as usual, details on these data on the dataset help page at ?ToothGrowth. We are interested in seeing how the length of the teeth changed for the different doses. We are not able to distinguish among the different guinea pigs, since this information is not contained in the data, so for the moment we will plot just all the data we have. So let’s load the dataset and do a basic plot of the dose vs. length. require(ggplot2) data(ToothGrowth) qplot(dose, len, data=ToothGrowth, geom="point") ##Alternative coding qplot(dose, len, data=ToothGrowth) The resulting plot is reproduced in Figure 5. As you have seen, the default plot generated, also without a geom argument, is the scatter plot, which is the default bivariate plot type. In this plot we may have an idea of the tendency the data have, for instance we see that the teeth length increase by increasing the amount of vitamin C intake. On the other side, we know that there are two different subgroups in our data, since the vitamin C was provided in two different ways, as orange juice or as ascorbic acid, so it could be interesting to check if these two groups behave differently. Figure 5: Scatterplot of length vs. dose of ToothGrowth data The first approach could be to have the data in two different colours. To do that we simply need to assign the colour attribute to the column sup in the data, which defines the way of vitamin intake. The resulting plot is in Figure 6. qplot(dose, len,data=ToothGrowth, geom="point", col=supp) We now can distinguish from which intake route come each data in the plot and it looks like the data from orange juice shown are a little higher compared to ascorbic acid, but to differentiate between them it is not really easy. We could then try with the facets, so that the data will be completely separated in two different sub-plots. So let´s see what happens. Figure 6: Scatterplot of length vs. dose of ToothGrowth with data in different colours depending on vitamin intake. qplot(dose, len,data=ToothGrowth, geom="point", facets=.~supp) In this new plot, showed in Figure 7, we definitely have a better picture of the data, since we can see how the tooth growth differs for the different intakes. As you have seen, in this simple example, you will find that the best visualization may be different depending on the data you have. In some cases grouping a variable with colours or dividing the data with faceting may give you a different idea about the data and their tendency. For instance in our case with the plot in Figure 7 we can see that the way how the tooth growth increase with dose seems to be different for the different intake routes. Figure 7: Scatterplot of length vs. dose of ToothGrowth with faceting One approach to see the general tendency of the data could be to include a smooth line to the graph. In this case in fact we can see that the growth in the case of the orange juice does not looks really linear, so a smooth line could be a nice way to catch this. In order to do that we simply add a smooth curve to the vector of geometry components in the qplot() function. qplot(dose, len,data=ToothGrowth, geom=c("point","smooth"), facets=.~supp) As you can see from the plot obtained (Figure 8) we now see not only clearly the different data thanks to the faceting, but we can also see the tendency of the data with respect to the dose administered. As you have seen, requiring for the smooth line in ggplot2 will also include a confidence interval in the plot. If you would like to not to have the confidence interval you may simply add the argument se=FALSE. Figure 8: Scatterplot of length vs. dose of ToothGrowth with faceting and smooth line Summary In this short article we have seen some basic concept of ggplot2, ranging from the basic principles in comparison with the other R packages for graphics, up to some basic plots as for instance histograms, density plots or scatterplots. In this case we have limited our example to the use of qplot(), which enable us to obtain plots with some easy commands, but on the other side, in order to have a full control of plot appearance as well as data representation the function ggplot() will provide you with much more advanced functionalities. You can find a more detailed description of these functions as well as of the different features of ggplot2 together illustrated in various examples in the book ggplot2 Essentials. Resources for Article: Further resources on this subject: Data Analysis Using R [article] Data visualization [article] Using R for Statistics, Research, and Graphics [article]
Read more
  • 0
  • 0
  • 2154

article-image-querying-and-filtering-data
Packt
25 Jun 2015
28 min read
Save for later

Querying and Filtering Data

Packt
25 Jun 2015
28 min read
In this article by Edwood Ng and Vineeth Mohan, authors of the book Lucene 4 Cookbook, we will cover the following recipes: Performing advanced filtering Creating a custom filter Searching with QueryParser TermQuery and TermRangeQuery BooleanQuery PrefixQuery and WildcardQuery PhraseQuery and MultiPhraseQuery FuzzyQuery (For more resources related to this topic, see here.) When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this article, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application. Performing advanced filtering Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that's based on user access level so that users are restricted to only seeing items that they are permitted to access. In both of these contexts, Lucene's filtering features will provide the capability to achieve the objectives. Lucene has a few built-in filters that are designed to fit most of the real-world applications. If you do find yourself in a position where none of the built-in filters are suitable for the job, you can rest assured that Lucene's expansibility will allow you to build your own custom filters. Let us take a look at Lucene's built-in filters: TermRangeFilter: This is a filter that restricts results to a range of terms that are defined by lower bound and upper bound of a submitted range. This filter is best used on a single-valued field because on a tokenized field, any tokens within a range will return by this filter. This is for textual data only. NumericRangeFilter: Similar to TermRangeFilter, this filter restricts results to a range of numeric values. FieldCacheRangeFilter: This filter runs on top of the number of range filters, including TermRangeFilter and NumericRangeFilter. It caches filtered results using FieldCache for improved performance. FieldCache is stored in the memory, so performance boost can be upward of 100x faster than the normal range filter. Because it uses FieldCache, it's best to use this on a single-valued field only. This filter will not be applicable for multivalued field and when the available memory is limited, since it maintains FieldCache (in memory) on filtered results. QueryWrapperFilter: This filter acts as a wrapper around a Query object. This filter is useful when you have complex business rules that are already defined in a Query and would like to reuse for other business purposes. It constructs a Query to act like a filter so that it can be applied to other Queries. Because this is a filter, scoring results from the Query within is irrelevant. PrefixFilter: This filter restricts results that match what's defined in the prefix. This is similar to a substring match, but limited to matching results with a leading substring only. FieldCacheTermsFilter: This is a term filter that uses FieldCache to store the calculated results in memory. This filter works on a single-valued field only. One use of it is when you have a category field where results are usually shown by categories in different pages. The filter can be used as a demarcation by categories. FieldValueFilter: This filter returns a document containing one or more values on the specified field. This is useful as a preliminary filter to ensure that certain fields exist before querying. CachingWrapperFilter: This is a wrapper that adds a caching layer to a filter to boost performance. Note that this filter provides a general caching layer; it should be applied on a filter that produces a reasonably small result set, such as an exact match. Otherwise, larger results may unnecessarily drain the system's resources and can actually introduce performance issues. If none of the above filters fulfill your business requirements, you can build your own, extending the Filter class and implementing its abstract method getDocIdSet (AtomicReaderContext, Bits). How to do it... Let's set up our test case with the following code: Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new   IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "",   Field.Store.YES); TextField textField = new TextField("content", "",   Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together   again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc); indexWriter.commit(); indexWriter.close(); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); How it works… The preceding code adds four documents into an index. The four documents are: Document 1 Name: First Content: Humpty Dumpty sat on a wall, Num: 100 Document 2 Name: Second Content: Humpty Dumpty had a great fall. Num: 200 Document 3 Name: Third Content: All the king's horses and all the king's men Num: 300 Document 4 Name: Fourth Content: Couldn't put Humpty together again. Num: 400 Here is our standard test case: IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new TermQuery(new Term("content", "humpty")); TopDocs topDocs = indexSearcher.search(query, FILTER, 100); System.out.println("Searching 'humpty'"); for (ScoreDoc scoreDoc : topDocs.scoreDocs) {    doc = indexReader.document(scoreDoc.doc);    System.out.println("name: " + doc.getField("name").stringValue() +        " - content: " + doc.getField("content").stringValue() + " - num: " + doc.getField("num").stringValue()); } indexReader.close(); Running the code as it is will produce the following output, assuming the FILTER variable is declared: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This is a simple search on the word humpty. The search would return the first, second, and fourth sentences. Now, let's take a look at a TermRangeFilter example: TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true); Applying this filter to preceding search (by setting FILTER as termRangeFilter) will produce the following output: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Note that the second sentence is missing from the results due to this filter. This filter removes documents with name outside of A through G. Both first and fourth sentences start with F that's within the range so their results are included. The second sentence's name value Second is outside the range, so the document is not considered by the query. Let's move on to NumericRangeFilter: NumericRangeFilter numericRangeFilter = NumericRangeFilter.newIntRange("num", 200, 400, true, true); This filter will produce the following results: Searching 'humpty' name: Second - content: Humpty Dumpty had a great fall. - num: 200 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Note that the first sentence is missing from results. It's because its num 100 is outside the specified numeric range 200 to 400 in NumericRangeFilter. Next one is FieldCacheRangeFilter: FieldCacheRangeFilter fieldCacheTermRangeFilter = FieldCacheRangeFilter.newStringRange("name", "A", "G", true, true); The output of this filter is similar to the TermRangeFilter example: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter provides a caching layer on top of TermRangeFilter. Results are similar, but performance is a lot better because the calculated results are cached in memory for the next retrieval. Next is QueryWrapperFiler: QueryWrapperFilter queryWrapperFilter = new QueryWrapperFilter(new TermQuery(new Term("content", "together"))); This example will produce this result: Searching 'humpty' name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter wraps around TermQuery on term together on the content field. Since the fourth sentence is the only one that contains the word "together" search results is limited to this sentence only. Next one is PrefixFilter: PrefixFilter prefixFilter = new PrefixFilter(new Term("name", "F")); This filter produces the following: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 This filter limits results where the name field begins with letter F. In this case, the first and fourth sentences both have the name field that begins with F (First and Fourth); hence, the results. Next is FieldCacheTermsFilter: FieldCacheTermsFilter fieldCacheTermsFilter = new FieldCacheTermsFilter("name", "First"); This filter produces the following: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 This filter limits results with the name containing the word first. Since the first sentence is the only one that contains first, only one sentence is returned in search results. Next is FieldValueFilter: FieldValueFilter fieldValueFilter = new FieldValueFilter("name1"); This would produce the following: Searching 'humpty' Note that there are no results because this filter limits results in which there is at least one value on the filed name1. Since the name1 field doesn't exist in our current example, no documents are returned by this filter; hence, zero results. Next is CachingWrapperFilter: TermRangeFilter termRangeFilter = TermRangeFilter.newStringRange("name", "A", "G", true, true); CachingWrapperFilter cachingWrapperFilter = new CachingWrapperFilter(termRangeFilter); This wrapper wraps around the same TermRangeFilter from above, so the result produced is similar: Searching 'humpty' name: First - content: Humpty Dumpty sat on a wall, - num: 100 name: Fourth - content: Couldn't put Humpty together again. - num: 400 Filters work in conjunction with Queries to refine the search results. As you may have already noticed, the benefit of Filter is its ability to cache results, while Query calculates in real time. When choosing between Filter and Query, you will want to ask yourself whether the search (or filtering) will be repeated. Provided you have enough memory allocation, a cached Filter will always provide a positive impact to search experiences. Creating a custom filter Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject. In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast. The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated, dereferenced, and sorted. There are two methods of interest in this class: getOrd(int) and lookupTerm(BytesRef). The getOrd(int) returns an ordinal for a DocId (int) and lookupTerm(BytesRef) returns an ordinal for a field value. This data structure is the opposite of the inverted index structure, as this provides a DocId to value lookup (similar to FieldCache), instead of value to a DocId lookup. DocIdSet, as the name implies, is a set of DocId. A FieldCacheDocIdSet subclass we will be using is a combination of this set and FieldCache. It iterates through the set and calls matchDoc(int) to find all the matching documents to be returned. In our example, we will be building a simple user security Filter to determine which documents are eligible to be viewed by a user based on the user ID and group ID. The group ID is assumed to be hereditary, where as a smaller ID inherits rights from a larger ID. For example, the following will be our group ID model in our implementation: 10 – admin 20 – manager 30 – user 40 – guest A user with group ID 10 will be able to access documents where its group ID is 10 or above. How to do it... Here is our custom Filter, UserSecurityFilter: public class UserSecurityFilter extends Filter {   private String userIdField; private String groupIdField; private String userId; private String groupId;   public UserSecurityFilter(String userIdField, String groupIdField, String userId, String groupId) {    this.userIdField = userIdField;    this.groupIdField = groupIdField;    this.userId = userId;    this.groupId = groupId; }   public DocIdSet getDocIdSet(AtomicReaderContext context, Bits acceptDocs) throws IOException {    final SortedDocValues userIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), userIdField);    final SortedDocValues groupIdDocValues = FieldCache.DEFAULT.getTermsIndex(context.reader(), groupIdField);      final int userIdOrd = userIdDocValues.lookupTerm(new BytesRef(userId));    final int groupIdOrd = groupIdDocValues.lookupTerm(new BytesRef(groupId));      return new FieldCacheDocIdSet(context.reader().maxDoc(), acceptDocs) {      @Override      protected final boolean matchDoc(int doc) {        final int userIdDocOrd = userIdDocValues.getOrd(doc);        final int groupIdDocOrd = groupIdDocValues.getOrd(doc);        return userIdDocOrd == userIdOrd || groupIdDocOrd >= groupIdOrd;      }    }; } } This Filter accepts four arguments in its constructor: userIdField: This is the field name for user ID groupIdField: This is the field name for group ID userId: This is the current session's user ID groupId: This is the current session's group ID of the user Then, we implement getDocIdSet(AtomicReaderContext, Bits) to perform our filtering by userId and groupId. We first acquire two SortedDocValues, one for the user ID and one for the group ID, based on the Field names we obtained from the constructor. Then, we look up the ordinal values for the current session's user ID and group ID. The return value is a new FieldCacheDocIdSet object implementing its matchDoc(int) method. This is where we compare both the user ID and group ID to determine whether a document is viewable by the user. A match is true when the user ID matches and the document's group ID is greater than or equal to the user's group ID. To test this Filter, we will set up our index as follows:    Analyzer analyzer = new StandardAnalyzer();    Directory directory = new RAMDirectory();    IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);    IndexWriter indexWriter = new IndexWriter(directory, config);    Document doc = new Document();    StringField stringFieldFile = new StringField("file", "", Field.Store.YES);    StringField stringFieldUserId = new StringField("userId", "", Field.Store.YES);    StringField stringFieldGroupId = new StringField("groupId", "", Field.Store.YES);      doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\finance\2014- sales.xls");    stringFieldUserId.setStringValue("1001");    stringFieldGroupId.setStringValue("20");    doc.add(stringFieldFile); doc.add(stringFieldUserId); doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);      doc.removeField("file"); doc.removeField("userId"); doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\company\2014- policy.doc");    stringFieldUserId.setStringValue("1101");    stringFieldGroupId.setStringValue("30");    doc.add(stringFieldFile); doc.add(stringFieldUserId);    doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);    doc.removeField("file"); doc.removeField("userId");    doc.removeField("groupId");    stringFieldFile.setStringValue("Z:\shared\company\2014- terms-and-conditions.doc");    stringFieldUserId.setStringValue("1205");    stringFieldGroupId.setStringValue("40");    doc.add(stringFieldFile); doc.add(stringFieldUserId);    doc.add(stringFieldGroupId);    indexWriter.addDocument(doc);    indexWriter.commit();    indexWriter.close(); The setup adds three documents to our index with different user IDs and group ID settings in each document, as follows: UserSecurityFilter userSecurityFilter = new UserSecurityFilter("userId", "groupId", "1001", "40"); IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); Query query = new MatchAllDocsQuery(); TopDocs topDocs = indexSearcher.search(query, userSecurityFilter,   100); for (ScoreDoc scoreDoc : topDocs.scoreDocs) { doc = indexReader.document(scoreDoc.doc); System.out.println("file: " + doc.getField("file").stringValue() +" - userId: " + doc.getField("userId").stringValue() + " - groupId: " +       doc.getField("groupId").stringValue());} indexReader.close(); We initialize UserSecurityFilter with the matching names for user ID and group ID fields, and set it up with user ID 1001 and group ID 40. For our test and search, we use MatchAllDocsQuery to basically search without any queries (as it will return all the documents). Here is the output from the code: file: Z:sharedfinance2014-sales.xls - userId: 1001 - groupId: 20 file: Z:sharedcompany2014-terms-and-conditions.doc - userId: 1205 - groupId: 40 The search specifically filters by user ID 1001, so the first document is returned because its user ID is also 1001. The third document is returned because its group ID, 40, is greater than or equal to the user's group ID, which is also 40. Searching with QueryParser QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required. To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching a single character. As you can see, the general syntax for a search query is not complicated, though the more advanced modifiers can seem daunting to new users. In this article, we will cover more advanced QueryParser features to show you what you can do to customize a search. How to do it.. Let's look at the options that we can set in QueryParser. The following is a piece of code snippet for our setup: Analyzer analyzer = new StandardAnalyzer(); Directory directory = new RAMDirectory(); IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer); IndexWriter indexWriter = new IndexWriter(directory, config); Document doc = new Document(); StringField stringField = new StringField("name", "", Field.Store.YES); TextField textField = new TextField("content", "", Field.Store.YES); IntField intField = new IntField("num", 0, Field.Store.YES);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("First"); textField.setStringValue("Humpty Dumpty sat on a wall,"); intField.setIntValue(100); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Second"); textField.setStringValue("Humpty Dumpty had a great fall."); intField.setIntValue(200); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Third"); textField.setStringValue("All the king's horses and all the king's men"); intField.setIntValue(300); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   doc.removeField("name"); doc.removeField("content"); doc.removeField("num"); stringField.setStringValue("Fourth"); textField.setStringValue("Couldn't put Humpty together again."); intField.setIntValue(400); doc.add(stringField); doc.add(textField); doc.add(intField); indexWriter.addDocument(doc);   indexWriter.commit(); indexWriter.close();   IndexReader indexReader = DirectoryReader.open(directory); IndexSearcher indexSearcher = new IndexSearcher(indexReader); QueryParser queryParser = new QueryParser("content", analyzer); // configure queryParser here Query query = queryParser.parse("humpty"); TopDocs topDocs = indexSearcher.search(query, 100); We add four documents and instantiate a QueryParser object with a default field and an analyzer. We will be using the same analyzer that was used in indexing to ensure that we apply the same text treatment to maximize matching capability. Wildcard search The query syntax for a wildcard search is the asterisk ("*") or question mark ("?") character. Here is a sample query: Query query = queryParser.parse("humpty*"); This query will return the first, second, and fourth sentences. By default, QueryParser does not allow a leading wildcard character because it has a significant performance impact. A leading wildcard would trigger a full scan on the index since any term can be a potential match. In essence, even an inverted index would become rather useless for a leading wildcard character search. However, it's possible to override this default setting to allow a leading wildcard character by calling setAllowLeadingWildcard(true). You can go ahead and run this example with different search strings to see how this feature works. Depending on where the wildcard character(s) is placed, QueryParser will produce either a PrefixQuery or WildcardQuery. In this specific example in which there is only one wildcard character and it's not the leading character, a PrefixQuery will be produced. Term range search We can produce a TermRangeQuery by using TO in a search string. The range has the following syntax: [start TO end] – inclusive {start TO end} – exclusive As indicated, the angle brackets ( [ and ] ) is inclusive of start and end terms, and curly brackets ( { and } ) is exclusive of start and end terms. It's also possible to mix these brackets to inclusive on one side and exclusive on the other side. Here is a code snippet: Query query = queryParser.parse("[aa TO c]"); This search will return the third and fourth sentences, as their beginning words are All and Couldn't, which are within the range. You can optionally analyze the range terms with the same analyzer by setting setAnalyzeRangeTerms(true). Autogenerated phrase query QueryParser can automatically generate a PhraseQuery when there is more than one term in a search string. Here is a code snippet: queryParser.setAutoGeneratePhraseQueries(true); Query query = queryParser.parse("humpty+dumpty+sat"); This search will generate a PhraseQuery on the phrase humpty dumpty sat and will return the first sentence. Date resolution If you have a date field (by using DateTools to convert date to a string format) and would like to do a range search on date, it may be necessary to match the date resolution on a specific field. Here is a code snippet on setting the Date resolution: queryParser.setDateResolution("date", DateTools.Resolution.DAY); queryParser.setLocale(Locale.US); queryParser.setTimeZone(TimeZone.getTimeZone("Am erica/New_York")); This example sets the resolution to day granularity, locale to US, and time zone to New York. The locale and time zone settings are specific to the date format only. Default operator The default operator on a multiterm search string is OR. You can change the default to AND so all the terms are required. Here is a code snippet that will require all the terms in a search string: queryParser.setDefaultOperator(QueryParser.Operator.AND); Query query = queryParser.parse("humpty dumpty"); This example will return first and second sentences as these are the only two sentences with both humpty and dumpty. Enable position increments This setting is enabled by default. Its purpose is to maintain a position increment of the token that follows an omitted token, such as a token filtered by a StopFilter. This is useful in phrase queries when position increments may be important for scoring. Here is an example on how to enable this setting: queryParser.setEnablePositionIncrements(true); Query query = queryParser.parse(""humpty dumpty""); In our scenario, it won't change our search results. This attribute only enables position increments information to be available in the resulting PhraseQuery. Fuzzy query Lucene's fuzzy search implementation is based on Levenshtein distance. It compares two strings and finds out the number of single character changes that are needed to transform one string to another. The resulting number indicates the closeness of the two strings. In a fuzzy search, a threshold number of edits is used to determine if the two strings are matched. To trigger a fuzzy match in QueryParser, you can use the tilde ~ character. There are a couple configurations in QueryParser to tune this type of query. Here is a code snippet: queryParser.setFuzzyMinSim(2f); queryParser.setFuzzyPrefixLength(3); Query query = queryParser.parse("hump~"); This example will return first, second, and fourth sentences as the fuzzy match matches hump to humpty because these two words are missed by two characters. We tuned the fuzzy query to a minimum similarity to two in this example. Lowercase expanded term This configuration determines whether to automatically lowercase multiterm queries. An analyzer can do this already, so this is more like an overriding configuration that forces multiterm queries to be lowercased. Here is a code snippet: queryParser.setLowercaseExpandedTerms(true); Query query = queryParser.parse(""Humpty Dumpty""); This code will lowercase our search string before search execution. Phrase slop Phrase search can be tuned to allow some flexibility in phrase matching. By default, phrase match is exact. Setting a slop value will give it some tolerance on terms that may not always be matched consecutively. Here is a code snippet that will demonstrate this feature: queryParser.setPhraseSlop(3); Query query = queryParser.parse(""Humpty Dumpty wall""); Without setting a phrase slop, this phrase Humpty Dumpty wall will not have any matches. By setting phrase slop to three, it allows some tolerance so that this search will now return the first sentence. Go ahead and play around with this setting in order to get more familiarized with its behavior. TermQuery and TermRangeQuery A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching. How to do it.. Here are a couple of examples on TermQuery and TermRangeQuery: query = new TermQuery(new Term("content", "humpty")); query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true); The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c. BooleanQuery A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery: "+" means required; for example, a search string +humpty dumpty equates to must match humpty and should match "dumpty" "-" means must not match; for example, a search string -humpty dumpty equates to must not match humpty and should match dumpty AND, OR, and NOT are pseudo Boolean operators. Under the hood, Lucene uses BooleanClause.Occur to model these operators. The options for occur are MUST, MUST_NOT, and SHOULD. In an AND query, both terms must match. In an OR query, both terms should match. Lastly, in a NOT query, the term MUST_NOT exists. For example, humpty AND dumpty means must match both humpty and dumpty, humpty OR dumpty means should match either or both humpty or dumpty, and NOT humpty means the term humpty must not exist in matching. As mentioned, rudimentary clauses of BooleanQuery have three option: must match, must not match, and should match. These options allow us to programmatically create Boolean operations through an API. How to do it.. Here is a code snippet that demonstrates BooleanQuery: BooleanQuery query = new BooleanQuery(); query.add(new BooleanClause( new TermQuery(new Term("content", "humpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "dumpty")), BooleanClause.Occur.MUST)); query.add(new BooleanClause(new TermQuery( new Term("content", "wall")), BooleanClause.Occur.SHOULD)); query.add(new BooleanClause(new TermQuery( new Term("content", "sat")), BooleanClause.Occur.MUST_NOT)); How it works… In this demonstration, we will use TermQuery to illustrate the building of BooleanClauses. It's equivalent to this logic: (humpty AND dumpty) OR wall NOT sat. This code will return the second sentence from our setup. Because of the last MUST_NOT BooleanClause on the word "sat", the first sentence is filtered from the results. Note that BooleanClause accepts two arguments: a Query and a BooleanClause.Occur. BooleanClause.Occur is where you specify the matching options: MUST, MUST_NOT, and SHOULD. PrefixQuery and WildcardQuery PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching. A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does. How to do it... Here is a code snippet to demonstrate both Query types: PrefixQuery query = new PrefixQuery(new Term("content", "hum")); WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*")); How it works… Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match anything that contains um. PhraseQuery and MultiPhraseQuery A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1. How to do it... Here is a code snippet to demonstrate both Query types: PhraseQuery query = new PhraseQuery(); query.add(new Term("content", "humpty")); query.add(new Term("content", "together")); MultiPhraseQuery query2 = new MultiPhraseQuery(); Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty"); Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty"); terms2[1] = new Term("content", "together"); query2.add(terms1); query2.add(terms2); How it works… The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would return sentence four from our setup, while the second Query would return sentence one, two, and four. Note that in MultiPhraseQuery, multiple terms in the same position are added as an array. FuzzyQuery A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2. How to do it... Here is a code snippet to demonstrate FuzzyQuery: FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX")); How it works… This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered by the tilde ( ~ ) sign. An equivalent search string would be humpXX~. Summary This gives you a glimpse of the various querying and filtering features that have been proven to build successful search engines. Resources for Article: Further resources on this subject: Extending ElasticSearch with Scripting [article] Downloading and Setting Up ElasticSearch [article] Lucene.NET: Optimizing and merging index segments [article]
Read more
  • 0
  • 0
  • 1172

article-image-moving-further-numpy-modules
Packt
23 Jun 2015
23 min read
Save for later

Moving Further with NumPy Modules

Packt
23 Jun 2015
23 min read
NumPy has a number of modules inherited from its predecessor, Numeric. Some of these packages have a SciPy counterpart, which may have fuller functionality. In this article by Ivan Idris author of the book NumPy: Beginner's Guide - Third Edition we will cover the following topics: The linalg package The fft package Random numbers Continuous and discrete distributions (For more resources related to this topic, see here.) Linear algebra Linear algebra is an important branch of mathematics. The numpy.linalg package contains linear algebra functions. With this module, you can invert matrices, calculate eigenvalues, solve linear equations, and determine determinants, among other things (see http://docs.scipy.org/doc/numpy/reference/routines.linalg.html). Time for action – inverting matrices The inverse of a matrix A in linear algebra is the matrix A-1, which, when multiplied with the original matrix, is equal to the identity matrix I. This can be written as follows: A A-1 = I The inv() function in the numpy.linalg package can invert an example matrix with the following steps: Create the example matrix with the mat() function: A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A) The A matrix appears as follows: A [[ 0 1 2] [ 1 0 3] [ 4 -3 8]] Invert the matrix with the inv() function: inverse = np.linalg.inv(A) print("inverse of An", inverse) The inverse matrix appears as follows: inverse of A [[-4.5 7. -1.5] [-2.   4. -1. ] [ 1.5 -2.   0.5]] If the matrix is singular, or not square, a LinAlgError is raised. If you want, you can check the result manually with a pen and paper. This is left as an exercise for the reader. Check the result by multiplying the original matrix with the result of the inv() function: print("Checkn", A * inverse) The result is the identity matrix, as expected: Check [[ 1. 0. 0.] [ 0. 1. 0.] [ 0. 0. 1.]] What just happened? We calculated the inverse of a matrix with the inv() function of the numpy.linalg package. We checked, with matrix multiplication, whether this is indeed the inverse matrix (see inversion.py): from __future__ import print_function import numpy as np   A = np.mat("0 1 2;1 0 3;4 -3 8") print("An", A)   inverse = np.linalg.inv(A) print("inverse of An", inverse)   print("Checkn", A * inverse) Pop quiz – creating a matrix Q1. Which function can create matrices? array create_matrix mat vector Have a go hero – inverting your own matrix Create your own matrix and invert it. The inverse is only defined for square matrices. The matrix must be square and invertible; otherwise, a LinAlgError exception is raised. Solving linear systems A matrix transforms a vector into another vector in a linear way. This transformation mathematically corresponds to a system of linear equations. The numpy.linalg function solve() solves systems of linear equations of the form Ax = b, where A is a matrix, b can be a one-dimensional or two-dimensional array, and x is an unknown variable. We will see the dot() function in action. This function returns the dot product of two floating-point arrays. The dot() function calculates the dot product (see https://www.khanacademy.org/math/linear-algebra/vectors_and_spaces/dot_cross_products/v/vector-dot-product-and-vector-length). For a matrix A and vector b, the dot product is equal to the following sum: Time for action – solving a linear system Solve an example of a linear system with the following steps: Create A and b: A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A) b = np.array([0, 8, -9]) print("bn", b) A and b appear as follows: Solve this linear system with the solve() function: x = np.linalg.solve(A, b) print("Solution", x) The solution of the linear system is as follows: Solution [ 29. 16.   3.] Check whether the solution is correct with the dot() function: print("Checkn", np.dot(A , x)) The result is as expected: Check [[ 0. 8. -9.]] What just happened? We solved a linear system using the solve() function from the NumPy linalg module and checked the solution with the dot() function: from __future__ import print_function import numpy as np   A = np.mat("1 -2 1;0 2 -8;-4 5 9") print("An", A)   b = np.array([0, 8, -9]) print("bn", b)   x = np.linalg.solve(A, b) print("Solution", x)   print("Checkn", np.dot(A , x)) Finding eigenvalues and eigenvectors Eigenvalues are scalar solutions to the equation Ax = ax, where A is a two-dimensional matrix and x is a one-dimensional vector. Eigenvectors are vectors corresponding to eigenvalues (see https://www.khanacademy.org/math/linear-algebra/alternate_bases/eigen_everything/v/linear-algebra-introduction-to-eigenvalues-and-eigenvectors). The eigvals() function in the numpy.linalg package calculates eigenvalues. The eig() function returns a tuple containing eigenvalues and eigenvectors. Time for action – determining eigenvalues and eigenvectors Let's calculate the eigenvalues of a matrix: Create a matrix as shown in the following: A = np.mat("3 -2;1 0") print("An", A) The matrix we created looks like the following: A [[ 3 -2] [ 1 0]] Call the eigvals() function: print("Eigenvalues", np.linalg.eigvals(A)) The eigenvalues of the matrix are as follows: Eigenvalues [ 2. 1.] Determine eigenvalues and eigenvectors with the eig() function. This function returns a tuple, where the first element contains eigenvalues and the second element contains corresponding eigenvectors, arranged column-wise: eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors) The eigenvalues and eigenvectors appear as follows: First tuple of eig [ 2. 1.] Second tuple of eig [[ 0.89442719 0.70710678] [ 0.4472136   0.70710678]] Check the result with the dot() function by calculating the right and left side of the eigenvalues equation Ax = ax: for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() The output is as follows: Left [[ 1.78885438] [ 0.89442719]] Right [[ 1.78885438] [ 0.89442719]] What just happened? We found the eigenvalues and eigenvectors of a matrix with the eigvals() and eig() functions of the numpy.linalg module. We checked the result using the dot() function (see eigenvalues.py): from __future__ import print_function import numpy as np   A = np.mat("3 -2;1 0") print("An", A)   print("Eigenvalues", np.linalg.eigvals(A) )   eigenvalues, eigenvectors = np.linalg.eig(A) print("First tuple of eig", eigenvalues) print("Second tuple of eign", eigenvectors)   for i, eigenvalue in enumerate(eigenvalues):      print("Left", np.dot(A, eigenvectors[:,i]))      print("Right", eigenvalue * eigenvectors[:,i])      print() Singular value decomposition Singular value decomposition (SVD) is a type of factorization that decomposes a matrix into a product of three matrices. The SVD is a generalization of the previously discussed eigenvalue decomposition. SVD is very useful for algorithms such as the pseudo inverse, which we will discuss in the next section. The svd() function in the numpy.linalg package can perform this decomposition. This function returns three matrices U, ?, and V such that U and V are unitary and ? contains the singular values of the input matrix: The asterisk denotes the Hermitian conjugate or the conjugate transpose. The complex conjugate changes the sign of the imaginary part of a complex number and is therefore not relevant for real numbers. A complex square matrix A is unitary if A*A = AA* = I (the identity matrix). We can interpret SVD as a sequence of three operations—rotation, scaling, and another rotation. We already transposed matrices in this article. The transpose flips matrices, turning rows into columns, and columns into rows. Time for action – decomposing a matrix It's time to decompose a matrix with the SVD using the following steps: First, create a matrix as shown in the following: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Decompose the matrix with the svd() function: U, Sigma, V = np.linalg.svd(A, full_matrices=False) print("U") print(U) print("Sigma") print(Sigma) print("V") print(V) Because of the full_matrices=False specification, NumPy performs a reduced SVD decomposition, which is faster to compute. The result is a tuple containing the two unitary matrices U and V on the left and right, respectively, and the singular values of the middle matrix: U [[-0.9486833 -0.31622777]   [-0.31622777 0.9486833 ]] Sigma [ 18.97366596   9.48683298] V [[-0.33333333 -0.66666667 -0.66666667] [ 0.66666667 0.33333333 -0.66666667]] We do not actually have the middle matrix—we only have the diagonal values. The other values are all 0. Form the middle matrix with the diag() function. Multiply the three matrices as follows: print("Productn", U * np.diag(Sigma) * V) The product of the three matrices is equal to the matrix we created in the first step: Product [[ 4. 11. 14.] [ 8.   7. -2.]] What just happened? We decomposed a matrix and checked the result by matrix multiplication. We used the svd() function from the NumPy linalg module (see decomposition.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   U, Sigma, V = np.linalg.svd(A, full_matrices=False)   print("U") print(U)   print("Sigma") print(Sigma)   print("V") print(V)   print("Productn", U * np.diag(Sigma) * V) Pseudo inverse The Moore-Penrose pseudo inverse of a matrix can be computed with the pinv() function of the numpy.linalg module (see http://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse). The pseudo inverse is calculated using the SVD (see previous example). The inv() function only accepts square matrices; the pinv() function does not have this restriction and is therefore considered a generalization of the inverse. Time for action – computing the pseudo inverse of a matrix Let's compute the pseudo inverse of a matrix: First, create a matrix: A = np.mat("4 11 14;8 7 -2") print("An", A) The matrix we created looks like the following: A [[ 4 11 14] [ 8 7 -2]] Calculate the pseudo inverse matrix with the pinv() function: pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv) The pseudo inverse result is as follows: Pseudo inverse [[-0.00555556 0.07222222] [ 0.02222222 0.04444444] [ 0.05555556 -0.05555556]] Multiply the original and pseudo inverse matrices: print("Check", A * pseudoinv) What we get is not an identity matrix, but it comes close to it: Check [[ 1.00000000e+00   0.00000000e+00] [ 8.32667268e-17   1.00000000e+00]] What just happened? We computed the pseudo inverse of a matrix with the pinv() function of the numpy.linalg module. The check by matrix multiplication resulted in a matrix that is approximately an identity matrix (see pseudoinversion.py): from __future__ import print_function import numpy as np   A = np.mat("4 11 14;8 7 -2") print("An", A)   pseudoinv = np.linalg.pinv(A) print("Pseudo inversen", pseudoinv)   print("Check", A * pseudoinv) Determinants The determinant is a value associated with a square matrix. It is used throughout mathematics; for more details, please refer to http://en.wikipedia.org/wiki/Determinant. For a n x n real value matrix, the determinant corresponds to the scaling a n-dimensional volume undergoes when transformed by the matrix. The positive sign of the determinant means the volume preserves its orientation (clockwise or anticlockwise), while a negative sign means reversed orientation. The numpy.linalg module has a det() function that returns the determinant of a matrix. Time for action – calculating the determinant of a matrix To calculate the determinant of a matrix, follow these steps: Create the matrix: A = np.mat("3 4;5 6") print("An", A) The matrix we created appears as follows: A [[ 3. 4.] [ 5. 6.]] Compute the determinant with the det() function: print("Determinant", np.linalg.det(A)) The determinant appears as follows: Determinant -2.0 What just happened? We calculated the determinant of a matrix with the det() function from the numpy.linalg module (see determinant.py): from __future__ import print_function import numpy as np   A = np.mat("3 4;5 6") print("An", A)   print("Determinant", np.linalg.det(A)) Fast Fourier transform The Fast Fourier transform (FFT) is an efficient algorithm to calculate the discrete Fourier transform (DFT). The Fourier series represents a signal as a sum of sine and cosine terms. FFT improves on more naïve algorithms and is of order O(N log N). DFT has applications in signal processing, image processing, solving partial differential equations, and more. NumPy has a module called fft that offers FFT functionality. Many functions in this module are paired; for those functions, another function does the inverse operation. For instance, the fft() and ifft() function form such a pair. Time for action – calculating the Fourier transform First, we will create a signal to transform. Calculate the Fourier transform with the following steps: Create a cosine wave with 30 points as follows: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Apply the inverse transform with the ifft() function. It should approximately return the original signal. Check with the following line: print(np.all(np.abs(np.fft.ifft(transformed) - wave)   < 10 ** -9)) The result appears as follows: True Plot the transformed signal with matplotlib: plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() The following resulting diagram shows the FFT result: What just happened? We applied the fft() function to a cosine wave. After applying the ifft() function, we got our signal back (see fourier.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) print(np.all(np.abs(np.fft.ifft(transformed) - wave) < 10 ** -9))   plt.plot(transformed) plt.title('Transformed cosine') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.show() Shifting The fftshift() function of the numpy.linalg module shifts zero-frequency components to the center of a spectrum. The zero-frequency component corresponds to the mean of the signal. The ifftshift() function reverses this operation. Time for action – shifting frequencies We will create a signal, transform it, and then shift the signal. Shift the frequencies with the following steps: Create a cosine wave with 30 points: x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) Transform the cosine wave with the fft() function: transformed = np.fft.fft(wave) Shift the signal with the fftshift() function: shifted = np.fft.fftshift(transformed) Reverse the shift with the ifftshift() function. This should undo the shift. Check with the following code snippet: print(np.all((np.fft.ifftshift(shifted) - transformed)   < 10 ** -9)) The result appears as follows: True Plot the signal and transform it with matplotlib: plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() The following diagram shows the effect of the shift and the FFT: What just happened? We applied the fftshift() function to a cosine wave. After applying the ifftshift() function, we got our signal back (see fouriershift.py): import numpy as np import matplotlib.pyplot as plt     x = np.linspace(0, 2 * np.pi, 30) wave = np.cos(x) transformed = np.fft.fft(wave) shifted = np.fft.fftshift(transformed) print(np.all(np.abs(np.fft.ifftshift(shifted) - transformed) < 10 ** -9))   plt.plot(transformed, lw=2, label="Transformed") plt.plot(shifted, '--', lw=3, label="Shifted") plt.title('Shifted and transformed cosine wave') plt.xlabel('Frequency') plt.ylabel('Amplitude') plt.grid() plt.legend(loc='best') plt.show() Random numbers Random numbers are used in Monte Carlo methods, stochastic calculus, and more. Real random numbers are hard to generate, so, in practice, we use pseudo random numbers, which are random enough for most intents and purposes, except for some very special cases. These numbers appear random, but if you analyze them more closely, you will realize that they follow a certain pattern. The random numbers-related functions are in the NumPy random module. The core random number generator is based on the Mersenne Twister algorithm—a standard and well-known algorithm (see https://en.wikipedia.org/wiki/Mersenne_Twister). We can generate random numbers from discrete or continuous distributions. The distribution functions have an optional size parameter, which tells NumPy how many numbers to generate. You can specify either an integer or a tuple as size. This will result in an array filled with random numbers of appropriate shape. Discrete distributions include the geometric, hypergeometric, and binomial distributions. Time for action – gambling with the binomial The binomial distribution models the number of successes in an integer number of independent trials of an experiment, where the probability of success in each experiment is a fixed number (see https://www.khanacademy.org/math/probability/random-variables-topic/binomial_distribution). Imagine a 17th century gambling house where you can bet on flipping pieces of eight. Nine coins are flipped. If less than five are heads, then you lose one piece of eight, otherwise you win one. Let's simulate this, starting with 1,000 coins in our possession. Use the binomial() function from the random module for that purpose. To understand the binomial() function, look at the following section: Initialize an array, which represents the cash balance, to zeros. Call the binomial() function with a size of 10000. This represents 10,000 coin flips in our casino: cash = np.zeros(10000) cash[0] = 1000 outcome = np.random.binomial(9, 0.5, size=len(cash)) Go through the outcomes of the coin flips and update the cash array. Print the minimum and maximum of the outcome, just to make sure we don't have any strange outliers: for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max()) As expected, the values are between 0 and 9. In the following diagram, you can see the cash balance performing a random walk: What just happened? We did a random walk experiment using the binomial() function from the NumPy random module (see headortail.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     cash = np.zeros(10000) cash[0] = 1000 np.random.seed(73) outcome = np.random.binomial(9, 0.5, size=len(cash))   for i in range(1, len(cash)):    if outcome[i] < 5:      cash[i] = cash[i - 1] - 1    elif outcome[i] < 10:      cash[i] = cash[i - 1] + 1    else:      raise AssertionError("Unexpected outcome " + outcome)   print(outcome.min(), outcome.max())   plt.plot(np.arange(len(cash)), cash) plt.title('Binomial simulation') plt.xlabel('# Bets') plt.ylabel('Cash') plt.grid() plt.show() Hypergeometric distribution The hypergeometricdistribution models a jar with two types of objects in it. The model tells us how many objects of one type we can get if we take a specified number of items out of the jar without replacing them (see https://en.wikipedia.org/wiki/Hypergeometric_distribution). The NumPy random module has a hypergeometric() function that simulates this situation. Time for action – simulating a game show Imagine a game show where every time the contestants answer a question correctly, they get to pull three balls from a jar and then put them back. Now, there is a catch, one ball in the jar is bad. Every time it is pulled out, the contestants lose six points. If, however, they manage to get out 3 of the 25 normal balls, they get one point. So, what is going to happen if we have 100 questions in total? Look at the following section for the solution: Initialize the outcome of the game with the hypergeometric() function. The first parameter of this function is the number of ways to make a good selection, the second parameter is the number of ways to make a bad selection, and the third parameter is the number of items sampled: points = np.zeros(100) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points)) Set the scores based on the outcomes from the previous step: for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:     print(outcomes[i]) The following diagram shows how the scoring evolved: What just happened? We simulated a game show using the hypergeometric() function from the NumPy random module. The game scoring depends on how many good and how many bad balls the contestants pulled out of a jar in each session (see urn.py): from __future__ import print_function import numpy as np import matplotlib.pyplot as plt     points = np.zeros(100) np.random.seed(16) outcomes = np.random.hypergeometric(25, 1, 3, size=len(points))   for i in range(len(points)):    if outcomes[i] == 3:      points[i] = points[i - 1] + 1    elif outcomes[i] == 2:      points[i] = points[i - 1] - 6    else:      print(outcomes[i])   plt.plot(np.arange(len(points)), points) plt.title('Game show simulation') plt.xlabel('# Rounds') plt.ylabel('Score') plt.grid() plt.show() Continuous distributions We usually model continuous distributions with probability density functions (PDF). The probability that a value is in a certain interval is determined by integration of the PDF (see https://www.khanacademy.org/math/probability/random-variables-topic/random_variables_prob_dist/v/probability-density-functions). The NumPy random module has functions that represent continuous distributions—beta(), chisquare(), exponential(), f(), gamma(), gumbel(), laplace(), lognormal(), logistic(), multivariate_normal(), noncentral_chisquare(), noncentral_f(), normal(), and others. Time for action – drawing a normal distribution We can generate random numbers from a normal distribution and visualize their distribution with a histogram (see https://www.khanacademy.org/math/probability/statistics-inferential/normal_distribution/v/introduction-to-the-normal-distribution). Draw a normal distribution with the following steps: Generate random numbers for a given sample size using the normal() function from the random NumPy module: N=10000 normal_values = np.random.normal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1. Use matplotlib for this purpose: _, bins, _ = plt.hist(normal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi))   * np.exp( - (bins - mu)**2 / (2 * sigma**2) ),lw=2) plt.show() In the following diagram, we see the familiar bell curve: What just happened? We visualized the normal distribution using the normal() function from the random NumPy module. We did this by drawing the bell curve and a histogram of randomly generated values (see normaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000   np.random.seed(27) normal_values = np.random.normal(size=N) _, bins, _ = plt.hist(normal_values, np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 plt.plot(bins, 1/(sigma * np.sqrt(2 * np.pi)) * np.exp( - (bins - mu)**2 / (2 * sigma**2) ), '--', lw=3, label="PDF") plt.title('Normal distribution') plt.xlabel('Value') plt.ylabel('Normalized Frequency') plt.grid() plt.legend(loc='best') plt.show() Lognormal distribution A lognormal distribution is a distribution of a random variable whose natural logarithm is normally distributed. The lognormal() function of the random NumPy module models this distribution. Time for action – drawing the lognormal distribution Let's visualize the lognormal distribution and its PDF with a histogram: Generate random numbers using the normal() function from the random NumPy module: N=10000 lognormal_values = np.random.lognormal(size=N) Draw the histogram and theoretical PDF with a center value of 0 and standard deviation of 1: _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1) sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(numpy.log(x) - mu)**2 / (2 * sigma**2))/ (x *   sigma * np.sqrt(2 * np.pi)) plt.plot(x, pdf,lw=3) plt.show() The fit of the histogram and theoretical PDF is excellent, as you can see in the following diagram: What just happened? We visualized the lognormal distribution using the lognormal() function from the random NumPy module. We did this by drawing the curve of the theoretical PDF and a histogram of randomly generated values (see lognormaldist.py): import numpy as np import matplotlib.pyplot as plt   N=10000 np.random.seed(34) lognormal_values = np.random.lognormal(size=N) _, bins, _ = plt.hist(lognormal_values,   np.sqrt(N), normed=True, lw=1, label="Histogram") sigma = 1 mu = 0 x = np.linspace(min(bins), max(bins), len(bins)) pdf = np.exp(-(np.log(x) - mu)**2 / (2 * sigma**2))/ (x * sigma * np.sqrt(2 * np.pi)) plt.xlim([0, 15]) plt.plot(x, pdf,'--', lw=3, label="PDF") plt.title('Lognormal distribution') plt.xlabel('Value') plt.ylabel('Normalized frequency') plt.grid() plt.legend(loc='best') plt.show() Bootstrapping in statistics Bootstrapping is a method used to estimate variance, accuracy, and other metrics of sample estimates, such as the arithmetic mean. The simplest bootstrapping procedure consists of the following steps: Generate a large number of samples from the original data sample having the same size N. You can think of the original data as a jar containing numbers. We create the new samples by N times randomly picking a number from the jar. Each time we return the number into the jar, so a number can occur multiple times in a generated sample. With the new samples, we calculate the statistical estimate under investigation for each sample (for example, the arithmetic mean). This gives us a sample of possible values for the estimator. Time for action – sampling with numpy.random.choice() We will use the numpy.random.choice() function to perform bootstrapping. Start the IPython or Python shell and import NumPy: $ ipython In [1]: import numpy as np Generate a data sample following the normal distribution: In [2]: N = 500   In [3]: np.random.seed(52)   In [4]: data = np.random.normal(size=N)   Calculate the mean of the data: In [5]: data.mean() Out[5]: 0.07253250605445645 Generate 100 samples from the original data and calculate their means (of course, more samples may lead to a more accurate result): In [6]: bootstrapped = np.random.choice(data, size=(N, 100))   In [7]: means = bootstrapped.mean(axis=0)   In [8]: means.shape Out[8]: (100,) Calculate the mean, variance, and standard deviation of the arithmetic means we obtained: In [9]: means.mean() Out[9]: 0.067866373318115278   In [10]: means.var() Out[10]: 0.001762807104774598   In [11]: means.std() Out[11]: 0.041985796464692651 If we are assuming a normal distribution for the means, it may be relevant to know the z-score, which is defined as follows: In [12]: (data.mean() - means.mean())/means.std() Out[12]: 0.11113598238549766 From the z-score value, we get an idea of how probable the actual mean is. What just happened? We bootstrapped a data sample by generating samples and calculating the means of each sample. Then we computed the mean, standard deviation, variance, and z-score of the means. We used the numpy.random.choice() function for bootstrapping. Summary You learned a lot in this article about NumPy modules. We covered linear algebra, the Fast Fourier transform, continuous and discrete distributions, and random numbers. Resources for Article: Further resources on this subject: SciPy for Signal Processing [article] Visualization [article] The plot function [article]
Read more
  • 0
  • 0
  • 2358
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-documents-and-collections-data-modeling-mongodb
Packt
22 Jun 2015
12 min read
Save for later

Documents and Collections in Data Modeling with MongoDB

Packt
22 Jun 2015
12 min read
In this article by Wilson da Rocha França, author of the book, MongoDB Data Modeling, we will cover documents and collections used in data modeling with MongoDB. (For more resources related to this topic, see here.) Data modeling is a very important process during the conception of an application since this step will help you to define the necessary requirements for the database's construction. This definition is precisely the result of the data understanding acquired during the data modeling process. As previously described, this process, regardless of the chosen data model, is commonly divided into two phases: one that is very close to the user's view and the other that is a translation of this view to a conceptual schema. In the scenario of relational database modeling, the main challenge is to build a robust database from these two phases, with the aim of guaranteeing updates to it with any impact during the application's lifecycle. A big advantage of NoSQL compared to relational databases is that NoSQL databases are more flexible at this point, due to the possibility of a schema-less model that, in theory, can cause less impact on the user's view if a modification in the data model is needed. Despite the flexibility NoSQL offers, it is important to previously know how we will use the data in order to model a NoSQL database. It is a good idea not to plan the data format to be persisted, even in a NoSQL database. Moreover, at first sight, this is the point where database administrators, quite used to the relational world, become more uncomfortable. Relational database standards, such as SQL, brought us a sense of security and stability by setting up rules, norms, and criteria. On the other hand, we will dare to state that this security turned database designers distant of the domain from which the data to be stored is drawn. The same thing happened with application developers. There is a notable divergence of interests among them and database administrators, especially regarding data models. The NoSQL databases practically bring the need for an approximation between database professionals and the applications, and also the need for an approximation between developers and databases. For that reason, even though you may be a data modeler/designer or a database administrator, don't be scared if from now on we address subjects that are out of your comfort zone. Be prepared to start using words common from the application developer's point of view, and add them to your vocabulary. This article will cover the following: Introducing your documents and collections The document's characteristics and structure Introducing documents and collections MongoDB has the document as a basic unity of data. The documents in MongoDB are represented in JavaScript Object Notation (JSON). Collections are groups of documents. Making an analogy, a collection is similar to a table in a relational model and a document is a record in this table. And finally, collections belong to a database in MongoDB. The documents are serialized on disk in a format known as Binary JSON (BSON), a binary representation of a JSON document. An example of a document is: {    "_id": 123456,    "firstName": "John",    "lastName": "Clay",    "age": 25,    "address": {      "streetAddress": "131 GEN. Almério de Moura Street",      "city": "Rio de Janeiro",      "state": "RJ",      "postalCode": "20921060"    },    "phoneNumber":[      {          "type": "home",          "number": "+5521 2222-3333"      },      {          "type": "mobile",          "number": "+5521 9888-7777"      }    ] } Unlike the relational model, where you must declare a table structure, a collection doesn't enforce a certain structure for a document. It is possible that a collection contains documents with completely different structures. We can have, for instance, on the same users collection: {    "_id": "123456",    "username": "johnclay",    "age": 25,    "friends":[      {"username": "joelsant"},      {"username": "adilsonbat"}    ],    "active": true,    "gender": "male" } We can also have: {    "_id": "654321",    "username": "santymonty",    "age": 25,    "active": true,    "gender": "male",    "eyeColor": "brown" } In addition to this, another interesting feature of MongoDB is that not just data is represented by documents. Basically, all user interactions with MongoDB are made through documents. Besides data recording, documents are a means to: Define what data can be read, written, and/or updated in queries Define which fields will be updated Create indexes Configure replication Query the information from the database Before we go deep into the technical details of documents, let's explore their structure. JSON JSON is a text format for the open-standard representation of data and that is ideal for data traffic. To explore the JSON format deeper, you can check ECMA-404 The JSON Data Interchange Standard where the JSON format is fully described. JSON is described by two standards: ECMA-404 and RFC 7159. The first one puts more focus on the JSON grammar and syntax, while the second provides semantic and security considerations. As the name suggests, JSON arises from the JavaScript language. It came about as a solution for object state transfers between the web server and the browser. Despite being part of JavaScript, it is possible to find generators and readers for JSON in almost all the most popular programming languages such as C, Java, and Python. The JSON format is also considered highly friendly and human-readable. JSON does not depend on the platform chosen, and its specification are based on two data structures: A set or group of key/value pairs A value ordered list So, in order to clarify any doubts, let's talk about objects. Objects are a non-ordered collection of key/value pairs that are represented by the following pattern: {    "key" : "value" } In relation to the value ordered list, a collection is represented as follows: ["value1", "value2", "value3"] In the JSON specification, a value can be: A string delimited with " " A number, with or without a sign, on a decimal base (base 10). This number can have a fractional part, delimited by a period (.), or an exponential part followed by e or E Boolean values (true or false) A null value Another object Another value ordered array The following diagram shows us the JSON value structure: Here is an example of JSON code that describes a person: {    "name" : "Han",    "lastname" : "Solo",    "position" : "Captain of the Millenium Falcon",    "species" : "human",    "gender":"male",    "height" : 1.8 } BSON BSON means Binary JSON, which, in other words, means binary-encoded serialization for JSON documents. If you are seeking more knowledge on BSON, I suggest you take a look at the BSON specification on http://bsonspec.org/. If we compare BSON to the other binary formats, BSON has the advantage of being a model that allows you more flexibility. Also, one of its characteristics is that it's lightweight—a feature that is very important for data transport on the Web. The BSON format was designed to be easily navigable and both encoded and decoded in a very efficient way for most of the programming languages that are based on C. This is the reason why BSON was chosen as the data format for MongoDB disk persistence. The types of data representation in BSON are: String UTF-8 (string) Integer 32-bit (int32) Integer 64-bit (int64) Floating point (double) Document (document) Array (document) Binary data (binary) Boolean false (x00 or byte 0000 0000) Boolean true (x01 or byte 0000 0001) UTC datetime (int64)—the int64 is UTC milliseconds since the Unix epoch Timestamp (int64)—this is the special internal type used by MongoDB replication and sharding; the first 4 bytes are an increment, and the last 4 are a timestamp Null value () Regular expression (cstring) JavaScript code (string) JavaScript code w/scope (code_w_s) Min key()—the special type that compares a lower value than all other possible BSON element values Max key()—the special type that compares a higher value than all other possible BSON element values ObjectId (byte*12) Characteristics of documents Before we go into detail about how we must model documents, we need a better understanding of some of its characteristics. These characteristics can determine your decision about how the document must be modeled. The document size We must keep in mind that the maximum length for a BSON document is 16 MB. According to BSON specifications, this length is ideal for data transfers through the Web and to avoid the excessive use of RAM. But this is only a recommendation. Nowadays, a document can exceed the 16 MB length by using GridFS. GridFS allows us to store documents in MongoDB that are larger than the BSON maximum size, by dividing it into parts, or chunks. Each chunk is a new document with 255 K of size. Names and values for a field in a document There are a few things that you must know about names and values for fields in a document. First of all, any field's name in a document is a string. As usual, we have some restrictions on field names. They are: The _id field is reserved for a primary key You cannot start the name using the character $ The name cannot have a null character, or (.) Additionally, documents that have indexed fields must respect the size limit for an indexed field. The values cannot exceed the maximum size of 1,024 bytes. The document primary key As seen in the preceding section, the _id field is reserved for the primary key. By default, this field must be the first one in the document, even when, during an insertion, it is not the first field to be inserted. In these cases, MongoDB moves it to the first position. Also, by definition, it is in this field that a unique index will be created. The _id field can have any value that is a BSON type, except the array. Moreover, if a document is created without an indication of the _id field, MongoDB will automatically create an _id field of the ObjectId type. However, this is not the only option. You can use any value you want to identify your document as long as it is unique. There is another option, that is, generating an auto-incremental value based on a support collection or on an optimistic loop. Support collections In this method, we use a separate collection that will keep the last used value in the sequence. To increment the sequence, first we should query the last used value. After this, we can use the operator $inc to increment the value. There is a collection called system.js that can keep the JavaScript code in order to reuse it. Be careful not to include application logic in this collection. Let's see an example for this method: db.counters.insert(    {      _id: "userid",      seq: 0    } )   function getNextSequence(name) {    var ret = db.counters.findAndModify(          {            query: { _id: name },            update: { $inc: { seq: 1 } },            new: true          }    );    return ret.seq; }   db.users.insert(    {      _id: getNextSequence("userid"),      name: "Sarah C."    } ) The optimistic loop The generation of the _id field by an optimistic loop is done by incrementing each iteration and, after that, attempting to insert it in a new document: function insertDocument(doc, targetCollection) {    while (1) {        var cursor = targetCollection.find( {},         { _id: 1 } ).sort( { _id: -1 } ).limit(1);        var seq = cursor.hasNext() ? cursor.next()._id + 1 : 1;        doc._id = seq;        var results = targetCollection.insert(doc);        if( results.hasWriteError() ) {            if( results.writeError.code == 11000 /* dup key */ )                continue;            else                print( "unexpected error inserting data: " +                 tojson( results ) );        }        break;    } } In this function, the iteration does the following: Searches in targetCollection for the maximum value for _id. Settles the next value for _id. Sets the value on the document to be inserted. Inserts the document. In the case of errors due to duplicated _id fields, the loop repeats itself, or else the iteration ends. The points demonstrated here are the basics to understanding all the possibilities and approaches that this tool can offer. But, although we can use auto-incrementing fields for MongoDB, we must avoid using them because this tool does not scale for a huge data mass. Summary In this article, you saw how to build documents in MongoDB, examined their characteristics, and saw how they are organized into collections. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [article] About MongoDB [article] Creating a RESTful API [article]
Read more
  • 0
  • 0
  • 1475

article-image-pandas-data-structures
Packt
22 Jun 2015
25 min read
Save for later

The pandas Data Structures

Packt
22 Jun 2015
25 min read
In this article by Femi Anthony, author of the book, Mastering pandas, starts by taking a tour of NumPy ndarrays, a data structure not in pandas but NumPy. Knowledge of NumPy ndarrays is useful as it forms the foundation for the pandas data structures. Another key benefit of NumPy arrays is that they execute what is known as vectorized operations, which are operations that require traversing/looping on a Python array, much faster. In this article, I will present the material via numerous examples using IPython, a browser-based interface that allows the user to type in commands interactively to the Python interpreter. (For more resources related to this topic, see here.) NumPy ndarrays The NumPy library is a very important package used for numerical computing with Python. Its primary features include the following: The type numpy.ndarray, a homogenous multidimensional array Access to numerous mathematical functions – linear algebra, statistics, and so on Ability to integrate C, C++, and Fortran code For more information about NumPy, see http://www.numpy.org. The primary data structure in NumPy is the array class ndarray. It is a homogeneous multi-dimensional (n-dimensional) table of elements, which are indexed by integers just as a normal array. However, numpy.ndarray (also known as numpy.array) is different from the standard Python array.array class, which offers much less functionality. More information on the various operations is provided at http://scipy-lectures.github.io/intro/numpy/array_object.html. NumPy array creation NumPy arrays can be created in a number of ways via calls to various NumPy methods. NumPy arrays via numpy.array NumPy arrays can be created via the numpy.array constructor directly: In [1]: import numpy as np In [2]: ar1=np.array([0,1,2,3])# 1 dimensional array In [3]: ar2=np.array ([[0,3,5],[2,8,7]]) # 2D array In [4]: ar1 Out[4]: array([0, 1, 2, 3]) In [5]: ar2 Out[5]: array([[0, 3, 5],                [2, 8, 7]]) The shape of the array is given via ndarray.shape: In [5]: ar2.shape Out[5]: (2, 3) The number of dimensions is obtained using ndarray.ndim: In [7]: ar2.ndim Out[7]: 2 NumPy array via numpy.arange ndarray.arange is the NumPy version of Python's range function:In [10]: # produces the integers from 0 to 11, not inclusive of 12            ar3=np.arange(12); ar3 Out[10]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]) In [11]: # start, end (exclusive), step size        ar4=np.arange(3,10,3); ar4 Out[11]: array([3, 6, 9]) NumPy array via numpy.linspace ndarray.linspace generates linear evenly spaced elements between the start and the end: In [13]:# args - start element,end element, number of elements        ar5=np.linspace(0,2.0/3,4); ar5 Out[13]:array([ 0., 0.22222222, 0.44444444, 0.66666667]) NumPy array via various other functions These functions include numpy.zeros, numpy.ones, numpy.eye, nrandom.rand, numpy.random.randn, and numpy.empty. The argument must be a tuple in each case. For the 1D array, you can just specify the number of elements, no need for a tuple. numpy.ones The following command line explains the function: In [14]:# Produces 2x3x2 array of 1's.        ar7=np.ones((2,3,2)); ar7 Out[14]: array([[[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]],                [[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]]]) numpy.zeros The following command line explains the function: In [15]:# Produce 4x2 array of zeros.            ar8=np.zeros((4,2));ar8 Out[15]: array([[ 0., 0.],          [ 0., 0.],            [ 0., 0.],            [ 0., 0.]]) numpy.eye The following command line explains the function: In [17]:# Produces identity matrix            ar9 = np.eye(3);ar9 Out[17]: array([[ 1., 0., 0.],            [ 0., 1., 0.],            [ 0., 0., 1.]]) numpy.diag The following command line explains the function: In [18]: # Create diagonal array        ar10=np.diag((2,1,4,6));ar10 Out[18]: array([[2, 0, 0, 0],            [0, 1, 0, 0],            [0, 0, 4, 0],            [0, 0, 0, 6]]) numpy.random.rand The following command line explains the function: In [19]: # Using the rand, randn functions          # rand(m) produces uniformly distributed random numbers with range 0 to m          np.random.seed(100)   # Set seed          ar11=np.random.rand(3); ar11 Out[19]: array([ 0.54340494, 0.27836939, 0.42451759]) In [20]: # randn(m) produces m normally distributed (Gaussian) random numbers            ar12=np.random.rand(5); ar12 Out[20]: array([ 0.35467445, -0.78606433, -0.2318722 ,   0.20797568, 0.93580797]) numpy.empty Using np.empty to create an uninitialized array is a cheaper and faster way to allocate an array, rather than using np.ones or np.zeros (malloc versus. cmalloc). However, you should only use it if you're sure that all the elements will be initialized later: In [21]: ar13=np.empty((3,2)); ar13 Out[21]: array([[ -2.68156159e+154,   1.28822983e-231],                [ 4.22764845e-307,   2.78310358e-309],                [ 2.68156175e+154,   4.17201483e-309]]) numpy.tile The np.tile function allows one to construct an array from a smaller array by repeating it several times on the basis of a parameter: In [334]: np.array([[1,2],[6,7]]) Out[334]: array([[1, 2],                  [6, 7]]) In [335]: np.tile(np.array([[1,2],[6,7]]),3) Out[335]: array([[1, 2, 1, 2, 1, 2],                 [6, 7, 6, 7, 6, 7]]) In [336]: np.tile(np.array([[1,2],[6,7]]),(2,2)) Out[336]: array([[1, 2, 1, 2],                  [6, 7, 6, 7],                  [1, 2, 1, 2],                  [6, 7, 6, 7]]) NumPy datatypes We can specify the type of contents of a numeric array by using the dtype parameter: In [50]: ar=np.array([2,-1,6,3],dtype='float'); ar Out[50]: array([ 2., -1., 6., 3.]) In [51]: ar.dtype Out[51]: dtype('float64') In [52]: ar=np.array([2,4,6,8]); ar.dtype Out[52]: dtype('int64') In [53]: ar=np.array([2.,4,6,8]); ar.dtype Out[53]: dtype('float64') The default dtype in NumPy is float. In the case of strings, dtype is the length of the longest string in the array: In [56]: sar=np.array(['Goodbye','Welcome','Tata','Goodnight']); sar.dtype Out[56]: dtype('S9') You cannot create variable-length strings in NumPy, since NumPy needs to know how much space to allocate for the string. dtypes can also be Boolean values, complex numbers, and so on: In [57]: bar=np.array([True, False, True]); bar.dtype Out[57]: dtype('bool') The datatype of ndarray can be changed in much the same way as we cast in other languages such as Java or C/C++. For example, float to int and so on. The mechanism to do this is to use the numpy.ndarray.astype() function. Here is an example: In [3]: f_ar = np.array([3,-2,8.18])        f_ar Out[3]: array([ 3. , -2. , 8.18]) In [4]: f_ar.astype(int) Out[4]: array([ 3, -2, 8]) More information on casting can be found in the official documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html. NumPy indexing and slicing Array indices in NumPy start at 0, as in languages such as Python, Java, and C++ and unlike in Fortran, Matlab, and Octave, which start at 1. Arrays can be indexed in the standard way as we would index into any other Python sequences: # print entire array, element 0, element 1, last element. In [36]: ar = np.arange(5); print ar; ar[0], ar[1], ar[-1] [0 1 2 3 4] Out[36]: (0, 1, 4) # 2nd, last and 1st elements In [65]: ar=np.arange(5); ar[1], ar[-1], ar[0] Out[65]: (1, 4, 0) Arrays can be reversed using the ::-1 idiom as follows: In [24]: ar=np.arange(5); ar[::-1] Out[24]: array([4, 3, 2, 1, 0]) Multi-dimensional arrays are indexed using tuples of integers: In [71]: ar = np.array([[2,3,4],[9,8,7],[11,12,13]]); ar Out[71]: array([[ 2, 3, 4],                [ 9, 8, 7],                [11, 12, 13]]) In [72]: ar[1,1] Out[72]: 8 Here, we set the entry at row1 and column1 to 5: In [75]: ar[1,1]=5; ar Out[75]: array([[ 2, 3, 4],                [ 9, 5, 7],                [11, 12, 13]]) Retrieve row 2: In [76]: ar[2] Out[76]: array([11, 12, 13]) In [77]: ar[2,:] Out[77]: array([11, 12, 13]) Retrieve column 1: In [78]: ar[:,1] Out[78]: array([ 3, 5, 12]) If an index is specified that is out of bounds of the range of an array, IndexError will be raised: In [6]: ar = np.array([0,1,2]) In [7]: ar[5]    ---------------------------------------------------------------------------    IndexError                 Traceback (most recent call last) <ipython-input-7-8ef7e0800b7a> in <module>()    ----> 1 ar[5]      IndexError: index 5 is out of bounds for axis 0 with size 3 Thus, for 2D arrays, the first dimension denotes rows and the second dimension, the columns. The colon (:) denotes selection across all elements of the dimension. Array slicing Arrays can be sliced using the following syntax: ar[startIndex: endIndex: stepValue]. In [82]: ar=2*np.arange(6); ar Out[82]: array([ 0, 2, 4, 6, 8, 10]) In [85]: ar[1:5:2] Out[85]: array([2, 6]) Note that if we wish to include the endIndex value, we need to go above it, as follows: In [86]: ar[1:6:2] Out[86]: array([ 2, 6, 10]) Obtain the first n-elements using ar[:n]: In [91]: ar[:4] Out[91]: array([0, 2, 4, 6]) The implicit assumption here is that startIndex=0, step=1. Start at element 4 until the end: In [92]: ar[4:] Out[92]: array([ 8, 10]) Slice array with stepValue=3: In [94]: ar[::3] Out[94]: array([0, 6]) To illustrate the scope of indexing in NumPy, let us refer to this illustration, which is taken from a NumPy lecture given at SciPy 2013 and can be found at http://bit.ly/1GxCDpC: Let us now examine the meanings of the expressions in the preceding image: The expression a[0,3:5] indicates the start at row 0, and columns 3-5, where column 5 is not included. In the expression a[4:,4:], the first 4 indicates the start at row 4 and will give all columns, that is, the array [[40, 41,42,43,44,45] [50,51,52,53,54,55]]. The second 4 shows the cutoff at the start of column 4 to produce the array [[44, 45], [54, 55]]. The expression a[:,2] gives all rows from column 2. Now, in the last expression a[2::2,::2], 2::2 indicates that the start is at row 2 and the step value here is also 2. This would give us the array [[20, 21, 22, 23, 24, 25], [40, 41, 42, 43, 44, 45]]. Further, ::2 specifies that we retrieve columns in steps of 2, producing the end result array ([[20, 22, 24], [40, 42, 44]]). Assignment and slicing can be combined as shown in the following code snippet: In [96]: ar Out[96]: array([ 0, 2, 4, 6, 8, 10]) In [100]: ar[:3]=1; ar Out[100]: array([ 1, 1, 1, 6, 8, 10]) In [110]: ar[2:]=np.ones(4);ar Out[110]: array([1, 1, 1, 1, 1, 1]) Array masking Here, NumPy arrays can be used as masks to select or filter out elements of the original array. For example, see the following snippet: In [146]: np.random.seed(10)          ar=np.random.random_integers(0,25,10); ar Out[146]: array([ 9, 4, 15, 0, 17, 25, 16, 17, 8, 9]) In [147]: evenMask=(ar % 2==0); evenMask Out[147]: array([False, True, False, True, False, False, True, False, True, False], dtype=bool) In [148]: evenNums=ar[evenMask]; evenNums Out[148]: array([ 4, 0, 16, 8]) In the following example, we randomly generate an array of 10 integers between 0 and 25. Then, we create a Boolean mask array that is used to filter out only the even numbers. This masking feature can be very useful, say for example, if we wished to eliminate missing values, by replacing them with a default value. Here, the missing value '' is replaced by 'USA' as the default country. Note that '' is also an empty string: In [149]: ar=np.array(['Hungary','Nigeria',                        'Guatemala','','Poland',                        '','Japan']); ar Out[149]: array(['Hungary', 'Nigeria', 'Guatemala',                  '', 'Poland', '', 'Japan'],                  dtype='|S9') In [150]: ar[ar=='']='USA'; ar Out[150]: array(['Hungary', 'Nigeria', 'Guatemala', 'USA', 'Poland', 'USA', 'Japan'], dtype='|S9') Arrays of integers can also be used to index an array to produce another array. Note that this produces multiple values; hence, the output must be an array of type ndarray. This is illustrated in the following snippet: In [173]: ar=11*np.arange(0,10); ar Out[173]: array([ 0, 11, 22, 33, 44, 55, 66, 77, 88, 99]) In [174]: ar[[1,3,4,2,7]] Out[174]: array([11, 33, 44, 22, 77]) In the preceding code, the selection object is a list and elements at indices 1, 3, 4, 2, and 7 are selected. Now, assume that we change it to the following: In [175]: ar[1,3,4,2,7] We get an IndexError error since the array is 1D and we're specifying too many indices to access it. IndexError         Traceback (most recent call last) <ipython-input-175-adbcbe3b3cdc> in <module>() ----> 1 ar[1,3,4,2,7]   IndexError: too many indices This assignment is also possible with array indexing, as follows: In [176]: ar[[1,3]]=50; ar Out[176]: array([ 0, 50, 22, 50, 44, 55, 66, 77, 88, 99]) When a new array is created from another array by using a list of array indices, the new array has the same shape. Complex indexing Here, we illustrate the use of complex indexing to assign values from a smaller array into a larger one: In [188]: ar=np.arange(15); ar Out[188]: array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])   In [193]: ar2=np.arange(0,-10,-1)[::-1]; ar2 Out[193]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0]) Slice out the first 10 elements of ar, and replace them with elements from ar2, as follows: In [194]: ar[:10]=ar2; ar Out[194]: array([-9, -8, -7, -6, -5, -4, -3, -2, -1, 0, 10, 11, 12, 13, 14]) Copies and views A view on a NumPy array is just a particular way of portraying the data it contains. Creating a view does not result in a new copy of the array, rather the data it contains may be arranged in a specific order, or only certain data rows may be shown. Thus, if data is replaced on the underlying array's data, this will be reflected in the view whenever the data is accessed via indexing. The initial array is not copied into the memory during slicing and is thus more efficient. The np.may_share_memory method can be used to see if two arrays share the same memory block. However, it should be used with caution as it may produce false positives. Modifying a view modifies the original array: In [118]:ar1=np.arange(12); ar1 Out[118]:array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])   In [119]:ar2=ar1[::2]; ar2 Out[119]: array([ 0, 2, 4, 6, 8, 10])   In [120]: ar2[1]=-1; ar1 Out[120]: array([ 0, 1, -1, 3, 4, 5, 6, 7, 8, 9, 10, 11]) To force NumPy to copy an array, we use the np.copy function. As we can see in the following array, the original array remains unaffected when the copied array is modified: In [124]: ar=np.arange(8);ar Out[124]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [126]: arc=ar[:3].copy(); arc Out[126]: array([0, 1, 2])   In [127]: arc[0]=-1; arc Out[127]: array([-1, 1, 2])   In [128]: ar Out[128]: array([0, 1, 2, 3, 4, 5, 6, 7]) Operations Here, we present various operations in NumPy. Basic operations Basic arithmetic operations work element-wise with scalar operands. They are - +, -, *, /, and **. In [196]: ar=np.arange(0,7)*5; ar Out[196]: array([ 0, 5, 10, 15, 20, 25, 30])   In [198]: ar=np.arange(5) ** 4 ; ar Out[198]: array([ 0,   1, 16, 81, 256])   In [199]: ar ** 0.5 Out[199]: array([ 0.,   1.,   4.,   9., 16.]) Operations also work element-wise when another array is the second operand as follows: In [209]: ar=3+np.arange(0, 30,3); ar Out[209]: array([ 3, 6, 9, 12, 15, 18, 21, 24, 27, 30])   In [210]: ar2=np.arange(1,11); ar2 Out[210]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) Here, in the following snippet, we see element-wise subtraction, division, and multiplication: In [211]: ar-ar2 Out[211]: array([ 2, 4, 6, 8, 10, 12, 14, 16, 18, 20])   In [212]: ar/ar2 Out[212]: array([3, 3, 3, 3, 3, 3, 3, 3, 3, 3])   In [213]: ar*ar2 Out[213]: array([ 3, 12, 27, 48, 75, 108, 147, 192, 243, 300]) It is much faster to do this using NumPy rather than pure Python. The %timeit function in IPython is known as a magic function and uses the Python timeit module to time the execution of a Python statement or expression, explained as follows: In [214]: ar=np.arange(1000)          %timeit a**3          100000 loops, best of 3: 5.4 µs per loop   In [215]:ar=range(1000)          %timeit [ar[i]**3 for i in ar]          1000 loops, best of 3: 199 µs per loop Array multiplication is not the same as matrix multiplication; it is element-wise, meaning that the corresponding elements are multiplied together. For matrix multiplication, use the dot operator. For more information refer to http://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html. In [228]: ar=np.array([[1,1],[1,1]]); ar Out[228]: array([[1, 1],                  [1, 1]])   In [230]: ar2=np.array([[2,2],[2,2]]); ar2 Out[230]: array([[2, 2],                  [2, 2]])   In [232]: ar.dot(ar2) Out[232]: array([[4, 4],                  [4, 4]]) Comparisons and logical operations are also element-wise: In [235]: ar=np.arange(1,5); ar Out[235]: array([1, 2, 3, 4])   In [238]: ar2=np.arange(5,1,-1);ar2 Out[238]: array([5, 4, 3, 2])   In [241]: ar < ar2 Out[241]: array([ True, True, False, False], dtype=bool)   In [242]: l1 = np.array([True,False,True,False])          l2 = np.array([False,False,True, False])          np.logical_and(l1,l2) Out[242]: array([False, False, True, False], dtype=bool) Other NumPy operations such as log, sin, cos, and exp are also element-wise: In [244]: ar=np.array([np.pi, np.pi/2]); np.sin(ar) Out[244]: array([ 1.22464680e-16,   1.00000000e+00]) Note that for element-wise operations on two NumPy arrays, the two arrays must have the same shape, else an error will result since the arguments of the operation must be the corresponding elements in the two arrays: In [245]: ar=np.arange(0,6); ar Out[245]: array([0, 1, 2, 3, 4, 5])   In [246]: ar2=np.arange(0,8); ar2 Out[246]: array([0, 1, 2, 3, 4, 5, 6, 7])   In [247]: ar*ar2          ---------------------------------------------------------------------------          ValueError                              Traceback (most recent call last)          <ipython-input-247-2c3240f67b63> in <module>()          ----> 1 ar*ar2          ValueError: operands could not be broadcast together with shapes (6) (8) Further, NumPy arrays can be transposed as follows: In [249]: ar=np.array([[1,2,3],[4,5,6]]); ar Out[249]: array([[1, 2, 3],                  [4, 5, 6]])   In [250]:ar.T Out[250]:array([[1, 4],                [2, 5],                [3, 6]])   In [251]: np.transpose(ar) Out[251]: array([[1, 4],                 [2, 5],                  [3, 6]]) Suppose we wish to compare arrays not element-wise, but array-wise. We could achieve this as follows by using the np.array_equal operator: In [254]: ar=np.arange(0,6)          ar2=np.array([0,1,2,3,4,5])          np.array_equal(ar, ar2) Out[254]: True Here, we see that a single Boolean value is returned instead of a Boolean array. The value is True only if all the corresponding elements in the two arrays match. The preceding expression is equivalent to the following: In [24]: np.all(ar==ar2) Out[24]: True Reduction operations Operators such as np.sum and np.prod perform reduces on arrays; that is, they combine several elements into a single value: In [257]: ar=np.arange(1,5)          ar.prod() Out[257]: 24 In the case of multi-dimensional arrays, we can specify whether we want the reduction operator to be applied row-wise or column-wise by using the axis parameter: In [259]: ar=np.array([np.arange(1,6),np.arange(1,6)]);ar Out[259]: array([[1, 2, 3, 4, 5],                 [1, 2, 3, 4, 5]]) # Columns In [261]: np.prod(ar,axis=0) Out[261]: array([ 1, 4, 9, 16, 25]) # Rows In [262]: np.prod(ar,axis=1) Out[262]: array([120, 120]) In the case of multi-dimensional arrays, not specifying an axis results in the operation being applied to all elements of the array as explained in the following example: In [268]: ar=np.array([[2,3,4],[5,6,7],[8,9,10]]); ar.sum() Out[268]: 54   In [269]: ar.mean() Out[269]: 6.0 In [271]: np.median(ar) Out[271]: 6.0 Statistical operators These operators are used to apply standard statistical operations to a NumPy array. The names are self-explanatory: np.std(), np.mean(), np.median(), and np.cumsum(). In [309]: np.random.seed(10)          ar=np.random.randint(0,10, size=(4,5));ar Out[309]: array([[9, 4, 0, 1, 9],                  [0, 1, 8, 9, 0],                  [8, 6, 4, 3, 0],                  [4, 6, 8, 1, 8]]) In [310]: ar.mean() Out[310]: 4.4500000000000002   In [311]: ar.std() Out[311]: 3.4274626183227732   In [312]: ar.var(axis=0) # across rows Out[312]: array([ 12.6875,   4.1875, 11.   , 10.75 , 18.1875])   In [313]: ar.cumsum() Out[313]: array([ 9, 13, 13, 14, 23, 23, 24, 32, 41, 41, 49, 55,                  59, 62, 62, 66, 72, 80, 81, 89]) Logical operators Logical operators can be used for array comparison/checking. They are as follows: np.all(): This is used for element-wise and all of the elements np.any(): This is used for element-wise or all of the elements Generate a random 4 × 4 array of ints and check if any element is divisible by 7 and if all elements are less than 11: In [320]: np.random.seed(100)          ar=np.random.randint(1,10, size=(4,4));ar Out[320]: array([[9, 9, 4, 8],                  [8, 1, 5, 3],                  [6, 3, 3, 3],                  [2, 1, 9, 5]])   In [318]: np.any((ar%7)==0) Out[318]: False   In [319]: np.all(ar<11) Out[319]: True Broadcasting In broadcasting, we make use of NumPy's ability to combine arrays that don't have the same exact shape. Here is an example: In [357]: ar=np.ones([3,2]); ar Out[357]: array([[ 1., 1.],                  [ 1., 1.],                  [ 1., 1.]])   In [358]: ar2=np.array([2,3]); ar2 Out[358]: array([2, 3])   In [359]: ar+ar2 Out[359]: array([[ 3., 4.],                  [ 3., 4.],                  [ 3., 4.]]) Thus, we can see that ar2 is broadcasted across the rows of ar by adding it to each row of ar producing the preceding result. Here is another example, showing that broadcasting works across dimensions: In [369]: ar=np.array([[23,24,25]]); ar Out[369]: array([[23, 24, 25]]) In [368]: ar.T Out[368]: array([[23],                  [24],                  [25]]) In [370]: ar.T+ar Out[370]: array([[46, 47, 48],                  [47, 48, 49],                  [48, 49, 50]]) Here, both row and column arrays were broadcasted and we ended up with a 3 × 3 array. Array shape manipulation There are a number of steps for the shape manipulation of arrays. Flattening a multi-dimensional array The np.ravel() function allows you to flatten a multi-dimensional array as follows: In [385]: ar=np.array([np.arange(1,6), np.arange(10,15)]); ar Out[385]: array([[ 1, 2, 3, 4, 5],                  [10, 11, 12, 13, 14]])   In [386]: ar.ravel() Out[386]: array([ 1, 2, 3, 4, 5, 10, 11, 12, 13, 14])   In [387]: ar.T.ravel() Out[387]: array([ 1, 10, 2, 11, 3, 12, 4, 13, 5, 14]) You can also use np.flatten, which does the same thing, except that it returns a copy while np.ravel returns a view. Reshaping The reshape function can be used to change the shape of or unflatten an array: In [389]: ar=np.arange(1,16);ar Out[389]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]) In [390]: ar.reshape(3,5) Out[390]: array([[ 1, 2, 3, 4, 5],                  [ 6, 7, 8, 9, 10],                 [11, 12, 13, 14, 15]]) The np.reshape function returns a view of the data, meaning that the underlying array remains unchanged. In special cases, however, the shape cannot be changed without the data being copied. For more details on this, see the documentation at http://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html. Resizing There are two resize operators, numpy.ndarray.resize, which is an ndarray operator that resizes in place, and numpy.resize, which returns a new array with the specified shape. Here, we illustrate the numpy.ndarray.resize function: In [408]: ar=np.arange(5); ar.resize((8,));ar Out[408]: array([0, 1, 2, 3, 4, 0, 0, 0]) Note that this function only works if there are no other references to this array; else, ValueError results: In [34]: ar=np.arange(5);          ar Out[34]: array([0, 1, 2, 3, 4]) In [35]: ar2=ar In [36]: ar.resize((8,)); --------------------------------------------------------------------------- ValueError                                Traceback (most recent call last) <ipython-input-36-394f7795e2d1> in <module>() ----> 1 ar.resize((8,));   ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function The way around this is to use the numpy.resize function instead: In [38]: np.resize(ar,(8,)) Out[38]: array([0, 1, 2, 3, 4, 0, 1, 2]) Adding a dimension The np.newaxis function adds an additional dimension to an array: In [377]: ar=np.array([14,15,16]); ar.shape Out[377]: (3,) In [378]: ar Out[378]: array([14, 15, 16]) In [379]: ar=ar[:, np.newaxis]; ar.shape Out[379]: (3, 1) In [380]: ar Out[380]: array([[14],                  [15],                  [16]]) Array sorting Arrays can be sorted in various ways. Sort the array along an axis; first, let's discuss this along the y-axis: In [43]: ar=np.array([[3,2],[10,-1]])          ar Out[43]: array([[ 3, 2],                [10, -1]]) In [44]: ar.sort(axis=1)          ar Out[44]: array([[ 2, 3],                [-1, 10]]) Here, we will explain the sorting along the x-axis: In [45]: ar=np.array([[3,2],[10,-1]])          ar Out[45]: array([[ 3, 2],                [10, -1]]) In [46]: ar.sort(axis=0)          ar Out[46]: array([[ 3, -1],                [10, 2]]) Sorting by in-place (np.array.sort) and out-of-place (np.sort) functions. Other operations that are available for array sorting include the following: np.min(): It returns the minimum element in the array np.max(): It returns the maximum element in the array np.std(): It returns the standard deviation of the elements in the array np.var(): It returns the variance of elements in the array np.argmin(): It indices of minimum np.argmax(): It indices of maximum np.all(): It returns element-wise and all of the elements np.any(): It returns element-wise or all of the elements Summary In this article we discussed how numpy.ndarray is the bedrock data structure on which the pandas data structures are based. The pandas data structures at their heart consist of NumPy ndarray of data and an array or arrays of labels. There are three main data structures in pandas: Series, DataFrame, and Panel. The pandas data structures are much easier to use and more user-friendly than Numpy ndarrays, since they provide row indexes and column indexes in the case of DataFrame and Panel. The DataFrame object is the most popular and widely used object in pandas. Resources for Article: Further resources on this subject: Machine Learning [article] Financial Derivative – Options [article] Introducing Interactive Plotting [article]
Read more
  • 0
  • 0
  • 3706

article-image-set-mariadb
Packt
16 Jun 2015
8 min read
Save for later

Set Up MariaDB

Packt
16 Jun 2015
8 min read
In this article, by Daniel Bartholomew, author of Getting Started with MariaDB - Second Edition, you will learn to set up MariaDB with a generic configuration suitable for general use. This is perfect for giving MariaDB a try but might not be suitable for a production database application under heavy load. There are thousands of ways to tweak the settings to get MariaDB to perform just the way we need it to. Many books have been written on this subject. In this article, we'll cover enough of the basics so that we can comfortably edit the MariaDB configuration files and know our way around. The MariaDB filesystem layout A MariaDB installation is not a single file or even a single directory, so the first stop on our tour is a high-level overview of the filesystem layout. We'll start with Windows and then move on to Linux. The MariaDB filesystem layout on Windows On Windows, MariaDB is installed under a directory named with the following pattern: C:Program FilesMariaDB <major>.<minor> In the preceding command, <major> and <minor> refer to the first and second number in the MariaDB version string. So for MariaDB 10.1, the location would be: C:Program FilesMariaDB 10.1 The only alteration to this location, unless we change it during the installation, is when the 32-bit version of MariaDB is installed on a 64-bit version of Windows. In that case, the default MariaDB directory is at the following location: C:Program Files x86MariaDB <major>.<minor> Under the MariaDB directory on Windows, there are four primary directories: bin, data, lib, and include. There are also several configuration examples and other files under the MariaDB directory and a couple of additional directories (docs and Share), but we won't go into their details here. The bin directory is where the executable files of MariaDB are located. The data directory is where databases are stored; it is also where the primary MariaDB configuration file, my.ini, is stored. The lib directory contains various library and plugin files. Lastly, the include directory contains files that are useful for application developers. We don't generally need to worry about the bin, lib, and include directories; it's enough for us to be aware that they exist and know what they contain. The data directory is where we'll spend most of our time in this article and when using MariaDB. On Linux distributions, MariaDB follows the default filesystem layout. For example, the MariaDB binaries are placed under /usr/bin/, libraries are placed under /usr/lib/, manual pages are placed under /usr/share/man/, and so on. However, there are some key MariaDB-specific directories and file locations that we should know about. Two of them are locations that are the same across most Linux distributions. These locations are the /usr/share/mysql/ and /var/lib/mysql/ directories. The /usr/share/mysql/ directory contains helper scripts that are used during the initial installation of MariaDB, translations (so we can have error and system messages in different languages), and character set information. We don't need to worry about these files and scripts; it's enough to know that this directory exists and contains important files. The /var/lib/mysql/ directory is the default location for our actual database data and the related files such as logs. There is not much need to worry about this directory as MariaDB will handle its contents automatically; for now it's enough to know that it exists. The next directory we should know about is where the MariaDB plugins are stored. Unlike the previous two, the location of this directory varies. On Debian and Ubuntu systems, the directory is at the following location: /usr/lib/mysql/plugin/ In distributions such as Fedora, Red Hat, and CentOS, the location of the plugin directory varies depending on whether our system is 32 bit or 64 bit. If unsure, we can just look in both. The possible locations are: /lib64/mysql/plugin//lib/mysql/plugin/ The basic rule of thumb is that if we don't have a /lib64/ directory, we have the 32-bit version of Fedora, Red Hat, or CentOS installed. As with /usr/share/mysql/, we don't need to worry about the contents of the MariaDB plugin directory. It's enough to know that it exists and contains important files. Also, if in the future we install a new MariaDB plugin, this directory is where it will go. The last directory that we should know about is only found on Debian and the distributions based on Debian such as Ubuntu. Its location is as follows: /etc/mysql/ The /etc/mysql/ directory is where the configuration information for MariaDB is stored; specifically, in the following two locations: /etc/mysql/my.cnf/etc/mysql/conf.d/ Fedora, Red Hat, CentOS, and related systems don't have an /etc/mysql/ directory by default, but they do have a my.cnf file and a directory that serves the same purpose that the /etc/mysql/conf.d/ directory does on Debian and Ubuntu. They are at the following two locations: /etc/my.cnf/etc/my.cnf.d/ The my.cnf files, regardless of location, function the same on all Linux versions and on Windows, where it is often named my.ini. The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories, as mentioned, serve the same purpose. We'll spend the next section going over these two directories. Modular configuration on Linux The /etc/my.cnf.d/ and /etc/mysql/conf.d/ directories are special locations for the MariaDB configuration files. They are found on the MariaDB releases for Linux such as Debian, Ubuntu, Fedora, Red Hat, and CentOS. We will only have one or the other of them, never both, and regardless of which one we have, their function is the same. The basic idea behind these directories is to allow the package manager (APT or YUM) to be able to install packages for MariaDB, which include additions to MariaDB's configuration without needing to edit or change the main my.cnf configuration file. It's easy to imagine the harm that would be caused if we installed a new plugin package and it overwrote a carefully crafted and tuned configuration file. With these special directories, the package manager can simply add a file to the appropriate directory and be done. When the MariaDB server and the clients and utilities included with MariaDB start up, they first read the main my.cnf file and then any files that they find under the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directories that have the extension .cnf because of a line at the end of the default configuration files. For example, MariaDB includes a plugin called feedback whose sole purpose is to send back anonymous statistical information to the MariaDB developers. They use this information to help guide future development efforts. It is disabled by default but can easily be enabled by adding feedback=on to a [mysqld] group of the MariaDB configuration file (we'll talk about configuration groups in the following section). We could add the required lines to our main my.cnf file or, better yet, we can create a file called feedback.cnf (MariaDB doesn't care what the actual filename is, apart from the .cnf extension) with the following content: [mysqld]feedback=on All we have to do is put our feedback.cnf file in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory and when we start or restart the server, the feedback.cnf file will be read and the plugin will be turned on. Doing this for a single plugin on a solitary MariaDB server may seem like too much work, but suppose we have 100 servers, and further assume that since the servers are doing different things, each of them has a slightly different my.cnf configuration file. Without using our small feedback.cnf file to turn on the feedback plugin on all of them, we would have to connect to each server in turn and manually add feedback=on to the [mysqld] group of the file. This would get tiresome and there is also a chance that we might make a mistake with one, or several of the files that we edit, even if we try to automate the editing in some way. Copying a single file to each server that only does one thing (turning on the feedback plugin in our example) is much faster, and much safer. And, if we have an automated deployment system in place, copying the file to every server can be almost instant. Caution! Because the configuration settings in the /etc/my.cnf.d/ or /etc/mysql/conf.d/ directory are read after the settings in the my.cnf file, they can override or change the settings in our main my.cnf file. This can be a good thing if that is what we want and expect. Conversely, it can be a bad thing if we are not expecting that behavior. Summary That's it for our configuration highlights tour! In this article, we've learned where the various bits and pieces of MariaDB are installed and about the different parts that make up a typical MariaDB configuration file. Resources for Article: Building a Web Application with PHP and MariaDB – Introduction to caching Installing MariaDB on Windows and Mac OS X Questions & Answers with MariaDB's Michael "Monty" Widenius- Founder of MySQL AB
Read more
  • 0
  • 0
  • 1386

article-image-clustering
Packt
16 Jun 2015
8 min read
Save for later

Clustering

Packt
16 Jun 2015
8 min read
 In this article by Jayani Withanawasam, author of the book Apache Mahout Essentials, we will see the clustering technique in machine learning and its implementation using Apache Mahout. The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored. In this article, we will cover the following topics: Unsupervised learning and clustering Applications of clustering Types of clustering K-Means clustering K-Means clustering with MapReduce (For more resources related to this topic, see here.) Unsupervised learning and clustering Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms. Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention. For example, let's say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity. We will learn how to use Apache Mahout to perform clustering using different algorithms. Applications of clustering Clustering has many applications in different domains, such as biology, business, and information retrieval. Computer vision and image processing Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection. In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm. Image features such as pixels, colors, intensity, and texture are used during clustering: Types of clustering Clustering can be divided into different categories based on different criteria. Hard clustering versus soft clustering Clustering techniques can be divided into hard clustering and soft clustering based on the cluster's membership. In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering. A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure: Flat clustering versus hierarchical clustering In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is a simple technique where no hierarchy is present. However, this comes at the cost of performance, as flat clustering is faster and more efficient than hierarchical clustering. For example, let's assume that you need to figure out T-shirt sizes for people of different sizes. Using hierarchal clustering, you can come up with sizes for small (s), medium (m), and large (l) first by analyzing a sample of the people in the population. Then, we can further categorize this as extra small (xs), small (s), medium, large (l), and extra large (xl) sizes. Model-based clustering In model-based clustering, data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data. The best-fit model is achieved by tuning up parameters to minimize loss on errors. Once the parameter values are set, probability membership can be calculated for new data points using the model. Model-based clustering gives a probability distribution over clusters. K-Means clustering K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. We will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids. Let's consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters): We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm: Getting your hands dirty! Let's move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout: Sequential MapReduce You can execute the algorithms using a command line (by calling the correct bin/mahout subcommand) or using Java programming (calling the correct driver's run method). Running K-Means using Java programming This example continues with the people-clustering scenario mentioned earlier. The size (weight and height) distribution for this example has been plotted in two-dimensional space, as shown in the following image: Data preparation First, we need to represent the problem domain as numerical vectors. The following table shows the size distribution of people mentioned in the previous scenario: Weight (kg) Height (cm) 22 80 25 75 28 85 55 150 50 145 53 153 Save the following content in a file named KmeansTest.data: 22 80 25 75 28 85 55 150 50 145 53 153 Understanding important parameters Let's take a look at the significance of some important parameters: org.apache.hadoop.fs.Path: This denotes the path to a file or directory in the filesystem. org.apache.hadoop.conf.Configuration: This provides access to Hadoop-related configuration parameters. org.apache.mahout.common.distance.DistanceMeasure: This determines the distance between two points. K: This denotes the number of clusters. convergenceDelta: This is a double value that is used to determine whether the algorithm has converged. maxIterations: This denotes the maximum number of iterations to run. runClustering: If this is true, the clustering step is to be executed after the clusters have been determined. runSequential: If this is true, the K-Means sequential implementation is to be used in order to process the input data. The following code snippet shows the source code: private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT ="Kmeansdata";public static void main(String[] args) throws Exception {// Path to output folderPath output = new Path("Kmeansoutput");// Hadoop configuration detailsConfiguration conf = new Configuration();HadoopUtil.delete(conf, output);run(conf, new Path("KmeansTest"), output, newEuclideanDistanceMeasure(), 2, 0.5, 10);}public static void run(Configuration conf, Path input, Pathoutput, DistanceMeasure measure, int k,double convergenceDelta, int maxIterations) throws Exception {// Input should be given as sequence file formatPath directoryContainingConvertedInput = new Path(output,DIRECTORY_CONTAINING_CONVERTED_INPUT);InputDriver.runJob(input, directoryContainingConvertedInput,"org.apache.mahout.math.RandomAccessSparseVector");// Get initial clusters randomlyPath clusters = new Path(output, "random-seeds");clusters = RandomSeedGenerator.buildRandom(conf,directoryContainingConvertedInput, clusters, k, measure);// Run K-Means with a given KKMeansDriver.run(conf, directoryContainingConvertedInput,clusters, output, convergenceDelta,maxIterations, true, 0.0, false);// run ClusterDumper to display resultPath outGlob = new Path(output, "clusters-*-final");Path clusteredPoints = new Path(output,"clusteredPoints");ClusterDumper clusterDumper = new ClusterDumper(outGlob,clusteredPoints);clusterDumper.printClusters(null);} Use the following code example in order to get a better (readable) outcome to analyze the data points and the centroids they are assigned to: Reader reader = new SequenceFile.Reader(fs,new Path(output,Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf);IntWritable key = new IntWritable();WeightedPropertyVectorWritable value = newWeightedPropertyVectorWritable();while (reader.next(key, value)) {System.out.println("key: " + key.toString()+ " value: "+value.toString());}reader.close(); After you run the algorithm, you will see the clustering output generated for each iteration and the final result in the filesystem (in the output directory you have specified; in this case, Kmeansoutput). Summary Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval. Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria. Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce). The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering. Resources for Article: Further resources on this subject: Apache Solr and Big Data – integration with MongoDB [Article] Introduction to Apache ZooKeeper [Article] Creating an Apache JMeter™ test workbench [Article]
Read more
  • 0
  • 0
  • 2246
article-image-neural-network-azure-ml
Packt
15 Jun 2015
3 min read
Save for later

Neural Network in Azure ML

Packt
15 Jun 2015
3 min read
In this article written by Sumit Mund, author of the book Microsoft Azure Machine Learning, we will learn about neural network, which is a kind of machine learning algorithm inspired by the computational models of a human brain. It builds a network of computation units, neurons, or nodes. In a typical network, there are three layers of nodes. First, the input layer, followed by the middle layer or hidden layer, and in the end, the output layer. Neural network algorithms can be used for both classification and regression problems. (For more resources related to this topic, see here.) The number of nodes in a layer depends on the problem and how you construct the network to get the best result. Usually, the number of nodes in an input layer is equal to the number of features in the dataset. For a regression problem, the number of nodes in the output layer is one while for a classification problem, it is equal to the number of class or label. Each node in a layer gets connected to all the nodes in the next layer. Each edge that connects between nodes is assigned a weight. So, a neural network can well be imagined as a weighted directed acyclic graph. In a typical neural network, as shown in the preceding figure, the middle layer or hidden layer contains the number nodes, which are chosen to make the computation right. While there is no formula or agreed convention for this, it is often optimized after trying out different options. Azure Machine Learning supports neural network for regression, two-class classification, and multiclass classification. It provides a separate module for each kind of problem and lets the users tune it with different parameters, such as the number of hidden nodes, number of iterations to train the model, and so on. A special kind of neural network algorithms where there are more than one hidden layers is known as deep networks or deep learning algorithms. Azure Machine Learning allows us to choose the number of hidden nodes as a property value of the neural network module. These kind of neural networks are getting increasingly popular these days because of their remarkable results and because they allow us to model complex and nonlinear scenarios. There are many kinds of deep networks, but recently, a special kind of deep network known as the convolutional neural network got very popular because of its significant performance in image recognition or classification problems. Azure Machine Learning supports the convolutional neural network. For simple networks with three layers, this can be done through a UI just by choosing parameters. However, to build a deep network like a convolutional deep network, it’s not easy to do so through a UI. So, Azure Machine Learning supports a new kind of language called Net#, which allows you to script different kinds of neural network inside ML Studio by defining different node, the connections (edges), and kind of connections. While deep networks are complex to build and train, Net# makes things relatively easy and simple. Though complex, neural networks are very powerful and Azure Machine Learning makes it fun to work with these be it three-layered shallow networks or multilayer deep networks. Resources for Article: Further resources on this subject: Security in Microsoft Azure [article] High Availability, Protection, and Recovery using Microsoft Azure [article] Managing Microsoft Cloud [article]
Read more
  • 0
  • 0
  • 2954

article-image-linear-regression
Packt
12 Jun 2015
19 min read
Save for later

Linear Regression

Packt
12 Jun 2015
19 min read
In this article by Rui Miguel Forte, the author of the book, Mastering Predictive Analytics with R, we'll learn about linear regression. Regression problems involve predicting a numerical output. The simplest but most common type of regression is linear regression. In this article, we'll explore why linear regression is so commonly used, its limitations, and extensions. (For more resources related to this topic, see here.) Introduction to linear regression In linear regression, the output variable is predicted by a linearly weighted combination of input features. Here is an example of a simple linear model:   The preceding model essentially says that we are estimating one output, denoted by ˆy, and this is a linear function of a single predictor variable (that is, a feature) denoted by the letter x. The terms involving the Greek letter β are the parameters of the model and are known as regression coefficients. Once we train the model and settle on values for these parameters, we can make a prediction on the output variable for any value of x by a simple substitution in our equation. Another example of a linear model, this time with three features and with values assigned to the regression coefficients, is given by the following equation:   In this equation, just as with the previous one, we can observe that we have one more coefficient than the number of features. This additional coefficient, β0, is known as the intercept and is the expected value of the model when the value of all input features is zero. The other β coefficients can be interpreted as the expected change in the value of the output per unit increase of a feature. For example, in the preceding equation, if the value of the feature x1 rises by one unit, the expected value of the output will rise by 1.91 units. Similarly, a unit increase in the feature x3 results in a decrease of the output by 7.56 units. In a simple one-dimensional regression problem, we can plot the output on the y axis of a graph and the input feature on the x axis. In this case, the model predicts a straight-line relationship between these two, where β0 represents the point at which the straight line crosses or intercepts the y axis and β1 represents the slope of the line. We often refer to the case of a single feature (hence, two regression coefficients) as simple linear regression and the case of two or more features as multiple linear regression. Assumptions of linear regression Before we delve into the details of how to train a linear regression model and how it performs, we'll look at the model assumptions. The model assumptions essentially describe what the model believes about the output variable y that we are trying to predict. Specifically, linear regression models assume that the output variable is a weighted linear function of a set of feature variables. Additionally, the model assumes that for fixed values of the feature variables, the output is normally distributed with a constant variance. This is the same as saying that the model assumes that the true output variable y can be represented by an equation such as the following one, shown for two input features: Here, ε represents an error term, which is normally distributed with zero mean and constant variance σ2: We might hear the term homoscedasticity as a more formal way of describing the notion of constant variance. By homoscedasticity or constant variance, we are referring to the fact that the variance in the error component does not vary with the values or levels of the input features. In the following plot, we are visualizing a hypothetical example of a linear relationship with heteroskedastic errors, which are errors that do not have a constant variance. The data points lie close to the line at low values of the input feature, because the variance is low in this region of the plot, but lie farther away from the line at higher values of the input feature because of the higher variance. The ε term is an irreducible error component of the true function y and can be used to represent random errors, such as measurement errors in the feature values. When training a linear regression model, we always expect to observe some amount of error in our estimate of the output, even if we have all the right features, enough data, and the system being modeled really is linear. Put differently, even with a true function that is linear, we still expect that once we find a line of best fit through our training examples, our line will not go through all, or even any of our data points because of this inherent variance exhibited by the error component. The critical thing to remember, though, is that in this ideal scenario, because our error component has zero mean and constant variance, our training criterion will allow us to come close to the true values of the regression coefficients given a sufficiently large sample, as the errors will cancel out. Another important assumption relates to the independence of the error terms. This means that we do not expect the residual or error term associated with one particular observation to be somehow correlated with that of another observation. This assumption can be violated if observations are functions of each other, which is typically the result of an error in the measurement. If we were to take a portion of our training data, double all the values of the features and outputs, and add these new data points to our training data, we could create the illusion of having a larger data set; however, there will be pairs of observations whose error terms will depend on each other as a result, and hence our model assumption would be violated. Incidentally, artificially growing our data set in such a manner is never acceptable for any model. Similarly, correlated error terms may occur if observations are related in some way by an unmeasured variable. For example, if we are measuring the malfunction rate of parts from an assembly line, then parts from the same factory might have a correlation in the error, for example, due to different standards and protocols used in the assembly process. Therefore, if we don't use the factory as a feature, we may see correlated errors in our sample among observations that correspond to parts from the same factory. The study of experimental design is concerned with identifying and reducing correlations in error terms, but this is beyond the scope of this book. Finally, another important assumption concerns the notion that the features themselves are statistically independent of each other. It is worth clarifying here that in linear models, although the input features must be linearly weighted, they themselves may be the output of another function. To illustrate this, one may be surprised to see that the following is a linear model of three features, sin(z1), ln(z2), and exp(z3): We can see that this is a linear model by making a few transformations on the input features and then making the replacements in our model: Now, we have an equation that is more recognizable as a linear regression model. If the previous example made us believe that nearly everything could be transformed into a linear model, then the following two examples will emphatically convince us that this is not in fact the case: Both models are not linear models because of the first regression coefficient (β1). The first model is not a linear model because β1 is acting as the exponent of the first input feature. In the second model, β1 is inside a sine function. The important lesson to take away from these examples is that there are cases where we can apply transformations on our input features in order to fit our data to a linear model; however, we need to be careful that our regression coefficients are always the linear weights of the resulting new features. Simple linear regression Before looking at some real-world data sets, it is very helpful to try to train a model on artificially generated data. In an artificial scenario such as this, we know what the true output function is beforehand, something that as a rule is not the case when it comes to real-world data. The advantage of performing this exercise is that it gives us a good idea of how our model works under the ideal scenario when all of our assumptions are fully satisfied, and it helps visualize what happens when we have a good linear fit. We'll begin by simulating a simple linear regression model. The following R snippet is used to create a data frame with 100 simulated observations of the following linear model with a single input feature: Here is the code for the simple linear regression model: > set.seed(5427395)> nObs = 100> x1minrange = 5> x1maxrange = 25> x1 = runif(nObs, x1minrange, x1maxrange)> e = rnorm(nObs, mean = 0, sd = 2.0)> y = 1.67 * x1 - 2.93 + e> df = data.frame(y, x1) For our input feature, we randomly sample points from a uniform distribution. We used a uniform distribution to get a good spread of data points. Note that our final df data frame is meant to simulate a data frame that we would obtain in practice, and as a result, we do not include the error terms, as these would be unavailable to us in a real-world setting. When we train a linear model using some data such as those in our data frame, we are essentially hoping to produce a linear model with the same coefficients as the ones from the underlying model of the data. Put differently, the original coefficients define a population regression line. In this case, the population regression line represents the true underlying model of the data. In general, we will find ourselves attempting to model a function that is not necessarily linear. In this case, we can still define the population regression line as the best possible linear regression line, but a linear regression model will obviously not perform equally well. Estimating the regression coefficients For our simple linear regression model, the process of training the model amounts to an estimation of our two regression coefficients from our data set. As we can see from our previously constructed data frame, our data is effectively a series of observations, each of which is a pair of values (xi, yi) where the first element of the pair is the input feature value and the second element of the pair is its output label. It turns out that for the case of simple linear regression, it is possible to write down two equations that can be used to compute our two regression coefficients. Instead of merely presenting these equations, we'll first take a brief moment to review some very basic statistical quantities that the reader has most likely encountered previously, as they will be featured very shortly. The mean of a set of values is just the average of these values and is often described as a measure of location, giving a sense of where the values are centered on the scale in which they are measured. In statistical literature, the average value of a random variable is often known as the expectation, so we often find that the mean of a random variable X is denoted as E(X). Another notation that is commonly used is bar notation, where we can represent the notion of taking the average of a variable by placing a bar over that variable. To illustrate this, the following two equations show the mean of the output variable y and input feature x: A second very common quantity, which should also be familiar, is the variance of a variable. The variance measures the average square distance that individual values have from the mean. In this way, it is a measure of dispersion, so that a low variance implies that most of the values are bunched up close to the mean, whereas a higher variance results in values that are spread out. Note that the definition of variance involves the definition of the mean, and for this reason, we'll see the use of the x variable with a bar on it in the following equation that shows the variance of our input feature x: Finally, we'll define the covariance between two random variables x and y using the following equation: From the previous equation, it should be clear that the variance, which we just defined previously, is actually a special case of the covariance where the two variables are the same. The covariance measures how strongly two variables are correlated with each other and can be positive or negative. A positive covariance implies a positive correlation; that is, when one variable increases, the other will increase as well. A negative covariance suggests the opposite; when one variable increases, the other will tend to decrease. When two variables are statistically independent of each other and hence uncorrelated, their covariance will be zero (although it should be noted that a zero covariance does not necessarily imply statistical independence). Armed with these basic concepts, we can now present equations for the estimates of the two regression coefficients for the case of simple linear regression: The first regression coefficient can be computed as the ratio of the covariance between the output and the input feature, and the variance of the input feature. Note that if the output feature were to be independent of the input feature, the covariance would be zero and therefore, our linear model would consist of a horizontal line with no slope. In practice, it should be noted that even when two variables are statistically independent, we will still typically see a small degree of covariance due to the random nature of the errors, so if we were to train a linear regression model to describe their relationship, our first regression coefficient would be nonzero in general. Later, we'll see how significance tests can be used to detect features we should not include in our models. To implement linear regression in R, it is not necessary to perform these calculations as R provides us with the lm() function, which builds a linear regression model for us. The following code sample uses the df data frame we created previously and calculates the regression coefficients: > myfit <- lm(y~x1, df)> myfitCall:lm(formula = y ~ x1, data = df) Coefficients:(Intercept)          x1     -2.380       1.641 In the first line, we see that the usage of the lm() function involves first specifying a formula and then following up with the data parameter, which in our case is our data frame. For the case of simple linear regression, the syntax of the formula that we specify for the lm() function is the name of the output variable, followed by a tilde (~) and then by the name of the single input feature. Finally, the output shows us the values for the two regression coefficients. Note that the β0 coefficient is labeled as the intercept, and the β1 coefficient is labeled by the name of the corresponding feature (in this case, x1) in the equation of the linear model. The following graph shows the population line and the estimated line on the same plot: As we can see, the two lines are so close to each other that they are barely distinguishable, showing that the model has estimated the true population line very closely. Multiple linear regression Whenever we have more than one input feature and want to build a linear regression model, we are in the realm of multiple linear regression. The general equation for a multiple linear regression model with k input features is: Our assumptions about the model and about the error component ε remain the same as with simple linear regression, remembering that as we now have more than one input feature, we assume that these are independent of each other. Instead of using simulated data to demonstrate multiple linear regression, we will analyze two real-world data sets. Predicting CPU performance Our first real-world data set was presented by the researchers Dennis F. Kibler, David W. Aha, and Marc K. Albert in a 1989 paper titled Instance-based prediction of real-valued attributes and published in Journal of Computational Intelligence. The data contain the characteristics of different CPU models, such as the cycle time and the amount of cache memory. When deciding between processors, we would like to take all of these things into account, but ideally, we'd like to compare processors on a single numerical scale. For this reason, we often develop programs to benchmark the relative performance of a CPU. Our data set also comes with the published relative performance of our CPUs and our objective will be to use the available CPU characteristics as features to predict this. The data set can be obtained online from the UCI Machine Learning Repository via this link: http://archive.ics.uci.edu/ml/datasets/Computer+Hardware. The UCI Machine Learning Repository is a wonderful online resource that hosts a large number of data sets, many of which are often cited by authors of books and tutorials. It is well worth the effort to familiarize yourself with this website and its data sets. A very good way to learn predictive analytics is to practice using the techniques you learn in this book on different data sets, and the UCI repository provides many of these for exactly this purpose. The machine.data file contains all our data in a comma-separated format, with one line per CPU model. We'll import this in R and label all the columns. Note that there are 10 columns in total, but we don't need the first two for our analysis, as these are just the brand and model name of the CPU. Similarly, the final column is a predicted estimate of the relative performance that was produced by the researchers themselves; our actual output variable, PRP, is in column 9. We'll store the data that we need in a data frame called machine: > machine <- read.csv("machine.data", header = F)> names(machine) <- c("VENDOR", "MODEL", "MYCT", "MMIN", "MMAX", "CACH", "CHMIN", "CHMAX", "PRP", "ERP")> machine <- machine[, 3:9]> head(machine, n = 3)MYCT MMIN MMAX CACH CHMIN CHMAX PRP1 125 256 6000 256   16   128 1982   29 8000 32000   32     8   32 2693   29 8000 32000   32     8   32 220 The data set also comes with the definition of the data columns: Column name Definition MYCT The machine cycle time in nanoseconds MMIN The minimum main memory in kilobytes MMAX The maximum main memory in kilobytes CACH The cache memory in kilobytes CHMIN The minimum channels in units CHMAX The maximum channels in units PRP The published relative performance (our output variable) The data set contains no missing values, so no observations need to be removed or modified. One thing that we'll notice is that we only have roughly 200 data points, which is generally considered a very small sample. Nonetheless, we will proceed with splitting our data into a training set and a test set, with an 85-15 split, as follows: > library(caret)> set.seed(4352345)> machine_sampling_vector <- createDataPartition(machine$PRP,    p = 0.85, list = FALSE)> machine_train <- machine[machine_sampling_vector,]> machine_train_features <- machine[, 1:6]> machine_train_labels <- machine$PRP[machine_sampling_vector]> machine_test <- machine[-machine_sampling_vector,]> machine_test_labels <- machine$PRP[-machine_sampling_vector] Now that we have our data set up and running, we'd usually want to investigate further and check whether some of our assumptions for linear regression are valid. For example, we would like to know whether we have any highly correlated features. To do this, we can construct a correlation matrix with the cor()function and use the findCorrelation() function from the caret package to get suggestions for which features to remove: > machine_correlations <- cor(machine_train_features)> findCorrelation(machine_correlations)integer(0)> findCorrelation(machine_correlations, cutoff = 0.75)[1] 3> cor(machine_train$MMIN, machine_train$MMAX)[1] 0.7679307 Using the default cutoff of 0.9 for a high degree of correlation, we found that none of our features should be removed. When we reduce this cutoff to 0.75, we see that caret recommends that we remove the third feature (MMAX). As the final line of preceding code shows, the degree of correlation between this feature and MMIN is 0.768. While the value is not very high, it is still high enough to cause us a certain degree of concern that this will affect our model. Intuitively, of course, if we look at the definitions of our input features, we will certainly tend to expect that a model with a relatively high value for the minimum main memory will also be likely to have a relatively high value for the maximum main memory. Linear regression can sometimes still give us a good model with correlated variables, but we would expect to get better results if our variables were uncorrelated. For now, we've decided to keep all our features for this data set. Summary In this article, we studied linear regression, a method that allows us to fit a linear model in a supervised learning setting where we have a number of input features and a single numeric output. Simple linear regression is the name given to the scenario where we have only one input feature, and multiple linear regression describes the case where we have multiple input features. Linear regression is very commonly used as a first approach to solving a regression problem. It assumes that the output is a linear weighted combination of the input features in the presence of an irreducible error component that is normally distributed and has zero mean and constant variance. The model also assumes that the features are independent.
Read more
  • 0
  • 0
  • 1794

article-image-splunk-web-framework
Packt
04 Jun 2015
10 min read
Save for later

The Splunk Web Framework

Packt
04 Jun 2015
10 min read
In this article by the author, Kyle Smith, of the book, Splunk Developer's Guide, we learn about search-related and view-related modules. We will be covering the following topics: Search-related modules View-related modules (For more resources related to this topic, see here.) Search-related modules Let's talk JavaScript modules. For each module, we will review their primary purpose, their module path, the default variable used in an HTML dashboard, and the JavaScript instantiation of the module. We will also cover which attributes are required and which are optional. SearchManager The SearchManager is a primary driver of any dashboard. This module contains an entire search job, including the query, properties, and the actual dispatch of the job. Let's instantiate an object, and dissect the options from this sample code: Module Path: splunkjs/mvc/searchmanager Default Variable: SearchManager JavaScript Object instantiation    Var mySearchManager = new SearchManager({        id: "search1",        earliest_time: "-24h@h",        latest_time: "now",        preview: true,        cache: false,        search: "index=_internal | stats count by sourcetype"    }, {tokens: true, tokenNamespace: "submitted"}); The only required property is the id property. This is a reference ID that will be used to access this object from other instantiated objects later in the development of the page. It is best to name it something concise, yet descriptive with no spaces. The search property is optional, and contains the SPL query that will be dispatched from the module. Make sure to escape any quotes properly, if not, you may cause a JavaScript exception. earliest_time and latest_time are time modifiers that restrict the range of the events. At the end of the options object, notice the second object with token references. This is what automatically executes the search. Without these options, you would have to trigger the search manually. There are a few other properties shown, but you can refer to the actual documentation at the main documentation page http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_searchmanager.html. SearchManagers are set to autostart on page load. To prevent this, set autostart to false in the options. SavedSearchManager The SavedSearchManager is very similar in operation to the SearchManager, but works with a saved report, instead of an ad hoc query. The advantage to using a SavedSearchManager is in performance. If the report is scheduled, you can configure the SavedSearchManager to use the previously run jobs to load the data. If any other user runs the report within Splunk, the SavedSearchManager can reuse that user's results in the manager to boost performance. Let's take a look at a few sections of our code: Module Path: splunkjs/mvc/savedsearchmanager Default Variable: SavedSearchManager JavaScript Object instantiation        Var mySavedSearchManager = new SavedSearchManager({            id: "savedsearch1",        searchname: "Saved Report 1"            "dispatch.earliest_time": "-24h@h",            "dispatch.latest_time": "now",            preview: true,            cache: true        }); The only two required properties are id and searchname. Both of those must be present in order for this manager to run correctly. The other options are very similar to the SearchManager, except for the dispatch options. The SearchManager has the option "earliest_time", whereas the SavedSearchManager uses the option "dispatch.earliest_time". They both have the same restriction but are named differently. The additional options are listed in the main documentation page available at http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_savedsearchmanager.html. PostProcessManager The PostProcessManager does just that, post processes the results of a main search. This works in the same way as the post processing done in SimpleXML; a main search to load the event set, and a secondary search to perform an additional analysis and transformation. Using this manager has its own performance considerations as well. By loading a single job first, and then performing additional commands on those results, you avoid having concurrent searches for the same information. Your usage of CPU and RAM will be less, as you only store one copy of the results, instead of multiple. Module Path: splunkjs/mvc/postprocessmanager Default Variable: PostProcessManager JavaScript Object instantiation        Var mysecondarySearch = new PostProcessManager({            id: "after_search1",        search: "stats count by sourcetype",    managerid: "search1"        }); The property id is the only required property. The module won't do anything when instantiated with only an id property, but you can set it up to populate later. The other options are similar to the SearchManager, the major difference being that the search property in this case is appended to the search property of the manager listed in the managerid property. For example, if the manager search is search index=_internal source=*splunkd.log, and the post process manager search is stats count by host, then the entire search for the post process manager is search index=_internal source=*splunkd.log | stats count by host. The additional options are listed at the main documentation page http://docs.splunk.com/DocumentationStatic/WebFramework/1.1/compref_postprocessmanager.html. View-related modules These modules are related to the views and data visualizations that are native to Splunk. They range in use from charts that display data, to control groups, such as radio groups or dropdowns. These are also included with Splunk and are included by default in the RequireJS declaration. ChartView The ChartView displays a series of data in the formats in the list as follows. Item number one shows an example of how each different chart is described and presented. Each ChartView is instantiated in the same way, the only difference is in what searches are used with which chart. Module Path: splunkjs/mvc/chartview Default Variable: ChartView JavaScript Object instantiation        Var myBarChart = new ChartView({            id: "myBarChart",             managerid: "searchManagerId",            type: "bar",            el: $("#mybarchart")        }); The only required property is the id property. This assigns the object an id that can be later referenced as needed. The el option refers to the HTML element in the page that this view will be assigned and created within. The managerid relates to an existing search, saved search, or post process manager object. The results are passed from the manager into the chart view and displayed as indicated. Each chart view can be customized extensively using the charting.* properties. For example, charting.chart.overlayFields, when set to a comma separated list of field names, will overlay those fields over the chart of other data, making it possible to display SLA times over the top of Customer Service Metrics. The full list of configurable options can be found at the following link: http://docs.splunk.com/Documentation/Splunk/latest/Viz/ChartConfigurationReference. The different types of ChartView Now that we've introduced the ChartView module, let's look at the different types of charts that are built-in. This section has been presented in the following format: Name of Chart Short description of the chart type Type property for use in the JavaScript configuration Example chart command that can be displayed with this chart type Example image of the chart The different ChartView types we will cover in this section include the following: Area The area chart is similar to the line chart, and compares quantitative data. The graph is filled with color to show volume. This is commonly used to show statistics of data over time. An example of an area chart is as follows: timechart span=1h max(results.collection1{}.meh_clicks) as MehClicks max(results.collection1{}.visitors) as Visits Bar The bar chart is similar to the column chart, except that the x and y axes have been switched, and the bars run horizontally and not vertically. The bar chart is used to compare different categories. An example of a bar chart is as follows: stats max(results.collection1{}.visitors) as Visits max(results.collection1{}.meh_clicks) as MehClicks by results.collection1{}.title.text Column The column chart is similar to the bar chart, but the bars are displayed vertically. An example of a column chart is as follows: timechart span=1h avg(DPS) as "Difference in Products Sold" Filler gauge The filler gauge is a Splunk-provided visualization. It is intended for single values, normally as a percentage, but can be adjusted to use discrete values as well. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. One of the differences between this gauge and the other single value gauges is that it shows both the color and value close together, whereas the others do not. An example of a filler gauge chart is as follows: eval diff = results.collection1{}.meh_clicks / results.collection1{}.visitors * 100 | stats latest(diff) as D Line The line chart is similar to the area chart but does not fill the area under the line. This chart can be used to display discrete measurements over time. An example of a line chart is as follows: timechart span=1h max(results.collection1{}.meh_clicks) as MehClicks max(results.collection1{}.visitors) as Visits Marker gauge The marker gauge is a Splunk native visualization intended for use with a single value. Normally this will be a percentage of a value, but can be adjusted as needed. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. An example of a marker gauge chart is as follows: eval diff = results.collection1{}.meh_clicks / results.collection1{}.visitors * 100 | stats latest(diff) as D Pie Chart A pie chart is useful for displaying percentages. It gives you the ability to quickly see which part of the "pie" is disproportionate to the others. Actual measurements may not be relevant. An example of a pie chart is as follows: top op_action Radial gauge The radial gauge is another single value chart provided by Splunk. It is normally used to show percentages, but can be adjusted to show discrete values. The gauge uses different colors for different ranges of values, by default using green, yellow, and red, in that order. These colors can also be changed using the charting.* properties. An example of a radial gauge is as follows: eval diff = MC / V * 100 | stats latest(diff) as D Scatter The scatter plot can plot two sets of data on an x and y axis chart (Cartesian coordinates). This chart is primarily time independent, and is useful for finding correlations (but not necessarily causation) in data. An example of a scatter plot is as follows: table MehClicks Visitors Summary We covered some deeper elements of Splunk applications and visualizations. We reviewed each of the SplunkJS modules, how to instantiate them, and gave an example of each search-related modules and view-related modules. Resources for Article: Further resources on this subject: Introducing Splunk [article] Lookups [article] Loading data, creating an app, and adding dashboards and reports in Splunk [article]
Read more
  • 0
  • 0
  • 2397
article-image-plotting-haskell
Packt
04 Jun 2015
10 min read
Save for later

Plotting in Haskell

Packt
04 Jun 2015
10 min read
In this article by James Church, author of the book Learning Haskell Data Analysis, we will see the different methods of data analysis by plotting data using Haskell. The other topics that this article covers is using GHCi, scaling data, and comparing stock prices. (For more resources related to this topic, see here.) Can you perform data analysis in Haskell? Yes, and you might even find that you enjoy it. We are going to take a few snippets of Haskell and put some plots of the stock market data together. To get started with, the following software needs to be installed: The Haskell platform (http://www.haskell.org/platform) Gnuplot (http://www.gnuplot.info/) The cabal command-line tool is the tool used to install packages in Haskell. There are three packages that we may need in order to analyze the stock market data. To use cabal, you will use the cabal install [package names] command. Run the following command to install the CSV parsing package, the EasyPlot package, and the Either package: $ cabal install csv easyplot either Once you have the necessary software and packages installed, we are all set for some introductory analysis in Haskell. We need data It is difficult to perform an analysis of data without data. The Internet is rich with sources of data. Since this tutorial looks at the stock market data, we need a source. Visit the Yahoo! Finance website to find the history of every publicly traded stock on the New York Stock Exchange that has been adjusted to reflect splits over time. The good folks at Yahoo! provide this resource in the csv file format. We begin with downloading the entire history of the Apple company from Yahoo! Finance (http://finance.yahoo.com). You can find the content for Apple by performing a quote look up from the Yahoo! Finance home page for the AAPL symbol (that is, 2 As, not 2 Ps). On this page, you can find the link for Historical Prices. On the Historical Prices page, identify the link that says Download to Spreadsheet. The complete link to Apple's historical prices can be found at the following link: http://real-chart.finance.yahoo.com/table.csv?s=AAPL. We should take a moment to explore our dataset. Here are the column headers in the csv file: Date: This is a string that represents the date of a particular date in Apple's history Open: This is the opening value of one share High: This is the high trade value over the course of this day Low: This is the low trade value of the course of this day Close: This is the final price of the share at the end of this trading day Volume: This is the total number of shares traded on this day Adj Close: This is a variation on the closing price that adjusts the dividend payouts and company splits Another feature of this dataset is that each of the rows are written in a table in a chronological reverse order. The most recent date in the table is the first. The oldest is the last. Yahoo! Finance provides this table (Apple's historical prices) under the unhelpful name table.csv. I renamed my csv file aapl.csv, which is provided by Yahoo! Finance. Start GHCi The interactive prompt for Haskell is GHCi. On the command line, type GHCi. We begin with importing our newly installed libraries from the prompt: > import Data.List< > import Text.CSV< > import Data.Either.Combinators< > import Graphics.EasyPlot Parse the csv file that you just downloaded using the parseCSVFromFile command. This command will return an Either type, which represents one of the two things that happened: your file was parsed (Right) or something went wrong (Left). We can inspect the type of our result with the :t command: > eitherErrorOrCells <- parseCSVFromFile "aapl.csv"< > :t eitherErrorOrCells < eitherErrorOrCells :: Either Text.Parsec.Error.ParseError CSV Did we get an error for our result? For this, we are going to use the fromRight and fromLeft commands. Remember, Right is right and Left is wrong. When we run the fromLeft command, we should see this message saying that our content is in the Right: > fromLeft' eitherErrorOrCells < *** Exception: Data.Either.Combinators.fromLeft: Argument takes from 'Right _' Pull the cells of our csv file into cells. We can see the first four rows of our content using take 5 (which will pull our header line and the first four cells): > let cells = fromRight' eitherErrorOrCells< > take 5 cells< [["Date","Open","High","Low","Close","Volume","Adj Close"],["2014-11-10","552.40","560.63","551.62","558.23","1298900","558.23"],["2014-11-07","555.60","555.60","549.35","551.82","1589100","551.82"],["2014-11-06","555.50","556.80","550.58","551.69","1649900","551.69"],["2014-11-05","566.79","566.90","554.15","555.95","1645200","555.95"]] The last column in our csv file is the Adj Close, which is the column we would like to plot. Count the columns (starting with 0), and you will find that Adj Close is number 6. Everything else can be dropped. (Here, we are also using the init function to drop the last row of the data, which is an empty list. Grabbing the 6th element of an empty list will not work in Haskell.): > map (x -> x !! 6) (take 5 (init cells))< ["Adj Close","558.23","551.82","551.69","555.95"] We know that this column represents the adjusted close prices. We should drop our header row. Since we use tail to drop the header row, take 5 returns the first five adjusted close prices: > map (x -> x !! 6) (take 5 (tail (init cells)))< ["558.23","551.82","551.69","555.95","564.19"] We should store all of our adjusted close prices in a value called adjCloseOriginal: > let adjCloseAAPLOriginal = map (x -> x !! 6) (tail (init cells)) These are still raw strings. We need to convert these to a Double type with the read function: > let adjCloseAAPL = map read adjCloseAaplOriginal :: [Double] We are almost done messaging our data. We need to make sure that every value in adjClose is paired with an index position for the purpose of plotting. Remember that our adjusted closes are in a chronological reverse order. This will create a tuple, which can be passed to the plot function: > let aapl = zip (reverse [1.0..length adjCloseAAPL]) adjCloseAAPL< > take 5 aapl < [(2577,558.23),(2576,551.82),(2575,551.69),(2574,555.95),(2573,564.19)] Plotting > plot (PNG "aapl.png") $ Data3D [Title "AAPL"] [] aapl< True The following chart is the result of the preceding command: Open aapl.png, which should be newly created in your current working directory. This is a typical default chart created by EasyPlot. We can see the entire history of the Apple stock price. For most of this history, the adjusted share price was less than $10 per share. At about the 6,000 trading day, we see the quick ascension of the share price to over $100 per share. Most of the time, when we take a look at a share price, we are only interested in the tail portion (say, the last year of changes). Our data is already reversed, so the newest close prices are at the front. There are 252 trading days in a year, so we can take the first 252 elements in our value and plot them. While we are at it, we are going to change the style of the plot to a line plot: > let aapl252 = take 252 aapl< > plot (PNG "aapl_oneyear.png") $ Data2D [Title "AAPL", Style Lines] [] aapl252< True The following chart is the result of the preceding command: Scaling data Looking at the share price of a single company over the course of a year will tell you whether the price is trending upward or downward. While this is good, we can get better information about the growth by scaling the data. To scale a dataset to reflect the percent change, we subtract each value by the first element in the list, divide that by the first element, and then multiply by 100. Here, we create a simple function called percentChange. We then scale the values 100 to 105, using this new function. (Using the :t command is not necessary, but I like to use it to make sure that I have at least the desired type signature correct.): > let percentChange first value = 100.0 * (value - first) / first< > :t percentChange< percentChange :: Fractional a => a -> a -> a< > map (percentChange 100) [100..105]< [0.0,1.0,2.0,3.0,4.0,5.0] We will use this new function to scale our Apple dataset. Our tuple of values can be split using the fst (for the first value containing the index) and snd (for the second value containing the adjusted close) functions: > let firstValue = snd (last aapl252)< > let aapl252scaled = map (pair -> (fst pair, percentChange firstValue (snd pair))) aapl252< > plot (PNG "aapl_oneyear_pc.png") $ Data2D [Title "AAPL PC", Style Lines] [] aapl252scaled< True The following chart is the result of the preceding command: Let's take a look at the preceding chart. Notice that it looks identical to the one we just made, except that the y axis is now changed. The values on the left-hand side of the chart are now the fluctuating percent changes of the stock from a year ago. To the investor, this information is more meaningful. Comparing stock prices Every publicly traded company has a different stock price. When you hear that Company A has a share price of $10 and Company B has a price of $100, there is almost no meaningful content to this statement. We can arrive at a meaningful analysis by plotting the scaled history of the two companies on the same plot. Our Apple dataset uses an index position of the trading day for the x axis. This is fine for a single plot, but in order to combine plots, we need to make sure that all plots start at the same index. In order to prepare our existing data of Apple stock prices, we will adjust our index variable to begin at 0: > let firstIndex = fst (last aapl252scaled)< > let aapl252scaled = map (pair -> (fst pair - firstIndex, percentChange firstValue (snd pair))) aapl252 We will compare Apple to Google. Google uses the symbol GOOGL (spelled Google without the e). I downloaded the history of Google from Yahoo! Finance and performed the same steps that I previously wrote with our Apple dataset: > -- Prep Google for analysis< > eitherErrorOrCells <- parseCSVFromFile "googl.csv"< > let cells = fromRight' eitherErrorOrCells< > let adjCloseGOOGLOriginal = map (x -> x !! 6) (tail (init cells))< > let adjCloseGOOGL = map read adjCloseGOOGLOriginal :: [Double]< > let googl = zip (reverse [1.0..genericLength adjCloseGOOGL]) adjCloseGOOGL< > let googl252 = take 252 googl< > let firstValue = snd (last googl252)< > let firstIndex = fst (last googl252)< > let googl252scaled = map (pair -> (fst pair - firstIndex, percentChange firstValue (snd pair))) googl252 Now, we can plot the share prices of Apple and Google on the same chart, Apple plotted in red and Google plotted in blue: > plot (PNG "aapl_googl.png") [Data2D [Title "AAPL PC", Style Lines, Color Red] [] aapl252scaled, Data2D [Title "GOOGL PC", Style Lines, Color Blue] [] googl252scaled]< True The following chart is the result of the preceding command: You can compare for yourself the growth rate of the stock price for these two competing companies because I believe that the contrast is enough to let the image speak for itself. This type of analysis is useful in the investment strategy known as growth investing. I am not recommending this as a strategy, nor am I recommending either of these two companies for the purpose of an investment. I am recommending Haskell as your language of choice for performing data analysis. Summary In this article, we used data from a csv file and plotted data. The other topics covered in this article were using GHCi and EasyPlot for plotting, scaling data, and comparing stock prices. Resources for Article: Further resources on this subject: The Hunt for Data [article] Getting started with Haskell [article] Driving Visual Analyses with Automobile Data (Python) [article]
Read more
  • 0
  • 0
  • 4961

article-image-predicting-hospital-readmission-expense-using-cascading
Packt
04 Jun 2015
10 min read
Save for later

Predicting Hospital Readmission Expense Using Cascading

Packt
04 Jun 2015
10 min read
In this article by Michael Covert, author of the book Learning Cascading, we will look at a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. (For more resources related to this topic, see here.) Overview Hospital readmission is an event that health care providers are attempting to reduce, and it is the primary target of new regulations of the Affordable Care Act, passed by the US government. A readmission is defined as any reentry to a hospital 30 days or less from a prior discharge. The financial impact of this is that US Medicare and Medicaid will either not pay or will reduce the payment made to hospitals for expenses incurred. By the end of 2014, over 2600 hospitals will incur these losses from a Medicare and Medicaid tab that is thought to exceed $24 billion annually. Hospitals are seeking to find ways to predict when a patient is susceptible to readmission so that actions can be taken to fully treat the patient before discharge. Many of them are using big data and machine learning-based predictive analytics. One such predictive engine is MedPredict from Analytics Inside, a company based in Westerville, Ohio. MedPredict is the predictive modeling component of the MedMiner suite of health care products. These products use Concurrent Cascading products to perform nightly rescoring of inpatients using a highly customizable calculation known as LACE, which stands for the following: Length of stay: This refers to the number of days a patient been in hospital Acute admissions through emergency department: This refers to whether a patient has arrived through the ER Comorbidities: A comorbidity refers to the presence of a two or more individual conditions in a patient. Each condition is designated by a diagnosis code. Diagnosis codes can also indicate complications and severity of a condition. In LACE, certain conditions are associated with the probability of readmission through statistical analysis. For instance, a diagnosis of AIDS, COPD, diabetes, and so on will each increase the probability of readmission. So, each diagnosis code is assigned points, with other points indicating "seriousness" of the condition. Diagnosis codes: These refer to the International Classification of Disease codes. Version 9 (ICD-9) and now version 10 (ICD-10) standards are available as well. Emergency visits: This refers to the number of emergency room visits the patient has made in a particular window of time. The LACE engine looks at a patient's history and computes a score that is a predictor of readmissions. In order to compute the comorbidity score, the Charlson Comorbidity Index (CCI) calculation is used. It is a statistical calculation that factors in the age and complexity of the patient's condition. Using Cascading to control predictive modeling The full data workflow to compute the probability of readmissions is as follows: Read all hospital records and reformat them into patient records, diagnosis records, and discharge records. Read all data related to patient diagnosis and diagnosis records, that is, ICD-9/10, date of diagnosis, complications, and so on. Read all tracked diagnosis records and join them with patient data to produce a diagnosis (comorbidity) score by summing up comorbidity "points". Read all data related to patient admissions, that is, records associated with admission and discharge, length of stay, hospital, admittance location, stay type, and so on. Read patient profile record, that is, age, race, gender, ethnicity, eye color, body mass indicator, and so on. Compute all intermediate scores for age, emergency visits, and comorbidities. Calculate the LACE score (refer to Figure 2). Assign a date and time to it. Take all the patient information, as mentioned in the preceding points, and run it through MedPredict to produce these variety of metrics: Expected length of stay Expected expense Expected outcome Probability of readmission Figure 1 – The data workflow The Cascading LACE engine The calculational aspects of computing LACE scores makes it ideal for Cascading as a series of reusable subassemblies. Firstly, the extraction, transformation, and loading (ETL) of patient data is complex and costly. Secondly, the calculations are data-intensive. The CCI alone has to examine a patient's medical history and must find all matching diagnosis codes (such as ICD-9 or ICD-10) to assign a score. This score must be augmented by the patient's age, and lastly, a patient's inpatient discharge records must be examined for admittance to the ER as well as emergency room visits. Also, many hospitals desire to customize these calculations. The LACE engine supports and facilitates this since scores are adjustable at the diagnosis code level, and MedPredict automatically produces metrics about how significant an individual feature is to the resulting score. Medical data is quite complex too. For instance, the particular diagnosis codes that represent cancer are many, and their meanings are quite nuanced. In some cases, metastasis (spreading of cancer to other locations in the body) may have occurred, and this is treated as a more severe situation. In other situations, measured values may be "bucketed", so this implies that we track the number of emergency room visits over 1 year, 6 months, 90 days, and 30 days. The Cascading LACE engine performs these calculations easily. It is customized through a set of hospital supplied parameters, and it has the capability to perform full calculations nightly due to its usage of Hadoop. Using this capability, a patient's record can track the full history of the LACE index over time. Additionally, different sets of LACE indices can be computed simultaneously, maybe one used for diabetes, the other for Chronic Obstructive Pulmonary Disorder (COPD), and so on. Figure 2 – The LACE subassembly MedPredict tracking The Lace engine metrics feed into MedPredict along with many other variables cited previously. These records are rescored nightly and the patient history is updated. This patient history is then used to analyze trends and generate alerts when the patient is showing an increased likelihood of variance to the desired metric values. What Cascading does for us We chose Cascading to help reduce the complexity of our development efforts. MapReduce provided us with the scalability that we desired, but we found that we were developing massive amounts of code to do so. Reusability was difficult, and the Java code library was becoming large. By shifting to Cascading, we found that we could encapsulate our code better and achieve significantly greater reusability. Additionally, we reduced complexity as well. The Cascading API provides simplification and understandability, which accelerates our development velocity metrics and also reduces bugs and maintenance cycles. We allow Cascading to control the end-to-end workflow of these nightly calculations. It handles preprocessing and formatting of data. Then, it handles running these calculations in parallel, allowing high speed hash joins to be performed, and also for each leg of the calculation to be split into a parallel pipe. Next, all these calculations are merged and the final score is produced. The last step is to analyze the patient trends and generate alerts where potential problems are likely to occur. Cascading has allowed us to produce a reusable assembly that is highly parameterized, thereby allowing hospitals to customize their usage. Not only can thresholds, scores, and bucket sizes be varied, but if it's desired, additional information could be included for things, such as medical procedures performed on the patient. The local mode of Cascading allows for easy testing, and it also provides a scaled down version that can be run against a small number of patients. However, by using Cascading in the Hadoop mode, massive scalability can be achieved against very large patient populations and ICD-9/10 code sets. Concurrent also provides an excellent framework for predictive modeling using machine learning through its Pattern component. MedPredict uses this to integrate its predictive engine, which is written using Cascading, MapReduce, and Mahout. Pattern provides an interface for the integration of other external analysis products through the exchange of Predictive Model Markup Language (PMML), an XML dialect that allows many of the MedPredict proprietary machine learning algorithms to be directly incorporated into the full Cascading LACE workflow. MedPredict then produces a variety of predictive metrics in a single pass of the data. The LACE scores (current and historical trends) are used as features for these predictions. Additionally, Concurrent provides a product called Driven that greatly reduces the development cycle time for such large, complex applications. Their lingual product provides seamless integration with relational databases, which is also key to enterprise integration. Results Numerous studies have now been performed using LACE risk estimates. Many hospitals have shown the ability to reduce readmission rates by 5-10 percent due to early intervention and specific guidance given to a patient as a result of an elevated LACE score. Other studies are examining the efficacy of additional metrics, and of segmentation of the patients into better identifying groups, such as heart failure, cancer, diabetes, and so on. Additional effort is being put in to study the ability of modifying the values of the comorbidity scores, taking into account combinations and complications. In some cases, even more dramatic improvements have taken place using these techniques. For up-to-date information, search for LACE readmissions, which will provide current information about implementations and results. Analytics Inside LLC Analytics Inside is based in Westerville, Ohio. It was founded in 2005 and specializes in advanced analytical solutions and services. Analytics Inside produces the RelMiner family of relationship mining systems. These systems are based on machine learning, big data, graph theories, data visualizations, and Natural Language Processing (NLP). For further information, visit our website at http://www.AnalyticsInside.us, or e-mail us at [email protected]. MedMiner Advanced Analytics for Health Care is an integrated software system designed to help an organization or patient care team in the following ways: Predicting the outcomes of patient cases and tracking these predictions over time Generating alerts based on patient case trends that will help direct remediation Complying better with ARRA value-based purchasing and meaningful use guidelines Providing management dashboards that can be used to set guidelines and track performance Tracking performance of drug usage, interactions, potentials for drug diversion, and pharmaceutical fraud Extracting medical information contained within text documents Designating data security is a key design point PHI can be hidden through external linkages, so data exchange is not required If PHI is required, it is kept safe through heavy encryption, virus scanning, and data isolation Using both cloud-based and on premise capabilities to meet client needs Concurrent Inc. Concurrent Inc. is the leader in big data application infrastructure, delivering products that help enterprises create, deploy, run, and manage data applications at scale. The company's flagship enterprise solution, Driven, was designed to accelerate the development and management of enterprise data applications. Concurrent is the team behind Cascading, the most widely deployed technology for data applications with more than 175,000 user downloads a month. Used by thousands of businesses, including eBay, Etsy, The Climate Corporation, and Twitter, Cascading is the defacto standard in open source application infrastructure technology. Concurrent is headquartered in San Francisco and can be found online at http://concurrentinc.com. Summary Hospital readmission is an event that health care providers are attempting to reduce, and it is a primary target of new regulation from the Affordable Care Act, passed by the US government. This article describes a system that allows for health care providers to create complex predictive models that can assess who is most at risk for such readmission using Cascading. Resources for Article: Further resources on this subject: Hadoop Monitoring and its aspects [article] Introduction to Hadoop [article] YARN and Hadoop [article]
Read more
  • 0
  • 0
  • 1279