Data | 0 articles | Tech News, Tutorials & Expert Insights

28 Oct 2013

12 min read

Low-Level Index Control

28 Oct 2013

(For more resources related to this topic, see here.) Altering Apache Lucene scoring With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch. Setting per-field similarity Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file): { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code: { "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } } And that's all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields. In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section. Default codec properties When using the default codec we are allowed to configure the following properties: min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25. max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48. Direct codec properties The direct codec allows us to configure the following properties: min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8. low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32. Memory codec properties By using the memory codec we are allowed to alter the following properties: pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data. acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead. Pulsing codec properties When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows: freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed. Bloom filter-based codec properties If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties: delegate: It specifies the name of the codec we want to wrap, with the bloom filter. ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million. For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file): { "settings" : { "index" : { "codec" : { "postings_format" : { "custom_bloom" : { "type" : "bloom_filter", "delegate" : "direct", "ffp" : "10k=0.03, 1m=0.05" } } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "postings_format" : "custom_bloom" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } } NRT, flush, refresh, and transaction log In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let's index an example document to the newly created index by using the following command: curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }' Now, we will replace this document and immediately we will try to find it. In order to do this, we'll use the following command chain: curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl localhost:9200/test/test/_search?pretty The preceding command will probably result in the response, which is very similar to the following response: {"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "test", "_id" : "1", "_score" : 1.0, "_source" : { "title": "test" } } ] } } The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened? But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching. Updating index and committing changes The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state. Let's return to our example. The first operation adds the document to the index, but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command: curl –XGET localhost:9200/test/_refresh If we add the preceding command before the search, ElasticSearch would respond as we had expected. Changing the default refresh time The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "5m" } }' The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won't be visible by queries. As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done. The transaction log Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing. Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared. In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command: curl –XGET localhost:9200/_flush Or we can run the flush command for the particular index, which in our case is the one called library: curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh In the second example we used it together with the refresh, which after flushing the data opens a new searcher. The transaction log configuration If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior: index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it's better to do flush more often with less data being stored in it. index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000. index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB. index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents. All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards. Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example: curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "translog.disable_flush" : true } }' The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done. Near Real Time GET Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command: curl -XGET localhost:9200/test/test/1?pretty ElasticSearch should return the result similar to the following: { "_index" : "test", "_type" : "test", "_id" : "1", "_version" : 2, "exists" : true, "_source" : { "title": "test2" } } If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

0
0
1341

article-image-designing-sample-applications-and-creating-reports

Packt

25 Oct 2013

6 min read

Designing Sample Applications and Creating Reports

Packt

25 Oct 2013

6 min read

0
0
1074

Packt

25 Oct 2013

28 min read

About Cassandra

Packt

25 Oct 2013

28 min read

(For more resources related to this topic, see here.) So, if Cassandra is so good at everything, why not everyone drop whatever database they are using and jump start with Cassandra? This is a natural question. Some applications require strong ACID compliance, such as a booking system. If you are a person who goes by statistics, you'd ask how Cassandra fares with other existing data stores. TilmannRabl et al in their paper, Solving Big Data Challenges for Enterprise Application Performance Management (http://vldb.org/pvldb/vol5/p1724_tilmannrabl_vldb2012.pdf), told that, "In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput from one to 12 nodes. This comes at the price of a high write and read latency. Cassandra's performance is best for high insertion rates." If you go through the paper, Cassandra wins in almost all the criteria. Equipped with proven concepts of distributed computing, made to reliably serve from commodity servers, and simple and easy maintenance, Cassandra is one of the most scalable, fastest, and very robust NoSQL database. So, the next natural question is what makes Cassandra so blazing fast? Let us dive deeper into the Cassandra architecture. Cassandra architecture Cassandra is a relative latecomer in the distributed data-store war. It takes advantage of two proven and closely similar data-store mechanisms, namely Google BigTable, a distributed storage system for structured data, 2006 (http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf [2006]), and Amazon Dynamo, Amazon's highly available key-value store, 2007 (http://www.read.seas.harvard.edu/~kohler/class/cs239-w08/decandia07dynamo.pdf [2007]). Figure 2.3: Read throughputs shows linear scaling of Cassandra Like BigTable, it has tabular data presentation. It is not tabular in the strictest sense. It is rather a dictionary-like structure where each entry holds another sorted dictionary/map. This model is more powerful than the usual key-value store and it is named as column family. The properties such as Eventual Consistency and decentralization are taken from Dynamo. For now, assume a column family as a giant spreadsheet, such as MS Excel. But unlike spreadsheets, each row is identified by a row key with a number (token), and unlike spreadsheets, each cell can have its own unique name within the row. The columns in the rows are sorted by this unique column name. Also, since the number of rows is allowed to be very large (1.7*(10)^38), we distribute the rows uniformly across all the available machines by dividing the rows in equal token groups. These rows create a Keyspace. Keyspace is a set of all the tokens (row IDs). Ring representation Cassandra cluster is denoted as a ring. The idea behind this representation is to show token distribution. Let's take an example. Assume that there is a partitioner that generates tokens from zero to 127 and you have four Cassandra machines to create a cluster. To allocate equal load, we need to assign each of the four nodes to bear an equal number of tokens. So, the first machine will be responsible for tokens one to 32, the second will hold 33 to 64, the third, 65 to 96, and the fourth, 97 to 127 and 0. If you mark each node with the maximum token number that it can hold, the cluster looks like a ring. (Figure 2.22) Partitioner is the hash function that determines the range of possible row keys. Cassandra uses a partitioner to calculate the token equivalent to a row key (row ID). Figure 2.4: Token ownership and distribution in a balanced Cassandra ring When you start to configure Cassandra, one thing that you may want to set is the maximum token number that a particular machine could hold. This property can be set in the Cassandra.yaml file as initial_token. One thing that may confuse a beginner is that the value of the initial token is what the last token owns. Be aware that nodes can be rebalanced and these tokens can be changed as the new nodes join or old nodes get discarded. This is the initial token because this is just the initial value, and it may be changed later. How Cassandra works Diving into various components of Cassandra without having a context is really a frustrating experience. It does not makes sense why you are studying SSTable, MemTable, and Log Structured Merge (LSM) tree without being able to see how they fit into functionality and performance guarantees that Cassandra gives. So, first, we will see Cassandra's write and read mechanism. It is possible that some of the terms that we encounter during this discussion may not be immediately understandable. A rough overview of the Cassandra components is as shown in the following figure: Figure 2.5: Main components of the Cassandra service The main class of Storage Layer is StorageProxy. It handles all the requests. Messaging Layer is responsible for internode communications like gossip. Apart from this, process-level structures keep a rough idea about the actual data containers and where they live. There are four data buckets that you need to know. MemTable is a hash table-like structure that stays in memory. It contains actual column data. SSTable is the disk version of MemTables. When MemTables are full, SSTables are persisted to the hard disk. Bloom filters is a probabilistic data structure that lives in memory. It helps Cassandra to quickly detect which SSTable does not have the requested data. CommitLog is the usual commit log that contains all the mutations that are to be applied. It lives on the disk and helps to replay uncommitted changes. With this primer, we can start looking into how write and read works in Cassandra. We will see more explanation later. Write in action To write, clients need to connect to any of the Cassandra nodes and send a write request. This node is called as the coordinator node. When a node in Cassandra cluster receives a write request, it delegates it to a service called StorageProxy. This node may or may not be the right place to write the data to. The task of StorageProxy is to get the nodes (all the replicas) that are responsible to hold the data that is going to be written. It utilizes a replication strategy to do that. Once the replica nodes are identified, it sends the RowMutation message to them, the node waits for replies from these nodes, but it does not wait for all the replies to come. It only waits for as many responses as are enough to satisfy the client's minimum number of successful writes defined by ConsistencyLevel. So, the following figure and steps after that show all that can happen during a write mechanism: Figure 2.6: A simplistic representation of the write mechanism. The figure on the left represents the node-local activities on receipt of the write request If FailureDetector detects that there aren't enough live nodes to satisfy ConsistencyLevel, the request fails. If FailureDetector gives a green signal, but writes time-out after the request is sent due to infrastructure problems or due to extreme load, StorageProxy writes a local hint to replay when the failed nodes come back to life. This is called hinted handoff. One might think that hinted handoff may be responsible for Cassandra's eventual consistency. But it's not entirely true. If the coordinator node gets shut down or dies due to hardware failure and hints on this machine cannot be forwarded, eventual consistency will not occur. The Anti-entropy mechanism is responsible for consistency rather than hinted handoff. Anti-entropy makes sure that all replicas are in sync. If the replica nodes are distributed across datacenters, it will be a bad idea to send individual messages to all the replicas in other datacenters. It rather sends the message to one replica in each datacenter with a header instructing it to forward the request to other replica nodes in that datacenter. Now, the data is received by the node that should actually store that data. The data first gets appended to CommitLog, and pushed to a MemTable for the appropriate column family in the memory. When MemTable gets full, it gets flushed to the disk in a sorted structure named SSTable. With lots of flushes, the disk gets plenty of SSTables. To manage SSTables, a compaction process runs. This process merges data from smaller SSTables to one big sorted file. Read in action Similar to a write case, when StorageProxy of the node that a client is connected to gets the request, it gets a list of nodes containing this key based on Replication Strategy. StorageProxy then sorts the nodes based on their proximity to itself. The proximity is determined by the Snitch function that is set up for this cluster. Basically, there are the following types of Snitch: SimpleSnitch: A closer node is the one that comes first when moving clockwise in the ring. (A ring is when all the machines in the cluster are placed in a circular fashion with each having a token number. When you walk clockwise, the token value increases. At the end, it snaps back to the first node.) AbstractNetworkTopologySnitch: Implementation of the Snitch function works like this: nodes on the same rack are closest. The nodes in the same datacenter but in different rack are closer than those in other datacenters, but farther than the nodes in the same rack. Nodes in different datacenters are the farthest. To a node, the nearest node will be the one on the same rack. If there is no node on the same rack, the nearest node will be the one that lives in the same datacenter, but on a different rack. If there is no node in the datacenter, any nearest neighbor will be the one in the other datacenter. DynamicSnitch: This Snitch determines closeness based on recent performance delivered by a node. So, a quick-responding node is perceived closer than a slower one, irrespective of their location closeness or closeness in the ring. This is done to avoid overloading a slow-performing node. Now that we have the list of nodes that have desired row keys, it's time to pull data from them. The coordinator node (the one that the client is connected to) sends a command to the closest node to perform read (we'll discuss local read in a minute) and return the data. Now, based on ConsistencyLevel, other nodes will send a command to perform a read operation and send just the digest of the result. If we have Read Repair (discussed later) enabled, the remaining replica nodes will be sent a message to compute the digest of the command response. Let's take an example: say you have five nodes containing a row key K (that is, replication factor (RF) equals 5). Your read ConsistencyLevel is three. Then the closest of the five nodes will be asked for the data. And the second and third closest nodes will be asked to return the digest. We still have two left to be queried. If read Repair is not enabled, they will not be touched for this request. Otherwise, these two will be asked to compute digest. The request to the last two nodes is done in the background, after returning the result. This updates all the nodes with the most recent value, making all replicas consistent. So, basically, in all scenarios, you will have a maximum one wrong response. But with correct read and write consistency levels, we can guarantee an up-to-date response all the time. Let's see what goes within a node. Take a simple case of a read request looking for a single column within a single row. First, the attempt is made to read from MemTable, which is rapid-fast since there exists only one copy of data. This is the fastest retrieval. If the data is not found there, Cassandra looks into SSTable. Now, remember from our earlier discussion that we flush MemTables to disk as SSTables and later when compaction mechanism wakes up, it merges those SSTables. So, our data can be in multiple SSTables. Figure 2.7: A simplified representation of the read mechanism. The bottom image shows processing on the read node. Numbers in circles shows the order of the event. BF stands for Bloom Filter Each SSTable is associated with its Bloom Filter built on the row keys in the SSTable. Bloom Filters are kept in memory, and used to detect if an SSTable may contain (false positive) the row data. Now, we have the SSTables that may contain the row key. The SSTables get sorted in reverse chronological order (latest first). Apart from Bloom Filter for row keys, there exists one Bloom Filter for each row in the SSTable. This secondary Bloom Filter is created to detect whether the requested column names exist in the SSTable. Now, Cassandra will take SSTables one by one from younger to older. And use the index file to locate the offset for each column value for that row key and the Bloom filter associated with the row (built on the column name). On Bloom filter being positive for the requested column, it looks into the SSTable file to read the column value. Note that we may have a column value in other yet-to-be-read SSTables, but that does not matter, because we are reading the most recent SSTables first, and any value that was written earlier to it does not matter. So, the value gets returned as soon as the first column in the most recent SSTable is allocated. Components of Cassandra We have gone through how read and write takes place in highly distributed Cassandra clusters. It's time to look into individual components of it a little deeper. Messaging service Messaging service is the mechanism that manages internode socket communication in a ring. Communications, for example, gossip, read, read digest, write, and so on, processed via a messaging service, can be assumed as a gateway messaging server running at each node. To communicate, each node creates two socket connections per node. This implies that if you have 101 nodes, there will be 200 open sockets on each node to handle communication with other nodes. The messages contain a verb handler within them that basically tells the receiving node a couple of things: how to deserialize the payload message and what handler to execute for this particular message. The execution is done by the verb handlers (sort of an event handler). The singleton that orchestrates the messaging service mechanism is org.apache.cassandra.net.MessagingService. Gossip Cassandra uses the gossip protocol for internode communication. As the name suggests, the protocol spreads information in the same way an office rumor does. It can also be compared to a virus spread. There is no central broadcaster, but the information (virus) gets transferred to the whole population. It's a way for nodes to build the global map of the system with a small number of local interactions. Cassandra uses gossip to find out the state and location of other nodes in the ring (cluster). The gossip process runs every second and exchanges information with at the most three other nodes in the cluster. Nodes exchange information about themselves and other nodes that they come to know about via some other gossip session. This causes all the nodes to eventually know about all the other nodes. Like everything else in Cassandra, gossip messages have a version number associated with it. So, whenever two nodes gossip, the older information about a node gets overwritten with a newer one. Cassandra uses an Anti-entropy version of gossip protocol that utilizes Merkle trees (discussed later) to repair unread data. Implementation-wise the gossip task is handled by the org.apache.cassandra.gms.Gossiper class. Gossiper maintains a list of live and dead endpoints (the unreachable endpoints). At every one-second interval, this module starts a gossip round with a randomly chosen node. A full round of gossip consists of three messages. A node X sends a syn message to a node Y to initiate gossip. Y, on receipt of this syn message, sends an ack message back to X. To reply to this ack message, X sends an ack2 message to Y completing a full message round. Figure 2.8: Two nodes gossiping Failure detection Failure detection is one of the fundamental features of any robust and distributed system. A good failure detection mechanism implementation makes a fault-tolerant system such as Cassandra. The failure detector that Cassandra uses is a variation of The ϕ accrual failure detector (2004) by Xavier Défago et al. (The phi accrual detector research paper is available at http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.3350.) The idea behind a FailureDetector is to detect a communication failure and take appropriate actions based on the state of the remote node. Unlike traditional failure detectors, phi accrual failure detector does not emit a Boolean alive or dead (true or false, trust or suspect) value. Instead, it gives a continuous value to the application and the application is left to decide the level of severity and act accordingly. This continuous suspect value is called phi (ϕ). Partitioner Cassandra is a distributed database management system. This means it takes a single logical database and distributes it over one or more machines in the database cluster. So, when you insert some data in Cassandra, it assigns each row to a row key; and based on that row key, Cassandra assigns that row to one of the nodes that's responsible for managing it. Let's try to understand this. Cassandra inherits the data model from Google's BigTable (BigTable research paper can be found at http://research.google.com/archive/bigtable.html.). This means we can roughly assume that the data is stored in some sort of a table that has an unlimited number of columns (not unlimited, Cassandra limits the maximum number of columns to be two billion) with rows binded with a unique key, namely, row key. Now, your terabytes of data on one machine will be restrictive from multiple points of views. One is disk space, another being limited parallel processing, and if not duplicated, a source of single point of failure. What Cassandra does is, it defines some rules to slice data across rows and assigns which node in the cluster is responsible for holding which slice. This task is done by a partitioner. There are several types of partitioners to choose from. In short, Cassandra (as of Version 1.2) offers three partitioners as follows: RandomPartitioner: It uses MD5 hashing to distribute data across the cluster. Cassandra 1.1.x and precursors have this as the default partitioner. Murmur3Partitioner: It uses Murmur hash to distribute the data. It performs better than RandomPartitioner. It is the default partitioner from Cassandra Version 1.2 onwards. ByteOrderPartitioner: Keeps keys distributed across the cluster by key bytes. This is an ordered distribution, so the rows are stored in lexical order. This distribution is commonly discouraged because it may cause a hotspot. Replication Cassandra runs on commodity hardware, and works reliably in network partitions. However, this comes with a cost: replication. To avoid data inaccessibility in case a node goes down or becomes unavailable, one must replicate data to more than one node. Replication brings features such as fault tolerance and no single point of failure to the system. Cassandra provides more than one strategy to replicate the data, and one can configure the replication factor while creating keyspace. Log Structured Merge tree Cassandra (also, HBase) is heavily influenced by Log Structured Merge (LSM). It uses an LSM tree-like mechanism to store data on a disk. The writes are sequential (in append fashion) and the data storage is contiguous. This makes writes in Cassandra superfast, because there is no seek involved. Contrast this with an RBDMS system that is based on the B+ Tree (http://en.wikipedia.org/wiki/B%2B_tree) implementation. LSM tree advocates the following mechanism to store data: note down the arriving modification into a log file (CommitLog), push the modification/new data into memory (MemTable) for faster lookup, and when the system has gathered enough updates in memory or after a certain threshold time, flush this data to a disk in a structured store file (SSTable). The logs corresponding to the updates that are flushed can now be discarded. Figure 2.12: Log Structured Merge (LSM) Trees CommitLog One of the promises that Cassandra makes to the end users is durability. In conventional terms (or in ACID terminology), durability guarantees that a successful transaction (write, update) will survive permanently. This means once Cassandra says write successful that means the data is persisted and will survive system failures. It is done the same way as in any DBMS that guarantees durability: by writing the replayable information to a file before responding to a successful write. This log is called the CommitLog in the Cassandra realm. This is what happens. Any write to a node gets tracked by org.apache.cassandra.db.commitlog.CommitLog, which writes the data with certain metadata into the CommitLog file in such a manner that replaying this will recreate the data. The purpose of this exercise is to ensure there is no data loss. If due to some reason the data could not make it into MemTable or SSTable, the system can replay the CommitLog to recreate the data. MemTable MemTable is an in-memory representation of column family. It can be thought of as a cached data. MemTable is sorted by key. Data in MemTable is sorted by row key. Unlike CommitLog, which is append-only, MemTable does not contain duplicates. A new write with a key that already exists in the MemTable overwrites the older record. This being in memory is both fast and efficient. The following is an example: Write 1: {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} In CommitLog (new entry, append): {k1: [{c1, v1},{c2, v2}, {c3, v3}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} Write 2: {k2: [{c4, v4}]} In CommitLog (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} In MemTable (new entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} Write 3: {k1: [{c1, v5}, {c6, v6}]} In CommitLog (old entry, append): {k1: [{c1, v1}, {c2, v2}, {c3, v3}]} {k2: [{c4, v4}]} {k1: [{c1, v5}, {c6, v6}]} In MemTable (old entry, update): {k1: [{c1, v5}, {c2, v2}, {c3, v3}, {c6, v6}]} {k2: [{c4, v4}]} Cassandra Version 1.1.1 uses SnapTree (https://github.com/nbronson/snaptree) for MemTable representation, which claims it to be "... a drop-in replacement for ConcurrentSkipListMap, with the additional guarantee that clone() is atomic and iteration has snapshot isolation." See also copy-on-write and compare-and-swap (http://en.wikipedia.org/wiki/Copy-on-write, http://en.wikipedia.org/wiki/Compare-and-swap). Any write gets written first to CommitLog and then to MemTable. SSTable SSTable is a disk representation of the data. MemTables gets flushed to disk to immutable SSTables. All the writes are sequential, which makes this process fast. So, the faster the disk speed, the quicker the flush operation. The SSTables eventually get merged in the compaction process and the data gets organized properly into one file. This extra work in compaction pays off during reads. SSTables have three components: Bloom filter, index files, and datafiles. Bloom filter Bloom filter is a litmus test for the availability of certain data in storage (collection). But unlike a litmus test, a Bloom filter may result in false positives: that is, it says that a data exists in the collection associated with the Bloom filter, when it actually does not. A Bloom filter never results in a false negative. That is, it never states that a data is not there while it is. The reason to use Bloom filter, even with its false-positive defect, is because it is superfast and its implementation is really simple. Cassandra uses Bloom filters to determine whether an SSTable has the data for a particular row key. Bloom filters are unused for range scans, but they are good candidates for index scans. This saves a lot of disk I/O that might take in a full SSTable scan, which is a slow process. That's why it is used in Cassandra, to avoid reading many, many SSTables, which can become a bottleneck. Index files Index files are companion files of SSTables. The same as Bloom filter, there exists one index file per SSTable. It contains all the row keys in the SSTable and its offset at which the row starts in the datafile. At startup, Cassandra reads every 128th key (configurable) into the memory (sampled index). When the index is looked for a row key (after Bloom filter hinted that the row key might be in this SSTable), Cassandra performs a binary search on the sampled index in memory. Followed by a positive result from the binary search, Cassandra will have to read a block in the index file from the disk starting from the nearest value lower than the value that we are looking for. Datafiles Datafiles are the actual data. They contain row keys, metadata, and columns (partial or full). Reading data from datafiles is just one disk seek followed by a sequential read, as offset to a row key is already obtained from the associated index file. Compaction As mentioned earlier, a read require may require Cassandra to read across multiple SSTables to get a result. This is wasteful, costs multiple (disk) seeks, may require a conflict resolution, and if there are too many, SSTables were created. To handle this problem, Cassandra has a process in place, namely, compaction. Compaction merges multiple SSTable files into one. Off the shelf, Cassandra offers two types of compaction mechanism: Size Tiered Compaction Strategy and Level Compaction Strategy. The compaction process starts when the number of SSTables on disk reaches a certain threshold (N: configurable). Although the merge process is a little I/O intensive, it benefits in the long term with a lower number of disk seeks during reads. Apart from this, there are a few other benefits of compaction as follows: Removal of expired tombstones (Cassandra v0.8+) Merging row fragments Rebuilds primary and secondary indexes Tombstones Cassandra is a complex system with its data distributed among CommitLogs, MemTables, and SSTables on a node. The same data is then replicated over replica nodes. So, like everything else in Cassandra, deletion is going to be eventful. Deletion, to an extent, follows an update pattern except Cassandra tags the deleted data with a special value, and marks it as a tombstone. This marker helps future queries, compaction, and conflict resolution. Let's step further down and see what happens when a column from a column family is deleted. A client connected to a node (coordinator node, but it may not be the one holding the data that we are going to mutate), issues a delete command for a column C, in a column family CF. If the consistency level is satisfied, the delete command gets processed. When a node, containing the row key, receives a delete request, it updates or inserts the column in MemTable with a special value, namely, tombstone. The tombstone basically has the same column name as the previous one; the value is set to UNIX epoch. The timestamp is set to what the client has passed. When a MemTable is flushed to SSTable, all tombstones go into it as any regular column. On the read side, when the data is read locally on the node and it happens to have multiple versions of it in different SSTables, they are compared and the latest value is taken as the result of reconciliation. If a tombstone turns out to be a result of reconciliation, it is made a part of the result that this node returns. So, at this level, if a query has a deleted column, this exists in the result. But the tombstones will eventually be filtered out of the result before returning it back to the client. So, a client can never see a value that is a tombstone. Hinted handoff When we last talked about durability, we observed Cassandra provides CommitLogs to provide write durability. This is good. But what if the node, where the writes are going to be, is itself dead? No communication will keep anything new to be written to the node. Cassandra, inspired by Dynamo, has a feature called hinted handoff. In short, it's the same as taking a quick note locally that X cannot be contacted. Here is the mutation (operation that requires modification of the data such as insert, delete, and update) M that will be required to be replayed when it comes back. The coordinator node (the node which the client is connected to) on receipt of a mutation/write request, forwards it to appropriate replicas that are alive. If this fulfills the expected consistency level, write is assumed successful. The write requests to a node that does not respond to a write request or is known to be dead (via gossip) and is stored locally in the system.hints table. This hint contains the mutation. When a node comes to know via gossip that a node is recovered, it replays all the hints it has in store for that node. Also, every 10 minutes, it keeps checking any pending hinted handoffs to be written. Why worry about hinted handoff when you have written to satisfy consistency level? Wouldn't it eventually get repaired? Yes, that's right. Also, hinted handoff may not be the most reliable way to repair a missed write. What if the node that has hinted handoff dies? This is a reason why we do not count on hinted handoff as a mechanism to provide consistency (except for the case of the consistency level, ANY) guarantee; it's a single point of failure. The purpose of hinted handoff is, one, to make restored nodes quickly consistent with the other live ones; and two, to provide extreme write availability when consistency is not required. Read repair and Anti-entropy Cassandra promises eventual consistency and read repair is the process which does that part. Read repair, as the name suggests, is the process of fixing inconsistencies among the replicas at the time of read. What does that mean? Let's say we have three replica nodes A, B, and C that contain a data X. During an update, X is updated to X1 in replicas A and B. But it is failed in replica C for some reason. On a read request for data X, the coordinator node asks for a full read from the nearest node (based on the configured Snitch) and digest of data X from other nodes to satisfy consistency level. The coordinator node compares these values (something like digest(full_X) == digest_from_node_C). If it turns out that the digests are the same as the digests of full read, the system is consistent and the value is returned to the client. On the other hand, if there is a mismatch, full data is retrieved and reconciliation is done and the client is sent the reconciled value. After this, in background, all the replicas are updated with the reconciled value to have a consistent view of data on each node. See Figure 2.1 as shown: Figure 2.16: Image showing read repair dynamics. 1. Client queries for data x, from a node C (coordinator). 2. C gets data from replicas R1, R2, and R3; reconciles. 3. Sends reconciled data to client. 4. If there is a mismatch across replicas, repair is invoked. Merkle tree Merkle tree (A digital signature Based On A Conventional Encryption Function by Merkle, R. (1988), available at http://www.cse.msstate.edu/~ramkumar/merkle2.pdf.) is a hash tree where leaves of the tree hashes hold actual data in a column family and non-leaf nodes hold hashes of their children. The unique advantage of Merkle tree is a whole subtree can be validated just by looking at the value of the parent node. So, if nodes on two replica servers have the same hash values, then the underlying data is consistent and there is no need to synchronize. If one node passes the whole Merkle tree of a column family to another node, it can determine all the inconsistencies. Figure 2.17: Merkle tree to determine mismatch in hash values at parent nodes due to the difference in underlying data Summary By now, you are familiar with all the nuts and bolts of Cassandra. It is understandable that it may be a lot to take in for someone new to NoSQL systems. It is okay if you do not have complete clarity at this point. As you start working with Cassandra, tweaking it, experimenting with it, and going through the Cassandra mailing list discussions or talks, you will start to come across stuff that you have read in this article and it will start to make sense, and perhaps you may want to come back and refer to this article to improve clarity. It is not required to understand this article fully to be able to write queries, set up clusters, maintain clusters, or do anything else related to Cassandra. A general sense of this article will take you far enough to work extremely well with Cassandra-based projects. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] Apache Felix Gogo [Article] Migration from Apache to Lighttpd [Article]

0
0
2816

article-image-visualizing-my-social-graph-d3js

Packt

24 Oct 2013

7 min read

Visualizing my Social Graph with d3.js

Packt

24 Oct 2013

7 min read

(For more resources related to this topic, see here.) The Social Networks Analysis Social Networks Analysis (SNA) is not new, sociologists have been using it for a long time to study human relationships (sociometry), to find communities and to simulate how information or a disease is spread in a population. With the rise of social networking sites such as Facebook, Twitter, LinkedIn, and so on. The acquisition of large amounts of social network data is easier. We can use SNA to get insight about customer behavior or unknown communities. It is important to say that this is not a trivial task and we will come across sparse data and a lot of noise (meaningless data). We need to understand how to distinguish between false correlation and causation. A good start is by knowing our graph through visualization and statistical analysis. Social networking sites bring us the opportunities to ask questions that otherwise are too hard to approach, because polling enough people is time-consuming and expensive. In this article, we will obtain our social network's graph from Facebook (FB) website in order to visualize the relationships between our friends. Finally we will create an interactive visualization of our graph using D3.js. Getting ready The easiest method to get our friends list is by using a third-party application. Netvizz is a Facebook app developed by Bernhard Rieder, which allows exporting social graph data to gdf and tab formats. Netvizz may export information about our friends such as gender, age, locale, posts, and likes. In order to get our social graph from Netvizz we need to access the link below and giving access to your Facebook profile. https://apps.facebook.com/netvizz/ As is shown in the following screenshot, we will create a gdf file from our personal friend network by clicking on the link named here in the Step 2. Then we will download the GDF (Graph Modeling Language) file. Netvizz will give us the number of nodes and edges (links); finally we will click on the gdf file link, as we can see in the following screenshot: The output file myFacebookNet.gdf will look like this: nodedef>name VARCHAR,label VARCHAR,gender VARCHAR,locale VARCHAR,agerankINT23917067,Jorge,male,en_US,10623931909,Haruna,female,en_US,10535702006,Joseph,male,en_US,104503839109,Damian,male,en_US,103532735006,Isaac,male,es_LA,102. . .edgedef>node1 VARCHAR,node2 VARCHAR23917067,3570200623917067,62939583723917067,74734348223917067,75560507523917067,1186286815. . . In the following screenshot we may see the visualization of the graph (106 nodes and 279 links). The nodes represent my friends and the links represent how my friends are connected between them. Transforming GDF to JSON In order to work with the graph in the web with d3.js, we need to transform our gdf file to json format. Firstly, we need to import the libraries numpy and json. import numpy as npimport json The numpy function, genfromtxt, will obtain only the ID and name from the nodes.csv file using the usecols attribute in the 'object' format. nodes = np.genfromtxt("nodes.csv",dtype='object',delimiter=',',skip_header=1,usecols=(0,1)) Then, the numpy function, genfromtxt, will obtain links with the source node and target node from the links.csv file using the usecols attribute in the 'object' format. links = np.genfromtxt("links.csv",dtype='object',delimiter=',',skip_header=1,usecols=(0,1)) The JSON format used in the D3.js Force Layout graph implemented in this article requires transforming the ID (for example, 100001448673085) into a numerical position in the list of nodes. Then, we need to look for each appearance of the ID in the links and replace them by their position in the list of nodes. for n in range(len(nodes)):for ls in range(len(links)):if nodes[n][0] == links[ls][0]:links[ls][0] = nif nodes[n][0] == links[ls][1]:links[ls][1] = n Now, we need to create a dictionary "data" to store the JSON file. data ={} Next, we need to create a list of nodes with the names of the friends in the format as follows: "nodes": [{"name": "X"},{"name": "Y"},. . .] and add it to thedata dictionary.lst = []for x in nodes:d = {}d["name"] = str(x[1]).replace("b'","").replace("'","")lst.append(d)data["nodes"] = lst Now, we need to create a list of links with the source and target in the format as follows: "links": [{"source": 0, "target": 2},{"source": 1, "target":2},. . .] and add it to the data dictionary.lnks = []for ls in links:d = {}d["source"] = ls[0]d["target"] = ls[1]lnks.append(d)data["links"] = lnks Finally, we need to create the file, newJson.json, and write the data dictionary in the file with the function dumps of the json library. with open("newJson.json","w") as f:f.write(json.dumps(data)) The file newJson.json will look as follows: {"nodes": [{"name": "Jorge"},{"name": "Haruna"},{"name": "Joseph"},{"name": "Damian"},{"name": "Isaac"},. . .],"links": [{"source": 0, "target": 2},{"source": 0, "target": 12},{"source": 0, "target": 20},{"source": 0, "target": 23},{"source": 0, "target": 31},. . .]} Graph visualization with D3.js D3.js provides us with the d3.layout.force() function that use the Force Atlas layout algorithm and help us to visualize our graph. First, we need to define the CSS style for the nodes, links, and node labels. <style>.link {fill: none;stroke: #666;stroke-width: 1.5px;}.node circle{fill: steelblue;stroke: #fff;stroke-width: 1.5px;}.node text{pointer-events: none;font: 10px sans-serif;}</style> Then, we need to refer the d3js library. <script src = "http://d3js.org/d3.v3.min.js"></script> Then, we need to define the width and height parameters for the svg container and include into the body tag. var width = 1100,height = 800var svg = d3.select("body").append("svg").attr("width", width).attr("height", height); Now, we define the properties of the force layout such as gravity, distance, and size. var force = d3.layout.force().gravity(.05).distance(150).charge(-100).size([width, height]); Then, we need to acquire the data of the graph using the JSON format. We will configure the parameters for nodes and links. d3.json("newJson.json", function(error, json) {force.nodes(json.nodes).links(json.links).start(); For a complete reference about the d3js Force Layout implementation, visit the link https://github.com/mbostock/d3/wiki/Force-Layout. Then, we define the links as lines from the json data. var link = svg.selectAll(".link").data(json.links).enter().append("line").attr("class", "link");var node = svg.selectAll(".node").data(json.nodes).enter().append("g").attr("class", "node").call(force.drag); Now, we define the node as circles of size 6 and include the labels of each node. node.append("circle").attr("r", 6);node.append("text").attr("dx", 12).attr("dy", ".35em").text(function(d) { return d.name }); Finally, with the function, tick, run step-by-step the force layout simulation. force.on("tick", function(){link.attr("x1", function(d) { return d.source.x; }).attr("y1", function(d) { return d.source.y; }).attr("x2", function(d) { return d.target.x; }).attr("y2", function(d) { return d.target.y; });node.attr("transform", function(d){return "translate(" + d.x + "," + d.y + ")";})});});</script> In the image below we can see the result of the visualization. In order to run the visualization we just need to open a Command Terminal and run the following Python command or any other web server. >>python –m http.server 8000 Then you just need to open a web browser and type the direction http://localhost:8000/ForceGraph.html. In the HTML page we can see our Facebook graph with a gravity effect and we can interactively drag-and-drop the nodes. All the codes and datasets of this article may be found in the author github repository in the link below.https://github.com/hmcuesta/PDA_Book/tree/master/Chapter10 Summary In this article we developed our own social graph visualization tool with D3js, transforming the data obtained from Netvizz with GDF format into JSON. Resources for Article: Further resources on this subject: GNU Octave: Data Analysis Examples [Article] Securing data at the cell level (Intermediate) [Article] Analyzing network forensic data (Become an expert) [Article]

0
0
7273

article-image-interacting-your-visualization

Packt

22 Oct 2013

9 min read

Interacting with your Visualization

Packt

22 Oct 2013

9 min read

(For more resources related to this topic, see here.) The ultimate goal of visualization design is to optimize applications so that they help us perform cognitive work more efficiently. Ware C. (2012) The goal of data visualization is to help the audience gain information from a large quantity of raw data quickly and efficiently through metaphor, mental model alignment, and cognitive magnification. So far in this article we have introduced various techniques to leverage D3 library implementing many types of visualization. However, we haven't touched a crucial aspect of visualization: human interaction. Various researches have concluded the unique value of human interaction in information visualization. Visualization combined with computational steering allows faster analyses of more sophisticated scenarios...This case study adequately demonstrate that the interaction of a complex model with steering and interactive visualization can extend the applicability of the modelling beyond research Barrass I. & Leng J (2011) In this article we will focus on D3 human visualization interaction support, or as mentioned earlier learn how to add computational steering capability to your visualization. Interacting with mouse events The mouse is the most common and popular human-computer interaction control found on most desktop and laptop computers. Even today, with multi-touch devices rising to dominance, touch events are typically still emulated into mouse events; therefore making application designed to interact via mouse usable through touches. In this recipe we will learn how to handle standard mouse events in D3. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/mouse.html How to do it... In the following code example we will explore techniques of registering and handling mouse events in D3. Although, in this particular example we are only handling click and mousemove, the techniques utilized here can be applied easily to all other standard mouse events supported by modern browsers: <script type="text/javascript"> var r = 400; var svg = d3.select("body") .append("svg"); var positionLabel = svg.append("text") .attr("x", 10) .attr("y", 30); svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) .transition() .delay(Math.pow(i, 2.5) * 50) .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } }); </script> This recipe generates the following interactive visualization: Mouse Interaction How it works... In D3, to register an event listener, we need to invoke the on function on a particular selection. The given event listener will be attached to all selected elements for the specified event (line A). The following code in this recipe attaches a mousemove event listener which displays the current mouse position (line B): svg.on("mousemove", function () { //<-A printPosition(); }); function printPosition() { //<-B var position = d3.mouse(svg.node()); //<-C positionLabel.text(position); } On line C we used d3.mouse function to obtain the current mouse position relative to the given container element. This function returns a two-element array [x, y]. After this we also registered an event listener for mouse click event on line D using the same on function: svg.on("click", function () { //<-D for (var i = 1; i < 5; ++i) { var position = d3.mouse(svg.node()); var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", 0) .style("stroke-width", 5 / (i)) // <-E .transition() .delay(Math.pow(i, 2.5) * 50) // <-F .duration(2000) .ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); // <-G }); } }); Once again, we retrieved the current mouse position using d3.mouse function and then generated five concentric expanding circles to simulate the ripple effect. The ripple effect was simulated using geometrically increasing delay (line F) with decreasing stroke-width (line E). Finally when the transition effect is over, the circles were removed using transition end listener (line G). There's more... Although, we have only demonstrated listening on the click and mousemove events in this recipe, you can listen on any event that your browser supports through the on function. The following is a list of mouse events that are useful to know when building your interactive visualization: click: Dispatched when user clicks a mouse button dbclick: Dispatched when a mouse button is clicked twice mousedown: Dispatched when a mouse button is pressed mouseenter: Dispatched when mouse is moved onto the boundaries of an element or one of its descendent elements mouseleave: Dispatched when mouse is moved off of the boundaries of an element and all of its descendent elements mousemove: Dispatched when mouse is moved over an element mouseout: Dispatched when mouse is moved off of the boundaries of an element mouseover: Dispatched when mouse is moved onto the boundaries of an element mouseup: Dispatched when a mouse button is released over an element Interacting with a multi-touch device Today, with the proliferation of multi-touch devices, any visualization targeting mass consumption needs to worry about its interactability not only through the traditional pointing device, but through multi-touches and gestures as well. In this recipe we will explore touch support offered by D3 to see how it can be leveraged to generate some pretty interesting interaction with multi-touch capable devices. Getting ready Open your local copy of the following file in your web browser: https://github.com/NickQiZhu/d3-cookbook/blob/master/src/chapter10/touch.html. How to do it... In this recipe we will generate a progress-circle around the user's touch and once the progress is completed then a subsequent ripple effect will be triggered around the circle. However, if the user prematurely ends his/her touch, then we shall stop the progress-circle without generating the ripples: <script type="text/javascript"> var initR = 100, r = 400, thickness = 20; var svg = d3.select("body") .append("svg"); d3.select("body") .on("touchstart", touch) .on("touchend", touch); function touch() { d3.event.preventDefault(); var arc = d3.svg.arc() .outerRadius(initR) .innerRadius(initR - thickness); var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000) .attrTween("d", function (d) { var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { if (complete(g)) ripples(d); g.remove(); }); g.exit().remove().each(function () { this.__stopped__ = true; }); } function complete(g) { return g.node().__stopped__ != true; } function ripples(position) { for (var i = 1; i < 5; ++i) { var circle = svg.append("circle") .attr("cx", position[0]) .attr("cy", position[1]) .attr("r", initR - (thickness / 2)) .style("stroke-width", thickness / (i)) .transition().delay(Math.pow(i, 2.5) * 50) .duration(2000).ease('quad-in') .attr("r", r) .style("stroke-opacity", 0) .each("end", function () { d3.select(this).remove(); }); } } </script> This recipe generates the following interactive visualization on a touch enabled device: Touch Interaction How it works... Event listener for touch events are registered through D3 selection's on function similar to what we have done with mouse events in the previous recipe: d3.select("body") .on("touchstart", touch) .on("touchend", touch); One crucial difference here is that we have registered our touch event listener on the body element instead of the svg element since with many OS and browsers there are default touch behaviors defined and we would like to override it with our custom implementation. This is done through the following function call: d3.event.preventDefault(); Once the touch event is triggered we retrieve multiple touch point data using the d3.touches function as illustrated by the following code snippet: var g = svg.selectAll("g.touch") .data(d3.touches(svg.node()), function (d) { return d.identifier; }); Instead of returning a two-element array as what d3.mouse function does, d3.touches returns an array of two-element arrays since there could be multiple touch points for each touch event. Each touch position array has data structure that looks like the following: Touch Position Array Other than the [x, y] position of the touch point each position array also carries an identifier to help you differentiate each touch point. We used this identifier here in this recipe to establish object constancy. Once the touch data is bound to the selection the progress circle was generated for each touch around the user's finger: g.enter() .append("g") .attr("class", "touch") .attr("transform", function (d) { return "translate(" + d[0] + "," + d[1] + ")"; }) .append("path") .attr("class", "arc") .transition().duration(2000).ease('linear') .attrTween("d", function (d) { // <-A var interpolate = d3.interpolate( {startAngle: 0, endAngle: 0}, {startAngle: 0, endAngle: 2 * Math.PI} ); return function (t) { return arc(interpolate(t)); }; }) .each("end", function (d) { // <-B if (complete(g)) ripples(d); g.remove(); }); This is done through a standard arc transition with attribute tweening (line A). Once the transition is over if the progress-circle has not yet been canceled by the user then a ripple effect similar to what we have done in the previous recipe was generated on line B. Since we have registered the same event listener touch function on both touchstart and touchend events, we can use the following lines to remove progress-circle and also set a flag to indicate that this progress circle has been stopped prematurely: g.exit().remove().each(function () { this.__stopped__ = true; }); We need to set this stateful flag since there is no way to cancel a transition once it is started; hence, even after removing the progress-circle element from the DOM tree the transition will still complete and trigger line B. There's more... We have demonstrated touch interaction through the touchstart and touchend events; however, you can use the same pattern to handle any other touch events supported by your browser. The following list contains the proposed touch event types recommended by W3C: touchstart: Dispatched when the user places a touch point on the touch surface touchend: Dispatched when the user removes a touch point from the touch surface touchmove: Dispatched when the user moves a touch point along the touch surface touchcancel: Dispatched when a touch point has been disrupted in an implementation-specific manner

0
0
1091

article-image-installing-mariadb-windows-and-mac-os-x

Packt

22 Oct 2013

5 min read

Installing MariaDB on Windows and Mac OS X

Packt

22 Oct 2013

5 min read

(For more resources related to this topic, see here.) Installing MariaDB on Windows There are two types of MariaDB downloads for Windows: ZIP files and MSI packages. As mentioned previously, the ZIP files are similar to the Linux binary .tar.gz files and they are only recommended for experts who know they want it. If we are starting out with MariaDB on Windows, it is recommended to use the MSI packages. Here are the steps to do just that: Download the MSI package from https://downloads.mariadb.org/. First click on the series we want (stable, most likely), then locate the Windows 64-bit or Windows 32-bit MSI package. For most computers, the 64-bit MSI package is probably the one that we want, especially if we have more than 4 Gigabytes of RAM. If you're unsure, the 32-bit package will work on both 32-bit and 64-bit computers. Once the download has finished, launch the MSI installer by double-clicking on it. Depending on our settings we may be prompted to launch it automatically. The installer will walk us through installing MariaDB. If we are installing MariaDB for the first time, we must be sure to set the root user password when prompted. Unless we need to, don't enable access from remote machines for the root user or create an anonymous account. The Install as service box is checked by default, and it's recommended to keep it that way so that MariaDB starts up when the computer is booted. The Service Name textbox has the default value MySQL for compatibility reasons, but we can rename it if we like. Check the Enable networking option, if you need to access the databases from a different computer. If we don't it's best to uncheck this box. As with the service name, there is a default TCP port number (3306) which you can change if you want to, but it is usually best to stick with the default unless there is a specific reason not to. The Optimize for transactions checkbox is checked by default. This setting can be left as is. There are other settings that we can make through the installer. All of them can be changed later by editing the my.ini file, so we don't have to worry about setting them right away. If our version of Windows has User Account Control enabled, there will be a pop-up during the installation asking if we want to allow the installer to install MariaDB. For obvious reasons, click on Yes. After the installation completes, there will be a MariaDB folder added to the start menu. Under this will be various links, including one to the MySQL Client. If we already have an older version of MariaDB or MySQL running on our machine, we will be prompted to upgrade the data files for the version we are installing, it is highly recommended that we do so. Eventually we will be presented with a dialog box with an installation complete message and a Finish button. If you got this far, congratulations! MariaDB is now installed and running on your Windows-based computer. Click on Finish to quit the installer. Installing MariaDB on Mac OS X One of the easiest ways to install MariaDB on Mac OS X is to use Homebrew, which is an Open Source package manager for that platform. Before you can install it, however, you need to prepare your system. The first thing you need to do is install Xcode; Apple's integrated development environment. It's available for free in the Mac App Store. Once Xcode is installed you can install brew. Full instructions are available on the Brew Project website at http://mxcl.github.io/homebrew/ but the basic procedure is to open a terminal and run the following command: ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)" This command downloads the installer and runs it. Once the initial installation is completed, we run the following command to make sure everything is set up properly: brew doctor The output of the doctor command will tell us of any potential issues along with suggestions for how to fix them. Once brew is working properly, you can install MariaDB with the following commands: brew update brew install mariadb Unlike on Linux and Windows, brew does not automatically set up or offer to set up MariaDB to start automatically when your system boots or start MariaDB after installation. To do so, we perform the following command: ln -sfv /usr/local/opt/mariadb/*.plist ~/Library/LaunchAgents launchctl load ~/Library/LaunchAgents/homebrew.mxcl.mariadb.plist To stop MariaDB, we use the unload command as follows: launchctl unload ~/Library/LaunchAgents/homebrew.mxcl.mariadb.plist Summary In this article, we learned how to install MariaDB on Windows and Mac OS X. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] So, what is MongoDB? [Article] Schemas and Models [Article]

0
0
3996

Packt

21 Oct 2013

4 min read

Drag

Packt

21 Oct 2013

4 min read

(For more resources related to this topic, see here.) I can't think of a better dragging demonstration than animating with the parallax illusion. The illusion works by having several keyframes rendered in vertical slices and dragging a screen over them to create an animated thingamabob. Drawing the lines by hand would be tedious, so we're using an image Marco Kuiper created in Photoshop. I asked on Twitter and he said we can use the image, if we check out his other work at marcofolio.net. You can also get the image in the examples repository at https://raw.github.com/Swizec/d3.js-book-examples/master/ch4/parallax_base.png. We need somewhere to put the parallax: var width = 1200, height = 450, svg = d3.select('#graph') .append('svg') .attr({width: width, height: height}); We'll use SVG's native support for embedding bitmaps to insert parallax_base.png into the page: svg.append('image') .attr({'xlink:href': 'parallax_base.png', width: width, height: height}); The image element's magic stems from its xlink:href attribute. It understands links and even lets us embed images to create self-contained SVGs. To use that, you would prepend an image MIME type to a base64 encoded representation of the image. For instance, the following line is the smallest embedded version of a spacer GIF. Don't worry if you don't know what a spacer GIF is; they were useful up to about 2005. data:image/gif;base64,R0lGODlhAQABAID/ AMDAwAAAACH5BAEAAAAALAAAAAABAAEAAAICRAEAOw== Anyway, now that we have the animation base, we need a screen that can be dragged. It's going to be a bunch of carefully calibrated vertical lines: var screen_width = 900, lines = d3.range(screen_width/6), x = d3.scale.ordinal().domain(lines).rangeBands([0, screen_width]); We'll base the screen off an array of numbers (lines). Since line thickness and density are very important, we divide screen_width by 6—five pixels for a line and one for spacing. Make sure the value of screen_width is a multiple of 6; otherwise anti-aliasing ruins the effect. The x scale will help us place the lines evenly: svg.append('g') .selectAll('line') .data(lines) .enter() .append('line') .style('shape-rendering', 'crispEdges') .attr({stroke: 'black', 'stroke-width': x.rangeBand()-1, x1: function (d) { return x(d); }, y1: 0, x2: function (d) { return x(d); }, y2: height}); There's nothing particularly interesting here, just stuff you already know. The code goes through the array and draws a new vertical line for each entry. We made absolutely certain there won't be any anti-aliasing by setting shape-rendering to crispEdges. Time to define and activate a dragging behavior for our group of lines: var drag = d3.behavior.drag().origin(Object).on('drag', function () {d3.select(this).attr('transform', 'translate('+d3.event.x+', 0)').datum({x: d3.event.x, y: 0});}); We created the behavior with d3.behavior.drag(), deﬁned a .origin() accessor, and speciﬁed what happens on drag. The behavior automatically translates touch and mouse events to the higher-level drag event. How cool is that! We need to give the behavior an origin so it knows how to calculate positions relatively; otherwise, the current position is always set to the mouse cursor and objects jump around. It's terrible. Object is the identity function for elements and assumes a datum with x and y coordinates. The heavy lifting happens inside the drag listener. We get the screen's new position from d3.event.x, move the screen there, and update the attached .datum() method. All that's left to do is to call drag and make sure to set the attached datum to the current position: svg.select('g') .datum({x: 0, y: 0}) .call(drag); The item looks solid now! Try dragging the screen at different speeds. The parallax effect doesn't work very well on a retina display because the base image gets resized and our screen loses calibration. Summary In this article, we looked into the drag behavioud of d3. All this can be done by with just click events, but I heartily recommend d3's behaviors module. It makes complex behaviors is that they automatically create relevant event listeners and let you work at a higher level of abstraction. Resources for Article: Further resources on this subject: Visualizing Productions Ahead of Time with Celtx [Article] Custom Data Readers in Ext JS [Article] The Login Page using Ext JS [Article]

0
0
1272

Packt

18 Oct 2013

5 min read

Using the Spark Shell

Packt

18 Oct 2013

5 min read

0
0
3350

Packt

17 Oct 2013

12 min read

KNIME terminologies

Packt

17 Oct 2013

12 min read

(For more resources related to this topic, see here.) Organizing your work In KNIME, you store your files in a workspace. When KNIME starts, you can specify which workspace you want to use. The workspaces are not just for fi les; they also contain settings and logs. It might be a good idea to set up an empty workspace, and instead of customizing a new one each time, you start a new project; you just copy (extract) it to the place you want to use, and open it with KNIME (or switch to it). The workspace can contain workflow groups (sometimes referred to as workflow set) or workflows. The groups are like folders in a filesystem that can help organize your work flows. Workflows might be your programs and processes that describe the steps which should be applied to load, analyze, visualize, or transform the data you have, something like an execution plan. Work flows contain the executable parts, which can be edited using the workflow editor, which in turn is similar to a canvas. Both the groups and the workflows might have metadata associated with them, such as the creation date, author, or comments (even the workspace can contain such information). Workflows might contain nodes, meta nodes, connections, work flow variables (or just flow variables), work flow credentials, and annotations besides the previously introduced metadata. Workflow credentials is the place where you can store your login name and password for different connections. These are kept safe, but you can access them easily. It is safe to share a work flow if you use only the work flow credentials for sensitive information (although the user name will be saved). Nodes Each node has a type, which identifies the algorithm associated with the node. You can think of the type as a template; it specifies how to execute for different inputs and parameters, and what should be the result. The nodes are similar to functions (or operators) in programs. The node types are organized according to the following general types, which specify the color and the shape of the node for easier understanding of work flows. The general types are shown in the following image: Example representation of different general types of nodes The nodes are organized in categories; this way, it is easier to find them. Each node has a node documentation that describes what can be achieved using that type of node, possibly use cases or tips. It also contains information about parameters and possible input ports and output ports. (Sometimes the last two are called inports and outports, or even in-ports and out-ports.) Parameters are usually single values (for example, filename, column name, text, number, date, and so on) associated with an identifier; although, having an array of texts is also possible. These are the settings that influence the execution of a node. There are other things that can modify the results, such as work flow variables or any other state observable from KNIME. Node lifecycle Nodes can have any of the following states: Misconfigured (also called IDLE) Configured Queued for execution Running Executed There are possible warnings in most of the states, which might be important; you can read them by moving the mouse pointer over the triangle sign. Meta nodes Meta nodes look like normal nodes at first sight, although they contain other nodes (or meta nodes) inside them. The associated context of the node might give options for special execution. Usually they help to keep your work flow organized and less scary at first sight. A user-defined meta node Ports The ports are where data in some form flows through from one node to another. The most common port type is the data table. These are represented by white triangles. The input ports (where data is expected to get into) are on the left-hand side of the nodes, but the output ports (where the created data comes out) are on the right-hand side of the nodes. You cannot mix and match the different kinds of ports. It is also not allowed to connect a node's output to its input or create circles in the graph of nodes; you have to create a loop if you want to achieve something similar to that. Currently, all ports in the standard KNIME distribution are presenting the results only when they are ready; although the infrastructure already allows other strategies, such as streaming, where you can view partial results too. The ports might contain information about the data even if their nodes are not yet executed. Data tables These are the most common form of port types. It is similar to an Excel sheet or a data table in the database. Sometimes these are named example set or data frame. Each data table has a name, a structure (or schema, a table specification), and possibly properties. The structure describes the data present in the table by storing some properties about the columns. In other contexts, columns may be called attributes, variables, or features. A column can only contain data of a single type (but the types form a hierarchy from the top and can be of any type). Each column has a type , a name, and a position within the table. Besides these, they might also contain further information, for example, statistics about the contained values or color/shape information for visual representation. There is always something in the data tables that looks like a column, even if it is not really a column. This is where the identifiers for the rows are held, that is, the row keys. There can be multiple rows in the table, just like in most of the other data handling software (similar to observations or records). The row keys are unique (textual) identifiers within the table. They have multiple roles besides that; for example, usually row keys are the labels when showing the data, so always try to find user-friendly identifiers for the rows. At the intersection of rows and columns are the (data) cells , similar to the data found in Excel sheets or in database tables (whereas in other contexts, it might refer to the data similar to values or fi elds). There is a special cell that represents the missing values. The missing value is usually represented as a question mark (?). If you have to represent more information about the missing data, you should consider adding a new column for each column, where this requirement is present, and add that information; however, in the original column, you just declare it as missing. There are multiple cell types in KNIME, and the following table contains the most important ones: Cell type Symbol Remarks Int cell I This represents integral numbers in the range from -231 to 231-1 (approximately 2E9). Long cell L This represents larger integral numbers, and their range is from -263 to 263-1 (approximately 9E18). Double cell D This represents real numbers with double (64 bit) floating point precision. String cell S This represents unstructured textual information. Date and time cell calendar & clock With these cells, you can store either date or time. Boolean cell B This represents logical values from the Boolean algebra (true or false); note that you cannot exclude the missing value. Xml cell XML This cell is ideal for structured data. Set cell {...} This cell can contain multiple cells (so a collection cell type) of the same type (no duplication or order of values are preserved). List cell {...} This is also a collection cell type, but this keeps the order and does not filter out the duplicates. Unknown type cell ? When you have different type of cells in a column (or in a collection cell), this is the generic cell type used. There are other cell types, for example, the ones for chemical data structures (SMILES, CDK, and so on), for images (SVG cell, PNG cell, and so on), or for documents. This is extensible, so the other extension can defi ne custom data cell types. Note that any data cell type can contain the missing value. Port view The port view allows you to get information about the content of the port. Complete content is available only after the node is executed, but usually some information is available even before that. This is very handy when you are constructing the workflow. You can check the structure of the data even if you will usually use node view in the later stages of data exploration during work flow construction. Flow variables Workflows can contain flow variables, which can act as a loop counter, a column name, or even an expression for a node parameter. These are not constants, but you can introduce them to the workspace level as well. This is a powerful feature; once you master it, you can create workflows you thought were impossible to create using KNIME. A typical use case for them is to assign roles to different columns (by assigning the column names to the role name as a flow variable) and use this information for node configurations. If your work flow has some important parameters that should be adjusted or set before each execution (for example a filename), this is an ideal option to provide these to the user; use the flow variables instead of a preset value that is hard to find. As the automatic generation of figures gets more support, the flow variables will find use there too. Iterating a range of values or files in a folder should also be done using flow variables. Node views Nodes can also have node views associated with them. These help to visualize your data or a model, show the node's internal state, or select a subset of the data using the HiLite feature. An important feature exists that a node's views can be opened multiple times. This allows us to compare different options of visualization without taking screenshots or having to remember what was it like, and how you reached that state. You can export these views to image fi les. HiLite The HiLite feature of KNIME is quite unique. Its purpose is to help identify a group of data that is important or interesting for some reason. This is related to the node views, as this selection is only visible in nodes with node views (for example, it is not available in port views). Support for data high lighting is optional, because not all views support this feature. The HiLite selection data is based on row keys, and this information can be lost when the row keys change. For this reason, some of the nonview nodes also have an option to keep this information propagated to the adjacent nodes. On the other hand, when the row keys remain the same, the marks in different views point to the same data rows. It is very important that the HiLite selection is only visible in a well-connected subgraph of work flow. It can also be available for non-executed nodes (for example, the HiLite Collector node). The HiLite information is not saved in the workflow, so you should use the HiLite filter node once you are satisfied with your selection to save that state, and you can reset that HiLite later. Eclipse concepts Because KNIME is based on the Eclipse platform (http://eclipse.org), it inherits some of its features too. One of them is the workspace model with projects (work flows in case of KNIME), and another important one is modularity. You can extend KNIME's functionality using plugins and features; sometimes these are named KNIME extensions. The extensions are distributed through update sites , which allow you to install updates or install new software from a local folder, a zip fi le, or an Internet location. The help system, the update mechanism (with proxy settings), or the fi le search feature are also provided by Eclipse. Eclipse's main window is the workbench. The most typical features are the perspectives and the views. Perspectives are about how the parts of the UI are arranged, while these independently configurable parts are the views. These views have nothing to do with node views or port views. The Eclipse/KNIME views can be detached, closed, moved around, minimized, or maximized within the window. Usually each view can have at most one instance visible (the Console view is an exception). KNIME does not support alternative perspectives (arrangements of views), so it is not important for you; however, you can still reset it to its original state. It might be important to know that Eclipse keeps the contents of fi les and folders in a special form. If you generate files, you should refresh the content to load it from the filesystem. You can do this from the context menu, but it can also be automated if you prefer that option. Preferences The preferences are associated with the workspace you use. This is where most of the Eclipse and KNIME settings are to be specified. The node parameters are stored in the workflows (which are also within the workspace), and these parameters are not considered to be preferences. Logging KNIME has something to tell you about almost every action. Usually, you do not care to read these logs, you do not need to do so. For this reason, KNIME dispatches these messages using different channels. There is a file in the workplace that collects all the messages by default with considerable details. There is even a KNIME/Eclipse view named Console, which contains only the most important details initially. Summary In this article we went through the most important terminologies and concepts you will use when using KNIME. Resources for Article : Further resources on this subject: Visualizing Productions Ahead of Time with Celtx [Article] N-Way Replication in Oracle 11g Streams: Part 1 [Article] Sage: 3D Data Plotting [Article]

0
0
1629

article-image-working-basic-components-make-threejs-scene

Packt

17 Oct 2013

22 min read

Working with the Basic Components That Make Up a Three.js Scene

Packt

17 Oct 2013

22 min read

0
0
7188

Packt

11 Oct 2013

10 min read

Administrating Solr

Packt

11 Oct 2013

10 min read

0
0
1515

article-image-creating-network-graphs-gephi

Packt

08 Oct 2013

10 min read

Creating Network Graphs with Gephi

Packt

08 Oct 2013

10 min read

(For more resources related to this topic, see here.) Basic network graph terminology Network graphs are essentially based on the construct of nodes and edges. Nodes represent points or entities within the data, while edges refer to the connections or lines between nodes. Individual nodes might be students in a school, or schools within an educational system, or perhaps agencies within a government structure. Individual nodes may be represented through equal sizes, but can also be depicted as smaller or larger based on the magnitude of a selected measure. For example, a node with many connections may be portrayed as far larger and thus more influential than a sparsely connected node. This approach will provide viewers with a visual cue that shows them where the highest (and lowest) levels of activity occur within a graph. Nodes will generally be positioned based on the similarity of their individual connections, leading to clusters of individual nodes within a larger network. In most network algorithms, nodes with higher levels of connections will also tend to be positioned near the center of the graph, while those with few connections will move toward the perimeter of the display. Edges are the connections between nodes, and may be displayed as undirected or directed paths. Undirected relationships indicate a connection that flows in both directions, while a directed relationship moves in a single direction that always originates in one node and moves toward another node. Undirected connections will tend to predominate in most cases, such as in social networks where participant activity flows in both directions. On occasion, we will see directed connections, as in the case of some transportation or technology networks where there are connections that flow in a single direction. Edges may also be weighted, to show the strength of the connection between nodes. In the case where University A has performed research with both University B and University C, the strength (width) of the edge will show viewers where the stronger relationship exists. If A and B have combined for three projects, while A and C have collaborated on 9 projects, we should weight the A to C connection three times that of the A to B connection. Another commonly used term is neighbors, which is nothing more than a node that is directly connected to a second node. Neighbors can be stated to be one degree apart. Degrees is the term used to refer to the number of connections flowing into (or away from) a node (also known as Degree Centrality), as well as to the number of connections required to connect to another node via the shortest possible path. In complex graphs, you may find nodes that are four, five, or even more degrees away from a distant node, and in some cases two nodes may not be connected at all. Now that you have a very basic understanding of network graph theory, let's learn about some of the common network graph algorithms. Common network graph algorithms Before we introduce you to some specific graph algorithms, we'll briefly discuss some of the theory behind network graphs and introduce you to a few of the terms you will frequently encounter. Network graphs are drawn through positioning nodes and their respective connections relative to one another. In the case of a graph with 8 or 10 nodes, this is a rather simple exercise, and could probably be drawn rather accurately without the help of complex methodologies. However, in the typical case where we have hundreds of nodes with thousands of edges, the task becomes far more complex. Some of the more prominent graph classes in Gephi include the following: Force-directed algorithms refer to a category of graphs that position elements based on the principles of attraction, repulsion, and gravity Circular algorithms position graph elements around the perimeter of one or more circles, and may allow the user to dictate the order of the elements based on data properties Dual circle layouts position a subset of nodes in the interior of the graph with the remaining nodes around the diameter, similar to a single circular graph Concentric layouts arrange the graph nodes using an approximately circular graph design, with less connected nodes at the perimeter of the graph and highly connected nodes typically near the center Radial axis layouts provide the user with the ability to determine some portion of the graph layout by defining graph attributes The type of graph you select may well be dictated by the sort of results you seek. If you wish to feature certain groups within your dataset, one of the layouts that allows you to segment the graph by groups will provide a potentially quick solution to your needs. In this instance, one of the circular or radial axis graphs may prove ideal. On the other hand, if you are attempting to discover relationships in a complex new dataset, one of the several available Force-directed layouts is likely a better choice. These algorithms will rely on the values in your dataset to determine the positioning within the graph. When choosing one of these approaches, please note that there will often be an extensive runtime to calculate the graph layout, especially as the data becomes more complex. Even on a powerful computer, examples may run for minutes or hours in an attempt to fully optimize the graph. Fortunately, you will have the ability in Gephi to stop these algorithms at any given point, and you will still have a very reasonable, albeit imperfect graph. In the next section, we'll look at a few of the standard layouts that are part of the Gephi base package. Standard network graph layouts Now that you are somewhat familiar with the types of layout algorithms, we'll take a look at what Gephi offers within the Layout tab. We'll begin with some common Force-directed approaches, and then examine some of the other choices. One of the best known force algorithms is Force Atlas, which in Gephi provides users with multiple options for drawing the graph. Foremost among these settings are Repulsion, Attraction, and Gravity settings. Repulsion strength adjustments will make individual nodes either more or less sensitive to other nodes they differ from. A higher repulsion level, for example, will push these nodes further apart. Conversely, setting the Attraction strength higher will force related nodes closer together. Finally, the Gravity setting will draw nodes closer to the center of the graph if it is set to a high level, and disperse them toward the edges if a very low value is set. Force Atlas 2 is another layout option that employs slightly different parameters than the original Force Atlas method. You may wish to compare these methods and determine which one gives you better results. Fruchterman Reingold is one more Force method; albeit one that provides you with just three parameters – Area, Gravity, and Speed. While the general approach is not unlike the Force Atlas algorithms, your results will appear different in a visual sense. Finally, Gephi provides three Yifan Hu methods. Each of these models—Yifan Hu, Yifan Hu Proportional, and Yifan Hu Multilevel, are likely to run much more rapidly than the methods discussed earlier, while providing generally similar results. Gephi also provides a variety of methods that do not employ the force approach. Some of the models, as we noted earlier in this article, provide you with more control over the final result. This may be the result of selecting how to order the nodes, or of which attributes to use in grouping nodes together, either through color or location. In the section above, I referenced several layout options, but in the interest of space we'll take a closer look at two of them—the Circular and Radial Axis layouts. Circular layouts are well suited to relatively small networks, given the limited flexibility of their fixed layout. We can adjust this to some degree by specifying the diameter of the graph, but anything more than a few dozen well-connected nodes often becomes difficult to manage. However, with smaller networks, these layouts can be intriguing, providing us with the ability to see patterns within and between specific groups more easily than we might see them in some other layouts. While this article will not cover any filtering options, those too can be used to help us better utilize the circular layouts, by providing us with the ability to highlight specific groups and their resulting connections. Think of the circle resembling a giant spider web filled with connections, and the filters as tools that help us see specific threads within the web. Our final notes are on Radial Axis layouts, which can provide us with fascinating looks at our data, especially if there are natural groups within the network. Think of a school with several classrooms full of students, for example. Each of these classrooms can be easily identified and grouped, perhaps by color. In a complex force directed graph we may be able to spot each of these groups, but it may become difficult due to the interactions with other classes. In a Radial Axis layout we can dictate the group definitions, forcing each group to be bound together, apart from any other groups. There are pros and cons to this approach, of course, as there are with any of the other methods. If we wish to understand how a specific group interacts with another group, this method can prove beneficial, as it isolates these groups visually, making it easier to see connections between them. On the negative side, it is often quite difficult to see connections between members within the group, due to the nature of the layout. As with any layouts, it is critical to look at the results and see how they apply to our original need. Always test your data using multiple layout algorithms, so that you wind up with the best possible approach. Summary Gephi is an ideal tool for users new to network graph analysis and visualization, as it provides a rich set of tools to create and customize network graphs. The user interface makes it easy to understand basic concepts such as nodes and edges, as well as descriptive terminology such as neighbors, degrees, repulsion, and attraction. New users can move as slowly or as rapidly as they wish, given Gephi's gentle learning curve. Gephi can also help you see and understand patterns within your data through a variety of sophisticated graph methods that will appeal to both the novice as well as seasoned users. The variety of sophisticated layout algorithms will provide you the opportunity to experiment with multiple layouts as you search for the best approach to display your data. In short, Gephi provides everything needed to produce first-rate network visualizations. Resources for Article: Further resources on this subject: OpenSceneGraph: Advanced Scene Graph Components [Article] Cacti: Using Graphs to Monitor Networks and Devices [Article] OpenSceneGraph: Managing Scene GraphOpenSceneGraph: Managing Scene Graph [Article]

0
0
4533

Packt

04 Oct 2013

7 min read

CQL for client applications

Packt

04 Oct 2013

7 min read

(For more resources related to this topic, see here.) Using the Thrift API The Thrift library is based on the Thrift RPC protocol. High-level clients built over it have been a standard way of building an application for a long time. In this section, we'll explain how to write a client application using CQL as the query language and thrift as the Java API. When we start Cassandra, by default it listens to Thrift clients (start_rpc: true property in the CASSANDRA_HOME/conf/cassandra.yaml file enables this). Let's build a small program that connects to Cassandra using the Thrift API, and runs CQL 3 queries for reading/writing data in the UserProfiles table we created for the facebook application. The program can be built by performing the following steps: For downloading the Thrift Library, you need to enter apache-assandra-thrift-1.2.x.jar (which is to be found in the CASSANDRA_HOME/lib folder) into your classpath. If your Java project is mavenized, you need to insert the following entry in pom.xml under the dependency section (version will vary depending upon your Cassandra server installation): <dependency> <groupId>org.apache.cassandra</groupId> <artifactId>cassandra-thrift</artifactId> <version>1.2.5</version> </dependency> For connecting to the Cassandra server on a given host and port, you need to open org.apache.thrift.transport.TTransport to the Cassandra node and create an instance of org.apache.cassandra.thrift.Cassandra.Client as follows: TTransport transport = new TFramedTransport(new TSocket("localhost", 9160)); TProtocol protocol = new TBinaryProtocol(transport); Cassandra.Client client = new Cassandra.Client(protocol); transport.open(); client.set_cql_version("3.0.0"); The default CQL version for Thrift is 2.0.0. You must set it to 3.0.0 if you are writing CQL 3 queries and don't want to see any version related errors. After you are done with transport, close it gracefully (usually at the end of read/write operations) as follows: transport.close(); Creating a schema : The executeQuery() utility method accepts String CQL 3 query and runs it: CqlResult executeQuery(String query) throws Exception { return client.execute_cql3_query(ByteBuffer.wrap(query.getBytes("UTF-8")), Compression.NONE, ConsistencyLevel.ONE); } Now, create keyspace and the table by directly executing CQL 3 query: //Create keyspace executeQuery("CREATE KEYSPACE facebook WITH replication = "{'class':'SimpleStrategy','replication_factor':3};"); executeQuery("USE facebook;"); //Create table executeQuery("CREATE TABLE UserProfiles(" +"email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data: A couple of records can be inserted as follows: executeQuery("USE facebook;"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('[email protected]','p4ssw0rd',' John Smith',32,0x8e37);"); executeQuery("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('[email protected]','guess1t',' David Bergin',42,0xc9f1);"); Executing the SELECT query returns CQLResult, on which we can iterate easily to fetch records: CqlResult result = executeQuery("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = '[email protected]';"); for (CqlRow row : result.getRows()) { System.out.println(row.getKey(); } Using the Datastax Java driver The Datastax Java driver is based on the Cassandra binary protocol that was introduced in Cassandra 1.2, and works only with CQL 3. The Cassandra binary protocol is specifically made for Cassandra in contrast to Thrift, which is a generic framework and has many limitations. Now, we are going to write a Java program that uses the Datastax Java driver for reading/writing data into Cassandra, by performing the following steps: Downloading the driver library : This driver library JAR file must be in your classpath in order to build an application using it. If you have a maven-based Java project, you need to insert the following entry into the pom.xml file under the dependeny section: <dependency> <groupId>com.datastax.cassandra</groupId> <artifactId>cassandra-driver-core</artifactId> <version>1.0.1</version> </dependency> This driver project is hosted on Github: (https://github.com/datastax/java-driver). It makes sense to check and download the latest version. Configuring Cassandra to listen to native clients : In the newer version of Cassandra, this would be enabled by default and Cassandra will listen to clients using binary protocol. But the earlier Cassandra installations may require enabling this. All you have to do is to check and enable the start_native_transport property into the CASSANDRA_HOME/conf/Cassandra.yaml file by inserting/uncommenting the following line: start_native_transport: true The port that Cassandra will use for listening to native clients is determined by the native_transport_port property. It is possible for Cassandra to listen to both Thrift and native clients simultaneously. If you want to disable Thrift, just set the start_rpc property to false in CASSANDRA_HOME/conf/Cassandra.yaml. Connecting to Cassandra : The com.datastax.driver.core.Cluster class is the entry point for clients to connect to the Cassandra cluster: Cluster cluster = Cluster.builder().addContactPoint("127.0.0.1").build(); After you are done with using it (usually when application shuts down), close it gracefully: cluster.shutdown(); Creating a session : An object of com.datastax.driver.core.Session allows you to execute a CQL 3 statement. The following line creates a Session instance: Session session = cluster.connect(); Creating a schema : Before reading/writing data, let's create a keyspace and a table similar to UserProfiles in the facebook application we built earlier: // Create Keyspace session.execute("CREATE KEYSPACE facebook WITH replication = " + "{'class':'SimpleStrategy','replication_factor':1};"); session.execute("USE facebook"); // Create table session.execute("CREATE TABLE UserProfiles(" + "email_id text," + "password text,"+ "name text," + "age int," + "profile_picture blob," + "PRIMARY KEY(email_id)" + ");" ); Reading/writing data : We can insert a couple of records as follows: session.execute("USE facebook"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('[email protected]','p4ssw0rd','John Smith',32,0x8e37);"); session.execute("INSERT INTO UserProfiles(email_id, password, name, age, profile_picture) VALUES('[email protected]','guess1t','David Bergin',42,0xc9f1);"); Finding and printing records : A SELECT query returns an instance of com.datastax.driver.core.ResultSet. You can fetch individual rows by iterating over it using the com.datastax.driver.core.Row object: ResultSet results = session.execute ("SELECT * FROM facebook.UserProfiles " + "WHERE email_id = '[email protected]';"); for (Row row : results) { System.out.println ("Email: " + row.getString("email_id") + "tName: " + row.getString("name")+ "t Age : " + row.getInt("age")); } Deleting records : We can delete a record as follows: session.execute("DELETE FROM facebook.UserProfiles WHERE email_id='[email protected]';"); Using high-level clients In addition to the libraries based on Thrift and binary protocols, some high-level clients are built with the purpose to ease development and provide additional services, such as connection pooling, load balancing, failover, secondary indexing, and so on. Some of them are listed here: Astyanax (https://github.com/Netflix/astyanax): Astyanax is a high-level Java client for Cassandra. It allows you to run both simple and prepared CQL queries. Hector (https://github.com/hector-client/hector): Hector is a high-level client for Cassandra. At the time of writing this book, it supported CQL 2 only (not CQL 3). Kundera (https://github.com/impetus-opensource/Kundera): Kundera is a JPA 2.0-based object datastore mapping library for Cassandra and many other NoSQL datastores. CQL 3 queries are run with Kundera using the native queries as described in JPA specification. Summary From this article, we basically learn about using CQL in queries using three different preceding methods. Resources for Article : Further resources on this subject: Quick start – Creating your first Java application [Article] Apache Cassandra: Libraries and Applications [Article] Getting Started with Apache Cassandra [Article]

0
0
1286

Packt

03 Oct 2013

8 min read

Simple graphs with d3.js

Packt

03 Oct 2013

8 min read

(For more resources related to this topic, see here.) There are many component vendors selling graphing controls. Frequently, these graph libraries are complicated to work with and expensive. When using them, you need to consider what would happen if the vendor went out of business or refused to fix an issue: For simple graphs, such as the one shown in the preceding screenshot, d3.js (http://d3js.org/) brings a number of functions and a coding style that makes creating graphs a snap. Let's create a using d3. The first thing to do is introduce an SVG element to the page. In d3, we'll append an SVG element explicitly: var graph = d3.select("#visualization") .append("svg") .attr("width", 500) .attr("height", 500); d3 relies heavily on the use of method chaining. If you're new to this concept, it is quick to pick up. Each call to a method performs some action and then returns an object, and the next method call operates on this object. So, in this case, the select method returns the div with the id of visualization. Calling append on the selected div adds an SVG element and then returns it. Finally, the attr methods set a property inside the object and then return the object. At first, method chaining may seem odd, but as we move on you'll see that it is a very powerful technique and cleans up the code considerably. Without method chaining, we end up with a lot of temporary variables. Next, we need to find the maximum element in the data array. Previously we might have used a jQuery each loop to find that element. d3 has built-in array functions that make this much cleaner: var maximumValue = d3.max(data, function(item){ return item.value;}); There are similar functions for finding minimums and means. None of the functions are anything you couldn't get by using a JavaScript utility library such as underscore.js (http://underscorejs.org/) or lodash (http://lodash.com/), however, it is convenient to make use of the built-in versions. In the next piece, we make us of are d3's scaling functions: var yScale = d3.scale.linear() .domain([maximumValue, 0]) .range([0, 300]); Scaling functions serve to map from one dataset to another. In our case, we're mapping from the values in our data array to the coordinates in our SVG. We use two different scales here: linear and ordinal. The linear scale is used to map a continuous domain to a continuous range. The mapping will be done linearly, so if our domain contained values between 0 and 10, and our range had values between 0 and 100, a value of 6 would map to 60, 3 to 30, and so forth. It seems trivial, but with more complicated domains and ranges, scales are very helpful. Apart from linear scales, there are power and logarithmic scales that may fit your data better. In our example data, our y values are not continuous, they're not even numeric. For this case we can make use of an ordinal scale: var xScale = d3.scale.ordinal() .domain(data.map(function(item){return item.month;})) .rangeBands([0,500], .1); ordinal scales map a discrete domain into a continuous range. Here, the domain is the list of months and the range the width of our SVG. You'll note that instead of using range we use rangeBands. rangebands splits the range up into chunks each to which each range item is assigned. So, if our domain was {May, June} and the range 0 to 100, from May we would receive a band from 0 to 49 and 50 to 100 from June. You'll also note that rangeBands takes an additional parameter, in our case 0.1. This is a padding value that generates a sort of no man's land between each band. This is ideal for creating a bar or column graph, as we may not want the columns touching each other. The padding parameter can take a value between 0 and 1 as a decimal representation of how much of the range should be reserved for padding. 0.25 would reserve 25% of the range for padding. There are also a family of built-in scales that deal with providing colors. Selecting colors for your visualization can be challenging, as the colors have to be far enough apart to be discernible. If you're color-challenged like me, the scales category10, category20, category20b, and category20c may be for you. You can declare a color scale in the the following manner: var colorScale = d3.scale.category10() .domain(data.map(function(item){return item.month;})); The preceding code will assign a different color to each month out of a set of 10 pre-calculated possible colors. Finally, we need to actually draw our graph: var graphData = graph.selectAll(".bar") .data(data); We select all the .bar elements inside the graph using selectAll. Hang on! There aren't any elements inside the graph, let alone elements that match the .bar selector. Typically, selectAll will return a collection of elements matching the selector just as the $ function does in jQuery. In this case, we're using selectAll as a short-hand method of creating an empty d3 collection that has a data method and can be chained. We next specify a set of data to union with the data from the existing selection of elements. d3 operates on collections objects without using looping techniques. This allows for a more declarative style of programming, but can be difficult to grasp immediately. In effect, we're unioning two sets of data: the currently existing data (found using selectAll) and the new data (provided by the data function). This method of dealing with data allows for easy updates to the data elements, should further elements be added or removed later. When new data elements are added, you can select just those elements by using enter(). This prevents repeating actions on existing data. You don't need to redraw the entire image, just the portions that have been updated with new data. Equally, if there are elements in the new dataset that didn't appear in the old one, they can be selected with exit(). Typically, you want to just remove those elements that can be done by running the following: graphData.exit().remove() When we create elements using the newly generated dataset, the data elements will actually be attached to the newly created DOM elements. Creating the elements involves calling append: graphData.enter() .append("rect") .attr("x", function(item){ return xScale(item.month);}) .attr("y", function(item){ return yScale(item.value);}) .attr("height", function(item){ return 300 - yScale(item.value);}) .attr("width", xScale.rangeBand()) .attr("fill", function(item, index){return colorScale(item.month);}); The following diagram shows how data() works with new and existing dataset: You can see in this chunk of code how useful method chaining has become. It makes the code much shorter and more readable than assigning a series of temporary variables or passing the results into standalone methods. The scales also come on their own here. The x coordinate is found simply by scaling the month we have using the ordinal scale. Because that scale takes into account the number of elements as well as the padding, nothing more complicated is needed. The y coordinate is similarly found using previously defined yScale. Because the origin in an SVG is in the top-left, we have to take the inverse of the scale to calculate the height. Again, this is a place where we wouldn't generally be using a constant except for the brevity of our example. The width of the column is found by asking the xScale for the width of the bands. Finally, we set the color based on the color scale so it appears as follows: Transitions Being able to animate your visualization is a powerful technique for adding that "wow" factor. People are far more likely to stop and pay attention to your visualization if it looks cool. Spending time to make your visualization look nifty can actually have a payback other than getting mad cred from other developers. d3 makes transitions simple by doing most of the heavy lifting for you. The transitions work by creating gradual changes in values of properties: graph.selectAll(".bar") .attr("height", 0) .transition() .attr("height", 50) This code will gradually change the height on a .bar from 0 to 50 px. By default, the transition will take 250ms, but this can be changed by chaining a call to duration and delayed with a call to delay: graph.selectAll(".bar") .attr("height", 0) .transition() .duration(400) .delay(100) .attr("height", 50) This transition will wait 100ms then grow the bar height over a course of 400ms. Transitions can also be used to change non-numeric attributes such as color. I like to use a few transitions during loading to get attention, but they can also be useful when changing the state of the visualization. Even such things as adding new elements using .data can have a transition attached to them. If there is a problem with transitions, it is that they are too easy to use. Be careful that you don't overuse them and overwhelm the user. You should also pay attention to the duration of the transition: if it takes too long to execute, users will lose interest. Summary This article shed light on the features of d3.js that can be used to create simple graphs. It also covered how to add that "wow" factor to your visualizations using transitions. Resources for Article: Further resources on this subject: Introducing QlikView elements [Article] Data sources for the Charts [Article] HTML5 Presentations - creating our initial presentation [Article]

0
0
2253

Packt

30 Sep 2013

5 min read

Testing a Camel application

Packt

30 Sep 2013

5 min read

(For more resources related to this topic, see here.) Let's start testing. Apache Camel comes with the Camel Test Kit: some classes leverage testing framework capabilities and extend with Camel specifics. To test our application, let's add this Camel Test Kit to our list of dependencies in the POM file, as shown in the following code: <dependency> <groupId>org.apache.camel</groupId> <artifactId>camel-test</artifactId> <version>${camel-version}</version> </dependency> At the same time, if you have any JUnit dependency, the best solution would be to delete it for now so that Maven will resolve the dependency and we will get a JUnit version required by Camel. Let's rewrite our main program a little bit. Change the class App as shown in the following code: public class App { public static void main(String[] args) throws Exception { Main m = new Main(); m.addRouteBuilder( new AppRoute() ); m.run(); } static class AppRoute extends RouteBuilder { @Override public void configure() throws Exception { from("stream:in") .to("file:test"); } } } Instead of having an anonymous class extending RouteBuilder, we made it an inner class. That is, we are not going to test the main program. Instead, we are going to test if our routing works as expected, that is, messages from the system input are routed into the files in the test directory. At the beginning of the test, we will delete the test directory and our assertion will be that we have the directory test after we send the message and that it has exactly one file. To simplify the deleting of the directory test at the beginning of the unit test, we will use FileUtils.deleteDirectory from Apache Commons IO. Let's add it to our list of dependencies: <dependency> <groupId>org.apache.commons</groupId> <artifactId>commons-io</artifactId> <version>1.3.2</version> </dependency> In our project layout, we have a file src/test/java/com/company/AppTest.java. This is a unit test that has been created from the Maven artifact that we used to create our application. Now, let's replace the code inside that file with the following code: package com.company; import org.apache.camel.builder.RouteBuilder; import org.apache.camel.test.junit4.CamelTestSupport; import org.apache.commons.io.FileUtils; import org.junit.After; import org.junit.BeforeClass; import org.junit.Test; import java.io.*; public class AppTest extends CamelTestSupport { static PipedInputStream in; static PipedOutputStream out; static InputStream originalIn; @Test() public void testAppRoute() throws Exception { out.write("This is a test message!n".getBytes()); Thread.sleep(2000); assertTrue(new File("test").listFiles().length == 1); } @BeforeClass() public static void setup() throws IOException { originalIn = System.in; out = new PipedOutputStream(); in = new PipedInputStream(out); System.setIn(in); FileUtils.deleteDirectory(new File("test")); } @After() public void teardown() throws IOException { out.close(); System.setIn(originalIn); } @Override public boolean isCreateCamelContextPerClass() { return false; } @Override protected RouteBuilder createRouteBuilder() throws Exception { return new App.AppRoute(); } } Now we can run mvn compile test from the console and see that the test was run and that it is successful: Some important things to take note of in the code of our unit test are as follows: We have extended the CamelTestSupport class for Junit4 (see the package it is imported from). There are also classes that support TestNG and Junit3. We have overridden the method createRouteBuilder() to return RouteBuilder with our routes. We made our test class create CamelContext for each test method (annotated by @Test) by making isCreateCamelContextPerClass return false. System.in has been substituted with a piped stream in the startup() method and has been set back to the original value in the teardown() method. The trick is in doing it before CamelContext is created and started (now you see why we create CamelContext for each test). Also, you may see that after we send the message into the output stream piped to System.in, we made the test thread stop for couple of seconds to ensure that the message passes through the routes into the file. In short, our test running suite overrides System.in with a pipe stream so we can write into System.in from the code and deletes the directory test before the Test class is loaded. After the class is loaded and right before the testAppRoute() method, it creates CamelContext, using routes created by the overridden method createRouteBuilder(). Then it runs the test method which sends bytes of the message into the piped stream so that it gets into System.in where it is read by the Camel (note the n limiting the message). Camel then does what is written in the routes, that is, creates a file in the test directory. To be sure it's done before we do assertions, we make the thread executing the test sleep for 2 seconds. Then, we assert that we do have a file in the test directory at the end of the test. Our test works, but you see that it already gets quite hairy with piping streams and making calls to Thread.sleep()—and that's just the beginning. We haven't yet started using external systems, such as FTP servers, web services, and JMS queues. Another concern is the integration of our application with other systems. Some of them may not have a test environment. In this case, we can't easily control the side effects of our application, messages that it sends and receives from those systems; or how the systems interact with our application. To solve this problem, software developers use mocking. Summary Thus we learned about testing a Camel application in this article. Resources for Article: Further resources on this subject: Migration from Apache to Lighttpd [Article] Apache Felix Gogo [Article] Using the OSGi Bundle Repository in OSGi and Apache Felix 3.0 [Article]

0
0
1908

How-To Tutorials - Data

Low-Level Index Control

Designing Sample Applications and Creating Reports

About Cassandra

Visualizing my Social Graph with d3.js

Interacting with your Visualization

Installing MariaDB on Windows and Mac OS X

Drag

Using the Spark Shell

KNIME terminologies

Working with the Basic Components That Make Up a Three.js Scene

Trending Topics

Administrating Solr

Creating Network Graphs with Gephi

CQL for client applications

Simple graphs with d3.js

Testing a Camel application