Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-about-mongodb
Packt
27 Nov 2014
17 min read
Save for later

About MongoDB

Packt
27 Nov 2014
17 min read
In this article by Amol Nayak, the author of MongoDB Cookbook, describes the various features of MongoDB. (For more resources related to this topic, see here.) MongoDB is a document-oriented database and is the most popular and favorite NoSQL database. The rankings given at http://db-engines.com/en/ranking shows us that MongoDB is sitting on the fifth rank overall as of August 2014 and is the first NoSQL product in this list. It is currently being used in production by a huge list of companies in various domains handling terabytes of data efficiently. MongoDB is developed to scale horizontally and cope up with the increasing data volumes. It is very simple to use and get started with, backed by a good support from its company MongoDB and has a vast array open source and proprietary tools build around it to improve developer and administrator's productivity. In this article, we will cover the following recipes: Single node installation of MongoDB with options from the config file Viewing database stats Creating an index and viewing plans of queries Single node installation of MongoDB with options from the config file As we're aware that providing options from the command line does the work, but it starts getting awkward as soon as the number of options we provide increases. We have a nice and clean alternative to providing the startup options from a configuration file rather than as command-line arguments. Getting ready Well, assuming that we have downloaded the MongoDB binaries from the download site, extracted it, and have the bin directory of MongoDB in the operating system's path variable (this is not mandatory but it really becomes convenient after doing it), the binaries can be downloaded from http://www.mongodb.org/downloads after selecting your host operating system. How to do it… The /data/mongo/db directory for the database and /logs/ for the logs should be created and present on your filesystem, with the appropriate permissions to write to it. Let's take a look at the steps in detail: Create a config file, which can have any arbitrary name. In our case, let's say we create the file at /conf/mongo.conf. We will then edit the file and add the following lines of code to it: port = 27000 dbpath = /data/mongo/db logpath = /logs/mongo.log smallfiles = true Start the Mongo server using the following command: > mongod --config /config/mongo.conf How it works… The properties are specified as <property name> = <value>. For all those properties that don't have values, for example, the smallfiles option, the value given is a Boolean value, true. If you need to have a verbose output, you will add v=true (or multiple v's to make it more verbose) to our config file. If you already know what the command-line option is, it is pretty easy to guess the value of the property in the file. It is the same as the command-line option, with just the hyphen removed. Viewing database stats In this recipe, we will see how to get the statistics of a database. Getting ready To find the stats of the database, we need to have a server up and running, and a single node is what should be ok. The data on which we would be operating needs to be imported into the database. Once these steps are completed, we are all set to go ahead with this recipe. How to do it… We will be using the test database for the purpose of this recipe. It already has the postalCodes collection in it. Let's take a look at the steps in detail: Connect to the server using the Mongo shell by typing in the following command from the operating system terminal (it is assumed that the server is listening to port 27017): $ mongo On the shell, execute the following command and observe the output: > db.stats() Now, execute the following command, but this time with the scale parameter (observe the output): > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : {    "major" : 4,    "minor" : 5 }, "ok" : 1 } How it works… Let us start by looking at the collections field. If you look carefully at the number and also execute the show collections command on the Mongo shell, you shall find one extra collection in the stats as compared to those by executing the command. The difference is for one collection, which is hidden, and its name is system.namespaces. You may execute db.system.namespaces.find() to view its contents. Getting back to the output of stats operation on the database, the objects field in the result has an interesting value too. If we find the count of documents in the postalCodes collection, we see that it is 39732. The count shown here is 39738, which means there are six more documents. These six documents come from the system.namespaces and system.indexes collection. Executing a count query on these two collections will confirm it. Note that the test database doesn't contain any other collection apart from postalCodes. The figures will change if the database contains more collections with documents in it. The scale parameter, which is a parameter to the stats function, divides the number of bytes with the given scale value. In this case, it is 1024, and hence, all the values will be in KB. Let's analyze the output: > db.stats(1024) { "db" : "test", "collections" : 3, "objects" : 39738, "avgObjSize" : 143.32699179626553, "dataSize" : 5562, "storageSize" : 16388, "numExtents" : 8, "indexes" : 2, "indexSize" : 2243, "fileSize" : 196608, "nsSizeMB" : 16, "dataFileVersion" : {    "major" : 4,    "minor" : 5 }, "ok" : 1 } The following table shows the meaning of the important fields: Field Description db This is the name of the database whose stats are being viewed. collections This is the total number of collections in the database. objects This is the count of documents across all collections in the database. If we find the stats of a collection by executing db.<collection>.stats(), we get the count of documents in the collection. This attribute is the sum of counts of all the collections in the database. avgObjectSize This is simply the size (in bytes) of all the objects in all the collections in the database, divided by the count of the documents across all the collections. This value is not affected by the scale provided even though this is a size field. dataSize This is the total size of the data held across all the collections in the database. This value is affected by the scale provided. storageSize This is the total amount of storage allocated to collections in this database for storing documents. This value is affected by the scale provided. numExtents This is the count of all the number of extents in the database across all the collections. This is basically the sum of numExtents in the collection stats for collections in this database. indexes This is the sum of number of indexes across all collections in the database. indexSize This is the size (in bytes) for all the indexes of all the collections in the database. This value is affected by the scale provided. fileSize This is simply the addition of the size of all the database files you should find on the filesystem for this database. The files will be named test.0, test.1, and so on for the test database. This value is affected by the scale provided. nsSizeMB This is the size of the file in MBs for the .ns file of the database. Another thing to note is the value of the avgObjectSize, and there is something weird in this value. Unlike this very field in the collection's stats, which is affected by the value of the scale provided. In database stats, this value is always in bytes, which is pretty confusing and one cannot really be sure why this is not scaled according to the provided scale. Creating an index and viewing plans of queries In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes. Getting ready For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. The data with which we will be operating needs to be imported in the database. Once we have this prerequisite, we are good to go. How to do it… We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps: Execute the following query to view the plan of a query: > db.postalCodes.find({state:'Maharashtra'}).explain() Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation Let's execute the same query again, but this time, we will limit the results to only 100 results: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain() Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result We will now create an index on the state and pincode fields as follows: > db.postalCodes.ensureIndex({state:1, pincode:1}) Execute the following query: > db.postalCodes.find({state:'Maharashtra'}).explain() Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result Since we want only the pin codes, we will modify the query as follows and view its plan: > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result. How it works… There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some gotchas. Analysis of the plan Let's look at the first step and analyze the output we executed: > db.postalCodes.find({state:'Maharashtra'}).explain() The output on my machine is as follows (I am skipping the nonrelevant fields for now): {        "cursor" : "BasicCursor",          "n" : 6446,        "nscannedObjects" : 39732,        "nscanned" : 39732,          …        "millis" : 55, …       } The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result. Improving the query execution time So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause: > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()   { "cursor" : "BasicCursor", …      "n" : 100,      "nscannedObjects" : 19951,      "nscanned" : 19951,        …      "millis" : 30,        … } The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important, as in the absence of index, the collection might still be completely scanned if the number of matches is not met. Improvement using indexes Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement: {        "cursor" : "BtreeCursor state_1_pincode_1", …          "n" : 6446,        "nscannedObjects" : 6446,        "nscanned" : 6446, …        "indexOnly" : false, …        "millis" : 16, … } The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we now used an index that gave us the starting document from which we could scan, and then, only the required number of documents were scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis has come down too, as expected. Improvement using covered indexes This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate. Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another IO operation to fetch the complete document from the collection; this document is then returned to the user. The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection will not be necessary, and the data from the index will be returned. This is called covered index, and the value of indexOnly, in this case, will be true. In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query results in the following plan: db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain() {        "cursor" : "BtreeCursor state_1_pincode_1", …        "n" : 6446,        "nscannedObjects" : 0,        "nscanned" : 6446, … "indexOnly" : true, …            "millis" : 15, … } The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index, and thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional IO was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0. As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance. Another thing to remember is that wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to use it set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection. Some gotchas of index creations We will now see some pitfalls in index creation and some facts where the array field is used in the index. Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries. What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, not more than one field can be of the array type. This is done to prevent the possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields. This should be good enough for now to scratch the surfaces of plain vanilla index. Summary This article provides detailed recipes that describe how to use the different features of MongoDB. MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability, thus making it a good contender for high-volume, high-performance systems across all business domains. It has an edge over the majority of NoSQL solutions for its ease of use, high performance, and rich features. In this article, we learned how to start single node installations of MongoDB with options from the config file. We also learned how to create an index from the shell and viewing plans of queries. Resources for Article: Further resources on this subject: Ruby with MongoDB for Web Development [Article] MongoDB data modeling [Article] Using Mongoid [Article]
Read more
  • 0
  • 0
  • 2053

article-image-creating-reusable-actions-agent-behaviors-lua
Packt
27 Nov 2014
18 min read
Save for later

Creating reusable actions for agent behaviors with Lua

Packt
27 Nov 2014
18 min read
In this article by David Young, author of Learning Game AI Programming with Lua, we will create reusable actions for agent behaviors. (For more resources related to this topic, see here.) Creating userdata So far we've been using global data to store information about our agents. As we're going to create decision structures that require information about our agents, we'll create a local userdata table variable that contains our specific agent data as well as the agent controller in order to manage animation handling: local userData = {    alive, -- terminal flag    agent, -- Sandbox agent    ammo, -- current ammo    controller, -- Agent animation controller    enemy, -- current enemy, can be nil    health, -- current health    maxHealth -- max Health }; Moving forward, we will encapsulate more and more data as a means of isolating our systems from global variables. A userData table is perfect for storing any arbitrary piece of agent data that the agent doesn't already possess and provides a common storage area for data structures to manipulate agent data. So far, the listed data members are some common pieces of information we'll be storing; when we start creating individual behaviors, we'll access and modify this data. Agent actions Ultimately, any decision logic or structure we create for our agents comes down to deciding what action our agent should perform. Actions themselves are isolated structures that will be constructed from three distinct states: Uninitialized Running Terminated The typical lifespan of an action begins in uninitialized state and will then become initialized through a onetime initialization, and then, it is considered to be running. After an action completes the running phase, it moves to a terminated state where cleanup is performed. Once the cleanup of an action has been completed, actions are once again set to uninitialized until they wait to be reactivated. We'll start defining an action by declaring the three different states in which actions can be as well as a type specifier, so our data structures will know that a specific Lua table should be treated as an action. Remember, even though we use Lua in an object-oriented manner, Lua itself merely creates each instance of an object as a primitive table. It is up to the code we write to correctly interpret different tables as different objects. The use of a Type variable that is moving forward will be used to distinguish one class type from another. Action.lua: Action = {};   Action.Status = {    RUNNING = "RUNNING",    TERMINATED = "TERMINATED",    UNINITIALIZED = "UNINITIALIZED" };   Action.Type = "Action"; Adding data members To create an action, we'll pass three functions that the action will use for the initialization, updating, and cleanup. Additional information such as the name of the action and a userData variable, used for passing information to each callback function, is passed in during the construction time. Moving our systems away from global data and into instanced object-oriented patterns requires each instance of an object to store its own data. As our Action class is generic, we use a custom data member, which is userData, to store action-specific information. Whenever a callback function for the action is executed, the same userData table passed in during the construction time will be passed into each function. The update callback will receive an additional deltaTimeInMillis parameter in order to perform any time specific update logic. To flush out the Action class' constructor function, we'll store each of the callback functions as well as initialize some common data members: Action.lua: function Action.new(name, initializeFunction, updateFunction,        cleanUpFunction, userData)      local action = {};       -- The Action's data members.    action.cleanUpFunction_ = cleanUpFunction;    action.initializeFunction_ = initializeFunction;    action.updateFunction_ = updateFunction;    action.name_ = name or "";    action.status_ = Action.Status.UNINITIALIZED;    action.type_ = Action.Type;    action.userData_ = userData;           return action; end Initializing an action Initializing an action begins by calling the action's initialize callback and then immediately sets the action into a running state. This transitions the action into a standard update loop that is moving forward: Action.lua: function Action.Initialize(self)    -- Run the initialize function if one is specified.    if (self.status_ == Action.Status.UNINITIALIZED) then        if (self.initializeFunction_) then            self.initializeFunction_(self.userData_);        end    end       -- Set the action to running after initializing.    self.status_ = Action.Status.RUNNING; end Updating an action Once an action has transitioned to a running state, it will receive callbacks to the update function every time the agent itself is updated, until the action decides to terminate. To avoid an infinite loop case, the update function must return a terminated status when a condition is met; otherwise, our agents will never be able to finish the running action. An update function isn't a hard requirement for our actions, as actions terminate immediately by default if no callback function is present. Action.lua: function Action.Update(self, deltaTimeInMillis)    if (self.status_ == Action.Status.TERMINATED) then        -- Immediately return if the Action has already        -- terminated.        return Action.Status.TERMINATED;    elseif (self.status_ == Action.Status.RUNNING) then        if (self.updateFunction_) then            -- Run the update function if one is specified.                      self.status_ = self.updateFunction_(                deltaTimeInMillis, self.userData_);              -- Ensure that a status was returned by the update            -- function.            assert(self.status_);        else            -- If no update function is present move the action            -- into a terminated state.            self.status_ = Action.Status.TERMINATED;        end    end      return self.status_; end Action cleanup Terminating an action is very similar to initializing an action, and it sets the status of the action to uninitialized once the cleanup callback has an opportunity to finish any processing of the action. If a cleanup callback function isn't defined, the action will immediately move to an uninitialized state upon cleanup. During action cleanup, we'll check to make sure the action has fully terminated, and then run a cleanup function if one is specified. Action.lua: function Action.CleanUp(self)    if (self.status_ == Action.Status.TERMINATED) then        if (self.cleanUpFunction_) then            self.cleanUpFunction_(self.userData_);        end    end       self.status_ = Action.Status.UNINITIALIZED; end Action member functions Now that we've created the basic, initialize, update, and terminate functionalities, we can update our action constructor with CleanUp, Initialize, and Update member functions: Action.lua: function Action.new(name, initializeFunction, updateFunction,        cleanUpFunction, userData)       ...      -- The Action's accessor functions.    action.CleanUp = Action.CleanUp;    action.Initialize = Action.Initialize;    action.Update = Action.Update;       return action; end Creating actions With a basic action class out of the way, we can start implementing specific action logic that our agents can use. Each action will consist of three callback functions—initialization, update, and cleanup—that we'll use when we instantiate our action instances. The idle action The first action we'll create is the basic and default choice from our agents that are going forward. The idle action wraps the IDLE animation request to our soldier's animation controller. As the animation controller will continue looping our IDLE command until a new command is queued, we'll time our idle action to run for 2 seconds, and then terminate it to allow another action to run: SoldierActions.lua: function SoldierActions_IdleCleanUp(userData)    -- No cleanup is required for idling. end   function SoldierActions_IdleInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.IDLE);           -- Since idle is a looping animation, cut off the idle    -- Action after 2 seconds.    local sandboxTimeInMillis = Sandbox.GetTimeInMillis(        userData.agent:GetSandbox());    userData.idleEndTime = sandboxTimeInMillis + 2000; end Updating our action requires that we check how much time has passed; if the 2 seconds have gone by, we terminate the action by returning the terminated state; otherwise, we return that the action is still running: SoldierActions.lua: function SoldierActions_IdleUpdate(deltaTimeInMillis, userData)    local sandboxTimeInMillis = Sandbox.GetTimeInMillis(        userData.agent:GetSandbox());    if (sandboxTimeInMillis >= userData.idleEndTime) then        userData.idleEndTime = nil;        return Action.Status.TERMINATED;    end    return Action.Status.RUNNING; end As we'll be using our idle action numerous times, we'll create a wrapper around initializing our action based on our three functions: SoldierLogic.lua: local function IdleAction(userData)    return Action.new(        "idle",        SoldierActions_IdleInitialize,        SoldierActions_IdleUpdate,        SoldierActions_IdleCleanUp,        userData); end The die action Creating a basic death action is very similar to our idle action. In this case, as death in our animation controller is a terminating state, all we need to do is request that the DIE command be immediately executed. From this point, our die action is complete, and it's the responsibility of a higher-level system to stop any additional processing of logic behavior. Typically, our agents will request this state when their health drops to zero. In the special case that our agent dies due to falling, the soldier's animation controller will manage the correct animation playback and set the soldier's health to zero: SoldierActions.lua: function SoldierActions_DieCleanUp(userData)    -- No cleanup is required for death. end   function SoldierActions_DieInitialize(userData)    -- Issue a die command and immediately terminate.    userData.controller:ImmediateCommand(        userData.agent,        SoldierController.Commands.DIE);      return Action.Status.TERMINATED; end   function SoldierActions_DieUpdate(deltaTimeInMillis, userData)    return Action.Status.TERMINATED; end Creating a wrapper function to instantiate a death action is identical to our idle action: SoldierLogic.lua: local function DieAction(userData)    return Action.new(        "die",        SoldierActions_DieInitialize,        SoldierActions_DieUpdate,        SoldierActions_DieCleanUp,        userData); end The reload action Reloading is the first action that requires an animation to complete before we can consider the action complete, as the behavior will refill our agent's current ammunition count. As our animation controller is queue-based, the action itself never knows how many commands must be processed before the reload command has finished executing. To account for this during the update loop of our action, we wait till the command queue is empty, as the reload action will be the last command that will be added to the queue. Once the queue is empty, we can terminate the action and allow the cleanup function to award the ammo: SoldierActions.lua: function SoldierActions_ReloadCleanUp(userData)    userData.ammo = userData.maxAmmo; end   function SoldierActions_ReloadInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.RELOAD);    return Action.Status.RUNNING; end   function SoldierActions_ReloadUpdate(deltaTimeInMillis, userData)    if (userData.controller:QueueLength() > 0) then        return Action.Status.RUNNING;    end       return Action.Status.TERMINATED; end SoldierLogic.lua: local function ReloadAction(userData)    return Action.new(        "reload",        SoldierActions_ReloadInitialize,        SoldierActions_ReloadUpdate,        SoldierActions_ReloadCleanUp,        userData); end The shoot action Shooting is the first action that directly interacts with another agent. In order to apply damage to another agent, we need to modify how the soldier's shots deal with impacts. When the soldier shot bullets out of his rifle, we added a callback function to handle the cleanup of particles; now, we'll add an additional functionality in order to decrement an agent's health if the particle impacts an agent: Soldier.lua: local function ParticleImpact(sandbox, collision)    Sandbox.RemoveObject(sandbox, collision.objectA);       local particleImpact = Core.CreateParticle(        sandbox, "BulletImpact");    Core.SetPosition(particleImpact, collision.pointA);    Core.SetParticleDirection(        particleImpact, collision.normalOnB);      table.insert(        impactParticles,        { particle = particleImpact, ttl = 2.0 } );       if (Agent.IsAgent(collision.objectB)) then        -- Deal 5 damage per shot.        Agent.SetHealth(            collision.objectB,            Agent.GetHealth(collision.objectB) - 5);    end end Creating the shooting action requires more than just queuing up a shoot command to the animation controller. As the SHOOT command loops, we'll queue an IDLE command immediately afterward so that the shoot action will terminate after a single bullet is fired. To have a chance at actually shooting an enemy agent, though, we first need to orient our agent to face toward its enemy. During the normal update loop of the action, we will forcefully set the agent to point in the enemy's direction. Forcefully setting the agent's forward direction during an action will allow our soldier to shoot but creates a visual artifact where the agent will pop to the correct forward direction. See whether you can modify the shoot action's update to interpolate to the correct forward direction for better visual results. SoldierActions.lua: function SoldierActions_ShootCleanUp(userData)    -- No cleanup is required for shooting. end   function SoldierActions_ShootInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.SHOOT);    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.IDLE);       return Action.Status.RUNNING; end   function SoldierActions_ShootUpdate(deltaTimeInMillis, userData)    -- Point toward the enemy so the Agent's rifle will shoot    -- correctly.    local forwardToEnemy = userData.enemy:GetPosition() –        userData.agent:GetPosition();    Agent.SetForward(userData.agent, forwardToEnemy);      if (userData.controller:QueueLength() > 0) then        return Action.Status.RUNNING;    end      -- Subtract a single bullet per shot.    userData.ammo = userData.ammo - 1;    return Action.Status.TERMINATED; end SoldierLogic.lua: local function ShootAction(userData)    return Action.new(        "shoot",        SoldierActions_ShootInitialize,        SoldierActions_ShootUpdate,        SoldierActions_ShootCleanUp,        userData); end The random move action Randomly moving is an action that chooses a random point on the navmesh to be moved to. This action is very similar to other actions that move, except that this action doesn't perform the moving itself. Instead, the random move action only chooses a valid point to move to and requires the move action to perform the movement: SoldierActions.lua: function SoldierActions_RandomMoveCleanUp(userData)   end   function SoldierActions_RandomMoveInitialize(userData)    local sandbox = userData.agent:GetSandbox();      local endPoint = Sandbox.RandomPoint(sandbox, "default");    local path = Sandbox.FindPath(        sandbox,        "default",        userData.agent:GetPosition(),        endPoint);       while #path == 0 do        endPoint = Sandbox.RandomPoint(sandbox, "default");        path = Sandbox.FindPath(            sandbox,            "default",            userData.agent:GetPosition(),            endPoint);    end       userData.agent:SetPath(path);    userData.agent:SetTarget(endPoint);    userData.movePosition = endPoint;       return Action.Status.TERMINATED; end   function SoldierActions_RandomMoveUpdate(userData)    return Action.Status.TERMINATED; end SoldierLogic.lua: local function RandomMoveAction(userData)    return Action.new(        "randomMove",        SoldierActions_RandomMoveInitialize,        SoldierActions_RandomMoveUpdate,        SoldierActions_RandomMoveCleanUp,        userData); end The move action Our movement action is similar to an idle action, as the agent's walk animation will loop infinitely. In order for the agent to complete a move action, though, the agent must reach within a certain distance of its target position or timeout. In this case, we can use 1.5 meters, as that's close enough to the target position to terminate the move action and half a second to indicate how long the move action can run for: SoldierActions.lua: function SoldierActions_MoveToCleanUp(userData)    userData.moveEndTime = nil; end   function SoldierActions_MoveToInitialize(userData)    userData.controller:QueueCommand(        userData.agent,        SoldierController.Commands.MOVE);       -- Since movement is a looping animation, cut off the move    -- Action after 0.5 seconds.    local sandboxTimeInMillis =        Sandbox.GetTimeInMillis(userData.agent:GetSandbox());    userData.moveEndTime = sandboxTimeInMillis + 500;      return Action.Status.RUNNING; end When applying the move action onto our agents, the indirect soldier controller will manage all animation playback and steer our agent along their path. The agent moving to a random position Setting a time limit for the move action will still allow our agents to move to their final target position, but gives other actions a chance to execute in case the situation has changed. Movement paths can be long, and it is undesirable to not handle situations such as death until the move action has terminated: SoldierActions.lua: function SoldierActions_MoveToUpdate(deltaTimeInMillis, userData)    -- Terminate the action after the allotted 0.5 seconds. The    -- decision structure will simply repath if the Agent needs    -- to move again.    local sandboxTimeInMillis =        Sandbox.GetTimeInMillis(userData.agent:GetSandbox());  if (sandboxTimeInMillis >= userData.moveEndTime) then        userData.moveEndTime = nil;        return Action.Status.TERMINATED;    end      path = userData.agent:GetPath();    if (#path ~= 0) then        offset = Vector.new(0, 0.05, 0);        DebugUtilities_DrawPath(            path, false, offset, DebugUtilities.Orange);        Core.DrawCircle(            path[#path] + offset, 1.5, DebugUtilities.Orange);    end      -- Terminate movement is the Agent is close enough to the    -- target.  if (Vector.Distance(userData.agent:GetPosition(),        userData.agent:GetTarget()) < 1.5) then          Agent.RemovePath(userData.agent);        return Action.Status.TERMINATED;    end      return Action.Status.RUNNING; end SoldierLogic.lua: local function MoveAction(userData)    return Action.new(        "move",        SoldierActions_MoveToInitialize,        SoldierActions_MoveToUpdate,        SoldierActions_MoveToCleanUp,        userData); end Summary In this article, we have taken a look at creating userdata and reuasable actions. Resources for Article: Further resources on this subject: Using Sprites for Animation [Article] Installing Gideros [Article] CryENGINE 3: Breaking Ground with Sandbox [Article]
Read more
  • 0
  • 0
  • 1072

article-image-logistic-regression
Packt
27 Nov 2014
9 min read
Save for later

Logistic regression

Packt
27 Nov 2014
9 min read
This article is written by Breck Baldwin and Krishna Dayanidhi, the authors of Natural Language Processing with Java and LingPipe Cookbook. In this article, we will cover logistic regression. (For more resources related to this topic, see here.) Logistic regression is probably responsible for the majority of industrial classifiers, with the possible exception of naïve Bayes classifiers. It almost certainly is one of the best performing classifiers available, albeit at the cost of slow training and considerable complexity in configuration and tuning. Logistic regression is also known as maximum entropy, neural network classification with a single neuron, and others. The classifiers have been based on the underlying characters or tokens, but logistic regression uses unrestricted feature extraction, which allows for arbitrary observations of the situation to be encoded in the classifier. This article closely follows a more complete tutorial at http://alias-i.com/lingpipe/demos/tutorial/logistic-regression/read-me.html. How logistic regression works All that logistic regression does is take a vector of feature weights over the data, apply a vector of coefficients, and do some simple math, which results in a probability for each class encountered in training. The complicated bit is in determining what the coefficients should be. The following are some of the features produced by our training example for 21 tweets annotated for English e and non-English n. There are relatively few features because feature weights are being pushed to 0.0 by our prior, and once a weight is 0.0, then the feature is removed. Note that one category, n, is set to 0.0 for all the features of the n-1 category—this is a property of the logistic regression process that fixes once categories features to 0.0 and adjust all other categories features with respect to that: FEATURE e nI : 0.37 0.0! : 0.30 0.0Disney : 0.15 0.0" : 0.08 0.0to : 0.07 0.0anymore : 0.06 0.0isn : 0.06 0.0' : 0.06 0.0t : 0.04 0.0for : 0.03 0.0que : -0.01 0.0moi : -0.01 0.0_ : -0.02 0.0, : -0.08 0.0pra : -0.09 0.0? : -0.09 0.0 Take the string, I luv Disney, which will only have two non-zero features: I=0.37 and Disney=0.15 for e and zeros for n. Since there is no feature that matches luv, it is ignored. The probability that the tweet is English breaks down to: vectorMultiply(e,[I,Disney]) = exp(.37*1 + .15*1) = 1.68 vectorMultiply(n,[I,Disney]) = exp(0*1 + 0*1) = 1 We will rescale to a probability by summing the outcomes and dividing it: p(e|,[I,Disney]) = 1.68/(1.68 +1) = 0.62p(e|,[I,Disney]) = 1/(1.68 +1) = 0.38 This is how the math works on running a logistic regression model. Training is another issue entirely. Getting ready This example assumes the same framework that we have been using all along to get training data from .csv files, train the classifier, and run it from the command line. Setting up to train the classifier is a bit complex because of the number of parameters and objects used in training. The main() method starts with what should be familiar classes and methods: public static void main(String[] args) throws IOException {String trainingFile = args.length > 0 ? args[0]: "data/disney_e_n.csv";List<String[]> training= Util.readAnnotatedCsvRemoveHeader(new File(trainingFile));int numFolds = 0;XValidatingObjectCorpus<Classified<CharSequence>> corpus= Util.loadXValCorpus(training,numFolds);TokenizerFactory tokenizerFactory= IndoEuropeanTokenizerFactory.INSTANCE; Note that we are using XValidatingObjectCorpus when a simpler implementation such as ListCorpus will do. We will not take advantage of any of its cross-validation features, because the numFolds param as 0 will have training visit the entire corpus. We are trying to keep the number of novel classes to a minimum, and we tend to always use this implementation in real-world gigs anyway. Now, we will start to build the configuration for our classifier. The FeatureExtractor<E> interface provides a mapping from data to features; this will be used to train and run the classifier. In this case, we are using a TokenFeatureExtractor() method, which creates features based on the tokens found by the tokenizer supplied during construction. This is similar to what naïve Bayes reasons over: FeatureExtractor<CharSequence> featureExtractor= new TokenFeatureExtractor(tokenizerFactory); The minFeatureCount item is usually set to a number higher than 1, but with small training sets, this is needed to get any performance. The thought behind filtering feature counts is that logistic regression tends to overfit low-count features that, just by chance, exist in one category of training data. As training data grows, the minFeatureCount value is adjusted usually by paying attention to cross-validation performance: int minFeatureCount = 1; The addInterceptFeature Boolean controls whether a category feature exists that models the prevalence of the category in training. The default name of the intercept feature is *&^INTERCEPT%$^&**, and you will see it in the weight vector output if it is being used. By convention, the intercept feature is set to 1.0 for all inputs. The idea is that if a category is just very common or very rare, there should be a feature that captures just this fact, independent of other features that might not be as cleanly distributed. This models the category probability in naïve Bayes in some way, but the logistic regression algorithm will decide how useful it is as it does with all other features: boolean addInterceptFeature = true;boolean noninformativeIntercept = true; These Booleans control what happens to the intercept feature if it is used. Priors, in the following code, are typically not applied to the intercept feature; this is the result if this parameter is true. Set the Boolean to false, and the prior will be applied to the intercept. Next is the RegressionPrior instance, which controls how the model is fit. What you need to know is that priors help prevent logistic regression from overfitting the data by pushing coefficients towards 0. There is a non-informative prior that does not do this with the consequence that if there is a feature that applies to just one category it will be scaled to infinity, because the model keeps fitting better as the coefficient is increased in the numeric estimation. Priors, in this context, function as a way to not be over confident in observations about the world. Another dimension in the RegressionPrior instance is the expected variance of the features. Low variance will push coefficients to zero more aggressively. The prior returned by the static laplace() method tends to work well for NLP problems. There is a lot going on, but it can be managed without a deep theoretical understanding. double priorVariance = 2;RegressionPrior prior= RegressionPrior.laplace(priorVariance,noninformativeIntercept); Next, we will control how the algorithm searches for an answer. AnnealingSchedule annealingSchedule= AnnealingSchedule.exponential(0.00025,0.999);double minImprovement = 0.000000001;int minEpochs = 100;int maxEpochs = 2000; AnnealingSchedule is best understood by consulting the Javadoc, but what it does is change how much the coefficients are allowed to vary when fitting the model. The minImprovement parameter sets the amount the model fit has to improve to not terminate the search, because the algorithm has converged. The minEpochs parameter sets a minimal number of iterations, and maxEpochs sets an upper limit if the search does not converge as determined by minImprovement. Next is some code that allows for basic reporting/logging. LogLevel.INFO will report a great deal of information about the progress of the classifier as it tries to converge: PrintWriter progressWriter = new PrintWriter(System.out,true);progressWriter.println("Reading data.");Reporter reporter = Reporters.writer(progressWriter);reporter.setLevel(LogLevel.INFO); Here ends the Getting ready section of one of our most complex classes—next, we will train and run the classifier. How to do it... It has been a bit of work setting up to train and run this class. We will just go through the steps to get it up and running: Note that there is a more complex 14-argument train method as well the one that extends configurability. This is the 10-argument version: LogisticRegressionClassifier<CharSequence> classifier= LogisticRegressionClassifier.<CharSequence>train(corpus,featureExtractor,minFeatureCount,addInterceptFeature,prior,annealingSchedule,minImprovement,minEpochs,maxEpochs,reporter); The train() method, depending on the LogLevel constant, will produce from nothing with LogLevel.NONE to the prodigious output with LogLevel.ALL. While we are not going to use it, we show how to serialize the trained model to disk: AbstractExternalizable.compileTo(classifier, new File("models/myModel.LogisticRegression")); Once trained, we will apply the standard classification loop with: Util.consoleInputPrintClassification(classifier); Run the preceding code in the IDE of your choice or use the command-line command: java -cp lingpipe-cookbook.1.0.jar:lib/lingpipe-4.1.0.jar:lib/opencsv-2.4.jar com.lingpipe.cookbook.chapter3.TrainAndRunLogReg The result is a big dump of information about the training: Reading data.:00 Feature Extractor class=class com.aliasi.tokenizer.TokenFeatureExtractor:00 min feature count=1:00 Extracting Training Data:00 Cold start:00 Regression callback handler=null:00 Logistic Regression Estimation:00 Monitoring convergence=true:00 Number of dimensions=233:00 Number of Outcomes=2:00 Number of Parameters=233:00 Number of Training Instances=21:00 Prior=LaplaceRegressionPrior(Variance=2.0,noninformativeIntercept=true):00 Annealing Schedule=Exponential(initialLearningRate=2.5E-4,base=0.999):00 Minimum Epochs=100:00 Maximum Epochs=2000:00 Minimum Improvement Per Period=1.0E-9:00 Has Informative Prior=true:00 epoch= 0 lr=0.000250000 ll= -20.9648 lp=-232.0139 llp= -252.9787 llp*= -252.9787:00 epoch= 1 lr=0.000249750 ll= -20.9406 lp=-232.0195 llp= -252.9602 llp*= -252.9602 The epoch reporting goes on until either the number of epochs is met or the search converges. In the following case, the number of epochs was met: :00 epoch= 1998 lr=0.000033868 ll= -15.4568 lp= -233.8125 llp= -249.2693 llp*= -249.2693 :00 epoch= 1999 lr=0.000033834 ll= -15.4565 lp= -233.8127 llp= -249.2692 llp*= -249.2692 Now, we can play with the classifier a bit: Type a string to be classified. Empty string to quit. I luv Disney Rank Category Score P(Category|Input) 0=e 0.626898085027528 0.626898085027528 1=n 0.373101914972472 0.373101914972472 This should look familiar; it is exactly the same result as the worked example at the start. That's it! You have trained up and used the world's most relevant industrial classifier. However, there's a lot more to harnessing the power of this beast. Summary In this article, we learned how to do logistic regression. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] Introspecting Maya, Python, and PyMEL [Article] Understanding the Python regex engine [Article]
Read more
  • 0
  • 0
  • 1139
Visually different images

article-image-machine-learning-examples-applicable-businesses
Packt
25 Nov 2014
7 min read
Save for later

Machine Learning Examples Applicable to Businesses

Packt
25 Nov 2014
7 min read
The purpose of this article by Michele Usuelli, author of the book, R Machine Learning Essentials, is to show how you machine learning helps in solving a business problem. (For more resources related to this topic, see here.) Predicting the output The past marketing campaign targeted part of the customer base. Among other 1,000 clients, how do we identify the 100 that are keener to subscribe? We can build a model that learns from the data and estimates which clients are more similar to the ones that subscribed in the previous campaign. For each client, the model estimates a score that is higher if the client is more likely to subscribe. There are different machine learning models determining the scores and we use two well-performing techniques, as follows: Logistic regression: This is a variation of the linear regression to predict a binary output Random forest: This is an ensemble based on a decision tree that works well in presence of many features In the end, we need to choose one out of the two techniques. There are cross-validation methods that allow us to estimate model accuracy. Starting from that, we can measure the accuracy of both the options and pick the one performing better. After choosing the most proper machine learning algorithm, we can optimize it using cross validation. However, in order to avoid overcomplicating the model building, we don't perform any feature selection or parameter optimization. These are the steps to build and evaluate the models: Load the randomForest package containing the random forest algorithm:library('randomForest') Define the formula defining the output and the variable names. The formula is in the format output ~ feature1 + feature2 + ...: arrayFeatures <- names(dtBank)arrayFeatures <- arrayFeatures[arrayFeatures != 'output']formulaAll <- paste('output', '~')formulaAll <- paste(formulaAll, arrayFeatures[1])for(nameFeature in arrayFeatures[-1]){formulaAll <- paste(formulaAll, '+', nameFeature)}formulaAll <- formula(formulaAll) Initialize the table containing all the testing sets: dtTestBinded <- data.table() Define the number of iterations: nIter <- 10 Start a for loop: for(iIter in 1:nIter){ Define the training and the test datasets: indexTrain <- sample(x = c(TRUE, FALSE),size = nrow(dtBank),replace = T,prob = c(0.8, 0.2))dtTrain <- dtBank[indexTrain]dtTest <- dtBank[!indexTrain] Select a subset from the test set in such a way that we have the same number of output == 0 and output == 1. First, we split dtTest in two parts (dtTest0 and dtTest1) on the basis of the output and we count the number of rows of each part (n0 and n1). Then, as dtTest0 has more rows, we randomly select n1 rows. In the end, we redefine dtTest binding dtTest0 and dtTest1, as follows: dtTest1 <- dtTest[output == 1]dtTest0 <- dtTest[output == 0]n0 <- nrow(dtTest0)n1 <- nrow(dtTest1)dtTest0 <- dtTest0[sample(x = 1:n0, size = n1)]dtTest <- rbind(dtTest0, dtTest1) Build the random forest model using randomForest. The formula argument defines the relationship between variables and the data argument defines the training dataset. In order to avoid overcomplicating the model, all the other parameters are left as their defaults: modelRf <- randomForest(formula = formulaAll,data = dtTrain) Build the logistic regression model using glm, which is a function used to build Generalized Linear Models (GLM). GLMs are a generalization of the linear regression and they allow to define a link function that connects the linear predictor with the outputs. The input is the same as the random forest, with the addition of family = binomial(logit) defining that the regression is logistic: modelLr <- glm(formula = formulaAll,data = dtTest,family = binomial(logit)) Predict the output of the random forest. The function is predict and its main arguments are object defining the model and newdata defining the test set, as follows: dtTest[, outputRf := predict(object = modelRf, newdata = dtTest, type='response')] Predict the output of the logistic regression, using predict similar to the random forest. The other argument is type='response' and it is necessary in the case of the logistic regression: dtTest[, outputLr := predict(object = modelLr, newdata = dtTest, type='response')] Add the new test set to dtTestBinded: dtTestBinded <- rbind(dtTestBinded, dtTest) End the for loop: } We built dtTestBinded containing the output column that defines which clients subscribed and the scores estimated by the models. Comparing the scores with the real output, we can validate the model performances. In order to explore dtTestBinded, we can build a chart showing how the scores of the non-subscribing clients are distributed. Then, we add the distribution of the subscribing clients to the chart and compare them. In this way, we can see the difference between the scores of the two groups. Since we use the same chart for the random forest and for the logistic regression, we define a function building the chart by following the given steps: Define the function and its input that includes the data table and the name of the score column: plotDistributions <- function(dtTestBinded, colPred){ Compute the distribution density for the clients that didn't subscribe. With output == 0, we extract the clients not subscribing, and using density, we define a density object. The adjust parameter defines the smoothing bandwidth that is a parameter of the way we build the curve starting from the data. The bandwidth can be interpreted as the level of detail: densityLr0 <- dtTestBinded[   output == 0,   density(get(colPred), adjust = 0.5)   ] Compute the distribution density for the clients that subscribed: densityLr1 <- dtTestBinded[   output == 1,   density(get(colPred), adjust = 0.5)   ] Define the colors in the chart using rgb. The colors are transparent red and transparent blue: col0 <- rgb(1, 0, 0, 0.3)col1 <- rgb(0, 0, 1, 0.3) Build the plot with the density of the clients not subscribing. Here, polygon is a function that adds the area to the chart: plot(densityLr0, xlim = c(0, 1), main = 'density')polygon(densityLr0, col = col0, border = 'black') Add the clients that subscribed to the chart: polygon(densityLr1, col = col1, border = 'black') Add the legend: legend(   'top',   c('0', '1'),   pch = 16,   col = c(col0, col1)) End the function: return()} Now, we can use plotDistributions on the random forest output: par(mfrow = c(1, 1))plotDistributions(dtTestBinded, 'outputRf') The histogram obtained is as follows:   The x-axis represents the score and the y-axis represents the density that is proportional to the number of clients that subscribed for similar scores. Since we don't have a client for each possible score, assuming a level of detail of 0.01, the density curve is smoothed in the sense that the density of each score is the average between the data with a similar score. The red and blue areas represent the non-subscribing and subscribing clients respectively. As can be easily noticed, the violet area comes from the overlapping of the two curves. For each score, we can identify which density is higher. If the highest curve is red, the client will be more likely to subscribe, and vice versa. For the random forest, most of the non-subscribing client scores are between 0 and 0.2 and the density peak is around 0.05. The subscribing clients have a more spread score, although higher, and their peak is around 0.1. The two distributions overlap a lot, so it's not easy to identify which clients will subscribe starting from their scores. However, if the marketing campaign targets all customers with a score higher than 0.3, they will likely belong to the blue cluster. In conclusion, using random forest, we are able to identify a small set of customers that will subscribe very likely. Summary In this article, you learned how to predict your output using proper machine learning techniques. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [article] Machine Learning in Bioinformatics [article] Learning Data Analytics with R and Hadoop [article]
Read more
  • 0
  • 0
  • 982

article-image-no-nodistinct
Packt
25 Nov 2014
4 min read
Save for later

No to nodistinct

Packt
25 Nov 2014
4 min read
This article is written by Stephen Redmond, the author of Mastering QlikView. There is a great skill in creating the right expression to calculate the right answer. Being able to do this in all circumstances relies on having a good knowledge of creating advanced expressions. Of course, the best path to mastery in this subject is actually getting out and doing it, but there is a great argument here for regularly practicing with dummy or test datasets. (For more resources related to this topic, see here.) When presented with a problem that needs to be solved, all the QlikView masters will not necessarily know immediately how to answer it. What they will have though is a very good idea of where to start, that is, what to try and what not to try. This is what I hope to impart to you here. Knowing how to create many advanced expressions will arm you to know where to apply them—and where not to apply them. This is one area of QlikView that is alien to many people. For some reason, they fear the whole idea of concepts such as Aggr. However, the reality is that these concepts are actually very simple and supremely logical. Once you get your head around them, you will wonder what all the fuss was about. No to nodistinct The Aggr function has as an optional clause, that is, the possibility of stating that the aggregation will be either distinct or nodistinct. The default option is distinct, and as such, is rarely ever stated. In this default operation, the aggregation will only produce distinct results for every combination of dimensions—just as you would expect from a normal chart or straight table. The nodistinct option only makes sense within a chart, one that has more dimensions than are in the Aggr statement. In this case, the granularity of the chart is lower than the granularity of Aggr, and therefore, QlikView will only calculate that Aggr for the first occurrence of lower granularity dimensions and will return null for the other rows. If we specify nodistinct, the same result will be calculated across all of the lower granularity dimensions. This can be difficult to understand without seeing an example, so let's look at a common use case for this option. We will start with a dataset: ProductSales:Load * Inline [Product, Territory, Year, SalesProduct A, Territory A, 2013, 100Product B, Territory A, 2013, 110Product A, Territory B, 2013, 120Product B, Territory B, 2013, 130Product A, Territory A, 2014, 140Product B, Territory A, 2014, 150Product A, Territory B, 2014, 160Product B, Territory B, 2014, 170]; We will build a report from this data using a pivot table: Now, we want to bring the value in the Total column into a new column under each year, perhaps to calculate a percentage for each year. We might think that, because the total is the sum for each Product and Territory, we might use an Aggr in the following manner: Sum(Aggr(Sum(Sales), Product, Territory)) However, as stated previously, because the chart includes an additional dimension (Year) than Aggr, the expression will only be calculated for the first occurrence of each of the lower granularity dimensions (in this case, for Year = 2013): The commonly suggested fix for this is to use Aggr without Sum and with nodistinct as shown: Aggr(NoDistinct Sum(Sales), Product, Territory) This will allow the Aggr expression to be calculated across all the Year dimension values, and at first, it will appear to solve the problem: The problem occurs when we decide to have a total row on this chart: As there is no aggregation function surrounding Aggr, it does not total correctly at the Product or Territory dimensions. We can't add an aggregation function, such as Sum, because it will break one of the other totals. However, there is something different that we can do; something that doesn't involve Aggr at all! We can use our old friend Total: Sum(Total<Product, Territory> Sales) This will calculate correctly at all the levels: There might be other use cases for using a nodistinct clause in Aggr, but they should be reviewed to see whether a simpler Total will work instead. Summary We discussed an important function, the Aggr function. We now know that the Aggr function is extremely useful, but we don't need to apply it in all circumstances where we have vertical calculations. Resources for Article: Further resources on this subject: Common QlikView script errors [article] Introducing QlikView elements [article] Creating sheet objects and starting new list using Qlikview 11 [article]
Read more
  • 0
  • 0
  • 1579

article-image-understanding-hbase-ecosystem
Packt
24 Nov 2014
11 min read
Save for later

Understanding the HBase Ecosystem

Packt
24 Nov 2014
11 min read
This article by Shashwat Shriparv, author of the book, Learning HBase, will introduce you to the world of HBase. (For more resources related to this topic, see here.) HBase is a horizontally scalable, distributed, open source, and a sorted map database. It runs on top of Hadoop file system that is Hadoop Distributed File System (HDFS). HBase is a NoSQL nonrelational database that doesn't always require a predefined schema. It can be seen as a scaling flexible, multidimensional spreadsheet where any structure of data is fit with on-the-fly addition of new column fields, and fined column structure before data can be inserted or queried. In other words, HBase is a column-based database that runs on top of Hadoop distributed file system and supports features such as linear scalability (scale out), automatic failover, automatic sharding, and more flexible schema. HBase is modeled on Google BigTable. It was inspired by Google BigTable, which is a compressed, high-performance, proprietary data store built on the Google filesystem. HBase was developed as a Hadoop subproject to support storage of structural data, which can take advantage of most distributed files systems (typically, the Hadoop Distributed File System known as HDFS). The following table contains key information about HBase and its features: Features Description Developed by Apache Written in Java Type Column oriented License Apache License Lacking features of relational databases SQL support, relations, primary, foreign, and unique key constraints, normalization Website http://hbase.apache.org Distributions Apache, Cloudera Download link http://mirrors.advancedhosters.com/apache/hbase/ Mailing lists The user list: [email protected] The developer list: [email protected] Blog http://blogs.apache.org/hbase/ HBase layout on top of Hadoop The following figure represents the layout information of HBase on top of Hadoop: There is more than one ZooKeeper in the setup, which provides high availability of master status; a RegionServer may contain multiple rations. The RegionServers run on the machines where DataNodes run. There can be as many RegionServers as DataNodes. RegionServers can have multiple HRegions; one HRegion can have one HLog and multiple HFiles with its associate's MemStore. HBase can be seen as a master-slave database where the master is called HMaster, which is responsible for coordination between client application and HRegionServer. It is also responsible for monitoring and recording metadata changes and management. Slaves are called HRegionServers, which serve the actual tables in form of regions. These regions are the basic building blocks of the HBase tables, which contain distribution of tables. So, HMaster and RegionServer work in coordination to serve the HBase tables and HBase cluster. Usually, HMaster is co-hosted with Hadoop NameNode daemon process on a server and communicates to DataNode daemon for reading and writing data on HDFS. The RegionServer runs or is co-hosted on the Hadoop DataNodes. Comparing architectural differences between RDBMs and HBase Let's list the major differences between relational databases and HBase: Relational databases HBase Uses tables as databases Uses regions as databases Filesystems supported are FAT, NTFS, and EXT Filesystem supported is HDFS The technique used to store logs is commit logs The technique used to store logs is Write-Ahead Logs (WAL) The reference system used is coordinate system The reference system used is ZooKeeper Uses the primary key Uses the row key Partitioning is supported Sharding is supported Use of rows, columns, and cells Use of rows, column families, columns, and cells HBase features Let's see the major features of HBase that make it one of the most useful databases for the current and future industry: Automatic failover and load balancing: HBase runs on top of HDFS, which is internally distributed and automatically recovered using multiple block allocation and replications. It works with multiple HMasters and region servers. This failover is also facilitated using HBase and RegionServer replication. Automatic sharding: An HBase table is made up of regions that are hosted by RegionServers and these regions are distributed throughout the RegionServers on different DataNodes. HBase provides automatic and manual splitting of these regions to smaller subregions, once it reaches a threshold size to reduce I/O time and overhead. Hadoop/HDFS integration: It's important to note that HBase can run on top of other filesystems as well. While HDFS is the most common choice as it supports data distribution and high availability using distributed Hadoop, for which we just need to set some configuration parameters and enable HBase to communicate to Hadoop, an out-of-the-box underlying distribution is provided by HDFS. Real-time, random big data access: HBase uses log-structured merge-tree (LSM-tree) as data storage architecture internally, which merges smaller files to larger files periodically to reduce disk seeks. MapReduce: HBase has a built-in support of Hadoop MapReduce framework for fast and parallel processing of data stored in HBase. You can search for the Package org.apache.hadoop.hbase.mapreduce for more details. Java API for client access: HBase has a solid Java API support (client/server) for easy development and programming. Thrift and a RESTtful web service: HBase not only provides a thrift and RESTful gateway but also web service gateways for integrating and accessing HBase besides Java code (HBase Java APIs) for accessing and working with HBase. Support for exporting metrics via the Hadoop metrics subsystem: HBase provides Java Management Extensions (JMX) and exporting matrix for monitoring purposes with tools such as Ganglia and Nagios. Distributed: HBase works when used with HDFS. It provides coordination with Hadoop so that distribution of tables, high availability, and consistency is supported by it. Linear scalability (scale out): Scaling of HBase is not scale up but scale out, which means that we don't need to make servers more powerful but we add more machines to its cluster. We can add more nodes to the cluster on the fly. As soon as a new RegionServer node is up, the cluster can begin rebalancing, start the RegionServer on the new node, and it is scaled up, it is as simple as that. Column oriented: HBase stores each column separately in contrast with most of the relational databases, which uses stores or are row-based storage. So in HBase, columns are stored contiguously and not the rows. More about row- and column-oriented databases will follow. HBase shell support: HBase provides a command-line tool to interact with HBase and perform simple operations such as creating tables, adding data, and scanning data. This also provides full-fledged command-line tool using which we can interact with HBase and perform operations such as creating table, adding data, removing data, and a few other administrative commands. Sparse, multidimensional, sorted map database: HBase is a sparse, multidimensional, sorted map-based database, which supports multiple versions of the same record. Snapshot support: HBase supports taking snapshots of metadata for getting the previous or correct state form of data. HBase in the Hadoop ecosystem Let's see where HBase sits in the Hadoop ecosystem. In the Hadoop ecosystem, HBase provides a persistent, structured, schema-based data store. The following figure illustrates the Hadoop ecosystem: HBase can work as a separate entity on the local filesystem (which is not really effective as no distribution is provided) as well as in coordination with Hadoop as a separate but connected entity. As we know, Hadoop provides two services, a distributed files system (HDFS) for storage and a MapReduce framework for processing in a parallel mode. When there was a need to store structured data (data in the form of tables, rows and columns), which most of the programmers are already familiar with, the programmers were finding it difficult to process the data that was stored on HDFS as an unstructured flat file format. This led to the evolution of HBase, which provided a way to store data in a structural way. Consider that we have got a CSV file stored on HDFS and we need to query from it. We would need to write a Java code for this, which wouldn't be a good option. It would be better if we could specify the data key and fetch the data from that file. So, what we can do here is create a schema or table with the same structure of CSV file to store the data of the CSV file in the HBase table and query using HBase APIs, or HBase shell using key. Data representation in HBase Let's look into the representation of rows and columns in HBase table: An HBase table is divided into rows, column families, columns, and cells. Row keys are unique keys to identify a row, column families are groups of columns, columns are fields of the table, and the cell contains the actual value or the data. So, we have been through the introduction of HBase; now, let's see what Hadoop and its components are in brief. It is assumed here that you are already familiar with Hadoop; if not, following a brief introduction about Hadoop will help you to understand it. Hadoop Hadoop is an underlying technology of HBase, providing high availability, fault tolerance, and distribution. It is an Apache-sponsored, free, open source, Java-based programming framework which supports large dataset storage. It provides distributed file system and MapReduce, which is a distributed programming framework. It provides a scalable, reliable, distributed storage and development environment. Hadoop makes it possible to run applications on a system with tens to tens of thousands of nodes. The underlying distributed file system provides large-scale storage, rapid data access. It has the following submodules: Hadoop Common: This is the core component that supports the other Hadoop modules. It is like the master components facilitating communication and coordination between different Hadoop modules. Hadoop distributed file system: This is the underlying distributed file system, which is abstracted on the top of the local filesystem that provides high throughput of read and write operations of data on Hadoop. Hadoop YARN: This is the new framework that is shipped with newer releases of Hadoop. It provides job scheduling and job and resource management. Hadoop MapReduce: This is the Hadoop-based processing system that provides parallel processing of large data and datasets. Other Hadoop subprojects are HBase, Hive, Ambari, Avro, Cassandra (Cassandra isn't a Hadoop subproject, it's a related project; they solve similar problems in different ways), Mahout, Pig, Spark, ZooKeeper (ZooKeeper isn't a Hadoop subproject. It's a dependency shared by many distributed systems), and so on. All of these have different usability and the combination of all these subprojects forms the Hadoop ecosystem. Core daemons of Hadoop The following are the core daemons of Hadoop: NameNode: This stores and manages all metadata about the data present on the cluster, so it is the single point of contact to Hadoop. In the new release of Hadoop, we have an option of more than one NameNode for high availability. JobTracker: This runs on the NameNode and performs the MapReduce of the jobs submitted to the cluster. SecondaryNameNode: This maintains the backup of metadata present on the NameNode, and also records the file system changes. DataNode: This will contain the actual data. TaskTracker: This will perform tasks on the local data assigned by the JobTracker. The preceding are the daemons in the case of Hadoop v1 or earlier. In newer versions of Hadoop, we have ResourceManager instead of JobTracker, the node manager instead of TaskTrackers, and the YARN framework instead of a simple MapReduce framework. The following is the comparison between daemons in Hadoop 1 and Hadoop 2: Hadoop 1 Hadoop 2 HDFS NameNode Secondary NameNode DataNode   NameNode (more than one active/standby) Checkpoint node DataNode Processing MapReduce v1 JobTracker TaskTracker   YARN (MRv2) ResourceManager NodeManager Application Master Comparing HBase with Hadoop As we now know what HBase and what Hadoop are, let's have a comparison between HDFS and HBase for better understanding: Hadoop/HDFS HBase This provide filesystem for distributed storage This provides tabular column-oriented data storage This is optimized for storage of huge-sized files with no random read/write of these files This is optimized for tabular data with random read/write facility This uses flat files This uses key-value pairs of data The data model is not flexible Provides a flexible data model This uses file system and processing framework This uses tabular storage with built-in Hadoop MapReduce support This is mostly optimized for write-once read-many This is optimized for both read/write many Summary So in this article, we discussed the introductory aspects of HBase and it's features. We have also discussed HBase's components and their place in the HBase ecosystem. Resources for Article: Further resources on this subject: The HBase's Data Storage [Article] HBase Administration, Performance Tuning [Article] Comparative Study of NoSQL Products [Article]
Read more
  • 0
  • 0
  • 3630
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-plot-function
Packt
18 Nov 2014
17 min read
Save for later

The plot function

Packt
18 Nov 2014
17 min read
In this article by L. Felipe Martins, the author of the book, IPython Notebook Essentials, has discussed about the plot() function, which is an important aspect of matplotlib, an IPython library for production of publication-quality graphs. (For more resources related to this topic, see here.) The plot() function is the workhorse of the matplotlib library. In this section, we will explore the line-plotting and formatting capabilities included in this function. To make things a bit more concrete, let's consider the formula for logistic growth, as follows: This model is frequently used to represent growth that shows an initial exponential phase, and then is eventually limited by some factor. The examples are the population in an environment with limited resources and new products and/or technological innovations, which initially attract a small and quickly growing market but eventually reach a saturation point. A common strategy to understand a mathematical model is to investigate how it changes as the parameters defining it are modified. Let's say, we want to see what happens to the shape of the curve when the parameter b changes. To be able to do what we want more efficiently, we are going to use a function factory. This way, we can quickly create logistic models with arbitrary values for r, a, b, and c. Run the following code in a cell: def make_logistic(r, a, b, c):    def f_logistic(t):        return a / (b + c * exp(-r * t))    return f_logistic The function factory pattern takes advantage of the fact that functions are first-class objects in Python. This means that functions can be treated as regular objects: they can be assigned to variables, stored in lists in dictionaries, and play the role of arguments and/or return values in other functions. In our example, we define the make_logistic() function, whose output is itself a Python function. Notice how the f_logistic() function is defined inside the body of make_logistic() and then returned in the last line. Let's now use the function factory to create three functions representing logistic curves, as follows: r = 0.15 a = 20.0 c = 15.0 b1, b2, b3 = 2.0, 3.0, 4.0 logistic1 = make_logistic(r, a, b1, c) logistic2 = make_logistic(r, a, b2, c) logistic3 = make_logistic(r, a, b3, c) In the preceding code, we first fix the values of r, a, and c, and define three logistic curves for different values of b. The important point to notice is that logistic1, logistic2, and logistic3 are functions. So, for example, we can use logistic1(2.5) to compute the value of the first logistic curve at the time 2.5. We can now plot the functions using the following code: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues)) plot(tvalues, logistic2(tvalues)) plot(tvalues, logistic3(tvalues)) The first line in the preceding code sets the maximum time value, tmax, to be 40. Then, we define the set of times at which we want the functions evaluated with the assignment, as follows: tvalues = linspace(0, tmax, 300) The linspace() function is very convenient to generate points for plotting. The preceding code creates an array of 300 equally spaced points in the interval from 0 to tmax. Note that, contrary to other functions, such as range() and arange(), the right endpoint of the interval is included by default. (To exclude the right endpoint, use the endpoint=False option.) After defining the array of time values, the plot() function is called to graph the curves. In its most basic form, it plots a single curve in a default color and line style. In this usage, the two arguments are two arrays. The first array gives the horizontal coordinates of the points being plotted, and the second array gives the vertical coordinates. A typical example will be the following function call: plot(x,y) The variables x and y must refer to NumPy arrays (or any Python iterable values that can be converted into an array) and must have the same dimensions. The points plotted have coordinates as follows: x[0], y[0] x[1], y[1] x[2], y[2] … The preceding command will produce the following plot, displaying the three logistic curves: You may have noticed that before the graph is displayed, there is a line of text output that looks like the following: [<matplotlib.lines.Line2D at 0x7b57c50>] This is the return value of the last call to the plot() function, which is a list (or with a single element) of objects of the Line2D type. One way to prevent the output from being shown is to enter None as the last row in the cell. Alternatively, we can assign the return value of the last call in the cell to a dummy variable: _dummy_ = plot(tvalues, logistic3(tvalues)) The plot() function supports plotting several curves in the same function call. We need to change the contents of the cell that are shown in the following code and run it again: tmax = 40 tvalues = linspace(0, tmax, 300) plot(tvalues, logistic1(tvalues),      tvalues, logistic2(tvalues),      tvalues, logistic3(tvalues)) This form saves some typing but turns out to be a little less flexible when it comes to customizing line options. Notice that the text output produced now is a list with three elements: [<matplotlib.lines.Line2D at 0x9bb6cc0>, <matplotlib.lines.Line2D at 0x9bb6ef0>, <matplotlib.lines.Line2D at 0x9bb9518>] This output can be useful in some instances. For now, we will stick with using one call to plot() for each curve, since it produces code that is clearer and more flexible. Let's now change the line options in the plot and set the plot bounds. Change the contents of the cell to read as follows: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-') plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':') plot(tvalues, logistic3(tvalues),      linewidth=3.5, color=(0.0, 0.0, 0.5), linestyle='--') axis([0, tmax, 0, 11.]) None Running the preceding command lines will produce the following plots: The options set in the preceding code are as follows: The first curve is plotted with a line width of 1.5, with the HTML color of DarkGreen, and a filled-line style The second curve is plotted with a line width of 2.0, colored with the RGB value given by the hexadecimal string '#8B0000', and a dotted-line style The third curve is plotted with a line width of 3.0, colored with the RGB components, (0.0, 0.0, 0.5), and a dashed-line style Notice that there are different ways of specifying a fixed color: a HTML color name, a hexadecimal string, or a tuple of floating-point values. In the last case, the entries in the tuple represent the intensity of the red, green, and blue colors, respectively, and must be floating-point values between 0.0 and 1.0. A complete list of HTML name colors can be found at http://www.w3schools.com/html/html_colornames.asp. Editor's Tip: For more insights on colors, check out https://dgtl.link/colors Line styles are specified by a symbolic string. The allowed values are shown in the following table: Symbol string Line style '-' Solid (the default) '--' Dashed ':' Dotted '-.' Dash-dot 'None', '', or '' Not displayed After the calls to plot(), we set the graph bounds with the function call: axis([0, tmax, 0, 11.]) The argument to axis() is a four-element list that specifies, in this order, the maximum and minimum values of the horizontal coordinates, and the maximum and minimum values of the vertical coordinates. It may seem non-intuitive that the bounds for the variables are set after the plots are drawn. In the interactive mode, matplotlib remembers the state of the graph being constructed, and graphics objects are updated in the background after each command is issued. The graph is only rendered when all computations in the cell are done so that all previously specified options take effect. Note that starting a new cell clears all the graph data. This interactive behavior is part of the matplotlib.pyplot module, which is one of the components imported by pylab. Besides drawing a line connecting the data points, it is also possible to draw markers at specified points. Change the graphing commands indicated in the following code snippet, and then run the cell again: plot(tvalues, logistic1(tvalues),      linewidth=1.5, color='DarkGreen', linestyle='-',      marker='o', markevery=50, markerfacecolor='GreenYellow',      markersize=10.0) plot(tvalues, logistic2(tvalues),      linewidth=2.0, color='#8B0000', linestyle=':',      marker='s', markevery=50, markerfacecolor='Salmon',      markersize=10.0) plot(tvalues, logistic3(tvalues),      linewidth=2.0, color=(0.0, 0.0, 0.5), linestyle='--',      marker = '*', markevery=50, markerfacecolor='SkyBlue',      markersize=12.0) axis([0, tmax, 0, 11.]) None Now, the graph will look as shown in the following figure: The only difference from the previous code is that now we added options to draw markers. The following are the options we use: The marker option specifies the shape of the marker. Shapes are given as symbolic strings. In the preceding examples, we use 'o' for a circular marker, 's' for a square, and '*' for a star. A complete list of available markers can be found at http://matplotlib.org/api/markers_api.html#module-matplotlib.markers. The markevery option specifies a stride within the data points for the placement of markers. In our example, we place a marker after every 50 data points. The markercolor option specifies the color of the marker. The markersize option specifies the size of the marker. The size is given in pixels. There are a large number of other options that can be applied to lines in matplotlib. A complete list is available at http://matplotlib.org/api/artist_api.html#module-matplotlib.lines. Adding a title, labels, and a legend The next step is to add a title and labels for the axes. Just before the None line, add the following three lines of code to the cell that creates the graph: title('Logistic growth: a={:5.2f}, c={:5.2f}, r={:5.2f}'.format(a, c, r)) xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') In the first line, we call the title() function to set the title of the plot. The argument can be any Python string. In our example, we use a formatted string: title('Logistic growth: a={:5.2f}, b={:5.2f}, r={:5.2f}'.format(a, c, r)) We use the format() method of the string class. The formats are placed between braces, as in {:5.2f}, which specifies a floating-point format with five spaces and two digits of precision. Each of the format specifiers is then associated sequentially with one of the data arguments of the method. A full documentation covering the details of string formatting is available at https://docs.python.org/2/library/string.html. The axis labels are set in the calls: xlabel('$t$') ylabel('$N(t)=a/(b+ce^{-rt})$') As in the title() functions, the xlabel() and ylabel() functions accept any Python string. Note that in the '$t$' and '$N(t)=a/(b+ce^{-rt}$' strings, we use LaTeX to format the mathematical formulas. This is indicated by the dollar signs, $...$, in the string. After the addition of a title and labels, our graph looks like the following: Next, we need a way to identify each of the curves in the picture. One way to do that is to use a legend, which is indicated as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)]) The legend() function accepts a list of strings. Each string is associated with a curve in the order they are added to the plot. Notice that we are again using formatted strings. Unfortunately, the preceding code does not produce great results. The legend, by default, is placed in the top-right corner of the plot, which, in this case, hides part of the graph. This is easily fixed using the loc option in the legend function, as shown in the following code: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], loc='upper left') Running this code, we obtain the final version of our logistic growth plot, as follows: The legend location can be any of the strings: 'best', 'upper right', 'upper left', 'lower left', 'lower right', 'right', 'center left', 'center right', 'lower center', 'upper center', and 'center'. It is also possible to specify the location of the legend precisely with the bbox_to_anchor option. To see how this works, modify the code for the legend as follows: legend(['b={:5.2f}'.format(b1),        'b={:5.2f}'.format(b2),        'b={:5.2f}'.format(b3)], bbox_to_anchor=(0.9,0.35)) Notice that the bbox_to_anchor option, by default, uses a coordinate system that is not the same as the one we specified for the plot. The x and y coordinates of the box in the preceding example are interpreted as a fraction of the width and height, respectively, of the whole figure. A little trial-and-error is necessary to place the legend box precisely where we want it. Note that the legend box can be placed outside the plot area. For example, try the coordinates (1.32,1.02). The legend() function is quite flexible and has quite a few other options that are documented at http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.legend. Text and annotations In this subsection, we will show how to add annotations to plots in matplotlib. We will build a plot demonstrating the fact that the tangent to a curve must be horizontal at the highest and lowest points. We start by defining the function associated with the curve and the set of values at which we want the curve to be plotted, which is shown in the following code: f = lambda x: (x**3 - 6*x**2 + 9*x + 3) / (1 + 0.25*x**2) xvalues = linspace(0, 5, 200) The first line in the preceding code uses a lambda expression to define the f() function. We use this approach here because the formula for the function is a simple, one-line expression. The general form of a lambda expression is as follows: lambda <arguments> : <return expression> This expression by itself creates an anonymous function that can be used in any place that a function object is expected. Note that the return value must be a single expression and cannot contain any statements. The formula for the function may seem unusual, but it was chosen by trial-and-error and a little bit of calculus so that it produces a nice graph in the interval from 0 to 5. The xvalues array is defined to contain 200 equally spaced points on this interval. Let's create an initial plot of our curve, as shown in the following code: plot(xvalues, f(xvalues), lw=2, color='FireBrick') axis([0, 5, -1, 8]) grid() xlabel('$x$') ylabel('$f(x)$') title('Extreme values of a function') None # Prevent text output Most of the code in this segment is explained in the previous section. The only new bit is that we use the grid() function to draw a grid. Used with no arguments, the grid coincides with the tick marks on the plot. As everything else in matplotlib, grids are highly customizable. Check the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.grid. When the preceding code is executed, the following plot is produced: Note that the curve has a highest point (maximum) and a lowest point (minimum). These are collectively called the extreme values of the function (on the displayed interval, this function actually grows without bounds as x becomes large). We would like to locate these on the plot with annotations. We will first store the relevant points as follows: x_min = 3.213 f_min = f(x_min) x_max = 0.698 f_max = f(x_max) p_min = array([x_min, f_min]) p_max = array([x_max, f_max]) print p_min print p_max The variables, x_min and f_min, are defined to be (approximately) the coordinates of the lowest point in the graph. Analogously, x_max and f_max represent the highest point. Don't be concerned with how these points were found. For the purposes of graphing, even a rough approximation by trial-and-error would suffice. Now, add the following code to the cell that draws the plot, right below the title() command, as shown in the following code: arrow_props = dict(facecolor='DimGray', width=3, shrink=0.05,              headwidth=7) delta = array([0.1, 0.1]) offset = array([1.0, .85]) annotate('Maximum', xy=p_max+delta, xytext=p_max+offset,          arrowprops=arrow_props, verticalalignment='bottom',          horizontalalignment='left', fontsize=13) annotate('Minimum', xy=p_min-delta, xytext=p_min-offset,          arrowprops=arrow_props, verticalalignment='top',          horizontalalignment='right', fontsize=13) Run the cell to produce the plot shown in the following diagram: In the code, start by assigning the variables arrow_props, delta, and offset, which will be used to set the arguments in the calls to annotate(). The annotate() function adds a textual annotation to the graph with an optional arrow indicating the point being annotated. The first argument of the function is the text of the annotation. The next two arguments give the locations of the arrow and the text: xy: This is the point being annotated and will correspond to the tip of the arrow. We want this to be the maximum/minimum points, p_min and p_max, but we add/subtract the delta vector so that the tip is a bit removed from the actual point. xytext: This is the point where the text will be placed as well as the base of the arrow. We specify this as offsets from p_min and p_max using the offset vector. All other arguments of annotate() are formatting options: arrowprops: This is a Python dictionary containing the arrow properties. We predefine the dictionary, arrow_props, and use it here. Arrows can be quite sophisticated in matplotlib, and you are directed to the documentation for details. verticalalignment and horizontalalignment: These specify how the arrow should be aligned with the text. fontsize: This signifies the size of the text. Text is also highly configurable, and the reader is directed to the documentation for details. The annotate() function has a huge number of options; for complete details of what is available, users should consult the documentation at http://matplotlib.org/1.3.1/api/pyplot_api.html#matplotlib.pyplot.annotate for the full details. We now want to add a comment for what is being demonstrated by the plot by adding an explanatory textbox. Add the following code to the cell right after the calls to annotate(): bbox_props = dict(boxstyle='round', lw=2, fc='Beige') text(2, 6, 'Maximum and minimum pointsnhave horizontal tangents',      bbox=bbox_props, fontsize=12, verticalalignment='top') The text()function is used to place text at an arbitrary position of the plot. The first two arguments are the position of the textbox, and the third argument is a string containing the text to be displayed. Notice the use of 'n' to indicate a line break. The other arguments are configuration options. The bbox argument is a dictionary with the options for the box. If omitted, the text will be displayed without any surrounding box. In the example code, the box is a rectangle with rounded corners, with a border width of 2 pixels and the face color, beige. As a final detail, let's add the tangent lines at the extreme points. Add the following code: plot([x_min-0.75, x_min+0.75], [f_min, f_min],      color='RoyalBlue', lw=3) plot([x_max-0.75, x_max+0.75], [f_max, f_max],      color='RoyalBlue', lw=3) Since the tangents are segments of straight lines, we simply give the coordinates of the endpoints. The reason to add the code for the tangents at the top of the cell is that this causes them to be plotted first so that the graph of the function is drawn at the top of the tangents. This is the final result: The examples we have seen so far only scratch the surface of what is possible with matplotlib. The reader should read the matplotlib documentation for more examples. Summary In this article, we learned how to use matplotlib to produce presentation-quality plots. We covered two-dimensional plots and how to set plot options, and annotate and configure plots. You also learned how to add labels, titles, and legends. Edited on July 27, 2018 to replace a broken reference link. Resources for Article: Further resources on this subject: Installing NumPy, SciPy, matplotlib, and IPython [Article] SciPy for Computational Geometry [Article] Fast Array Operations with NumPy [Article]
Read more
  • 0
  • 0
  • 1819

article-image-hbases-data-storage
Packt
13 Nov 2014
9 min read
Save for later

The HBase's Data Storage

Packt
13 Nov 2014
9 min read
In this article by Nishant Garg author of HBase Essentials, we will look at HBase's data storage from its architectural view point. (For more resources related to this topic, see here.) For most of the developers or users, the preceding topics are not of big interest, but for an administrator, it really makes sense to understand how underlying data is stored or replicated within HBase. Administrators are the people who deal with HBase, starting from its installation to cluster management (performance tuning, monitoring, failure, recovery, data security and so on). Let's start with data storage in HBase first. Data storage In HBase, tables are split into smaller chunks that are distributed across multiple servers. These smaller chunks are called regions and the servers that host regions are called RegionServers. The master process handles the distribution of regions among RegionServers, and each RegionServer typically hosts multiple regions. In HBase implementation, the HRegionServer and HRegion classes represent the region server and the region, respectively. HRegionServer contains the set of HRegion instances available to the client and handles two types of files for data storage: HLog (the write-ahead log file, also known as WAL) HFile (the real data storage file) In HBase, there is a system-defined catalog table called hbase:meta that keeps the list of all the regions for user-defined tables. In older versions prior to 0.96.0, HBase had two catalog tables called-ROOT- and .META. The -ROOT- table was used to keep track of the location of the .META table. Version 0.96.0 onwards, the -ROOT- table is removed. The .META table is renamed as hbase:meta. Now, the location of .META is stored in Zookeeper. The following is the structure of the hbase:meta table. Key—the region key of the format ([table],[region start key],[region id]). A region with an empty start key is the first region in a table. The values are as follows: info:regioninfo(serialized the HRegionInfo instance for this region) info:server(server:port of the RegionServer containing this region) info:serverstartcode(start time of the RegionServer process that contains this region) When the table is split, two new columns will be created as info:splitA and info:splitB. These columns represent the two newly created regions. The values for these columns are also serialized as HRegionInfo instances. Once the split process is complete, the row that contains the old region information is deleted. In the case of data reading, the client application first connects to ZooKeeper and looks up the location of the hbase:meta table. For the next client, the HTable instance queries the hbase:meta table and finds out the region that contains the rows of interest and also locate the region server that is serving the identified region. The information about the region and region server is then cached by the client application for future interactions and avoids the lookup process. If the region is reassigned by the load balancer process or if the region server has expired, fresh lookup is done on the hbase:meta catalog table to get the new location of the user table region and cache is updated accordingly. At the object level, the HRegionServer class is responsible to create a connection with the region by creating HRegion objects. This HRegion instance sets up a store instance that has one or more StoreFile instances (wrapped around HFile) and MemStore. MemStore accumulates the data edits as it happens and buffers them into the memory. This is also important for accessing the recent edits of table data. As shown in the preceding diagram, the HRegionServer instance (the region server) contains the map of HRegion instances (regions) and also has an HLog instance that represents the WAL. There is a single block cache instance at the region-server level, which holds data from all the regions hosted on that region server. A block cache instance is created at the time of the region server startup and it can have an implementation of LruBlockCache, SlabCache, or BucketCache. The block cache also supports multilevel caching; that is, a block cache might have first-level cache, L1, as LruBlockCache and second-level cache, L2, as SlabCache or BucketCache. All these cache implementations have their own way of managing the memory; for example, LruBlockCache is like a data structure and resides on the JVM heap whereas the other two types of implementation also use memory outside of the JVM heap. HLog (the write-ahead log – WAL) In the case of writing the data, when the client calls HTable.put(Put), the data is first written to the write-ahead log file (which contains actual data and sequence number together represented by the HLogKey class) and also written in MemStore. Writing data directly into MemStrore can be dangerous as it is a volatile in-memory buffer and always open to the risk of losing data in case of a server failure. Once MemStore is full, the contents of the MemStore are flushed to the disk by creating a new HFile on the HDFS. While inserting data from the HBase shell, the flush command can be used to write the in-memory (memstore) data to the store files. If there is a server failure, the WAL can effectively retrieve the log to get everything up to where the server was prior to the crash failure. Hence, the WAL guarantees that the data is never lost. Also, as another level of assurance, the actual write-ahead log resides on the HDFS, which is a replicated filesystem. Any other server having a replicated copy can open the log. The HLog class represents the WAL. When an HRegion object is instantiated, the single HLog instance is passed on as a parameter to the constructor of HRegion. In the case of an update operation, it saves the data directly to the shared WAL and also keeps track of the changes by incrementing the sequence numbers for each edits. WAL uses a Hadoop SequenceFile, which stores records as sets of key-value pairs. Here, the HLogKey instance represents the key, and the key-value represents the rowkey, column family, column qualifier, timestamp, type, and value along with the region and table name where data needs to be stored. Also, the structure starts with two fixed-length numbers that indicate the size and value of the key. The following diagram shows the structure of a key-value pair: The WALEdit class instance takes care of atomicity at the log level by wrapping each update. For example, in the case of a multicolumn update for a row, each column is represented as a separate KeyValue instance. If the server fails after updating few columns to the WAL, it ends up with only a half-persisted row and the remaining updates are not persisted. Atomicity is guaranteed by wrapping all updates that comprise multiple columns into a single WALEdit instance and writing it in a single operation. For durability, a log writer's sync() method is called, which gets the acknowledgement from the low-level filesystem on each update. This method also takes care of writing the WAL to the replication servers (from one datanode to another). The log flush time can be set to as low as you want, or even be kept in sync for every edit to ensure high durability but at the cost of performance. To take care of the size of the write ahead log file, the LogRoller instance runs as a background thread and takes care of rolling log files at certain intervals (the default is 60 minutes). Rolling of the log file can also be controlled based on the size and hbase.regionserver.logroll.multiplier. It rotates the log file when it becomes 90 percent of the block size, if set to 0.9. HFile (the real data storage file) HFile represents the real data storage file. The files contain a variable number of data blocks and fixed number of file info blocks and trailer blocks. The index blocks records the offsets of the data and meta blocks. Each data block contains a magic header and a number of serialized KeyValue instances. The default size of the block is 64 KB and can be as large as the block size. Hence, the default block size for files in HDFS is 64 MB, which is 1,024 times the HFile default block size but there is no correlation between these two blocks. Each key-value in the HFile is represented as a low-level byte array. Within the HBase root directory, we have different files available at different levels. Write-ahead log files represented by the HLog instances are created in a directory called WALs under the root directory defined by the hbase.rootdir property in hbase-site.xml. This WALs directory also contains a subdirectory for each HRegionServer. In each subdirectory, there are several write-ahead log files (because of log rotation). All regions from that region server share the same HLog files. In HBase, every table also has its own directory created under the data/default directory. This data/default directory is located under the root directory defined by the hbase.rootdir property in hbase-site.xml. Each table directory contains a file called .tableinfo within the .tabledesc folder. This .tableinfo file stores the metadata information about the table, such as table and column family schemas, and is represented as the serialized HTableDescriptor class. Each table directory also has a separate directory for every region comprising the table, and the name of this directory is created using the MD5 hash portion of a region name. The region directory also has a .regioninfo file that contains the serialized information of the HRegionInfo instance for the given region. Once the region exceeds the maximum configured region size, it splits and a matching split directory is created within the region directory. This size is configured using the hbase.hregion.max.filesize property or the configuration done at the column-family level using the HColumnDescriptor instance. In the case of multiple flushes by the MemStore, the number of files might get increased on this disk. The compaction process running in the background combines the files to the largest configured file size and also triggers region split. Summary In this article, we have learned about the internals of HBase and how it stores the data. Resources for Article: Further resources on this subject: Big Data Analysis [Article] Advanced Hadoop MapReduce Administration [Article] HBase Administration, Performance Tuning [Article]
Read more
  • 0
  • 0
  • 3708

article-image-postmodel-workflow
Packt
04 Nov 2014
23 min read
Save for later

Postmodel Workflow

Packt
04 Nov 2014
23 min read
 This article written by Trent Hauck, the author of scikit-learn Cookbook, Packt Publishing, will cover the following recipes: K-fold cross validation Automatic cross validation Cross validation with ShuffleSplit Stratified k-fold Poor man's grid search Brute force grid search Using dummy estimators to compare results (For more resources related to this topic, see here.) Even though by design the articles are unordered, you could argue by virtue of the art of data science, we've saved the best for last. For the most part, each recipe within this article is applicable to the various models we've worked with. In some ways, you can think about this article as tuning the parameters and features. Ultimately, we need to choose some criteria to determine the "best" model. We'll use various measures to define best. Then in the Cross validation with ShuffleSplit recipe, we will randomize the evaluation across subsets of the data to help avoid overfitting. K-fold cross validation In this recipe, we'll create, quite possibly, the most important post-model validation exercise—cross validation. We'll talk about k-fold cross validation in this recipe. There are several varieties of cross validation, each with slightly different randomization schemes. K-fold is perhaps one of the most well-known randomization schemes. Getting ready We'll create some data and then fit a classifier on the different folds. It's probably worth mentioning that if you can keep a holdout set, then that would be best. For example, we have a dataset where N = 1000. If we hold out 200 data points, then use cross validation between the other 800 points to determine the best parameters. How to do it... First, we'll create some fake data, then we'll examine the parameters, and finally, we'll look at the size of the resulting dataset: >>> N = 1000>>> holdout = 200>>> from sklearn.datasets import make_regression>>> X, y = make_regression(1000, shuffle=True) Now that we have the data, let's hold out 200 points, and then go through the fold scheme like we normally would: >>> X_h, y_h = X[:holdout], y[:holdout]>>> X_t, y_t = X[holdout:], y[holdout:]>>> from sklearn.cross_validation import KFold K-fold gives us the option of choosing how many folds we want, if we want the values to be indices or Booleans, if want to shuffle the dataset, and finally, the random state (this is mainly for reproducibility). Indices will actually be removed in later versions. It's assumed to be True. Let's create the cross validation object: >>> kfold = KFold(len(y_t), n_folds=4) Now, we can iterate through the k-fold object: >>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(kfold):       print output_string.format(i, len(y_t[train]),       len(y_t[test]))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Each iteration should return the same split size. How it works... It's probably clear, but k-fold works by iterating through the folds and holds out 1/n_folds * N, where N for us was len(y_t). From a Python perspective, the cross validation objects have an iterator that can be accessed by using the in operator. Often times, it's useful to write a wrapper around a cross validation object that will iterate a subset of the data. For example, we may have a dataset that has repeated measures for data points or we may have a dataset with patients and each patient having measures. We're going to mix it up and use pandas for this part: >>> import numpy as np>>> import pandas as pd>>> patients = np.repeat(np.arange(0, 100, dtype=np.int8), 8)>>> measurements = pd.DataFrame({'patient_id': patients,                   'ys': np.random.normal(0, 1, 800)}) Now that we have the data, we only want to hold out certain customers instead of data points: >>> custids = np.unique(measurements.patient_id)>>> customer_kfold = KFold(custids.size, n_folds=4)>>> output_string = "Fold: {}, N_train: {}, N_test: {}">>> for i, (train, test) in enumerate(customer_kfold):       train_cust_ids = custids[train]       training = measurements[measurements.patient_id.isin(                 train_cust_ids)]       testing = measurements[~measurements.patient_id.isin(                 train_cust_ids)]       print output_string.format(i, len(training), len(testing))Fold: 0, N_train: 600, N_test: 200Fold: 1, N_train: 600, N_test: 200Fold: 2, N_train: 600, N_test: 200Fold: 3, N_train: 600, N_test: 200 Automatic cross validation We've looked at the using cross validation iterators that scikit-learn comes with, but we can also use a helper function to perform cross validation for use automatically. This is similar to how other objects in scikit-learn are wrapped by helper functions, pipeline for instance. Getting ready First, we'll need to create a sample classifier; this can really be anything, a decision tree, a random forest, whatever. For us, it'll be a random forest. We'll then create a dataset and use the cross validation functions. How to do it... First import the ensemble module and we'll get started: >>> from sklearn import ensemble>>> rf = ensemble.RandomForestRegressor(max_features='auto') Okay, so now, let's create some regression data: >>> from sklearn import datasets>>> X, y = datasets.make_regression(10000, 10) Now that we have the data, we can import the cross_validation module and get access to the functions we'll use: >>> from sklearn import cross_validation>>> scores = cross_validation.cross_val_score(rf, X, y)>>> print scores[ 0.86823874 0.86763225 0.86986129] How it works... For the most part, this will delegate to the cross validation objects. One nice thing is that, the function will handle performing the cross validation in parallel. We can activate verbose mode play by play: >>> scores = cross_validation.cross_val_score(rf, X, y, verbose=3, cv=4)[CV] no parameters to be set[CV] no parameters to be set, score=0.872866 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.873679 - 0.6s[CV] no parameters to be set[CV] no parameters to be set, score=0.878018 - 0.7s[CV] no parameters to be set[CV] no parameters to be set, score=0.871598 - 0.6s[Parallel(n_jobs=1)]: Done 1 jobs | elapsed: 0.7s[Parallel(n_jobs=1)]: Done 4 out of 4 | elapsed: 2.6s finished As we can see, during each iteration, we scored the function. We also get an idea of how long the model runs. It's also worth knowing that we can score our function predicated on which kind of model we're trying to fit. Cross validation with ShuffleSplit ShuffleSplit is one of the simplest cross validation techniques. This cross validation technique will simply take a sample of the data for the number of iterations specified. Getting ready ShuffleSplit is another cross validation technique that is very simple. We'll specify the total elements in the dataset, and it will take care of the rest. We'll walk through an example of estimating the mean of a univariate dataset. This is somewhat similar to resampling, but it'll illustrate one reason why we want to use cross validation while showing cross validation. How to do it... First, we need to create the dataset. We'll use NumPy to create a dataset, where we know the underlying mean. We'll sample half of the dataset to estimate the mean and see how close it is to the underlying mean: >>> import numpy as np>>> true_loc = 1000>>> true_scale = 10>>> N = 1000>>> dataset = np.random.normal(true_loc, true_scale, N)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.hist(dataset, color='k', alpha=.65, histtype='stepfilled');>>> ax.set_title("Histogram of dataset");>>> f.savefig("978-1-78398-948-5_06_06.png") NumPy will give the following output: Now, let's take the first half of the data and guess the mean: >>> from sklearn import cross_validation>>> holdout_set = dataset[:500]>>> fitting_set = dataset[500:]>>> estimate = fitting_set[:N/2].mean()>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.set_title("True Mean vs Regular Estimate")>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.set_xlim(999, 1001)>>> ax.legend()>>> f.savefig("978-1-78398-948-5_06_07.png") We'll get the following output: Now, we can use ShuffleSplit to fit the estimator on several smaller datasets: >>> from sklearn.cross_validation import ShuffleSplit>>> shuffle_split = ShuffleSplit(len(fitting_set))>>> mean_p = []>>> for train, _ in shuffle_split:       mean_p.append(fitting_set[train].mean())       shuf_estimate = np.mean(mean_p)>>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.vlines(true_loc, 0, 1, color='r', linestyles='-', lw=5,             alpha=.65, label='true mean')>>> ax.vlines(estimate, 0, 1, color='g', linestyles='-', lw=5,             alpha=.65, label='regular estimate')>>> ax.vlines(shuf_estimate, 0, 1, color='b', linestyles='-', lw=5,             alpha=.65, label='shufflesplit estimate')>>> ax.set_title("All Estimates")>>> ax.set_xlim(999, 1001)>>> ax.legend(loc=3) The output will be as follows: As we can see, we got an estimate that was similar to what we expected, but we were able to take many samples to get that estimate. Stratified k-fold In this recipe, we'll quickly look at stratified k-fold valuation. We've walked through different recipes where the class representation was unbalanced in some manner. Stratified k-fold is nice because its scheme is specifically designed to maintain the class proportions. Getting ready We're going to create a small dataset. In this dataset, we will then use stratified k-fold validation. We want it small so that we can see the variation. For larger samples. it probably won't be as big of a deal. We'll then plot the class proportions at each step to illustrate how the class proportions are maintained: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=int(1e3), weights=[1./11]) Let's check the overall class weight distribution: >>> y.mean()0.90300000000000002 Roughly, 90.5 percent of the samples are 1, with the balance 0. How to do it... Let's create a stratified k-fold object and iterate it through each fold. We'll measure the proportion of verse that are 1. After that we'll plot the proportion of classes by the split number to see how and if it changes. This code will hopefully illustrate how this is beneficial. We'll also plot this code against a basic ShuffleSplit: >>> from sklearn import cross_validation>>> n_folds = 50>>> strat_kfold = cross_validation.StratifiedKFold(y,                 n_folds=n_folds)>>> shuff_split = cross_validation.ShuffleSplit(n=len(y),                 n_iter=n_folds)>>> kfold_y_props = []>>> shuff_y_props = []>>> for (k_train, k_test), (s_train, s_test) in zip(strat_kfold,         shuff_split):        kfold_y_props.append(y[k_train].mean())       shuff_y_props.append(y[s_train].mean()) Now, let's plot the proportions over each fold: >>> import matplotlib.pyplot as plt>>> f, ax = plt.subplots(figsize=(7, 5))>>> ax.plot(range(n_folds), shuff_y_props, label="ShuffleSplit",           color='k')>>> ax.plot(range(n_folds), kfold_y_props, label="Stratified",           color='k', ls='--')>>> ax.set_title("Comparing class proportions.")>>> ax.legend(loc='best') The output will be as follows: We can see that the proportion of each fold for stratified k-fold is stable across folds. How it works... Stratified k-fold works by taking the y value. First, getting the overall proportion of the classes, then intelligently splitting the training and test set into the proportions. This will generalize to multiple labels: >>> import numpy as np>>> three_classes = np.random.choice([1,2,3], p=[.1, .4, .5],                   size=1000)>>> import itertools as it>>> for train, test in cross_validation.StratifiedKFold(three_classes, 5):       print np.bincount(three_classes[train])[ 0 90 314 395][ 0 90 314 395][ 0 90 314 395][ 0 91 315 395][ 0 91 315 396] As we can see, we got roughly the sample sizes of each class for our training and testing proportions. Poor man's grid search In this recipe, we're going to introduce grid search with basic Python, though we will use sklearn for the models and matplotlib for the visualization. Getting ready In this recipe, we will perform the following tasks: Design a basic search grid in the parameter space Iterate through the grid and check the loss/score function at each point in the parameter space for the dataset Choose the point in the parameter space that minimizes/maximizes the evaluation function Also, the model we'll fit is a basic decision tree classifier. Our parameter space will be 2 dimensional to help us with the visualization: The parameter space will then be the Cartesian product of the those two sets: We'll see in a bit how we can iterate through this space with itertools. Let's create the dataset and then get started: >>> from sklearn import datasets>>> X, y = datasets.make_classification(n_samples=2000, n_features=10) How to do it... Earlier we said that we'd use grid search to tune two parameters—criteria and max_features. We need to represent those as Python sets, and then use itertools product to iterate through them: >>> criteria = {'gini', 'entropy'}>>> max_features = {'auto', 'log2', None}>>> import itertools as it>>> parameter_space = it.product(criteria, max_features) Great! So now that we have the parameter space, let's iterate through it and check the accuracy of each model as specified by the parameters. Then, we'll store that accuracy so that we can compare different parameter spaces. We'll also use a test and train split of 50, 50: import numpy as nptrain_set = np.random.choice([True, False], size=len(y))from sklearn.tree import DecisionTreeClassifieraccuracies = {}for criterion, max_feature in parameter_space:   dt = DecisionTreeClassifier(criterion=criterion,         max_features=max_feature)   dt.fit(X[train_set], y[train_set])   accuracies[(criterion, max_feature)] = (dt.predict(X[~train_set])                                         == y[~train_set]).mean()>>> accuracies{('entropy', None): 0.974609375, ('entropy', 'auto'): 0.9736328125,('entropy', 'log2'): 0.962890625, ('gini', None): 0.9677734375, ('gini','auto'): 0.9638671875, ('gini', 'log2'): 0.96875} So we now have the accuracies and its performance. Let's visualize the performance: >>> from matplotlib import pyplot as plt>>> from matplotlib import cm>>> cmap = cm.RdBu_r>>> f, ax = plt.subplots(figsize=(7, 4))>>> ax.set_xticklabels([''] + list(criteria))>>> ax.set_yticklabels([''] + list(max_features))>>> plot_array = []>>> for max_feature in max_features:m = []>>> for criterion in criteria:       m.append(accuracies[(criterion, max_feature)])       plot_array.append(m)>>> colors = ax.matshow(plot_array, vmin=np.min(accuracies.values())             - 0.001, vmax=np.max(accuracies.values()) + 0.001,             cmap=cmap)>>> f.colorbar(colors) The following is the output: It's fairly easy to see which one performed best here. Hopefully, you can see how this process can be taken to the further stage with a brute force method. How it works... This works fairly simply, we just have to perform the following steps: Choose a set of parameters. Iterate through them and find the accuracy of each step. Find the best performer by visual inspection. Brute force grid search In this recipe, we'll do an exhaustive grid search through scikit-learn. This is basically the same thing we did in the previous recipe, but we'll utilize built-in methods. We'll also walk through an example of performing randomized optimization. This is an alternative to brute force search. Essentially, we're trading computer cycles to make sure that we search the entire space. We were fairly calm in the last recipe. However, you could imagine a model that has several steps, first imputation for fix missing data, then PCA reduce the dimensionality to classification. Your parameter space could get very large, very fast; therefore, it can be advantageous to only search a part of that space. Getting ready To get started, we'll need to perform the following steps: Create some classification data. We'll then create a LogisticRegression object that will be the model we're fitting. After that, we'll create the search objects, GridSearch and RandomizedSearchCV. How to do it... Run the following code to create some classification data: >>> from sklearn.datasets import make_classification>>> X, y = make_classification(1000, n_features=5) Now, we'll create our logistic regression object: >>> from sklearn.linear_model import LogisticRegression>>> lr = LogisticRegression(class_weight='auto') We need to specify the parameters we want to search. For GridSearch, we can just specify the ranges that we care about, but for RandomizedSearchCV, we'll need to actually specify the distribution over the same space from which to sample: >>> lr.fit(X, y)LogisticRegression(C=1.0, class_weight={0: 0.25, 1: 0.75},                   dual=False,fit_intercept=True,                  intercept_scaling=1, penalty='l2',                   random_state=None, tol=0.0001)>>> grid_search_params = {'penalty': ['l1', 'l2'],'C': [1, 2, 3, 4]} The only change we'll need to make is to describe the C parameter as a probability distribution. We'll keep it simple right now, though we will use scipy to describe the distribution: >>> import scipy.stats as st>>> import numpy as np>>> random_search_params = {'penalty': ['l1', 'l2'],'C': st.randint(1, 4)} How it works... Now, we'll fit the classifier. This works by passing lr to the parameter search objects: >>> from sklearn.grid_search import GridSearchCV, RandomizedSearchCV>>> gs = GridSearchCV(lr, grid_search_params) GridSearchCV implements the same API as the other models: >>> gs.fit(X, y)GridSearchCV(cv=None, estimator=LogisticRegression(C=1.0,             class_weight='auto', dual=False, fit_intercept=True,             intercept_scaling=1, penalty='l2', random_state=None,             tol=0.0001), fit_params={}, iid=True, loss_func=None,             n_jobs=1, param_grid={'penalty': ['l1', 'l2'], 'C':             [1, 2, 3, 4]}, pre_dispatch='2*n_jobs', refit=True,             score_func=None, scoring=None, verbose=0) As we can see with the param_grid parameter, our penalty and C are both arrays. To access the scores, we can use the grid_scores_ attribute of the grid search. We also want to find the optimal set of parameters. We can also look at the marginal performance of the grid search: >>> gs.grid_scores_[mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 1},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 2},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 2},mean: 0.90200, std: 0.01117, params: {'penalty': 'l1', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 3},mean: 0.90100, std: 0.01258, params: {'penalty': 'l1', 'C': 4},mean: 0.90100, std: 0.01258, params: {'penalty': 'l2', 'C': 4}] We might want to get the max score: >>> gs.grid_scores_[1][1]0.90100000000000002>>> max(gs.grid_scores_, key=lambda x: x[1])mean: 0.90300, std: 0.01192, params: {'penalty': 'l1', 'C': 1} The parameters obtained are the best choices for our logistic regression. Using dummy estimators to compare results This recipe is about creating fake estimators; this isn't the pretty or exciting stuff, but it is worthwhile to have a reference point for the model you'll eventually build. Getting ready In this recipe, we'll perform the following tasks: Create some data random data. Fit the various dummy estimators. We'll perform these two steps for regression data and classification data. How to do it... First, we'll create the random data: >>> from sklearn.datasets import make_regression, make_classification# classification if for later>>> X, y = make_regression()>>> from sklearn import dummy>>> dumdum = dummy.DummyRegressor()>>> dumdum.fit(X, y)DummyRegressor(constant=None, strategy='mean') By default, the estimator will predict by just taking the mean of the values and predicting the mean values: >>> dumdum.predict(X)[:5]array([ 2.23297907, 2.23297907, 2.23297907, 2.23297907, 2.23297907]) There are other two other strategies we can try. We can predict a supplied constant (refer to constant=None from the preceding command). We can also predict the median value. Supplying a constant will only be considered if strategy is "constant". Let's have a look: >>> predictors = [("mean", None),                 ("median", None),                 ("constant", 10)]>>> for strategy, constant in predictors:       dumdum = dummy.DummyRegressor(strategy=strategy,                 constant=constant)>>> dumdum.fit(X, y)>>> print "strategy: {}".format(strategy), ",".join(map(str,         dumdum.predict(X)[:5]))strategy: mean 2.23297906733,2.23297906733,2.23297906733,2.23297906733,2.23297906733strategy: median 20.38535248,20.38535248,20.38535248,20.38535248,20.38535248strategy: constant 10.0,10.0,10.0,10.0,10.0 We actually have four options for classifiers. These strategies are similar to the continuous case, it's just slanted toward classification problems: >>> predictors = [("constant", 0),                 ("stratified", None),                 ("uniform", None),                 ("most_frequent", None)] We'll also need to create some classification data: >>> X, y = make_classification()>>> for strategy, constant in predictors:       dumdum = dummy.DummyClassifier(strategy=strategy,                 constant=constant)       dumdum.fit(X, y)       print "strategy: {}".format(strategy), ",".join(map(str,             dumdum.predict(X)[:5]))strategy: constant 0,0,0,0,0strategy: stratified 1,0,0,1,0strategy: uniform 0,0,0,1,1strategy: most_frequent 1,1,1,1,1 How it works... It's always good to test your models against the simplest models and that's exactly what the dummy estimators give you. For example, imagine a fraud model. In this model, only 5 percent of the data set is fraud. Therefore, we can probably fit a pretty good model just by never guessing any fraud. We can create this model by using the stratified strategy, using the following command. We can also get a good example of why class imbalance causes problems: >>> X, y = make_classification(20000, weights=[.95, .05])>>> dumdum = dummy.DummyClassifier(strategy='most_frequent')>>> dumdum.fit(X, y)DummyClassifier(constant=None, random_state=None, strategy='most_frequent')>>> from sklearn.metrics import accuracy_score>>> print accuracy_score(y, dumdum.predict(X))0.94575 We were actually correct very often, but that's not the point. The point is that this is our baseline. If we cannot create a model for fraud that is more accurate than this, then it isn't worth our time. Summary This article taught us how we can take a basic model produced from one of the recipes and tune it so that we can achieve better results than we could with the basic model. Resources for Article: Further resources on this subject: Specialized Machine Learning Topics [article] Machine Learning in IPython with scikit-learn [article] Our First Machine Learning Method – Linear Classification [article]
Read more
  • 0
  • 0
  • 1498

article-image-loading-data-creating-app-and-adding-dashboards-and-reports-splunk
Packt
31 Oct 2014
13 min read
Save for later

Loading data, creating an app, and adding dashboards and reports in Splunk

Packt
31 Oct 2014
13 min read
In this article by Josh Diakun, Paul R Johnson, and Derek Mock, authors of Splunk Operational Intelligence Cookbook, we will take a look at how to load sample data into Splunk, how to create an application, and how to add dashboards and reports in Splunk. (For more resources related to this topic, see here.) Loading the sample data While most of the data you will index with Splunk will be collected in real time, there might be instances where you have a set of data that you would like to put into Splunk, either to backfill some missing or incomplete data, or just to take advantage of its searching and reporting tools. This recipe will show you how to perform one-time bulk loads of data from files located on the Splunk server. We will also use this recipe to load the data samples that will be used as we build our Operational Intelligence app in Splunk. There are two files that make up our sample data. The first is access_log, which represents data from our web layer and is modeled on an Apache web server. The second file is app_log, which represents data from our application layer and is modeled on the log4j application log data. Getting ready To step through this recipe, you will need a running Splunk server and should have a copy of the sample data generation app (OpsDataGen.spl). (This file is part of the downloadable code bundle, which is available on the book's website.) How to do it... Follow the given steps to load the sample data generator on your system: Log in to your Splunk server using your credentials. From the home launcher, select the Apps menu in the top-left corner and click on Manage Apps. Select Install App from file. Select the location of the OpsDataGen.spl file on your computer, and then click on the Upload button to install the application. After installation, a message should appear in a blue bar at the top of the screen, letting you know that the app has installed successfully. You should also now see the OpsDataGen app in the list of apps. By default, the app installs with the data-generation scripts disabled. In order to generate data, you will need to enable either a Windows or Linux script, depending on your Splunk operating system. To enable the script, select the Settings menu from the top-right corner of the screen, and then select Data inputs. From the Data inputs screen that follows, select Scripts. On the Scripts screen, locate the OpsDataGen script for your operating system and click on Enable. For Linux, it will be $SPLUNK_HOME/etc/apps/OpsDataGen/bin/AppGen.path For Windows, it will be $SPLUNK_HOMEetcappsOpsDataGenbinAppGen-win.path The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. Select the Settings menu from the top-right corner of the screen, select Data inputs, and then select Files & directories. On the Files & directories screen, locate the two OpsDataGen inputs for your operating system and for each click on Enable. For Linux, it will be: $SPLUNK_HOME/etc/apps/OpsDataGen/data/access_log $SPLUNK_HOME/etc/apps/OpsDataGen/data/app_log For Windows, it will be: $SPLUNK_HOMEetcappsOpsDataGendataaccess_log $SPLUNK_HOMEetcappsOpsDataGendataapp_log The following screenshot displays both the Windows and Linux inputs that are available after installing the OpsDataGen app. It also displays where to click to enable the correct one based on the operating system Splunk is installed on. The data will now be generated in real time. You can test this by navigating to the Splunk search screen and running the following search over an All time (real-time) time range: index=main sourcetype=log4j OR sourcetype=access_combined After a short while, you should see data from both source types flowing into Splunk, and the data generation is now working as displayed in the following screenshot: How it works... In this case, you installed a Splunk application that leverages a scripted input. The script we wrote generates data for two source types. The access_combined source type contains sample web access logs, and the log4j source type contains application logs. Creating an Operational Intelligence application This recipe will show you how to create an empty Splunk app that we will use as the starting point in building our Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the previous recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow the given steps to create the Operational Intelligence application: Log in to your Splunk server. From the top menu, select Apps and then select Manage Apps. Click on the Create app button. Complete the fields in the box that follows. Name the app Operational Intelligence and give it a folder name of operational_intelligence. Add in a version number and provide an author name. Ensure that Visible is set to Yes, and the barebones template is selected. When the form is completed, click on Save. This should be followed by a blue bar with the message, Successfully saved operational_intelligence. Congratulations, you just created a Splunk application! How it works... When an app is created through the Splunk GUI, as in this recipe, Splunk essentially creates a new folder (or directory) named operational_intelligence within the $SPLUNK_HOME/etc/apps directory. Within the $SPLUNK_HOME/etc/apps/operational_intelligence directory, you will find four new subdirectories that contain all the configuration files needed for our barebones Operational Intelligence app that we just created. The eagle-eyed among you would have noticed that there were two templates, barebones and sample_app, out of which any one could have been selected when creating the app. The barebones template creates an application with nothing much inside of it, and the sample_app template creates an application populated with sample dashboards, searches, views, menus, and reports. If you wish to, you can also develop your own custom template if you create lots of apps, which might enforce certain color schemes for example. There's more... As Splunk apps are just a collection of directories and files, there are other methods to add apps to your Splunk Enterprise deployment. Creating an application from another application It is relatively simple to create a new app from an existing app without going through the Splunk GUI, should you wish to do so. This approach can be very useful when we are creating multiple apps with different inputs.conf files for deployment to Splunk Universal Forwarders. Taking the app we just created as an example, copy the entire directory structure of the operational_intelligence app and name it copied_app. cp -r $SPLUNK_HOME$/etc/apps/operational_intelligence/* $SPLUNK_HOME$/etc/apps/copied_app Within the directory structure of copied_app, we must now edit the app.conf file in the default directory. Open $SPLUNK_HOME$/etc/apps/copied_app/default/app.conf and change the label field to My Copied App, provide a new description, and then save the conf file. ## Splunk app configuration file#[install]is_configured = 0[ui]is_visible = 1label = My Copied App[launcher]author = John Smithdescription = My Copied applicationversion = 1.0 Now, restart Splunk, and the new My Copied App application should now be seen in the application menu. $SPLUNK_HOME$/bin/splunk restart Downloading and installing a Splunk app Splunk has an entire application website with hundreds of applications, created by Splunk, other vendors, and even users of Splunk. These are great ways to get started with a base application, which you can then modify to meet your needs. If the Splunk server that you are logged in to has access to the Internet, you can click on the Apps menu as you did earlier and then select the Find More Apps button. From here, you can search for apps and install them directly. An alternative way to install a Splunk app is to visit http://apps.splunk.com and search for the app. You will then need to download the application locally. From your Splunk server, click on the Apps menu and then on the Manage Apps button. After that, click on the Install App from File button and upload the app you just downloaded, in order to install it. Once the app has been installed, go and look at the directory structure that the installed application just created. Familiarize yourself with some of the key files and where they are located. When downloading applications from the Splunk apps site, it is best practice to test and verify them in a nonproduction environment first. The Splunk apps site is community driven and, as a result, quality checks and/or technical support for some of the apps might be limited. Adding dashboards and reports Dashboards are a great way to present many different pieces of information. Rather than having lots of disparate dashboards across your Splunk environment, it makes a lot of sense to group related dashboards into a common Splunk application, for example, putting operational intelligence dashboards into a common Operational Intelligence application. In this recipe, you will learn how to move the dashboards and associated reports into our new Operational Intelligence application. Getting ready To step through this recipe, you will need a running Splunk Enterprise server, with the sample data loaded from the Loading the sample data recipe. You should be familiar with navigating the Splunk user interface. How to do it... Follow these steps to move your dashboards into the new application: Log in to your Splunk server. Select the newly created Operational Intelligence application. From the top menu, select Settings and then select the User interface menu item. Click on the Views section. In the App Context dropdown, select Searching & Reporting (search) or whatever application you were in when creating the dashboards: Locate the website_monitoring dashboard row in the list of views and click on the Move link to the right of the row. In the Move Object pop up, select the Operational Intelligence (operational_intelligence) application that was created earlier and then click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Repeat from step 5 to move the product_monitoring dashboard as well. After the Website Monitoring and Product Monitoring dashboards have been moved, we now want to move all the reports that were created, as these power the dashboards and provide operational intelligence insight. From the top menu, select Settings and this time select Searches, reports, and alerts. Select the Search & Reporting (search) context and filter by cp0* to view the searches (reports) that are created. Click on the Move link of the first cp0* search in the list. Select to move the object to the Operational Intelligence (operational_intelligence) application and click on the Move button. A message bar will then be displayed at the top of the screen to confirm that the dashboard was moved successfully. Select the Search & Reporting (search) context and repeat from step 11 to move all the other searches over to the new Operational Intelligence application—this seems like a lot but will not take you long! All of the dashboards and reports are now moved over to your new Operational Intelligence application. How it works... In the previous recipe, we revealed how Splunk apps are essentially just collections of directories and files. Dashboards are XML files found within the $SPLUNK_HOME/etc/apps directory structure. When moving a dashboard from one app to another, Splunk is essentially just moving the underlying file from a directory inside one app to a directory in the other app. In this recipe, you moved the dashboards from the Search & Reporting app to the Operational Intelligence app, as represented in the following screenshot: As visualizations on the dashboards leverage the underlying saved searches (or reports), you also moved these reports to the new app so that the dashboards maintain permissions to access them. Rather than moving the saved searches, you could have changed the permissions of each search to Global such that they could be seen from all the other apps in Splunk. However, the other reason you moved the reports was to keep everything contained within a single Operational Intelligence application, which you will continue to build on going forward. It is best practice to avoid setting permissions to Global for reports and dashboards, as this makes them available to all the other applications when they most likely do not need to be. Additionally, setting global permissions can make things a little messy from a housekeeping perspective and crowd the lists of reports and views that belong to specific applications. The exception to this rule might be for knowledge objects such as tags, event types, macros, and lookups, which often have advantages to being available across all applications. There's more… As you went through this recipe, you likely noticed that the dashboards had application-level permissions, but the reports had private-level permissions. The reports are private as this is the default setting in Splunk when they are created. This private-level permission restricts access to only your user account and admin users. In order to make the reports available to other users of your application, you will need to change the permissions of the reports to Shared in App as we did when adjusting the permissions of reports. Changing the permissions of saved reports Changing the sharing permission levels of your reports from the default Private to App is relatively straightforward: Ensure that you are in your newly created Operational Intelligence application. Select the Reports menu item to see the list of reports. Click on Edit next to the report you wish to change the permissions for. Then, click on Edit Permissions from the drop-down list. An Edit Permissions pop-up box will appear. In the Display for section, change from Owner to App, and then, click on Save. The box will close, and you will see that the Sharing permissions in the table will now display App for the specific report. This report will now be available to all the users of your application. Summary In this article, we loaded the sample data into Splunk. We also saw how to organize dashboards and knowledge into a custom Splunk app. Resources for Article: Further resources on this subject: VWorking with Pentaho Mobile BI [Article] Visualization of Big Data [Article] Highlights of Greenplum [Article]
Read more
  • 0
  • 0
  • 3451
article-image-theming-highcharts
Packt
30 Oct 2014
10 min read
Save for later

Theming with Highcharts

Packt
30 Oct 2014
10 min read
Besides the charting capabilities offered by Highcharts, theming is yet another strong feature of Highcharts. With its extensive theming API, charts can be customized completely to match the branding of a website or an app. Almost all of the chart elements are customizable through this API. In this article by Bilal Shahid, author of Highcharts Essentials, we will do the following things: (For more resources related to this topic, see here.) Use different fill types and fonts Create a global theme for our charts Use jQuery easing for animations Using Google Fonts with Highcharts Google provides an easy way to include hundreds of high quality web fonts to web pages. These fonts work in all major browsers and are served by Google CDN for lightning fast delivery. These fonts can also be used with Highcharts to further polish the appearance of our charts. This section assumes that you know the basics of using Google Web Fonts. If you are not familiar with them, visit https://developers.google.com/fonts/docs/getting_started. We will style the following example with Google Fonts. We will use the Merriweather family from Google Fonts and link to its style sheet from our web page inside the <head> tag: <link href='http://fonts.googleapis.com/css?family=Merriweather:400italic,700italic' rel='stylesheet' type='text/css'> Having included the style sheet, we can actually use the font family in our code for the labels in yAxis: yAxis: [{ ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 400,      fontStyle: 'italic',      fontSize: '14px',      color: '#ffffff'    } } }, { ... labels: {    style: {      fontFamily: 'Merriweather, sans-serif',      fontWeight: 700,      fontStyle: 'italic',      fontSize: '21px',      color: '#ffffff'    },    ... } }] For the outer axis, we used a font size of 21px with font weight of 700. For the inner axis, we lowered the font size to 14px and used font weight of 400 to compensate for the smaller font size. The following is the modified speedometer: In the next section, we will continue with the same example to include jQuery UI easing in chart animations. Using jQuery UI easing for series animation Animations occurring at the point of initialization of charts can be disabled or customized. The customization requires modifying two properties: animation.duration and animation.easing. The duration property accepts the number of milliseconds for the duration of the animation. The easing property can have various values depending on the framework currently being used. For a standalone jQuery framework, the values can be either linear or swing. Using the jQuery UI framework adds a couple of more options for the easing property to choose from. In order to follow this example, you must include the jQuery UI framework to the page. You can also grab the standalone easing plugin from http://gsgd.co.uk/sandbox/jquery/easing/ and include it inside your <head> tag. We can now modify the series to have a modified animation: plotOptions: { ... series: {    animation: {      duration: 1000,      easing: 'easeOutBounce'    } } } The preceding code will modify the animation property for all the series in the chart to have duration set to 1000 milliseconds and easing to easeOutBounce. Each series can have its own different animation by defining the animation property separately for each series as follows: series: [{ ... animation: {    duration: 500,    easing: 'easeOutBounce' } }, { ... animation: {    duration: 1500,    easing: 'easeOutBounce' } }, { ... animation: {      duration: 2500,    easing: 'easeOutBounce' } }] Different animation properties for different series can pair nicely with column and bar charts to produce visually appealing effects. Creating a global theme for our charts A Highcharts theme is a collection of predefined styles that are applied before a chart is instantiated. A theme will be applied to all the charts on the page after the point of its inclusion, given that the styling options have not been modified within the chart instantiation. This provides us with an easy way to apply custom branding to charts without the need to define styles over and over again. In the following example, we will create a basic global theme for our charts. This way, we will get familiar with the fundamentals of Highcharts theming and some API methods. We will define our theme inside a separate JavaScript file to make the code reusable and keep things clean. Our theme will be contained in an options object that will, in turn, contain styling for different Highcharts components. Consider the following code placed in a file named custom-theme.js. This is a basic implementation of a Highcharts custom theme that includes colors and basic font styles along with some other modifications for axes: Highcharts.customTheme = {      colors: ['#1BA6A6', '#12734F', '#F2E85C', '#F27329', '#D95D30', '#2C3949', '#3E7C9B', '#9578BE'],      chart: {        backgroundColor: {            radialGradient: {cx: 0, cy: 1, r: 1},            stops: [                [0, '#ffffff'],                [1, '#f2f2ff']            ]        },        style: {            fontFamily: 'arial, sans-serif',            color: '#333'        }    },    title: {        style: {            color: '#222',            fontSize: '21px',            fontWeight: 'bold'        }    },    subtitle: {        style: {            fontSize: '16px',            fontWeight: 'bold'        }    },    xAxis: {        lineWidth: 1,        lineColor: '#cccccc',        tickWidth: 1,        tickColor: '#cccccc',        labels: {            style: {                fontSize: '12px'            }        }    },    yAxis: {        gridLineWidth: 1,        gridLineColor: '#d9d9d9',        labels: {           style: {                fontSize: '12px'            }        }    },    legend: {        itemStyle: {            color: '#666',            fontSize: '9px'        },        itemHoverStyle:{            color: '#222'        }      } }; Highcharts.setOptions( Highcharts.customTheme ); We start off by modifying the Highcharts object to include an object literal named customTheme that contains styles for our charts. Inside customTheme, the first option we defined is for series colors. We passed an array containing eight colors to be applied to series. In the next part, we defined a radial gradient as a background for our charts and also defined the default font family and text color. The next two object literals contain basic font styles for the title and subtitle components. Then comes the styles for the x and y axes. For the xAxis, we define lineColor and tickColor to be #cccccc with the lineWidth value of 1. The xAxis component also contains the font style for its labels. The y axis gridlines appear parallel to the x axis that we have modified to have the width and color at 1 and #d9d9d9 respectively. Inside the legend component, we defined styles for the normal and mouse hover states. These two states are stated by itemStyle and itemHoverStyle respectively. In normal state, the legend will have a color of #666 and font size of 9px. When hovered over, the color will change to #222. In the final part, we set our theme as the default Highcharts theme by using an API method Highcharts.setOptions(), which takes a settings object to be applied to Highcharts; in our case, it is customTheme. The styles that have not been defined in our custom theme will remain the same as the default theme. This allows us to partially customize a predefined theme by introducing another theme containing different styles. In order to make this theme work, include the file custom-theme.js after the highcharts.js file: <script src="js/highcharts.js"></script> <script src="js/custom-theme.js"></script> The output of our custom theme is as follows: We can also tell our theme to include a web font from Google without having the need to include the style sheet manually in the header, as we did in a previous section. For that purpose, Highcharts provides a utility method named Highcharts.createElement(). We can use it as follows by placing the code inside the custom-theme.js file: Highcharts.createElement( 'link', {    href: 'http://fonts.googleapis.com/css?family=Open+Sans:300italic,400italic,700italic,400,300,700',    rel: 'stylesheet',    type: 'text/css' }, null, document.getElementsByTagName( 'head' )[0], null ); The first argument is the name of the tag to be created. The second argument takes an object as tag attributes. The third argument is for CSS styles to be applied to this element. Since, there is no need for CSS styles on a link element, we passed null as its value. The final two arguments are for the parent node and padding, respectively. We can now change the default font family for our charts to 'Open Sans': chart: {    ...    style: {        fontFamily: "'Open Sans', sans-serif",        ...    } } The specified Google web font will now be loaded every time a chart with our custom theme is initialized, hence eliminating the need to manually insert the required font style sheet inside the <head> tag. This screenshot shows a chart with 'Open Sans' Google web font. Summary In this article, you learned about incorporating Google fonts and jQuery UI easing into our chart for enhanced styling. Resources for Article: Further resources on this subject: Integrating with other Frameworks [Article] Highcharts [Article] More Line Charts, Area Charts, and Scatter Plots [Article]
Read more
  • 0
  • 0
  • 4773

article-image-hosting-service-iis-using-tcp-protocol
Packt
30 Oct 2014
8 min read
Save for later

Hosting the service in IIS using the TCP protocol

Packt
30 Oct 2014
8 min read
In this article by Mike Liu, the author of WCF Multi-layer Services Development with Entity Framework, Fourth Edtion, we will learn how to create and host a service in IIS using the TCP protocol. (For more resources related to this topic, see here.) Hosting WCF services in IIS using the HTTP protocol gives the best interoperability to the service, because the HTTP protocol is supported everywhere today. However, sometimes interoperability might not be an issue. For example, the service may be invoked only within your network with all Microsoft clients only. In this case, hosting the service by using the TCP protocol might be a better solution. Benefits of hosting a WCF service using the TCP protocol Compared to HTTP, there are a few benefits in hosting a WCF service using the TCP protocol: It supports connection-based, stream-oriented delivery services with end-to-end error detection and correction It is the fastest WCF binding for scenarios that involve communication between different machines It supports duplex communication, so it can be used to implement duplex contracts It has a reliable data delivery capability (this is applied between two TCP/IP nodes and is not the same thing as WS-ReliableMessaging, which applies between endpoints) Preparing the folders and files First, we need to prepare the folders and files for the host application, just as we did for hosting the service using the HTTP protocol. We will use the previous HTTP hosting application as the base to create the new TCP hosting application: Create the folders: In Windows Explorer, create a new folder called HostIISTcp under C:SOAwithWCFandEFProjectsHelloWorld and a new subfolder called bin under the HostIISTcp folder. You should now have the following new folders: C:SOAwithWCFandEFProjectsHelloWorld HostIISTcp and a bin folder inside the HostIISTcp folder. Copy the files: Now, copy all the files from the HostIIS hosting application folder at C:SOAwithWCFandEFProjectsHelloWorldHostIIS to the new folder that we created at C:SOAwithWCFandEFProjectsHelloWorldHostIISTcp. Create the Visual Studio solution folder: To make it easier to be viewed and managed from the Visual Studio Solution Explorer, you can add a new solution folder, HostIISTcp, to the solution and add the Web.config file to this folder. Add another new solution folder, bin, under HostIISTcp and add the HelloWorldService.dll and HelloWorldService.pdb files under this bin folder. Add the following post-build events to the HelloWorldService project, so next time, all the files will be copied automatically when the service project is built: xcopy "$(AssemblyName).dll" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y xcopy "$(AssemblyName).pdb" "C:SOAwithWCFandEFProjectsHelloWorldHostIISTcpbin" /Y Modify the Web.config file: The Web.config file that we have copied from HostIIS is using the default basicHttpBinding as the service binding. To make our service use the TCP binding, we need to change the binding to TCP and add a TCP base address. Open the Web.config file and add the following node to it under the <system.serviceModel> node: <services> <service name="HelloWorldService.HelloWorldService">    <endpoint address="" binding="netTcpBinding"    contract="HelloWorldService.IHelloWorldService"/>    <host>      <baseAddresses>        <add baseAddress=        "net.tcp://localhost/HelloWorldServiceTcp/"/>      </baseAddresses>    </host> </service> </services> In this new services node, we have defined one service called HelloWorldService.HelloWorldService. The base address of this service is net.tcp://localhost/HelloWorldServiceTcp/. Remember, we have defined the host activation relative address as ./HelloWorldService.svc, so we can invoke this service from the client application with the following URL: http://localhost/HelloWorldServiceTcp/HelloWorldService.svc. For the file-less WCF activation, if no endpoint is defined explicitly, HTTP and HTTPS endpoints will be defined by default. In this example, we would like to expose only one TCP endpoint, so we have added an endpoint explicitly (as soon as this endpoint is added explicitly, the default endpoints will not be added). If you don't add this TCP endpoint explicitly here, the TCP client that we will create in the next section will still work, but on the client config file you will see three endpoints instead of one and you will have to specify which endpoint you are using in the client program. The following is the full content of the Web.config file: <?xml version="1.0"?> <!-- For more information on how to configure your ASP.NET application, please visit http://go.microsoft.com/fwlink/?LinkId=169433 --> <configuration> <system.web>    <compilation debug="true" targetFramework="4.5"/>    <httpRuntime targetFramework="4.5" /> </system.web>   <system.serviceModel>    <serviceHostingEnvironment >      <serviceActivations>        <add factory="System.ServiceModel.Activation.ServiceHostFactory"          relativeAddress="./HelloWorldService.svc"          service="HelloWorldService.HelloWorldService"/>      </serviceActivations>    </serviceHostingEnvironment>      <behaviors>      <serviceBehaviors>        <behavior>          <serviceMetadata httpGetEnabled="true"/>        </behavior>      </serviceBehaviors>    </behaviors>    <services>      <service name="HelloWorldService.HelloWorldService">        <endpoint address="" binding="netTcpBinding"         contract="HelloWorldService.IHelloWorldService"/>        <host>          <baseAddresses>            <add baseAddress=            "net.tcp://localhost/HelloWorldServiceTcp/"/>          </baseAddresses>        </host>      </service>    </services> </system.serviceModel>   </configuration> Enabling the TCP WCF activation for the host machine By default, the TCP WCF activation service is not enabled on your machine. This means your IIS server won't be able to host a WCF service with the TCP protocol. You can follow these steps to enable the TCP activation for WCF services: Go to Control Panel | Programs | Turn Windows features on or off. Expand the Microsoft .Net Framework 3.5.1 node on Windows 7 or .Net Framework 4.5 Advanced Services on Windows 8. Check the checkbox for Windows Communication Foundation Non-HTTP Activation on Windows 7 or TCP Activation on Windows 8. The following screenshot depicts the options required to enable WCF activation on Windows 7: The following screenshot depicts the options required to enable TCP WCF activation on Windows 8: Repair the .NET Framework: After you have turned on the TCP WCF activation, you have to repair .NET. Just go to Control Panel, click on Uninstall a Program, select Microsoft .NET Framework 4.5.1, and then click on Repair. Creating the IIS application Next, we need to create an IIS application named HelloWorldServiceTcp to host the WCF service, using the TCP protocol. Follow these steps to create this application in IIS: Open IIS Manager. Add a new IIS application, HelloWorldServiceTcp, pointing to the HostIISTcp physical folder under your project's folder. Choose DefaultAppPool as the application pool for the new application. Again, make sure your default app pool is a .NET 4.0.30319 application pool. Enable the TCP protocol for the application. Right-click on HelloWorldServiceTcp, select Manage Application | Advanced Settings, and then add net.tcp to Enabled Protocols. Make sure you use all lowercase letters and separate it from the existing HTTP protocol with a comma. Now the service is hosted in IIS using the TCP protocol. To view the WSDL of the service, browse to http://localhost/HelloWorldServiceTcp/HelloWorldService.svc and you should see the service description and a link to the WSDL of the service. Testing the WCF service hosted in IIS using the TCP protocol Now, we have the service hosted in IIS using the TCP protocol; let's create a new test client to test it: Add a new console application project to the solution, named HelloWorldClientTcp. Add a reference to System.ServiceModel in the new project. Add a service reference to the WCF service in the new project, naming the reference HelloWorldServiceRef and use the URL http://localhost/HelloWorldServiceTcp/HelloWorldService.svc?wsdl. You can still use the SvcUtil.exe command-line tool to generate the proxy and config files for the service hosted with TCP, just as we did in previous sections. Actually, behind the scenes Visual Studio is also calling SvcUtil.exe to generate the proxy and config files. Add the following code to the Main method of the new project: var client = new HelloWorldServiceRef.HelloWorldServiceClient (); Console.WriteLine(client.GetMessage("Mike Liu")); Finally, set the new project as the startup project. Now, if you run the program, you will get the same result as before; however, this time the service is hosted in IIS using the TCP protocol. Summary In this article, we created and tested an IIS application to host the service with the TCP protocol. Resources for Article: Further resources on this subject: Microsoft WCF Hosting and Configuration [Article] Testing and Debugging Windows Workflow Foundation 4.0 (WF) Program [Article] Applying LINQ to Entities to a WCF Service [Article]
Read more
  • 0
  • 0
  • 7577

article-image-data-visualization
Packt
27 Oct 2014
8 min read
Save for later

Data visualization

Packt
27 Oct 2014
8 min read
Data visualization is one of the most important tasks in data science track. Through effective visualization we can easily uncover underlying pattern among variables with doing any sophisticated statistical analysis. In this cookbook we have focused on graphical analysis using R in a very simple way with each independent example. We have covered default R functionality along with more advance visualization techniques such as lattice, ggplot2, and three-dimensional plots. Readers will not only learn the code to produce the graph but also learn why certain code has been written with specific examples. R Graphs Cookbook Second Edition written by Jaynal Abedin and Hrishi V. Mittal is such a book where the user will learn how to produce various graphs using R and how to customize them and finally how to make ready for publication. This practical recipe book starts with very brief description about R graphics system and then gradually goes through basic to advance plots with examples. Beside the R default graphics this recipe book introduces advance graphic system such as lattice and ggplot2; the grammar of graphics. We have also provided examples on how to inspect large dataset using advanced visualization such as tableplot and three dimensional visualizations. We also cover the following topics: How to create various types of bar charts using default R functions, lattice and ggplot2 How to produce density plots along with histograms using lattice and ggplot2 and customized them for publication How to produce graphs of frequency tabulated data How to inspect large dataset by simultaneously visualizing numeric and categorical variables in a single plot How to annotate graphs using ggplot2 (For more resources related to this topic, see here.) This recipe book is targeted to those reader groups who already exposed to R programming and want to learn effective graphics with the power of R and its various libraries. This hands-on guide starts with very short introduction to R graphics system and then gets straight to the point – actually creating graphs, instead of just theoretical learning. Each recipe is specifically tailored to full fill reader’s appetite for visually representing the data in the best way possible. Now, we will present few examples so that you can have an idea about the content of this recipe book: The ggplot2 R package is based on The Grammar of Graphics by Leland Wilkinson, Springer). Using this package, we can produce a variety of traditional graphics, and the user can produce their customized graphs as well. The beauty of this package is in its layered graphics facilities; through the use of layered graphics utilities, we can produce almost any kind of data visualization. Recently, ggplot2 is the most searched keyword in the R community, including the most popular R blog (www.r-bloggers.com). The comprehensive theme system allows the user to produce publication quality graphs with a variety of themes of choice. If we want to explain this package in a single sentence, then we can say that if whatever we can think about data visualization can be structured in a data frame, the visualization is a matter of few seconds. In the specific chapter on ggplot2 , we will see different examples and use themes to produce publication quality graphs. However, in this introductory chapter, we will show you one of the important features of the ggplot2 package that produces various types of graphs. The main function is ggplot(), but with the help of a different geom function, we can easily produce different types of graphs, such as the following: geom_point(): This will create scatter plot geom_line(): This will create a line chart geom_bar(): This will create a bar chart geom_boxplot(): This will create a box plot geom_text(): This will write certain text inside the plot area Now, we will see a simple example of the use of different geom functions with the default R mtcars dataset: # loading ggplot2 library library(ggplot2) # creating a basic ggplot object p <- ggplot(data=mtcars) # Creating scatter plot of mpg and disp variable p1 <- p+geom_point(aes(x=disp,y=mpg)) # creating line chart from the same ggplot object but different # geom function p2 <- p+geom_line(aes(x=disp,y=mpg)) # creating bar chart of mpg variable p3 <- p+geom_bar(aes(x=mpg)) # creating boxplot of mpg over gear p4 <- p+geom_boxplot(aes(x=factor(gear),y=mpg)) # writing certain text into the scatter plot p5 <- p1+geom_text(x=200,y=25,label="Scatter plot") The visualization of the preceding five plot will look like the following figure: Visualizing an empirical Cumulative Distribution function The empirical Cumulative Distribution function (CDF) is the non-parametric maximum-likelihood estimation of the CDF. In this recipe, we will see how the empirical CDF can be produced. Getting ready To produce this plot, we need to use the latticeExtra library. We will use the simulated dataset as shown in the following code: # Set a seed value to make the data reproducible set.seed(12345) qqdata <-data.frame(disA=rnorm(n=100,mean=20,sd=3),                disB=rnorm(n=100,mean=25,sd=4),                disC=rnorm(n=100,mean=15,sd=1.5),                age=sample((c(1,2,3,4)),size=100,replace=T),                sex=sample(c("Male","Female"),size=100,replace=T),                 econ_status=sample(c("Poor","Middle","Rich"),                size=100,replace=T)) How to do it… To plot an empirical CDF, we first need to call the latticeExtra library (note that this library has a dependency on RColorBrewer). Now, to plot the empirical CDF, we can use the following simple code: library(latticeExtra) ecdfplot(~disA|sex,data=qqdata) Graph annotation with ggplot To produce publication-quality data visualization, we often need to annotate the graph with various texts, symbols, or even shapes. In this recipe, we will see how we can easily annotate an existing graph. Getting ready In this recipe, we will use the disA and disD variables from ggplotdata. Let's call ggplotdata for this recipe. We also need to call the grid and gridExtra libraries for this recipe. How to do it... In this recipe, we will execute the following annotation on an existing scatter plot. So, the whole procedure will be as follows: Create a scatter plot Add customized text within the plot Highlight certain region to indicate extreme values Draw a line segment with an arrow within the scatter plot to indicate a single extreme observation Now, we will implement each of the steps one by one: library(grid) library(gridExtra) # creating scatter plot and print it annotation_obj <- ggplot(data=ggplotdata,aes(x=disA,y=disD))+geom_point() annotation_obj # Adding custom text at (18,29) position annotation_obj1 <- annotation_obj + annotate(geom="text",x=18,y=29,label="Extreme value",size=3) annotation_obj1 # Highlight certain regions with a box annotation_obj2 <- annotation_obj1+ annotate("rect", xmin = 24, xmax = 27,ymin=17,ymax=22,alpha = .2) annotation_obj2 # Drawing line segment with arrow annotation_obj3 <- annotation_obj2+ annotate("segment",x = 16,xend=17.5,y=25,yend=27.5,colour="red", arrow = arrow(length = unit(0.5, "cm")),size=2) annotation_obj3 The preceding four steps are displayed in the following single graph: How it works... The annotate() function takes input of a geom such as “segment”, “text” etc, and then it takes another input regarding position of that geom that is where to draw or where to place.. In this particular recipe, we used three geom instances, such as text to write customized text within the plot, rect to highlight a certain region in the plot, and segment to draw an arrow. The alpha argument represents the transparency of the region and size argument to represent the size of the text and line width of the line segment. Summary This article just gives a sample recipe of what kind of recipes are included in the book, and how the structure of each recipe is. Resources for Article: Further resources on this subject: Using R for Statistics, Research, and Graphics [Article] First steps with R [Article] Aspects of Data Manipulation in R [Article]
Read more
  • 0
  • 0
  • 3069
article-image-emr-architecture
Packt
27 Oct 2014
6 min read
Save for later

The EMR Architecture

Packt
27 Oct 2014
6 min read
This article is written by Amarkant Singh and Vijay Rayapati, the authors of Learning Big Data with Amazon Elastic MapReduce. The goal of this article is to introduce you to the EMR architecture and EMR use cases. (For more resources related to this topic, see here.) Traditionally, very few companies had access to large-scale infrastructure to build Big Data applications. However, cloud computing has democratized the access to infrastructure allowing developers and companies to quickly perform new experiments without worrying about the need for setting up or scaling infrastructure. A cloud provides an infrastructure as a service platform to allow businesses to build applications and host them reliably with scalable infrastructure. It includes a variety of application-level services to help developers to accelerate their development and deployment times. Amazon EMR is one of the hosted services provided by AWS and is built on top of a scalable AWS infrastructure to build Big Data applications. The EMR architecture Let's get familiar with the EMR. This section outlines the key concepts of EMR. Hadoop offers distributed processing by using the MapReduce framework for execution of tasks on a set of servers or compute nodes (also known as a cluster). One of the nodes in the Hadoop cluster will be controlling the distribution of tasks to other nodes and it's called the Master Node. The nodes executing the tasks using MapReduce are called Slave Nodes: Amazon EMR is designed to work with many other AWS services such as S3 for input/output data storage, DynamoDB, and Redshift for output data. EMR uses AWS CloudWatch metrics to monitor the cluster performance and raise notifications for user-specified alarms. We can create on-demand Hadoop clusters using EMR while storing the input and output data in S3 without worrying about managing a 24*7 cluster or HDFS for data storage. The Amazon EMR job flow is shown in the following diagram: Types of nodes Amazon EMR provides three different roles for the servers or nodes in the cluster and they map to the Hadoop roles of master and slave nodes. When you create an EMR cluster, then it's called a Job Flow, which has been created to execute a set of jobs or job steps one after the other: Master node: This node controls and manages the cluster. It distributes the MapReduce tasks to nodes in the cluster and monitors the status of task execution. Every EMR cluster will have only one master node in a master instance group. Core nodes: These nodes will execute MapReduce tasks and provide HDFS for storing the data related to task execution. The EMR cluster will have core nodes as part of it in a core instance group. The core node is related to the slave node in Hadoop. So, basically these nodes have two-fold responsibility: the first one is to execute the map and reduce tasks allocated by the master and the second is to hold the data blocks. Task nodes: These nodes are used for only MapReduce task execution and they are optional while launching the EMR cluster. The task node is related to the slave node in Hadoop and is part of a task instance group in EMR. When you scale down your clusters, you cannot remove any core nodes. This is because EMR doesn't want to let you lose your data blocks. You can remove nodes from a task group while scaling down your cluster. You should also be using only task instance groups to have spot instances, as spot instances can be taken away as per your bid price and you would not want to lose your data blocks. You can launch a cluster having just one node, that is, with just one master node and no other nodes. In that case, the same node will act as both master and core nodes. For simplicity, you can assume a node as EC2 server in EMR. EMR use cases Amazon EMR can be used to build a variety of applications such as recommendation engines, data analysis, log processing, event/click stream analysis, data transformations (ETL), fraud detection, scientific simulations, genomics, financial analysis, or data correlation in various industries. The following section outlines some of the use cases in detail. Web log processing We can use EMR to process logs to understand the usage of content such as video, file downloads, top web URLs accessed by end users, user consumption from different parts of the world, and many more. We can process any web or mobile application logs using EMR to understand specific business insights relevant for your business. We can move all our web access application or mobile logs to Amazon S3 for analysis using EMR even if we are not using AWS for running our production applications. Clickstream analysis By using clickstream analysis, we can segment users into different groups and understand their behaviors with respect to advertisements or application usage. Ad networks or advertisers can perform clickstream analysis on ad-impression logs to deliver more effective campaigns or advertisements to end users. Reports generated from this analysis can include various metrics such as source traffic distribution, purchase funnel, lead source ROI, and abandoned carts among others. Product recommendation engine Recommendation engines can be built using EMR for e-commerce, retail, or web businesses. Many of the e-commerce businesses have a large inventory of products across different categories while regularly adding new products or categories. It will be very difficult for end users to search and identify the products quickly. With recommendation engines, we can help end users to quickly find relevant products or suggest products based on what they are viewing and so on. We may also want to notify users via an e-mail based on their past purchase behavior. Scientific simulations When you need distributed processing with large-scale infrastructure for scientific or research simulations, EMR can be of great help. We can quickly launch large clusters in a matter of minutes and install specific MapReduce programs for analysis using EMR. AWS also offers genomics datasets for free on S3. Data transformations We can perform complex extract, transform, and load (ETL) processes using EMR for either data analysis or data warehousing needs. It can be as simple as transforming XML file data into JSON data for further usage or moving all financial transaction records of a bank into a common date-time format for archiving purposes. You can also use EMR to move data between different systems in AWS such as DynamoDB, Redshift, S3, and many more. Summary In this article, we learned about the EMR architecture. We understood the concepts related to EMR for various node types in detail. Resources for Article: Further resources on this subject: Introduction to MapReduce [Article] Understanding MapReduce [Article] HDFS and MapReduce [Article]
Read more
  • 0
  • 0
  • 4847

article-image-clustering-k-means
Packt
27 Oct 2014
9 min read
Save for later

Clustering with K-Means

Packt
27 Oct 2014
9 min read
In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification. Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure: Clustering might reveal the following two groups, indicated by squares and circles. Clustering could also reveal the following four groups: Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set. Clustering with the K-Means Algorithm The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article. The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation: where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table. Instance X0 X1 1 7 5 2 5 7 3 7 7 4 3 3 5 4 6 6 1 4 7 0 0 8 2 2 9 8 7 10 6 8 11 5 5 12 3 7 There are two explanatory variables; each instance has two features. The instances are plotted in the following figure. Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table. Instance X0 X1 C1 distance C2 distance Last Cluster Cluster Changed? 1 7 5 3.16228 2 None C2 Yes 2 5 7 1.41421 2 None C1 Yes 3 7 7 3.16228 2.82843 None C2 Yes 4 3 3 3.16228 2.82843 None C2 Yes 5 4 6 0 1.41421 None C1 Yes 6 1 4 3.60555 4.12311 None C1 Yes 7 0 0 7.21110 7.07107 None C2 Yes 8 2 2 4.47214 4.24264 None C2 Yes 9 8 7 4.12311 3.60555 None C2 Yes 10 6 8 2.82843 3.16228 None C1 Yes 11 5 5 1.41421 0 None C2 Yes 12 3 7 1.41421 2.82843 None C1 Yes C1 centroid 4 6           C2 centroid 5 5           The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances. Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids. Instance X0 X1 C1 distance C2 distance Last Cluster New Cluster Changed? 1 7 5 3.492850 2.575394 C2 C2 No 2 5 7 1.341641 2.889107 C1 C1 No 3 7 7 3.255764 3.749830 C2 C1 Yes 4 3 3 3.492850 1.943067 C2 C2 No 5 4 6 0.447214 1.943067 C1 C1 No 6 1 4 3.687818 3.574285 C1 C2 Yes 7 0 0 7.443118 6.169378 C2 C2 No 8 2 2 4.753946 3.347250 C2 C2 No 9 8 7 4.242641 4.463000 C2 C1 Yes 10 6 8 2.720294 4.113194 C1 C1 No 11 5 5 1.843909 0.958315 C2 C2 No 12 3 7 1 3.260775 C1 C1 No C1 centroid 3.8 6.4           C2 centroid 4.571429 4.142857           The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments. Now we will move the centroids to the means of their constituents' locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure. None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum. Local Optima Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions: K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected. The Elbow Method If K is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow. Let's use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters. We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following: >>> import numpy as np>>> from sklearn.cluster import KMeans>>> from scipy.spatial.distance import cdist>>> import matplotlib.pyplot as plt>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))>>> X = np.hstack((cluster1, cluster2)).T>>> X = np.vstack((x, y)).T>>> K = range(1, 10)>>> meandistortions = []>>> for k in K:>>> kmeans = KMeans(n_clusters=k)>>> kmeans.fit(X)>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_centers_, 'euclidean'), axis=1)) / X.shape[0])>>> plt.plot(K, meandistortions, 'bx-')>>> plt.xlabel('k')>>> plt.ylabel('Average distortion')>>> plt.title('Selecting k with the Elbow Method')>>> plt.show() The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let's use the elbow method on the following data set with three clusters: The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set. Summary In this article we explained what clustering is and we talked about the two methods available for clustering Resources for Article: Further resources on this subject: Machine Learning in IPython with scikit-learn [Article] Machine Learning in Bioinformatics [Article] Specialized Machine Learning Topics [Article]
Read more
  • 0
  • 0
  • 5588