Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-sap-hana-architecture
Packt
20 Dec 2013
12 min read
Save for later

SAP HANA Architecture

Packt
20 Dec 2013
12 min read
(For more resources related to this topic, see here.) Understanding the SAP HANA architecture Architecture is the key for SAP HANA to be a game changing innovative technology. SAP HANA has been designed so well architecture-wise such that it makes a lot of difference when compared to other traditional databases available today. This section explains us the various components of SAP HANA and its functionalities. Getting ready Enterprise application requirements have become more demanding—complex reports with high computation on huge volumes of transaction data and also business data of other formats (both structured and semi-structured). Data is being written or updated, and also read from the database in parallel. Thus, integration of both transactional and analytical data into single database is required, where SAP HANA has evolved. Columnar storage exploits modern hardware and technology (multiple CPU cores, large main memory, and caches) in achieving the requirements of enterprise applications. Apart from this, it should also support procedural logic where certain tasks cannot be completed with simple SQL. How it works… The SAP HANA database consists of several services (servers). Index server is the most important component of all the servers. Other servers are name server, preprocessor server, statistics server, and XS Engine: Index server: This server holds the actual data and the engines for processing the data. When SQL or MDX is fired against the SAP HANA system in the case of authenticated sessions and transactions, an index server takes care of these commands and processes them. Name server: This server holds complete information about the system landscape. Name server is responsible for the topology of the SAP HANA system. In a distributed system, SAP HANA instances will be running on multiple hosts. In this kind of setup, the name server knows where the components are running and how data is spread on different servers. Preprocessor server: This server comes into the picture during text data analysis. Index server utilizes the capabilities of preprocessor server in text data analysis and searching. This helps to extract the information on which text search capabilities are based. Statistics server: This server helps in collecting the data for the system monitor and helps you know the health of the SAP HANA system. The statistics server is responsible for collecting the data related to status, resource allocation/consumption and performance of the SAP HANA system. Monitoring the clients and getting the status of various alert monitors use the data collected by Statistics server. This server also provides a history of measurement data for further analysis. XS Engine: The XS Engine allows external applications and application developers to access the SAP HANA system via the XS Engine clients, for example, a web browser accesses SAP HANA apps built by application developers via HTTP. Application developers build applications by using the XS Engine, and the users access the app via HTTP by using a web browser. The persistent model in the SAP HANA database is converted into a consumption model for clients to access it via HTTP. This allows an organization to host system services that are a part of the SAP HANA database (for example, Search service, a built-in web server that provides access to static content in the repository). The following diagram shows the architecture of SAP HANA: There's is more... Let us continue learning about the different components: SAP Host Agent: According to the new approach from SAP, the SAP Host Agent should be installed on all machines that are related to the SAP landscape. It is used by Adaptive Computing Controller (ACC) to manage the system and Software Update Manager (SUM) for automatic updates. LM-structure: LM-structure for SAP HANA contains the information about current installation details. This information will be used by SUM during automatic updates. SAP Solution Manager diagnostic agent: This agent provides all the data to SAP Solution Manager (SAP SOLMAN) to monitor the SAP HANA system. After the SAP SOLMAN is integrated with the SAP HANA system, this agent provides information about the database at a glance, which includes the database state and general information about the system, such as alerts, CPU, or memory and disk usage. SAP HANA Studio repository: This helps the end users to update the SAP HANA studio to higher versions. The SAP HANA Studio repository is the code that does this process. Software Update Manager for SAP HANA: This helps in automatic updates of SAP HANA from the SAP Marketplace and patching the SAP host agent. It also allows distribution of the Studio repository to the end users. See also http://help.sap.com/hana/SAP_HANA_Installation_Guide_en.pdf SAP Notes:1793303 and 1514967 Explaining IMCE and its components We have seen the architecture of SAP HANA and its components. In this section, we will learn about IMCE (in-memory computing engine) and how its components and its functionalities. Getting Ready The SAP in-memory computing engine (formerly Business Analytic Engine (BAE)) is the core engine for SAP's next generation high-performance, in-memory solutions as it leverages technologies such as in-memory computing, columnar databases, massively parallel processing (MPP), and data compression, to allow organizations to instantly explore and analyze large volumes of transactional and analytical data from across the enterprise in real time. How it works... In-memory computing allows the processing of massive quantities of real-time data in the main memory of the server, providing immediate results from analyses and transactions. The SAP in-memory computing database delivers the following capabilities: In-memory computing functionality with native support for row and columnar datastores providing full ACID (atomicity, consistency, isolation, and durability) transactional capabilities Integrated lifecycle management capabilities and data integration capabilities to access SAP and non-SAP data sources SAP IMCE Studio, which includes tools for data modeling, data and life cycle management, and data security The SAP IMCE that resides at the heart of SAP HANA is an integrated database and calculation layer that allows the processing of massive quantities of real-time data in the main memory to provide immediate results from analysis and transactions. Like any standard database, the SAP IMCE not only supports industry standards such as SQL and MDX, but also incorporates a high-performance calculation engine that embeds procedural language support directly into the database kernel. This approach is designed to remove the need to read data from the database, process it, and then write data back to the database, that is, process the data near the database and return the results. The IMCE is an in-memory, column-oriented database technology. It is a powerful calculation engine at the heart of SAP HANA. As data resides in the Random Access Memory (RAM), highly accelerated performance can be achieved compared to systems that read data from disks. The heart lies within the IMCE, which allows us to create and perform calculations on data. SAP IMCE Studio includes tools for data modeling activities, data and life cycle management, and also tools that are related to data security. The following diagram shows the components of IMCE alone: There's more… SAP HANA database has the following two engines: Column-based store: This engine stores the huge amounts of relational data in column-optimized tables, which are aggregated and used in analytical operations. Row-based store: This engine stores the relational data in rows, similar to the storage mechanism of traditional database systems. The row store is more optimized for write operations and has a lower compression rate. Also, the query performance is lower when compared to the column-based store. The engine that is used to store data can be selected on a per-table basis at the time of creating a table. Tables in the row-based store are loaded at start up time. In the case of column-based stores, tables can be either loaded at start up or on demand, that is, during normal operation of the SAP HANA database. Both engines share a common persistence layer, which provides data persistency that is consistent across both engines. Like a traditional database, we have page management and logging in SAP HANA. The changes made to the in-memory database pages are persisted through savepoints. These savepoints are written to those data volumes on the persistent storage for which the storage medium is hard drives. All transactions committed in the SAP HANA database are stored/saved/referenced by the logger of the persistency layer in a log entry written to the log volumes on the persistent storage. To get high I/O performance and low latency, log volumes use the flash technology storage. The relational engines can be accessed through a variety of interfaces. The SAP HANA database supports SQL (JDBC/ODBC), MDX (ODBO), and BICS (SQLDBC). The calculation engine performs all the calculations in the database. No data moves into the application layer until calculations are completed. It also contains the business functions library that is called by applications to perform calculations based on the business rules and logic. The SAP HANA-specific SQL script language is an extension of SQL that can be used to push down data-intensive application logic into the SAP HANA database for specific requirements. Session management This component creates and manages sessions and connections for the database clients. When a session is created, a set of parameters are maintained. These parameters are like auto-commit settings or the current transaction isolation level. After establishing a session, database clients communicate with the SAP HANA database using SQL statements. SAP HANA database treats all the statements as transactions while processing them. Each new session created will be assigned to a new transaction. Transaction manager The transaction manager is the component that coordinates database transactions, takes care of controlling transaction isolation, and keeps track of running and closed transactions. The transaction manager informs the involved storage engines about the running or closed transactions, so that they can execute necessary actions, when a transaction is committed or rolled back. The transaction manager cooperates with the persistence layer to achieve atomic and durable transactions. The client requests are analyzed and executed by a set of components summarized as request processing and execution control. The client requests are analyzed by a request parser, and then it is dispatched to the responsible component. The transaction control statements are forwarded to the transaction manager. The data definition statements are sent to the metadata manager. The object invocations are dispatched to the object store. The data manipulation statements are sent to the optimizer, which creates an optimized execution plan that is given to the execution layer. The SAP HANA database also has built-in support for domain-specific models (such as for financial planning domain) and it offers scripting capabilities that allow application-specific calculations to run inside the database. It has its own scripting language named SQLScript that is designed to enable optimizations and parallelization. This SQLScript is based on free functions that operate on tables by using SQL queries for set processing. The SAP HANA database also contains a component called the planning engine that allows financial planning applications to execute basic planning operations in the database layer. For example, while applying filters/transformations, a new version of a dataset will be created as a copy of an existing one. An example of planning operation is disaggregation operation in which based on a distribution function; target values from higher to lower aggregation levels are distributed. Metadata manager Metadata manager helps to access metadata. SAP HANA database's metadata consists of a variety of objects, such as definitions of tables, views and indexes, SQLScript function definitions, and object store metadata. All these types of metadata are stored in one common catalog for all the SAP HANA database stores. Metadata is stored in tables in the row store. The SAP HANA features such as transaction support and multi-version concurrency control (MVCC) are also used for metadata management. Central metadata is shared across the servers in the case of a distributed database systems. The background mechanism of metadata storage and sharing is hidden from the components that use the metadata manager. As row-based tables and columnar tables can be combined in one SQL statement, both the row and column engines must be able to consume the intermediate results. The main difference between the two engines is the way they process data: the row store operators process data in a row-at-a-time fashion, whereas column store operations (such as scan and aggregate) require the entire column to be available in contiguous memory locations. To exchange intermediate results created by each other, the row store provides results to the column store. The result materializes as complete rows in the memory, while the column store can expose results using the iterators interface needed by the row store. Persistence layer The persistence layer is responsible for durability and atomicity of transactions. The persistent layer ensures that the database is restored to the most recent committed state after a restart, and makes sure that transactions are either completely executed or completely rolled back. To achieve this in an efficient way, the persistence layer uses a combination of write-ahead logs, shadow paging, and savepoints. Moreover, the persistence layer also offers interfaces for writing and reading data. It also contains SAP HANA's logger that manages the transaction log. Authorization manager The authorization manager is invoked by other SAP HANA database components to check the required privileges for users to execute the requested operations. Privileges to other users or roles can be granted. A privilege grants the right to perform a specified operation (such as create, update, select, and execute data manipulation languages) on a specified object such as a table, view, and SQLScript function. Analytic privileges represent filters or hierarchy, and they drill down limitations for analytic queries. Analytic privileges such as granting access to values with a certain combination of dimension attributes are supported in SAP HANA. Users are authenticated either by the SAP HANA database itself (log in with username and password), or authentication can be delegated to external authentication providers third-party such as an LDAP directory. See also SAP HANA in-memory analytics and in-memory computing available at http://scn.sap.com/people/vitaliy.rudnytskiy/blog/2011/03/22/time-to-update-your-sap-hana-vocabulary Summary This article explains the SAP architecture and the IMCE feature in brief. Resources for Article: Further resources on this subject: SAP HANA integration with Microsoft Excel [Article] Data Migration Scenarios in SAP Business ONE Application- part 2 [Article] Data Migration Scenarios in SAP Business ONE Application- part 1 [Article]
Read more
  • 0
  • 0
  • 4448

article-image-downloading-and-setting-elasticsearch
Packt
19 Dec 2013
8 min read
Save for later

Downloading and Setting Up ElasticSearch

Packt
19 Dec 2013
8 min read
(For more resources related to this topic, see here.) Downloading and installing ElasticSearch ElasticSearch has an active community and the release cycles are very fast. Because ElasticSearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the ElasticSearch community tries to keep them updated and fix bugs that are discovered in them and in ElasticSearch core. If it's possible, the best practice is to use the latest available release (usually the more stable one). Getting ready A supported ElasticSearch Operative System (Linux/MacOSX/Windows) with installed Java JVM 1.6 or above is required. A web browser is required to download the ElasticSearch binary release. How to do it... For downloading and installing an ElasticSearch server, we will perform the steps given as follows: Download ElasticSearch from the Web. The latest version is always downloadable from the web address http://www.elasticsearch.org/download/. There are versions available for different operative systems: elasticsearch-{ version-number} .zip: This is for both Linux/Mac OSX, and Windows operating systems elasticsearch-{ version-number} .tar.gz: This is for Linux/Mac elasticsearch-{ version-number} .deb: This is for Debian-based Linux distributions (this also covers Ubuntu family) These packages contain everything to start ElasticSearch. At the time of writing this book, the latest and most stable version of ElasticSearch was 0.90.7. To check out whether this is the latest available or not, please visit http://www.elasticsearch.org/download/. Extract the binary content. After downloading the correct release for your platform, the installation consists of expanding the archive in a working directory. Choose a working directory that is safe to charset problems and doesn't have a long path to prevent problems when ElasticSearch creates its directories to store the index data. For windows platform, a good directory could be c:es, on Unix and MacOSX /opt/ es. To run ElasticSearch, you need a Java Virtual Machine 1.6 or above installed. For better performance, I suggest you use Sun/Oracle 1.7 version. We start ElasticSearch to check if everything is working. To start your ElasticSearch server, just go in the install directory and type: # bin/elasticsearch –f (for Linux and MacOsX) or # binelasticserch.bat –f (for Windows) Now your server should start as shown in the following screenshot: How it works... The ElasticSearch package contains three directories: bin: This contains script to start and manage ElasticSearch. The most important ones are: elasticsearch(.bat): This is the main script to start ElasticSearch plugin(.bat): This is a script to manage plugins config: This contains the ElasticSearch configs. The most important ones are: elasticsearch.yml: This is the main config file for ElasticSearch logging.yml: This is the logging config file lib: This contains all the libraries required to run ElasticSearch There's more... During ElasticSearch startup a lot of events happen: A node name is chosen automatically (that is Akenaten in the example) if not provided in elasticsearch.yml. A node name hash is generated for this node (that is, whqVp_4zQGCgMvJ1CXhcWQ). If there are plugins (internal or sites), they are loaded. In the previous example there are no plugins. Automatically if not configured, ElasticSearch binds on all addresses available two ports: 9300 internal, intra node communication, used for discovering other nodes 9200 HTTP REST API port After starting, if indices are available, they are checked and put in online mode to be used. There are more events which are fired during ElasticSearch startup. We'll see them in detail in other recipes. Networking setupM Correctly setting up a networking is very important for your node and cluster. As there are a lot of different install scenarios and networking issues in this recipe we will cover two kinds of networking setups: Standard installation with autodiscovery working configuration Forced IP configuration; used if it is not possible to use autodiscovery Getting ready You need a working ElasticSearch installation and to know your current networking configuration (that is, IP). How to do it... For configuring networking, we will perform the steps as follows: Open the ElasticSearch configuration file with your favorite text editor. Using the standard ElasticSearch configuration file (config/elasticsearch. yml), your node is configured to bind on all your machine interfaces and does autodiscovery broadcasting events, that means it sends "signals" to every machine in the current LAN and waits for a response. If a node responds to it, they can join in a cluster. If another node is available in the same LAN, they join in the cluster. Only nodes with the same ElasticSearch version and same cluster name (cluster.name option in elasticsearch.yml) can join each other. To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, such as: cluster.name: elasticsearch node.name: "My wonderful server" network.host: 192.168.0.1 discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300- 9400]"] This configuration sets the cluster name to elasticsearch, the node name, the network address, and it tries to bind the node to the address given in the discovery section. We can check the configuration during node loading. We can now start the server and check if the network is configured: [INFO ][node ] [Aparo] version[0.90.3], pid[16792], build[5c38d60/2013-08-06T13:18:31Z] [INFO ][node ] [Aparo] initializing ... [INFO ][plugins ] [Aparo] loaded [transport-thrift, rivertwitter, mapper-attachments, lang-python, jdbc-river, langjavascript], sites [bigdesk, head] [INFO ][node ] [Aparo] initialized [INFO ][node ] [Aparo] starting ... [INFO ][transport ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.5:9300]} [INFO ][cluster.service] [Aparo] new_master [Angela Cairn] [yJcbdaPTSgS7ATQszgpSow][inet[/192.168.1.5:9300]], reason: zendisco- join (elected_as_master) [INFO ][discovery ] [Aparo] elasticsearch/ yJcbdaPTSgS7ATQszgpSow [INFO ][http ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.5:9200]} [INFO ][node ] [Aparo] started In this case, we have: The transport bounds to 0:0:0:0:0:0:0:0:9300 and 192.168.1.5:9300 The REST HTTP interface bounds to 0:0:0:0:0:0:0:0:9200 and 192.168.1.5:9200 How it works... It works as follows: cluster.name: This sets up the name of the cluster (only nodes with the same name can join). node.name: If this is not defined, it is automatically generated by ElasticSearch. It allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set this name meaningful to easily locate it. Using a valid name is easier to remember than a generated name, such as whqVp_4zQGCgMvJ1CXhcWQ network.host: This defines the IP of your machine to be used in binding the node. If your server is on different LANs or you want to limit the bind on only a LAN, you must set this value with your server IP. discovery.zen.ping.unicast.hosts: This allows you to define a list of hosts (with ports or port range) to be used to discover other nodes to join the cluster. This setting allows using the node in LAN where broadcasting is not allowed or autodiscovery is not working (that is, packet filtering routers). The referred port is the transport one, usually 9300. The addresses of the hosts list can be a mix of: host name, that is, myhost1 IP address, that is, 192.168.1.2 IP address or host name with the port, that is, myhost1:9300 and 192.168.1.2:9300 IP address or host name with a range of ports, that is, myhost1:[9300-9400], 192.168.1.2:[9300-9400] Setting up a node ElasticSearch allows you to customize several parameters in an installation. In this recipe, we'll see the most used ones to define where to store our data and to improve general performances. Getting ready You need a working ElasticSearch installation. How to do it... The steps required for setting up a simple node are as follows: Open the config/elasticsearch.yml file with an editor of your choice. Set up the directories that store your server data: path.conf: /opt/data/es/conf path.data: /opt/data/es/data1,/opt2/data/data2 path.work: /opt/data/work path.logs: /opt/data/logs path.plugins: /opt/data/plugins Set up parameters to control the standard index creation. These parameters are: index.number_of_shards: 5 index.number_of_replicas: 1 How it works... The path.conf file defines the directory that contains your configuration: mainly elasticsearch.yml and logging.yml. The default location is $ES_HOME/config with ES_HOME your install directory. It's useful to set up the config directory outside your application directory so you don't need to copy configuration files every time you update the version or change the ElasticSearch installation directory. The path.data file is the most important one: it allows defining one or more directories where you store index data. When you define more than one directory, they are managed similarly to a RAID 0 configuration (the total space is the sum of all the data directory entry points), favoring locations with the most free space. The path.work file is a location where ElasticSearch puts temporary files. The path.log file is where log files are put. The control how to log is managed in logging.yml. The path.plugins file allows overriding the plugins path (default $ES_HOME/plugins). It's useful to put "system wide" plugins. The main parameters used to control the index and shard is index.number_of_shards, that controls the standard number of shards for a new created index, and index.number_ of_replicas that controls the initial number of replicas. There's more... There are a lot of other parameters that can be used to customize your ElasticSearch installation and new ones are added with new releases. The most important ones are described in this recipe and in the next one.
Read more
  • 0
  • 0
  • 2058

article-image-applied-modeling
Packt
19 Dec 2013
15 min read
Save for later

Applied Modeling

Packt
19 Dec 2013
15 min read
(For more resources related to this topic, see here.) This article examines the foundations of tabular modeling (at least from the modelers point of view). In order to provide a familiar model which can be used later, this article will progressively build the model. Each recipe is intended to demonstrate a particular technique; however, they need to be followed in order so that the final model is completed. Grouping by binning and sorting with ranks Often, we want to provide descriptive information in our data, based on values that are derived from downstream activities. For example, this could arise if we wish to include a field in the Customer table that shows the value of the previous sales for that customer or a bin grouping that defines the customer into banded sales. This value can then be used to rank each customer according to their relative importance in relation to other customers. This type of value adding activity is usually troublesome and time intensive in a traditional data mart design as the customer data can be fully updated only once the sales data has been loaded. While this process may seem relatively straightforward, it is a recursive problem as the customer data must be loaded before the sales data (since customers must exist before the sales can be assigned to them), but the full view of the customer is reliant on loading the sales data. In a standard dimensional (star schema) modeling approach, including this type of information for dimensions requires a three-step process: The dimension (reseller customer data) is updated for known fields and attributes. This load excludes information that is derived (such as sales). Then, the sales data (referred to as fact data) is loaded in data warehouse. This ensures that the data mart is in a current state and all the sales transaction data is up-to-date. Information relating to any new and changed stores can be loaded correctly. The dimension data which relies on other fact data is updated based on the current state of the data mart. Since the tabular model is less confined by the traditional star schema requirements of the fact and dimension tables (in fact, the tabular model does not explicitly identify facts or dimensions), the inclusion and processing of these descriptive attributes can be built directly into the model. The calculation of a simple measure such as a historic sales value may be included in OLAP modeling through calculated columns in the data source view. However, this is restrictive and limited to simple calculations (such as total sales or n period sales). Other manipulations (such as ranking and binning) are a lot more flexible in tabular modeling (as we will see). This recipe examines how to manipulate a dimensional table in order to provide a richer end user experience. Specifically, we will do the following: Introduce a calculated field to calculate the historical sales for the customer Determine the rank of the customer based on that field Create a discretization bin for the customer based on their sales Create an ordered hierarchy based on their discretization bins Getting ready Continuing with the scenario that was discussed in the Introduction section of the article, the purpose of this article is to identify each reseller's (customer's) historic sales and then rank them accordingly. We then discretize the Resellers table (customer) based on this. This problem is further complicated by the consideration that a sale occurs in the country of origin (the sales data in the Reseller Sales table will appear in any currency). In order to provide a concise recipe, we break the entire process into two distinct steps: Conversion of sales (which manages the ranking of Resellers based on a unified sales value) Classification of sales (which manages the manipulation of sales values based on discretized bins to format those bins) How to do it… Firstly, we need to provide a common measure to compare the sales value of Resellers. Convert sales to a uniform currency using the following steps: Open a new workbook and launch the PowerPivot Window. Import the text files Reseller Sales.txt, Currency Conversion.txt, and Resellers.txt. The source data folder for this article includes the base schema.ini file that correctly transforms all data upon import. When importing the data, you should be prompted that the schema.ini file exists and will override the import settings. If this does not occur, ensure that the schema.ini file exists in the same directory as your source data. The prompt should look like the following screenshot: Although it is not mandatory, it is recommended that connection managers are labeled according to a standard. In this model, I have used the convention type_table_name where type refers to the connection type (.txt) and table_name refers to the name of the table. Connections can be edited using the Existing Connections button in the Design tab. Create a relationship between the Customer ID field in the Resellers table and the Customer ID field in the Reseller Sales table. Add a new field (also called a calculated column) in the Resellers Sales table to show the gross value of the sale amount in USD. Add usd_gross_sales and hide it from client tools using the following code: = [Quantity Ordered] *[Price Per Unit] *LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] ,'Currency Conversion'[Date] ,[Order dt] ,'Currency Conversion'[Currency ID] ,[Currency ID] ) Add a new measure in the Resellers Sales table to show sales (in USD). Add USD Gross Sales as: USD Gross Sales := SUM ( [usd_gross_sales] ) Add a new calculated column to the Resellers table to show the USD Sales Total value. The formula for the field should be: = 'Reseller Sales' [USD Gross Sales] Add a Sales Rank field in the Resellers table to show the order for each resellers USD Sales Total. The formula for Sales Rank is: =RANKX(Resellers, [USD Sales Total]) Hide the USD Sales Total and Sales Rank fields from client tools. Now that all the entries in the Resellers table show their sales value in a uniform currency, we can determine a grouping for the Reseller table. In this case, we are going to group them into 100,000 dollar bands. Add a new field to show each value in the USD Sales Total column of Resellers rounded down to the nearest 100,000 dollars. Hide it from client tools. Now, add Round Down Amount as: =ROUNDDOWN([USD Sales Total],-5) Add a new field to show the rank of Round Down Amount in descending order and hide it from client tools. Add Round Down Order as: =RANKX(Resellers,[Round Down Amount],,FALSE(), DENSE) Add a new field in the Resellers table to show the 100,000 dollars group that the reseller belongs to. Since we know what the lower bound of the sales bin is, we can also infer that the upper bin is the rounded up 100,000 dollars sales group. Add Sales Group as follows: =IF([Round Down Amount]=0 || ISBLANK([Round Down Amount]) , "Sales Under 100K" , FORMAT([Round Down Amount], "$#,K") & " - " & FORMAT(ROUNDUP([USD Sales Total],-5),"$#,K") ) Set the Sort By Column of the Sales Group field to the Round Down Order column. Note that the Round Down Order column should display in a descending order (that is, entries in the Resellers table with high sales values should appear first). Create a hierarchy on the Resellers table which shows the Sales Group field and the Customer Name column as levels. Title the hierarchy as Customer By Sales Group. Add a new measure titled Number of Resellers to the Resellers table: Number of Resellers:=COUNTROWS(Resellers) Create a pivot table that shows the Customer By Sales Group hierarchy on the rows and Number of Resellers as values. If you created the usd_gross_sales field in the Reseller Sales table, it can also be added as an implicit measure to verify values. Expand the first bin of Sales Group. The pivot should look like the following screenshot: How it works… This recipe has included various steps, which add descriptive information to the Resellers table. This includes obtaining data from a separate table (the USD sales) and then manipulating that field within the Resellers table. In order to provide a clearer definition of how this process works, we will break the explanation into several subsections. This includes the sales data retrieval, the use of rank functions, and finally, the discretization of sales. The next section deals with the process of converting sales data to a single currency. The starting point for this recipe is to determine a common currency sales value for each reseller (or customer). While the inclusion of the calculated column USD Sales Total in the Resellers table should be relatively straightforward, it is complicated by the consideration that the sales data is stored in multiple currencies. Therefore, the first step needs to include a currency conversion to determine the USD sales value for each line. This is simply the local value multiplied by the daily exchange rate. The LOOKUPVALUE function is used to return the row-by-row exchange rate. Now that we have the usd_gross_sales value for each sales line, we define a measure that calculates its sum in whatever filter context it is applied in. Including it in the Reseller Sales table makes sense (since, it relates to sales data), but what is interesting is how the filter context is applied when it is used as a field in the Resellers table. Here, the row filter context that exists in the Resellers table (after all, each row refers to a reseller) applies a restriction to the sales data. This shows the sales value for each reseller. For this recipe to work correctly, it is not necessary to include the calculated field usd_gross_sales in Reseller Sales. We simply need to define a calculation, which shows the gross sales value in USD and then use the row filter context in the Resellers table to restrict sales to the reseller in question (that is, the reseller in the row). It is obvious that the exchange rate should be applied on a daily basis because the value can change every day. We could use an X function in the USD Gross Sales measure to achieve exactly the same outcome. Our formula will be: SUMX ( 'Reseller Sales' , 'Reseller Sales'[Quantity Ordered] * 'Reseller Sales'[Price Per Unit] * LOOKUPVALUE ( 'Currency Conversion'[AVG Rate] , 'Currency Conversion'[Date] ,'Reseller Sales'[Order dt] ,'Currency Conversion'[Currency ID] ,'Reseller Sales'[Currency ID] ) ) Furthermore, if we wanted to, we could completely eliminate the USD Gross Sales measure from the model. To do this, we could wrap the entire formula (the previous definition on USD Gross Sales) into the CALCULATE statement in the Resellers table's field definition of USD Gross Sales. This forces the calculation to occur at the current row context. Why have we included the additional fields and measures in Reseller Sales? This is a modeling choice. It makes the model easier to understand because it is more modular. This would otherwise require two calculations (one into a default currency and the second into a selected currency) and the field usd_gross_sales is used in that calculation. Now that sales are converted to a uniform currency, we can determine the importance by rank. RANKX is used to rank the rows in the Resellers table based on the USD Gross Sales field. The simplest implementation of RANKX is demonstrated within the Sales Rank field. Here, the function simply returns a rank based on the value according to the supplied measure (which is of course USD Gross Sales). However, the RANKX function provides a lot of versatility and follows the syntax: RANKX(<table> , <expression>[, <value>[, <order>[, <ties>]]] ) After the initial implementation of RANKX in its simplest form, the arguments of particular interest are the <order> and <ties> arguments. These can be used to specify the sort order (whether the rank is to be applied from highest to lowest or lowest to highest) and the function behavior when duplicate values are encountered. This may be best demonstrated with an example. To do this, we will examine the operation of rank in relation to Round Down Amount. When a simple RANKX function is applied, the function sorts the columns in an ascending order and returns the position of a row based on the sorted order of the value and the number of prior rows within the table. This includes rows attributable to duplicate values. This is shown in the following screenshot where the Simple column is defined as RANKX(Resellers,[Round Down Amount]). Note, the data is sorted by Round Down Amount and the first four tied values have a RANKX value of 1. This is the behavior we expect since all rows have the same value. For the next value (700000), RANKX returns 5 because this is the fifth element in the sequence. When the DENSE argument is specified, the value returned after a tie is the next sequential number in the list. In the preceding screenshot, this is shown through the DENSE column. The formula for the field DENSE is: RANKX(Resellers, [Round Down Amount],,,DENSE)) Finally, we can specify the sort order that is used by the function (the default is ascending) with the help of <order> argument of the function. If we wish to sort (and rank) from lowest to highest, we could use the formula as shown in the INVERSE DENSE column. The INVERSE DENSE column uses the following calculation: RANKX(Resellers, [Round Down Amount],,TRUE,DENSE) After having specified the Sales Group field sort by column as Round Down Order, we may ask why we did not also sort the Customer Name column by their respective values in the Sales Rank column? Trying to define a sort by column in this way would cause an error as it is not a one-to-one relationship between these two fields. That is, each customer does not have a unique value for the sales rank. Let's have a look at this in more detail. If we filter the Resellers table to show the blank USD Sales Total rows (click on the drop-down arrow in the USD Sales Total column and check the BLANKS checkbox), we see that the values of the Sales Rank column for all the rows is the same. In the following screenshot, we can see the value 636 repeated for all the rows: Allowing the client tool visibility to the USD Sales Total and Sales Rank fields will not provide an intuitive browsing attribute for most client tools. For this reason, it is not recommended to expose these attributes to users. Hiding them will still allow the data to be queried directly Discretizing sales By discretizing the Resellers table, we firstly make a decision to group each reseller into bands of 100,000 intervals. Since, we have already calculated the USD Gross Sales value for each customer, our problem is reduced by determining which bin each customer belongs to. This is very easily achieved as we can derive the lower and upper bound for the Resellers table. That is, the lower bound will be a rounded down amount of their sales and the upper bound will be the rounded up value (that is rounded nearest to the 100,000 interval). Finally, we must ensure that the ordering of the bins is correct so that the bins appear from the highest value resellers to the lowest. For convenience, these steps are broken down through the creation of additional columns but they need not be—we could incorporate the steps into a single formula (mind you, it would be hard to read). Additionally, we have provided a unique name for the first bin by testing for 0 sales. This may not be required. The rounding is done with the ROUNDDOWN and ROUNDUP functions. These functions simply return the number moved by the number of digits offset. The following is the syntax for ROUNDDOWN: ROUNDDOWN(<number>, <num_digits>) Since we are interested only in the INTEGER values (that is, values to the left of the decimal place), we must specify <num_digits> as -5. The display value of the bin is controlled through the FORMAT function, which returns the text equivalent of a value according to the provided format string. The syntax for FORMAT is: FORMAT(<value>, <format_string>) There's more... In presenting a USD Gross Sales value for the Resellers table, we may not be interested in all the historic data. A typical variation on this theme is to determine the current worth by showing the recent history (or summing recent sales). This requirement can be easily implemented into the preceding method by swapping USD Gross Sales with recent sales. To determine this amount, we need to filter the data used in the SUM function. For example, to determine the last 30 days' sales for a reseller, we will use the following code: SUMX( FILTER('Reseller Sales' , 'Reseller Sales'[Order dt]> (MAX('Reseller Sales'[Order dt])-30) ) , USD SALES EXPRESSION )
Read more
  • 0
  • 0
  • 1253
Visually different images

article-image-sharing-your-bi-reports-and-dashboards
Packt
19 Dec 2013
4 min read
Save for later

Sharing Your BI Reports and Dashboards

Packt
19 Dec 2013
4 min read
(For more resources related to this topic, see here.) The final objective of the information in the BI reports and dashboards is to detect the cause-effect business behavior and trends, and trigger actions to solve them. These actions supported by visual information, via scorecards and dashboards. This process requires an interaction with several people. MicroStrategy includes the functionality to share our reports, scorecards, and dashboards, regardless of the location of the people. Reaching your audience MicroStrategy offers the option to share our reports via different channels that leverage the latest social technologies that are already present in the marketplace, that is, MicroStrategy integrates with Twitter and Facebook. The sharing is like avoiding any related costs and maintaining the design premise of the do-it-yourself approach without any help from specialized IT personnel. Main menu The main menu of MicroStrategy shows a column named Status. When we click on that column, as shown in the following screenshot, the Share option appears: The Share button The other option is the Share button within our reports, that is, the view that we want to share. Select the Share button located at the bottom of the screen, as shown in the following screenshot: The share options are the same, regardless of the location where you activate the option; the various alternate menus are shown in the following screenshot: E-mail sharing While selecting the e-mail option from the Scorecards-Dashboards model, the system will ask you for the e-mail programs that you want to use in order to send an e-mail; in our case, we select Outlook. MicroStrategy automatically prepares an e-mail with a link to share it. You can modify the text, and select the recipients of the e-mail, as shown in the following screenshot: The recipients of the e-mail will click on the URL that is included in the e-mail, send it by this schema, and the user will be able to analyze the report in a read-only mode with only the Filters panel enabled. The following screenshot shows how the user will review the report. Also, the user is not allowed to make any modifications. This option does not require a MicroStrategy platform user account. When a user clicks on the link, he is able to edit the filters and perform their analyses, as well as switch to any available layout, in our case, scorecards and dashboards. As a result, any visualization object can be maximized and minimized for better analysis, as shown in the following screenshot: In this option, the report can be visualized in a fullscreen mode by clicking on the fullscreen button [] located at the top-right corner of the screen. In this sharing mode, the user is able to download the information in Excel and PDF formats for each visualization object. For instance, if you need all the data included in the grid for the stores in region 1 opened in the year 2000. Perform the following steps: In the browser, open the URL that is generated when you select the e-mail share option. Select the ScoreCard tab. In the Open Year filter, type 2012 and in the Region filter, type 1. Now, maximize the grid. Two icons will appear in the top-left corner of the screen: one for exporting the data to Excel and the other for exporting it to PDF for each visualization object, as shown in the following screenshot: Please keep in mind that these two export options only apply to a specific visualization object; it is not possible to export the complete report from this functionality that is offered to the consumer. Summary In this article, we learned how to share our scorecards and dashboards via several channels, such as e-mails, social networks (Twitter and Facebook), and blogs or corporate intranet sites. Resources for Article: Further resources on this subject: Participating in a business process (Intermediate) [Article] Self-service Business Intelligence, Creating Value from Data [Article] Exploring Financial Reporting and Analysis [Article]
Read more
  • 0
  • 0
  • 962

article-image-getting-started-apache-nutch
Packt
18 Dec 2013
13 min read
Save for later

Getting Started with Apache Nutch

Packt
18 Dec 2013
13 min read
(For more resources related to this topic, see here.) Introduction of Apache Nutch Apache Nutch is a very robust and scalable tool for webcrawling and it can be integrated with scripting language i.e Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data. Apache Nutch is an Open Source WebCrawler Software which is used for crawling websites. You can create your own search engine like google if you understand Apache Nutch clearly. It will provide you your own search engine using which you can increase your application page rank in searching and also customize your application searching according to your needs. It is extensible and scalable. It facilitates for parsing, indexing, creating your own search engine, customize search according to needs, scalability, robustness and ScoringFilter for custom implementations. ScoringFilter is a Java class which is used while creating Apache Nutch plugin. It is used for manipulating scoring variables. We can run Apache Nutch on a single machine as well as distributed environment like Apache Hadoop. It is written in Java. We can find broken links using Apache Nutch, create a copy of all the visited pages for searching over for example: Build indexes. We can find Web page hyperlinks in an automated manner. Apache Nutch can be integrated with Apache Solr easily and we can index all the webpages which are crawled by Apache Nutch to Apache Solr. We can then use Apache Solr for searching the webpages which are indexed by Apache Nutch. Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. Crawling your first website Crawling is driven by Apache Nutch crawling tool and certain related tools for building and maintaining several data structures. It includes web database, the index and a set of segments. Once Apache Nutch has indexed the webpages to Apache Solr, you can search for the required webpage(s) in Apache Solr. Apache Solr Installation Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. It’s a very powerful searching mechanism and provides full-text search, dynamic clustering, database integration, rich document handling and many more. Apache SOLR will be used for indexing urls which are crawled by Apache Nutch and then one can search the details in Apache SOLR crawled by Apache Nutch. Crawling your website using the crawl script Apache Nutch 2.2.1 comes with the facility of crawl script which does crawling by just executing one single script. In earlier version, we have to manually do each step like generating data, fetching data, parsing data and so on for perfrom crawling. Crawling the web, the CrawlDb, and URL filters When user invokes crawling command in Apache Nutch 1.x, crawlDB is generated by Apache Nutch which is nothing but a directory which contains details about crawling. In Apache 2.x, crawlDB is not present. Instead Apache Nutch keeps all the crawling data directly into the database. InjectorJob The injector will add the necessary urls to the crawldb. Crawldb is the directory which is created by Apache Nutch for storing data related to crawling. You need to provide urls to InjectorJob either by downloading urls from internet or writing your own file which contains urls. Let’s say you have created one directory called urls which contains all the urls that needs to be injected in cralwdb. Following command will be used for perform the InjectorJob: #bin/nutch inject crawl/crawldb urls Urls will be directory which contains all the urls which needs to be injected in crawldb. Crawl/crawldb is the directory in which injected urls will be placed. After performing this job, you have number of unfetched urls inside your database i.e crawldb. GeneratorJob Once we have done with the InjectorJob, now it’s time to fetch the injected urls from crawldb. So for fetching the urls, you need to perform GeneratorJob before. Follwing command will be used for GeneratorJob: #bin/nutch generate crawl/crawldb crawl/segments Crawldb is the directory from where urls are generated. Segments is the directory which is used by GeneratorJob to fetch the necessary information required for crawling. FetcherJob The job of the fetch is to fetch the urls which are generated by GeneratorJob. It will use the input provided by GeneratorJob. Follwing command will be used for FetcherJob: #bin/nutch fetch –all Here I have provided input parameters –all which means this job will fetch all the urls which are generated by GeneratorJob. You can use different input parameters according to your needs. ParserJob After FetcherJob, ParserJob is to parse the urls which are fetched by FetcherJob. Follwing command will be used for ParserJob: # bin/nutch parse –all I have used input parameters –all which will parse all the urls which are fetched by FetcherJob. You can use different input parameter according to your needs. DbUpdaterJob Once the ParserJob has been completed, we need to update the database by providing results of the FetcherJob. This will update the respected databases with the last fetched urls. Following command will be used for DbUpdaterJob: # bin/nutch updatedb crawl/crawldb –all After performing this job, database will contain both updated entries of all the initial pages and also contains the new entities which are correspond to the newly discovered pages which are linked from the initial set. Invertlinks Before applying indexing, we need to first invert all the links. After this we will be able to index incoming anchor text with the pages. Following command will be used for Invertlinks: # bin/nutch invertlinks crawl/linkdb -dir crawl/segments Apache Hadoop Apache Hadoop is designed for running your application on servers where there will be lot of computers in which one will be master computer and rest will be the slave computers. So it’s huge data warehouse. Master computers are the computers which will direct slave computers for data processing. So processing is done by slave computers. This is the reason why Apache Hadoop is used for processing huge amount of data as process is divided into the number of slave computers and that’s why Apache Hadoop gives highest throughput for any processing. So as data will increase, you need to increase number of slave computers. That’s how Apache Hadoop functionality runs. Integration of Apache Nutch with Apache Hadoop Apache Nutch can be easily integrated with Apache Hadoop and we can make our process much faster than running Apache Nutch on single machine. After integrating Apache Nutch with Apache Hadoop, we can perform crawling on Apache Hadoop cluster environment. So the process will be much faster and we will get highest amount of throughput. Apache Hadoop Setup with Cluster This setup is not required a huge hardware to purchase and running Apache Nutch and Apache Hadoop. It is designed in such a way to make the use of hardware maximum. Formatting the HDFS filesystem using the NameNode HDFS stands for Hadoop Distributed File system is a directory which is used by Apache Hadoop for storage purpose. So it’s the directory which stroes all the data related to Apache Hadoop. It has two components as NameNode and DataNode in which NameNode manages the filesystem metadata and DataNodes actually stores the data. It’s highly configurable and suited well for many installations. When there are very large clusters, at that time configuration needs to be tuned. The first step for getting start your Apache Hadoop is the formatting Hadoop filesystem which is implemented on top of the local filesystem of your cluster(which will include only your local machine if you have followed). Setting up the deployment architecture of Apache Nutch We have to setup Apache Nutch on each of the machine which we are using. In this case, we are using six machines cluster. So we have to setup Apache Nutch on each machine. For the less number of machines in our cluster configuration, we can setup manually on each machine. But when the machines are more, let’s say we have 100 machines in our cluster environment. So we can’t setup on each machine manually. For that we require some deployment tool such as Chef or ateleast distributed ssh. You can refer to http://www.opscode.com/chef/ for getting familiar with Chef. You can refer http://www.ibm.com/developerworks/aix/library/au-satdistadmin/for getting familiar with distributed ssh.I will just demonstrate about running Apache Hadoop on Ubuntu for Single-Node Cluster. If you want to go for running Apache Hadoop on Ubuntu for Multi-Node cluster then I have already provided reference link above. You can follow that and configure the same. Once we have done with the deployment of Apache Nutch to single machine, we will run this script start-all.sh that will start the services on the master node and data nodes. It means the script will begin the hadoop daemons on the master node and so we are able to login into all the slave nodes using ssh command as explained above and will begin daemons on the slave nodes. The start-all.sh script expects that Apache Nutch should be put on the same location on each machine. It is also expecting that Apache Hadoop is storing the data at the same filepath on each machine. The start-all.sh script which starts the daemons on the master and slave nodes are going to use password-less login using ssh. Introduction of Apache Nutch configuration with Eclipse Apache Nutch can be easily configured with Eclipse. After that we can perform crawling easily using Eclipse. So need to perform crawling from command line. We can use eclipse for all the operations of crawling which we are doing from command line.Instructions are provided for fixing a development environment for Apache Nutch with Eclipse IDE. It's supposed to give a comprehensive starting resource for configuring, building, crawling and debugging of Apache Nutch within the above of context. Following are the prerequisites for Apache Nutch integration with Eclipse: Get the latest version of Eclipse from http://www.eclipse.org/downloads/packages/release/juno/r All the required subsequent are available from the Eclipse Marketplace. But if they are not, you can download eclipse market place as follows http://marketplace.eclipse.org/marketplace-client-intro Once you've configuired Eclipse, Download as per here http://subclipse.tigris.org/. If you have faced a problem with the 1.8.x release, try 1.6.x. This may resolve compatability issues. Download IvyDE plugin for Eclipse as here http://ant.apache.org/ivy/ivyde/download.cgi Download m2e plugin for Eclipse here http://marketplace.eclipse.org/content/maven-integration-eclipse Introduction of Apache Accumulo Accumulo is basically used as the datastore for storing data. So same way as we are using different databases like MySQL, Oracle, etc. So same way Apache Accumulo can be used. The key point of Apache Accumulo is, it is running on Apache Hadoop Cluster environment. So that's a very good feature with Accumulo.Accumulo sorted, distributed key/value store could be a strong, scalable, high performance information storage and retrieval system. Apache Accumulo depends on Google's BigTable design and is built ontop of Apache Hadoop, ,Thrift and Zookeeper. Apache Accumulo features a some novel improvement on the BigTable design within a form of cell-based access management and the server-side programming mechanism which will do modificationication in key/value pairs at varied points within the data management process Introduction of Apache Gora Apache Gora open source framework providesin-memory data model and persistence for large data. Apache Gora supports persisting to column stores, key and value stores, document stores and RDBMSs and analyzing the data with extensive Apache Hadoop MapReduce support. Supported Datastores Apache Gora presently supports the subsequent datastores: AccumuloProphetess PApache Hbase Amazon DynamoDB Use of Apache Gora Although there are many excellent ORM frameworks for relational databases and data modeling in NoSQL data stores different profoundly from their relative cousins. DataD-model agnostic frameworks like JDO aren't comfortable to be used cases, wherever one has to use the complete power of data models in column stores. Gora fills the thegap giving user an easy-to-use in-memory data model plus persistence for large data frameworkproviding data store specific mappings and also in built Apache Hadoop support. Integration of Apache Nutch with Apache Accumulo In this section, we are going to cover the integration process for integrating Apache Nutch with Apache Accumulo. Apache Accumulo is basically used for a huge data storeage. It is built on the top of Apache Hadoop, Zookeeper and Thrift. So a potential use of integrating Apache Nutch with Apache Accumulo is when our application has huge data to process and we want to run our application in cluste environment. At that time we can use Apache Accumulo as data storage purpose. As Apache Accumulo only running with Apache Hadoop, maximum use of Apache Accumulo would be in cluster based environment. So first we will start with the configuration of Apache GORA with Apache Nutch. Then we will setup Apache Hadoop and Zookeeper. Then we will do installation and configuration of Apache Accumulo. Then we will test Apache Accumulo and at the end we will see Crawling with Apache Nutch on Apache Accumulo. Setup Apache Hadoop and Apache Zookeeper for Apache Nutch Apache Zookeeper is a centralized service which is used for maintaining configuration information, provideses distributed synchronization, naming and also provideses group services. All these services are used by distributed applications in one or another manner. So all these services are provided by zookeeper so you don’t have to write these services from scratch. You can use these services for implementing consensus, management, group, leader election and presence protocols and you can also build it for your own requirements. Apache Accumulo is built on the top of Apache Hadoop, Zookeeper. So we must configure Apache Accumulo within Apache Hadoop and Apache Zookeeper. You can referrer to http://www.covert.io/post/18414889381/accumulo-nutch-and-gora for any queries related to setup. Integration of Apache Nutch with MySQL In this section, we are going to integrate Apache Nutch with MySQL. So after that you can crawled webpages in Apache Nutch that will be stored in MYSQL. So you can go to MySQL and check your crawled webpages and also perform necessary operations. We will start with the introduction of MySQL then we will cover what is the need of integrating MySQL with Apache Nutch. After that we will see configuration of MySQL with Apache Nutch and at the end we will do crawling with Apache Nutch on MySQL. So let’s just start with the introduction of MYSQL. Summary We covered the following: Downloading Apache Hadoop and Apache Nutch Perform Crawling on Apache Hadoop Cluster in Apache Nutch Apache Nutch configuration with eclipse Installation steps of building Apache Nutch with Eclipse Crawling in Eclipse Configuration of Apache GORA with Apache Nutch Installation and Configuration of Apache Accumulo Crawling with Apache Nutch on Apache Accumulo Need of integrating MySQL with Apache Nutch   Resources for Article: Further resources on this subject: Getting Started with the Alfresco Records Management Module [Article] Making Big Data Work for Hadoop and Solr [Article] Apache Solr PHP Integration [Article]
Read more
  • 0
  • 0
  • 1870

article-image-settings-goals
Packt
18 Dec 2013
8 min read
Save for later

Settings goals

Packt
18 Dec 2013
8 min read
(for more resources related to this topic, see here.) You can use the following code to track the event: [Flurry logEvent:@"EVENT_NAME"]; The logEvent: method logs your event every time it's triggered during the application session. This method helps you to track how often that event is triggered. You can track up to 300 different event IDs. However, the length of each event ID should be less than 255 characters. After the event is triggered, you can track that event from your Flurry dashboard. As is explained in the following screenshot, your events will be listed in the Events section. After clicking on Event Summary, you can see a list of the events you have created along with the statistics of the generated data as shown in the following screenshot: You can fetch the detailed data by clicking on the event name (for example, USER_VIEWED). This section will provide you with a chart-based analysis of the data as shown in the following screenshot: The Events Per Session chart will provide you with details about how frequently a particular event is triggered in a session. Other than this, you are provided with the following data charts as well: Unique Users Performing Event: This chart will explain the frequency of unique users triggering the event. Total Events: This chart holds the event-generation frequency over a time period. You can access the frequency of the event being triggered over any particular time slot. Average Events Per Session: This charts holds the average frequency of the events that happen per session. There is another variation of this method, as shown in the following code, which allows you to track the events along with the specific user data provided: [Flurry logEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionary]; This version of the logEvent: method counts the frequency of the event and records dynamic parameters in the form of dictionaries. External parameters should be in the NSDictionary format, whereas both the key and the value should be in the NSString object format. Let's say you want to track how frequently your comment section is used and see the comments, then you can use this method to track such events along with the parameters. You can track up to 300 different events with an event ID length less than 255 characters. You can provide a maximum of 10 event parameters per event. The following example illustrates the use of the logEvent: method along with optional parameters in the dictionary format: NSDictionary *dictionary = [NSDictionary dictionaryWithObjectsAndKeys:@"your dynamic parameter value", @"your dynamic parameter name",nil]; [Flurry logEvent:@"EVENT_NAME" withParameters:dictionary]; In case you want Flurry to log all your application sections/screens automatically, then you should pass navigationController or as a parameter to count all your pages automatically using one of the following code: [Flurry logAllPageViews:navigationController]; [Flurry logAllPageViews:tabBarController]; The Flurry SDK will create a delegate on your navigationControlleror tabBarController object, whichever is provided to detect the page's navigation. Each navigation detected will be tracked by the Flurry SDK automatically as a page view. You only need to pass each object to the Flurry SDK once. However you can pass multiple instances of different navigation and tab bar controllers. There can be some cases where you can have a view controller that is not associated with any navigation or tab bar controller. Then you can use the following code: [Flurry logPageView]; The preceding code will track the event independently of navigation and tab bar controllers. For each user interaction you can manually log events. Tracking time spent Flurry allows you to track events based on the duration factor as well. You can use the [Flurry logEvent: timed:] method to log your event in time as shown in the following code: [Flurry logEvent:@"EVENT_NAME" timed:YES]; In case you want to pass additional parameters along with the event name, you can use the following type of the logEvent: method to start a timed event for event Parameters as shown in the following code: [Flurry logEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionarytimed:YES]; The aforementioned method can help you to track your timed event along with the dynamic data provided in the dictionary format. You can end all your timed events before the application exits. This can even be accomplished by updating the event with event Parameters. If you want to end your events without updating the parameters, you can pass nil as the parameters. If you do not end your events, they will automatically end when the application exits as shown in the following code: [Flurry endTimedEvent:@"EVENT_NAME" withParameters:YOUR_NSDictionary]; Let's take the following example in which you want to log an event whenever a user comments on any article in your application: NSDictionary *commentParams = [NSDictionary dictionaryWithObjectsAndKeys: @"User_Comment", @"Comment", // Capture comment info @"Registered", @"User_Status", // Capture user status nil]; [Flurry logEvent:@"User_Comment" withParameters:commentParams timed:YES]; // In a function that captures when a user post the comment [Flurry endTimedEvent:@"Article_Read" withParameters:nil]; //You can pass in additional //params or update existing ones here as well The aforementioned piece of code will help you to log a timed event every time a user comments on a picture in your application. While tracking the event, you are also tracking the comment and the user registered by specifying them in the dictionary. Tracking errors Flurry provides you with a method to track errors as well. You can use the following methods to track errors on Flurry: [Flurry logError:@"ERROR_NAME" message:@"ERROR_MESSAGE" exception:e]; You can track exceptions and errors that occurred in the application by providing the name of the error (ERROR_NAME) along with the messages, such as ERROR_MESSAGE, with an exception object. Flurry reports the first ten errors in each session. You can fetch all the application exceptions and specifically uncaught exceptions on Flurry. You can use the logError:message:exception: class method to catch all the uncaught exceptions. These exceptions will be logged in Flurry in the Error section, which is accessible on the Flurry dashboard: // Uncaught Exception Handler - sent through Flurry. void uncaughtExceptionHandler(NSException *exception) { [Flurry logError:@"Uncaught" message:@"Crash" exception:exception]; } - (void)applicationDidFinishLaunching:(UIApplication *)application { NSSetUncaughtExceptionHandler(&uncaughtExceptionHandler); [Flurry startSession:@"YOUR_API_KEY"]; // .... } Flurry also helps you to catch all the uncaught exceptions generated by the application. All the exceptions will be caught by using the NSSetUncaughtExceptionHandler method in which you can pass a method that will catch all the exceptions raised during the application session. All the errors reported can also be tracked using the logError:message:error: method. You can pass the error name, message, and object to log the NSError error on Flurry as shown in the following code: - (void) webView:(UIWebView *)webView didFailLoadWithError:(NSError *)error { [Flurry logError:@"WebView No Load" message:[error localizedDescription] error:error]; } Tracking versions When you develop applications for mobile devices, it's obvious that you will evolve your application at every stage, pushing the latest updates for the application, which creates a new version of the application on the application store. To track the application based on these versions, you need to set up the Flurry to track your application versions as well. This can be done using the following code: [Flurry setAppVersion:App_Version_Number]; So by using the aforementioned method, you can track your application based on its version. For example, if you have released an application and unfortunately it's having a critical bug, then you can track your application based on the current version and the errors that are tracked by Flurry from the application. You can access data generated from Flurry's Dashboards by navigating to Flurry Classic. This will, by default, load a time-based graph of the application session for all versions. However, you can access the user session graph by selecting a version from the drop-down list as shown in the following screenshot: This is how the drop-down list will appear. Select a version and click on Update as shown in the following screenshot: The previous action will generate a version-based graph for a user's session with the as number of times users have opened the app in the given time frame shown in the following screenshot: Along with that, Flurry also provides user retention graphs to gauge the number of users and the usage of application over a period of time. Summary In this article we explored the ways to track the application on Flurry and to gather meaningful data on Flurry by setting goals to track the application. Then we learned how to track the time spent by users on the application along using user data tracking and retention graphs to gauge the number of users. resources for article: further resources on this subject: Data Analytics [article] Learning Data Analytics with R and Hadoop [article] Limits of Game Data Analysis [article]
Read more
  • 0
  • 0
  • 846
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-managing-ibm-cognos-bi-server-components
Packt
12 Dec 2013
6 min read
Save for later

Managing IBM Cognos BI Server Components

Packt
12 Dec 2013
6 min read
(for more resources related to this topic, see here.) Cognos BI architecture The IBM Cognos 10.2 BI architecture is separated into the following three tiers: Web server (gateways) Applications (dispatcher and Content Manager) Data (reporting/querying the database, content store, metric store) Web server (gateways) The user starts a web session with Cognos to connect to the IBM Cognos Connection's web-based interface/application using the web browser (Internet Explorer and Mozilla Firefox are the currently supported browsers). This web request is sent to the web server where the Cognos gateway resides. The gateway is a server-software program that works as an intermediate party between the web server and other servers, such as an application server. The following diagram shows the basic view of the three tiers of the Cognos BI architecture: The Cognos gateway is the starting point from where a request is received and transferred to the BI Server. On receiving a request from the web server, the Cognos gateway applies encryption to the information received, adds necessary environment variables and authentication namespace, and transfers the information to the application server (or dispatcher). Similarly, when the data has been processed and the presentation is ready, it is rendered towards the user's browser via the gateway and web server. The following diagram shows the Tier 1 layer in detail: The gateways must be configured to communicate with the application component (dispatcher) in a distributed environment. To make a failover cluster, more than one BI Server may be configured. The following types of web gateways are supported: CGI: This is also the default gateway. This is a basic gateway. ISAPI: This is for the Windows environment. It is the best for Windows IIS (Internet Information Services). Servlet: This gateway is the best for application servers that are supporting servlets. Apache_mod: This gateway type may be used for the Apache server. The following diagram shows an environment in which the web server is load balanced by two server machines: To improve performance, gateways (if more than one) must be installed and configured on separate machines. The application tier (Cognos BI Server) The application tier comprises one or multiple BI Servers. A server's job is to run user requests, for example, queries, reports, and analysis that are received from a gateway. The GUI environment (IBM Cognos Connection) that appears after logging in is also rendered and presented by Cognos BI Server. Another such example is the Metric Studio interface. The BI Server must include the dispatcher and Content Manager (the Content Manager component may be separated from the dispatcher). The following diagram shows BI Server's Tier 2 in detail: Dispatcher The dispatcher has static handlers to many services. Each request that is received is routed to the corresponding service for further processing. The dispatcher is also responsible for starting all the Cognos services at startup. These services include the system service, report service, report data service, presentation service, Metric Studio service, log service, job service, event management service, Content Manager service, batch report service, delivery service, and many others. When there are multiple dispatchers in a multitier architecture, a dispatcher may also send and route requests to another dispatcher. The URIs for all dispatchers must be known to the Cognos gateway(s). All dispatchers are registered in Content Manager (CM), making it possible for all dispatchers to know each other. A dispatcher grid is formed in this way. To improve the system performance, multiple dispatchers must be installed but on separate computers, and the Content Manager component must also be on a separate server. The following diagram shows how multiple dispatcher servers can be added. Services for the BI Server (dispatcher) Each dispatcher has a set of services, which are listed alphabetically in the following table. When the Cognos service is started from Cognos Configuration, all services are started one by one. The following table shows the dispatcher services and their short descriptions: Service Description Agent service Runs the agent. Annotation service Adds comments to reports. Batch report service Handles background report requests. Content manager cache service Handles cache for frequent queries to enhance performance of Content Manager. Content manager service Performs DML in content store db. Cognos deployment is another task for this service. Delivery service For sending e-mails. Event management service Manages the Event Objects (creation, scheduling, and so on) Graphics service Renders graphics for other services such as report service. Human task service Manages human tasks. Index data service For basic full-text functions for storage and retrieval of terms and indexed summary documents. Index search service For search and drill-through functions, including lists of aliases and examples. Index update service For write, update, delete, and administration-related functions. Job service Runs jobs in coordination with the monitor service. Log service For extensive logging of the Cognos environment (file, database, remote-log server, event viewer, and system log). Metadata service For displaying data lineage information (data source, calculation expressions) for the Cognos studios and viewer. Metric studio service This service is used for providing a user interface to metric studio for monitoring and manipulating system KPIs. Migration service Used for migration from old versions to new versions, especially series 7. Monitor service Works as a timer service-it manages the monitoring and running of tasks that were scheduled or marked as background tasks. Helps in failover and recovery for running tasks. Presentation service This service prepares and displays the presentation layer by converting the XML data to HTML or any other format view. IBM Cognos Connection is also prepared by this service. Query service For managing dynamic query requests. Report data service This service prepares data for other applications; for example mobile, Microsoft Office, and so on. Report service Manages report requests. The output is displayed in IBM Cognos Connection. System service This service defines the BI-Bus API compliant service. It gives more data about the BI configuration parameters. Summary This article covered the IBM Cognos BI architecture. Now you must be familiar with the single tier and multitier architectures and a variety of features and options that Cognos provides. resources for article: further resources on this subject: IBM Cognos Insight [article] Integrating IBM Cognos TM1 with IBM Cognos 8 BI [article] IBM Cognos 10 BI dashboarding components [article]
Read more
  • 0
  • 0
  • 3982

article-image-apache-solr-php-integration
Packt
25 Nov 2013
7 min read
Save for later

Apache Solr PHP Integration

Packt
25 Nov 2013
7 min read
(For more resources related to this topic, see here.) We will be looking at installation on both Windows and Linux environments. We will be using the Solarium library for communication between Solr and PHP. This article will give a brief overview of the Solarium library and showcase some of the concepts and configuration options on Solr end for implementing certain features. Calling Solr using PHP code A ping query is used in Solr to check the status of the Solr server. The Solr URL for executing the ping query is http://localhost:8080/solr/collection1/admin/ping/?wt=json. Response of Solr ping query in browser We can use Curl to get the ping response from Solr via PHP code; a sample code for executing the previous ping query is as below $curl = curl_init("http://localhost:8080/solr/collection1/admin/ping/?wt=json"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($curl); $data = json_decode($output, true); echo "Ping Status : ".$data["status"].PHP_EOL; Though Curl can be used to execute almost any query on Solr, but it is preferable to use a library which does the work for us. In our case we will be using Solarium. To execute the same query on Solr using the Solarium library the code is as follows. include_once("vendor/autoload.php"); $config = array("endpoint" => array("localhost" => array("host"=>"127.0.0.1", "port"=>"8080", "path"=>"/solr", "core"=>"collection1",) ) ); We have included the Solarium library in our code. And defined the connection parameters for our Solr server. Next we will need to create a Solarium client with the previous Solr configuration. And call the createPing() function to create the ping query. $client = new SolariumClient($config); $ping = $client->createPing(); Finally execute the ping query and get the result. $result = $client->ping($ping); $result->getStatus(); The output should be similar to the one shown below. Output of ping query using PHP Adding documents to Solr index To create a Solr index, we need to add documents to the Solr index using the command line, Solr web interface or our PHP program. But before we create a Solr index, we need to define the structure or the schema of the Solr index. Schema consists of fields and field types. It defines how each field will be treated and handled during indexing or during search. Let us see a small piece of code for adding documents to the Solr index using PHP and Solarium library. Create a solarium client. Create an instance of the update query. Create the document in PHP and finally add fields to the document. $client = new SolariumClient($config); $updateQuery = $client->createUpdate(); $doc1 = $updateQuery->createDocument(); $doc1->id = 112233445; $doc1->cat = 'book'; $doc1->name = 'A Feast For Crows'; $doc1->price = 8.99; $doc1->inStock = 'true'; $doc1->author = 'George R.R. Martin'; $doc1->series_t = '"A Song of Ice and Fire"'; Id field has been marked as unique in our schema. So we will have to keep different values for Id field for different documents that we add to Solr. Add documents to the update query followed by commit command. Finally execute the query. $updateQuery->addDocuments(array($doc1)); $updateQuery->addCommit(); $result = $client->update($updateQuery); Let us execute the code. php insertSolr.php After executing the code, a search for martin will give these documents in the result. http://localhost:8080/solr/collection1/select/?q=martin Document added to Solr index Executing search on Solr Index Documents added to the Solr index can be searched using the following piece of PHP code. $selectConfig = array( 'query' => 'cat:book AND author:Martin', 'start' => 3, 'rows' => 3,'fields' => array('id','name','price','author'), 'sort' => array('price' => 'asc') ); $query = $client->createSelect($selectConfig); $resultSet = $client->select($query); The above code creates a simple Solr query and searches for book in cat field and Martin in author field. The results are sorted in ascending order or price and fields returned are id, name of book, price and author of book. Pagination has been implemented as 3 results per page, so this query returns results for 2nd page starting from 3rd result. In addition to this simple select query, Solr also supports some advanced query modes known as dismax and edismax. With the help of these query modes, we can boost certain fields to give more importance to certain fields in our query. We can also use function queries to do some type of dynamic boosting based on values in fields. If no sorting is provided, the Solr results are sorted by the score of documents which are calculated based on the terms in the query and the matching terms in the documents in the index. Score is calculated for each document in the result set using two main factors - term frequency known as tf and inverse document frequency known as idf. In addition to these, Solr provides a way of narrowing down the results using filter queries. Also facets can be created based on fields in the index and it can be used by the end users to narrow down the results. Highlighting search results using PHP and Solr Solr can be used to highlight the fields returned in a search result based on the query. Here is a sample code for highlighting the results for search keyword harry. $query->setQuery('harry'); $query->setFields(array('id','name','author','series_t','score','last_modified')); Get the highlighting component from the query, set the fields to be highlighted and also set the html tags to be used for highlighting. $hl = $query->getHighlighting(); $hl->setFields('name,series_t'); $hl->setSimplePrefix('<strong>')->setSimplePostfix('</strong>'); Once the query is run and result set is received, we will need to retrieve the highlighted results from the result set. Here is the output for the highlighting code. Highlighted search results In addition to highlighting, Solr can be used to create a spelling suggester and a spell checker. Spelling suggester can be used to prompt input query to the end user as the user keeps on typing. Spell check can be used to prompt spelling corrections similar to 'did you mean' to the user. Solr can also be used for finding documents which are similar to a certain document based on words in certain fields. This functionality of Solr is known as more like this and is exposed via Solarium by the MoreLikeThis component. Solr also provides grouping of the result based on a particular query or a certain field. Scaling Solr Solr can be scaled to handle large number of search requests by using master slave architecture. Also if the index is huge, it can be sharded across multiple Solr instances and we can run a distributed search to get results for our query from all the sharded instances. Solarium provides a load balancing plug-in which can be used to load balance queries across master-slave architecture. Summary Solr provides an extensive list of features for implementing search. These features can be easily accessed in PHP using the Solarium library to build a full features search application which can be used to power search on any website. Resources for Article: Further resources on this subject: Apache Solr Configuration [Article] Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article]
Read more
  • 0
  • 0
  • 6226

article-image-plotting-charts-images-and-maps
Packt
22 Nov 2013
9 min read
Save for later

Plotting Charts with Images and Maps

Packt
22 Nov 2013
9 min read
(For more resources related to this topic, see here.) This article explores how to work with images and maps. Python has some well-known image libraries that allow us to process images in both aesthetic and scientific ways. We will touch on PIL's capabilities by demonstrating how to process images by applying filters and by resizing them. Furthermore, we will show how to use image files as annotation for our matplotlibs' charts. To deal with data visualization of geospatial datasets, we will cover the functionality of Python's available libraries and public APIs that we can use with map-based visual representations. The final recipe shows how Python can create CAPTCHA test images. Processing images with PIL Why use Python for image processing, if we could use WIMP (http://en.wikipedia.org/wiki/WIMP_(computing)) or WYSIWYG (http://en.wikipedia.org/wiki/WYSIWYG) to achieve the same goal? This is used because we want to create an automated system to process images in real time without human support, thus, optimizing the image pipeline. Getting ready Note that the PIL coordinate system assumes that the (0,0) coordinate is in the upper-left corner. The Image module has a useful class and instance methods to perform basic operations over a loaded image object (im): im = Image.open(filename): This opens a file and loads the image into im object. im.crop(box): This crops the image inside the coordinates defined by box. box defines left, upper, right, lower pixels coordinates (for example: box = (0, 100, 100,100)). im.filter(filter): This applies a filter on the image and returns a filtered image. im.histogram(): This returns a histogram list for this image, where each item represents the number of pixels. Number of items in the list is 256 for single channel images, but if the image is not a single channel image, there can be more items in the list. For an RGB image the list contains 768 items (one set of 256 values for each channel). im.resize(size, filter): This resizes the image and uses a filter for resampling. The possible filters are NEAREST, BILINEAR, BICUBIC, and ANTIALIAS. The default is NEAREST. im.rotate(angle, filter): This rotates an image in the counter clockwise direction. im.split(): This splits bands of image and returns a tuple of individual bands. Useful for splitting an RGB image into three single band images. im.transform(size, method, data, filter): This applies transformation on a given image using data and a filter. Transformation can be AFFINE, EXTENT, QUAD, and MESH. You can read more about transformation in the official documentation. Data defines the box in the original image where the transformation will be applied. The ImageDraw module allows us to draw over the image, where we can use functions such as arc, ellipse, line, pieslice, point, and polygon to modify the pixels of the loaded image. The ImageChops module contains a number of image channel operations (hence the name Chops) that can be used for image composition, painting, special effects, and other processing operations. Channel operations are allowed only for 8-bit images. Here are some interesting channel operations: ImageChops.duplicate(image): This copies current image into a new image object ImageChops.invert(image): This inverts an image and returns a copy ImageChops.difference(image1, image2): This is useful for verification that images are the same without visual inspection The ImageFilter module contains the implementation of the kernel class that allows the creation of custom convolution kernels. This module also contains a set of healthy common filters that allows the application of well-known filters (BLUR and MedianFilter) to our image. There are two types of filters provided by the ImageFilter module: fixed image enhancement filters and image filters that require certain arguments to be defined; for example, size of the kernel to be used. We can easily get the list of all fixed filter names in IPython: In [1]: import ImageFilter In [2]: [ f for f in dir(ImageFilter) if f.isupper()] Out[2]: ['BLUR', 'CONTOUR', 'DETAIL', 'EDGE_ENHANCE', 'EDGE_ENHANCE_MORE', 'EMBOSS', 'FIND_EDGES', 'SHARPEN', 'SMOOTH', 'SMOOTH_MORE'] The next example shows how we can apply all currently supported fixed filters on any supported image: import os import sys from PIL import Image, ImageChops, ImageFilter class DemoPIL(object): def __init__(self, image_file=None): self.fixed_filters = [ff for ff in dir(ImageFilter) if ff.isupper()] assert image_file is not None assert os.path.isfile(image_file) is True self.image_file = image_file self.image = Image.open(self.image_file) def _make_temp_dir(self): from tempfile import mkdtemp self.ff_tempdir = mkdtemp(prefix="ff_demo") def _get_temp_name(self, filter_name): name, ext = os.path.splitext(os.path.basename(self.image_file)) newimage_file = name + "-" + filter_name + ext path = os.path.join(self.ff_tempdir, newimage_file) return path def _get_filter(self, filter_name): # note the use Python's eval() builtin here to return function object real_filter = eval("ImageFilter." + filter_name) return real_filter def apply_filter(self, filter_name): print "Applying filter: " + filter_name filter_callable = self._get_filter(filter_name) # prevent calling non-fixed filters for now if filter_name in self.fixed_filters: temp_img = self.image.filter(filter_callable) else: print "Can't apply non-fixed filter now." return temp_img def run_fixed_filters_demo(self): self._make_temp_dir() for ffilter in self.fixed_filters: temp_img = self.apply_filter(ffilter) temp_img.save(self._get_temp_name(ffilter)) print "Images are in: {0}".format((self.ff_tempdir),) if __name__ == "__main__": assert len(sys.argv) == 2 demo_image = sys.argv[1] demo = DemoPIL(demo_image) # will create set of images in temporary folder demo.run_fixed_filters_demo() We can run this easily from the command prompt: $ pythonch06_rec01_01_pil_demo.py image.jpeg We packed our little demo in the DemoPIL class, so we can extend it easily while sharing the common code around the demo function run_fixed_filters_demo. Common code here includes opening the image file, testing if the file is really a file, creating temporary directory to hold our filtered images, building the filtered image filename, and printing useful information to user. This way the code is organized in a better manner and we can easily focus on our demo function, without touching other parts of the code. This demo will open our image file and apply every fixed filter available in ImageFilter to it and save that new filtered image in a unique temporary directory. The location of this temporary directory is retrieved, so we can open it with our OS's file explorer and view the created images. As an optional exercise, try extending this demo class to perform other filters available in ImageFilter on the given image. How to do it... The example in this section shows how we can process all the images in a certain folder. We specify a target path, and the program reads all the image files in that target path (images folder) and resizes them to a specified ratio (0.1 in this example), and saves each one in a target folder called thumbnail_folder: import os import sys from PIL import Image class Thumbnailer(object): def __init__(self, src_folder=None): self.src_folder = src_folder self.ratio = .3 self.thumbnail_folder = "thumbnails" def _create_thumbnails_folder(self): thumb_path = os.path.join(self.src_folder, self.thumbnail_folder) if not os.path.isdir(thumb_path): os.makedirs(thumb_path) def _build_thumb_path(self, image_path): root = os.path.dirname(image_path) name, ext = os.path.splitext(os.path.basename(image_path)) suffix = ".thumbnail" return os.path.join(root, self.thumbnail_folder, name + suffix + ext) def _load_files(self): files = set() for each in os.listdir(self.src_folder): each = os.path.abspath(self.src_folder + '/' + each) if os.path.isfile(each): files.add(each) return files def _thumb_size(self, size): return (int(size[0] * self.ratio), int(size[1] * self.ratio)) def create_thumbnails(self): self._create_thumbnails_folder() files = self._load_files() for each in files: print "Processing: " + each try: img = Image.open(each) thumb_size = self._thumb_size(img.size) resized = img.resize(thumb_size, Image.ANTIALIAS) savepath = self._build_thumb_path(each) resized.save(savepath) except IOError as ex: print "Error: " + str(ex) if __name__ == "__main__": # Usage: # ch06_rec01_02_pil_thumbnails.py my_images assert len(sys.argv) == 2 src_folder = sys.argv[1] if not os.path.isdir(src_folder): print "Error: Path '{0}' does not exits.".format((src_folder)) sys.exit(-1) thumbs = Thumbnailer(src_folder) # optionally set the name of theachumbnail folder relative to *src_folder*. thumbs.thumbnail_folder = "THUMBS" # define ratio to resize image to # 0.1 means the original image will be resized to 10% of its size thumbs.ratio = 0.1 # will create set of images in temporary folder thumbs.create_thumbnails() How it works... For the given src_folder folder, we load all the files in this folder and try to load each file using Image.open(); this is the logic of the create_thumbnails() function. If the file we try to load is not an image, IOError will be thrown, and it will print this error and skip to next file in the sequence. If we want to have more control over what files we load, we should change the _load_files() function to only include files with certain extension (file type): for each in os.listdir(self.src_folder): if os.path.isfile(each) and os.path.splitext(each) is in ('.jpg','.png'): self._files.add(each) This is not foolproof, as file extension does not define file type, it just helps the operating system to attach a default program to the file, but it works in majority of the cases and is simpler than reading a file header to determine the file content (which still does not guarantee that the file really is the first couple of bytes, say it is). There's more... With PIL, although not used very often, we can easily convert images from one format to the other. This is achievable with two simple operations: first open an image in a source format using open(), and then save that image in the other format using save(). Format is defined either implicitly via filename extension (.png or .jpeg), or explicitly via the format of the argument passed to the save() function.
Read more
  • 0
  • 0
  • 2451

article-image-preparation-analysis-data-source
Packt
22 Nov 2013
6 min read
Save for later

Preparation Analysis of Data Source

Packt
22 Nov 2013
6 min read
(For more resources related to this topic, see here.) List of data sources Here, the wide varieties of data sources that are supported in the PowerPivot interface are given in brief. The vital part is to install providers such as OLE DB and ODBC that support the existing data source, because when installing the PowerPivot add-in, it will not install the provider too and some providers might already be installed with other applications. For instance, if there is a SQL Server installed in the system, the OLE DB provider will be installed with the SQL Server, so that later on it wouldn't be necessary to install OLE DB while adding data sources by using the SQL Server as a data source. Hence, make sure to verify the provider before it is added as a data source. Perhaps the provider is required only for relational database data sources. Relational database By using RDBMS, you can import tables and views into the PowerPivot workbook. The following is a list of various data sources: Microsoft SQL Server Microsoft SQL Azure Microsoft SQL Server Parallel Data Warehouse Microsoft Access Oracle Teradata Sybase Informix IBM DB2 Other providers (OLE DB / ODBC) Multidimensional sources Multidimensional data sources can only be added from Microsoft Analysis Services (SSAS). Data feeds The three types of data feeds are as follows: SQL Server Reporting Service (SSRS) Azure DataMarket dataset Other feeds such as atom service documents and single data feed Text files The two types of text files are given as follows: Excel file Text file Purpose of importing data from a variety of sources In order to make a decision about a particular subject area, you should analyze all the required data that is relevant to the subject area. If the data is stored in a variety of data sources, importing the data from different data sources has to be done. If all the data is only in one data source, only the data needs to be imported from the required tables for the subject and then various types of analysis can be done. The reason why users need to import data from different data sources is that they would then have an ample amount of data when they need to make any analysis. Another generic reason would be to cross-analyze data from different business systems such as Customer Relationship Management (CRM) and Campaign Management System (CMS). Data sourcing from only one source wouldn't be as sophisticated as an analysis done from different data sources as the amount of data from which the analysis was done for multisourced data is more detailed than the data from only a single source. It also might reveal conflicts between multiple data sources. In real time, usually in the e-commerce industry, blogs, and forum websites wouldn't ask for more details about customers at the time of registration, because the time consumed for long registrations would discourage the user, leading to cancellation the of the order. For instance, the customer table that would be stored in the database of an e-commerce industry would contain the following attributes: Customer FirstName LastName E-mail BirthDate Zip Code Gender However, this kind of industry needs to know their customers more in order to increase their sales. Since the industry only saves a few attributes about the customer during registration, it is difficult to track down the customers and it is even more difficult to advertise according to individual customers. Therefore, in order to find some relevant data about the customers, the e-commerce industries try to make another data source using the Internet or other types of sources. For instance, by using the Postcode given by the customer during registration, the user can get Country | State | City from various websites and then use the information obtained to make a table format either in Excel or CSV, as follows: Location Postcode City State Country So, finally the user would have two sources—one source is from the user's RDBMS database and the other source is from the Excel file created later—both of these can be used in order to make a customer analysis based on their location. General overview of ETL The main goal is to facilitate the development of data migration applications by applying data transformations. Extraction Transform Load (ETL) comprises of the first step in a data warehouse process. It is the most important and the hardest part, because it determines the quality of the data warehouse and the scope of the analyses the users will be able to build upon it. Let us discuss in detail what ETL is. The first substep is extraction. As the users would want to regroup all the information a company has, they will have to collect the data from various heterogeneous sources (operational data such as databases and/or external data such as CSV files) and various models (relational, geographical, web pages, and so on). The difficulty starts here. As data sources are heterogeneous, formats are usually different, and the users would have different types of data models for similar information (different names and/or formats) and different keys for the same objects. These are some of the main problems. The aim of the transformation task is to correct such problems, as much as possible. Let us now see what a transformation task is. The users would need to transform heterogeneous data of different sources into homogeneous data. Here are some examples of what they can do: Extraction of the data sources Identification of relevant data sources Filtering of non-admissible data Modification of formats or values Summary This article shows the users how to prepare data for analysis. We also covered different types of data sources that can be imported into the PowerPivot interface. Some information about the provider and a general overview of the ETL process has also been given. Users now know how the ETL process works in the PowerPivot interface; also, users were shown an introduction and the advantages of DAX. Resources for Article: Further resources on this subject: Creating a pivot table [Article] What are SSAS 2012 dimensions and cube? [Article] An overview of Oracle Hyperion Interactive Reporting [Article]
Read more
  • 0
  • 0
  • 1043
article-image-common-qlikview-script-errors
Packt
22 Nov 2013
3 min read
Save for later

Common QlikView script errors

Packt
22 Nov 2013
3 min read
(For more resources related to this topic, see here.) QlikView error messages displayed during the running of the script, during reload, or just after the script is run are key to understanding what errors are contained in your code. After an error is detected and the error dialog appears, review the error, and click on OK or Cancel on the Script Error dialog box. If you have the debugger open, click on Close, then click on Cancel on the Sheet Properties dialog. Re-enter the Script Editor and examine your script to fix the error. Errors can come up as a result of syntax, formula or expression errors, join errors, circular logic, or any number of issues in your script. The following are a few common error messages you will encounter when developing your QlikView script. The first one, illustrated in the following screenshot, is the syntax error we received when running the code that missed a comma after Sales. This is a common syntax error. It's a little bit cryptic, but the error is contained in the code snippet that is displayed. The error dialog does not exactly tell you that it expected a comma in a certain place, but with practice, you will realize the error quickly. The next error is a circular reference error. This error will be handled automatically by QlikView. You can choose to accept QlikView's fix of loosening one of the tables in the circular reference (view the data model in Table Viewer for more information on which table is loosened, or view the Document Properties dialog, Tables tab to find out which table is marked Loosely Coupled). Alternatively, you can choose another table to be loosely coupled in the Document Properties, Tables tab, or you can go back into the script and fix the circular reference with one of the methods. The following screenshot is a warning/error dialog displayed when you have a circular reference in a script: Another common issue is an unknown statement error that can be caused by an error in writing your script—missed commas, colons, semicolons, brackets, quotation marks, or an improperly written formula. In the case illustrated in the following screenshot, the error has encountered an unknown statement—namely, the Customers line that QlikView is attempting to interpret as Customers Load *…. The fix for this error is to add a colon after Customers in the following way: Customers: There are instances when a load script will fail silently. Attempting to store a QVD or CSV to a file that is locked by another user viewing it is one such error. Another example is when you have two fields with the same name in your load statement. The debugger can help you find the script lines in which the silent error is present. Summary In this article we learned about QlikView error messages displayed during the script execution. Resources for Article: Further resources on this subject: Meet QlikView [Article] Introducing QlikView elements [Article] Linking Section Access to multiple dimensions [Article]
Read more
  • 0
  • 0
  • 5897

article-image-portlet
Packt
22 Nov 2013
14 min read
Save for later

Portlet

Packt
22 Nov 2013
14 min read
(For more resources related to this topic, see here.) The Spring MVC portlet The Spring MVC portlet follows the Model-View-Controller design pattern. The model refers to objects that imply business rules. Usually, each object has a corresponding table in the database. The view refers to JSP files that will be rendered into the HTML markup. The controller is a Java class that distributes user requests to different JSP files. A Spring MVC portlet usually has the following folder structure: In the previous screenshot, there are two Spring MVC portlets: leek-portlet and lettuce-portlet. You can see that the controller classes are clearly named as LeekController.java and LettuceController.java. The JSP files for the leek portlet are view/leek/leek.jsp, view/leek/edit.jsp, and view/leek/help.jsp. The definition of the leek portlet in the portlet.xml file is as follows: <portlet> <portlet-name>leek</portlet-name> <display-name>Leek</display-name> <portlet-class>org.springframework.web.portlet.DispatcherPortlet</portlet-class> <init-param> <name>contextConfigLocation</name> <value>/WEB-INF/context/leek-portlet.xml</value> </init-param> <supports> <mime-type>text/html</mime-type> <portlet-mode>view</portlet-mode> <portlet-mode>edit</portlet-mode> <portlet-mode>help</portlet-mode> </supports> ... <supported-publishing-event> <qname >x:ipc.share</qname> </supported-publishing-event> </portlet> You can see from the previous code that the portlet class for a Spring MVC portlet is the org.springframework.web.portlet.DispatcherPortlet.java class. When a Spring MVC portlet is called, this class runs. It also calls the WEB-INF/context/leek/portlet.xml file and initializes the singletons defined in that file when the leek portlet is deployed. The leek portlet supports the view, edit, and help mode. It can also fire a portlet event with ipc.share as its qualified name. Yo can use the method to import the leek and lettuce portlets (whose source code can be downloaded from the Packt site) to your Liferay IDE. Then, carry out the following steps: Deploy the leek-portlet package and wait until the leek portlet and lettuce portlet are registered by the Liferay Portal. Log in as the portal administrator and add the two Spring MVC portlets onto a portal page. Your portal page should look similar to the following screenshot: The default view of the leek portlet comes from the view/leek/leek.jsp file whose logic is defined through the following method in the com.uibook.leek.portlet.LeekController.java class: @RequestMapping public String render(Model model, SessionStatus status, RenderRequest req) { return "leek"; } This method calls the view/leek/leek.jsp file. In the default view of the leek portlet, when you click on the radio button for Snow water from last winter and then on the Get Water button, the following form will be submitted: <form action="http://localhost:8080/web/uibook/home?p_auth=wwMoBV4C&p_p_id=leek_WAR_leekportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_leek_WAR_leekportlet_action=sprayWater" id="_leek_WAR_leekportlet_leekFm" method="post" name="_leek_WAR_leekportlet_leekFm"> This form will fire an action URL because p_p_lifecycle is equal to 1. As the action name is sprayWater in the URL, the DispatcherPortlet.java class (as specified in the portlet.xml file) calls the following method: @ActionMapping(params="action=sprayWater") public void sprayWater(ActionRequest request, ActionResponse response, SessionStatus sessionStatus) { String waterType = request.getParameter("waterSupply"); if(waterType != null){ request.setAttribute("theWaterIs", waterType); sessionStatus.setComplete(); } } This method simply gets the value for the waterSupply parameter as specified in the following code, which comes from the view/leek/leek.jsp file: <input type="radio" name="<portlet:namespace />waterSupply" value="snow water from last winter">Snow water from last winter The value is snow water from last winter, which is set as a request attribute. As the previous sprayWater(…) method does not specify a request parameter for a JSP file to be rendered, the logic goes to the default view of the leek portlet. So, the view/leek/leek.jsp file will be rendered. Here, as you can see, the two-phase logic is retained in the Spring MVC portlet, as has been explained in the Understanding a simple JSR-286 portlet section of this article. Now the theWaterIs request attribute has a value, which is snow water from last winter. So, the following code in the leek.jsp file runs and displays the Please enjoy some snow water from last winter. message, as shown in the previous screenshot: <c:if test="${not empty theWaterIs}"> <p>Please enjoy some ${theWaterIs}.</p> </c:if> In the previous screenshot, the Passing you a gift... link is rendered with the following code in the leek.jsp file: <a href="<portlet:actionURL name="shareGarden"></portlet:actionURL>">Passing you a gift ...</a> When this link is clicked, an action URL named shareGarden is fired. So, the DispatcherPortlet.java class will call the following method: @ActionMapping("shareGarden") public void pitchBallAction(SessionStatus status, ActionResponse response) { String elementType = null; Random random = new Random(System.currentTimeMillis()); int elementIndex = random.nextInt(3) + 1; switch(elementIndex) { case 1 : elementType = "sunshine"; break; ... } QName qname = new QName("http://uibook.com/events","ipc.share"); response.setEvent(qname, elementType); status.setComplete(); } This method gets a value for elementType (the type of water in our case) and sends out this elementType value to another portlet based on the ipc.share qualified name. The lettuce portlet has been defined in the portlet.xml file as follows to receive such a portlet event: <portlet> <portlet-name>lettuce</portlet-name> ... <supported-processing-event> <qname >x:ipc.share</qname> </supported-processing-event> </portlet> When the ipc.share portlet event is sent, the portal page refreshes. Because the lettuce portlet is on the same page as the leek portlet, the portlet event is received by the following method in the com.uibook.lettuce.portlet.LettuceController.java class: @EventMapping(value ="{http://uibook.com/events}ipc.share") public void receiveEvent(EventRequest request, EventResponse response, ModelMap map) { Event event = request.getEvent(); String element = (String)event.getValue(); map.put("element", element); response.setRenderParameter("element", element); } This receiveEvent(…) method receives the ipc.share portlet event, gets the value in the event (which can be sunshine, rain drops, wind, or space), and puts it in the ModelMap object with element as the key. Now, the following code in the view/lettuce/lettuce.jsp file runs: <c:choose> <c:when test="${empty element}"> <p> Please share the garden with me! </p> </c:when> <c:otherwise> <p>Thank you for the ${element}!</p> </c:otherwise> </c:choose> As the element parameter now has a value, a message similar to Thank you for the wind will show in the lettuce portlet. The wind is a gift from the leek to the lettuce portlet. In the default view of the leek portlet, there is a Some shade, please! button. This button is implemented with the following code in the view/leek/leek.jsp file: <button type="button" onclick="<portlet:namespace />loadContentThruAjax();">Some shade, please!</button> When this button is clicked, a _leek_WAR_leekportlet_loadContentThruAjax() JavaScript function will run: function <portlet:namespace />loadContentThruAjax() { ... document.getElementById("<portlet:namespace />content").innerHTML=xmlhttp.responseText; ... xmlhttp.open('GET','<portlet:resourceURL escapeXml="false" id="provideShade"/>',true); xmlhttp.send(); } This loadContentThruAjax() function is an Ajax call. It fires a resource URL whose ID is provideShade. It maps the following method in the com.uibook.leek.portlet.LeekController.java class: @ResourceMapping(value = "provideShade") public void provideShade(ResourceRequest resourceRequest, ResourceResponse resourceResponse) throws PortletException, IOException { resourceResponse.setContentType("text/html"); PrintWriter out = resourceResponse.getWriter(); StringBuilder strB = new StringBuilder(); strB.append("The banana tree will sway its leaf to cover you from the sun."); out.println(strB.toString()); out.close(); } This method simply sends the The banana tree will sway its leaf to cover you from the sun message back to the browser. The previous loadContentThruAjax() method receives this message, inserts it in the <div id="_leek_WAR_leekportlet_content"></div> element, and shows it. About the Vaadin portlet Vaadin is an open source web application development framework. It consists of a server-side API and a client-side API. Each API has a set of UI components and widgets. Vaadin has themes for controlling the appearance of a web page. Using Vaadin, you can write a web application purely in Java. A Vaadin application is like a servlet. However, unlike the servlet code, Vaadin has a large set of UI components, controls, and widgets. For example, in correspondence to the <table> HTML element, the Vaadin API has a com.vaadin.ui.Table.java class. The following is a comparison between servlet table implementation and Vaadin table implementation: Servlet Code Vaadin Code PrintWriter out = response.getWriter(); out.println("<table>n" +    "<tr>n" +    "<td>row 2, cell 1</td>n" +    "<td>row 2, cell 2</td>" +    "</tr>n" +    "</table>"); sample = new Table(); sample.setSizeFull(); sample.setSelectable(true); ... sample.setColumnHeaders(new String[] { "Country", "Code" });   Basically, if there is a label element in HTML, there is a corresponding Label.java class in Vaadin. In the sample Vaadin code, you will find the use of the com.vaadin.ui.Button.java and com.vaadin.ui.TextField.java classes. Vaadin supports portlet development based on JSR-286. Vaadin support in Liferay Portal Starting with Version 6.0, the Liferay Portal was bundled with the Vaadin Java API, themes, and a widget set described as follows: ${APP_SERVER_PORTAL_DIR}/html/VAADIN/themes/ ${APP_SERVER_PORTAL_DIR}/html/VAADIN/widgetsets/ ${APP_SERVER_PORTAL_DIR}/WEB-INF/lib/vaadin.jar A Vaadin control panel for the Liferay Portal is also available for download. It can be used to rebuild the widget set when you install new add-ons in the Liferay Portal. In the ${LPORTAL_SRC_DIR}/portal-impl/src/portal.properties file, we have the following Vaadin-related setting: vaadin.resources.path=/html vaadin.theme=liferay vaadin.widgetset=com.vaadin.portal.gwt.PortalDefaultWidgetSet In this section, we will discuss two Vaadin portlets. These two Vaadin portlets are run and tested in Liferay Portal 6.1.20 because, at the time of writing, the support for Vaadin is not available in the new Liferay Portal 6.2. It is expected that when the Generally Available (GA) version of Liferay Portal 6.2 is available, the support for Vaadin portlets in the new Liferay Portal 6.2 will be ready. Vaadin portlet for CRUD operations CRUD stands for create, read, update, and delete. We will use a peanut portlet to illustrate the organization of a Vaadin portlet. In this portlet, a user can create, read, update, and delete data. This portlet is adapted from a SimpleAddressBook portlet from a Vaadin demo. Its structure is as shown in the following screenshot: You can see that it does not have JSP files. The view, model, and controller are all incorporated in the PeanutApplication.java class. Its portlet.xml file has the following content: <portlet-class>com.vaadin.terminal.gwt.server.ApplicationPortlet2</portlet-class> <init-param> <name>application</name> <value>peanut.PeanutApplication</value> </init-param> This means that when the Liferay Portal calls the peanut portlet, the com.vaadin.terminal.gwt.server.ApplicationPortlet2.java class will run. This ApplicationPortlet2.java class will in turn call the peanut.PeanutApplication.java class, which will retrieve data from the database and generate the HTML markup. The default UI of the peanut portlet is as follows: This default UI is implemented with the following code: HorizontalSplitPanel splitPanel = new HorizontalSplitPanel(); setMainWindow(new Window("Address Book", splitPanel)); VerticalLayout left = new VerticalLayout(); left.setSizeFull(); left.addComponent(contactList); contactList.setSizeFull(); left.setExpandRatio(contactList, 1); splitPanel.addComponent(left); splitPanel.addComponent(contactEditor); splitPanel.setHeight("450"); contactEditor.setCaption("Contact details editor"); contactEditor.setSizeFull(); contactEditor.getLayout().setMargin(true); contactEditor.setImmediate(true); bottomLeftCorner.setWidth("100%"); left.addComponent(bottomLeftCorner); The previous code comes from the initLayout() method of the PeanutApplication.java class. This method is run when the portal page is first loaded. The new Window("Address Book", splitPanel) statement instantiates a window area, which is the whole portlet UI. This window is set as the main window of the portlet; every portlet has a main window. The splitPanel attribute splits the main window into two equal parts vertically; it is like the 2 Columns (50/50) page layout of Liferay. The splitPanel.addComponent(left) statement adds the contact information table to the left pane of the main window, while the splitPanel.addComponent(contactEditor) statement adds the contact details of the editor to the right pane of the main window. The left variable is a com.vaadin.ui.VerticalLayout.java object. In the left.addComponent(bottomLeftCorner) statement, the left object adds a bottomLeftCorner object to itself. The bottomLeftCorner object is a com.vaadin.ui.HorizontalLayout.java object. It takes the space across the left vertical layout under the contact information table. This bottomLeftCorner horizontal layout will house the contact-add button and the contact-remove button. The following screenshot gives you an idea of how the screen will look: When the + icon is clicked, a button click event will be fired which runs the following code: Object id = ((IndexedContainer) contactList.getContainerDataSource()).addItemAt(0); contactList.getItem(id).getItemProperty("First Name").setValue("John"); contactList.getItem(id).getItemProperty("Last Name").setValue("Doe"); This code adds an entry in the contactList object (contact information table) initializing the contact's first name to John and the last name to Doe. At the same time, the ValueChangeListener property of the contactList object is triggered and runs the following code: contactList.addListener(new Property.ValueChangeListener() { public void valueChange(ValueChangeEvent event) { Object id = contactList.getValue(); contactEditor.setItemDataSource(id == null ? null : contactList .getItem(id)); contactRemovalButton.setVisible(id != null); } }); This code populates the contactEditor variable, a com.vaadin.ui.Form.Form.java object, with John Doe's contact information and displays the Contact details editor section in the right pane of the main window. After that, an end user can enter John Doe's other contact details. The end user can also update John Doe's first and last names. If you have noticed, the last statement of the previous code snippet mentions contactRemovalButton. At this time, the John Doe entry in the contact information table is highlighted. If the end user clicks on the contact removal button, this information will be removed from both the contact information table and the contact details editor. Actually, the end user can highlight any entry in the contact information table and edit or delete it. You may have seen that during the whole process of creating, reading, updating, and deleting the contact, the portal page URL did not change and the portal page did not refresh. All the operations were performed through Ajax calls to the application server. This means that only a few database accesses happened during the whole process. This improves the site performance and reduces load on the application server. It also implies that if you develop Vaadin portlets in the Liferay Portal, you do not have to know the friendly URL configuration skill on a Liferay Portal project. In the peanut portlet, a developer cannot retrieve the logged-in user in the code, which is a weak point. In the following section, a potato portlet is implemented in such a way that a developer can retrieve the Liferay Portal information, including the logged-in user information. Summary In this article, we learned about portlets and their development. We learned ways todevelop simple JSR 286 portlets, SpringMVC portlets, and Vaadin portlets. We also learned to implement the view, edit, and help modes of a portlet. Resources for Article: Further resources on this subject: Setting up and Configuring a Liferay Portal [Article] Liferay, its Installation and setup [Article] Building your First Liferay Site [Article]
Read more
  • 0
  • 0
  • 1151

article-image-basic-concepts-and-architecture-cassandra
Packt
21 Nov 2013
7 min read
Save for later

Basic Concepts and Architecture of Cassandra

Packt
21 Nov 2013
7 min read
(For more resources related to this topic, see here.) CAP theorem If you want to understand Cassandra, you first need to understand the CAP theorem. The CAP theorem (published by Eric Brewer at the University of California, Berkeley) basically states that it is impossible for a distributed system to provide you with all of the following three guarantees: Consistency: Updates to the state of the system are seen by all the clients simultaneously Availability: Guarantee of the system to be available for every valid request Partition tolerance: The system continues to operate despite arbitrary message loss or network partition Cassandra provides users with stronger availability and partition tolerance with tunable consistency tradeoff; the client, while writing to and/or reading from Cassandra, can pass a consistency level that drives the consistency requirements for the requested operations. BigTable / Log-structured data model In a BigTable data model, the primary key and column names are mapped with their respective bytes of value to form a multidimensional map. Each table has multiple dimensions. Timestamp is one such dimension that allows the table to version the data and is also used for internal garbage collection (of deleted data). The next figure shows the data structure in a visual context; the row key serves as the identifier of the column that follows it, and the column name and value are stored in contiguous blocks: It is important to note that every row has the column names stored along with the values, allowing the schema to be dynamic. Column families Columns are grouped into sets called column families, which can be addressed through a row key (primary key). All the data stored in a column family is of the same type. A column family must be created before any data can be stored; any column key within the family can be used. It is our intent that the number of distinct column families in a keyspace should be small, and that the families should rarely change during an operation. In contrast, a column family may have an unbounded number of columns. Both disk and memory accounting are performed at the column family level. Keyspace A keyspace is a group of column families; replication strategies and ACLs are performed at the keyspace level. If you are familiar with traditional RDBMS, you can consider the keyspace as an alternative name for the schema and the column family as an alternative name for tables. Sorted String Table (SSTable) An SSTable provides a persistent file format for Cassandra; it is an ordered immutable storage structure from rows of columns (name/value pairs). Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately. The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data. Memtable A memtable is a memory location where data is written to during update or delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable. Basically, an update or a write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables: Reads in Cassandra will merge the data from different SSTables and the data in memtables. Reads should always be requested with a row key (primary key) with the exception of a key range scan. When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts. Delete operations to a column work a little differently; because SSTables are immutable, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC. Compaction To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts n (the configurable number of SSTables) into one big SSTable. They start out being the same size as the memtables. Therefore, the sizes of the SSTables are exponentially bigger when they grow older. Partitioning and replication Dynamo style As mentioned previously, the partitioner and replication scheme is motivated by the Dynamo paper; let's talk about it in detail. Gossip protocol Cassandra is a peer-to-peer system with no single point of failure; the cluster topology information is communicated via the Gossip protocol. The Gossip protocol is similar to real-world gossip, where a node (say B) tells a few of its peers in the cluster what it knows about the state of a node (say A). Those nodes tell a few other nodes about A, and over a period of time, all the nodes know about A. Distributed hash table The key feature of Cassandra is the ability to scale incrementally. This includes the ability to dynamically partition the data over a set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing and randomly distributes the rows over the network using the hash of the row key. When a node joins the ring, it is assigned a token that advocates where the node has to be placed in the ring: Now consider a case where the replication factor is 3; clients randomly write or read from a coordinator (every node in the system can act as a coordinator and a data node) in the cluster. The node calculates a hash of the row key and provides the coordinator enough information to write to the right node in the ring. The coordinator also looks at the replication factor and writes to the neighboring nodes in the ring order. Eventual consistency Given a sufficient period of time over which no changes are sent, all updates can be expected to propagate through the system and the replicas created will be consistent. Cassandra supports both the eventual consistency model and strong consistency model, which can be controlled from the client while performing an operation. Cassandra supports various consistency levels while writing or reading data. The consistency level drives the number data replicas the coordinator has to contact to get the data before acknowledging the clients. If W + R > Replication Factor, where W is the number of nodes to block on write and R the number to block on reads, the clients will see a strong consistency behavior: ONE: R/W at least one node TWO: R/W at least two nodes QUORUM: R/W from at least floor (N/2) + 1, where N is the replication factor When nodes are down for maintenance, Cassandra will store hints for updates performed on that node, which can be delivered back when the node is available in the future. To make data consistent, Cassandra relies on hinted handoffs, read repairs, and anti-entropy repairs. Summary In this article, we have discussed basic concepts and basic building blocks, including the motivation in building a new datastore solution. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] About Cassandra [Article] Quick start – Creating your first Java application [Article]
Read more
  • 0
  • 0
  • 4974
article-image-security-considerations
Packt
21 Nov 2013
9 min read
Save for later

Security considerations

Packt
21 Nov 2013
9 min read
(For more resources related to this topic, see here.) Security considerations One general piece of advice that applies to every type of application development is to develop the software with security in mind, meaning it is more expensive for an error-prone application to first implement the needed features and after that to make modifications in them to enforce security. Instead, this should be done simultaneously. In this article we are raising security awareness, and next we will learn about which measures we can apply and what we can do in order to have more secure applications. Use TLS TLS (the cryptographic protocol named Transport Layer Security) is the result of the standardization of the SSL protocol (Version 3.0), which was developed by Netscape and was proprietary. Thus, in various documents and specifications, we can find the use of TLS and SSL interchangeably, even though there are actually differences in the protocol. From a security standpoint, it is recommended that all requests sent from the client during the execution of a grant flow are done over TLS. In fact, it is recommended TLS be used on both sides of the connection. OAuth 2.0 relies heavily on TLS; this is done in order to maintain confidentiality of the exchanged data over the network by providing encryption and integrity on top of the connection between the client and server. In retrospect, in OAuth 1.0 the use of TLS was not mandatory, and parts of the authorization flow (on both server side and client side) had to deal with cryptography, which resulted in various implementations, some good and some sloppy. When we make an HTTP request (for example, in order to execute some OAuth 2.0 grant flow), in order to make the connection secure the HTTP client library that is used to execute the request has to be configured to use TLS. TLS is to be used by the client application when sending requests to both authorization and resource servers, and is to be used by the servers themselves as well. The result is an end-to-end TLS protected connection. If end-to-end protection cannot be established, it is advised to reduce the scope and lifetime of the access tokens that are issued by the authorization server. The OAuth2.0 specification states that the use of TLS is mandatory when sending requests to the authorization and token endpoints and when sending requests using password authentication. Access tokens, refresh tokens, username and password combinations, and client credentials must be transmitted with the use of TLS. By using TLS, the attackers that are trying to intercept/eavesdrop the exchanged information during the execution of the grant flow will not be able to do so. If TLS is not used, attackers can eavesdrop on an access token, an authorization code, a username and password combination, or other critical information. This means that the use of TLS prevents man-in-the-middle attacks and replaying of already fulfilled requests (also called replay attacks). By performing replay attempts, the attackers can issue themselves new access tokens or can perform replays on a request towards resource servers and modify or delete data belonging to the resource owner. Last but not least, the authorization server can enforce the use of TLS on every endpoint in order to reduce the risk of phishing attacks. Ensure web server application protection For client applications that are actually web applications deployed on a server, there are numerous protection measures that can be taken into account so that the server, the database, and the configuration files are kept safe. The list is not limited and can vary between scenarios and environments; some of the key measures are as follows: Install recommended security additions and tools for the given web and database servers that are in use. Restrict remote administrator access only to the people that require it (for example, for server maintenance and application monitoring). Regulate which server user can have which roles, and regulate permissions for the resources available to them. Disable or remove unnecessary services on the server. Regulate the database connections so that they are only available to the client application. Close unnecessary open ports on the server; leaving them open can give an advantage to the attacker. Configure protection against SQL injection. Configure database and file encryption for vital information stored (credentials and so on). Avoid storing credentials in plain text format. Keep the software components that are in use updated in order to avoid security exploitation. Avoid security misconfiguration. It is important to have in mind what kind of web server it is, which database is used, which modules the client application uses, and on which services the client application depends, so that we can research how to apply the security measures appropriately. OWASP (Open Web Application Security Project) provides additional documentation on security measures and describes the industry's best practices regarding software security. It is an additional resource recommended for reference and research on this topic, and can be found at https://www.owasp.org. Ensure mobile and desktop application protection Mobile and desktop applications can be installed on devices and machines that can be part of internal/enterprise or external environments. They are more vulnerable compared to the applications deployed on regulated server environments. Attackers have a better chance to try to extract the source code from the applications and other data that comes with them. In order to provide the best possible security, some of the key measures are as follows: Use secure storage mechanisms provided by additional programming libraries and by features offered by the operating system for which the application is developed. In multiuser operating systems, store user specific data such as credentials or access and refresh tokens in locations that are not available to other users on the same system. As mentioned previously, credentials should not be stored in plain text format and should be encrypted. If using an embedded database (such as SQLite in most cases), try to enforce security measures against SQL injection and encrypt the vital information (or encrypt the whole embedded database). For mobile devices, advise the end user to utilize device lock (usually with a PIN, password, or face unlock). Implement an optional PIN or password lock on the application level that the end user can activate if desired (which can also serve as an alternative to the previous locking measure). Sanitize and validate the value from any input fields that are used in the applications, in order to avoid code injection, which can lead to changing the behavior or exposing data stored by the client application. When the application is ready to be packaged for production use (to be used by end users), perform code analysis for obfuscating code and removing the unused code. This will produce a smaller client application in file size, which will perform the same but it will be harder to reverse engineer. As usual, for additional reference and research we can refer to the OAuth2.0 threat model RFC document, to OWASP, and to security documentation specific to the programming language, tools, libraries, and operating system that the client application is built for. Utilize the state parameter As mentioned, with this parameter the state between the request and the callback is maintained. Even if it is an optional parameter it is highly advisable to use, and the value from the callback response will be validated if it is equal to the one that was sent. When setting the value for the state parameter in the request Don't use predictable values that can be guessed by attackers. Don't repeat the same value often between requests. Don't use values that can contain and expose some internal business logic of the system and can be used maliciously if discovered. Use session values: If the user agent—with which the user has authenticated and approved the authorization request—has its session cookie available, calculate a hash from it and use that one as the state value. Or use some string generator: If a session variable is not available as an alternative, we can use some generated programmable value. Some real world implementations do this by generating unique identifiers and using them as state values, commonly achieved by generating a random UUID (universally unique identifier) and converting it to a hexadecimal value. Keep track of which state value was set for which request (user session in most cases) and redirect URI, in order to validate that the returned one contains an equal value. Use refresh tokens when available For client applications that have obtained an access token and a refresh token along with it, upon access token expiry it is a good practice to request a new one by using the refresh token instead of going through the whole grant flow again. With this measure we are transmitting less data over the network and are providing less exposure that the attacker can monitor. Request the needed scope only As briefly mentioned previously in this article, it is highly advisable to specify only the required scope when requesting an access token instead of specifying the maximum one that is available. With this measure, if an attacker gets hold of the access token, he can take damaging actions only to the level specified by the scope, and not more. This is done for damage minimization until the token is revoked and invalidated. Summary In this article we learned what data is to be protected, what features OAuth 2.0 contains regarding information security, and which precautions we should take into consideration. Resources for Article: Further resources on this subject: Deploying a Vert.x application [Article] Building tiny Web-applications in Ruby using Sinatra [Article] Fine Tune the View layer of your Fusion Web Application [Article]
Read more
  • 0
  • 0
  • 661

article-image-securing-hadoop-ecosystem
Packt
20 Nov 2013
6 min read
Save for later

Securing the Hadoop Ecosystem

Packt
20 Nov 2013
6 min read
(For more resources related to this topic, see here.) Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce). The following are the topics that we'll be covering in this article: Configuring authentication and authorization for the following Hadoop ecosystem components: Hive Oozie Flume HBase Sqoop Pig Best practices in configuring secured Hadoop components Configuring Kerberos for Hadoop ecosystem components The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured. Securing Hive Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop: There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows: The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster. Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster. To secure Hive in the Hadoop ecosystem, the following interactions should be secured: User interaction with Hive CLI or HiveServer User roles and privileges needs to be enforced to ensure users have access to only authorized data The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems. Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication. HiveServer2 supports Kerberos and LDAP authentication for the user authentication. When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users' data in the Hadoop cluster. On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users. To set up a secured Hive interactions, we need to do the following steps: One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml. <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/[email protected]</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property> Securing Hive using Sentry In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser. Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies. More details on Sentry are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cdh/sentry.html Summary In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry. Resources for Article: Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Managing a Hadoop Cluster [Article] Making Big Data Work for Hadoop and Solr [Article]
Read more
  • 0
  • 0
  • 2767