Data | 0 articles | Tech News, Tutorials & Expert Insights

20 Dec 2013

12 min read

SAP HANA Architecture

20 Dec 2013

0
0
4448

article-image-downloading-and-setting-elasticsearch

Packt

19 Dec 2013

8 min read

Downloading and Setting Up ElasticSearch

Packt

19 Dec 2013

8 min read

(For more resources related to this topic, see here.) Downloading and installing ElasticSearch ElasticSearch has an active community and the release cycles are very fast. Because ElasticSearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the ElasticSearch community tries to keep them updated and fix bugs that are discovered in them and in ElasticSearch core. If it's possible, the best practice is to use the latest available release (usually the more stable one). Getting ready A supported ElasticSearch Operative System (Linux/MacOSX/Windows) with installed Java JVM 1.6 or above is required. A web browser is required to download the ElasticSearch binary release. How to do it... For downloading and installing an ElasticSearch server, we will perform the steps given as follows: Download ElasticSearch from the Web. The latest version is always downloadable from the web address http://www.elasticsearch.org/download/. There are versions available for different operative systems: elasticsearch-{ version-number} .zip: This is for both Linux/Mac OSX, and Windows operating systems elasticsearch-{ version-number} .tar.gz: This is for Linux/Mac elasticsearch-{ version-number} .deb: This is for Debian-based Linux distributions (this also covers Ubuntu family) These packages contain everything to start ElasticSearch. At the time of writing this book, the latest and most stable version of ElasticSearch was 0.90.7. To check out whether this is the latest available or not, please visit http://www.elasticsearch.org/download/. Extract the binary content. After downloading the correct release for your platform, the installation consists of expanding the archive in a working directory. Choose a working directory that is safe to charset problems and doesn't have a long path to prevent problems when ElasticSearch creates its directories to store the index data. For windows platform, a good directory could be c:es, on Unix and MacOSX /opt/ es. To run ElasticSearch, you need a Java Virtual Machine 1.6 or above installed. For better performance, I suggest you use Sun/Oracle 1.7 version. We start ElasticSearch to check if everything is working. To start your ElasticSearch server, just go in the install directory and type: # bin/elasticsearch –f (for Linux and MacOsX) or # binelasticserch.bat –f (for Windows) Now your server should start as shown in the following screenshot: How it works... The ElasticSearch package contains three directories: bin: This contains script to start and manage ElasticSearch. The most important ones are: elasticsearch(.bat): This is the main script to start ElasticSearch plugin(.bat): This is a script to manage plugins config: This contains the ElasticSearch configs. The most important ones are: elasticsearch.yml: This is the main config file for ElasticSearch logging.yml: This is the logging config file lib: This contains all the libraries required to run ElasticSearch There's more... During ElasticSearch startup a lot of events happen: A node name is chosen automatically (that is Akenaten in the example) if not provided in elasticsearch.yml. A node name hash is generated for this node (that is, whqVp_4zQGCgMvJ1CXhcWQ). If there are plugins (internal or sites), they are loaded. In the previous example there are no plugins. Automatically if not configured, ElasticSearch binds on all addresses available two ports: 9300 internal, intra node communication, used for discovering other nodes 9200 HTTP REST API port After starting, if indices are available, they are checked and put in online mode to be used. There are more events which are fired during ElasticSearch startup. We'll see them in detail in other recipes. Networking setupM Correctly setting up a networking is very important for your node and cluster. As there are a lot of different install scenarios and networking issues in this recipe we will cover two kinds of networking setups: Standard installation with autodiscovery working configuration Forced IP configuration; used if it is not possible to use autodiscovery Getting ready You need a working ElasticSearch installation and to know your current networking configuration (that is, IP). How to do it... For configuring networking, we will perform the steps as follows: Open the ElasticSearch configuration file with your favorite text editor. Using the standard ElasticSearch configuration file (config/elasticsearch. yml), your node is configured to bind on all your machine interfaces and does autodiscovery broadcasting events, that means it sends "signals" to every machine in the current LAN and waits for a response. If a node responds to it, they can join in a cluster. If another node is available in the same LAN, they join in the cluster. Only nodes with the same ElasticSearch version and same cluster name (cluster.name option in elasticsearch.yml) can join each other. To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, such as: cluster.name: elasticsearch node.name: "My wonderful server" network.host: 192.168.0.1 discovery.zen.ping.unicast.hosts: ["192.168.0.2","192.168.0.3[9300- 9400]"] This configuration sets the cluster name to elasticsearch, the node name, the network address, and it tries to bind the node to the address given in the discovery section. We can check the configuration during node loading. We can now start the server and check if the network is configured: [INFO ][node ] [Aparo] version[0.90.3], pid[16792], build[5c38d60/2013-08-06T13:18:31Z] [INFO ][node ] [Aparo] initializing ... [INFO ][plugins ] [Aparo] loaded [transport-thrift, rivertwitter, mapper-attachments, lang-python, jdbc-river, langjavascript], sites [bigdesk, head] [INFO ][node ] [Aparo] initialized [INFO ][node ] [Aparo] starting ... [INFO ][transport ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.5:9300]} [INFO ][cluster.service] [Aparo] new_master [Angela Cairn] [yJcbdaPTSgS7ATQszgpSow][inet[/192.168.1.5:9300]], reason: zendisco- join (elected_as_master) [INFO ][discovery ] [Aparo] elasticsearch/ yJcbdaPTSgS7ATQszgpSow [INFO ][http ] [Aparo] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.5:9200]} [INFO ][node ] [Aparo] started In this case, we have: The transport bounds to 0:0:0:0:0:0:0:0:9300 and 192.168.1.5:9300 The REST HTTP interface bounds to 0:0:0:0:0:0:0:0:9200 and 192.168.1.5:9200 How it works... It works as follows: cluster.name: This sets up the name of the cluster (only nodes with the same name can join). node.name: If this is not defined, it is automatically generated by ElasticSearch. It allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set this name meaningful to easily locate it. Using a valid name is easier to remember than a generated name, such as whqVp_4zQGCgMvJ1CXhcWQ network.host: This defines the IP of your machine to be used in binding the node. If your server is on different LANs or you want to limit the bind on only a LAN, you must set this value with your server IP. discovery.zen.ping.unicast.hosts: This allows you to define a list of hosts (with ports or port range) to be used to discover other nodes to join the cluster. This setting allows using the node in LAN where broadcasting is not allowed or autodiscovery is not working (that is, packet filtering routers). The referred port is the transport one, usually 9300. The addresses of the hosts list can be a mix of: host name, that is, myhost1 IP address, that is, 192.168.1.2 IP address or host name with the port, that is, myhost1:9300 and 192.168.1.2:9300 IP address or host name with a range of ports, that is, myhost1:[9300-9400], 192.168.1.2:[9300-9400] Setting up a node ElasticSearch allows you to customize several parameters in an installation. In this recipe, we'll see the most used ones to define where to store our data and to improve general performances. Getting ready You need a working ElasticSearch installation. How to do it... The steps required for setting up a simple node are as follows: Open the config/elasticsearch.yml file with an editor of your choice. Set up the directories that store your server data: path.conf: /opt/data/es/conf path.data: /opt/data/es/data1,/opt2/data/data2 path.work: /opt/data/work path.logs: /opt/data/logs path.plugins: /opt/data/plugins Set up parameters to control the standard index creation. These parameters are: index.number_of_shards: 5 index.number_of_replicas: 1 How it works... The path.conf file defines the directory that contains your configuration: mainly elasticsearch.yml and logging.yml. The default location is $ES_HOME/config with ES_HOME your install directory. It's useful to set up the config directory outside your application directory so you don't need to copy configuration files every time you update the version or change the ElasticSearch installation directory. The path.data file is the most important one: it allows defining one or more directories where you store index data. When you define more than one directory, they are managed similarly to a RAID 0 configuration (the total space is the sum of all the data directory entry points), favoring locations with the most free space. The path.work file is a location where ElasticSearch puts temporary files. The path.log file is where log files are put. The control how to log is managed in logging.yml. The path.plugins file allows overriding the plugins path (default $ES_HOME/plugins). It's useful to put "system wide" plugins. The main parameters used to control the index and shard is index.number_of_shards, that controls the standard number of shards for a new created index, and index.number_ of_replicas that controls the initial number of replicas. There's more... There are a lot of other parameters that can be used to customize your ElasticSearch installation and new ones are added with new releases. The most important ones are described in this recipe and in the next one.

0
0
2058

Packt

19 Dec 2013

15 min read

Applied Modeling

Packt

19 Dec 2013

15 min read

0
0
1253

article-image-sharing-your-bi-reports-and-dashboards

Packt

19 Dec 2013

4 min read

Sharing Your BI Reports and Dashboards

Packt

19 Dec 2013

4 min read

(For more resources related to this topic, see here.) The final objective of the information in the BI reports and dashboards is to detect the cause-effect business behavior and trends, and trigger actions to solve them. These actions supported by visual information, via scorecards and dashboards. This process requires an interaction with several people. MicroStrategy includes the functionality to share our reports, scorecards, and dashboards, regardless of the location of the people. Reaching your audience MicroStrategy offers the option to share our reports via different channels that leverage the latest social technologies that are already present in the marketplace, that is, MicroStrategy integrates with Twitter and Facebook. The sharing is like avoiding any related costs and maintaining the design premise of the do-it-yourself approach without any help from specialized IT personnel. Main menu The main menu of MicroStrategy shows a column named Status. When we click on that column, as shown in the following screenshot, the Share option appears: The Share button The other option is the Share button within our reports, that is, the view that we want to share. Select the Share button located at the bottom of the screen, as shown in the following screenshot: The share options are the same, regardless of the location where you activate the option; the various alternate menus are shown in the following screenshot: E-mail sharing While selecting the e-mail option from the Scorecards-Dashboards model, the system will ask you for the e-mail programs that you want to use in order to send an e-mail; in our case, we select Outlook. MicroStrategy automatically prepares an e-mail with a link to share it. You can modify the text, and select the recipients of the e-mail, as shown in the following screenshot: The recipients of the e-mail will click on the URL that is included in the e-mail, send it by this schema, and the user will be able to analyze the report in a read-only mode with only the Filters panel enabled. The following screenshot shows how the user will review the report. Also, the user is not allowed to make any modifications. This option does not require a MicroStrategy platform user account. When a user clicks on the link, he is able to edit the filters and perform their analyses, as well as switch to any available layout, in our case, scorecards and dashboards. As a result, any visualization object can be maximized and minimized for better analysis, as shown in the following screenshot: In this option, the report can be visualized in a fullscreen mode by clicking on the fullscreen button [] located at the top-right corner of the screen. In this sharing mode, the user is able to download the information in Excel and PDF formats for each visualization object. For instance, if you need all the data included in the grid for the stores in region 1 opened in the year 2000. Perform the following steps: In the browser, open the URL that is generated when you select the e-mail share option. Select the ScoreCard tab. In the Open Year filter, type 2012 and in the Region filter, type 1. Now, maximize the grid. Two icons will appear in the top-left corner of the screen: one for exporting the data to Excel and the other for exporting it to PDF for each visualization object, as shown in the following screenshot: Please keep in mind that these two export options only apply to a specific visualization object; it is not possible to export the complete report from this functionality that is offered to the consumer. Summary In this article, we learned how to share our scorecards and dashboards via several channels, such as e-mails, social networks (Twitter and Facebook), and blogs or corporate intranet sites. Resources for Article: Further resources on this subject: Participating in a business process (Intermediate) [Article] Self-service Business Intelligence, Creating Value from Data [Article] Exploring Financial Reporting and Analysis [Article]

0
0
962

article-image-getting-started-apache-nutch

Packt

18 Dec 2013

13 min read

Getting Started with Apache Nutch

Packt

18 Dec 2013

13 min read

(For more resources related to this topic, see here.) Introduction of Apache Nutch Apache Nutch is a very robust and scalable tool for webcrawling and it can be integrated with scripting language i.e Python for web crawling. You can use it whenever your application contains huge data and you want to apply crawling on your data. Apache Nutch is an Open Source WebCrawler Software which is used for crawling websites. You can create your own search engine like google if you understand Apache Nutch clearly. It will provide you your own search engine using which you can increase your application page rank in searching and also customize your application searching according to your needs. It is extensible and scalable. It facilitates for parsing, indexing, creating your own search engine, customize search according to needs, scalability, robustness and ScoringFilter for custom implementations. ScoringFilter is a Java class which is used while creating Apache Nutch plugin. It is used for manipulating scoring variables. We can run Apache Nutch on a single machine as well as distributed environment like Apache Hadoop. It is written in Java. We can find broken links using Apache Nutch, create a copy of all the visited pages for searching over for example: Build indexes. We can find Web page hyperlinks in an automated manner. Apache Nutch can be integrated with Apache Solr easily and we can index all the webpages which are crawled by Apache Nutch to Apache Solr. We can then use Apache Solr for searching the webpages which are indexed by Apache Nutch. Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. Crawling your first website Crawling is driven by Apache Nutch crawling tool and certain related tools for building and maintaining several data structures. It includes web database, the index and a set of segments. Once Apache Nutch has indexed the webpages to Apache Solr, you can search for the required webpage(s) in Apache Solr. Apache Solr Installation Apache Solr is a search platform which is built on top of Apache Lucene. It can be used for searching any type of data for example webpages. It’s a very powerful searching mechanism and provides full-text search, dynamic clustering, database integration, rich document handling and many more. Apache SOLR will be used for indexing urls which are crawled by Apache Nutch and then one can search the details in Apache SOLR crawled by Apache Nutch. Crawling your website using the crawl script Apache Nutch 2.2.1 comes with the facility of crawl script which does crawling by just executing one single script. In earlier version, we have to manually do each step like generating data, fetching data, parsing data and so on for perfrom crawling. Crawling the web, the CrawlDb, and URL filters When user invokes crawling command in Apache Nutch 1.x, crawlDB is generated by Apache Nutch which is nothing but a directory which contains details about crawling. In Apache 2.x, crawlDB is not present. Instead Apache Nutch keeps all the crawling data directly into the database. InjectorJob The injector will add the necessary urls to the crawldb. Crawldb is the directory which is created by Apache Nutch for storing data related to crawling. You need to provide urls to InjectorJob either by downloading urls from internet or writing your own file which contains urls. Let’s say you have created one directory called urls which contains all the urls that needs to be injected in cralwdb. Following command will be used for perform the InjectorJob: #bin/nutch inject crawl/crawldb urls Urls will be directory which contains all the urls which needs to be injected in crawldb. Crawl/crawldb is the directory in which injected urls will be placed. After performing this job, you have number of unfetched urls inside your database i.e crawldb. GeneratorJob Once we have done with the InjectorJob, now it’s time to fetch the injected urls from crawldb. So for fetching the urls, you need to perform GeneratorJob before. Follwing command will be used for GeneratorJob: #bin/nutch generate crawl/crawldb crawl/segments Crawldb is the directory from where urls are generated. Segments is the directory which is used by GeneratorJob to fetch the necessary information required for crawling. FetcherJob The job of the fetch is to fetch the urls which are generated by GeneratorJob. It will use the input provided by GeneratorJob. Follwing command will be used for FetcherJob: #bin/nutch fetch –all Here I have provided input parameters –all which means this job will fetch all the urls which are generated by GeneratorJob. You can use different input parameters according to your needs. ParserJob After FetcherJob, ParserJob is to parse the urls which are fetched by FetcherJob. Follwing command will be used for ParserJob: # bin/nutch parse –all I have used input parameters –all which will parse all the urls which are fetched by FetcherJob. You can use different input parameter according to your needs. DbUpdaterJob Once the ParserJob has been completed, we need to update the database by providing results of the FetcherJob. This will update the respected databases with the last fetched urls. Following command will be used for DbUpdaterJob: # bin/nutch updatedb crawl/crawldb –all After performing this job, database will contain both updated entries of all the initial pages and also contains the new entities which are correspond to the newly discovered pages which are linked from the initial set. Invertlinks Before applying indexing, we need to first invert all the links. After this we will be able to index incoming anchor text with the pages. Following command will be used for Invertlinks: # bin/nutch invertlinks crawl/linkdb -dir crawl/segments Apache Hadoop Apache Hadoop is designed for running your application on servers where there will be lot of computers in which one will be master computer and rest will be the slave computers. So it’s huge data warehouse. Master computers are the computers which will direct slave computers for data processing. So processing is done by slave computers. This is the reason why Apache Hadoop is used for processing huge amount of data as process is divided into the number of slave computers and that’s why Apache Hadoop gives highest throughput for any processing. So as data will increase, you need to increase number of slave computers. That’s how Apache Hadoop functionality runs. Integration of Apache Nutch with Apache Hadoop Apache Nutch can be easily integrated with Apache Hadoop and we can make our process much faster than running Apache Nutch on single machine. After integrating Apache Nutch with Apache Hadoop, we can perform crawling on Apache Hadoop cluster environment. So the process will be much faster and we will get highest amount of throughput. Apache Hadoop Setup with Cluster This setup is not required a huge hardware to purchase and running Apache Nutch and Apache Hadoop. It is designed in such a way to make the use of hardware maximum. Formatting the HDFS filesystem using the NameNode HDFS stands for Hadoop Distributed File system is a directory which is used by Apache Hadoop for storage purpose. So it’s the directory which stroes all the data related to Apache Hadoop. It has two components as NameNode and DataNode in which NameNode manages the filesystem metadata and DataNodes actually stores the data. It’s highly configurable and suited well for many installations. When there are very large clusters, at that time configuration needs to be tuned. The first step for getting start your Apache Hadoop is the formatting Hadoop filesystem which is implemented on top of the local filesystem of your cluster(which will include only your local machine if you have followed). Setting up the deployment architecture of Apache Nutch We have to setup Apache Nutch on each of the machine which we are using. In this case, we are using six machines cluster. So we have to setup Apache Nutch on each machine. For the less number of machines in our cluster configuration, we can setup manually on each machine. But when the machines are more, let’s say we have 100 machines in our cluster environment. So we can’t setup on each machine manually. For that we require some deployment tool such as Chef or ateleast distributed ssh. You can refer to http://www.opscode.com/chef/ for getting familiar with Chef. You can refer http://www.ibm.com/developerworks/aix/library/au-satdistadmin/for getting familiar with distributed ssh.I will just demonstrate about running Apache Hadoop on Ubuntu for Single-Node Cluster. If you want to go for running Apache Hadoop on Ubuntu for Multi-Node cluster then I have already provided reference link above. You can follow that and configure the same. Once we have done with the deployment of Apache Nutch to single machine, we will run this script start-all.sh that will start the services on the master node and data nodes. It means the script will begin the hadoop daemons on the master node and so we are able to login into all the slave nodes using ssh command as explained above and will begin daemons on the slave nodes. The start-all.sh script expects that Apache Nutch should be put on the same location on each machine. It is also expecting that Apache Hadoop is storing the data at the same filepath on each machine. The start-all.sh script which starts the daemons on the master and slave nodes are going to use password-less login using ssh. Introduction of Apache Nutch configuration with Eclipse Apache Nutch can be easily configured with Eclipse. After that we can perform crawling easily using Eclipse. So need to perform crawling from command line. We can use eclipse for all the operations of crawling which we are doing from command line.Instructions are provided for fixing a development environment for Apache Nutch with Eclipse IDE. It's supposed to give a comprehensive starting resource for configuring, building, crawling and debugging of Apache Nutch within the above of context. Following are the prerequisites for Apache Nutch integration with Eclipse: Get the latest version of Eclipse from http://www.eclipse.org/downloads/packages/release/juno/r All the required subsequent are available from the Eclipse Marketplace. But if they are not, you can download eclipse market place as follows http://marketplace.eclipse.org/marketplace-client-intro Once you've configuired Eclipse, Download as per here http://subclipse.tigris.org/. If you have faced a problem with the 1.8.x release, try 1.6.x. This may resolve compatability issues. Download IvyDE plugin for Eclipse as here http://ant.apache.org/ivy/ivyde/download.cgi Download m2e plugin for Eclipse here http://marketplace.eclipse.org/content/maven-integration-eclipse Introduction of Apache Accumulo Accumulo is basically used as the datastore for storing data. So same way as we are using different databases like MySQL, Oracle, etc. So same way Apache Accumulo can be used. The key point of Apache Accumulo is, it is running on Apache Hadoop Cluster environment. So that's a very good feature with Accumulo.Accumulo sorted, distributed key/value store could be a strong, scalable, high performance information storage and retrieval system. Apache Accumulo depends on Google's BigTable design and is built ontop of Apache Hadoop, ,Thrift and Zookeeper. Apache Accumulo features a some novel improvement on the BigTable design within a form of cell-based access management and the server-side programming mechanism which will do modificationication in key/value pairs at varied points within the data management process Introduction of Apache Gora Apache Gora open source framework providesin-memory data model and persistence for large data. Apache Gora supports persisting to column stores, key and value stores, document stores and RDBMSs and analyzing the data with extensive Apache Hadoop MapReduce support. Supported Datastores Apache Gora presently supports the subsequent datastores: AccumuloProphetess PApache Hbase Amazon DynamoDB Use of Apache Gora Although there are many excellent ORM frameworks for relational databases and data modeling in NoSQL data stores different profoundly from their relative cousins. DataD-model agnostic frameworks like JDO aren't comfortable to be used cases, wherever one has to use the complete power of data models in column stores. Gora fills the thegap giving user an easy-to-use in-memory data model plus persistence for large data frameworkproviding data store specific mappings and also in built Apache Hadoop support. Integration of Apache Nutch with Apache Accumulo In this section, we are going to cover the integration process for integrating Apache Nutch with Apache Accumulo. Apache Accumulo is basically used for a huge data storeage. It is built on the top of Apache Hadoop, Zookeeper and Thrift. So a potential use of integrating Apache Nutch with Apache Accumulo is when our application has huge data to process and we want to run our application in cluste environment. At that time we can use Apache Accumulo as data storage purpose. As Apache Accumulo only running with Apache Hadoop, maximum use of Apache Accumulo would be in cluster based environment. So first we will start with the configuration of Apache GORA with Apache Nutch. Then we will setup Apache Hadoop and Zookeeper. Then we will do installation and configuration of Apache Accumulo. Then we will test Apache Accumulo and at the end we will see Crawling with Apache Nutch on Apache Accumulo. Setup Apache Hadoop and Apache Zookeeper for Apache Nutch Apache Zookeeper is a centralized service which is used for maintaining configuration information, provideses distributed synchronization, naming and also provideses group services. All these services are used by distributed applications in one or another manner. So all these services are provided by zookeeper so you don’t have to write these services from scratch. You can use these services for implementing consensus, management, group, leader election and presence protocols and you can also build it for your own requirements. Apache Accumulo is built on the top of Apache Hadoop, Zookeeper. So we must configure Apache Accumulo within Apache Hadoop and Apache Zookeeper. You can referrer to http://www.covert.io/post/18414889381/accumulo-nutch-and-gora for any queries related to setup. Integration of Apache Nutch with MySQL In this section, we are going to integrate Apache Nutch with MySQL. So after that you can crawled webpages in Apache Nutch that will be stored in MYSQL. So you can go to MySQL and check your crawled webpages and also perform necessary operations. We will start with the introduction of MySQL then we will cover what is the need of integrating MySQL with Apache Nutch. After that we will see configuration of MySQL with Apache Nutch and at the end we will do crawling with Apache Nutch on MySQL. So let’s just start with the introduction of MYSQL. Summary We covered the following: Downloading Apache Hadoop and Apache Nutch Perform Crawling on Apache Hadoop Cluster in Apache Nutch Apache Nutch configuration with eclipse Installation steps of building Apache Nutch with Eclipse Crawling in Eclipse Configuration of Apache GORA with Apache Nutch Installation and Configuration of Apache Accumulo Crawling with Apache Nutch on Apache Accumulo Need of integrating MySQL with Apache Nutch Resources for Article: Further resources on this subject: Getting Started with the Alfresco Records Management Module [Article] Making Big Data Work for Hadoop and Solr [Article] Apache Solr PHP Integration [Article]

0
0
1870

Packt

18 Dec 2013

8 min read

Settings goals

Packt

18 Dec 2013

8 min read

0
0
846

article-image-managing-ibm-cognos-bi-server-components

Packt

12 Dec 2013

6 min read

Managing IBM Cognos BI Server Components

Packt

12 Dec 2013

6 min read

(for more resources related to this topic, see here.) Cognos BI architecture The IBM Cognos 10.2 BI architecture is separated into the following three tiers: Web server (gateways) Applications (dispatcher and Content Manager) Data (reporting/querying the database, content store, metric store) Web server (gateways) The user starts a web session with Cognos to connect to the IBM Cognos Connection's web-based interface/application using the web browser (Internet Explorer and Mozilla Firefox are the currently supported browsers). This web request is sent to the web server where the Cognos gateway resides. The gateway is a server-software program that works as an intermediate party between the web server and other servers, such as an application server. The following diagram shows the basic view of the three tiers of the Cognos BI architecture: The Cognos gateway is the starting point from where a request is received and transferred to the BI Server. On receiving a request from the web server, the Cognos gateway applies encryption to the information received, adds necessary environment variables and authentication namespace, and transfers the information to the application server (or dispatcher). Similarly, when the data has been processed and the presentation is ready, it is rendered towards the user's browser via the gateway and web server. The following diagram shows the Tier 1 layer in detail: The gateways must be configured to communicate with the application component (dispatcher) in a distributed environment. To make a failover cluster, more than one BI Server may be configured. The following types of web gateways are supported: CGI: This is also the default gateway. This is a basic gateway. ISAPI: This is for the Windows environment. It is the best for Windows IIS (Internet Information Services). Servlet: This gateway is the best for application servers that are supporting servlets. Apache_mod: This gateway type may be used for the Apache server. The following diagram shows an environment in which the web server is load balanced by two server machines: To improve performance, gateways (if more than one) must be installed and configured on separate machines. The application tier (Cognos BI Server) The application tier comprises one or multiple BI Servers. A server's job is to run user requests, for example, queries, reports, and analysis that are received from a gateway. The GUI environment (IBM Cognos Connection) that appears after logging in is also rendered and presented by Cognos BI Server. Another such example is the Metric Studio interface. The BI Server must include the dispatcher and Content Manager (the Content Manager component may be separated from the dispatcher). The following diagram shows BI Server's Tier 2 in detail: Dispatcher The dispatcher has static handlers to many services. Each request that is received is routed to the corresponding service for further processing. The dispatcher is also responsible for starting all the Cognos services at startup. These services include the system service, report service, report data service, presentation service, Metric Studio service, log service, job service, event management service, Content Manager service, batch report service, delivery service, and many others. When there are multiple dispatchers in a multitier architecture, a dispatcher may also send and route requests to another dispatcher. The URIs for all dispatchers must be known to the Cognos gateway(s). All dispatchers are registered in Content Manager (CM), making it possible for all dispatchers to know each other. A dispatcher grid is formed in this way. To improve the system performance, multiple dispatchers must be installed but on separate computers, and the Content Manager component must also be on a separate server. The following diagram shows how multiple dispatcher servers can be added. Services for the BI Server (dispatcher) Each dispatcher has a set of services, which are listed alphabetically in the following table. When the Cognos service is started from Cognos Configuration, all services are started one by one. The following table shows the dispatcher services and their short descriptions: Service Description Agent service Runs the agent. Annotation service Adds comments to reports. Batch report service Handles background report requests. Content manager cache service Handles cache for frequent queries to enhance performance of Content Manager. Content manager service Performs DML in content store db. Cognos deployment is another task for this service. Delivery service For sending e-mails. Event management service Manages the Event Objects (creation, scheduling, and so on) Graphics service Renders graphics for other services such as report service. Human task service Manages human tasks. Index data service For basic full-text functions for storage and retrieval of terms and indexed summary documents. Index search service For search and drill-through functions, including lists of aliases and examples. Index update service For write, update, delete, and administration-related functions. Job service Runs jobs in coordination with the monitor service. Log service For extensive logging of the Cognos environment (file, database, remote-log server, event viewer, and system log). Metadata service For displaying data lineage information (data source, calculation expressions) for the Cognos studios and viewer. Metric studio service This service is used for providing a user interface to metric studio for monitoring and manipulating system KPIs. Migration service Used for migration from old versions to new versions, especially series 7. Monitor service Works as a timer service-it manages the monitoring and running of tasks that were scheduled or marked as background tasks. Helps in failover and recovery for running tasks. Presentation service This service prepares and displays the presentation layer by converting the XML data to HTML or any other format view. IBM Cognos Connection is also prepared by this service. Query service For managing dynamic query requests. Report data service This service prepares data for other applications; for example mobile, Microsoft Office, and so on. Report service Manages report requests. The output is displayed in IBM Cognos Connection. System service This service defines the BI-Bus API compliant service. It gives more data about the BI configuration parameters. Summary This article covered the IBM Cognos BI architecture. Now you must be familiar with the single tier and multitier architectures and a variety of features and options that Cognos provides. resources for article: further resources on this subject: IBM Cognos Insight [article] Integrating IBM Cognos TM1 with IBM Cognos 8 BI [article] IBM Cognos 10 BI dashboarding components [article]

0
0
3982

article-image-apache-solr-php-integration

Packt

25 Nov 2013

7 min read

Apache Solr PHP Integration

Packt

25 Nov 2013

7 min read

(For more resources related to this topic, see here.) We will be looking at installation on both Windows and Linux environments. We will be using the Solarium library for communication between Solr and PHP. This article will give a brief overview of the Solarium library and showcase some of the concepts and configuration options on Solr end for implementing certain features. Calling Solr using PHP code A ping query is used in Solr to check the status of the Solr server. The Solr URL for executing the ping query is http://localhost:8080/solr/collection1/admin/ping/?wt=json. Response of Solr ping query in browser We can use Curl to get the ping response from Solr via PHP code; a sample code for executing the previous ping query is as below $curl = curl_init("http://localhost:8080/solr/collection1/admin/ping/?wt=json"); curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1); $output = curl_exec($curl); $data = json_decode($output, true); echo "Ping Status : ".$data["status"].PHP_EOL; Though Curl can be used to execute almost any query on Solr, but it is preferable to use a library which does the work for us. In our case we will be using Solarium. To execute the same query on Solr using the Solarium library the code is as follows. include_once("vendor/autoload.php"); $config = array("endpoint" => array("localhost" => array("host"=>"127.0.0.1", "port"=>"8080", "path"=>"/solr", "core"=>"collection1",) ) ); We have included the Solarium library in our code. And defined the connection parameters for our Solr server. Next we will need to create a Solarium client with the previous Solr configuration. And call the createPing() function to create the ping query. $client = new SolariumClient($config); $ping = $client->createPing(); Finally execute the ping query and get the result. $result = $client->ping($ping); $result->getStatus(); The output should be similar to the one shown below. Output of ping query using PHP Adding documents to Solr index To create a Solr index, we need to add documents to the Solr index using the command line, Solr web interface or our PHP program. But before we create a Solr index, we need to define the structure or the schema of the Solr index. Schema consists of fields and field types. It defines how each field will be treated and handled during indexing or during search. Let us see a small piece of code for adding documents to the Solr index using PHP and Solarium library. Create a solarium client. Create an instance of the update query. Create the document in PHP and finally add fields to the document. $client = new SolariumClient($config); $updateQuery = $client->createUpdate(); $doc1 = $updateQuery->createDocument(); $doc1->id = 112233445; $doc1->cat = 'book'; $doc1->name = 'A Feast For Crows'; $doc1->price = 8.99; $doc1->inStock = 'true'; $doc1->author = 'George R.R. Martin'; $doc1->series_t = '"A Song of Ice and Fire"'; Id field has been marked as unique in our schema. So we will have to keep different values for Id field for different documents that we add to Solr. Add documents to the update query followed by commit command. Finally execute the query. $updateQuery->addDocuments(array($doc1)); $updateQuery->addCommit(); $result = $client->update($updateQuery); Let us execute the code. php insertSolr.php After executing the code, a search for martin will give these documents in the result. http://localhost:8080/solr/collection1/select/?q=martin Document added to Solr index Executing search on Solr Index Documents added to the Solr index can be searched using the following piece of PHP code. $selectConfig = array( 'query' => 'cat:book AND author:Martin', 'start' => 3, 'rows' => 3,'fields' => array('id','name','price','author'), 'sort' => array('price' => 'asc') ); $query = $client->createSelect($selectConfig); $resultSet = $client->select($query); The above code creates a simple Solr query and searches for book in cat field and Martin in author field. The results are sorted in ascending order or price and fields returned are id, name of book, price and author of book. Pagination has been implemented as 3 results per page, so this query returns results for 2nd page starting from 3rd result. In addition to this simple select query, Solr also supports some advanced query modes known as dismax and edismax. With the help of these query modes, we can boost certain fields to give more importance to certain fields in our query. We can also use function queries to do some type of dynamic boosting based on values in fields. If no sorting is provided, the Solr results are sorted by the score of documents which are calculated based on the terms in the query and the matching terms in the documents in the index. Score is calculated for each document in the result set using two main factors - term frequency known as tf and inverse document frequency known as idf. In addition to these, Solr provides a way of narrowing down the results using filter queries. Also facets can be created based on fields in the index and it can be used by the end users to narrow down the results. Highlighting search results using PHP and Solr Solr can be used to highlight the fields returned in a search result based on the query. Here is a sample code for highlighting the results for search keyword harry. $query->setQuery('harry'); $query->setFields(array('id','name','author','series_t','score','last_modified')); Get the highlighting component from the query, set the fields to be highlighted and also set the html tags to be used for highlighting. $hl = $query->getHighlighting(); $hl->setFields('name,series_t'); $hl->setSimplePrefix('<strong>')->setSimplePostfix('</strong>'); Once the query is run and result set is received, we will need to retrieve the highlighted results from the result set. Here is the output for the highlighting code. Highlighted search results In addition to highlighting, Solr can be used to create a spelling suggester and a spell checker. Spelling suggester can be used to prompt input query to the end user as the user keeps on typing. Spell check can be used to prompt spelling corrections similar to 'did you mean' to the user. Solr can also be used for finding documents which are similar to a certain document based on words in certain fields. This functionality of Solr is known as more like this and is exposed via Solarium by the MoreLikeThis component. Solr also provides grouping of the result based on a particular query or a certain field. Scaling Solr Solr can be scaled to handle large number of search requests by using master slave architecture. Also if the index is huge, it can be sharded across multiple Solr instances and we can run a distributed search to get results for our query from all the sharded instances. Solarium provides a load balancing plug-in which can be used to load balance queries across master-slave architecture. Summary Solr provides an extensive list of features for implementing search. These features can be easily accessed in PHP using the Solarium library to build a full features search application which can be used to power search on any website. Resources for Article: Further resources on this subject: Apache Solr Configuration [Article] Getting Started with Apache Solr [Article] Making Big Data Work for Hadoop and Solr [Article]

0
0
6226

article-image-plotting-charts-images-and-maps

Packt

22 Nov 2013

9 min read

Plotting Charts with Images and Maps

Packt

22 Nov 2013

9 min read

(For more resources related to this topic, see here.) This article explores how to work with images and maps. Python has some well-known image libraries that allow us to process images in both aesthetic and scientific ways. We will touch on PIL's capabilities by demonstrating how to process images by applying filters and by resizing them. Furthermore, we will show how to use image files as annotation for our matplotlibs' charts. To deal with data visualization of geospatial datasets, we will cover the functionality of Python's available libraries and public APIs that we can use with map-based visual representations. The final recipe shows how Python can create CAPTCHA test images. Processing images with PIL Why use Python for image processing, if we could use WIMP (http://en.wikipedia.org/wiki/WIMP_(computing)) or WYSIWYG (http://en.wikipedia.org/wiki/WYSIWYG) to achieve the same goal? This is used because we want to create an automated system to process images in real time without human support, thus, optimizing the image pipeline. Getting ready Note that the PIL coordinate system assumes that the (0,0) coordinate is in the upper-left corner. The Image module has a useful class and instance methods to perform basic operations over a loaded image object (im): im = Image.open(filename): This opens a file and loads the image into im object. im.crop(box): This crops the image inside the coordinates defined by box. box defines left, upper, right, lower pixels coordinates (for example: box = (0, 100, 100,100)). im.filter(filter): This applies a filter on the image and returns a filtered image. im.histogram(): This returns a histogram list for this image, where each item represents the number of pixels. Number of items in the list is 256 for single channel images, but if the image is not a single channel image, there can be more items in the list. For an RGB image the list contains 768 items (one set of 256 values for each channel). im.resize(size, filter): This resizes the image and uses a filter for resampling. The possible filters are NEAREST, BILINEAR, BICUBIC, and ANTIALIAS. The default is NEAREST. im.rotate(angle, filter): This rotates an image in the counter clockwise direction. im.split(): This splits bands of image and returns a tuple of individual bands. Useful for splitting an RGB image into three single band images. im.transform(size, method, data, filter): This applies transformation on a given image using data and a filter. Transformation can be AFFINE, EXTENT, QUAD, and MESH. You can read more about transformation in the official documentation. Data defines the box in the original image where the transformation will be applied. The ImageDraw module allows us to draw over the image, where we can use functions such as arc, ellipse, line, pieslice, point, and polygon to modify the pixels of the loaded image. The ImageChops module contains a number of image channel operations (hence the name Chops) that can be used for image composition, painting, special effects, and other processing operations. Channel operations are allowed only for 8-bit images. Here are some interesting channel operations: ImageChops.duplicate(image): This copies current image into a new image object ImageChops.invert(image): This inverts an image and returns a copy ImageChops.difference(image1, image2): This is useful for verification that images are the same without visual inspection The ImageFilter module contains the implementation of the kernel class that allows the creation of custom convolution kernels. This module also contains a set of healthy common filters that allows the application of well-known filters (BLUR and MedianFilter) to our image. There are two types of filters provided by the ImageFilter module: fixed image enhancement filters and image filters that require certain arguments to be defined; for example, size of the kernel to be used. We can easily get the list of all fixed filter names in IPython: In [1]: import ImageFilter In [2]: [ f for f in dir(ImageFilter) if f.isupper()] Out[2]: ['BLUR', 'CONTOUR', 'DETAIL', 'EDGE_ENHANCE', 'EDGE_ENHANCE_MORE', 'EMBOSS', 'FIND_EDGES', 'SHARPEN', 'SMOOTH', 'SMOOTH_MORE'] The next example shows how we can apply all currently supported fixed filters on any supported image: import os import sys from PIL import Image, ImageChops, ImageFilter class DemoPIL(object): def __init__(self, image_file=None): self.fixed_filters = [ff for ff in dir(ImageFilter) if ff.isupper()] assert image_file is not None assert os.path.isfile(image_file) is True self.image_file = image_file self.image = Image.open(self.image_file) def _make_temp_dir(self): from tempfile import mkdtemp self.ff_tempdir = mkdtemp(prefix="ff_demo") def _get_temp_name(self, filter_name): name, ext = os.path.splitext(os.path.basename(self.image_file)) newimage_file = name + "-" + filter_name + ext path = os.path.join(self.ff_tempdir, newimage_file) return path def _get_filter(self, filter_name): # note the use Python's eval() builtin here to return function object real_filter = eval("ImageFilter." + filter_name) return real_filter def apply_filter(self, filter_name): print "Applying filter: " + filter_name filter_callable = self._get_filter(filter_name) # prevent calling non-fixed filters for now if filter_name in self.fixed_filters: temp_img = self.image.filter(filter_callable) else: print "Can't apply non-fixed filter now." return temp_img def run_fixed_filters_demo(self): self._make_temp_dir() for ffilter in self.fixed_filters: temp_img = self.apply_filter(ffilter) temp_img.save(self._get_temp_name(ffilter)) print "Images are in: {0}".format((self.ff_tempdir),) if __name__ == "__main__": assert len(sys.argv) == 2 demo_image = sys.argv[1] demo = DemoPIL(demo_image) # will create set of images in temporary folder demo.run_fixed_filters_demo() We can run this easily from the command prompt: $ pythonch06_rec01_01_pil_demo.py image.jpeg We packed our little demo in the DemoPIL class, so we can extend it easily while sharing the common code around the demo function run_fixed_filters_demo. Common code here includes opening the image file, testing if the file is really a file, creating temporary directory to hold our filtered images, building the filtered image filename, and printing useful information to user. This way the code is organized in a better manner and we can easily focus on our demo function, without touching other parts of the code. This demo will open our image file and apply every fixed filter available in ImageFilter to it and save that new filtered image in a unique temporary directory. The location of this temporary directory is retrieved, so we can open it with our OS's file explorer and view the created images. As an optional exercise, try extending this demo class to perform other filters available in ImageFilter on the given image. How to do it... The example in this section shows how we can process all the images in a certain folder. We specify a target path, and the program reads all the image files in that target path (images folder) and resizes them to a specified ratio (0.1 in this example), and saves each one in a target folder called thumbnail_folder: import os import sys from PIL import Image class Thumbnailer(object): def __init__(self, src_folder=None): self.src_folder = src_folder self.ratio = .3 self.thumbnail_folder = "thumbnails" def _create_thumbnails_folder(self): thumb_path = os.path.join(self.src_folder, self.thumbnail_folder) if not os.path.isdir(thumb_path): os.makedirs(thumb_path) def _build_thumb_path(self, image_path): root = os.path.dirname(image_path) name, ext = os.path.splitext(os.path.basename(image_path)) suffix = ".thumbnail" return os.path.join(root, self.thumbnail_folder, name + suffix + ext) def _load_files(self): files = set() for each in os.listdir(self.src_folder): each = os.path.abspath(self.src_folder + '/' + each) if os.path.isfile(each): files.add(each) return files def _thumb_size(self, size): return (int(size[0] * self.ratio), int(size[1] * self.ratio)) def create_thumbnails(self): self._create_thumbnails_folder() files = self._load_files() for each in files: print "Processing: " + each try: img = Image.open(each) thumb_size = self._thumb_size(img.size) resized = img.resize(thumb_size, Image.ANTIALIAS) savepath = self._build_thumb_path(each) resized.save(savepath) except IOError as ex: print "Error: " + str(ex) if __name__ == "__main__": # Usage: # ch06_rec01_02_pil_thumbnails.py my_images assert len(sys.argv) == 2 src_folder = sys.argv[1] if not os.path.isdir(src_folder): print "Error: Path '{0}' does not exits.".format((src_folder)) sys.exit(-1) thumbs = Thumbnailer(src_folder) # optionally set the name of theachumbnail folder relative to *src_folder*. thumbs.thumbnail_folder = "THUMBS" # define ratio to resize image to # 0.1 means the original image will be resized to 10% of its size thumbs.ratio = 0.1 # will create set of images in temporary folder thumbs.create_thumbnails() How it works... For the given src_folder folder, we load all the files in this folder and try to load each file using Image.open(); this is the logic of the create_thumbnails() function. If the file we try to load is not an image, IOError will be thrown, and it will print this error and skip to next file in the sequence. If we want to have more control over what files we load, we should change the _load_files() function to only include files with certain extension (file type): for each in os.listdir(self.src_folder): if os.path.isfile(each) and os.path.splitext(each) is in ('.jpg','.png'): self._files.add(each) This is not foolproof, as file extension does not define file type, it just helps the operating system to attach a default program to the file, but it works in majority of the cases and is simpler than reading a file header to determine the file content (which still does not guarantee that the file really is the first couple of bytes, say it is). There's more... With PIL, although not used very often, we can easily convert images from one format to the other. This is achievable with two simple operations: first open an image in a source format using open(), and then save that image in the other format using save(). Format is defined either implicitly via filename extension (.png or .jpeg), or explicitly via the format of the argument passed to the save() function.

0
0
2451

article-image-preparation-analysis-data-source

Packt

22 Nov 2013

6 min read

Preparation Analysis of Data Source

Packt

22 Nov 2013

6 min read

(For more resources related to this topic, see here.) List of data sources Here, the wide varieties of data sources that are supported in the PowerPivot interface are given in brief. The vital part is to install providers such as OLE DB and ODBC that support the existing data source, because when installing the PowerPivot add-in, it will not install the provider too and some providers might already be installed with other applications. For instance, if there is a SQL Server installed in the system, the OLE DB provider will be installed with the SQL Server, so that later on it wouldn't be necessary to install OLE DB while adding data sources by using the SQL Server as a data source. Hence, make sure to verify the provider before it is added as a data source. Perhaps the provider is required only for relational database data sources. Relational database By using RDBMS, you can import tables and views into the PowerPivot workbook. The following is a list of various data sources: Microsoft SQL Server Microsoft SQL Azure Microsoft SQL Server Parallel Data Warehouse Microsoft Access Oracle Teradata Sybase Informix IBM DB2 Other providers (OLE DB / ODBC) Multidimensional sources Multidimensional data sources can only be added from Microsoft Analysis Services (SSAS). Data feeds The three types of data feeds are as follows: SQL Server Reporting Service (SSRS) Azure DataMarket dataset Other feeds such as atom service documents and single data feed Text files The two types of text files are given as follows: Excel file Text file Purpose of importing data from a variety of sources In order to make a decision about a particular subject area, you should analyze all the required data that is relevant to the subject area. If the data is stored in a variety of data sources, importing the data from different data sources has to be done. If all the data is only in one data source, only the data needs to be imported from the required tables for the subject and then various types of analysis can be done. The reason why users need to import data from different data sources is that they would then have an ample amount of data when they need to make any analysis. Another generic reason would be to cross-analyze data from different business systems such as Customer Relationship Management (CRM) and Campaign Management System (CMS). Data sourcing from only one source wouldn't be as sophisticated as an analysis done from different data sources as the amount of data from which the analysis was done for multisourced data is more detailed than the data from only a single source. It also might reveal conflicts between multiple data sources. In real time, usually in the e-commerce industry, blogs, and forum websites wouldn't ask for more details about customers at the time of registration, because the time consumed for long registrations would discourage the user, leading to cancellation the of the order. For instance, the customer table that would be stored in the database of an e-commerce industry would contain the following attributes: Customer FirstName LastName E-mail BirthDate Zip Code Gender However, this kind of industry needs to know their customers more in order to increase their sales. Since the industry only saves a few attributes about the customer during registration, it is difficult to track down the customers and it is even more difficult to advertise according to individual customers. Therefore, in order to find some relevant data about the customers, the e-commerce industries try to make another data source using the Internet or other types of sources. For instance, by using the Postcode given by the customer during registration, the user can get Country | State | City from various websites and then use the information obtained to make a table format either in Excel or CSV, as follows: Location Postcode City State Country So, finally the user would have two sources—one source is from the user's RDBMS database and the other source is from the Excel file created later—both of these can be used in order to make a customer analysis based on their location. General overview of ETL The main goal is to facilitate the development of data migration applications by applying data transformations. Extraction Transform Load (ETL) comprises of the first step in a data warehouse process. It is the most important and the hardest part, because it determines the quality of the data warehouse and the scope of the analyses the users will be able to build upon it. Let us discuss in detail what ETL is. The first substep is extraction. As the users would want to regroup all the information a company has, they will have to collect the data from various heterogeneous sources (operational data such as databases and/or external data such as CSV files) and various models (relational, geographical, web pages, and so on). The difficulty starts here. As data sources are heterogeneous, formats are usually different, and the users would have different types of data models for similar information (different names and/or formats) and different keys for the same objects. These are some of the main problems. The aim of the transformation task is to correct such problems, as much as possible. Let us now see what a transformation task is. The users would need to transform heterogeneous data of different sources into homogeneous data. Here are some examples of what they can do: Extraction of the data sources Identification of relevant data sources Filtering of non-admissible data Modification of formats or values Summary This article shows the users how to prepare data for analysis. We also covered different types of data sources that can be imported into the PowerPivot interface. Some information about the provider and a general overview of the ETL process has also been given. Users now know how the ETL process works in the PowerPivot interface; also, users were shown an introduction and the advantages of DAX. Resources for Article: Further resources on this subject: Creating a pivot table [Article] What are SSAS 2012 dimensions and cube? [Article] An overview of Oracle Hyperion Interactive Reporting [Article]

0
0
1043

article-image-common-qlikview-script-errors

Packt

22 Nov 2013

3 min read

Common QlikView script errors

Packt

22 Nov 2013

3 min read

(For more resources related to this topic, see here.) QlikView error messages displayed during the running of the script, during reload, or just after the script is run are key to understanding what errors are contained in your code. After an error is detected and the error dialog appears, review the error, and click on OK or Cancel on the Script Error dialog box. If you have the debugger open, click on Close, then click on Cancel on the Sheet Properties dialog. Re-enter the Script Editor and examine your script to fix the error. Errors can come up as a result of syntax, formula or expression errors, join errors, circular logic, or any number of issues in your script. The following are a few common error messages you will encounter when developing your QlikView script. The first one, illustrated in the following screenshot, is the syntax error we received when running the code that missed a comma after Sales. This is a common syntax error. It's a little bit cryptic, but the error is contained in the code snippet that is displayed. The error dialog does not exactly tell you that it expected a comma in a certain place, but with practice, you will realize the error quickly. The next error is a circular reference error. This error will be handled automatically by QlikView. You can choose to accept QlikView's fix of loosening one of the tables in the circular reference (view the data model in Table Viewer for more information on which table is loosened, or view the Document Properties dialog, Tables tab to find out which table is marked Loosely Coupled). Alternatively, you can choose another table to be loosely coupled in the Document Properties, Tables tab, or you can go back into the script and fix the circular reference with one of the methods. The following screenshot is a warning/error dialog displayed when you have a circular reference in a script: Another common issue is an unknown statement error that can be caused by an error in writing your script—missed commas, colons, semicolons, brackets, quotation marks, or an improperly written formula. In the case illustrated in the following screenshot, the error has encountered an unknown statement—namely, the Customers line that QlikView is attempting to interpret as Customers Load *…. The fix for this error is to add a colon after Customers in the following way: Customers: There are instances when a load script will fail silently. Attempting to store a QVD or CSV to a file that is locked by another user viewing it is one such error. Another example is when you have two fields with the same name in your load statement. The debugger can help you find the script lines in which the silent error is present. Summary In this article we learned about QlikView error messages displayed during the script execution. Resources for Article: Further resources on this subject: Meet QlikView [Article] Introducing QlikView elements [Article] Linking Section Access to multiple dimensions [Article]

0
0
5897

Packt

22 Nov 2013

14 min read

Portlet

Packt

22 Nov 2013

14 min read

(For more resources related to this topic, see here.) The Spring MVC portlet The Spring MVC portlet follows the Model-View-Controller design pattern. The model refers to objects that imply business rules. Usually, each object has a corresponding table in the database. The view refers to JSP files that will be rendered into the HTML markup. The controller is a Java class that distributes user requests to different JSP files. A Spring MVC portlet usually has the following folder structure: In the previous screenshot, there are two Spring MVC portlets: leek-portlet and lettuce-portlet. You can see that the controller classes are clearly named as LeekController.java and LettuceController.java. The JSP files for the leek portlet are view/leek/leek.jsp, view/leek/edit.jsp, and view/leek/help.jsp. The definition of the leek portlet in the portlet.xml file is as follows: <portlet> <portlet-name>leek</portlet-name> <display-name>Leek</display-name> <portlet-class>org.springframework.web.portlet.DispatcherPortlet</portlet-class> <init-param> <name>contextConfigLocation</name> <value>/WEB-INF/context/leek-portlet.xml</value> </init-param> <supports> <mime-type>text/html</mime-type> <portlet-mode>view</portlet-mode> <portlet-mode>edit</portlet-mode> <portlet-mode>help</portlet-mode> </supports> ... <supported-publishing-event> <qname >x:ipc.share</qname> </supported-publishing-event> </portlet> You can see from the previous code that the portlet class for a Spring MVC portlet is the org.springframework.web.portlet.DispatcherPortlet.java class. When a Spring MVC portlet is called, this class runs. It also calls the WEB-INF/context/leek/portlet.xml file and initializes the singletons defined in that file when the leek portlet is deployed. The leek portlet supports the view, edit, and help mode. It can also fire a portlet event with ipc.share as its qualified name. Yo can use the method to import the leek and lettuce portlets (whose source code can be downloaded from the Packt site) to your Liferay IDE. Then, carry out the following steps: Deploy the leek-portlet package and wait until the leek portlet and lettuce portlet are registered by the Liferay Portal. Log in as the portal administrator and add the two Spring MVC portlets onto a portal page. Your portal page should look similar to the following screenshot: The default view of the leek portlet comes from the view/leek/leek.jsp file whose logic is defined through the following method in the com.uibook.leek.portlet.LeekController.java class: @RequestMapping public String render(Model model, SessionStatus status, RenderRequest req) { return "leek"; } This method calls the view/leek/leek.jsp file. In the default view of the leek portlet, when you click on the radio button for Snow water from last winter and then on the Get Water button, the following form will be submitted: <form action="http://localhost:8080/web/uibook/home?p_auth=wwMoBV4C&p_p_id=leek_WAR_leekportlet&p_p_lifecycle=1&p_p_state=normal&p_p_mode=view&p_p_col_id=column-2&p_p_col_pos=1&p_p_col_count=2&_leek_WAR_leekportlet_action=sprayWater" id="_leek_WAR_leekportlet_leekFm" method="post" name="_leek_WAR_leekportlet_leekFm"> This form will fire an action URL because p_p_lifecycle is equal to 1. As the action name is sprayWater in the URL, the DispatcherPortlet.java class (as specified in the portlet.xml file) calls the following method: @ActionMapping(params="action=sprayWater") public void sprayWater(ActionRequest request, ActionResponse response, SessionStatus sessionStatus) { String waterType = request.getParameter("waterSupply"); if(waterType != null){ request.setAttribute("theWaterIs", waterType); sessionStatus.setComplete(); } } This method simply gets the value for the waterSupply parameter as specified in the following code, which comes from the view/leek/leek.jsp file: <input type="radio" name="<portlet:namespace />waterSupply" value="snow water from last winter">Snow water from last winter The value is snow water from last winter, which is set as a request attribute. As the previous sprayWater(…) method does not specify a request parameter for a JSP file to be rendered, the logic goes to the default view of the leek portlet. So, the view/leek/leek.jsp file will be rendered. Here, as you can see, the two-phase logic is retained in the Spring MVC portlet, as has been explained in the Understanding a simple JSR-286 portlet section of this article. Now the theWaterIs request attribute has a value, which is snow water from last winter. So, the following code in the leek.jsp file runs and displays the Please enjoy some snow water from last winter. message, as shown in the previous screenshot: <c:if test="${not empty theWaterIs}"> <p>Please enjoy some ${theWaterIs}.</p> </c:if> In the previous screenshot, the Passing you a gift... link is rendered with the following code in the leek.jsp file: <a href="<portlet:actionURL name="shareGarden"></portlet:actionURL>">Passing you a gift ...</a> When this link is clicked, an action URL named shareGarden is fired. So, the DispatcherPortlet.java class will call the following method: @ActionMapping("shareGarden") public void pitchBallAction(SessionStatus status, ActionResponse response) { String elementType = null; Random random = new Random(System.currentTimeMillis()); int elementIndex = random.nextInt(3) + 1; switch(elementIndex) { case 1 : elementType = "sunshine"; break; ... } QName qname = new QName("http://uibook.com/events","ipc.share"); response.setEvent(qname, elementType); status.setComplete(); } This method gets a value for elementType (the type of water in our case) and sends out this elementType value to another portlet based on the ipc.share qualified name. The lettuce portlet has been defined in the portlet.xml file as follows to receive such a portlet event: <portlet> <portlet-name>lettuce</portlet-name> ... <supported-processing-event> <qname >x:ipc.share</qname> </supported-processing-event> </portlet> When the ipc.share portlet event is sent, the portal page refreshes. Because the lettuce portlet is on the same page as the leek portlet, the portlet event is received by the following method in the com.uibook.lettuce.portlet.LettuceController.java class: @EventMapping(value ="{http://uibook.com/events}ipc.share") public void receiveEvent(EventRequest request, EventResponse response, ModelMap map) { Event event = request.getEvent(); String element = (String)event.getValue(); map.put("element", element); response.setRenderParameter("element", element); } This receiveEvent(…) method receives the ipc.share portlet event, gets the value in the event (which can be sunshine, rain drops, wind, or space), and puts it in the ModelMap object with element as the key. Now, the following code in the view/lettuce/lettuce.jsp file runs: <c:choose> <c:when test="${empty element}"> <p> Please share the garden with me! </p> </c:when> <c:otherwise> <p>Thank you for the ${element}!</p> </c:otherwise> </c:choose> As the element parameter now has a value, a message similar to Thank you for the wind will show in the lettuce portlet. The wind is a gift from the leek to the lettuce portlet. In the default view of the leek portlet, there is a Some shade, please! button. This button is implemented with the following code in the view/leek/leek.jsp file: <button type="button" onclick="<portlet:namespace />loadContentThruAjax();">Some shade, please!</button> When this button is clicked, a _leek_WAR_leekportlet_loadContentThruAjax() JavaScript function will run: function <portlet:namespace />loadContentThruAjax() { ... document.getElementById("<portlet:namespace />content").innerHTML=xmlhttp.responseText; ... xmlhttp.open('GET','<portlet:resourceURL escapeXml="false" id="provideShade"/>',true); xmlhttp.send(); } This loadContentThruAjax() function is an Ajax call. It fires a resource URL whose ID is provideShade. It maps the following method in the com.uibook.leek.portlet.LeekController.java class: @ResourceMapping(value = "provideShade") public void provideShade(ResourceRequest resourceRequest, ResourceResponse resourceResponse) throws PortletException, IOException { resourceResponse.setContentType("text/html"); PrintWriter out = resourceResponse.getWriter(); StringBuilder strB = new StringBuilder(); strB.append("The banana tree will sway its leaf to cover you from the sun."); out.println(strB.toString()); out.close(); } This method simply sends the The banana tree will sway its leaf to cover you from the sun message back to the browser. The previous loadContentThruAjax() method receives this message, inserts it in the <div id="_leek_WAR_leekportlet_content"></div> element, and shows it. About the Vaadin portlet Vaadin is an open source web application development framework. It consists of a server-side API and a client-side API. Each API has a set of UI components and widgets. Vaadin has themes for controlling the appearance of a web page. Using Vaadin, you can write a web application purely in Java. A Vaadin application is like a servlet. However, unlike the servlet code, Vaadin has a large set of UI components, controls, and widgets. For example, in correspondence to the <table> HTML element, the Vaadin API has a com.vaadin.ui.Table.java class. The following is a comparison between servlet table implementation and Vaadin table implementation: Servlet Code Vaadin Code PrintWriter out = response.getWriter(); out.println("<table>n" + "<tr>n" + "<td>row 2, cell 1</td>n" + "<td>row 2, cell 2</td>" + "</tr>n" + "</table>"); sample = new Table(); sample.setSizeFull(); sample.setSelectable(true); ... sample.setColumnHeaders(new String[] { "Country", "Code" }); Basically, if there is a label element in HTML, there is a corresponding Label.java class in Vaadin. In the sample Vaadin code, you will find the use of the com.vaadin.ui.Button.java and com.vaadin.ui.TextField.java classes. Vaadin supports portlet development based on JSR-286. Vaadin support in Liferay Portal Starting with Version 6.0, the Liferay Portal was bundled with the Vaadin Java API, themes, and a widget set described as follows: ${APP_SERVER_PORTAL_DIR}/html/VAADIN/themes/ ${APP_SERVER_PORTAL_DIR}/html/VAADIN/widgetsets/ ${APP_SERVER_PORTAL_DIR}/WEB-INF/lib/vaadin.jar A Vaadin control panel for the Liferay Portal is also available for download. It can be used to rebuild the widget set when you install new add-ons in the Liferay Portal. In the ${LPORTAL_SRC_DIR}/portal-impl/src/portal.properties file, we have the following Vaadin-related setting: vaadin.resources.path=/html vaadin.theme=liferay vaadin.widgetset=com.vaadin.portal.gwt.PortalDefaultWidgetSet In this section, we will discuss two Vaadin portlets. These two Vaadin portlets are run and tested in Liferay Portal 6.1.20 because, at the time of writing, the support for Vaadin is not available in the new Liferay Portal 6.2. It is expected that when the Generally Available (GA) version of Liferay Portal 6.2 is available, the support for Vaadin portlets in the new Liferay Portal 6.2 will be ready. Vaadin portlet for CRUD operations CRUD stands for create, read, update, and delete. We will use a peanut portlet to illustrate the organization of a Vaadin portlet. In this portlet, a user can create, read, update, and delete data. This portlet is adapted from a SimpleAddressBook portlet from a Vaadin demo. Its structure is as shown in the following screenshot: You can see that it does not have JSP files. The view, model, and controller are all incorporated in the PeanutApplication.java class. Its portlet.xml file has the following content: <portlet-class>com.vaadin.terminal.gwt.server.ApplicationPortlet2</portlet-class> <init-param> <name>application</name> <value>peanut.PeanutApplication</value> </init-param> This means that when the Liferay Portal calls the peanut portlet, the com.vaadin.terminal.gwt.server.ApplicationPortlet2.java class will run. This ApplicationPortlet2.java class will in turn call the peanut.PeanutApplication.java class, which will retrieve data from the database and generate the HTML markup. The default UI of the peanut portlet is as follows: This default UI is implemented with the following code: HorizontalSplitPanel splitPanel = new HorizontalSplitPanel(); setMainWindow(new Window("Address Book", splitPanel)); VerticalLayout left = new VerticalLayout(); left.setSizeFull(); left.addComponent(contactList); contactList.setSizeFull(); left.setExpandRatio(contactList, 1); splitPanel.addComponent(left); splitPanel.addComponent(contactEditor); splitPanel.setHeight("450"); contactEditor.setCaption("Contact details editor"); contactEditor.setSizeFull(); contactEditor.getLayout().setMargin(true); contactEditor.setImmediate(true); bottomLeftCorner.setWidth("100%"); left.addComponent(bottomLeftCorner); The previous code comes from the initLayout() method of the PeanutApplication.java class. This method is run when the portal page is first loaded. The new Window("Address Book", splitPanel) statement instantiates a window area, which is the whole portlet UI. This window is set as the main window of the portlet; every portlet has a main window. The splitPanel attribute splits the main window into two equal parts vertically; it is like the 2 Columns (50/50) page layout of Liferay. The splitPanel.addComponent(left) statement adds the contact information table to the left pane of the main window, while the splitPanel.addComponent(contactEditor) statement adds the contact details of the editor to the right pane of the main window. The left variable is a com.vaadin.ui.VerticalLayout.java object. In the left.addComponent(bottomLeftCorner) statement, the left object adds a bottomLeftCorner object to itself. The bottomLeftCorner object is a com.vaadin.ui.HorizontalLayout.java object. It takes the space across the left vertical layout under the contact information table. This bottomLeftCorner horizontal layout will house the contact-add button and the contact-remove button. The following screenshot gives you an idea of how the screen will look: When the + icon is clicked, a button click event will be fired which runs the following code: Object id = ((IndexedContainer) contactList.getContainerDataSource()).addItemAt(0); contactList.getItem(id).getItemProperty("First Name").setValue("John"); contactList.getItem(id).getItemProperty("Last Name").setValue("Doe"); This code adds an entry in the contactList object (contact information table) initializing the contact's first name to John and the last name to Doe. At the same time, the ValueChangeListener property of the contactList object is triggered and runs the following code: contactList.addListener(new Property.ValueChangeListener() { public void valueChange(ValueChangeEvent event) { Object id = contactList.getValue(); contactEditor.setItemDataSource(id == null ? null : contactList .getItem(id)); contactRemovalButton.setVisible(id != null); } }); This code populates the contactEditor variable, a com.vaadin.ui.Form.Form.java object, with John Doe's contact information and displays the Contact details editor section in the right pane of the main window. After that, an end user can enter John Doe's other contact details. The end user can also update John Doe's first and last names. If you have noticed, the last statement of the previous code snippet mentions contactRemovalButton. At this time, the John Doe entry in the contact information table is highlighted. If the end user clicks on the contact removal button, this information will be removed from both the contact information table and the contact details editor. Actually, the end user can highlight any entry in the contact information table and edit or delete it. You may have seen that during the whole process of creating, reading, updating, and deleting the contact, the portal page URL did not change and the portal page did not refresh. All the operations were performed through Ajax calls to the application server. This means that only a few database accesses happened during the whole process. This improves the site performance and reduces load on the application server. It also implies that if you develop Vaadin portlets in the Liferay Portal, you do not have to know the friendly URL configuration skill on a Liferay Portal project. In the peanut portlet, a developer cannot retrieve the logged-in user in the code, which is a weak point. In the following section, a potato portlet is implemented in such a way that a developer can retrieve the Liferay Portal information, including the logged-in user information. Summary In this article, we learned about portlets and their development. We learned ways todevelop simple JSR 286 portlets, SpringMVC portlets, and Vaadin portlets. We also learned to implement the view, edit, and help modes of a portlet. Resources for Article: Further resources on this subject: Setting up and Configuring a Liferay Portal [Article] Liferay, its Installation and setup [Article] Building your First Liferay Site [Article]

0
0
1151

article-image-basic-concepts-and-architecture-cassandra

Packt

21 Nov 2013

7 min read

Basic Concepts and Architecture of Cassandra

Packt

21 Nov 2013

7 min read

(For more resources related to this topic, see here.) CAP theorem If you want to understand Cassandra, you first need to understand the CAP theorem. The CAP theorem (published by Eric Brewer at the University of California, Berkeley) basically states that it is impossible for a distributed system to provide you with all of the following three guarantees: Consistency: Updates to the state of the system are seen by all the clients simultaneously Availability: Guarantee of the system to be available for every valid request Partition tolerance: The system continues to operate despite arbitrary message loss or network partition Cassandra provides users with stronger availability and partition tolerance with tunable consistency tradeoff; the client, while writing to and/or reading from Cassandra, can pass a consistency level that drives the consistency requirements for the requested operations. BigTable / Log-structured data model In a BigTable data model, the primary key and column names are mapped with their respective bytes of value to form a multidimensional map. Each table has multiple dimensions. Timestamp is one such dimension that allows the table to version the data and is also used for internal garbage collection (of deleted data). The next figure shows the data structure in a visual context; the row key serves as the identifier of the column that follows it, and the column name and value are stored in contiguous blocks: It is important to note that every row has the column names stored along with the values, allowing the schema to be dynamic. Column families Columns are grouped into sets called column families, which can be addressed through a row key (primary key). All the data stored in a column family is of the same type. A column family must be created before any data can be stored; any column key within the family can be used. It is our intent that the number of distinct column families in a keyspace should be small, and that the families should rarely change during an operation. In contrast, a column family may have an unbounded number of columns. Both disk and memory accounting are performed at the column family level. Keyspace A keyspace is a group of column families; replication strategies and ACLs are performed at the keyspace level. If you are familiar with traditional RDBMS, you can consider the keyspace as an alternative name for the schema and the column family as an alternative name for tables. Sorted String Table (SSTable) An SSTable provides a persistent file format for Cassandra; it is an ordered immutable storage structure from rows of columns (name/value pairs). Operations are provided to look up the value associated with a specific key and to iterate over all the column names and value pairs within a specified key range. Internally, each SSTable contains a sequence of row keys and a set of column key/value pairs. There is an index and the start location of the row key in the index file, which is stored separately. The index summary is loaded into the memory when the SSTable is opened in order to optimize the amount of memory needed for the index. A lookup for actual rows can be performed with a single disk seek and by scanning sequentially for the data. Memtable A memtable is a memory location where data is written to during update or delete operations. A memtable is a temporary location and will be flushed to the disk once it is full to form an SSTable. Basically, an update or a write operation to Cassandra is a sequential write to the commit log in the disk and a memory update; hence, writes are as fast as writing to memory. Once the memtables are full, they are flushed to the disk, forming new SSTables: Reads in Cassandra will merge the data from different SSTables and the data in memtables. Reads should always be requested with a row key (primary key) with the exception of a key range scan. When multiple updates are applied to the same column, Cassandra uses client-provided timestamps to resolve conflicts. Delete operations to a column work a little differently; because SSTables are immutable, Cassandra writes the tombstone to avoid random writes. A tombstone is a special value written to Cassandra instead of removing the data immediately. The tombstone can then be sent to nodes that did not get the initial remove request, and can be removed during GC. Compaction To bound the number of SSTable files that must be consulted on reads and to reclaim the space taken by unused data, Cassandra performs compactions. In a nutshell, compaction compacts n (the configurable number of SSTables) into one big SSTable. They start out being the same size as the memtables. Therefore, the sizes of the SSTables are exponentially bigger when they grow older. Partitioning and replication Dynamo style As mentioned previously, the partitioner and replication scheme is motivated by the Dynamo paper; let's talk about it in detail. Gossip protocol Cassandra is a peer-to-peer system with no single point of failure; the cluster topology information is communicated via the Gossip protocol. The Gossip protocol is similar to real-world gossip, where a node (say B) tells a few of its peers in the cluster what it knows about the state of a node (say A). Those nodes tell a few other nodes about A, and over a period of time, all the nodes know about A. Distributed hash table The key feature of Cassandra is the ability to scale incrementally. This includes the ability to dynamically partition the data over a set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing and randomly distributes the rows over the network using the hash of the row key. When a node joins the ring, it is assigned a token that advocates where the node has to be placed in the ring: Now consider a case where the replication factor is 3; clients randomly write or read from a coordinator (every node in the system can act as a coordinator and a data node) in the cluster. The node calculates a hash of the row key and provides the coordinator enough information to write to the right node in the ring. The coordinator also looks at the replication factor and writes to the neighboring nodes in the ring order. Eventual consistency Given a sufficient period of time over which no changes are sent, all updates can be expected to propagate through the system and the replicas created will be consistent. Cassandra supports both the eventual consistency model and strong consistency model, which can be controlled from the client while performing an operation. Cassandra supports various consistency levels while writing or reading data. The consistency level drives the number data replicas the coordinator has to contact to get the data before acknowledging the clients. If W + R > Replication Factor, where W is the number of nodes to block on write and R the number to block on reads, the clients will see a strong consistency behavior: ONE: R/W at least one node TWO: R/W at least two nodes QUORUM: R/W from at least floor (N/2) + 1, where N is the replication factor When nodes are down for maintenance, Cassandra will store hints for updates performed on that node, which can be delivered back when the node is available in the future. To make data consistent, Cassandra relies on hinted handoffs, read repairs, and anti-entropy repairs. Summary In this article, we have discussed basic concepts and basic building blocks, including the motivation in building a new datastore solution. Resources for Article: Further resources on this subject: Apache Cassandra: Libraries and Applications [Article] About Cassandra [Article] Quick start – Creating your first Java application [Article]

0
0
4974

Packt

21 Nov 2013

9 min read

Security considerations

Packt

21 Nov 2013

9 min read

(For more resources related to this topic, see here.) Security considerations One general piece of advice that applies to every type of application development is to develop the software with security in mind, meaning it is more expensive for an error-prone application to first implement the needed features and after that to make modifications in them to enforce security. Instead, this should be done simultaneously. In this article we are raising security awareness, and next we will learn about which measures we can apply and what we can do in order to have more secure applications. Use TLS TLS (the cryptographic protocol named Transport Layer Security) is the result of the standardization of the SSL protocol (Version 3.0), which was developed by Netscape and was proprietary. Thus, in various documents and specifications, we can find the use of TLS and SSL interchangeably, even though there are actually differences in the protocol. From a security standpoint, it is recommended that all requests sent from the client during the execution of a grant flow are done over TLS. In fact, it is recommended TLS be used on both sides of the connection. OAuth 2.0 relies heavily on TLS; this is done in order to maintain confidentiality of the exchanged data over the network by providing encryption and integrity on top of the connection between the client and server. In retrospect, in OAuth 1.0 the use of TLS was not mandatory, and parts of the authorization flow (on both server side and client side) had to deal with cryptography, which resulted in various implementations, some good and some sloppy. When we make an HTTP request (for example, in order to execute some OAuth 2.0 grant flow), in order to make the connection secure the HTTP client library that is used to execute the request has to be configured to use TLS. TLS is to be used by the client application when sending requests to both authorization and resource servers, and is to be used by the servers themselves as well. The result is an end-to-end TLS protected connection. If end-to-end protection cannot be established, it is advised to reduce the scope and lifetime of the access tokens that are issued by the authorization server. The OAuth2.0 specification states that the use of TLS is mandatory when sending requests to the authorization and token endpoints and when sending requests using password authentication. Access tokens, refresh tokens, username and password combinations, and client credentials must be transmitted with the use of TLS. By using TLS, the attackers that are trying to intercept/eavesdrop the exchanged information during the execution of the grant flow will not be able to do so. If TLS is not used, attackers can eavesdrop on an access token, an authorization code, a username and password combination, or other critical information. This means that the use of TLS prevents man-in-the-middle attacks and replaying of already fulfilled requests (also called replay attacks). By performing replay attempts, the attackers can issue themselves new access tokens or can perform replays on a request towards resource servers and modify or delete data belonging to the resource owner. Last but not least, the authorization server can enforce the use of TLS on every endpoint in order to reduce the risk of phishing attacks. Ensure web server application protection For client applications that are actually web applications deployed on a server, there are numerous protection measures that can be taken into account so that the server, the database, and the configuration files are kept safe. The list is not limited and can vary between scenarios and environments; some of the key measures are as follows: Install recommended security additions and tools for the given web and database servers that are in use. Restrict remote administrator access only to the people that require it (for example, for server maintenance and application monitoring). Regulate which server user can have which roles, and regulate permissions for the resources available to them. Disable or remove unnecessary services on the server. Regulate the database connections so that they are only available to the client application. Close unnecessary open ports on the server; leaving them open can give an advantage to the attacker. Configure protection against SQL injection. Configure database and file encryption for vital information stored (credentials and so on). Avoid storing credentials in plain text format. Keep the software components that are in use updated in order to avoid security exploitation. Avoid security misconfiguration. It is important to have in mind what kind of web server it is, which database is used, which modules the client application uses, and on which services the client application depends, so that we can research how to apply the security measures appropriately. OWASP (Open Web Application Security Project) provides additional documentation on security measures and describes the industry's best practices regarding software security. It is an additional resource recommended for reference and research on this topic, and can be found at https://www.owasp.org. Ensure mobile and desktop application protection Mobile and desktop applications can be installed on devices and machines that can be part of internal/enterprise or external environments. They are more vulnerable compared to the applications deployed on regulated server environments. Attackers have a better chance to try to extract the source code from the applications and other data that comes with them. In order to provide the best possible security, some of the key measures are as follows: Use secure storage mechanisms provided by additional programming libraries and by features offered by the operating system for which the application is developed. In multiuser operating systems, store user specific data such as credentials or access and refresh tokens in locations that are not available to other users on the same system. As mentioned previously, credentials should not be stored in plain text format and should be encrypted. If using an embedded database (such as SQLite in most cases), try to enforce security measures against SQL injection and encrypt the vital information (or encrypt the whole embedded database). For mobile devices, advise the end user to utilize device lock (usually with a PIN, password, or face unlock). Implement an optional PIN or password lock on the application level that the end user can activate if desired (which can also serve as an alternative to the previous locking measure). Sanitize and validate the value from any input fields that are used in the applications, in order to avoid code injection, which can lead to changing the behavior or exposing data stored by the client application. When the application is ready to be packaged for production use (to be used by end users), perform code analysis for obfuscating code and removing the unused code. This will produce a smaller client application in file size, which will perform the same but it will be harder to reverse engineer. As usual, for additional reference and research we can refer to the OAuth2.0 threat model RFC document, to OWASP, and to security documentation specific to the programming language, tools, libraries, and operating system that the client application is built for. Utilize the state parameter As mentioned, with this parameter the state between the request and the callback is maintained. Even if it is an optional parameter it is highly advisable to use, and the value from the callback response will be validated if it is equal to the one that was sent. When setting the value for the state parameter in the request Don't use predictable values that can be guessed by attackers. Don't repeat the same value often between requests. Don't use values that can contain and expose some internal business logic of the system and can be used maliciously if discovered. Use session values: If the user agent—with which the user has authenticated and approved the authorization request—has its session cookie available, calculate a hash from it and use that one as the state value. Or use some string generator: If a session variable is not available as an alternative, we can use some generated programmable value. Some real world implementations do this by generating unique identifiers and using them as state values, commonly achieved by generating a random UUID (universally unique identifier) and converting it to a hexadecimal value. Keep track of which state value was set for which request (user session in most cases) and redirect URI, in order to validate that the returned one contains an equal value. Use refresh tokens when available For client applications that have obtained an access token and a refresh token along with it, upon access token expiry it is a good practice to request a new one by using the refresh token instead of going through the whole grant flow again. With this measure we are transmitting less data over the network and are providing less exposure that the attacker can monitor. Request the needed scope only As briefly mentioned previously in this article, it is highly advisable to specify only the required scope when requesting an access token instead of specifying the maximum one that is available. With this measure, if an attacker gets hold of the access token, he can take damaging actions only to the level specified by the scope, and not more. This is done for damage minimization until the token is revoked and invalidated. Summary In this article we learned what data is to be protected, what features OAuth 2.0 contains regarding information security, and which precautions we should take into consideration. Resources for Article: Further resources on this subject: Deploying a Vert.x application [Article] Building tiny Web-applications in Ruby using Sinatra [Article] Fine Tune the View layer of your Fusion Web Application [Article]

0
0
661

Packt

20 Nov 2013

6 min read

Securing the Hadoop Ecosystem

Packt

20 Nov 2013

6 min read

(For more resources related to this topic, see here.) Each ecosystem component has its own security challenges and needs to be configured uniquely based on its architecture to secure them. Each of these ecosystem components has end users directly accessing the component or a backend service accessing the Hadoop core components (HDFS and MapReduce). The following are the topics that we'll be covering in this article: Configuring authentication and authorization for the following Hadoop ecosystem components: Hive Oozie Flume HBase Sqoop Pig Best practices in configuring secured Hadoop components Configuring Kerberos for Hadoop ecosystem components The Hadoop ecosystem is growing continuously and maturing with increasing enterprise adoption. In this section, we look at some of the most important Hadoop ecosystem components, their architecture, and how they can be secured. Securing Hive Hive provides the ability to run SQL queries over the data stored in the HDFS. Hive provides the Hive query engine that converts Hive queries provided by the user to a pipeline of MapReduce jobs that are submitted to Hadoop (JobTracker or ResourceManager) for execution. The results of the MapReduce executions are then presented back to the user or stored in HDFS. The following figure shows a high-level interaction of a business user working with Hive to run Hive queries on Hadoop: There are multiple ways a Hadoop user can interact with Hive and run Hive queries; these are as follows: The user can directly run the Hive queries using Command Line Interface (CLI). The CLI connects to the Hive metastore using the metastore server and invokes Hive query engine directly to execute Hive query on the cluster. Custom applications written in Java and other languages interacts with Hive using the HiveServer. HiveServer, internally, uses the metastore server and the Hive Query Engine to execute the Hive query on the cluster. To secure Hive in the Hadoop ecosystem, the following interactions should be secured: User interaction with Hive CLI or HiveServer User roles and privileges needs to be enforced to ensure users have access to only authorized data The interaction between Hive and Hadoop (JobTracker or ResourceManager) has to be secured and the user roles and privileges should be propagated to Hadoop jobs To ensure secure Hive user interaction, there is a need to ensure that the user is authenticated by HiveServer or CLI before running any jobs on the cluster. The user has to first use the kinit command to fetch the Kerberos ticket. This ticket is stored in the credential cache and used to authenticate with Kerberos-enabled systems. Once the user is authenticated, Hive submits the job to Hadoop (JobTracker or ResourceManager). Hive needs to impersonate the user to execute MapReduce on the cluster. From Hive Version 0.11 onwards, HiveServer2 was introduced. The earlier HiveServer had serious security limitations related to user authentication. HiveServer2 supports Kerberos and LDAP authentication for the user authentication. When HiveServer2 is configured to have LDAP authentication, Hive users are managed using the LDAP store. Hive asks the users to submit the MapReduce jobs to Hadoop. Thus, if we configure HiveServer2 to use LDAP, only the user authentication between the client and HiveServer2 is addressed. The interaction of Hive with Hadoop is insecure, and Hive MapReduce will be able to access other users' data in the Hadoop cluster. On the other hand, when we use Kerberos authentication for Hive users with HiveServer2, the same user is impersonated to execute MapReduce on the Hadoop cluster. So it is recommended that in a production environment, we configure HiveServer2 with Kerberos to have a seamless authentication and access control for the users submitting Hive queries. The credential store for Kerberos KDC can be configured to be LDAP so that we can centrally manage the user credentials of the end users. To set up a secured Hive interactions, we need to do the following steps: One of the key steps in securing Hive interaction is to ensure that the Hive user is impersonated in Hadoop, as Hive executes a MapReduce job on the Hadoop cluster. To achieve this goal, we need to add the hive.server2.enable.impersonation configuration in hive-site.xml, and hadoop.proxyuser.hive.hosts and hadoop. proxyuser.hive.groups in core-site.xml. <property> <name>hive.server2.authentication</name> <value>KERBEROS</value> </property> <property> <name>hive.server2.authentication.kerberos.principal</name> <value>hive/[email protected]</value> </property> <property> <name>hive.server2.authentication.kerberos.keytab</name> <value>/etc/hive/conf/hive.keytab</value> </property> <property> <name>hive.server2.enable.impersonation</name> <description>Enable user impersonation for HiveServer2</description> <value>true</value> </property> Securing Hive using Sentry In the previous section, we saw how Hive authentication can be enforced using Kerberos and the user privileges that are enforced by using user impersonation in Hadoop by the superuser. Sentryis the one of the latest entrant in the Hadoop ecosystem that provides finegrained user authorization for the data that is stored in Hive. Sentry provides finegrained, role-based authorization to Hive and Impala. Sentry uses HiveServer2 and metastore server to execute the queries on the Hadoop platform. However, the user impersonation is turned off in HiveServer2 when Sentry is used. Sentry enforces user privileges on the Hadoop data using the Hive metastore. Sentry supports authorization policies per database/schema. This could be leveraged to enforce user management policies. More details on Sentry are available at the following URL: http://www.cloudera.com/content/cloudera/en/products/cdh/sentry.html Summary In this article we learned how to configure Kerberos for Hadoop ecosystem components. We also looked at how to secure Hive using Sentry. Resources for Article: Further resources on this subject: Advanced Hadoop MapReduce Administration [Article] Managing a Hadoop Cluster [Article] Making Big Data Work for Hadoop and Solr [Article]

0
0
2767

How-To Tutorials - Data

SAP HANA Architecture

Downloading and Setting Up ElasticSearch

Applied Modeling

Sharing Your BI Reports and Dashboards

Getting Started with Apache Nutch

Settings goals

Managing IBM Cognos BI Server Components

Apache Solr PHP Integration

Plotting Charts with Images and Maps

Preparation Analysis of Data Source

Trending Topics

Common QlikView script errors

Portlet

Basic Concepts and Architecture of Cassandra

Security considerations

Securing the Hadoop Ecosystem