How-To Tutorials

25 Oct 2013

11 min read

Social-Engineer Toolkit

25 Oct 2013

(For more resources related to this topic, see here.) Social engineering is an act of manipulating people to perform actions that they don't intend to do. A cyber-based, socially engineered scenario is designed to trap a user into performing activities that can lead to the theft of confidential information or some malicious activity. The reason for the rapid growth of social engineering amongst hackers is that it is difficult to break the security of a platform, but it is far easier to trick the user of that platform into performing unintentional malicious activity. For example, it is difficult to break the security of Gmail in order to steal someone's password, but it is easy to create a socially engineered scenario where the victim can be tricked to reveal his/her login information by sending a fake login/phishing page. The Social-Engineer Toolkit is designed to perform such tricking activities. Just like we have exploits and vulnerabilities for existing software and operating systems, SET is a generic exploit of humans in order to break their own conscious security. It is an official toolkit available at https://www.trustedsec.com/, and it comes as a default installation with BackTrack 5. In this article, we will analyze the aspect of this tool and how it adds more power to the Metasploit framework. We will mainly focus on creating attack vectors and managing the configuration file, which is considered the heart of SET. So, let's dive deeper into the world of social engineering. Getting started with the Social-Engineer Toolkit (SET) Let's start our introductory recipe about SET, where we will be discussing SET on different platforms. Getting ready SET can be downloaded for different platforms from its official website: https://www.trustedsec.com/. It has both the GUI version, which runs through the browser, and the command-line version, which can be executed from the terminal. It comes pre-installed in BackTrack, which will be our platform for discussion in this article. How to do it... To launch SET on BackTrack, start the terminal window and pass the following path: root@bt:~# cd /pentest/exploits/set root@bt:/pentest/exploits/set# ./set Copyright 2012, The Social-Engineer Toolkit (SET) All rights reserved. Select from the menu: If you are using SET for the first time, you can update the toolkit to get the latest modules and fix known bugs. To start the updating process, we will pass the svn update command. Once the toolkit is updated, it is ready for use. The GUI version of SET can be accessed by navigating to Applications | BackTrack | Exploitation tools | Social-Engineer Toolkit. How it works... SET is a Python-based automation tool that creates a menu-driven application for us. Faster execution and the versatility of Python make it the preferred language for developing modular tools, such as SET. It also makes it easy to integrate the toolkit with web servers. Any open source HTTP server can be used to access the browser version of SET. Apache is typically considered the preferable server while working with SET. There's more... Sometimes, you may have an issue upgrading to the new release of SET in BackTrack 5 R3. Try out the following steps: You should remove the old SET using the following command: dpkg –r set We can remove SET in two ways. First, we can trace the path to /pentest/exploits/set, making sure we are in the directory and then opt for the 'rm' command for removing all files present there. Or, we can use the method shown previously. Then, for reinstallation, we can download its clone using the following command: Git clone https://github.com/trustedsec/social-engineer-toolkit /set Working with the SET config file In this recipe, we will take a close look at the SET config file, which contains default values for different parameters that are used by the toolkit. The default configuration works fine with most of the attacks, but there can be situations when you have to modify the settings according to the scenario and requirements. So, let's see what configuration settings are available in the config file. Getting ready To launch the config file, move to the config file and open the set_config file: root@bt:/pentest/exploits/set# nano config/set_config The configuration file will be launched with some introductory statements, as shown in the following screenshot: How to do it... Let's go through it step-by-step: First, we will see what configuration settings are available for us: # DEFINE THE PATH TO METASPLOIT HERE, FOR EXAMPLE /pentest/exploits/framework3 METASPLOIT_PATH=/pentest/exploits/framework3 The first configuration setting is related to the Metasploit installation directory. Metasploit is required by SET for proper functioning, as it picks up payloads and exploits from the framework: # SPECIFY WHAT INTERFACE YOU WANT ETTERCAP TO LISTEN ON, IF NOTHING WILL DEFAULT # EXAMPLE: ETTERCAP_INTERFACE=wlan0 ETTERCAP_INTERFACE=eth0 # # ETTERCAP HOME DIRECTORY (NEEDED FOR DNS_SPOOF) ETTERCAP_PATH=/usr/share/ettercap Ettercap is a multipurpose sniffer for switched LAN. Ettercap section can be used to perform LAN attacks like DNS poisoning, spoofing etc. The above SET setting can be used to either set ettercap ON of OFF depending upon the usability. # SENDMAIL ON OR OFF FOR SPOOFING EMAIL ADDRESSES SENDMAIL=OFF The sendmail e-mail server is primarily used for e-mail spoofing. This attack will work only if the target's e-mail server does not implement reverse lookup. By default, its value is set to OFF. The following setting shows one of the most widely used attack vectors of SET. This configuration will allow you to sign a malicious Java applet with your name or with any fake name, and then it can be used to perform a browser-based Java applet infection attack: # CREATE SELF-SIGNED JAVA APPLETS AND SPOOF PUBLISHER NOTE THIS REQUIRES YOU TO # INSTALL ---> JAVA 6 JDK, BT4 OR UBUNTU USERS: apt-get install openjdk-6-jdk # IF THIS IS NOT INSTALLED IT WILL NOT WORK. CAN ALSO DO apt-get install sun-java6-jdk SELF_SIGNED_APPLET=OFF We will discuss this attack vector in detail in a later recipe, that is, the Spear phishing attack vector . This attack vector will also require JDK to be installed on your system. Let's set its value to ON, as we will be discussing this attack in detail: SELF_SIGNED_APPLET=ON # AUTODETECTION OF IP ADDRESS INTERFACE UTILIZING GOOGLE, SET THIS ON IF YOU WANT # SET TO AUTODETECT YOUR INTERFACE AUTO_DETECT=ON The AUTO_DETECT flag is used by SET to auto-discover the network settings. It will enable SET to detect your IP address if you are using NAT/Port forwarding, and it allows you to connect to the external Internet. The following setting is used to set up the Apache web server to perform web-based attack vectors. It is always preferred to set it to ON for better attack performance: # USE APACHE INSTEAD OF STANDARD PYTHON WEB SERVERS, THIS WILL INCREASE SPEED OF # THE ATTACK VECTOR APACHE_SERVER=OFF # # PATH TO THE APACHE WEBROOT APACHE_DIRECTORY=/var/www The following setting is used to set up the SSL certificate while performing web attacks. Several bugs and issues have been reported for the WEBATTACK_SSL setting of SET. So, it is recommended to keep this flag OFF: # TURN ON SSL CERTIFICATES FOR SET SECURE COMMUNICATIONS THROUGH WEB_ATTACK VECTOR WEBATTACK_SSL=OFF The following setting can be used to build a self-signed certificate for web attacks, but there will be a warning message saying Untrusted certificate. Hence, it is recommended to use this option wisely to avoid alerting the target user: # PATH TO THE PEM FILE TO UTILIZE CERTIFICATES WITH THE WEB ATTACK VECTOR (REQUIRED) # YOU CAN CREATE YOUR OWN UTILIZING SET, JUST TURN ON SELF_SIGNED_CERT # IF YOUR USING THIS FLAG, ENSURE OPENSSL IS INSTALLED! # SELF_SIGNED_CERT=OFF The following setting is used to enable or disable the Metasploit listener once the attack is executed: # DISABLES AUTOMATIC LISTENER - TURN THIS OFF IF YOU DON'T WANT A METASPLOIT LISTENER IN THE BACKGROUND. AUTOMATIC_LISTENER=ON The following configuration will allow you to use SET as a standalone toolkit without using Metasploit functionalities, but it is always recommended to use Metasploit along with SET in order to increase the penetration testing performance: # THIS WILL DISABLE THE FUNCTIONALITY IF METASPLOIT IS NOT INSTALLED AND YOU JUST WANT TO USE SETOOLKIT OR RATTE FOR PAYLOADS # OR THE OTHER ATTACK VECTORS. METASPLOIT_MODE=ON These are a few important configuration settings available for SET. Proper knowledge of the config file is essential to gain full control over the SET. How it works... The SET config file is the heart of the toolkit, as it contains the default values that SET will pick while performing various attack vectors. A misconfigured SET file can lead to errors during the operation, so it is essential to understand the details defined in the config file in order to get the best results. The How to do it... section clearly reflects the ease with which we can understand and manage the config file. Working with the spear-phishing attack vector A spear-phishing attack vector is an e-mail attack scenario that is used to send malicious mails to target/specific user(s). In order to spoof your own e-mail address, you will require a sendmail server. Change the config setting to SENDMAIL=ON. If you do not have sendmail installed on your machine, then it can be downloaded by entering the following command: root@bt:~# apt-get install sendmail Reading package lists... Done Getting ready Before we move ahead with a phishing attack, it is imperative for us to know how the e-mail system works. Recipient e-mail servers, in order to mitigate these types of attacks, deploy gray-listing, SPF records validation, RBL verification, and content verification. These verification processes ensure that a particular e-mail arrived from the same e-mail server as its domain. For example, if a spoofed e-mail address, <[email protected]>, arrives from the IP 202.145.34.23, it will be marked as malicious, as this IP address does not belong to Gmail. Hence, in order to bypass these security measures, the attacker should ensure that the server IP is not present in the RBL/SURL list. As the spear-phishing attack relies heavily on user perception, the attacker should conduct a recon of the content that is being sent and should ensure that the content looks as legitimate as possible. Spear-phishing attacks are of two types—web-based content and payload-based content. How to do it... The spear-phishing module has three different attack vectors at our disposal: Let's analyze each of them. Passing the first option will start our mass-mailing attack. The attack vector starts with selecting a payload. You can select any vulnerability from the list of available Metasploit exploit modules. Then, we will be prompted to select a handler that can connect back to the attacker. The options will include setting the vnc server or executing the payload and starting the command line, and so on. The next few steps will be starting the sendmail server, setting a template for a malicious file format, and selecting a single or mass-mail attack: Finally, you will be prompted to either choose a known mail service, such as Gmail or Yahoo, or use your own server: 1. Use a gmail Account for your email attack. 2. Use your own server or open relay set:phishing>1 set:phishing> From address (ex: [email protected]):[email protected] set:phishing> Flag this message/s as high priority? [yes|no]:y Setting up your own server cannot be very reliable, as most of the mail services follow a reverse lookup to make sure that the e-mail has generated from the same domain name as the address name. Let's analyze another attack vector of spear-fishing. Creating a file format payload is another attack vector in which we can generate a file format with a known vulnerability and send it via e-mail to attack our target. It is preferred to use MS Word-based vulnerabilities, as they are difficult to detect whether they are malicious or not, so they can be sent as an attachment via an e-mail: set:phishing> Setup a listener [yes|no]:y [-] *** [-] * WARNING: Database support has been disabled [-] *** At last, we will be prompted on whether we want to set up a listener or not. The Metasploit listener will begin and we will wait for the user to open the malicious file and connect back to the attacking system. The success of e-mail attacks depends on the e-mail client that we are targeting. So, a proper analysis of this attack vector is essential. How it works... As discussed earlier, the spear-phishing attack vector is a social engineering attack vector that targets specific users. An e-mail is sent from the attacking machine to the target user(s). The e-mail will contain a malicious attachment, which will exploit a known vulnerability on the target machine and provide a shell connectivity to the attacker. The SET automates the entire process. The major role that social engineering plays here is setting up a scenario that looks completely legitimate to the target, fooling the target into downloading the malicious file and executing it.

0
0
7349

article-image-unity-assets-to-create-interactive-2d-games

Amarabha Banerjee

25 Jun 2018

20 min read

Unity assets to create interactive 2D games [Tutorial]

Amarabha Banerjee

25 Jun 2018

20 min read

0
0
7345

How-To Tutorials

article-image-multi-table-query-generator-using-phpmyadmin-and-mysql

Packt

09 Oct 2009

4 min read

The Multi-Table Query Generator using phpMyAdmin and MySQL

Packt

09 Oct 2009

4 min read

The Search pages in the Database or Table view are intended for single-table lookups. This article by Marc Delisle, covers the multi-table Query by example (QBE) feature available in the Database view. Many phpMyAdmin users work in the Table view, table-by-table, and thus tend to overlook the multi-table query generator, which is a wonderful feature for fine-tuning queries. The query generator is useful not only in multi-table situations but also for a single table. It enables us to specify multiple criteria for a column, a feature that the Search page in the Table view does not possess. The examples in this article assumes that a multi-user installation of the linked-tables infrastructure has been made and that the book-copy table created during an exercise in the article on Table and Database Operations in PHP is still there in the marc_book database. To access the code used in this article Click Here. To open the page for this feature, we go to the Database view for a specific database (the query generator supports working on only one database at a time) and click on Query. The screenshot overleaf shows the initial QBE page. It contains the following elements: Criteria columns An interface to add criteria rows An interface to add criteria columns A table selector The query area Buttons to update or to execute the query Choosing Tables The initial selection includes all the tables. In this example, we assume that the linked-table infrastructure has been installed into the marc_book database. Consequently, the Field selector contains a great number of fields. For our example, we will work only with the author and book tables: We then click Update Query. This refreshes the screen and reduces the number of fields available in the Field selector. We can always change the table choice later, using our browser's mechanism for multiple choices in drop-down menus (usually, control-click). Column Criteria Three criteria columns are provided by default. This section discusses the options we have for editing their criteria. These include options for selecting fields, sorting individual columns, entering conditions for individual columns, and so on. Field Selector: Single-Column or All Columns The Field selector contains all individual columns for the selected tables, plus a special choice ending with an asterisk (*) for each table, which means all the fields are selected: To display all the fields in the author table, we choose `author`.* and check the Show checkbox, without entering anything in the Sort and Criteria boxes. In our case, we select `author`.`name`, because we want to enter some criteria for the author's name. Sorts For each selected individual column, we can specify a sort (in Ascending or Descending order) or let this line remain intact (meaning no sort). If we choose more than one sorted column, the sort will be done with a priority from left to right. When we ask for a column to be sorted, we normally check the Show checkbox, but this is not necessary because we might want to do just the sorting operation without displaying this column. Showing a Column We check the Show checkbox so that we can see the column in the results. Sometimes, we may just want to apply a criterion on a column and not include it in the resulting page. Here, we add the phone column, ask for a sort on it, and choose to show both the name and phone number. We also ask for a sort on the name in ascending order. The sort will be done first by name, and then by phone number, if the names are identical. This is because the name is in a column criterion to the left of the phone column, and thus has a higher priority:

0
0
7344

Packt

01 Mar 2017

17 min read

Understanding Spark RDD

Packt

01 Mar 2017

17 min read

In this article by Asif Abbasi author of the book Learning Apache Spark 2.0, we will understand Spark RDD along with that we will learn, how to construct RDDs, Operations on RDDs, Passing functions to Spark in Scala, Java, and Python and Transformations such as map, filter, flatMap, and sample. (For more resources related to this topic, see here.) What is an RDD? What’s in a name might be true for a rose, but perhaps not for an Resilient Distributed Datasets (RDD), and in essence describes what an RDD is. They are basically datasets, which are distributed across a cluster (remember Spark framework is inherently based on an MPP architecture), and provide resilience (automatic failover) by nature. Before we go into any further detail, let’s try to understand this a little bit, and again we are trying to be as abstract as possible. Let us assume that you have a sensor data from aircraft sensors and you want to analyze the data irrespective of its size and locality. For example, an Airbus A350 has roughly 6000 sensors across the entire plane and generates 2.5 TB data per day, while the newer model expected to launch in 2020 will generate roughly 7.5 TB per day. From a data engineering point of view, it might be important to understand the data pipeline, but from an analyst and a data scientist point of view, your major concern is to analyze the data irrespective of the size and number of nodes across which it has been stored. This is where the neatness of the RDD concept comes into play, where the sensor data can be encapsulated as an RDD concept, and any transformation/action that you perform on the RDD applies across the entire dataset. Six month's worth of dataset for an A350 would be approximately 450 TBs of data, and would need to sit across multiple machines. For the sake of discussion, we assume that you are working on a cluster of four worker machines. Your data would be partitioned across the workers as follows: Figure 2-1: RDD split across a cluster The figure basically explains that an RDD is a distributed collection of the data, and the framework distributes the data across the cluster. Data distribution across a set of machines brings its own set of nuisances including recovering from node failures. RDD’s are resilient as they can be recomputed from the RDD lineage graph, which is basically a graph of the entire parent RDDs of the RDD. In addition to the resilience, distribution, and representing a data set, an RDD has various other distinguishing qualities: In Memory: An RDD is a memory resident collection of objects. We’ll look at options where an RDD can be stored in memory, on disk, or both. However, the execution speed of Spark stems from the fact that the data is in memory, and is not fetched from disk for each operation. Partitioned: A partition is a division of a logical dataset or constituent elements into independent parts. Partitioning is a defacto performance optimization technique in distributed systems to achieve minimal network traffic, a killer for high performance workloads. The objective of partitioning in key-value oriented data is to collocate similar range of keys and in effect minimize shuffling. Data inside RDD is split into partitions and across various nodes of the cluster. Typed: Data in an RDD is strongly typed. When you create an RDD, all the elements are typed depending on the data type. Lazy evaluation: The transformations in Spark are lazy, which means data inside RDD is not available until you perform an action. You can, however, make the data available at any time using a count() action on the RDD. We’ll discuss this later and the benefits associated with it. Immutable: An RDD once created cannot be changed. It can, however, be transformed into a new RDD by performing a set of transformations on it. Parallel: An RDD is operated on in parallel. Since the data is spread across a cluster in various partitions, each partition is operated on in parallel. Cacheable: Since RDD’s are lazily evaluated, any action on an RDD will cause the RDD to revaluate all transformations that led to the creation of RDD. This is generally not a desirable behavior on large datasets, and hence Spark allows the option to persist the data on memory or disk. A typical Spark program flow with an RDD includes: Creation of an RDD from a data source. A set of transformations, for example, filter, map, join, and so on. Persisting the RDD to avoid re-execution. Calling actions on the RDD to start performing parallel operations across the cluster. This is depicted in the following figure: Figure 2-2: Typical Spark RDD flow Operations on RDD Two major operation types can be performed on an RDD. They are called: Transformations Actions Transformations Transformations are operations that create a new dataset, as RDDs are immutable. They are used to transform data from one to another, which could result in amplification of the data, reduction of the data, or a totally different shape altogether. These operations do not return any value back to the driver program, and hence lazily evaluated, which is one of the main benefits of Spark. An example of a transformation would be a map function that will pass through each element of the RDD and return a totally new RDD representing the results of application of the function on the original dataset. Actions Actions are operations that return a value to the driver program. As previously discussed, all transformations in Spark are lazy, which essentially means that Spark remembers all the transformations carried out on an RDD, and applies them in the most optimal fashion when an action is called. For example, you might have a 1 TB dataset, which you pass through a set of map functions by applying various transformations. Finally, you apply the reduce action on the dataset. Apache Spark will return only a final dataset, which might be few MBs rather than the entire 1 TB dataset of mapped intermediate result. You should, however, remember to persist intermediate results otherwise Spark will recompute the entire RDD graph each time an Action is called. The persist() method on an RDD should help you avoid recomputation and saving intermediate results. We’ll look at this in more detail later. Let’s illustrate the work of transformations and actions by a simple example. In this specific example, we’ll be using flatmap() transformations and a count action. We’ll use the README.md file from the local filesystem as an example. We’ll give a line-by-line explanation of the Scala example, and then provide code for Python and Java. As always, you must try this example with your own piece of text and investigate the results: //Loading the README.md file val dataFile = sc.textFile(“README.md”) Now that the data has been loaded, we’ll need to run a transformation. Since we know that each line of the text is loaded as a separate element, we’ll need to run a flatMap transformation and separate out individual words as separate elements, for which we’ll use the split function and use space as a delimiter: //Separate out a list of words from individual RDD elements val words = dataFile.flatMap(line => line.split(“ “)) Remember that until this point, while you seem to have applied a transformation function, nothing has been executed and all the transformations have been added to the logical plan. Also note that the transformation function returns a new RDD. We can then call the count() action on the words RDD, to perform the computation, which then results in fetching of data from the file to create an RDD, before applying the transformation function specified. You might note that we have actually passed a function to Spark: //Separate out a list of words from individual RDD elements Words.count() Upon calling the count() action the RDD is evaluated, and the results are sent back to the driver program. This is very neat and especially useful during big data applications. If you are Python savvy, you may want to run the following code in PySpark. You should note that lambda functions are passed to the Spark framework: //Loading data file, applying transformations and action dataFile = sc.textFile("README.md") words = dataFile.flatMap(lambda line: line.split(" ")) words.count() Programming the same functionality in Java is also quite straight forward and will look pretty similar to the program in Scala: JavaRDD<String> lines = sc.textFile("README.md"); JavaRDD<String> words = lines.map(line -> line.split(“ “)); int wordCount = words.count(); This might look like a simple program, but behind the scenes it is taking the line.split(“ ”) function and applying it to all the partitions in the cluster in parallel. The framework provides this simplicity and does all the background work of coordination to schedule it across the cluster, and get the results back. Passing functions to Spark (Scala) As you have seen in the previous example, passing functions is a critical functionality provided by Spark. From a user’s point of view you would pass the function in your driver program, and Spark would figure out the location of the data partitions across the cluster memory, running it in parallel. The exact syntax of passing functions differs by the programming language. Since Spark has been written in Scala, we’ll discuss Scala first. In Scala, the recommended ways to pass functions to the Spark framework are as follows: Anonymous functions Static singleton methods Anonymous functions Anonymous functions are used for short pieces of code. They are also referred to as lambda expressions, and are a cool and elegant feature of the programming language. The reason they are called anonymous functions is because you can give any name to the input argument and the result would be the same. For example, the following code examples would produce the same output: val words = dataFile.map(line => line.split(“ “)) val words = dataFile.map(anyline => anyline.split(“ “)) val words = dataFile.map(_.split(“ “)) Figure 2-11: Passing anonymous functions to Spark in Scala Static singleton functions While anonymous functions are really helpful for short snippets of code, they are not very helpful when you want to request the framework for a complex data manipulation. Static singleton functions come to the rescue with their own nuances, which we will discuss in this section. In software engineering, the Singleton pattern is a design pattern that restricts instantiation of a class to one object. This is useful when exactly one object is needed to coordinate actions across the system. Static methods belong to the class and not an instance of it. They usually take input from the parameters, perform actions on it, and return the result. Figure 2-12: Passing static singleton functions to Spark in Scala Static singleton is the preferred way to pass functions, as technically you can create a class and call a method in the class instance. For example: class UtilFunctions{ def split(inputParam: String): Array[String] = {inputParam.split(“ “)} def operate(rdd: RDD[String]): RDD[String] ={ rdd.map(split)} } You can send a method in a class, but that has performance implications as the entire object would be sent along the method. Passing functions to Spark (Java) In Java, to create a function you will have to implement the interfaces available in the org.apache.spark.api.java function package. There are two popular ways to create such functions: Implement the interface in your own class, and pass the instance to Spark. Starting Java 8, you can use lambda expressions to pass off the functions to the Spark framework. Let’s reimplement the preceding word count examples in Java: Figure 2-13: Code example of Java implementation of word count (inline functions) If you belong to a group of programmers who feel that writing inline functions makes the code complex and unreadable (a lot of people do agree to that assertion), you may want to create separate functions and call them as follows: Figure 2-14: Code example of Java implementation of word count Passing functions to Spark (Python) Python provides a simple way to pass functions to Spark. The Spark programming guide available at spark.apache.org suggests there are three recommended ways to do this: Lambda expressions: The ideal way for short functions that can be written inside a single expression Local defs inside the function calling into Spark for longer code Top-level functions in a module While we have already looked at the lambda functions in some of the previous examples, let’s look at local definitions of the functions. Our example stays the same, which is we are trying to count the total number of words in a text file in Spark: def splitter(lineOfText): words = lineOfText.split(" ") return len(words) def aggregate(numWordsLine1, numWordsLineNext): totalWords = numWordsLine1 + numWordsLineNext return totalWords Let’s see the working code example: Figure 2-15: Code example of Python word count (local definition of functions) Here’s another way to implement this by defining the functions as a part of a UtilFunctions class, and referencing them within your map and reduce functions: Figure 2-16: Code example of Python word count (Utility class) You may want to be a bit cheeky here and try to add a countWords() to the UtilFunctions, so that it takes an RDD as input, and returns the total number of words. This method has potential performance implications as the whole object will need to be sent to the cluster. Let’s see how this can be implemented and the results in the following screenshot: Figure 2-17: Code example of Python word count (Utility class - 2) This can be avoided by making a copy of the referenced data field in a local object, rather than accessing it externally. Now that we have had a look at how to pass functions to Spark, and have already looked at some of the transformations and actions in the previous examples, including map, flatMap, and reduce, let’s look at the most common transformations and actions used in Spark. The list is not exhaustive, and you can find more examples in the Apache Spark documentation in the programming guide. If you would like to get a comprehensive list of all the available functions, you might want to check the following API docs: RDD PairRDD Scala http://bit.ly/2bfyoTo http://bit.ly/2bfzgah Python http://bit.ly/2bfyURl N/A Java http://bit.ly/2bfyRov http://bit.ly/2bfyOsH R http://bit.ly/2bfyrOZ N/A Table 2.1 – RDD and PairRDD API references Transformations The following table shows the most common transformations: map(func) coalesce(numPartitions) filter(func) repartition(numPartitions) flatMap(func) repartitionAndSortWithinPartitions(partitioner) mapPartitions(func) join(otherDataset, [numTasks]) mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks]) sample(withReplacement, fraction, seed) cartesian(otherDataset) Map(func) The map transformation is the most commonly used and the simplest of transformations on an RDD. The map transformation applies the function passed in the arguments to each of the elements of the source RDD. In the previous examples, we have seen the usage of map() transformation where we have passed the split() function to the input RDD. Figure 2-18: Operation of a map() function We’ll not give examples of map() functions as we have already seen plenty of examples of map functions previously. Filter (func) Filter, as the name implies, filters the input RDD, and creates a new dataset that satisfies the predicate passed as arguments. Example 2-1: Scala filtering example: val dataFile = sc.textFile(“README.md”) val linesWithApache = dataFile.filter(line => line.contains(“Apache”)) Example 2-2: Python filtering example: dataFile = sc.textFile(“README.md”) linesWithApache = dataFile.filter(lambda line: “Apache” in line) Example 2-3: Java filtering example: JavaRDD<String> dataFile = sc.textFile(“README.md”) JavaRDD<String> linesWithApache = dataFile.filter(line -> line.contains(“Apache”)); flatMap(func) The flatMap transformation is similar to map, but it offers a bit more flexibility. From the perspective of similarity to a map function, it operates on all the elements of the RDD, but the flexibility stems from its ability to handle functions that return a sequence rather than a single item. As you saw in the preceding examples, we had used flatMap to flatten the result of the split(“”) function, which returns a flattened structure rather than an RDD of string arrays. Figure 2-19: Operational details of the flatMap() transformation Let’s look at the flatMap example in Scala. Example 2-4: The flatmap() example in Scala: val favMovies = sc.parallelize(List("Pulp Fiction","Requiem for a dream","A clockwork Orange")); movies.flatMap(movieTitle=>movieTitle.split(" ")).collect() A flatMap in Python API would produce similar results. Example 2-5: The flatmap() example in Python: movies = sc.parallelize(["Pulp Fiction","Requiem for a dream","A clockwork Orange"]) movies.flatMap(lambda movieTitle: movieTitle.split(" ")).collect() The flatMap example in Java is a bit long-winded, but it essentially produces the same results. Example 2-6: The flatmap() example in Java: JavaRDD<String> movies = sc.parallelize (Arrays.asList("Pulp Fiction","Requiem for a dream" ,"A clockwork Orange") ); JavaRDD<String> movieName = movies.flatMap( new FlatMapFunction<String,String>(){ public Iterator<String> call(String movie){ return Arrays.asList(movie.split(" ")) .iterator(); } } ); Sample(withReplacement, fraction, seed) Sampling is an important component of any data analysis and it can have a significant impact on the quality of your results/findings. Spark provides an easy way to sample RDD’s for your calculations, if you would prefer to quickly test your hypothesis on a subset of data before running it on a full dataset. But here is a quick overview of the parameters that are passed onto the method: withReplacement: Is a Boolean (True/False), and it indicates if elements can be sampled multiple times (replaced when sampled out). Sampling with replacement means that the two sample values are independent. In practical terms this means that if we draw two samples with replacement, what we get on the first one doesn’t affect what we get on the second draw, and hence the covariance between the two samples is zero. If we are sampling without replacement, the two samples aren’t independent. Practically this means what we got on the first draw affects what we get on the second one and hence the covariance between the two isn’t zero. fraction: Fraction indicates the expected size of the sample as a fraction of the RDD’s size. The fraction must be between 0 and 1. For example, if you want to draw a 5% sample, you can choose 0.05 as a fraction. seed: The seed used for the random number generator. Let’s look at the sampling example in Scala. Example 2-7: The sample() example in Scala: val data = sc.parallelize( List(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); data.sample(true,0.1,12345).collect() The sampling example in Python looks similar to the one in Scala. Example 2-8: The sample() example in Python: data = sc.parallelize( [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20]) data.sample(1,0.1,12345).collect() In Java, our sampling example returns an RDD of integers. Example 2-9: The sample() example in Java: JavaRDD<Integer> nums = sc.parallelize(Arrays.asList( 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)); nums.sample(true,0.1,12345).collect(); References https://spark.apache.org/docs/latest/programming-guide.html http://www.purplemath.com/modules/numbprop.htm Summary We have gone through the concept of creating an RDD, to manipulating data within the RDD. We’ve looked at the transformations and actions available to an RDD, and walked you through various code examples to explain the differences between transformations and actions Resources for Article: Further resources on this subject: Getting Started with Apache Spark [article] Getting Started with Apache Spark DataFrames [article] Sabermetrics with Apache Spark [article]

0
0
7344

Packt

04 Mar 2018

5 min read

Walkthrough of Storm UI

Packt

04 Mar 2018

5 min read

In this article by Ankit Jain, the author of the book Mastering Apache Storm,This section we will seehow you how we can start the Storm UI daemon. However, before starting the Storm UI daemon, we assume that you have a running Storm cluster. The Storm cluster deployment steps are mentioned in previous step. NNow, go to the Storm home directory (cd $STORM_HOME) at the Leader Nimbus machine and run the following command to start the Storm UI daemon: $> cd $STORM_HOME $> bin/storm ui & (For more resources related to this topic, see here.) By default, the Storm UI starts on the 8080 port of the machine where it is started. Now, we will browse to the http://nimbus-node:8080 page to view the Storm UI, where Nnimbus-node is the IP address or hostname of the Nimbus machine. The following is a screenshot of the Storm home page: Insert Image B06182_Article02_0103.png Cluster Summary Section This portion of the Storm UI shows the version of Storm deployed in a cluster, uptime of the Nnimbus nodes, number of free worker slots, number of used worker slots, and so on. While submitting a topology to the cluster, the user first needs to make sure that the value of the Free slots column should not be zero; otherwise, the topology doesn't get any worker for processing and will wait in the queue till a worker becomes free. Nimbus Summary section This portion of the Storm UI shows the number of Nimbus processes are running in Storm Cluster. The section also shows Status of Nimbus nodes. A node with Status "Leader" is an Active master while the node with Status is "Not a Leader" is a Passive master. Supervisor Summary section This portion of the Storm UI shows the list of supervisor nodes running in the cluster along with their Id, Host, Uptime, Slots, and Used slots columns. Nimbus Configuration section This portion of the Storm UI shows the configuration of the Nimbus node. Some of the important properties are: supervisor.slots.ports storm.zookeeper.port storm.zookeeper.servers storm.zookeeper.retry.interval worker.childopts supervisor.childopts The definition of each of this property are covert in chapter 3. Following is the screenshots of Nimbus Configuration: Topology Summary section This portion of the Storm UI shows the list of topologies running in the Storm cluster along with their ID, number of workers assigned to the topology, number of executors, number of tasks, uptime, and so on. Let's deploy the sample topology (if not running already) in a remote Storm cluster by running the following command: $> cd $STORM_HOME $> bin/storm jar ~/storm_example-0.0.1-SNAPSHOT-jar-with-dependencies.jar com.stormadvance.storm_example.SampleStormClusterTopology storm_example We have created SampleStormClusterTopology topology by defining three worker processes, two executors for SampleSpout, and four executors for SampleBolt. The information about the worker, executor and task is mentioned in next chapter. After submitting SampleStormClusterTopology on the Storm cluster, the user has to refresh the Storm home page. The following screenshot shows that the row is added for SampleStormClusterTopology in the Topology summary section. The topology section contains the name of the topology, unique ID of the topology, status of the topology, uptime, number of workers assigned to the topology, and so on. The possible values of status fields are ACTIVE, KILLED, and INACTIVE. Let's click on SampleStormClusterTopology to view its detailed statistics. I am attaching there are two screenshots for the same. The first one content the information about the number of numberworkers, executors, and tasks assigned to SampleStormClusterTopology topology The next screenshot image contains information about spout and bolts— The number of executors and tasks assigned to each spout and bolt: The information showed in the previous screenshots are: Topology stats: This section will give the information about the number of tuples emitted, transferred, and acknowledged, the capacity latency, and so on, within the window of 10 minutes, 3 hours, 1 day, and since the start of the topology. Spouts (All time): This section shows the statistics of all the spouts running inside a topology.The detailed information of Spout stats is mentioned in chapter 3. Bolts (All time): This section shows the statistics of all the bolts running inside a topology. The detailed information of Bolt stats is mentioned in chapter 3. Topology actions: This section allows us to perform activate, deactivate, rebalance, kill, and other operation on topology's directly through the Storm UI. Deactivate: Click on deactivate to deactivate the topology. Once the topology is deactivate, the spout stopped emitting tuples and the status of topology changed to INACTIVE on storm UI Deactivating the topology does not free the Storm resource. Activate: Click on Activate button to activate the topology. Once the topology is activate, the spout again started emitting tuples. Kill: Click on Kill button to destroy/kill the topology. Once the topology is killed, it will free all the Storm resource allotted to this topology. While killing the topology, Storm will first deactivate the spouts and wait for the kill time mentioned on the alerts box, so the bolts have a chance to finish the processing of the tuples emitted by spouts before the kill command. The following screenshot shows how we can kill the topology through the Storm UI: Let's go to the Storm UI's home page to check the status of SampleStormClusterToplogy, as shown in the following screenshot: Summary We have seen how to start Storm UI daemon and we have also seen Storm home page with its sections Resources for Article: Further resources on this subject: [article] [article] [article]

0
0
7339

How-To Tutorials

article-image-modeling-furniture-blender

Packt

22 Oct 2009

10 min read

Modeling Furniture in Blender

Packt

22 Oct 2009

10 min read

Create Models or Use a Library? There are two possibilities when working with furniture. We can create new furniture, or use pre-made models from a library. The question is: when must we use each type? Some people say that using a pre-made model is not very professional thing but what they forget to say is that most projects have a tight deadline, and we need a quick modeling process to be ready on time. So, what's most important for professionals? Getting things done, or telling the client that all the models were created just for his project? Of course, the deadline is the most important, and your clients normally won't mind if you use pre-made models. Probably they won't even notice. So don't be ashamed to use pre-made models they won't make your projects any less professional. It's even recommended to use these models to speed-up the process, and allow you to spend more time on lighting or texturing. Is there any situation that demands the creation of a furniture model from scratch? Well, there are some. First, if you can't find the model in any library that you know, then it's going to be necessary to create it from scratch. If you are working with an architect who designs the spaces and furniture as well, you will probably have to model the furniture too, since it won't be available at any public library. Any project that deals with customized furniture will require that we work on the modeling for the furniture. Create your own libraryA good practice for anyone doing architectural visualization is to collect a lot of 3D models from public libraries for use in future projects. Keep these models for later, but don't forget to check if the author has released the models with no restrictions for commercial use. Otherwise, you must get their permission to use them. If you want to create your library, with no restrictions, why not create your own models? This could be a good exercise: take a few examples, and start creating some furniture. With time, you will have a good number of models. How to Get Started? In most cases, we have to get used to all that furniture modeling. We will have to start from scratch, with no blueprints available. The only references that we will have would be the photos, either provided by our clients, or provided from some web resources. If you have the time, visit a real store, and take some pictures and measures on your own. Sometimes, these stores will give you fliers and brochures, especially if you work with architecture. With time, you will get a lot of good reference material, and some of them come with measurements. But, if you don't know where to get started, let me point out some great web resources: http://www.e-interiors.net http://resources.blogoscopia.com http://blender-archi.tuxfamily.org/Models http://www.katorlegaz.com/ http://sketchup.google.com/3dwarehouse The first link has a lot of reference images classified by furniture type and designer. And sometimes, they even provide free 3D models. Most models there are saved in DXF, or 3DS file formats. Appending Models Before we start to model, let's see how we can import a model form an external library into Blender. The process is very simple, and what we have to do is to use the File menu, and access the Append or Link option. There is a shortcut for that too - just press SHIFT+F1 to call the same function. With this option, we have to select file that is already in the Blender file format. This option won't import files in other formats. When we select a file, a list of elements available in that particular file will be displayed, for us to select what we want to import. In most cases, the models will be stored under Object. When we click the Object option, all of the objects available in the file will be listed. If you know the name of the object that you want to import, just select the name, and click Load Library. The object will be loaded into our scene. Here, we have two options to handle this object: Append or Link: Append: If we choose this option, the object will be merged into ourcurrent scene. Link: With this option, an external link to the object file will be created. Any modifications to the original file will be reflected in our current scene. What is the best method to use? It will depend on whether we are willing to track all modifications applied to our furniture models. Using the Link method is a great way of keeping the furniture updated, because every modification at the original file is reflected immediately in the scene in which this model is placed. However, we will have to take the original file with the scene file every time we need to put our scene on another computer. They always have to go together. But if you choose to use the Append option, things will be a bit simpler, because the object will be incorporated into the scene file. We won't have to be worried about moving the furniture file along with the scene. Always use the Append option when you want to use furniture, or any other model, saved in another Blender file. To use a furniture model saved in another file, with a type other than “.blend”, we have to use the Import option. Importing Models To import a model, the process is very simple. We must use the File menu and select, Import. Then we have to select the proper file type from the list. The best file type, and the most common for furniture blocks, is the 3Ds file format, which belongs to the old 3D Studio application. There are some other good formats that work well with Blender, such as OBJ and LWO. The 3Ds file format can store lights, and it works well with Blender. The only thing we have to take extra care about is that most models imported come with triangular faces, which are a bit harder to edit. But, if you don't need to make any modifications to the model, this won't be a problem. Append or Import?Just to make things clearer, if you download a furniture model from a web site, and it's saved in the Blender native file format (.blend), you should append the model. If you download or get a furniture model on any file format other than “.blend”, you will have to import it. Since most models aren't saved in the Blender native file format, we can safely say that almost all furniture models that you find will require an import action to be placed in your scenes. Modeling a Chair Let's start with something simple, such as a chair. Even for a simple model, it will help us deal with smaller dimensions and details. Here is an image of the model: What's the main objective of this modeling? We have to create this chair, with the minimum use of faces and vertices. A good amount of detail can be left for textures, and it's always a good choice to use a lower number of vertices and faces in a model. If you consider one model, it won't matter much. But with a large number of chairs, such as in a theater room, it can make a difference in render time. Let's get started with a simple cube. Select this cube, and change the work mode to Edit. Select all vertices and press the W key. This will open the Specials menu. Choose subdivide, just once, from this menu. This will create new vertices and edges. Once these new vertices have been created, as shown in the image to the left, below, press the A key to remove all of the objects from the selection. Now, select the vertices to the right, using the B key. Remember to change the view mode to Wireframe before using the B key, otherwise, we won't be able to select the vertices behind the visible faces. When these vertices are selected, press the X key and choose Vertices to erase only the selected vertices. Using the CTRL+R key, add a new edge loop to the model, as shown in the following image: The next step is to change the scale of our model. Rotate the view to see the model in perspective view. Select all objects and press the S key, immediately after pressing the Z key. This will make the scale work only in the Z axis. Now, select the vertices identified in the following image and erase them using the X key. Change the selection mode to Edges, and select the edges identified in the following image. With the edges selected, press the E key to extrude them. With the new faces created, we can now add some detail to the model. Select only the top edge of the previously created faces. Move this edge down just a bit. This will add a small declivity to the seat. Now, we can move on to the next extrude, which must be from the selected edges identified in the following image. I'm not using any kind of measure for this example, but if you like to work only with real measurements, remember to hold the CTRL key every time a new extrude or edge is moved. This way, all transformations will use the grid lines. For this model, I'm not using vertex snap. With the new faces created, select just the two edges identified in the following image. Extrude these edges until they reach the other side of the base model. Hold the CTRL key, while you extrude them, to help with the precision. If you already want to remove duplicated vertices, select all objects, and press the W key. Choose Remove doubles to erase any duplicated vertices. Select the edges identified in the image to keep adding more parts to the chair. Extrude the edges three times until you have the same structure showed here. Now, we have to close the top with a face. To do that, we must select all four vertices on the top. When the vertices are selected, press the F key to create a new face. The next step is to select the small side edges to create some detail. Select just one edge, beginning from bottom to top, and move it just a bit. Repeat this operation with the other edges until we get the edges positioned as in the following image. The basic shape of our chair has now been created. Now, we can make some adjustments for improving the overall proportions. Select all edges or vertices on the left side, and move them a bit to the left. This will make the model wider. Did you notice that we have modeled only half a chair? Now we can make the other half, using the Mirror modifier. Add the modifier, and choose the right axis to make a perfect copy. If the center point for the model has been moved, you might need to edit the model to create a perfect mirrored match. Don't worry if you have moved the model by accident - this can happen sometimes. Along with the Mirror modifier, add a Subsurf modifier, too. With the Subsurf modifier, we realize that this model needs a new edge loop on the left side. Just press CTRL+R, and add a new loop, as in the following image.

0
0
7336

Packt

19 Jun 2014

14 min read

Getting Started with Mockito

Packt

19 Jun 2014

14 min read

0
0
7336

article-image-implementing-unity-2017-game-audio-tutorial

Amarabha Banerjee

11 Jul 2018

11 min read

Implementing Unity 2017 Game Audio [Tutorial]

Amarabha Banerjee

11 Jul 2018

11 min read

0
0
7330

article-image-google-open-sources-an-on-device-real-time-hand-gesture-recognition-algorithm-built-with-mediapipe

Sugandha Lahoti

21 Aug 2019

3 min read

Google open sources an on-device, real-time hand gesture recognition algorithm built with MediaPipe

Sugandha Lahoti

21 Aug 2019

3 min read

Google researchers have unveiled a new real-time hand tracking algorithm that could be a new breakthrough for people communicating via sign language. Their algorithm uses machine learning to compute 3D keypoints of a hand from a video frame. This research is implemented in MediaPipe which is an open-source cross-platform framework for building multimodal (eg. video, audio, any time series data) applied ML pipelines. What is interesting is that the 3D hand perception can be viewed in real-time on a mobile phone. How real-time hand perception and gesture recognition works with MediaPipe? The algorithm is built using the MediaPipe framework. Within this framework, the pipeline is built as a directed graph of modular components. The pipeline employs three different models: a palm detector model, a handmark detector model and a gesture recognizer. The palm detector operates on full images and outputs an oriented bounding box. They employ a single-shot detector model called BlazePalm, They achieve an average precision of 95.7% in palm detection. Next, the hand landmark takes the cropped image defined by the palm detector and returns 3D hand keypoints. For detecting key points on the palm images, researchers manually annotated around 30K real-world images with 21 coordinates. They also generated a synthetic dataset to improve the robustness of the hand landmark detection model. The gesture recognizer then classifies the previously computed keypoint configuration into a discrete set of gestures. The algorithm determines the state of each finger, e.g. bent or straight, by the accumulated angles of joints. The existing pipeline supports counting gestures from multiple cultures, e.g. American, European, and Chinese, and various hand signs including “Thumb up”, closed fist, “OK”, “Rock”, and “Spiderman”. They also trained their models to work in a wide variety of lighting situations and with a diverse range of skin tones. Gesture recognition - Source: Google blog With MediaPipe, the researchers built their pipeline as a directed graph of modular components, called Calculators. Individual calculators like cropping, rendering , and neural network computations can be performed exclusively on the GPU. They employed TFLite GPU inference on most modern phones. The researchers are open sourcing the hand tracking and gesture recognition pipeline in the MediaPipe framework along with the source code. The researchers Valentin Bazarevsky and Fan Zhang write in a blog post, “Whereas current state-of-the-art approaches rely primarily on powerful desktop environments for inference, our method, achieves real-time performance on a mobile phone, and even scales to multiple hands. We hope that providing this hand perception functionality to the wider research and development community will result in an emergence of creative use cases, stimulating new applications and new research avenues.” People commended the fact that this algorithm can run on mobile devices and is useful for people who communicate via sign language. https://twitter.com/SOdaibo/status/1163577788764495872 https://twitter.com/anshelsag/status/1163597036442148866 https://twitter.com/JonCorey1/status/1163997895835693056 Microsoft Azure VP demonstrates Holoportation, a reconstructed transmittable 3D technology Terrifyingly realistic Deepfake video of Bill Hader transforming into Tom Cruise is going viral on YouTube. Google News Initiative partners with Google AI to help ‘deep fake’ audio detection research

0
0
7321

article-image-predicting-bitcoin-price-from-historical-and-live-data

Sunith Shetty

06 Apr 2018

17 min read

Predicting Bitcoin price from historical and live data

Sunith Shetty

06 Apr 2018

17 min read

0
0
7320

article-image-create-configure-azure-virtual-machine

Gebin George

25 May 2018

13 min read

How to create and configure an Azure Virtual Machine

Gebin George

25 May 2018

13 min read

0
0
7316

article-image-react-native-vs-xamarin-which-is-the-better-cross-platform-mobile-development-framework

Guest Contributor

25 May 2019

10 min read

React Native VS Xamarin: Which is the better cross-platform mobile development framework?

Guest Contributor

25 May 2019

10 min read

0
0
7307

article-image-installing-arch-linux-using-official-iso

Packt

19 Feb 2013

7 min read

Installing Arch Linux using the official ISO

Packt

19 Feb 2013

7 min read

(For more resources related to this topic, see here.) Getting ready You can get the official ISO image file from https://www.archlinux.org/download/. On this page you will find a download link to the latest release. Depending on your preference, download the torrent file or the ISO image file immediately. The following list describes the main tasks that we will perform in this recipe: Preparing, booting, and setting keyboard layout: We are going to get the ISO file from the download page of the Arch Linux website and store it on the preferred media of our choice. At the time of writing this article, there is a dual ISO image file that contains both i686 and x86-64 architectures on one disk. Start your PC with your preferred installation media (CD or USB stick). On most PC systems, you can access the boot menu by pressing one of the function keys, usually between F8 and F12 depending on the motherboard manufacturer. On older machines where you do not yet have a boot menu, you might need to change the boot order in the BIOS where the CD-ROM (or DVD/Blu-ray) has to be chosen as the first device to try booting from. We'll also explain how to use a different keyboard layout than the default one in this recipe. Creating, formatting, and mounting partitions: You can partition the disks the way you want using cfdisk (for MBR disk partitioning) or cgdisk (for GUID disk partitioning). After creating the partitions, we can choose to format our created partitions with specific filesystems. When all partitions are formatted, we need to mount the partitions. First we will mount the root partition to /mnt. The other partitions will be mounted later on after you have created the specific folders. We'll designate our device with /dev/sdX; in your case this can be /dev/sda, and so on. Connecting to the Internet: To be able to continue installing the ISO you need to connect to the Internet, because there are no packages available for installation on the ISO. For a wireless network you will need to use netcfg. When connected to a wired network, just use dhcpcd or dhclient. Installing the base system and boot loader: These days the base system gets installed by running a simple script pacstrap. Pacstrap takes multiple parameters, the target location, and the packages or groups you want to install. For people who want to develop on their machines, the best base install is adding base-devel to the default installation. For normal end users, just base will be sufficient to start. Configuring the system: In this recipe, we'll describe the flow of what to do during the configuration. How to do it... The following steps will guide you in preparing, booting, and setting keyboard layout: Once you have downloaded the ISO image file, you should also verify its integrity by downloading the sha1sums.txt file from the download page. These days you can also check if the ISO is completely valid by verifying the signature of the ISO. Verify the integrity by issuing the sha1sum -c sha1sums.txt command and you'll see whether your download was successful or not. Also check if the signature of the ISO is correct by running gpg -v archlinux-...iso.sig: sha1sum -c sha1sums.txt gpg -v archlinux-2012-08-04-dual.iso.sig The following screenshot shows the execution of this step: As you can see in the previous screenshot, the ISO's checksum is ok and the signature is valid. Now that we are sure our ISO is ok, we can burn this to a CD with our favorite burning program. Insert the CD into the drive, or insert the USB stick into the USB port of your PC. Enter the boot menu, or let your computer automatically boot from the inserted installation media. If the previous steps are performed correctly, you will see the following screenshot: Select the architecture you want and press Enter, and we'll be on our way. Search the keyboard layout desired for your region. The available keyboard layouts can be found at /usr/share/kbd/keymaps/. Set the desired keyboard layout with loadkeys keyboardlayout. Now let's perform the following steps to create, format, and mount partitions: Start cfdisk or cgdisk, having the first parameter as the device you want to partition: cfdisk /dev/sdX cgdisk /dev/sdX Create your partition scheme. Store the partition scheme. Use the mkfs command to create a filesystem on a specific partition: mkfs -t vfat /dev/sdX mkfs.ext4 -L root /dev/sdX Mount your root partition to /mnt: mount /dev/sdX3 /mnt Make directories under mount for your other partitions: mkdir -p /mnt/boot Mount the other partitions: mount /dev/sdX1 /mnt/boot The following steps are needed to connect to the Internet: When we need a wireless network, create a netcfg profile and run netcfg mywireless. Use dhclient or dhcpcd to get an IP address. The following steps should be performed for installing the base system and boot loader: Run pacstrap with the desired parameters: pacstrap /mnt base base-devel Install the desired boot loader: the best choice at this moment is Syslinux. The final installation of the boot loader will be done in a chroot during the initial configuration. We'll now list the steps to do during the configuration: Generate fstab with genfstab: genfstab -p /mnt >> /mnt/etc/fstab Change the root into the system location: arch-chroot /mnt Set your hostname in /etc/hostname. Create /etc/localtime symlink. Set your locale in /etc/locale.conf. Uncomment the configured locale in /etc/locale.gen. Run locale-gen. Configure /etc/mkinitcpio.conf. Generate your initial ramdisk: mkinitcpio -p linux Finish installation of your boot loader. Set the root password with passwd. Leave the chroot environment (exit). How it works... We downloaded the ISO image file via torrent, or via HTTP from the mirror sites listed on the download page. The sha1sum command lets us verify the integrity of the downloaded ISO. On top of the checksum, we can also check the integrity by verifying the signature available for the ISO. So now, we can rest assured that the downloaded file is the real one. The ISO contains a fully working operating system. It also contains all the necessary tools to perform system recovery and installation. The keyboard configuration set with loadkeys will make sure that the key you press on your keyboard will be translated to the correct letter on your screen. Using a different keyboard layout from the one on your physical keyboard might be confusing. We then created a partition scheme on the selected disk with the appropriate tool (cfdisk or cgdisk). Make Filesystem (mkfs) is a unified frontend to create a filesystem. Using it we created our filesystem layout manually under/mnt by creating our default partition layout in our root, and mounting the specific partitions accordingly. You can make a connection with your wireless network (if needed), and then use dhcpcd or dhclient to obtain an IP address that enables you to access the Internet. Pacstrap will run pacman with a modified root location to install the desired packages into the newly created system. For example, installing Syslinux: pacstrap /mnt syslinux The specific configuration files will ensure we don't have to do all those steps over and over again on every boot. Summary This article explained the procedure to get Arch Linux installed on your system using the official installation media. Resources for Article : Further resources on this subject: Compression Formats in Linux Shell Script [Article] Making a Complete yet Small Linux Distribution [Article] Linux Shell Script: Tips and Tricks [Article]

0
0
7307

Packt

25 May 2015

15 min read

Cleaning Data in PDF Files

Packt

25 May 2015

15 min read

In this article by Megan Squire, author of the book Clean Data, we will experiment with several data decanters to extract all the good stuff hidden inside inscrutable PDF files. We will explore the following topics: What PDF files are for and why it is difficult to extract data from them How to copy and paste from PDF files, and what to do when this does not work How to shrink a PDF file by saving only the pages that we need How to extract text and numbers from a PDF file using the tools inside a Python package called pdfMiner How to extract tabular data from within a PDF file using a browser-based Java application called Tabula How to use the full, paid version of Adobe Acrobat to extract a table of data (For more resources related to this topic, see here.) Why is cleaning PDF files difficult? Files saved in Portable Document Format (PDF) are a little more complicated than some of the text files. PDF is a binary format that was invented by Adobe Systems, which later evolved into an open standard so that multiple applications could create PDF versions of their documents. The purpose of a PDF file is to provide a way of viewing the text and graphics in a document independent of the software that did the original layout. In the early 1990s, the heyday of desktop publishing, each graphic design software package had a different proprietary format for its files, and the packages were quite expensive. In those days, in order to view a document created in Word, Pagemaker, or Quark, you would have to open the document using the same software that had created it. This was especially problematic in the early days of the Web, since there were not many available techniques in HTML to create sophisticated layouts, but people still wanted to share files with each other. PDF was meant to be a vendor-neutral layout format. Adobe made its Acrobat Reader software free for anyone to download, and subsequently the PDF format became widely used. Here is a fun fact about the early days of Acrobat Reader. The words click here when entered into Google search engine still bring up Adobe's Acrobat PDF Reader download website as the first result, and have done so for years. This is because so many websites distribute PDF files along with a message saying something like, "To view this file you must have Acrobat Reader installed. Click here to download it." Since Google's search algorithm uses the link text to learn what sites go with what keywords, the keyword click here is now associated with Adobe Acrobat's download site. PDF is still used to make vendor- and application-neutral versions of files that have layouts that are more complicated than what could be achieved with plain text. For example, viewing the same document in the various versions of Microsoft Word still sometimes causes documents with lots of embedded tables, styles, images, forms, and fonts to look different from one another. This can be due to a number of factors, such as differences in operating systems or versions of the installed Word software itself. Even with applications that are intended to be compatible between software packages or versions, subtle differences can result in incompatibilities. PDF was created to solve some of this. Right away we can tell that PDF is going to be more difficult to deal with than a text file, because it is a binary format, and because it has embedded fonts, images, and so on. So most of the tools in our trusty data cleaning toolbox, such as text editors and command-line tools (less) are largely useless with PDF files. Fortunately there are still a few tricks we can use to get the data out of a PDF file. Try simple solutions first – copying Suppose that on your way to decant your bottle of fine red wine, you spill the bottle on the floor. Your first thought might be that this is a complete disaster and you will have to replace the whole carpet. But before you start ripping out the entire floor, it is probably worth trying to clean the mess with an old bartender's trick: club soda and a damp cloth. In this section, we outline a few things to try first, before getting involved in an expensive file renovation project. They might not work, but they are worth a try. Our experimental file Let's practice cleaning PDF data by using a real PDF file. We also do not want this experiment to be too easy, so let's choose a very complicated file. Suppose we are interested in pulling the data out of a file we found on the Pew Research Center's website called "Is College Worth It?". Published in 2011, this PDF file is 159 pages long and contains numerous data tables showing various ways of measuring if attaining a college education in the United States is worth the investment. We would like to find a way to quickly extract the data within these numerous tables so that we can run some additional statistics on it. For example, here is what one of the tables in the report looks like: This table is fairly complicated. It only has six columns and eight rows, but several of the rows take up two lines, and the header row text is only shown on five of the columns. The complete report can be found at the PewResearch website at http://www.pewsocialtrends.org/2011/05/15/is-college-worth-it/, and the particular file we are using is labeled Complete Report: http://www.pewsocialtrends.org/files/2011/05/higher-ed-report.pdf. Step one – try copying out the data we want The data we will experiment on in this example is found on page 149 of the PDF file (labeled page 143 in their document). If we open the file in a PDF viewer, such as Preview on Mac OSX, and attempt to select just the data in the table, we already see that some strange things are happening. For example, even though we did not mean to select the page number (143); it got selected anyway. This does not bode well for our experiment, but let's continue. Copy the data out by using Command-C or select Edit | Copy. How text looks when selected in this PDF from within Preview Step two – try pasting the copied data into a text editor The following screenshot shows how the copied text looks when it is pasted into Text Wrangler, our text editor: Clearly, this data is not in any sensible order after copying and pasting it. The page number is included, the numbers are horizontal instead of vertical, and the column headers are out of order. Even some of the numbers have been combined; for example, the final row contains the numbers 4,4,3,2; but in the pasted version, this becomes a single number 4432. It would probably take longer to clean up this data manually at this point than it would have taken just to retype the original table. We can conclude that with this particular PDF file, we are going to have to take stronger measures to clean it. Step three – make a smaller version of the file Our copying and pasting procedures have not worked, so we have resigned ourselves to the fact that we are going to need to prepare for more invasive measures. Perhaps if we are not interested in extracting data from all 159 pages of this PDF file, we can identify just the area of the PDF that we want to operate on, and save that section to a separate file. To do this in Preview on MacOSX, launch the File | Print… dialog box. In the Pages area, we will enter the range of pages we actually want to copy. For the purpose of this experiment, we are only interested in page 149; so enter 149 in both the From: and to: boxes as shown in the following screenshot. Then from the PDF dropdown box at the bottom, select Open PDF in Preview. You will see your single-page PDF in a new window. From here, we can save this as a new file and give it a new name, such as report149.pdf or the like. Another technique to try – pdfMiner Now that we have a smaller file to experiment with, let's try some programmatic solutions to extract the text and see if we fare any better. pdfMiner is a Python package with two embedded tools to operate on PDF files. We are particularly interested in experimenting with one of these tools, a command-line program called pdf2txt that is designed to extract text from within a PDF document. Maybe this will be able to help us get those tables of numbers out of the file correctly. Step one – install pdfMiner Launch the Canopy Python environment. From the Canopy Terminal Window, run the following command: pip install pdfminer This will install the entire pdfMiner package and all its associated command-line tools. The documentation for pdfMiner and the two tools that come with it, pdf2txt and dumpPDF, is located at http://www.unixuser.org/~euske/python/pdfminer/. Step two – pull text from the PDF file We can extract all text from a PDF file using the command-line tool called pdf2txt.py. To do this, use the Canopy Terminal and navigate to the directory where the file is located. The basic format of the command is pdf2txt.py <filename>. If you have a larger file that has multiple pages (or you did not already break the PDF into smaller ones), you can also run pdf2txt.py –p149 <filename> to specify that you only want page 149. Just as with the preceding copy-and-paste experiment, we will try this technique not only on the tables located on page 149, but also on the Preface on page 3. To extract just the text from page 3, we run the following command: pdf2txt.py –p3 pewReport.pdf After running this command, the extracted preface of the Pew Research report appears in our command-line window: To save this text to a file called pewPreface.txt, we can simply add a redirect to our command line as follows: pdf2txt.py –p3 pewReport.pdf > pewPreface.txt But what about those troublesome data tables located on page 149? What happens when we use pdf2txt on those? We can run the following command: pdf2txt.py pewReport149.pdf The results are slightly better than copy and paste, but not by much. The actual data output section is shown in the following screenshot. The column headers and data are mixed together, and the data from different columns are shown out of order. We will have to declare the tabular data extraction portion of this experiment a failure, though pdfMiner worked reasonably well on line-by-line text-only extraction. Remember that your success with each of these tools may vary. Much of it depends on the particular characteristics of the original PDF file. It looks like we chose a very tricky PDF for this example, but let's not get disheartened. Instead, we will move on to another tool and see how we fare with it. Third choice – Tabula Tabula is a Java-based program to extract data within tables in PDF files. We will download the Tabula software and put it to work on the tricky tables in our page 149 file. Step one – download Tabula Tabula is available to be downloaded from its website at http://tabula.technology/. The site includes some simple download instructions. On Mac OSX version 10.10.1, I had to download the legacy Java 6 application before I was able to run Tabula. The process was straightforward and required only following the on-screen instructions. Step two – run Tabula Launch Tabula from inside the downloaded .zip archive. On the Mac, the Tabula application file is called simply Tabula.app. You can copy this to your Applications folder if you like. When Tabula starts, it launches a tab or window within your default web browser at the address http://127.0.0.1:8080/. The initial action portion of the screen looks like this: The warning that auto-detecting tables takes a long time is true. For the single-page perResearch149.pdf file, with three tables in it, table auto-detection took two full minutes and resulted in an error message about an incorrectly formatted PDF file. Step three – direct Tabula to extract the data Once Tabula reads in the file, it is time to direct it where the tables are. Using your mouse cursor, select the table you are interested in. I drew a box around the entire first table. Tabula took about 30 seconds to read in the table, and the results are shown as follows: Compared to the way the data was read with copy and paste and pdf2txt, this data looks great. But if you are not happy with the way Tabula reads in the table, you can repeat this process by clearing your selection and redrawing the rectangle. Step four – copy the data out We can use the Download Data button within Tabula to save the data to a friendlier file format, such as CSV or TSV. Step five – more cleaning Open the CSV file in Excel or a text editor and take a look at it. At this stage, we have had a lot of failures in getting this PDF data extracted, so it is very tempting to just quit now. Here are some simple data cleaning tasks: We can combine all the two-line text cells into a single cell. For example, in column B, many of the phrases take up more than one row. Prepare students to be productive and members of the workforce should be in one cell as a single phrase. The same is true for the headers in Rows 1 and 2 (4-year and Private should be in a single cell). To clean this in Excel, create a new column between columns B and C. Use the concatenate() function to join B3:B4, B5:B6, and so on. Use Paste-Special to add the new concatenated values into a new column. Then remove the two columns you no longer need. Do the same for rows 1 and 2. Remove blank lines between rows. When these procedures are finished, the data looks like this: Tabula might seem like a lot of work compared to cutting and pasting data or running a simple command-line tool. That is true, unless your PDF file turns out to be finicky like this one was. Remember that specialty tools are there for a reason—but do not use them unless you really need them. Start with a simple solution first and only proceed to a more difficult tool when you really need it. When all else fails – fourth technique Adobe Systems sells a paid, commercial version of their Acrobat software that has some additional features above and beyond just allowing you to read PDF files. With the full version of Acrobat, you can create complex PDF files and manipulate existing files in various ways. One of the features that is relevant here is the Export Selection As… option found within Acrobat. To get started using this feature, launch Acrobat and use the File Open dialog to open the PDF file. Within the file, navigate to the table holding the data you want to export. The following screenshot shows how to select the data from the page 149 PDF we have been operating on. Use your mouse to select the data, then right-click and choose Export Selection As… At this point, Acrobat will ask you how you want the data exported. CSV is one of the choices. Excel Workbook (.xlsx) would also be a fine choice if you are sure you will not want to also edit the file in a text editor. Since I know that Excel can also open CSV files, I decided to save my file in that format so I would have the most flexibility between editing in Excel and my text editor. After choosing the format for the file, we will be prompted for a filename and location for where to save the file. When we launch the resulting file, either in a text editor or in Excel, we can see that it looks a lot like the Tabula version we saw in the previous section. Here is how our CSV file will look when opened in Excel: At this point, we can use the exact same cleaning routine we used with the Tabula data, where we concatenated the B2:B3 cells into a single cell and then removed the empty rows. Summary The goal of this article was to learn how to export data out of a PDF file. Like sediment in a fine wine, the data in PDF files can appear at first to be very difficult to separate. Unlike decanting wine, however, which is a very passive process, separating PDF data took a lot of trial and error. We learned four ways of working with PDF files to clean data: copying and pasting, pdfMiner, Tabula, and Acrobat export. Each of these tools has certain strengths and weaknesses: Copying and pasting costs nothing and takes very little work, but is not as effective with complicated tables. pdfMiner/Pdf2txt is also free, and as a command-line tool, it could be automated. It also works on large amounts of data. But like copying and pasting, it is easily confused by certain types of tables. Tabula takes some work to set up, and since it is a product undergoing development, it does occasionally give strange warnings. It is also a little slower than the other options. However, its output is very clean, even with complicated tables. Acrobat gives similar output to Tabula, but with almost no setup and very little effort. It is a paid product. By the end, we had a clean dataset that was ready for analysis or long-term storage. Resources for Article: Further resources on this subject: Machine Learning Using Spark MLlib [article] Data visualization [article] First steps with R [article]

0
0
7306

Packt

17 Sep 2013

7 min read

Coding with Minecraft

Packt

17 Sep 2013

7 min read

(For more resources related to this topic, see here.) Getting ready Before you begin, you will need a running copy of Minecraft: Pi Edition. Start a game in a new or existing world, and wait for the game world to load. How to do it... Follow these steps to connect to the running Minecraft game: Open a fresh terminal by double-clicking on the LXTerminal icon on the desktop. Type cd ~/mcpi/api/python/mcpi into the terminal. Type python to begin the Python interpreter. Enter the following Python code: import minecraftmc = minecraft.Minecraft.create()mc.postToChat("Hello, world!") In the Minecraft window, you should see a message appear! How it works... First, we used the cd command, which we have seen previously to move to the location where the Python API is located. The application programming interface (API) consists of code provided by the Minecraft developers that handles some of the more basic functionality you might need. We used the ~ character as a shortcut for your home directory (/home/pi). Typing in cd /home/pi/mcpi/api/python/mcpi would have exactly the same effect, but requires more typing. We then start the Python interpreter. An interpreter is a program that executes code line by line as it is being typed. This allows us to get instant feedback on the code we are writing. You may like to explore the IDLE interpreter by typing idle into the terminal instead of python. IDLE is more advanced, and is able to color your code based on its meaning (so you can spot errors more easily), and can graphically suggest functions available for use. Then we started writing real Python code. The first line, import minecraft, gives us access to the necessary parts of the API by loading the minecraft module. There are several Python code files inside the directory we moved to, one of which is called minecraft.py, each containing a different code module. The main module we want access to is called minecraft. We then create a connection to the game using mc = minecraft.Minecraft.create(). mc is the name we have given to the connection, which allows us to use the same connection in any future code. minecraft. tells Python to look in the minecraft module. Minecraft is the name of a class in the minecraft module that groups together related data and functions. create() is a function of the Minecraft class that creates a connection to the game. Finally, we use the connection we have created, and its postToChat method to display a message in Minecraft. The way that our code interacts with the game is completely hidden from us to increase fl exibility: we can use almost exactly the same code to interact with any game of Minecraft: Pi Edition, and it is possible to use many different programming languages. If the developers want to change the way the communication works, they can do so, and it won't affect any of the code we have written. Behind the scenes, some text describing our command is sent across a network connection to the game, where the command is interpreted and performed. By default, the connection is to the very Raspberry Pi that we are running the code on, but it is also possible to send these commands over the Internet from any computer to any network-connected Raspberry Pi running Minecraft. A description of all of these text-based messages can be found in ~/ mcpi/api/spec: the message sent to the game when we wrote mc.postToChat("Hello,world!") was chat.post("Hello, world!"). This way of doing things allows any programming language to communicate with the running Minecraft game. As well as Python, a Java API is included that is capable of all the same tasks, and the community has created versions of the API in several other languages. There's more... There are many more functions provided in the Python API, some of the main ones are described here. You can explore the available commands using Python's help function: after importing the minecraft module, help(minecraft) will list the contents of that module, along with any text descriptions provided by the writer of the module. You can also use help to provide information on classes and functions. It is also possible to create your own API by building on top of the existing functions. For example, if you find yourself wanting to create a lot of spheres, you could write your own function that makes use of those provided, and import your module wherever you need it. The minecraft module The following code assumes that you have imported the minecraft module and created a connection to the running Minecraft game using mc = minecraft.Minecraft.create(). Whenever x, y, and z coordinates are used, x and z are both different directions that follow the ground, and y is the height, with 0 at sea level. Code Description mc.getBlock(x,y,z) This gets the type of a block at a particular location as a number. These numbers are all provided in the block module. mc.setBlock(x,y,z, type) The sets the block at a particular position to a particular type. There is also a setBlocks function that allows a cuboid to be filled - this will be faster than setting blocks individually. mc.getHeight(x,z) This gets the height of the world at the given location. mc.getPlayerEntityIds() This gets a list of IDs of all connected players. mc.saveCheckpoint() This saves the current state of the world. mc.restoreCheckpoint() This restores the state of the world from the saved checkpoint. mc.postToChat(message) This posts a message to the game chat. mc.setting(setting, status) This changes a game setting (such as "world_immutable" or "nametags_visible") to True or False. mc.camera.setPos(x,y,z) This moves the game camera to a particular location. Other options are setNormal(player_id), setFixed(), and setFollow(player_id). mc.player.getPos() This gets the position of the host player. mc.player.setPos(x,y,z) This moves the host player. mc.events.pollBlockHits() This gets a list of all blocks that have been hit since the last time the events were requested. Each event describes the position of the block that was hit. mc.events.clearAll() This clears the list of events. At the time of writing, only block hits are recorded, but more event types may be included in the future. The block module Another useful module is the block module: use import block to gain access to its contents. The block module has a list of all available blocks, and the numbers used to represent them. For example, a block of dirt is represented by the number 3. You can use 3 directly in your code if you like, or you can use the helpful name block.DIRT, which will help to make your code more readable. Some blocks, such as wool, have additional information to describe their color. This data can be provided after the block's ID in all functions. For example, to create a block of red wool, where 14 is the data value representing "red": mc.setBlock(x, y, z, block.WOOL, 14) Full information on the additional data values can be found online at http://www.minecraftwiki.net/wiki/Data values(Pocket_Edition). Summary This article gave us a simple code to interact with the game. It also explained how Python communicates with the game. The other aspects brought forth by the article include overview of other API functions, which can build useful functions on top of existing ones. Resources for Article: Further resources on this subject: Creating a file server (Samba) [Article] Webcam and Video Wizardry [Article] Instant Minecraft Designs – Building a Tudor-style house [Article]

0
0
7301

Social-Engineer Toolkit

Unity assets to create interactive 2D games [Tutorial]

The Multi-Table Query Generator using phpMyAdmin and MySQL

Understanding Spark RDD

Walkthrough of Storm UI

Modeling Furniture in Blender

Getting Started with Mockito

Implementing Unity 2017 Game Audio [Tutorial]

Google open sources an on-device, real-time hand gesture recognition algorithm built with MediaPipe

Predicting Bitcoin price from historical and live data

Trending Topics

How to create and configure an Azure Virtual Machine

React Native VS Xamarin: Which is the better cross-platform mobile development framework?

Installing Arch Linux using the official ISO

Cleaning Data in PDF Files

Coding with Minecraft