How-To Tutorials

article-image-visualizing-univariate-distribution-seaborn

16 Nov 2017

7 min read

Visualizing univariate distribution in Seaborn

16 Nov 2017

[box type="note" align="" class="" width=""]This article is an excerpt from a book by Allen Chi Shing Yu, Claire Yik Lok Chung, and Aldrin Kay Yuen Yim titled Matplotlib 2.x By Example. [/box] Seaborn by Michael Waskom is a statistical visualization library that is built on top of Matplotlib. It comes with handy functions for visualizing categorical variables, univariate distributions, and bivariate distributions. In this article, we will visualize univariate distribution in Seaborn. Visualizing univariate distribution Seaborn makes the task of visualizing the distribution of a dataset much easier. In this example, we are going to use the annual population summary published by the Department of Economic and Social Affairs, United Nations, in 2015. Projected population figures towards 2100 were also included in the dataset. Let's see how it distributes among different countries in 2017 by plotting a bar plot: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Population Bar chart sns.barplot(x="AgeGrp",y="Value", hue="Sex", data = current_population) # Use Matplotlib functions to label axes rotate tick labels ax = plt.gca() ax.set(xlabel="Age Group", ylabel="Population (thousands)") ax.set_xticklabels(ax.xaxis.get_majorticklabels(), rotation=45) plt.title("Population Barchart (USA)") # Show the figure plt.show() Bar chart in Seaborn The seaborn.barplot() function shows a series of data points as rectangular bars. If multiple points per group are available, confidence intervals will be shown on top of the bars to indicate the uncertainty of the point estimates. Like most other Seaborn functions, various input data formats are supported, such as Python lists, Numpy arrays, pandas Series, and pandas DataFrame. A more traditional way to show the population structure is through the use of a population pyramid. So what is a population pyramid? As its name suggests, it is a pyramid-shaped plot that shows the age distribution of a population. It can be roughly classified into three classes, namely constrictive, stationary, and expansive for populations that are undergoing negative, stable, and rapid growth respectively. For instance, constrictive populations have a lower proportion of young people, so the pyramid base appears to be constricted. Stable populations have a more or less similar number of young and middle-aged groups. Expansive populations, on the other hand, have a large proportion of youngsters, thus resulting in pyramids with enlarged bases. We can build a population pyramid by plotting two bar charts on two subplots with a shared y axis: import seaborn as sns import matplotlib.pyplot as plt # Extract USA population data in 2017 current_population = population_df[(population_df.Location == 'United States of America') & (population_df.Time == 2017) & (population_df.Sex != 'Both')] # Change the age group to descending order current_population = current_population.iloc[::-1] # Create two subplots with shared y-axis fig, axes = plt.subplots(ncols=2, sharey=True) # Bar chart for male sns.barplot(x="Value",y="AgeGrp", color="darkblue", ax=axes[0], data = current_population[(current_population.Sex == 'Male')]) # Bar chart for female sns.barplot(x="Value",y="AgeGrp", color="darkred", ax=axes[1], data = current_population[(current_population.Sex == 'Female')]) # Use Matplotlib function to invert the first chart axes[0].invert_xaxis() # Use Matplotlib function to show tick labels in the middle axes[0].yaxis.tick_right() # Use Matplotlib functions to label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() Since Seaborn is built on top of the solid foundations of Matplotlib, we can customize the plot easily using built-in functions of Matplotlib. In the preceding example, we used matplotlib.axes.Axes.invert_xaxis() to flip the male population plot horizontally, followed by changing the location of the tick labels to the right-hand side using matplotlib.axis.YAxis.tick_right(). We further customized the titles and axis labels for the plot using a combination of matplotlib.axes.Axes.set_title(), matplotlib.axes.Axes.set(), and matplotlib.figure.Figure.suptitle(). Let's try to plot the population pyramids for Cambodia and Japan as well by changing the line population_df.Location == 'United States of America' to population_df.Location == 'Cambodia' or population_df.Location == 'Japan'. Can you classify the pyramids into one of the three population pyramid classes? To see how Seaborn simplifies the code for relatively complex plots, let's see how a similar plot can be achieved using vanilla Matplotlib. First, like the previous Seaborn-based example, we create two subplots with shared y axis: fig, axes = plt.subplots(ncols=2, sharey=True) Next, we plot horizontal bar charts using matplotlib.pyplot.barh() and set the location and labels of ticks, followed by adjusting the subplot spacing: # Get a list of tick positions according to the data bins y_pos = range(len(current_population.AgeGrp.unique())) # Horizontal barchart for male axes[0].barh(y_pos, current_population[(current_population.Sex == 'Male')].Value, color="darkblue") # Horizontal barchart for female axes[1].barh(y_pos, current_population[(current_population.Sex == 'Female')].Value, color="darkred") # Show tick for each data point, and label with the age group axes[0].set_yticks(y_pos) axes[0].set_yticklabels(current_population.AgeGrp.unique()) # Increase spacing between subplots to avoid clipping of ytick labels plt.subplots_adjust(wspace=0.3) Finally, we use the same code to further customize the look and feel of the figure: # Invert the first chart axes[0].invert_xaxis() # Show tick labels in the middle axes[0].yaxis.tick_right() # Label the axes and titles axes[0].set_title("Male") axes[1].set_title("Female") axes[0].set(xlabel="Population (thousands)", ylabel="Age Group") axes[1].set(xlabel="Population (thousands)", ylabel="") fig.suptitle("Population Pyramid (USA)") # Show the figure plt.show() When compared to the Seaborn-based code, the pure Matplotlib implementation requires extra lines to define the tick positions, tick labels, and subplot spacing. For some other Seaborn plot types that include extra statistical calculations such as linear regression, and Pearson correlation, the code reduction is even more dramatic. Therefore, Seaborn is a "batteries-included" statistical visualization package that allows users to write less verbose code. Histogram and distribution fitting in Seaborn In the population example, the raw data was already binned into different age groups. What if the data is not binned (for example, the BigMac Index data)? Turns out, seaborn.distplot can help us to process the data into bins and show us a histogram as a result. Let's look at this example: import seaborn as sns import matplotlib.pyplot as plt # Get the BigMac index in 2017 current_bigmac = bigmac_df[(bigmac_df.Date == "2017-01-31")] # Plot the histogram ax = sns.distplot(current_bigmac.dollar_price) plt.show() The seaborn.distplot function expects either pandas Series, single-dimensional numpy.array, or a Python list as input. Then, it determines the size of the bins according to the Freedman-Diaconis rule, and finally it fits a kernel density estimate (KDE) over the histogram. KDE is a non-parametric method used to estimate the distribution of a variable. We can also supply a parametric distribution, such as beta, gamma, or normal distribution, to the fit argument. In this example, we are going to fit the normal distribution from the scipy.stats package over the Big Mac Index dataset: from scipy import stats ax = sns.distplot(current_bigmac.dollar_price, kde=False, fit=stats.norm) plt.show() [INSERT IMAGE] You have now equipped yourself with the knowledge to visualize univariate data in Seaborn as Bar Charts, Histogram, and distribution fitting. To have more fun visualizing data with Seaborn and Matplotlib, check out the book, this snippet appears from.

0
0
7120

article-image-implementing-an-api-design-first-approach-for-building-apis

Packt Editorial Staff

15 Jun 2018

9 min read

Implement an API Design-first approach for building APIs [Tutorial]

Packt Editorial Staff

15 Jun 2018

9 min read

0
0
7114

article-image-android-virtual-device-manager

Packt

06 Feb 2015

8 min read

Android Virtual Device Manager

Packt

06 Feb 2015

8 min read

This article written by Belén Cruz Zapata, the author of the book Android Studio Essentials, teaches us the uses of the AVD Manager tool. It introduces us to the Google Play services. (For more resources related to this topic, see here.) The Android Virtual Device Manager (AVD Manager) is an Android tool accessible from Android Studio to manage the Android virtual devices that will be executed in the Android emulator. To open the AVD Manager from Android Studio, navigate to the Tools | Android | AVD Manager menu option. You can also click on the shortcut from the toolbar. The AVD Manager displays the list of the existing virtual devices. Since we have not created any virtual device, initially the list will be empty. To create our first virtual device, click on the Create Virtual Device button to open the configuration dialog. The first step is to select the hardware configuration of the virtual device. The hardware definitions are listed on the left side of the window. Select one of them, like the Nexus 5, to examine its details on the right side as shown in the following screenshot. Hardware definitions can be classified into one of these categories: Phone, Tablet, Wear or TV. We can also configure our own hardware device definitions from the AVD Manager. We can create a new definition using the New Hardware Profile button. The Clone Device button creates a duplicate of an existing device. Click on the New Hardware Profile button to examine the existing configuration parameters. The most important parameters that define a device are: Device Name: Name of the device. Screensize: Screen size in inches. This value determines the size category of the device. Type a value of 4.0 and notice how the Size value (on the right side) is normal. Now type a value of 7.0 and the Size field changes its value to large. This parameter along with the screen resolution also determines the density category. Resolution: Screen resolution in pixels. This value determines the density category of the device. Having a screen size of 4.0 inches, type a value of 768 x 1280 and notice how the density value is 400 dpi. Change the screen size to 6.0 inches and the density value changes to hdpi. Now change the resolution to 480 x 800 and the density value is mdpi. RAM: RAM memory size of the device. Input: Indicate if the home, back, or menu buttons of the device are available via software or hardware. Supported device states: Check the allowed states. Cameras: Select if the device has a front camera or a back camera. Sensors: Sensors available in the device: accelerometer, gyroscope, GPS, and proximity sensor. Default Skin: Select additional hardware controls. Create a new device with a screen size of 4.7 inches, a resolution of 800 x 1280, a RAM value of 500 MiB, software buttons, and both portrait and landscape states enabled. Name it as My Device. Click on the Finish button. The hardware definition has been added to the list of configurations. Click on the Next button to continue the creation of a new virtual device. The next step is to select the virtual device system image and the target Android platform. Each platform has its architecture, so the system images that are installed on your system will be listed along with the rest of the images that can be downloaded (Show downloadable system images box checked). Download and select one of the images of the Lollipop release and click on the Next button. Finally, the last step is to verify the configuration of the virtual device. Enter the name of the Android Virtual Device in the AVD Name field. Give the virtual device a meaningful name to recognize it easily, such as AVD_nexus5_api21. Click on the Show Advanced Settings button. The settings that we can configure for the virtual device are the following: Emulation Options: The Store a snapshot for faster startup option saves the state of the emulator in order to load faster the next time. The Use Host GPU tries to accelerate the GPU hardware to run the emulator faster. Custom skin definition: Select if additional hardware controls are displayed in the emulator. Memory and Storage: Select the memory parameters of the virtual device. Let the default values, unless a warning message is shown; in this case, follow the instructions of the message. For example, select 1536M for the RAM memory and 64 for the VM Heap. The Internal Storage can also be configured. Select for example: 200 MiB. Select the size of the SD Card or select a file to behave as the SD card. Device: Select one of the available device configurations. These configurations are the ones we tested in the layout editor preview. Select the Nexus 5 device to load its parameters in the dialog. Target: Select the device Android platform. We have to create one virtual device with the minimum platform supported by our application and another virtual device with the target platform of our application. For this first virtual device, select the target platform, Android 4.4.2 - API Level 19. CPU/ABI: Select the device architecture. The value of this field is set when we select the target platform. Each platform has its architecture, so if we do not have it installed, the following message will be shown; No system images installed for this target. To solve this, open the SDK Manager and search for one of the architectures of the target platform, ARM EABI v7a System Image or Intel x86 Atom System Image. Keyboard: Select if a hardware keyboard is displayed in the emulator. Check it. Skin: Select if additional hardware controls are displayed in the emulator. You can select the Skin with dynamic hardware controls option. Front Camera: Select if the emulator has a front camera or a back camera. The camera can be emulated or can be real by the use of a webcam from the computer. Select None for both cameras. Keyboard: Select if a hardware keyboard is displayed in the emulator. Check it. Network: Select the speed of the simulated network and select the delay in processing data across the network. The new virtual device is now listed in the AVD Manager. Select the recently created virtual device to enable the remaining actions: Start: Run the virtual device. Edit: Edit the virtual device configuration. Duplicate: Creates a new device configuration displaying the last step of the creation process. You can change its configuration parameters and then verify the new device. Wipe Data: Removes the user files from the virtual device. Show on Disk: Opens the virtual device directory on your system. View Details: Open a dialog detailing the virtual device characteristics. Delete: Delete the virtual device. Click on the Start button. The emulator will be opened as shown in the following screenshot. Wait until it is completely loaded, and then you will be able to try it. In Android Studio, open the main layout with the graphical editor and click on the list of the devices. As the following screenshot shows, our custom device definition appears and we can select it to preview the layout: Navigation Editor The Navigation Editor is a tool to create and structure the layouts of the application using a graphical viewer. To open this tool navigate to the Tools | Android | Navigation Editor menu. The tool opens a file in XML format named main.nvg.xml. This file is stored in your project at /.navigation/app/raw/. Since there is only one layout and one activity in our project, the navigation editor only shows this main layout. If you select the layout, detailed information about it is displayed on the right panel of the editor. If you double-click on the layout, the XML layout file will be opened in a new tab. We can create a new activity by right-mouse clicking on the editor and selecting the New Activity option. We can also add transitions from the controls of a layout by shift clicking on a control and then dragging to the target activity. Open the main layout and create a new button with the label Open Activity: <Button android_id="@+id/button_open" android_layout_width="wrap_content" android_layout_height="wrap_content" android_layout_below="@+id/button_accept" android_layout_centerHorizontal="true" android_text="Open Activity" /> Open the Navigation Editor and add a second activity. Now the navigation editor displays both activities as the next screenshot shows. Now we can add the navigation between them. Shift-drag from the new button of the main activity to the second activity. A blue line and a pink circle have been added to represent the new navigation. Select the navigation relationship to see its details on the right panel as shown in the following screenshot. The right panel shows the source the activity, the destination activity and the gesture that triggers the navigation. Now open our main activity class and notice the new code that has been added to implement the recently created navigation. The onCreate method now contains the following code: findViewById(R.id.button_open).setOnClickListener( new View.OnClickListener() { @Override public void onClick(View v) { MainActivity.this.startActivity( new Intent(MainActivity.this, Activity2.class)); } }); This code sets the onClick method of the new button, from where the second activity is launched. Summary This article thought us about the Navigation Editor tool. It also showed how to integrate the Google Play services with a project in Android Studio. In this article, we got acquainted to the AVD Manager tool. Resources for Article: Further resources on this subject: Android Native Application API [article] Creating User Interfaces [article] Android 3.0 Application Development: Multimedia Management [article]

0
0
7110

article-image-web-services-microsoft-azure

Packt

29 Nov 2010

8 min read

Web Services in Microsoft Azure

Packt

29 Nov 2010

8 min read

A web service is not one single entity and consists of three distinct parts: An endpoint, which is the URL (and related information) where client applications will find our service A host environment, which in our case will be Azure A service class, which is the code that implements the methods called by the client application A web service endpoint is more than just a URL. An endpoint also includes: The bindings, or communication and security protocols The contract (or promise) that certain methods exist, how these methods should be called, and what the data will look like when returned A simple way to remember the components of an endpoint is A/B/C, that is, address/bindings/contract. Web services can fill many roles in our Azure applications—from serving as a simple way to place messages into a queue, to being a complete replacement for a data access layer in a web application (also known as a Service Oriented Architecture or SOA). In Azure, web services serve as HTTP/HTTPS endpoints, which can be accessed by any application that supports REST, regardless of language or operating system. The intrinsic web services libraries in .NET are called Windows Communication Foundation (WCF). As WCF is designed specifically for programming web services, it's referred to as a service-oriented programming model. We are not limited to using WCF libraries in Azure development, but we expect it to be a popular choice for constructing web services being part of the .NET framework. A complete introduction to WCF can be found at http://msdn.microsoft.com/en-us/netframework/aa663324.aspx. When adding WCF services to an Azure web role, we can either create a separate web role instance, or add the web services to an existing web role. Using separate instances allows us to scale the web services independently of the web forms, but multiple instances increase our operating costs. Separate instances also allow us to use different technologies for each Azure instance; for example, the web form may be written in PHP and hosted on Apache, while the web services may be written in Java and hosted using Tomcat. Using the same instance helps keep our costs much lower, but in that case we have to scale both the web forms and the web services together. Depending on our application's architecture, this may not be desirable. Securing WCF Stored data are only as secure as the application used for accessing it. The Internet is stateless, and REST has no sense of security, so security information must be passed as part of the data in each request. If the credentials are not encrypted, then all requests should be forced to use HTTPS. If we control the consuming client applications, we can also control the encryption of the user credentials. Otherwise, our only choice may be to use clear text credentials via HTTPS. For an application with a wide or uncontrolled distribution (like most commercial applications want to be), or if we are to support a number of home-brewed applications, the authorization information must be unique to the user. Part of the behind-the-services code should check to see if the user making the request can be authenticated, and if the user is authorized to perform the action. This adds additional coding overhead, but it's easier to plan for this up front. There are a number of ways to secure web services—from using HTTPS and passing credentials with each request, to using authentication tokens in each request. As it happens, using authentication tokens is part of the AppFabric Access Control, and we'll look more into the security for WCF when we dive deeper into Access Control. Jupiter Motors web service In our corporate portal for Jupiter Motors, we included a design for a client application, which our delivery personnel will use to update the status of an order and to decide which customers will accept delivery of their vehicle. For accounting and insurance reasons, the order status needs to be updated immediately after a customer accepts their vehicle. To do so, the client application will call a web service to update the order status as soon as the Accepted button is clicked. Our WCF service is interconnected to other parts of our Jupiter Motors application, so we won't see it completely in action until it all comes together. In the meantime, it will seem like we're developing blind. In reality, all the components would probably be developed and tested simultaneously. Creating a new WCF service web role When creating a web service, we have a choice to add the web service to an existing web role or create a new web role. This helps us deploy and maintain our website application separately from our web services. And in order for us to scale the web role independently from the worker role, we'll create our web service in a role separate from our web application. Creating a new WCF service web role is very simple—Visual Studio will do the "hard work" for us and allow us to start coding our services. First, open the JupiterMotors project. Create the new web role by right-clicking on the Roles folder in our project, choosing Add, and then select the New Web Role Project… option. When we do this, we will be asked what type of web role we want to create. We will choose a WCF Service Web Role, call it JupiterMotorsWCFRole, and click on the Add button. Because different services must have unique names in our project, a good naming convention to use is the project name concatenated with the type of role. This makes the different roles and instances easily discernable and complies with the unique naming requirement. This is where Visual Studio does its magic. It creates the new role in the cloud project, creates a new web role for our WCF web services, and creates some template code for us. The template service created is called "Service1". You will see both, a Service1.svc file as well as an IService1.vb file. Also, a web.config file (as we would expect to see in any web role) is created in the web role and is already wired up for our Service1 web service. All of the generated code is very helpful if you are learning WCF web services. This is what we should see once Visual Studio finishes creating the new project: We are going to start afresh with our own services—we can delete Service1.svc and IService1.vb. Also, in the web.config file, the following boilerplate code can be deleted (we'll add our own code as needed): <system.serviceModel> <services> <service name="JupiterMotorsWCFRole.Service1" behaviorConfiguration="JupiterMotorsWCFRole. Service1Behavior">  <endpoint address="" binding="basicHttpBinding" contract="JupiterMotorsWCFRole.IService1">  <identity> <dns value="localhost"/> </identity> </endpoint> <endpoint address="mex" binding="mexHttpBinding" contract="IMetadataExchange"/> </service> </services> <behaviors> <serviceBehaviors> <behavior name="JupiterMotorsWCFRole.Service1Behavior">  <serviceMetadata httpGetEnabled="true"/>  <serviceDebug includeExceptionDetailInFaults="false"/> </behavior> </serviceBehaviors> </behaviors> </system.serviceModel> Let's now add a WCF service to the JupiterMotorsWCFRole project. To do so, right-click on the project, then Add, and select the New Item... option. We now choose a WCF service and will name it as ERPService.svc: Just like the generated code when we created the web role, ERPService.svc as well as IERPService.vb files were created for us, and these are now wired into the web.config file. There is some generated code in the ERPService.svc and IERPService.vb files, but we will replace this with our code in the next section. When we create a web service, the actual service class is created with the name we specify. Additionally, an interface class is automatically created. We can specify the name for the class; however, being an interface class, it will always have its name beginning with letter I. This is a special type of interface class, called a service contract. The service contract provides a description of what methods and return types are available in our web service.

0
0
7099

article-image-python-3-8-new-features-the-walrus-operator-positional-only-parameters-and-much-more

Bhagyashree R

18 Jul 2019

5 min read

Python 3.8 new features: the walrus operator, positional-only parameters, and much more

Bhagyashree R

18 Jul 2019

5 min read

Earlier this month, the team behind Python announced the release of Python 3.8b2, the second of four planned beta releases. Ahead of the third beta release, which is scheduled for 29th July, we look at some of the key features coming to Python 3.8. The "incredibly controversial" walrus operator The walrus operator was proposed in PEP 572 (Assignment Expressions) by Chris Angelico, Tim Peters, and Guido van Rossum last year. Since then it has been heavily discussed in the Python community with many questioning whether it is a needed improvement. Others were excited as the operator does make the code a tiny bit more readable. At the end of the PEP discussion, Guido van Rossum stepped down as BDFL (benevolent dictator for life) and the creation of a new governance model. In an interview with InfoWorld, Guido shared, “The straw that broke the camel’s back was a very contentious Python enhancement proposal, where after I had accepted it, people went to social media like Twitter and said things that really hurt me personally. And some of the people who said hurtful things were actually core Python developers, so I felt that I didn’t quite have the trust of the Python core developer team anymore.” According to PEP 572, the assignment expression is a syntactical operator that allows you to assign values to a variable as a part of an expression. Its aim is to simplify things like multiple-pattern matches and the so-called loop and a half. At PyCon 2019, Dustin Ingram, a PyPI maintainer, gave a few examples where you can use this syntax: Balancing lines of codes and complexity Avoiding inefficient comprehensions Avoiding unnecessary variables in scope You can watch the full talk on YouTube: https://www.youtube.com/watch?v=6uAvHOKofws The feature was implemented by Emily Morehouse, Python core developer and Founder, Director of Engineering at Cuttlesoft, and was merged earlier this year: https://twitter.com/emilyemorehouse/status/1088593522142339072 Explaining other improvements this feature brings, Jake Edge, a contributor on LWN.net wrote, “These and other uses (e.g. in list and dict comprehensions) help make the intent of the programmer clearer. It is a feature that many other languages have, but Python has, of course, gone without it for nearly 30 years at this point. In the end, it is actually a fairly small change for all of the uproars it caused.” Positional-only parameters Proposed in PEP 570, this introduces a new syntax (/) to specify positional-only parameters in Python function definitions. This is similar to how * indicates that the arguments to its right are keyword only. This syntax is already used by many CPython built-in and standard library functions, for instance, the pow() function: pow(x, y, z=None, /) This syntax gives library authors more control over better expressing the intended usage of an API and allows the API to “evolve in a safe, backward-compatible way.” It gives library authors the flexibility to change the name of positional-only parameters without breaking callers. Additionally, this also ensures consistency of the Python language with existing documentation and the behavior of various "builtin" and standard library functions. As with PEP 572, this proposal also got mixed reactions from Python developers. In support, one developer said, “Position-only parameters already exist in cpython builtins like range and min. Making their support at the language level would make their existence less confusing and documented.” While others think that this will allow authors to “dictate” how their methods could be used. “Not the biggest fan of this one because it allows library authors to overly dictate how their functions can be used, as in, mark an argument as positional merely because they want to. But cool all the same,” a Redditor commented. Debug support for f-strings Formatted strings (f-strings) were introduced in Python 3.6 with PEP 498. It enables you to evaluate an expression as part of the string along with inserting the result of function calls and so on. In Python 3.8, some additional syntax changes have been made by adding add (=) specifier and a !d conversion for ease of debugging. You can use this feature like this: print(f'{foo=} {bar=}') This provides developers a better way of doing “print-style debugging”, especially for those who have a background in languages that already have such feature such as Perl, Ruby, JavaScript, etc. One developer expressed his delight on Hacker News, “F strings are pretty awesome. I’m coming from JavaScript and partially java background. JavaScript’s String concatenation can become too complex and I have difficulty with large strings.” Python Initialization Configuration Though Python is highly configurable, its configuration seems scattered all around the code. The PEP 587 introduces a new C API to configure the Python Initialization giving developers finer control over the configuration and better error reporting. Among the improvements, this API will bring include ability to read and modify configuration before it is applied and overriding how Python computes the module search paths (``sys.path``). Along with these, there are many other exciting features coming to Python 3.8, which is currently scheduled for October, including a fast calling protocol for CPython, Vectorcall, support for out-of-band buffers in pickle protocol 5, and more. You can find the full list on Python’s official website. Python serious about diversity, dumps offensive ‘master’, ‘slave’ terms in its documentation Introducing PyOxidizer, an open source utility for producing standalone Python applications, written in Rust Python 3.8 beta 1 is now ready for you to test

0
0
7089

article-image-freeswitch-utilizing-built-ivr-engine

Packt

05 Aug 2010

10 min read

FreeSWITCH: Utilizing the Built-in IVR Engine

Packt

05 Aug 2010

10 min read

0
0
7088

Pavan Ramchandani

17 May 2018

9 min read

That '70s language: AWK programming

Pavan Ramchandani

17 May 2018

9 min read

AWK is an interpreted programming language designed for text processing and report generation. It is typically used for data manipulation, such as searching for items within data, performing arithmetic operations, and restructuring raw data for generating reports in most Unix-like operating systems. Today, we will explore the AWK philosophy and different types of AWK that exist, starting from its original implementation in 1977 at AT&T's Laboratories, Inc. We will also look at the various implementation areas of AWK in data science today. Using AWK programs, one can handle repetitive text-editing problems with very simple and short programs. It is a pattern-action language; it searches for patterns in a given input and, when a match is found, it performs the corresponding action. The pattern can be made of strings, regular expressions, comparison operations on numbers, fields, variables, and so on. It reads the input files and splits each input line of the file into fields automatically. AWK has most of the well-designed features that every programming language should contain. Its syntax particularly resembles that of the C programming language. It is named after its original three authors: Alfred V. Aho Peter J. Weinberger Brian W. Kernighan AWK is a very powerful, elegant, and simple that every person dealing with text processing should be familiar with. This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming. This book will introduce you to AWK programming language and get you hands-on working with practical implementation of AWK. Types of AWK The AWK language was originally implemented as an AWK utility on Unix. Today, most Linux distributions provide GNU implementation of AWK (GAWK), and a symlink for AWK is created from the original GAWK binary. The AWK utility can be categorized into the following three types, depending upon the type of interpreter it uses for executing AWK programs: AWK: This is the original AWK interpreter available from AT&T Laboratories. However, it is not used much nowadays and hence it might not be well-maintained. Its limitation is that it splits a line into a maximum 99 fields. It was updated and replaced in the mid-1980s with an enhanced version called New AWK (NAWK). NAWK: This is AT&T's latest development on the AWK interpreter. It is well-maintained by one of the original authors of AWK - Dr. Brian W. Kernighan. GAWK: This is the GNU project's implementation of the AWK programming language. All GNU/Linux distributions are shipped with GAWK by default and hence it is the most popular version of AWK. GAWK interpreter is fully compatible with AWK and NAWK. Beyond these, we also have other, less popular, AWK interpreters and translators, mentioned as follows. These variants are useful in operations when you want to translate your AWK program to C, C++, or Perl: MAWK: Michael Brennan interpreter for AWK. TAWK: Thompson Automation interpreter/compiler/Microsoft Windows DLL for AWK. MKSAWK: Mortice Kern Systems interpreter/compiler/for AWK. AWKCC: An AWK translator to C (might not be well-maintained). AWKC++: Brian Kernighan's AWK translator to C++ (experimental). It can be downloaded from: https://9p.io/cm/cs/who/bwk/awkc++.ps. AWK2C: An AWK translator to C. It uses GNU AWK libraries extensively. A2P: An AWK translator to Perl. It comes with Perl. AWKA: Yet another AWK translator to C (comes with the library), based on MAWK. It can be downloaded from: http://awka.sourceforge.net/download.html. When and where to use AWK AWK is simpler than any other utility for text processing and is available as the default on Unix-like operating systems. However, some people might say Perl is a superior choice for text processing, as AWK is functionally a subset of Perl, but the learning curve for Perl is steeper than that of AWK; AWK is simpler than Perl. AWK programs are smaller and hence quicker to execute. Anybody who knows the Linux command line can start writing AWK programs in no time. Here are a few use cases of AWK: Text processing Producing formatted text reports/labels Performing arithmetic operations on fields of a file Performing string operations on different fields of a file Programs written in AWK are smaller than they would be in other higher-level languages for similar text processing operations. AWK programs are interpreted on a GNU/Linux Terminal and thus avoid the compiling, debugging phase of software development in other languages. Getting started with installation This section describes how to set up the AWK environment on your GNU/Linux system, and we'll also discuss the workflow of AWK. Then, we'll look at different methods for executing AWK programs. Installation on Linux Generally, AWK is installed by default on most GNU/Linux distributions. Using the which command, you can check whether it is installed on your system or not. In case AWK is not installed on your system, you can do so in one of two ways: Using the package manager of the corresponding GNU/Linux system Compiling from the source code Let's take a look at each method in detail in the following sections. Using the package manager Different flavors of GNU/Linux distribution have different package-management utilities. If you are using a Debian-based GNU/Linux distribution, such as Ubuntu, Mint, or Debian, then you can install it using the Advance Package Tool (APT) package manager, as follows: [ shiwang@linux ~ ] $ sudo apt-get update -y [ shiwang@linux ~ ] $ sudo apt-get install gawk -y Similarly, to install AWK on an RPM-based GNU/Linux distribution, such as Fedora, CentOS, or RHEL, you can use the Yellowdog Updator Modified (YUM) package manager, as follows: [ root@linux ~ ] # yum update -y [ root@linux ~ ] # yum install gawk -y For installation of AWK on openSUSE, you can use the zypper (zypper command line) package-management utility, as follows: [ root@linux ~ ] # zypper update -y [ root@linux ~ ] # zypper install gawk -y Once the installation is finished, make sure AWK is accessible through the command line. We can check that using the which command, which will return the absolute path of AWK on our system: [ root@linux ~ ] # which awk /usr/bin/awk You can also use awk --version to find the AWK version on our system: [ root@linux ~ ] # awk --version Compiling from the source code Like every other open source utility, the GNU AWK source code is freely available for download as part of the GNU project. Previously, you saw how to install AWK using the package manager; now, you will see how to install AWK by compiling from its source code on the GNU/Linux distribution. The following steps are applicable to most of the GNU/Linux software for installation: Download the source code from a GNU project ftp site. Here, we will use the wget command line utility to download it, however you are free to choose any other program, such as curl, you feel comfortable with: [ shiwang@linux ~ ] $ wget http://ftp.gnu.org/gnu/gawk/gawk-4.1.3.tar.xz Extract the downloaded source code: [ shiwang@linux ~ ] $ tar xvf gawk-4.1.3.tar.xz Change your working directory and execute the configure file to configure the GAWK as per the working environment of your system: [ shiwang@linux ~ ] $ cd gawk-4.1.3 && ./configure Once the configure command completes its execution successfully, it will generate the make file. Now, compile the source code by executing the make command: [ shiwang@linux ~ ] $ make Type make install to install the programs and any data files and documentation. When installing into a prefix owned by root, it is recommended that the package be configured and built as a regular user, and only the make install phase is executed with root privileges: [ shiwang@linux ~ ] $ sudo make install Upon successful execution of these five steps, you have compiled and installed AWK on your GNU/Linux distribution. You can verify this by executing the which awk command in the Terminal or awk --version: [ root@linux ~ ] # which awk /usr/bin/awk Now you have a working AWK/GAWK installation and we are ready to begin AWK programming, but before that, our next section describes the workflow of the AWK interpreter. If you are running on macOS X, AWK, and not GAWK, would be installed as a default on it. For GAWK installation on macOS X, please refer to MacPorts for GAWK. Workflow of AWK Having a basic knowledge of the AWK interpreter workflow will help you to better understand AWK and will result in more efficient AWK program development. Hence, before getting your hands dirty with AWK programming, you need to understand its internals. The AWK workflow can be summarized as shown in the following figure: Let's take a look at each operation: READ OPERATION: AWK reads a line from the input stream (file, pipe, or stdin) and stores it in memory. It works on text input, which can be a file, the standard input stream, or from a pipe, which it further splits into records and fields: Records: An AWK record is a single, continuous data input that AWK works on. Records are bounded by a record separator, whose value is stored in the RS variable. The default value of RS is set to a newline character. So, the lines of input are considered records for the AWK interpreter. Records are read continuously until the end of the input is reached. Figure 1.2 shows how input data is broken into records and then goes further into how it is split into fields: Fields: Each record can further be broken down into individual chunks called fields. Like records, fields are bounded. The default field separator is any amount of whitespace, including tab and space characters. So by default, lines of input are further broken down into individual words separated by whitespace. You can refer to the fields of a record by a field number, beginning with 1. The last field in each record can be accessed by its number or with the NF special variable, which contains the number of fields in the current record, as shown in Figure 1.3: EXECUTE OPERATION: All AWK commands are applied sequentially on the input (records and fields). By default, AWK executes commands on each record/line. This behavior of AWK can be restricted by the use of patterns. REPEAT OPERATION: The process of read and execute is repeated until the end of the file is reached. The following flowchart depicts the workflow: We introduced you to the AWK programming language and got ourselves a quick primer to get started with application development. If you found this post is useful, do check out the book Learning AWK Programming to learn more about the intricacies of AWK programming language for text processing. The oldest programming languages in use today What is the difference between functional and object oriented programming? Systems programming with Go in UNIX and Linux

0
0
7084

article-image-build-hadoop-clusters-using-google-cloud-platform-tutorial

Sunith Shetty

24 Jul 2018

10 min read

Build Hadoop clusters using Google Cloud Platform [Tutorial]

Sunith Shetty

24 Jul 2018

10 min read

Cloud computing has transformed the way individuals and organizations access and manage their servers and applications on the internet. Before Cloud computing, everyone used to manage their servers and applications on their own premises or on dedicated data centers. The increase in the raw computing power of computing (CPU and GPU) of multiple-cores on a single chip and the increase in the storage space (HDD and SSD) present challenges in efficiently utilizing the available computing resources. In today's tutorial, we will learn different ways of building Hadoop cluster on the Cloud and ways to store and access data on Cloud. This article is an excerpt from a book written by Naresh Kumar and Prashant Shindgikar titled Modern Big Data Processing with Hadoop. Building Hadoop cluster in the Cloud Cloud offers a flexible and easy way to rent resources such as servers, storage, networking, and so on. The Cloud has made it very easy for consumers with the pay-as-you-go model, but much of the complexity of the Cloud is hidden from us by the providers. In order to better understand whether Hadoop is well suited to being on the Cloud, let's try to dig further and see how the Cloud is organized internally. At the core of the Cloud are the following mechanisms: A very large number of servers with a variety of hardware configurations Servers connected and made available over IP networks Large data centers to host these devices Data centers spanning geographies with evolved network and data center designs If we pay close attention, we are talking about the following: A very large number of different CPU architectures A large number of storage devices with a variety of speeds and performance Networks with varying speed and interconnectivity Let's look at a simple design of such a data center on the Cloud:We have the following devices in the preceding diagram: S1, S2: Rack switches U1-U6: Rack servers R1: Router Storage area network Network attached storage As we can see, Cloud providers have a very large number of such architectures to make them scalable and flexible. You would have rightly guessed that when the number of such servers increases and when we request a new server, the provider can allocate the server anywhere in the region. This makes it a bit challenging for compute and storage to be together but also provides elasticity. In order to address this co-location problem, some Cloud providers give the option of creating a virtual network and taking dedicated servers, and then allocating all their virtual nodes on these servers. This is somewhat closer to a data center design, but flexible enough to return resources when not needed. Let's get back to Hadoop and remind ourselves that in order to get the best from the Hadoop system, we should have the CPU power closer to the storage. This means that the physical distance between the CPU and the storage should be much less, as the BUS speeds match the processing requirements. The slower the I/O speed between the CPU and the storage (for example, iSCSI, storage area network, network attached storage, and so on) the poorer the performance we get from the Hadoop system, as the data is being fetched over the network, kept in memory, and then fed to the CPU for further processing. This is one of the important things to keep in mind when designing Hadoop systems on the Cloud. Apart from performance reasons, there are other things to consider: Scaling Hadoop Managing Hadoop Securing Hadoop Now, let's try to understand how we can take care of these in the Cloud environment. Hadoop can be installed by the following methods: Standalone Semi-distributed Fully-distributed When we want to deploy Hadoop on the Cloud, we can deploy it using the following ways: Custom shell scripts Cloud automation tools (Chef, Ansible, and so on) Apache Ambari Cloud vendor provided methods Google Cloud Dataproc Amazon EMR Microsoft HDInsight Third-party managed Hadoop Cloudera Cloud agnostic deployment Apache Whirr Google Cloud Dataproc In this section, we will learn how to use Google Cloud Dataproc to set up a single node Hadoop cluster. The steps can be broken down into the following: Getting a Google Cloud account. Activating Google Cloud Dataproc service. Creating a new Hadoop cluster. Logging in to the Hadoop cluster. Deleting the Hadoop cluster. Getting a Google Cloud account This section assumes that you already have a Google Cloud account. Activating the Google Cloud Dataproc service Once you log in to the Google Cloud console, you need to visit the Cloud Dataproc service. The activation screen looks something like this: Creating a new Hadoop cluster Once the Dataproc is enabled in the project, we can click on Create to create a new Hadoop cluster. After this, we see another screen where we need to configure the cluster parameters: I have left most of the things to their default values. Later, we can click on the Create button which creates a new cluster for us. Logging in to the cluster After the cluster has successfully been created, we will automatically be taken to the cluster lists page. From there, we can launch an SSH window to log in to the single node cluster we have created. The SSH window looks something like this: As you can see, the Hadoop command is readily available for us and we can run any of the standard Hadoop commands to interact with the system. Deleting the cluster In order to delete the cluster, click on the DELETE button and it will display a confirmation window, as shown in the following screenshot. After this, the cluster will be deleted: Looks so simple, right? Yes. Cloud providers have made it very simple for users to use the Cloud and pay only for the usage. Data access in the Cloud The Cloud has become an important destination for storing both personal data and business data. Depending upon the importance and the secrecy requirements of the data, organizations have started using the Cloud to store their vital datasets. The following diagram tries to summarize the various access patterns of typical enterprises and how they leverage the Cloud to store their data: Cloud providers offer different varieties of storage. Let's take a look at what these types are: Block storage File-based storage Encrypted storage Offline storage Block storage This type of storage is primarily useful when we want to use this along with our compute servers, and want to manage the storage via the host operating system. To understand this better, this type of storage is equivalent to the hard disk/SSD that comes with our laptops/MacBook when we purchase them. In case of laptop storage, if we decide to increase the capacity, we need to replace the existing disk with another one. When it comes to the Cloud, if we want to add more capacity, we can just purchase another larger capacity storage and attach it to our server. This is one of the reasons why the Cloud has become popular as it has made it very easy to add or shrink the storage that we need. It's good to remember that, since there are many different types of access patterns for our applications, Cloud vendors also offer block storage with varying storage/speed requirements measured with their own capacity/IOPS, and so on. Let's take an example of this capacity upgrade requirement and see what we do to utilize this block storage on the Cloud. In order to understand this, let's look at the example in this diagram: Imagine a server created by the administrator called DB1 with an original capacity of 100 GB. Later, due to unexpected demand from customers, an application started consuming all the 100 GB of storage, so the administrator has decided to increase the capacity to 1 TB (1,024 GB). This is what the workflow looks like in this scenario: Create a new 1 TB disk on the Cloud Attach the disk to the server and mount it Take a backup of the database Copy the data from the existing disk to the new disk Start the database Verify the database Destroy the data on the old disk and return the disk This process is simplified but in production this might take some time, depending upon the type of maintenance that is being performed by the administrator. But, from the Cloud perspective, acquiring new block storage is very quick. File storage Files are the basics of computing. If you are familiar with UNIX/Linux environments, you already know that, everything is a file in the Unix world. But don't get confused with that as every operating system has its own way of dealing with hardware resources. In this case we are not worried about how the operating system deals with hardware resources, but we are talking about the important documents that the users store as part of their day-to-day business. These files can be: Movie/conference recordings Pictures Excel sheets Word documents Even though they are simple-looking files in our computer, they can have significant business importance and should be dealt with in a careful fashion, when we think of storing these on the Cloud. Most Cloud providers offer an easy way to store these simple files on the Cloud and also offer flexibility in terms of security as well. A typical workflow for acquiring the storage of this form is like this: Create a new storage bucket that's uniquely identified Add private/public visibility to this bucket Add multi-geography replication requirement to the data that is stored in this bucket Some Cloud providers bill their customers based on the number of features they select as part of their bucket creation. Please choose a hard-to-discover name for buckets that contain confidential data, and also make them private. Encrypted storage This is a very important requirement for business critical data as we do not want the information to be leaked outside the scope of the organization. Cloud providers offer an encryption at rest facility for us. Some vendors choose to do this automatically and some vendors also provide flexibility in letting us choose the encryption keys and methodology for the encrypting/decrypting data that we own. Depending upon the organization policy, we should follow best practices in dealing with this on the Cloud. With the increase in the performance of storage devices, encryption does not add significant overhead while decrypting/encrypting files. This is depicted in the following image: Continuing the same example as before, when we choose to encrypt the underlying block storage of 1 TB, we can leverage the Cloud-offered encryption where they automatically encrypt and decrypt the data for us. So, we do not have to employ special software on the host operating system to do the encryption and decryption. Remember that encryption can be a feature that's available in both the block storage and file-based storage offer from the vendor. Cold storage This storage is very useful for storing important backups in the Cloud that are rarely accessed. Since we are dealing with a special type of data here, we should also be aware that the Cloud vendor might charge significantly high amounts for data access from this storage, as it's meant to be written once and forgetten (until it's needed). The advantage with this mechanism is that we have to pay lesser amounts to store even petabytes of data. We looked at the different steps involved in building our own Hadoop cluster on the Cloud. And we saw different ways of storing and accessing our data on the Cloud. To know more about how to build expert Big Data systems, do checkout this book Modern Big Data Processing with Hadoop. Read More: What makes Hadoop so revolutionary? Machine learning APIs for Google Cloud Platform Getting to know different Big data Characteristics

0
0
7076

article-image-debugging-java-programs-using-jdb

Packt

23 Jun 2010

6 min read

Debugging Java Programs using JDB

Packt

23 Jun 2010

6 min read

In this article by Nataraju Neeluru, we will learn how to debug a Java program using a simple command-line debugging tool called JDB. JDB is one of the several debuggers available for debugging Java programs. It comes as part of the Sun's JDK. JDB is used by a lot of people for debugging purposes, for the main reason that it is very simple to use, lightweight and being a command-line tool, is very fast. Those who are familiar with debugging C programs with gdb, will be more inclined to use JDB for debugging Java programs. We will cover most of the commonly used and needed JDB commands for debugging Java programs. Nothing much is assumed to read this article, other than some familiarity with Java programming and general concepts of debugging like breakpoint, stepping through the code, examining variables, etc. Beginners may learn quite a few things here, and experts may revise their knowledge. (For more resources on Java, see here.) Introduction JDB is a debugging tool that comes along with the Sun's JDK. The executable exists in JAVA_HOME/bin as 'jdb' on Linux and 'jdb.exe' on Windows (where JAVA_HOME is the root directory of the JDK installation). A few notes about the tools and notation used in this article: We will use 'jdb' on Linux for illustration throughout this article, though the JDB command set is more or less same on all platforms. All the tools (like jdb, java) used in this article are of JDK 5, though most of the material presented here holds true and works in other versions also. '$' is the command prompt on the Linux machine on which the illustration is carried out. We will use 'JDB' to denote the tool in general, and 'jdb' to denote the particular executable in JDK on Linux. JDB commands are explained in a particular sequence. If that sequence is changed, then the output obtained may be different from what is shown in this article. Throughout this article, we will use the following simple Java program for debugging: public class A{ private int x; private int y; public A(int a, int b) { x = a; y = b; } public static void main(String[] args) { System.out.println("Hi, I'm main.. and I'm going to call f1"); f1(); f2(3, 4); f3(4, 5); f4(); f5(); } public static void f1() { System.out.println("I'm f1..."); System.out.println("I'm still f1..."); System.out.println("I'm still f1..."); } public static int f2(int a, int b) { return a + b; } public static A f3(int a, int b) { A obj = new A(a, b); obj.reset(); return obj; } public static void f4() { System.out.println("I'm f4 "); } public static void f5() { A a = new A(5, 6); synchronized(a) { System.out.println("I'm f5, accessing a's fields " + a.x + " " + a.y); } } private void reset() { x = 0; y = 0; }} Let us put this code in a file called A.java in the current working directory, compile it using 'javac -g A.java' (note the '-g' option that makes the Java compiler generate some extra debugging information in the class file), and even run it once using 'java A' to see what the output is. Apparently, there is no bug in this program to debug it, but we will see, using JDB, how the control flows through this program. Recall that, this program being a Java program, runs on a Java Virtual Machine (JVM). Before we actually debug the Java program, we need to see that a connection is established between JDB and the JVM on which the Java program is running. Depending on the way JDB connects to the JVM, there are a few ways in which we can use JDB. No matter how the connection is established, once JDB is connected to the JVM, we can use the same set of commands for debugging. The JVM, on which the Java program to be debugged is running, is called the 'debuggee' here. Establishing the connection between JDB and the JVM In this section, we will see a few ways of establishing the connection between JDB and the JVM. JDB launching the JVM: In this option, we do not see two separate things as the debugger (JDB) and the debuggee(JVM), but rather we just invoke JDB by giving the initial class (i.e., the class that has the main() method) as an argument, and internally JDB 'launches' the JVM. $jdb AInitializing jdb ... At this point, the JVM is not yet started. We need to give 'run' command at the JDB prompt for the JVM to be started. JDB connecting to a running JVM: In this option, first start the JVM by using a command of the form: $java -Xdebug -Xrunjdwp:transport=dt_socket,server=y,address=6000 AListening for transport dt_socket at address: 6000 It says that the JVM is listening at port 6000 for a connection. Now, start JDB (in another terminal) as: $jdb -attach 6000Set uncaught java.lang.ThrowableSet deferred uncaught java.lang.ThrowableInitializing jdb ...>VM Started: No frames on the current call stack main[1] At this point, JDB is connected to the JVM. It is possible to do remote debugging with JDB. If the JVM is running on machine M1, and we want to run JDB on M2, then we can start JDB on M2 as: $jdb -attach M1:6000 JDB listening for a JVM to connect: In this option, JDB is started first, with a command of the form: $jdb -listen 6000Listening at address: adc2180852:6000 This makes JDB listen at port 6000 for a connection from the JVM. Now, start the JVM (from another terminal) as: $java -Xdebug -Xrunjdwp:transport=dt_socket,server=n,address=6000 A Once the above command is run, we see the following in the JDB terminal: Set uncaught java.lang.ThrowableSet deferred uncaught java.lang.ThrowableInitializing jdb ...>VM Started: No frames on the current call stack main[1] At this point, JDB has accepted the connection from the JVM. Here also, we can make the JVM running on machine M1 connect to a remote JDB running on machine M2, by starting the JVM as: $java -Xdebug -Xrunjdwp:transport=dt_socket,server=n,address=M2:6000 A

0
0
7075

article-image-writing-postgis-functions-in-python-tutorial

Pravin Dhandre

01 Aug 2018

5 min read

Writing PostGIS functions in Python language [Tutorial]

Pravin Dhandre

01 Aug 2018

5 min read

In this tutorial, you will learn to write a Python function for PostGIS and PostgreSQL using the PL/Python language and effective libraries like urllib2 and simplejson. You will use Python to query the http://openweathermap.org/ web services to get the weather for a PostGIS geometry from within a PostgreSQL function. This tutorial is an excerpt from a book written by Mayra Zurbaran,Pedro Wightman, Paolo Corti, Stephen Mather, Thomas Kraft and Bborie Park titled PostGIS Cookbook - Second Edition. Adding Python support to database Verify your PostgreSQL server installation has PL/Python support. In Windows, this should be already included, but this is not the default if you are using, for example, Ubuntu 16.04 LTS, so you will most likely need to install it: $ sudo apt-get install postgresql-plpython-9.1 Install PL/Python on the database (you could consider installing it in your template1 database; in this way, every newly created database will have PL/Python support by default): You could alternatively add PL/Python support to your database, using the createlang shell command (this is the only way if you are using PostgreSQL version 9.1 or lower): $ createlang plpythonu postgis_cookbook $ psql -U me postgis_cookbook postgis_cookbook=# CREATE EXTENSION plpythonu; How to do it... Carry out the following steps: In this tutorial, as with the previous one, you will use a http://openweathermap.org/ web service to get the temperature for a point from the closest weather station. The request you need to run (test it in a browser) is http://api.openweathermap.org/data/2.5/find?lat=55&lon=37&cnt=10&appid=YOURKEY. You should get the following JSON output (the closest weather station's data from which you will read the temperature to the point, with the coordinates of the given longitude and latitude): { message: "", cod: "200", calctime: "", cnt: 1, list: [ { id: 9191, dt: 1369343192, name: "100704-1", type: 2, coord: { lat: 13.7408, lon: 100.5478 }, distance: 6.244, main: { temp: 300.37 }, wind: { speed: 0, deg: 141 }, rang: 30, rain: { 1h: 0, 24h: 3.302, today: 0 } } ] } Create the following PostgreSQL function in Python, using the PL/Python language: CREATE OR REPLACE FUNCTION chp08.GetWeather(lon float, lat float) RETURNS float AS $$ import urllib2 import simplejson as json data = urllib2.urlopen( 'http://api.openweathermap.org/data/ 2.1/find/station?lat=%s&lon=%s&cnt=1' % (lat, lon)) js_data = json.load(data) if js_data['cod'] == '200': # only if cod is 200 we got some effective results if int(js_data['cnt'])>0: # check if we have at least a weather station station = js_data['list'][0] print 'Data from weather station %s' % station['name'] if 'main' in station: if 'temp' in station['main']: temperature = station['main']['temp'] - 273.15 # we want the temperature in Celsius else: temperature = None else: temperature = None return temperature $$ LANGUAGE plpythonu; Now, test your function; for example, get the temperature from the weather station closest to Wat Pho Templum in Bangkok: postgis_cookbook=# SELECT chp08.GetWeather(100.49, 13.74); getweather ------------ 27.22 (1 row) If you want to get the temperature for the point features in a PostGIS table, you can use the coordinates of each feature's geometry: postgis_cookbook=# SELECT name, temperature, chp08.GetWeather(ST_X(the_geom), ST_Y(the_geom)) AS temperature2 FROM chp08.cities LIMIT 5; name | temperature | temperature2 -------------+-------------+-------------- Minneapolis | 275.15 | 15 Saint Paul | 274.15 | 16 Buffalo | 274.15 | 19.44 New York | 280.93 | 19.44 Jersey City | 282.15 | 21.67 (5 rows) Now it would be nice if our function could accept not only the coordinates of a point, but also a true PostGIS geometry as well as an input parameter. For the temperature of a feature, you could return the temperature of the weather station closest to the centroid of the feature geometry. You can easily get this behavior using function overloading. Add a new function, with the same name, supporting a PostGIS geometry directly as an input parameter. In the body of the function, call the previous function, passing the coordinates of the centroid of the geometry. Note that in this case, you can write the function without using Python, with the PL/PostgreSQL language: CREATE OR REPLACE FUNCTION chp08.GetWeather(geom geometry) RETURNS float AS $$ BEGIN RETURN chp08.GetWeather(ST_X(ST_Centroid(geom)), ST_Y(ST_Centroid(geom))); END; $$ LANGUAGE plpgsql; Now, test the function, passing a PostGIS geometry to the function: postgis_cookbook=# SELECT chp08.GetWeather( ST_GeomFromText('POINT(-71.064544 42.28787)')); getweather ------------ 23.89 (1 row) If you use the function on a PostGIS layer, you can pass the feature's geometries to the function directly, using the overloaded function written in the PL/PostgreSQL language: postgis_cookbook=# SELECT name, temperature, chp08.GetWeather(the_geom) AS temperature2 FROM chp08.cities LIMIT 5; name | temperature | temperature2 -------------+-------------+-------------- Minneapolis | 275.15 | 17.22 Saint Paul | 274.15 | 16 Buffalo | 274.15 | 18.89 New York | 280.93 | 19.44 Jersey City | 282.15 | 21.67 (5 rows) In this tutorial, you wrote a Python function in PostGIS, using the PL/Python language. Using Python inside PostgreSQL and PostGIS functions gives you the great advantage of being able to use any Python library you wish. Therefore, you will be able to write much more powerful functions compared to those written using the standard PL/PostgreSQL language. In fact, in this case, you used the urllib2 and simplejson Python libraries to query a web service from within a PostgreSQL function—this would be an impossible operation to do using plain PL/PostgreSQL. You have also seen how to overload functions in order to provide the function's user a different way to access the function, using input parameters in a different way. To get armed with all the tools and instructions you need for managing entire spatial database systems, read PostGIS Cookbook - Second Edition. Top 7 libraries for geospatial analysis Learning R for Geospatial Analysis

0
0
7074

How-To Tutorials

article-image-internet-connected-smart-water-meter

Packt

22 Sep 2015

13 min read

Internet Connected Smart Water Meter

Packt

22 Sep 2015

13 min read

0
0
7069

Alvin Ourrad

23 Mar 2015

6 min read

Making Games with Pixi.js

Alvin Ourrad

23 Mar 2015

6 min read

In this post I will introduce you to pixi.js, a super-fast rending engine that is also a swiss-army-knife tool with a friendly API. What ? Pixi.js is a rendering engine that allows you to use the power of WebGL and canvas to render your content on your screen in a completely seamless way. In fact, pixi.js features both a WebGL and a canvas renderer, and can fall back to the latter for lower-end devices. You can then harness the power of WebGL and hardware-accelerated graphics on devices that are powerful enough to use it. If one of your users is on an older device, the engine falls back to the canvas renderer automatically and there is no difference for the person browsing your website, so you don't have to worry about those users any more. WebGL for 2D ? If you have heard or browsed a web product that was showcased as using WebGL, you probably have memories of a 3D game, a 3D earth visualization, or something similar. WebGL was originally highlighted and brought to the public for its capability to render 3D graphics in the browser, because it was the only way that was fast enough to allow them to do it. But the underlying technology is not 3D only, nor is it 2D, you make it do what you want, so the idea behind pixi.js was to bring this speed and quality of rendering to 2D graphics and games, and of course to the general public. You might argue that you do not need this level of accuracy and fine-grain control for 2D, and the WebGL API might be a bit of an overhead for a 2D application, but with browsers becoming more powerful, the expectations of the users are getting higher and higher and this technology with its speed allows you to compete with the applications that used to be flash-only. Tour/Overview Pixi.js was created by a former flash developer, so consequently its syntax is very similar to ActionScript3. Here is a little tour of the core components that you need to create when using pixi. The renderer I already gave you a description of its features and capabilities, so the only thing to bear in mind is that there are two ways of creating a renderer. You can specify the renderer that you want, or let the engine decide according to the current device. // When you let the engine decide : var renderer = PIXI.autoDetectRenderer(800,600); // When you specifically want one or the other renderer: var renderer = new PIXI.WebGLRenderer(800,600); // and for canvas you'd write : // var renderer = new PIXI.WebGLRenderer(800,600); The stage Pixi mimics the Flash API in how it deals with object’s positioning. Basically, the object's coordinates are always relative to their parent container. Flash and pixi allow you to create special objects that are called containers. They are not images or graphics, they are abstract ways to group objects together. Say you have a landscape made of various things such as trees, rocks, and so on. If you add them to a container and move this container, you can move all of these objects together by moving the container. Here is how it works: Don't run away just yet, this is where the stage comes in. The Stage is the root container that everything is added to. The stage isn't meant to move, so when a sprite is added directly to the stage, you can be sure its position will be the same as its position on-screen (well, within your canvas). // here is how you create a stage var stage = new PIXI.Stage(); Let's make a thing Ok, enough of the scene-graph theory, it's time to make something. As I wrote before, pixi is a rendering engine, so you will need to tell the renderer to render its stage, otherwise nothing will happen. So this is the bare bones template you'll use for anything pixi: // create an new instance of a pixi stage var stage = new PIXI.Stage(0x0212223); // create a renderer instance var renderer = PIXI.autoDetectRenderer(window.innerWidth, window.innerHeight); // add the renderer view element to the DOM document.body.appendChild(renderer.view); // create a new Sprite using the texture var bunny = new PIXI.Sprite.fromImage("assets/bunny.png"); bunny.position.set(200,230); stage.addChild(bunny); animate(); function animate() { // render the stage renderer.render(stage); requestAnimFrame(animate); } First, you create a renderer and a stage, just like I showed you before, then you create the most important pixi object, a Sprite, which is basically an image rendered on your screen. var sprite = new PIXI.Sprite.fromImage("assets/image.png"); Sprites, are the core of your game, and the thing you will use the most in pixi and any major game framework. However, pixi being not really a game framework, but a level lower, you need to manually add your sprites to the stage. So whenever something is not visible, make sure to double-check that you have added it to the stage like this: stage.addChild(sprite); Then, you can create a function that creates a bunch of sprites. function createParticles () { for (var i = 0; i < 40; i++) { // create a new Sprite using the texture var bunny = new PIXI.Sprite.fromImage("assets/bunny.png"); bunny.xSpeed = (Math.random()*20)-10; bunny.ySpeed = (Math.random()*20)-10; bunny.tint = Math.random() * 0xffffff; bunny.rotation = Math.random() * 6; stage.addChild(bunny); } } And then, you can leverage the update loop to move these sprites around randomly: if(count > 10){ createParticles(); count = 0; } if(stage.children.length > 20000){ stage.children.shift()} for (var i = 0; i < stage.children.length; i++) { var sprite = stage.children[i]; sprite.position.x += sprite.xSpeed; sprite.position.y += sprite.ySpeed; if(sprite.position.x > renderer.width){ sprite.position.x = 0; } if(sprite.position.y > renderer.height){ sprite.position.y = 0; } }; </code> That's it, for this blog post. Feel free to have a play with pixi and browse the dedicated website. Games development, web development, native apps... Visit our JavaScript page for more tutorials and content on the frameworks and tools essential for any software developers toolkit. About the author Alvin is a web developer fond of the web and the power of open standards. A lover of open source, he likes experimenting with interactivity in the browser. He currently works as an HTML5 game developer.

0
0
7067

How-To Tutorials

article-image-intel-me-has-a-manufacturing-mode-vulnerability-and-even-giant-manufacturers-like-apple-are-not-immune-say-researchers

Savia Lobo

03 Oct 2018

4 min read

“Intel ME has a Manufacturing Mode vulnerability, and even giant manufacturers like Apple are not immune,” say researchers

Savia Lobo

03 Oct 2018

4 min read

Yesterday, a group of European information security researchers announced that they have discovered a vulnerability in Intel’s Management Engine (Intel ME) INTEL-SA-00086. They say that the root of this problem is an undocumented Intel ME mode, specifically known as the Manufacturing Mode. Undocumented commands enable overwriting SPI flash memory and implementing the doomsday scenario. The vulnerability could locally exploit of an ME vulnerability (INTEL-SA-00086). What is Manufacturing Mode? Intel ME Manufacturing Mode is intended for configuration and testing of the end platform during manufacturing. However, this mode and its potential risks are not described anywhere in Intel's public documentation. Ordinary users do not have the ability to disable this mode since the relevant utility (part of Intel ME System Tools) is not officially available. As a result, there is no software that can protect, or even notify, the user if this mode is enabled. This mode allows configuring critical platform settings stored in one-time-programmable memory (FUSEs). These settings include those for BootGuard (the mode, policy, and hash for the digital signing key for the ACM and UEFI modules). Some of them are referred to as FPFs (Field Programmable Fuses). An output of the -FPFs option in FPT In addition to FPFs, in Manufacturing Mode the hardware manufacturer can specify settings for Intel ME, which are stored in the Intel ME internal file system (MFS) on SPI flash memory. These parameters can be changed by reprogramming the SPI flash. The parameters are known as CVARs (Configurable NVARs, Named Variables). CVARs, just like FPFs, can be set and read via FPT. Manufacturing mode vulnerability in Intel chips within Apple laptops The researchers analyzed several platforms from a number of manufacturers, including Lenovo and Apple MacBook Prо laptops. The Lenovo models did not have any issues related to Manufacturing Mode. However, they found that the Intel chipsets within the Apple laptops are running in Manufacturing Mode and was found to include the vulnerability CVE-2018-4251. This information was reported to Apple and the vulnerability was patched in June, in the macOS High Sierra update 10.13.5. By exploiting CVE-2018-4251, an attacker could write old versions of Intel ME (such as versions containing vulnerability INTEL-SA-00086) to memory without needing an SPI programmer and without physical access to the computer. Thus, a local vector is possible for exploitation of INTEL-SA-00086, which enables running arbitrary code in ME. The researchers have also stated, in the notes for the INTEL-SA-00086 security bulletin, Intel does not mention enabled Manufacturing Mode as a method for local exploitation in the absence of physical access. Instead, the company incorrectly claims that local exploitation is possible only if access settings for SPI regions have been misconfigured. How can users save themselves from this vulnerability? To keep users safe, the researchers decided to describe how to check the status of Manufacturing Mode and how to disable it. Intel System Tools includes MEInfo in order to allow obtaining thorough diagnostic information about the current state of ME and the platform overall. They demonstrated this utility in their previous research about the undocumented HAP (High Assurance Platform) mode and showed how to disable ME. The utility, when called with the -FWSTS flag, displays a detailed description of status HECI registers and the current status of Manufacturing Mode (when the fourth bit of the FWSTS status register is set, Manufacturing Mode is active). Example of MEInfo output They also created a program for checking the status of Manufacturing Mode if the user for whatever reason does not have access to Intel ME System Tools. Here is what the script shows on affected systems: mmdetect script To disable Manufacturing Mode, FPT has a special option (-CLOSEMNF) that allows setting the recommended access rights for SPI flash regions in the descriptor. Here is what happens when -CLOSEMNF is entered: Process of closing Manufacturing Mode with FPT Thus, the researchers demonstrated that Intel ME has a Manufacturing Mode problem. Even major commercial manufacturers such as Apple are not immune to configuration mistakes on Intel platforms. Also, there is no public information on the topic, leaving end users in the dark about weaknesses that could result in data theft, persistent irremovable rootkits, and even ‘bricking’ of hardware. To know about this vulnerability in detail, visit Positive research’s blog. Meet ‘Foreshadow’: The L1 Terminal Fault in Intel’s chips SpectreRSB targets CPU return stack buffer, found on Intel, AMD, and ARM chipsets Intel faces backlash on Microcode Patches after it prohibited Benchmarking or Comparison

0
0
7063

Packt

11 Aug 2015

17 min read

Divide and Conquer – Classification Using Decision Trees and Rules

Packt

11 Aug 2015

17 min read

In this article by Brett Lantz, author of the book Machine Learning with R, Second Edition, we will get a basic understanding about decision trees and rule learners, including the C5.0 decision tree algorithm. This algorithm will cover mechanisms such as choosing the best split and pruning the decision tree. While deciding between several job offers with various levels of pay and benefits, many people begin by making lists of pros and cons, and eliminate options based on simple rules. For instance, ''if I have to commute for more than an hour, I will be unhappy.'' Or, ''if I make less than $50k, I won't be able to support my family.'' In this way, the complex and difficult decision of predicting one's future happiness can be reduced to a series of simple decisions. This article covers decision trees and rule learners—two machine learning methods that also make complex decisions from sets of simple choices. These methods then present their knowledge in the form of logical structures that can be understood with no statistical knowledge. This aspect makes these models particularly useful for business strategy and process improvement. By the end of this article, you will learn: How trees and rules "greedily" partition data into interesting segments The most common decision tree and classification rule learners, including the C5.0, 1R, and RIPPER algorithms We will begin by examining decision trees, followed by a look at classification rules. (For more resources related to this topic, see here.) Understanding decision trees Decision tree learners are powerful classifiers, which utilize a tree structure to model the relationships among the features and the potential outcomes. As illustrated in the following figure, this structure earned its name due to the fact that it mirrors how a literal tree begins at a wide trunk, which if followed upward, splits into narrower and narrower branches. In much the same way, a decision tree classifier uses a structure of branching decisions, which channel examples into a final predicted class value. To better understand how this works in practice, let's consider the following tree, which predicts whether a job offer should be accepted. A job offer to be considered begins at the root node, where it is then passed through decision nodes that require choices to be made based on the attributes of the job. These choices split the data across branches that indicate potential outcomes of a decision, depicted here as yes or no outcomes, though in some cases there may be more than two possibilities. In the case a final decision can be made, the tree is terminated by leaf nodes (also known as terminal nodes) that denote the action to be taken as the result of the series of decisions. In the case of a predictive model, the leaf nodes provide the expected result given the series of events in the tree. A great benefit of decision tree algorithms is that the flowchart-like tree structure is not necessarily exclusively for the learner's internal use. After the model is created, many decision tree algorithms output the resulting structure in a human-readable format. This provides tremendous insight into how and why the model works or doesn't work well for a particular task. This also makes decision trees particularly appropriate for applications in which the classification mechanism needs to be transparent for legal reasons, or in case the results need to be shared with others in order to inform future business practices. With this in mind, some potential uses include: Credit scoring models in which the criteria that causes an applicant to be rejected need to be clearly documented and free from bias Marketing studies of customer behavior such as satisfaction or churn, which will be shared with management or advertising agencies Diagnosis of medical conditions based on laboratory measurements, symptoms, or the rate of disease progression Although the previous applications illustrate the value of trees in informing decision processes, this is not to suggest that their utility ends here. In fact, decision trees are perhaps the single most widely used machine learning technique, and can be applied to model almost any type of data—often with excellent out-of-the-box applications. This said, in spite of their wide applicability, it is worth noting some scenarios where trees may not be an ideal fit. One such case might be a task where the data has a large number of nominal features with many levels or it has a large number of numeric features. These cases may result in a very large number of decisions and an overly complex tree. They may also contribute to the tendency of decision trees to overfit data, though as we will soon see, even this weakness can be overcome by adjusting some simple parameters. Divide and conquer Decision trees are built using a heuristic called recursive partitioning. This approach is also commonly known as divide and conquer because it splits the data into subsets, which are then split repeatedly into even smaller subsets, and so on and so forth until the process stops when the algorithm determines the data within the subsets are sufficiently homogenous, or another stopping criterion has been met. To see how splitting a dataset can create a decision tree, imagine a bare root node that will grow into a mature tree. At first, the root node represents the entire dataset, since no splitting has transpired. Next, the decision tree algorithm must choose a feature to split upon; ideally, it chooses the feature most predictive of the target class. The examples are then partitioned into groups according to the distinct values of this feature, and the first set of tree branches are formed. Working down each branch, the algorithm continues to divide and conquer the data, choosing the best candidate feature each time to create another decision node, until a stopping criterion is reached. Divide and conquer might stop at a node in a case that: All (or nearly all) of the examples at the node have the same class There are no remaining features to distinguish among the examples The tree has grown to a predefined size limit To illustrate the tree building process, let's consider a simple example. Imagine that you work for a Hollywood studio, where your role is to decide whether the studio should move forward with producing the screenplays pitched by promising new authors. After returning from a vacation, your desk is piled high with proposals. Without the time to read each proposal cover-to-cover, you decide to develop a decision tree algorithm to predict whether a potential movie would fall into one of three categories: Critical Success, Mainstream Hit, or Box Office Bust. To build the decision tree, you turn to the studio archives to examine the factors leading to the success and failure of the company's 30 most recent releases. You quickly notice a relationship between the film's estimated shooting budget, the number of A-list celebrities lined up for starring roles, and the level of success. Excited about this finding, you produce a scatterplot to illustrate the pattern: Using the divide and conquer strategy, we can build a simple decision tree from this data. First, to create the tree's root node, we split the feature indicating the number of celebrities, partitioning the movies into groups with and without a significant number of A-list stars: Next, among the group of movies with a larger number of celebrities, we can make another split between movies with and without a high budget: At this point, we have partitioned the data into three groups. The group at the top-left corner of the diagram is composed entirely of critically acclaimed films. This group is distinguished by a high number of celebrities and a relatively low budget. At the top-right corner, majority of movies are box office hits with high budgets and a large number of celebrities. The final group, which has little star power but budgets ranging from small to large, contains the flops. If we wanted, we could continue to divide and conquer the data by splitting it based on the increasingly specific ranges of budget and celebrity count, until each of the currently misclassified values resides in its own tiny partition, and is correctly classified. However, it is not advisable to overfit a decision tree in this way. Though there is nothing to stop us from splitting the data indefinitely, overly specific decisions do not always generalize more broadly. We'll avoid the problem of overfitting by stopping the algorithm here, since more than 80 percent of the examples in each group are from a single class. This forms the basis of our stopping criterion. You might have noticed that diagonal lines might have split the data even more cleanly. This is one limitation of the decision tree's knowledge representation, which uses axis-parallel splits. The fact that each split considers one feature at a time prevents the decision tree from forming more complex decision boundaries. For example, a diagonal line could be created by a decision that asks, "is the number of celebrities is greater than the estimated budget?" If so, then "it will be a critical success." Our model for predicting the future success of movies can be represented in a simple tree, as shown in the following diagram. To evaluate a script, follow the branches through each decision until the script's success or failure has been predicted. In no time, you will be able to identify the most promising options among the backlog of scripts and get back to more important work, such as writing an Academy Awards acceptance speech. Since real-world data contains more than two features, decision trees quickly become far more complex than this, with many more nodes, branches, and leaves. In the next section, you will learn about a popular algorithm to build decision tree models automatically. The C5.0 decision tree algorithm There are numerous implementations of decision trees, but one of the most well-known implementations is the C5.0 algorithm. This algorithm was developed by computer scientist J. Ross Quinlan as an improved version of his prior algorithm, C4.5, which itself is an improvement over his Iterative Dichotomiser 3 (ID3) algorithm. Although Quinlan markets C5.0 to commercial clients (see http://www.rulequest.com/ for details), the source code for a single-threaded version of the algorithm was made publically available, and it has therefore been incorporated into programs such as R. To further confuse matters, a popular Java-based open source alternative to C4.5, titled J48, is included in R's RWeka package. Because the differences among C5.0, C4.5, and J48 are minor, the principles in this article will apply to any of these three methods, and the algorithms should be considered synonymous. The C5.0 algorithm has become the industry standard to produce decision trees, because it does well for most types of problems directly out of the box. Compared to other advanced machine learning models, the decision trees built by C5.0 generally perform nearly as well, but are much easier to understand and deploy. Additionally, as shown in the following table, the algorithm's weaknesses are relatively minor and can be largely avoided: Strengths Weaknesses An all-purpose classifier that does well on most problems Highly automatic learning process, which can handle numeric or nominal features, as well as missing data Excludes unimportant features Can be used on both small and large datasets Results in a model that can be interpreted without a mathematical background (for relatively small trees) More efficient than other complex models Decision tree models are often biased toward splits on features having a large number of levels It is easy to overfit or underfit the model Can have trouble modeling some relationships due to reliance on axis-parallel splits Small changes in the training data can result in large changes to decision logic Large trees can be difficult to interpret and the decisions they make may seem counterintuitive To keep things simple, our earlier decision tree example ignored the mathematics involved in how a machine would employ a divide and conquer strategy. Let's explore this in more detail to examine how this heuristic works in practice. Choosing the best split The first challenge that a decision tree will face is to identify which feature to split upon. In the previous example, we looked for a way to split the data such that the resulting partitions contained examples primarily of a single class. The degree to which a subset of examples contains only a single class is known as purity, and any subset composed of only a single class is called pure. There are various measurements of purity that can be used to identify the best decision tree splitting candidate. C5.0 uses entropy, a concept borrowed from information theory that quantifies the randomness, or disorder, within a set of class values. Sets with high entropy are very diverse and provide little information about other items that may also belong in the set, as there is no apparent commonality. The decision tree hopes to find splits that reduce entropy, ultimately increasing homogeneity within the groups. Typically, entropy is measured in bits. If there are only two possible classes, entropy values can range from 0 to 1. For n classes, entropy ranges from 0 to log2(n). In each case, the minimum value indicates that the sample is completely homogenous, while the maximum value indicates that the data are as diverse as possible, and no group has even a small plurality. In the mathematical notion, entropy is specified as follows: In this formula, for a given segment of data (S), the term c refers to the number of class levels and pi refers to the proportion of values falling into class level i. For example, suppose we have a partition of data with two classes: red (60 percent) and white (40 percent). We can calculate the entropy as follows: > -0.60 * log2(0.60) - 0.40 * log2(0.40) [1] 0.9709506 We can examine the entropy for all the possible two-class arrangements. If we know that the proportion of examples in one class is x, then the proportion in the other class is (1 – x). Using the curve() function, we can then plot the entropy for all the possible values of x: > curve(-x * log2(x) - (1 - x) * log2(1 - x), col = "red", xlab = "x", ylab = "Entropy", lwd = 4) This results in the following figure: As illustrated by the peak in entropy at x = 0.50, a 50-50 split results in maximum entropy. As one class increasingly dominates the other, the entropy reduces to zero. To use entropy to determine the optimal feature to split upon, the algorithm calculates the change in homogeneity that would result from a split on each possible feature, which is a measure known as information gain. The information gain for a feature F is calculated as the difference between the entropy in the segment before the split (S1) and the partitions resulting from the split (S2): One complication is that after a split, the data is divided into more than one partition. Therefore, the function to calculate Entropy(S2) needs to consider the total entropy across all of the partitions. It does this by weighing each partition's entropy by the proportion of records falling into the partition. This can be stated in a formula as: In simple terms, the total entropy resulting from a split is the sum of the entropy of each of the n partitions weighted by the proportion of examples falling in the partition (wi). The higher the information gain, the better a feature is at creating homogeneous groups after a split on this feature. If the information gain is zero, there is no reduction in entropy for splitting on this feature. On the other hand, the maximum information gain is equal to the entropy prior to the split. This would imply that the entropy after the split is zero, which means that the split results in completely homogeneous groups. The previous formulae assume nominal features, but decision trees use information gain for splitting on numeric features as well. To do so, a common practice is to test various splits that divide the values into groups greater than or less than a numeric threshold. This reduces the numeric feature into a two-level categorical feature that allows information gain to be calculated as usual. The numeric cut point yielding the largest information gain is chosen for the split. Though it is used by C5.0, information gain is not the only splitting criterion that can be used to build decision trees. Other commonly used criteria are Gini index, Chi-Squared statistic, and gain ratio. For a review of these (and many more) criteria, refer to Mingers J. An Empirical Comparison of Selection Measures for Decision-Tree Induction. Machine Learning. 1989; 3:319-342. Pruning the decision tree A decision tree can continue to grow indefinitely, choosing splitting features and dividing the data into smaller and smaller partitions until each example is perfectly classified or the algorithm runs out of features to split on. However, if the tree grows overly large, many of the decisions it makes will be overly specific and the model will be overfitted to the training data. The process of pruning a decision tree involves reducing its size such that it generalizes better to unseen data. One solution to this problem is to stop the tree from growing once it reaches a certain number of decisions or when the decision nodes contain only a small number of examples. This is called early stopping or pre-pruning the decision tree. As the tree avoids doing needless work, this is an appealing strategy. However, one downside to this approach is that there is no way to know whether the tree will miss subtle, but important patterns that it would have learned had it grown to a larger size. An alternative, called post-pruning, involves growing a tree that is intentionally too large and pruning leaf nodes to reduce the size of the tree to a more appropriate level. This is often a more effective approach than pre-pruning, because it is quite difficult to determine the optimal depth of a decision tree without growing it first. Pruning the tree later on allows the algorithm to be certain that all the important data structures were discovered. The implementation details of pruning operations are very technical and beyond the scope of this article. For a comparison of some of the available methods, see Esposito F, Malerba D, Semeraro G. A Comparative Analysis of Methods for Pruning Decision Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997;19: 476-491. One of the benefits of the C5.0 algorithm is that it is opinionated about pruning—it takes care of many decisions automatically using fairly reasonable defaults. Its overall strategy is to post-prune the tree. It first grows a large tree that overfits the training data. Later, the nodes and branches that have little effect on the classification errors are removed. In some cases, entire branches are moved further up the tree or replaced by simpler decisions. These processes of grafting branches are known as subtree raising and subtree replacement, respectively. Balancing overfitting and underfitting a decision tree is a bit of an art, but if model accuracy is vital, it may be worth investing some time with various pruning options to see if it improves the performance on test data. As you will soon see, one of the strengths of the C5.0 algorithm is that it is very easy to adjust the training options. Summary This article covered two classification methods that use so-called "greedy" algorithms to partition the data according to feature values. Decision trees use a divide and conquer strategy to create flowchart-like structures, while rule learners separate and conquer data to identify logical if-else rules. Both methods produce models that can be interpreted without a statistical background. One popular and highly configurable decision tree algorithm is C5.0. We used the C5.0 algorithm to create a tree to predict whether a loan applicant will default. This article merely scratched the surface of how trees and rules can be used. Resources for Article: Further resources on this subject: Introduction to S4 Classes [article] First steps with R [article] Supervised learning [article]

0
0
7063

article-image-working-geo-spatial-data-python

Packt

30 Dec 2010

7 min read

Working with Geo-Spatial Data in Python

Packt

30 Dec 2010

7 min read

Python Geospatial Development If you want to follow through the examples in this article, make sure you have the following Python libraries installed on your computer: GDAL/OGR version 1.7 or later (http://gdal.org) pyproj version 1.8.6 or later (http://code.google.com/p/pyproj) Shapely version 1.2 or later (http://trac.gispython.org/lab/wiki/Shapely) Reading and writing geo-spatial data In this section, we will look at some examples of tasks you might want to perform that involve reading and writing geo-spatial data in both vector and raster format. Task: Calculate the bounding box for each country in the world In this slightly contrived example, we will make use of a Shapefile to calculate the minimum and maximum latitude/longitude values for each country in the world. This "bounding box" can be used, among other things, to generate a map of a particular country. For example, the bounding box for Turkey would look like this: Start by downloading the World Borders Dataset from: http://thematicmapping.org/downloads/world_borders.php Decompress the .zip archive and place the various files that make up the Shapefile (the .dbf, .prj, .shp, and .shx files) together in a suitable directory. We next need to create a Python program that can read the borders of each country. Fortunately, using OGR to read through the contents of a Shapefile is trivial: import osgeo.ogr shapefile = osgeo.ogr.Open("TM_WORLD_BORDERS-0.3.shp") layer = shapefile.GetLayer(0) for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) The feature consists of a geometry and a set of fields. For this data, the geometry is a polygon that defines the outline of the country, while the fields contain various pieces of information about the country. According to the Readme.txt file, the fields in this Shapefile include the ISO-3166 three-letter code for the country (in a field named ISO3) as well as the name for the country (in a field named NAME). This allows us to obtain the country code and name like this: countryCode = feature.GetField("ISO3") countryName = feature.GetField("NAME") We can also obtain the country's border polygon using: geometry = feature.GetGeometryRef() There are all sorts of things we can do with this geometry, but in this case we want to obtain the bounding box or envelope for the polygon: minLong,maxLong,minLat,maxLat = geometry.GetEnvelope() Let's put all this together into a complete working program: # calcBoundingBoxes.py import osgeo.ogr shapefile = osgeo.ogr.Open("TM_WORLD_BORDERS-0.3.shp") layer = shapefile.GetLayer(0) countries = [] # List of (code,name,minLat,maxLat, # minLong,maxLong) tuples. for i in range(layer.GetFeatureCount()): feature = layer.GetFeature(i) countryCode = feature.GetField("ISO3") countryName = feature.GetField("NAME") geometry = feature.GetGeometryRef() minLong,maxLong,minLat,maxLat = geometry.GetEnvelope() countries.append((countryName, countryCode, minLat, maxLat, minLong, maxLong)) countries.sort() for name,code,minLat,maxLat,minLong,maxLong in countries: print "%s (%s) lat=%0.4f..%0.4f, long=%0.4f..%0.4f" % (name, code,minLat, maxLat,minLong, maxLong) Running this program produces the following output: % python calcBoundingBoxes.py Afghanistan (AFG) lat=29.4061..38.4721, long=60.5042..74.9157 Albania (ALB) lat=39.6447..42.6619, long=19.2825..21.0542 Algeria (DZA) lat=18.9764..37.0914, long=-8.6672..11.9865 ... Task: Save the country bounding boxes into a Shapefile While the previous example simply printed out the latitude and longitude values, it might be more useful to draw the bounding boxes onto a map. To do this, we have to convert the bounding boxes into polygons, and save these polygons into a Shapefile. Creating a Shapefile involves the following steps: Define the spatial reference used by the Shapefile's data. In this case, we'll use the WGS84 datum and unprojected geographic coordinates (that is, latitude and longitude values). This is how you would define this spatial reference using OGR: import osgeo.osr spatialReference = osgeo.osr.SpatialReference() spatialReference.SetWellKnownGeogCS('WGS84') We can now create the Shapefile itself using this spatial reference: import osgeo.ogr driver = osgeo.ogr.GetDriverByName("ESRI Shapefile") dstFile = driver.CreateDataSource("boundingBoxes.shp")) dstLayer = dstFile.CreateLayer("layer", spatialReference) After creating the Shapefile, you next define the various fields that will hold the metadata for each feature. In this case, let's add two fields to store the country name and its ISO-3166 code: fieldDef = osgeo.ogr.FieldDefn("COUNTRY", osgeo.ogr.OFTString) fieldDef.SetWidth(50) dstLayer.CreateField(fieldDef) fieldDef = osgeo.ogr.FieldDefn("CODE", osgeo.ogr.OFTString) fieldDef.SetWidth(3) dstLayer.CreateField(fieldDef) We now need to create the geometry for each feature—in this case, a polygon defining the country's bounding box. A polygon consists of one or more linear rings; the first linear ring defines the exterior of the polygon, while additional rings define "holes" inside the polygon. In this case, we want a simple polygon with a square exterior and no holes: linearRing = osgeo.ogr.Geometry(osgeo.ogr.wkbLinearRing) linearRing.AddPoint(minLong, minLat) linearRing.AddPoint(maxLong, minLat) linearRing.AddPoint(maxLong, maxLat) linearRing.AddPoint(minLong, maxLat) linearRing.AddPoint(minLong, minLat) polygon = osgeo.ogr.Geometry(osgeo.ogr.wkbPolygon) polygon.AddGeometry(linearRing) You may have noticed that the coordinate (minLong, minLat)was added to the linear ring twice. This is because we are defining line segments rather than just points—the first call to AddPoint()defines the starting point, and each subsequent call to AddPoint()adds a new line segment to the linear ring. In this case, we start in the lower-left corner and move counter-clockwise around the bounding box until we reach the lower-left corner again: Once we have the polygon, we can use it to create a feature: feature = osgeo.ogr.Feature(dstLayer.GetLayerDefn()) feature.SetGeometry(polygon) feature.SetField("COUNTRY", countryName) feature.SetField("CODE", countryCode) dstLayer.CreateFeature(feature) feature.Destroy() Notice how we use the setField() method to store the feature's metadata. We also have to call the Destroy() method to close the feature once we have finished with it; this ensures that the feature is saved into the Shapefile. Finally, we call the Destroy() method to close the output Shapefile: dstFile.Destroy() Putting all this together, and combining it with the code from the previous recipe to calculate the bounding boxes for each country in the World Borders Dataset Shapefile, we end up with the following complete program: # boundingBoxesToShapefile.py import os, os.path, shutil import osgeo.ogr import osgeo.osr # Open the source shapefile. srcFile = osgeo.ogr.Open("TM_WORLD_BORDERS-0.3.shp") srcLayer = srcFile.GetLayer(0) # Open the output shapefile. if os.path.exists("bounding-boxes"): shutil.rmtree("bounding-boxes") os.mkdir("bounding-boxes") spatialReference = osgeo.osr.SpatialReference() spatialReference.SetWellKnownGeogCS('WGS84') driver = osgeo.ogr.GetDriverByName("ESRI Shapefile") dstPath = os.path.join("bounding-boxes", "boundingBoxes.shp") dstFile = driver.CreateDataSource(dstPath) dstLayer = dstFile.CreateLayer("layer", spatialReference) fieldDef = osgeo.ogr.FieldDefn("COUNTRY", osgeo.ogr.OFTString) fieldDef.SetWidth(50) dstLayer.CreateField(fieldDef) fieldDef = osgeo.ogr.FieldDefn("CODE", osgeo.ogr.OFTString) fieldDef.SetWidth(3) dstLayer.CreateField(fieldDef) # Read the country features from the source shapefile. for i in range(srcLayer.GetFeatureCount()): feature = srcLayer.GetFeature(i) countryCode = feature.GetField("ISO3") countryName = feature.GetField("NAME") geometry = feature.GetGeometryRef() minLong,maxLong,minLat,maxLat = geometry.GetEnvelope() # Save the bounding box as a feature in the output # shapefile. linearRing = osgeo.ogr.Geometry(osgeo.ogr.wkbLinearRing) linearRing.AddPoint(minLong, minLat) linearRing.AddPoint(maxLong, minLat) linearRing.AddPoint(maxLong, maxLat) linearRing.AddPoint(minLong, maxLat) linearRing.AddPoint(minLong, minLat) polygon = osgeo.ogr.Geometry(osgeo.ogr.wkbPolygon) polygon.AddGeometry(linearRing) feature = osgeo.ogr.Feature(dstLayer.GetLayerDefn()) feature.SetGeometry(polygon) feature.SetField("COUNTRY", countryName) feature.SetField("CODE", countryCode) dstLayer.CreateFeature(feature) feature.Destroy() # All done. srcFile.Destroy() dstFile.Destroy() The only unexpected twist in this program is the use of a sub-directory called bounding-boxes to store the output Shapefile. Because a Shapefile is actually made up of multiple files on disk (a .dbf file, a .prj file, a .shp file, and a .shx file), it is easier to place these together in a sub-directory. We use the Python Standard Library module shutil to delete the previous contents of this directory, and then os.mkdir() to create it again. If you aren't storing the TM_WORLD_BORDERS-0.3.shp Shapefile in the same directory as the script itself, you will need to add the directory where the Shapefile is stored to your osgeo.ogr.Open() call. You can also store the boundingBoxes.shp Shapefile in a different directory if you prefer, by changing the path where this Shapefile is created. Running this program creates the bounding box Shapefile, which we can then draw onto a map. For example, here is the outline of Thailand along with a bounding box taken from the boundingBoxes.shp Shapefile:

0
0
7061

Visualizing univariate distribution in Seaborn

Implement an API Design-first approach for building APIs [Tutorial]

Android Virtual Device Manager

Web Services in Microsoft Azure

Python 3.8 new features: the walrus operator, positional-only parameters, and much more

FreeSWITCH: Utilizing the Built-in IVR Engine

That '70s language: AWK programming

Build Hadoop clusters using Google Cloud Platform [Tutorial]

Debugging Java Programs using JDB

Writing PostGIS functions in Python language [Tutorial]

Trending Topics

Internet Connected Smart Water Meter

Making Games with Pixi.js

“Intel ME has a Manufacturing Mode vulnerability, and even giant manufacturers like Apple are not immune,” say researchers

Divide and Conquer – Classification Using Decision Trees and Rules

Working with Geo-Spatial Data in Python