Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Programming

1081 Articles
article-image-listening-out
Packt
17 Aug 2015
14 min read
Save for later

Listening Out

Packt
17 Aug 2015
14 min read
In this article by Mat Johns, author of the book Getting Started with Hazelcast - Second Edition, we will learn the following topics: Creating and using collection listeners Instance, lifecycle, and cluster membership listeners The partition migration listener Quorum functions and listeners (For more resources related to this topic, see here.) Listening to the goings-on One great feature of Hazelcast is its ability to notify us of the goings-on of our persisted data and the cluster as a whole, allowing us to register an interest in events. The listener concept is borrowed from Java. So, you should feel pretty familiar with it. To provide this, there are a number of listener interfaces that we can implement to receive, process, and handle different types of events—one of which we previously encountered. The following are the listener interfaces: Collection listeners: EntryListener is used for map-based (IMap and MultiMap) events ItemListener is used for flat collection-based (IList, ISet, and IQueue) events MessageListener is used to receive topic events, but as we've seen before, it is used as a part of the standard operation of topics QuorumListener is used for quorum state change events Cluster listeners: DistributedObjectListener is used for the collection, creation, and destruction of events MembershipListener is used for cluster membership events LifecycleListener is used for local node state events MigrationListener is used for partition migration state events The sound of our own data Being notified about data changes can be rather useful as we can make an application-level decision regarding whether the change is important or not and react accordingly. The first interface that we are going to look at is EntryListener. This class will notify us when changes are made to the entries stored in a map collection. If we take a look at the interface, we can see four entry event types and two map-wide events that we will be notified about. EntryListener has also being broken up into a number of individual super MapListener interfaces. So, should we be interested in only a subset of event types, we can implement the appropriate super interfaces as required. Let's take a look at the following code: void entryAdded(EntryEvent<K, V> event);void entryRemoved(EntryEvent<K, V> event);void entryUpdated(EntryEvent<K, V> event);void entryEvicted(EntryEvent<K, V> event);void mapCleared(MapEvent event);void mapEvicted(MapEvent event); Hopefully, the first three are pretty self-explanatory. However, the fourth is a little less clear and in fact, one of the most useful. The entryEvicted method is invoked when an entry is removed from a map non-programmatically (that is, Hazelcast has done it all by itself). This instance will occur in one of the following two scenarios: An entry's TTL has been reached and the entry has been expired The map size, according to the configured policy, has been reached, and the appropriate eviction policy has kicked in to clear out space in the map The first scenario allows us a capability that is very rarely found in data sources—to have our application be told when a time-bound record has expired and the ability to trigger some behavior based on it. For example, we can use it to automatically trigger a teardown operation if an entry is not correctly maintained by a user's interactions. This will allow us to generate an event based on the absence of activity, which is rather useful! Let's create an example MapEntryListener class to illustrate the various events firing off: public class MapEntryListenerimplements EntryListener<String, String> {@Overridepublic void entryAdded(EntryEvent<String, String> event) {System.err.println("Added: " + event);}@Overridepublic void entryRemoved(EntryEvent<String, String> event) {System.err.println("Removed: " + event);}@Overridepublic void entryUpdated(EntryEvent<String, String> event) {System.err.println("Updated: " + event);}@Overridepublic void entryEvicted(EntryEvent<String, String> event) {System.err.println("Evicted: " + event);}@Overridepublic void mapCleared(MapEvent event) {System.err.println("Map Cleared: " + event);}@Overridepublic void mapEvicted(MapEvent event) {System.err.println("Map Evicted: " + event);}} We shall see the various events firing off as expected, with a short 10-second wait for the Berlin entry to expire, which will trigger the eviction event, as follows: Added: EntryEvent {c:capitals} key=GB, oldValue=null, value=Winchester,event=ADDED, by Member [127.0.0.1]:5701 thisUpdated: EntryEvent {c:capitals} key=GB, oldValue=Winchester,value=London, event=UPDATED, by Member [127.0.0.1]:5701 thisAdded: EntryEvent {c:capitals} key=DE, oldValue=null, value=Berlin,event=ADDED, by Member [127.0.0.1]:5701 thisRemoved: EntryEvent {c:capitals} key=GB, oldValue=null, value=London,event=REMOVED, by Member [127.0.0.1]:5701 thisEvicted: EntryEvent {c:capitals} key=DE, oldValue=null, value=Berlin,event=EVICTED, by Member [127.0.0.1]:5701 this We can obviously implement the interface as extensively as possible to service our application, potentially creating no-op stubs should we wish not to handle a particular type of event. Continuously querying The previous example focuses on notifying us of all the entry events. However, what if we were only interested in some particular data? We can obviously filter out our listener to only handle entries that we are interested in. However, it is potentially expensive to have all the events flying around the cluster, especially if our interest lies only in a minority of the potential data. To address this, we can combine capabilities from the map-searching features that we looked at a while back. As and when we register the entry listener to a collection, we can optionally provide a search Predicate that can be used as a filter within Hazelcast itself. We can whittle events down to relevant data before they even reach our listener, as follows: IMap<String, String> capitals = hz.getMap("capitals");capitals.addEntryListener(new MapEntryListener(),new SqlPredicate("name = 'London'"), true); Listeners racing into action One issue with the previous example is that we retrospectively reconfigured the map to feature the listener after it was already in service. To avoid this race condition, we should wire up the listener before the node-entering service. We can do this by registering the listener within the map configuration, as follows: <hazelcast><map name="default"><entry-listeners><entry-listener include-value="true">com.packtpub.hazelcast.listeners.MapEntryListener</entry-listener></entry-listeners></map></hazelcast> However, in both the methods of configuration, we have provided a Boolean flag when registering the listener to the map. This include-value flag allows us to configure the listener when it is invoked, as if we are interested in just the key of the event entry or the entries value as well. The default behavior (true) is to include the value, but in case the use case does not require it, there is a performance benefit of not having to provide it to the listener. So, if the use case does not require this extra data, it will be beneficial to set this flag to false. Keyless collections Though the keyless collections (set, list, and queue) are very similar to map collections, they feature their own interface to define the available events, in this case, ItemListener. It is not as extensive as its map counterpart, featuring just the itemAdded and itemRemoved events, and can be used in the same way, though it only offers visibility of these two event types. Programmatic configuration ahead of time So far, most of the extra configurations that we applied have been done either by customizing the hazelcast.xml file, or retrospectively modifying a collection in the code. However, what if we want to programmatically configure Hazelcast without the race condition that we discovered earlier? Fortunately, there is a way. By creating an instance of the Config class, we can configure the appropriate behavior on it by using a hierarchy that is similar to the XML configuration, but in code. Before passing this configuration object over to the instance creation method, the previous example can be reconfigured to do so, as follows: public static void main(String[] args) {Config conf = new Config();conf.addListenerConfig(new EntryListenerConfig(new MapEntryListener(), false, true));HazelcastInstance hz = Hazelcast.newHazelcastInstance(conf); Events unfolding in the wider world Now that we can determine what is going on with our data within the cluster, we may wish to have a higher degree of visibility of the state of the cluster itself. We can use this either to trigger application-level responses to cluster instability, or provide mechanisms to enable graceful scaling. We are provided with a number of interfaces for different types of cluster activity. All of these listeners can be configured retrospectively, as we have seen in the previous examples. However, in production, it is better to configure them in advance for the same reasons regarding the race condition as the collection listeners. We can do this either by using the hazelcast.xml configuration, or by using the Config class, as follows: <hazelcast><listeners><listener>com.packtpub.hazelcast.MyClusterListener</listener></listeners></hazelcast> The first of these, DistributedObjectListener, simply notifies all the nodes in the cluster as to the new collection objects that are being created or destroyed. Again, let's create a new example listener, ClusterObjectListener, to receive events, as follows: public class ClusterObjectListenerimplements DistributedObjectListener {@Overridepublic void distributedObjectCreated(DistributedObjectEvent event) {System.err.println("Created: " + event);}@Overridepublic void distributedObjectDestroyed(DistributedObjectEvent event) {System.err.println("Destroyed: " + event);}} As these listeners are for cluster-wide events, the example usage of this listener is rather simple. It mainly creates an instance with the appropriate listener registered, as follows: public class ClusterListeningExample {public static void main(String[] args) {Config config = new Config();config.addListenerConfig(new ListenerConfig(new ClusterObjectListener()));HazelcastInstance hz = Hazelcast.newHazelcastInstance(config);}} When using the TestApp console, we can create and destroy some collections, as follows: hazelcast[default] > ns testnamespace: testhazelcast[test] > m.put foo barnullhazelcast[test] > m.destroyDestroyed! The preceding code will produce the following, logging on ALL the nodes that feature the listener: Created: DistributedObjectEvent{eventType=CREATED, serviceName='hz:impl:mapService', distributedObject=IMap{name='test'}} Destroyed: DistributedObjectEvent{eventType=DESTROYED, serviceName='hz:impl:mapService', distributedObject=IMap{name='test'}} The next type of cluster listener is MembershipListener, which notifies all the nodes as to the joining or leaving of a node from the cluster. Let's create another example class, this time ClusterMembershipListener, as follows: public class ClusterMembershipListenerimplements MembershipListener {@Overridepublic void memberAdded(MembershipEvent membershipEvent) {System.err.println("Added: " + membershipEvent);}@Overridepublic void memberRemoved(MembershipEvent membershipEvent) {System.err.println("Removed: " + membershipEvent);}@Overridepublic void memberAttributeChanged(MemberAttributeEventmemberAttributeEvent) {System.err.println("Changed: " + memberAttributeEvent);}} Let's add the following code to the previous example application: conf.addListenerConfig(new ListenerConfig(new   ClusterMembershipListener())); Lastly, we have LifecycleListener, which is local to an individual node and allows the application built on top of Hazelcast to understand its particular node state by being notified as it changes when starting, pausing, resuming, or even shutting down, as follows: public class NodeLifecycleListener implements LifecycleListener {@Overridepublic void stateChanged(LifecycleEvent event) {System.err.println(event);}} Moving data around the place The final listener is very useful as it lets an application know when Hazelcast is rebalancing the data within the cluster. This gives us an opportunity to prevent or even block the shutdown of a node, as we might be in a period of increased data resilience risk because we may be actively moving data around at the time. The interface used in this case is MigrationListener. It will notify the application when the partitions migrate from one node to another and when they complete: public class ClusterMigrationListener implements MigrationListener {@Overridepublic void migrationStarted(MigrationEvent migrationEvent) {System.err.println("Started: " + migrationEvent);}@Overridepublic void migrationCompleted(MigrationEvent migrationEvent) {System.err.println("Completed: " + migrationEvent);}@Overridepublic void migrationFailed(MigrationEvent migrationEvent) {System.err.println("Failed: " + migrationEvent);}} When you are registering this cluster listener in your example application and creating and destroying various nodes, you will see a deluge of events that show the ongoing migrations. The more astute among you may have previously spotted a repartitioning task logging when spinning up the multiple nodes: INFO: [127.0.0.1]:5701 [dev] [3.5] Re-partitioning cluster data... Migration queue size: 271 The previous code indicated that 271 tasks (one migration task for each partition being migrated) have been scheduled to rebalance the cluster. The new listener will now give us significantly more visibility on these events as they occur and hopefully, they will be completed successfully: Started: MigrationEvent{partitionId=98, oldOwner=Member [127.0.0.1]:5701, newOwner=Member [127.0.0.1]:5702 this} Completed: MigrationEvent{partitionId=98, oldOwner=Member [127.0.0.1]:5701, newOwner=Member [127.0.0.1]:5702 this} Started: MigrationEvent{partitionId=99, oldOwner=Member [127.0.0.1]:5701, newOwner=Member [127.0.0.1]:5702 this} Completed: MigrationEvent{partitionId=99, oldOwner=Member [127.0.0.1]:5701, newOwner=Member [127.0.0.1]:5702 this} However, this logging information is overwhelming and actually not all that useful to us. So, let's expand on the listener to try and provide the application with the ability to check whether the cluster is currently migrating data partitions or has recently done so. Let's create a new static class, MigrationStatus, to hold information about cluster migration and help us interrogate it as regards its current status: public abstract class MigrationStatus {private static final Map<Integer, Boolean> MIGRATION_STATE =new ConcurrentHashMap<Integer, Boolean>();private static final AtomicLong LAST_MIGRATION_TIME =new AtomicLong(System.currentTimeMillis());public static void migrationEvent(int partitionId, boolean state) {MIGRATION_STATE.put(partitionId, state);if (!state) {LAST_MIGRATION_TIME.set(System.currentTimeMillis());}}public static boolean isMigrating() {Collection<Boolean> migrationStates= MIGRATION_STATE.values();Long lastMigrationTime = LAST_MIGRATION_TIME.get();// did we recently (< 10 seconds ago) complete a migrationif (System.currentTimeMillis() < lastMigrationTime + 10000) {return true;}// are any partitions currently migratingfor (Boolean partition : migrationStates) {if (partition) return true;}// otherwise we're not migratingreturn false;}} Then, we will update the listener to pass through the appropriate calls in response to the events coming into it, as follows: @Overridepublic void migrationStarted(MigrationEvent migrationEvent) {MigrationStatus.migrationEvent(migrationEvent.getPartitionId(), true);}@Overridepublic void migrationCompleted(MigrationEvent migrationEvent) {MigrationStatus.migrationEvent(migrationEvent.getPartitionId(), false);}@Overridepublic void migrationFailed(MigrationEvent migrationEvent) {System.err.println("Failed: " + migrationEvent);MigrationStatus.migrationEvent(migrationEvent.getPartitionId(), false);} Finally, let's add a loop to the example application to print out the migration state over time, as follows: public static void main(String[] args) throws Exception {Config conf = new Config();conf.addListenerConfig(new ListenerConfig(new ClusterMembershipListener()));conf.addListenerConfig(new ListenerConfig(new MigrationStatusListener()));HazelcastInstance hz = Hazelcast.newHazelcastInstance(conf);while(true) {System.err.println("Is Migrating?: " + MigrationStatus.isMigrating());Thread.sleep(5000);}} When starting and stopping various nodes, we should see each node detect the presence of the rebalance occurring, but it passes by quite quickly. It is in these small, critical periods of time when data is being moved around that resilience is mostly at risk, albeit depending on the configured numbers of backup, the risk could potentially be quite small. Added: MembershipEvent {member=Member [127.0.0.1]:5703,type=added}Is Migrating?: trueIs Migrating?: trueIs Migrating?: false Extending quorum We previously saw how we can configure a simple cluster health check to ensure that a given number of nodes were present to support the application. However, should we need more detailed control over the quorum definition beyond a simple node count check, we can create our own quorum function that will allow us to programmatically define what it means to be healthy. This can be as simple or as complex as what the application requires. In the following example, we sourced an expected cluster size (probably from a suitable location than a hard-coded) and dynamically checked whether a majority of the nodes are present: public class QuorumExample {public static void main(String[] args) throws Exception {QuorumConfig quorumConf = new QuorumConfig();quorumConf.setName("atLeastTwoNodesWithMajority");quorumConf.setEnabled(true);quorumConf.setType(QuorumType.WRITE);final int expectedClusterSize = 5;quorumConf.setQuorumFunctionImplementation(new QuorumFunction() {@Overridepublic boolean apply(Collection<Member> members) {return members.size() >= 2&& members.size() > expectedClusterSize / 2;}});MapConfig mapConf = new MapConfig();mapConf.setName("default");mapConf.setQuorumName("atLeastTwoNodesWithMajority");Config conf = new Config();conf.addQuorumConfig(quorumConf);conf.addMapConfig(mapConf);HazelcastInstance hz = Hazelcast.newHazelcastInstance(conf);new ConsoleApp(hz).start(args);}} We can also create a listener for the quorum health check so that we can be notified when the state of the quorum changes, as follows: public class ClusterQuorumListener implements QuorumListener {@Overridepublic void onChange(QuorumEvent quorumEvent) {System.err.println("Changed: " + quorumEvent);}} Let's attach the new listener to the appropriate configuration, as follows: quorumConf.addListenerConfig(new QuorumListenerConfig(new   ClusterQuorumListener())); Summary Hazelcast allows us to be a first-hand witness to a lot of internal state information. By registering the listeners so that they can be notified as the events occur, we can further enhance an application not only in terms of its functionality, but also with respect to its resilience. By allowing the application to know when and what events are unfolding underneath it, we can add defensiveness to it, embracing the dynamic and destroyable nature of the ephemeral approaches towards applications and infrastructure. Resources for Article: Further resources on this subject: What is Hazelcast? [Article] Apache Solr and Big Data – integration with MongoDB [Article] Introduction to Apache ZooKeeper [Article]
Read more
  • 0
  • 0
  • 1334

article-image-elegant-restful-client-python-exposing-remote-resources
Xavier Bruhiere
12 Aug 2015
6 min read
Save for later

Elegant RESTful Client in Python for Exposing Remote Resources

Xavier Bruhiere
12 Aug 2015
6 min read
Product Hunt addicts like me might have noticed how often a "developer" tab was available on landing pages. More and more modern products offer a special entry point tailored for coders who want deeper interaction, beyond standard end-user experience. Twitter, Myo, Estimote are great examples of technologies an engineer could leverage for its own tool/product. And Application Programming Interfaces (API) make it possible. Companies design them as a communication contract between the developer and their product. We can discern Representational State Transfer APIs (RESTful) from programmatic ones. The latter usually offer deeper technical integration, while the former tries to abstract most of the product's complexity behind intuitive remote resources (more on that later). The resulting simplicity owes a lot to the HTTP protocol and turns out to be trickier than one thinks. Both RESTful servers and clients often underestimates the value of HTTP historical rules or the challenges behind network failures. I will dump in this article my last experience in building an HTTP+JSON API client. We are going to build a small framework in python to interact with well-designed third party services. One should get out of it a consistent starting point for new projects, like remotely controlling its car ! Stack and Context Before diving in, let's state an important assumption : APIs our client will call are well designed. They enforce RFC standards, conventions and consistent resources. Sometimes, however, real world throws at us ugly interfaces. Always read the documentation (if any) and deal with it. The choice of Python should be seen as a minor implementation consideration. Nevertheless, it will bring us the powerful requests package and a nice repl to manually explore remote services. Its popularity also suggests we are likely to be able to integrate our future package in a future project. To keep things practical, requests will hit Consul HTTP endpoints, providing us with a handy interface for our infrastructure. Consul, as a whole, it is a tool for discovering and configuring services in your infrastructure. Just download the appropriate binary, move it in your $PATH and start a new server : consul agent -server -bootstrap-expect 1 -data-dir /tmp/consul -node consul-server We also need python 3.4 or 2.7, pip installed and, then, to download the single dependency we mentioned earlier with pip install requests==2.7.0. Now let's have a conversation with an API ! Sending requests APIs exposes resources for manipulation through HTTP verbs. Say we need to retrieve nodes in our cluster, Consul documentation requires us to perform a GET /v1/catalog/nodes. import requests def http_get(resource, payload=None): """ Perform an HTTP GET request against the given endpoint. """ # Avoid dangerous default function argument `{}` payload = payload or {} # versioning an API guarantees compatibility endpoint = '{}/{}/{}'.format('localhost:8500', 'v1', resource) return requests.get( endpoint, # attach parameters to the url, like `&foo=bar` params=payload, # tell the API we expect to parse JSON responses headers={'Accept': 'application/vnd.consul+json; version=1'}) Providing consul is running on the same host, we get the following result. In [4]: res = http_get('catalog/nodes') In [5]: res.json() Out[5]: [{'Address': '172.17.0.1', 'Node': 'consul-server'}] Awesome : a few lines of code gave us a really convenient access to Consul information. Let's leverage OOP to abstract further the nodes resource. Mapping resources The idea is to consider a Catalog class whose attributes are Consul API resources. A little bit of Python magic offers an elegant way to achieve that. class Catalog(object): # url specific path _path = 'catalog' def__getattr__(self, name): """ Extend built-in method to add support for attributes related to endpoints. Example: agent.members runs GET /v1/agent/members """ # Default behavior if name in self.__dict__: returnself.__dict__[name] # Dynamic attribute based on the property name else: return http_get('/'.join([self._path, name])) It might seem a little cryptic if you are not familiar with built-in Python's object methods but the usage is crystal clear : In [47]: catalog_ = Catalog() In [48]: catalog_.nodes.json() Out[48]: [{'Address': '172.17.0.1', 'Node': 'consul-server'}] The really nice benefit with this approach is that we become very productive in supporting new resources. Just rename the previous class ClientFactory and profit. class Status(ClientFactory): _path = 'status' In [58]: status_ = Status() In [59]: status_.peers.json() Out[59]: ['172.17.0.1:8300'] But... what if the resource we call does not exist ? And, although we provide a header with Accept: application/json, what if we actually don't get back a JSON object or reach our rate limit ? Reading responses Let's challenge our current implementation against those questions. In [61]: status_.not_there Out[61]: <Response [404]> In [68]: # ok, that's a consistent response In [69]: # 404 HTTP code means the resource wasn't found on server-side In [69]: status_.not_there.json() --------------------------------------------------------------------------- StopIteration Traceback (most recent call last) ... ValueError: Expecting value: line 1 column 1 (char 0) Well that's not safe at all. We're going to wrap our HTTP calls with a decorator in charge of inspecting the API response. def safe_request(fct): """ Return Go-like data (i.e. actual response and possible error) instead of raising errors. """ def inner(*args, **kwargs): data, error = {}; one try: res = fct(*args, **kwargs) except requests.exceptions.ConnectionErroras error: returnNone, {'message': str(error), 'id': -1} if res.status_code == 200 and res.headers['content-type'] == 'application/json': # expected behavior data = res.json() elif res.status_code == 206 and res.headers['content-type'] == 'application/json': # partial response, return as-is data = res.json() else: # something went wrong error = {'id': res.status_code, 'message': res.reason} return res, error return inner # update our old code @safe_request def http_get(resource): # ... This implementation stills require us to check for errors instead of disposing of the data right away. But we are dealing with network and unexpected failures will happen. Being aware of them without crashing or wrapping every resources with try/catch is a working compromise. In [71]: res, err = status_.not_there In [72]: print(err) {'id': 404, 'message': 'Not Found'} Conclusion We just covered an opinionated python abstraction for programmatically expose remote resources. Subclassing the objects above allows one to quickly interact with new services, through command line tools or interactive prompt. Yet, we only worked with the GET method. Most of the APIs allow resources deletion (DELETE), update (PUT) or creation (POST) to name a few HTTP verbs. Other future work could involve : authentication smarter HTTP code handler when dealing with forbidden, rate limiting, internal server error responses Given the incredible services that emerged lately (IBM Watson, Docker, ...), building API clients is a more and more productive option to develop innovative projects. About the Author Xavier Bruhiere is a Lead Developer at AppTurbo in Paris, where he develops innovative prototypes to support company growth. He is addicted to learning, hacking on intriguing hot techs (both soft and hard), and practicing high intensity sports.
Read more
  • 0
  • 0
  • 7116

article-image-working-xaml
Packt
11 Aug 2015
14 min read
Save for later

Working with XAML

Packt
11 Aug 2015
14 min read
XAML is also known as Extensible Application Markup Language. XAML is a generic language like XML and can be used for multiple purposes and in the WPF and Silverlight applications. XAML is majorly used to declaratively design user interfaces. In this article by Abhishek Shukla, author of the book Blend for Visual Studio 2012 by Example Beginner's Guide, we had a look at the various layout controls and at how to use them in our application. In this article, we will have a look at the following topics: The fundamentals of XAML The use of XAML for applications in Blend (For more resources related to this topic, see here.) An important point to note here is that almost everything that we can do using XAML can also be done using C#. The following is the XAML and C# code to accomplish the same task of adding a rectangle within a canvas. In the XAML code, a canvas is created and an instance of a rectangle is created, which is placed inside the canvas: XAML code: In the following code, we declare the Rectangle element within Canvas, and that makes the Rectangle element the child of Canvas. The hierarchy of the elements defines the parent-child relationship of the elements. When we declare an element in XAML, it is the same as initializing it with a default constructor, and when we set an attribute in XAML, it is equivalent to setting the same property or event handler in code. In the following code, we set the various properties of Rectangle, such as Height, Width, and so on: <Canvas>    <Rectangle Height="100" Width="250" Fill="AliceBlue"               StrokeThickness="5" Stroke="Black" /> </Canvas> C# code: We created Canvas and Rectangle. Then, we set a few properties of the Rectangle element and then placed Rectangle inside Canvas: Canvas layoutRoot = new Canvas(); Rectangle rectangle = new Rectangle();   rectangle.Height = 100; rectangle.Width = 250; rectangle.Fill = Brushes.AliceBlue; rectangle.StrokeThickness = 5; rectangle.Stroke = Brushes.Black;   layoutRoot.Children.Add(rectangle);   AddChild(layoutRoot); We can clearly see why XAML is the preferred choice to define UI elements—XAML code is shorter, more readable, and has all the advantages that a declarative language provides. Another major advantage of working with XAML rather than C# is instant design feedback. As we type XAML code, we see the changes on the art board instantly. Whereas in the case of C#, we have to run the application to see the changes. Thanks to XAML, the process of creating a UI is now more like visual design than code development. When we drag and drop an element or draw an element on the art board, Blend generates XAML in the background. This is helpful as we do not need to hand-code XAML as we would have to when working with text-editing tools. Generally, the order of attaching properties and event handlers is performed based on the order in which they are defined in the object element. However, this should not matter, ideally, because, as per the design guidelines, the classes should allow properties and event handlers to be specified in any order. Wherever we create a UI, we should use XAML, and whatever relates to data should be processed in code. XAML is great for UIs that may have logic, but XAML is not intended to process data, which should be prepared in code and posted to the UI for displaying purposes. It is data processing that XAML is not designed for. The basics of XAML Each element in XAML maps to an instance of a .NET class, and the name of the element is exactly the same as the class. For example, the <Button> element in XAML is an instruction to create an instance of the Button class. XAML specifications define the rules to map the namespaces, types, events, and properties of object-oriented languages to XML namespaces. Time for action – taking a look at XAML code Perform the following steps and take a look at the XAML namespaces after creating a WPF application: Let's create a new WPF project in Expression Blend and name it Chapter03. In Blend, open MainWindow.xaml, and click on the split-view option so that we can see both the design view as well as the XAML view. The following screenshot shows this: You will also notice that a grid is present under the window and there is no element above the window, which means that Window is the root element for the current document. We see this in XAML, and we will also see multiple attributes set on Window: <Window x_Class="Chapter03.MainWindow" Title="MainWindow" Height="350" Width="525">    We can see in the preceding code that the XAML file has a root element. For a XAML document to be valid, there should be one and only one root element. Generally, we have a window, page, or user control as the root element. Other root elements that are used are ResourceDictionary for dictionaries and Application for application definition.    The Window element also contains a few attributes, including a class name and two XML namespaces. The three properties (Title, Height, and Width) define the caption of the window and default size of the window. The class name, as you might have guessed, defines the class name of the window. The http://schemas.microsoft.com/winfx/2006/xaml/presentation is is the default namespace. The default namespace allows us to add elements to the page without specifying any prefix. So, all other namespaces defined must have a unique prefix for reference. For example, the namespace http://schemas.microsoft.com/winfx/2006/xaml is defined with the prefix :x. The prefix that is mapped to the schema allows us to reference this namespace just using the prefix instead of the full-schema namespace.    XML namespaces don't have a one-to-one mapping with .NET namespaces. WPF types are spread across multiple namespaces, and one-to-one mapping would be rather cumbersome. So, when we refer to the presentation, various namespaces, such as System.Windows and System.Windows.Controls, are included.    All the elements and their attributes defined in MainPage.xaml should be defined in at least one of the schemas mentioned in the root element of XAML; otherwise, the XAML document will be invalid, and there is no guarantee that the compiler will understand and continue. When we add Rectangle in XAML, we expect Rectangle and its attributes to be part of the default schema: <Rectangle x_Name="someRectangle" Fill="AliceBlue"/> What just happened? We had a look at the XAML namespaces that we use when we create a WPF application. Time for action – adding other namespaces in XAML In this section, we will add another namespace apart from the default namespace: We can use any other namespace in XAML as well. To do that, we need to declare the XML namespaces the schemas of which we want to adhere to. The syntax to do that is as follows: Prefix is the XML prefix we want to use in our XAML to represent that namespace. For example, the XAML namespace uses the :x prefix. Namespace is the fully qualified .NET namespace. AssemblyName is the assembly where the type is declared, and this assembly could be the current project assembly or a referenced assembly. Open the XAML view of MainWindow.xaml if it is not already open, and add the following line of code after the reference to To create an instance of an object we would have to use this namespace prefix as <system:Double></system:Double> We can access the types defined in the current assembly by referencing the namespace of the current project:   To create an instance of an object we would have to use this namespace prefix as   <local:MyObj></local:MyObj> What just happened? We saw how we can add more namespaces in XAML apart from the ones present by default and how we can use them to create objects. Naming elements In XAML, it is not mandatory to add a name to every element, but we might want to name some of the XAML elements that we want to access in the code or XAML. We can change properties or attach the event handler to, or detach it from, elements on the fly. We can set the name property by hand-coding in XAML or setting it in the properties window. This is shown in the following screenshot: The code-behind class The x:Class attribute in XAML references the code-behind class for XAML. You would notice the x:, which means that the Class attribute comes from the XAML namespace, as discussed earlier. The value of the attribute references the MainWindow class in the Chapter03 namespace. If we go to the MainWindow.xaml.cs file, we can see the partial class defined there. The code-behind class is where we can put C# (or VB) code for the implementation of event handlers and other application logic. As we discussed earlier, it is technically possible to create all of the XAML elements in code, but that bypasses the advantages of having XAML. So, as these two are partial classes, they are compiled into one class. So, as C# and XAML are equivalent and both these are partial classes, they are compiled into the same IL. Time for action – using a named element in a code-behind class Go to MainWindow.xaml.cs and change the background of the LayoutRoot grid to Green: public MainWindow() {    InitializeComponent();    LayoutRoot.Background = Brushes.Green; } Run the application; you will see that the background color of the grid is green. What just happened? We accessed the element defined in XAML by its name. Default properties The content of a XAML element is the value that we can simply put between the tags without any specific references. For example, we can set the content of a button as follows: <Button Content="Some text" /> or <Button > Some text </Button> The default properties are specified in the help file. So, all we need to do is press F1 on any of our controls to see what the value is. Expressing properties as attributes We can express the properties of an element as an XML attribute. Time for action – adding elements in XAML by hand-coding In this section, instead of dragging and dropping controls, we will add XAML code: Move MainWindow.xaml to XAML and add the code shown here to add TextBlock and three buttons in Grid: <Grid x_Name="LayoutRoot"> <TextBlock Text="00:00" Height="170" Margin="49,32,38,116" Width="429" FontSize="48"/>    <Button Content="Start" Height="50" Margin="49,220,342,49"     Width="125"/>    <Button Content="Stop" Height="50" Margin="203,220,188,49"     Width="125"/>    <Button Content="Reset" Height="50" Margin="353,220,38,49"    Width="125"/> </Grid> We set a few properties for each of the elements. The property types of the various properties are also mentioned. These are the .NET types to which the type converter would convert them:    Content: This is the content displayed onscreen. This property is of the Object type.    Height: This is the height of the button. This property is of the Double type.    Width: This is the width of the button. This property is of the Double type.    Margin: This is the amount of space outside the control, that is, between the edge of the control and its container. This property is of the Thickness type.    Text: This is the text displayed onscreen. This property is of the String type.    FontSize: This is the size of the font of the text. This property is of the Double type. What just happened? We added elements in XAML with properties as XML attributes. Non-attribute syntax We will define a gradient background for the grid, which is a complex property. Notice that the code in the next section sets the background property of the grid using a different type converter this time. Instead of a string-to-brush converter, the LinearGradientBrush to brush type converter would be used. Time for action – defining the gradient for the grid Add the following code inside the grid. We have specified two gradient stops. The first one is black and the second one is a shade of green. We have specified the starting point of LinearGradientBrush as the top-left corner and the ending point as the bottom-right corner, so we will see a diagonal gradient: <Grid x_Name="LayoutRoot"> <Grid.Background>    <LinearGradientBrush EndPoint="1,1" StartPoint="0,0">      <GradientStop Color="Black"/>      <GradientStop Color="#FF27EC07" Offset="1"/>    </LinearGradientBrush> </Grid.Background> The following is the output of the preceding code: Comments in XAML Using <!-- --> tags, we can add comments in XAML just as we add comments in XML. Comments are really useful when we have complex XAML with lots of elements and complex layouts: <!-- TextBlock to show the timer --> Styles in XAML We would like all the buttons in our application to look the same, and we can achieve this using styles. Here, we will just see how styles can be defined and used in XAML. Defining a style We will define a style as a static resource in the window. We can move all the properties we define for each button to that style. Time for action – defining style in XAML Add a new <Window.Resources> tag in XAML, and then add the code for the style, as shown here: <Window.Resources> <Style TargetType="Button" x_Key="MyButtonStyle">    <Setter Property="Height" Value="50" />    <Setter Property="Width" Value="125" />    <Setter Property="Margin" Value="0,10" />    <Setter Property="FontSize" Value="18"/>    <Setter Property="FontWeight" Value="Bold" />    <Setter Property="Background" Value="Black" />    <Setter Property="Foreground" Value="Green" /> </Style> </Window.Resources> We defined a few properties on style that are worth noting: TargetType: This property specifies the type of element to which we will apply this style. In our case, it is Button. x:Key: This is the unique key to reference the style within the window. Setter: The setter elements contain the property name and value. What just happened We defined a style for a button in the window. Using a style Let's use the style that we defined in the previous section. All UI controls have a style property (from the FrameworkElement base class). Time for action – defining style in XAML To use a style, we have to set the style property of the element as shown in the following code. Add the same style code to all the buttons:    We set the style property using curly braces ({}) because we are using another element.    StaticResource denotes that we are using another resource.    MyButtonStyle is the key to refer to the style. The following code encapsulates the style properties: <Button x_Name="BtnStart" Content="Start" Grid.Row="1" Grid.Column="0" Style="{StaticResource MyButtonStyle}"/> The following is the output of the preceding code: What just happened? We used a style defined for a button. Where to go from here This article gave a brief overview of XAML, which helps you to start designing your applications in Expression Blend. However, if you wish know more about XAML, you could visit the following MSDN links and go through the various XAML specifications and details: XAML in Silverlight: http://msdn.microsoft.com/en-us/library/cc189054(v=vs.95).aspx XAML in WPF: http://msdn.microsoft.com/en-us/library/ms747122(v=vs.110).aspx Pop quiz Q1. How can we find out the default property of a control? F1. F2. F3. F4. Q2. Is it possible to create a custom type converter? Yes. No. Summary We had a look at the basics of XAML, including namespaces, elements, and properties. With this introduction, you can now hand-edit XAML, where necessary, and this will allow you to tweak the output of Expression Blend or even hand-code an entire UI if you are more comfortable with that. Resources for Article: Further resources on this subject: Windows Phone 8 Applications [article] Building UI with XAML for Windows 8 Using C [article] Introduction to Modern OpenGL [article]
Read more
  • 0
  • 0
  • 1036
Banner background image

article-image-advanced-data-access-patterns
Packt
11 Aug 2015
25 min read
Save for later

Advanced Data Access Patterns

Packt
11 Aug 2015
25 min read
In this article by, Suhas Chatekar, author of the book Learning NHibernate 4, we would dig deeper into that statement and try to understand what those downsides are and what can be done about them. In our attempt to address the downsides of repository, we would present two data access patterns, namely specification pattern and query object pattern. Specification pattern is a pattern adopted into data access layer from a general purpose pattern used for effectively filtering in-memory data. Before we begin, let me reiterate – repository pattern is not bad or wrong choice in every situation. If you are building a small and simple application involving a handful of entities then repository pattern can serve you well. But if you are building complex domain logic with intricate database interaction then repository may not do justice to your code. The patterns presented can be used in both simple and complex applications, and if you feel that repository is doing the job perfectly then there is no need to move away from it. (For more resources related to this topic, see here.) Problems with repository pattern A lot has been written all over the Internet about what is wrong with repository pattern. A simple Google search would give you lot of interesting articles to read and ponder about. We would spend some time trying to understand problems introduced by repository pattern. Generalization FindAll takes name of the employee as input along with some other parameters required for performing the search. When we started putting together a repository, we said that Repository<T> is a common repository class that can be used for any entity. But now FindAll takes a parameter that is only available on Employee, thus locking the implementation of FindAll to the Employee entity only. In order to keep the repository still reusable by other entities, we would need to part ways from the common Repository<T> class and implement a more specific EmployeeRepository class with Employee specific querying methods. This fixes the immediate problem but introduces another one. The new EmployeeRepository breaks the contract offered by IRepository<T> as the FindAll method cannot be pushed on the IRepository<T> interface. We would need to add a new interface IEmployeeRepository. Do you notice where this is going? You would end up implementing lot of repository classes with complex inheritance relationships between them. While this may seem to work, I have experienced that there are better ways of solving this problem. Unclear and confusing contract What happens if there is a need to query employees by a different criteria for a different business requirement? Say, we now need to fetch a single Employee instance by its employee number. Even if we ignore the above issue and be ready to add a repository class per entity, we would need to add a method that is specific to fetching the Employee instance matching the employee number. This adds another dimension to the code maintenance problem. Imagine how many such methods we would end up adding for a complex domain every time someone needs to query an entity using a new criteria. With several methods on repository contract that query same entity using different criteria makes the contract less clear and confusing for new developers. Such a pattern also makes it difficult to reuse code even if two methods are only slightly different from each other. Leaky abstraction In order to make methods on repositories reusable in different situations, lot of developers tend to add a single method on repository that does not take any input and return an IQueryable<T> by calling ISession.Query<T> inside it, as shown next: public IQueryable<T> FindAll() {    return session.Query<T>(); } IQueryable<T> returned by this method can then be used to construct any query that you want outside of repository. This is a classic case of leaky abstraction. Repository is supposed to abstract away any concerns around querying the database, but now what we are doing here is returning an IQueryable<T> to the consuming code and asking it to build the queries, thus leaking the abstraction that is supposed to be hidden into repository. IQueryable<T> returned by the preceding method holds an instance of ISession that would be used to ultimately interact with database. Since repository has no control over how and when this IQueryable would invoke database interaction, you might get in trouble. If you are using "session per request" kind of pattern then you are safeguarded against it but if you are not using that pattern for any reason then you need to watch out for errors due to closed or disposed session objects. God object anti-pattern A god object is an object that does too many things. Sometimes, there is a single class in an application that does everything. Such an implementation is almost always bad as it majorly breaks the famous single responsibility principle (SRP) and reduces testability and maintainability of code. A lot can be written about SRP and god object anti-pattern but since it is not the primary topic, I would leave the topic with underscoring the importance of staying away from god object anti-pattern. Avid readers can Google on the topic if they are interested. Repositories by nature tend to become single point of database interaction. Any new database interaction goes through repository. Over time, repositories grow organically with large number of methods doing too many things. You may spot the anti-pattern and decide to break the repository into multiple small repositories but the original single repository would be tightly integrated with your code in so many places that splitting it would be a difficult job. For a contained and trivial domain model, repository pattern can be a good choice. So do not abandon repositories entirely. It is around complex and changing domain that repositories start exhibiting the problems just discussed. You might still argue that repository is an unneeded abstraction and we can very well use NHibernate directly for a trivial domain model. But I would caution against any design that uses NHibernate directly from domain or domain services layer. No matter what design I use for data access, I would always adhere to "explicitly declare capabilities required" principle. The abstraction that offers required capability can be a repository interface or some other abstractions that we would learn. Specification pattern Specification pattern is a reusable and object-oriented way of applying business rules on domain entities. The primary use of specification pattern is to select subset of entities from a larger collection of entities based on some rules. An important characteristic of specification pattern is combining multiple rules by chaining them together. Specification pattern was in existence before ORMs and other data access patterns had set their feet in the development community. The original form of specification pattern dealt with in-memory collections of entities. The pattern was then adopted to work with ORMs such as NHibernate as people started seeing the benefits that specification pattern could bring about. We would first discuss specification pattern in its original form. That would give us a good understanding of the pattern. We would then modify the implementation to make it fit with NHibernate. Specification pattern in its original form Let's look into an example of specification pattern in its original form. A specification defines a rule that must be satisfied by domain objects. This can be generalized using an interface definition, as follows: public interface ISpecification<T> { bool IsSatisfiedBy(T entity); } ISpecification<T> defines a single method IsSatisifedBy. This method takes the entity instance of type T as input and returns a Boolean value depending on whether the entity passed satisfies the rule or not. If we were to write a rule for employees living in London then we can implement a specification as follows: public class EmployeesLivingIn : ISpecification<Employee> { public bool IsSatisfiedBy(Employee entity) {    return entity.ResidentialAddress.City == "London"; } } The EmployeesLivingIn class implements ISpecification<Employee> telling us that this is a specification for the Employee entity. This specification compares the city from the employee's ResidentialAddress property with literal string "London" and returns true if it matches. You may be wondering why I have named this class as EmployeesLivingIn. Well, I had some refactoring in mind and I wanted to make my final code read nicely. Let's see what I mean. We have hardcoded literal string "London" in the preceding specification. This effectively stops this class from being reusable. What if we need a specification for all employees living in Paris? Ideal thing to do would be to accept "London" as a parameter during instantiation of this class and then use that parameter value in the implementation of the IsSatisfiedBy method. Following code listing shows the modified code: public class EmployeesLivingIn : ISpecification<Employee> { private readonly string city;   public EmployeesLivingIn(string city) {    this.city = city; }   public bool IsSatisfiedBy(Employee entity) {    return entity.ResidentialAddress.City == city; } } This looks good without any hardcoded string literals. Now if I wanted my original specification for employees living in London then following is how I could build it: var specification = new EmployeesLivingIn("London"); Did you notice how the preceding code reads in plain English because of the way class is named? Now, let's see how to use this specification class. Usual scenario where specifications are used is when you have got a list of entities that you are working with and you want to run a rule and find out which of the entities in the list satisfy that rule. Following code listing shows a very simple use of the specification we just implemented: List<Employee> employees = //Loaded from somewhere List<Employee> employeesLivingInLondon = new List<Employee>(); var specification = new EmployeesLivingIn("London");   foreach(var employee in employees) { if(specification.IsSatisfiedBy(employee)) {    employeesLivingInLondon.Add(employee); } } We have a list of employees loaded from somewhere and we want to filter this list and get another list comprising of employees living in London. Till this point, the only benefit we have had from specification pattern is that we have managed to encapsulate the rule into a specification class which can be reused anywhere now. For complex rules, this can be very useful. But for simple rules, specification pattern may look like lot of plumbing code unless we overlook the composability of specifications. Most power of specification pattern comes from ability to chain multiple rules together to form a complex rule. Let's write another specification for employees who have opted for any benefit: public class EmployeesHavingOptedForBenefits : ISpecification<Employee> { public bool IsSatisfiedBy(Employee entity) {    return entity.Benefits.Count > 0; } } In this rule, there is no need to supply any literal value from outside so the implementation is quite simple. We just check if the Benefits collection on the passed employee instance has count greater than zero. You can use this specification in exactly the same way as earlier specification was used. Now if there is a need to apply both of these specifications to an employee collection, then very little modification to our code is needed. Let's start with adding an And method to the ISpecification<T> interface, as shown next: public interface ISpecification<T> { bool IsSatisfiedBy(T entity); ISpecification<T> And(ISpecification<T> specification); } The And method accepts an instance of ISpecification<T> and returns another instance of the same type. As you would have guessed, the specification that is returned from the And method would effectively perform a logical AND operation between the specification on which the And method is invoked and specification that is passed into the And method. The actual implementation of the And method comes down to calling the IsSatisfiedBy method on both the specification objects and logically ANDing their results. Since this logic does not change from specification to specification, we can introduce a base class that implements this logic. All specification implementations can then derive from this new base class. Following is the code for the base class: public abstract class Specification<T> : ISpecification<T> { public abstract bool IsSatisfiedBy(T entity);   public ISpecification<T> And(ISpecification<T> specification) {    return new AndSpecification<T>(this, specification); } } We have marked Specification<T> as abstract as this class does not represent any meaningful business specification and hence we do not want anyone to inadvertently use this class directly. Accordingly, the IsSatisfiedBy method is marked abstract as well. In the implementation of the And method, we are instantiating a new class AndSepcification. This class takes two specification objects as inputs. We pass the current instance and one that is passed to the And method. The definition of AndSpecification is very simple. public class AndSpecification<T> : Specification<T> { private readonly Specification<T> specification1; private readonly ISpecification<T> specification2;   public AndSpecification(Specification<T> specification1, ISpecification<T> specification2) {    this.specification1 = specification1;    this.specification2 = specification2; }   public override bool IsSatisfiedBy(T entity) {    return specification1.IsSatisfiedBy(entity) &&    specification2.IsSatisfiedBy(entity); } } AndSpecification<T> inherits from abstract class Specification<T> which is obvious. IsSatisfiedBy is simply performing a logical AND operation on the outputs of the ISatisfiedBy method on each of the specification objects passed into AndSpecification<T>. After we change our previous two business specification implementations to inherit from abstract class Specification<T> instead of interface ISpecification<T>, following is how we can chain two specifications using the And method that we just introduced: List<Employee> employees = null; //= Load from somewhere List<Employee> employeesLivingInLondon = new List<Employee>(); var specification = new EmployeesLivingIn("London")                                     .And(new EmployeesHavingOptedForBenefits());   foreach (var employee in employees) { if (specification.IsSatisfiedBy(employee)) {    employeesLivingInLondon.Add(employee); } } There is literally nothing changed in how the specification is used in business logic. The only thing that is changed is construction and chaining together of two specifications as depicted in bold previously. We can go on and implement other chaining methods but point to take home here is composability that the specification pattern offers. Now let's look into how specification pattern sits beside NHibernate and helps in fixing some of pain points of repository pattern. Specification pattern for NHibernate Fundamental difference between original specification pattern and the pattern applied to NHibernate is that we had an in-memory list of objects to work with in the former case. In case of NHibernate we do not have the list of objects in the memory. We have got the list in the database and we want to be able to specify rules that can be used to generate appropriate SQL to fetch the records from database that satisfy the rule. Owing to this difference, we cannot use the original specification pattern as is when we are working with NHibernate. Let me show you what this means when it comes to writing code that makes use of specification pattern. A query, in its most basic form, to retrieve all employees living in London would look something as follows: var employees = session.Query<Employee>()                .Where(e => e.ResidentialAddress.City == "London"); The lambda expression passed to the Where method is our rule. We want all the Employee instances from database that satisfy this rule. We want to be able to push this rule behind some kind of abstraction such as ISpecification<T> so that this rule can be reused. We would need a method on ISpecification<T> that does not take any input (there are no entities in-memory to pass) and returns a lambda expression that can be passed into the Where method. Following is how that method could look: public interface ISpecification<T> where T : EntityBase<T> { Expression<Func<T, bool>> IsSatisfied(); } Note the differences from the previous version. We have changed the method name from IsSatisfiedBy to IsSatisfied as there is no entity being passed into this method that would warrant use of word By in the end. This method returns an Expression<Fund<T, bool>>. If you have dealt with situations where you pass lambda expressions around then you know what this type means. If you are new to expression trees, let me give you a brief explanation. Func<T, bool> is a usual function pointer. This pointer specifically points to a function that takes an instance of type T as input and returns a Boolean output. Expression<Func<T, bool>> takes this function pointer and converts it into a lambda expression. An implementation of this new interface would make things more clear. Next code listing shows the specification for employees living in London written against the new contract: public class EmployeesLivingIn : ISpecification<Employee> { private readonly string city;   public EmployeesLivingIn(string city) {    this.city = city; }   public override Expression<Func<Employee, bool>> IsSatisfied() {    return e => e.ResidentialAddress.City == city; } } There is not much changed here compared to the previous implementation. Definition of IsSatisfied now returns a lambda expression instead of a bool. This lambda is exactly same as the one we used in the ISession example. If I had to rewrite that example using the preceding specification then following is how that would look: var specification = new EmployeeLivingIn("London"); var employees = session.Query<Employee>()                .Where(specification.IsSatisfied()); We now have a specification wrapped in a reusable object that we can send straight to NHibernate's ISession interface. Now let's think about how we can use this from within domain services where we used repositories before. We do not want to reference ISession or any other NHibernate type from domain services as that would break onion architecture. We have two options. We can declare a new capability that can take a specification and execute it against the ISession interface. We can then make domain service classes take a dependency on this new capability. Or we can use the existing IRepository capability and add a method on it which takes the specification and executes it. We started this article with a statement that repositories have a downside, specifically when it comes to querying entities using different criteria. But now we are considering an option to enrich the repositories with specifications. Is that contradictory? Remember that one of the problems with repository was that every time there is a new criterion to query an entity, we needed a new method on repository. Specification pattern fixes that problem. Specification pattern has taken the criterion out of the repository and moved it into its own class so we only ever need a single method on repository that takes in ISpecification<T> and execute it. So using repository is not as bad as it sounds. Following is how the new method on repository interface would look: public interface IRepository<T> where T : EntityBase<T> { void Save(T entity); void Update(int id, Employee employee); T GetById(int id); IEnumerable<T> Apply(ISpecification<T> specification); } The Apply method in bold is the new method that works with specification now. Note that we have removed all other methods that ran various different queries and replaced them with this new method. Methods to save and update the entities are still there. Even the method GetById is there as the mechanism used to get entity by ID is not same as the one used by specifications. So we retain that method. One thing I have experimented with in some projects is to split read operations from write operations. The IRepository interface represents something that is capable of both reading from the database and writing to database. Sometimes, we only need a capability to read from database, in which case, IRepository looks like an unnecessarily heavy object with capabilities we do not need. In such a situation, declaring a new capability to execute specification makes more sense. I would leave the actual code for this as a self-exercise for our readers. Specification chaining In the original implementation of specification pattern, chaining was simply a matter of carrying out logical AND between the outputs of the IsSatisfiedBy method on the specification objects involved in chaining. In case of NHibernate adopted version of specification pattern, the end result boils down to the same but actual implementation is slightly more complex than just ANDing the results. Similar to original specification pattern, we would need an abstract base class Specification<T> and a specialized AndSepcificatin<T> class. I would just skip these details. Let's go straight into the implementation of the IsSatisifed method on AndSpecification where actual logical ANDing happens. public override Expression<Func<T, bool>> IsSatisfied() { var p = Expression.Parameter(typeof(T), "arg1"); return Expression.Lambda<Func<T, bool>>(Expression.AndAlso(          Expression.Invoke(specification1.IsSatisfied(), p),          Expression.Invoke(specification2.IsSatisfied(), p)), p); } Logical ANDing of two lambda expression is not a straightforward operation. We need to make use of static methods available on helper class System.Linq.Expressions.Expression. Let's try to go from inside out. That way it is easier to understand what is happening here. Following is the reproduction of innermost call to the Expression class: Expression.Invoke(specification1.IsSatisfied(), parameterName) In the preceding code, we are calling the Invoke method on the Expression class by passing the output of the IsSatisfied method on the first specification. Second parameter passed to this method is a temporary parameter of type T that we created to satisfy the method signature of Invoke. The Invoke method returns an InvocationExpression which represents the invocation of the lambda expression that was used to construct it. Note that actual lambda expression is not invoked yet. We do the same with second specification in question. Outputs of both these operations are then passed into another method on the Expression class as follows: Expression.AndAlso( Expression.Invoke(specification1.IsSatisfied(), parameterName), Expression.Invoke(specification2.IsSatisfied(), parameterName) ) Expression.AndAlso takes the output from both specification objects in the form of InvocationExpression type and builds a special type called BinaryExpression which represents a logical AND between the two expressions that were passed to it. Next we convert this BinaryExpression into an Expression<Func<T, bool>> by passing it to the Expression.Lambda<Func<T, bool>> method. This explanation is not very easy to follow and if you have never used, built, or modified lambda expressions programmatically like this before, then you would find it very hard to follow. In that case, I would recommend not bothering yourself too much with this. Following code snippet shows how logical ORing of two specifications can be implemented. Note that the code snippet only shows the implementation of the IsSatisfied method. public override Expression<Func<T, bool>> IsSatisfied() { var parameterName = Expression.Parameter(typeof(T), "arg1"); return Expression.Lambda<Func<T, bool>>(Expression.OrElse( Expression.Invoke(specification1.IsSatisfied(), parameterName), Expression.Invoke(specification2.IsSatisfied(), parameterName)), parameterName); } Rest of the infrastructure around chaining is exactly same as the one presented during discussion of original specification pattern. I have avoided giving full class definitions here to save space but you can download the code to look at complete implementation. That brings us to end of specification pattern. Though specification pattern is a great leap forward from where repository left us, it does have some limitations of its own. Next, we would look into what these limitations are. Limitations Specification pattern is great and unlike repository pattern, I am not going to tell you that it has some downsides and you should try to avoid it. You should not. You should absolutely use it wherever it fits. I would only like to highlight two limitations of specification pattern. Specification pattern only works with lambda expressions. You cannot use LINQ syntax. There may be times when you would prefer LINQ syntax over lambda expressions. One such situation is when you want to go for theta joins which are not possible with lambda expressions. Another situation is when lambda expressions do not generate optimal SQL. I will show you a quick example to understand this better. Suppose we want to write a specification for employees who have opted for season ticket loan benefit. Following code listing shows how that specification could be written: public class EmployeeHavingTakenSeasonTicketLoanSepcification :Specification<Employee> { public override Expression<Func<Employee, bool>> IsSatisfied() {    return e => e.Benefits.Any(b => b is SeasonTicketLoan); } } It is a very simple specification. Note the use of Any to iterate over the Benefits collection to check if any of the Benefit in that collection is of type SeasonTicketLoan. Following SQL is generated when the preceding specification is run: SELECT employee0_.Id           AS Id0_,        employee0_.Firstname     AS Firstname0_,        employee0_.Lastname     AS Lastname0_,        employee0_.EmailAddress AS EmailAdd5_0_,        employee0_.DateOfBirth   AS DateOfBi6_0_,        employee0_.DateOfJoining AS DateOfJo7_0_,        employee0_.IsAdmin       AS IsAdmin0_,        employee0_.Password     AS Password0_ FROM   Employee employee0_ WHERE EXISTS (SELECT benefits1_.Id FROM   Benefit benefits1_ LEFT OUTER JOIN Leave benefits1_1_ ON benefits1_.Id = benefits1_1_.Id LEFT OUTER JOIN SkillsEnhancementAllowance benefits1_2_ ON benefits1_.Id = benefits1_2_.Id LEFT OUTER JOIN SeasonTicketLoan benefits1_3_ ON benefits1_.Id = benefits1_3_.Id WHERE employee0_.Id = benefits1_.Employee_Id AND CASE WHEN benefits1_1_.Id IS NOT NULL THEN 1      WHEN benefits1_2_.Id IS NOT NULL THEN 2      WHEN benefits1_3_.Id IS NOT NULL THEN 3       WHEN benefits1_.Id IS NOT NULL THEN 0      END = 3) Isn't that SQL too complex? It is not only complex on your eyes but this is not how I would have written the needed SQL in absence of NHibernate. I would have just inner-joined the Employee, Benefit, and SeasonTicketLoan tables to get the records I need. On large databases, the preceding query may be too slow. There are some other such situations where queries written using lambda expressions tend to generate complex or not so optimal SQL. If we use LINQ syntax instead of lambda expressions, then we can get NHibernate to generate just the SQL. Unfortunately, there is no way of fixing this with specification pattern. Summary Repository pattern has been around for long time but suffers through some issues. General nature of its implementation comes in the way of extending repository pattern to use it with complex domain models involving large number of entities. Repository contract can be limiting and confusing when there is a need to write complex and very specific queries. Trying to fix these issues with repositories may result in leaky abstraction which can bite us later. Moreover, repositories maintained with less care have a tendency to grow into god objects and maintaining them beyond that point becomes a challenge. Specification pattern and query object pattern solve these issues on the read side of the things. Different applications have different data access requirements. Some applications are write-heavy while others are read-heavy. But there are a minute number of applications that fit into former category. A large number of applications developed these days are read-heavy. I have worked on applications that involved more than 90 percent database operations that queried data and only less than 10 percent operations that actually inserted/updated data into database. Having this knowledge about the application you are developing can be very useful in determining how you are going to design your data access layer. That brings use to the end of our NHibernate journey. Not quite, but yes, in a way. Resources for Article: Further resources on this subject: NHibernate 3: Creating a Sample Application [article] NHibernate 3.0: Using LINQ Specifications in the data access layer [article] NHibernate 2: Mapping relationships and Fluent Mapping [article]
Read more
  • 0
  • 0
  • 3473

article-image-task-execution-asio
Packt
11 Aug 2015
20 min read
Save for later

Task Execution with Asio

Packt
11 Aug 2015
20 min read
In this article by Arindam Mukherjee, the author of Learning Boost C++ Libraries, we learch how to execute a task using Boost Asio (pronounced ay-see-oh), a portable library for performing efficient network I/O using a consistent programming model. At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit. (For more resources related to this topic, see here.) IO Service, queues, and handlers At the heart of Asio is the type boost::asio::io_service. A program uses the io_service interface to perform network I/O and manage tasks. Any program that wants to use the Asio library creates at least one instance of io_service and sometimes more than one. In this section, we will explore the task management capabilities of io_service. Here is the IO Service in action using the obligatory "hello world" example: Listing 11.1: Asio Hello World 1 #include <boost/asio.hpp> 2 #include <iostream> 3 namespace asio = boost::asio; 4 5 int main() { 6   asio::io_service service; 7 8   service.post( 9     [] { 10       std::cout << "Hello, world!" << 'n'; 11     }); 12 13   std::cout << "Greetings: n"; 14   service.run(); 15 } We include the convenience header boost/asio.hpp, which includes most of the Asio library that we need for the examples in this aritcle (line 1). All parts of the Asio library are under the namespace boost::asio, so we use a shorter alias for this (line 3). The program itself just prints Hello, world! on the console but it does so through a task. The program first creates an instance of io_service (line 6) and posts a function object to it, using the post member function of io_service. The function object, in this case defined using a lambda expression, is referred to as a handler. The call to post adds the handler to a queue inside io_service; some thread (including that which posted the handler) must dispatch them, that is, remove them off the queue and call them. The call to the run member function of io_service (line 14) does precisely this. It loops through the handlers in the queue inside io_service, removing and calling each handler. In fact, we can post more handlers to the io_service before calling run, and it would call all the posted handlers. If we did not call run, none of the handlers would be dispatched. The run function blocks until all the handlers in the queue have been dispatched and returns only when the queue is empty. By itself, a handler may be thought of as an independent, packaged task, and Boost Asio provides a great mechanism for dispatching arbitrary tasks as handlers. Note that handlers must be nullary function objects, that is, they should take no arguments. Asio is a header-only library by default, but programs using Asio need to link at least with boost_system. On Linux, we can use the following command line to build this example: $ g++ -g listing11_1.cpp -o listing11_1 -lboost_system -std=c++11 Running this program prints the following: Greetings: Hello, World! Note that Greetings: is printed from the main function (line 13) before the call to run (line 14). The call to run ends up dispatching the sole handler in the queue, which prints Hello, World!. It is also possible for multiple threads to call run on the same I/O object and dispatch handlers concurrently. We will see how this can be useful in the next section. Handler states – run_one, poll, and poll_one While the run function blocks until there are no more handlers in the queue, there are other member functions of io_service that let you process handlers with greater flexibility. But before we look at this function, we need to distinguish between pending and ready handlers. The handlers we posted to the io_service were all ready to run immediately and were invoked as soon as their turn came on the queue. In general, handlers are associated with background tasks that run in the underlying OS, for example, network I/O tasks. Such handlers are meant to be invoked only once the associated task is completed, which is why in such contexts, they are called completion handlers. These handlers are said to be pending until the associated task is awaiting completion, and once the associated task completes, they are said to be ready. The poll member function, unlike run, dispatches all the ready handlers but does not wait for any pending handler to become ready. Thus, it returns immediately if there are no ready handlers, even if there are pending handlers. The poll_one member function dispatches exactly one ready handler if there be one, but does not block waiting for pending handlers to get ready. The run_one member function blocks on a nonempty queue waiting for a handler to become ready. It returns when called on an empty queue, and otherwise, as soon as it finds and dispatches a ready handler. post versus dispatch A call to the post member function adds a handler to the task queue and returns immediately. A later call to run is responsible for dispatching the handler. There is another member function called dispatch that can be used to request the io_service to invoke a handler immediately if possible. If dispatch is invoked in a thread that has already called one of run, poll, run_one, or poll_one, then the handler will be invoked immediately. If no such thread is available, dispatch adds the handler to the queue and returns just like post would. In the following example, we invoke dispatch from the main function and from within another handler: Listing 11.2: post versus dispatch 1 #include <boost/asio.hpp> 2 #include <iostream> 3 namespace asio = boost::asio; 4 5 int main() { 6   asio::io_service service; 7   // Hello Handler – dispatch behaves like post 8   service.dispatch([]() { std::cout << "Hellon"; }); 9 10   service.post( 11     [&service] { // English Handler 12       std::cout << "Hello, world!n"; 13       service.dispatch([] { // Spanish Handler, immediate 14                         std::cout << "Hola, mundo!n"; 15                       }); 16     }); 17   // German Handler 18   service.post([&service] {std::cout << "Hallo, Welt!n"; }); 19   service.run(); 20 } Running this code produces the following output: Hello Hello, world! Hola, mundo! Hallo, Welt! The first call to dispatch (line 8) adds a handler to the queue without invoking it because run was yet to be called on io_service. We call this the Hello Handler, as it prints Hello. This is followed by the two calls to post (lines 10, 18), which add two more handlers. The first of these two handlers prints Hello, world! (line 12), and in turn, calls dispatch (line 13) to add another handler that prints the Spanish greeting, Hola, mundo! (line 14). The second of these handlers prints the German greeting, Hallo, Welt (line 18). For our convenience, let's just call them the English, Spanish, and German handlers. This creates the following entries in the queue: Hello Handler English Handler German Handler Now, when we call run on the io_service (line 19), the Hello Handler is dispatched first and prints Hello. This is followed by the English Handler, which prints Hello, World! and calls dispatch on the io_service, passing the Spanish Handler. Since this executes in the context of a thread that has already called run, the call to dispatch invokes the Spanish Handler, which prints Hola, mundo!. Following this, the German Handler is dispatched printing Hallo, Welt! before run returns. What if the English Handler called post instead of dispatch (line 13)? In that case, the Spanish Handler would not be invoked immediately but would queue up after the German Handler. The German greeting Hallo, Welt! would precede the Spanish greeting Hola, mundo!. The output would look like this: Hello Hello, world! Hallo, Welt! Hola, mundo! Concurrent execution via thread pools The io_service object is thread-safe and multiple threads can call run on it concurrently. If there are multiple handlers in the queue, they can be processed concurrently by such threads. In effect, the set of threads that call run on a given io_service form a thread pool. Successive handlers can be processed by different threads in the pool. Which thread dispatches a given handler is indeterminate, so the handler code should not make any such assumptions. In the following example, we post a bunch of handlers to the io_service and then start four threads, which all call run on it: Listing 11.3: Simple thread pools 1 #include <boost/asio.hpp> 2 #include <boost/thread.hpp> 3 #include <boost/date_time.hpp> 4 #include <iostream> 5 namespace asio = boost::asio; 6 7 #define PRINT_ARGS(msg) do { 8   boost::lock_guard<boost::mutex> lg(mtx); 9   std::cout << '[' << boost::this_thread::get_id() 10             << "] " << msg << std::endl; 11 } while (0) 12 13 int main() { 14   asio::io_service service; 15   boost::mutex mtx; 16 17   for (int i = 0; i < 20; ++i) { 18     service.post([i, &mtx]() { 19                         PRINT_ARGS("Handler[" << i << "]"); 20                         boost::this_thread::sleep( 21                               boost::posix_time::seconds(1)); 22                       }); 23   } 24 25   boost::thread_group pool; 26   for (int i = 0; i < 4; ++i) { 27     pool.create_thread([&service]() { service.run(); }); 28   } 29 30   pool.join_all(); 31 } We post twenty handlers in a loop (line 18). Each handler prints its identifier (line 19), and then sleeps for a second (lines 19-20). To run the handlers, we create a group of four threads, each of which calls run on the io_service (line 21) and wait for all the threads to finish (line 24). We define the macro PRINT_ARGS which writes output to the console in a thread-safe way, tagged with the current thread ID (line 7-10). We will use this macro in other examples too. To build this example, you must also link against libboost_thread, libboost_date_time, and in Posix environments, with libpthread too: $ g++ -g listing9_3.cpp -o listing9_3 -lboost_system -lboost_thread -lboost_date_time -pthread -std=c++11 One particular run of this program on my laptop produced the following output (with some lines snipped): [b5c15b40] Handler[0] [b6416b40] Handler[1] [b6c17b40] Handler[2] [b7418b40] Handler[3] [b5c15b40] Handler[4] [b6416b40] Handler[5] … [b6c17b40] Handler[13] [b7418b40] Handler[14] [b6416b40] Handler[15] [b5c15b40] Handler[16] [b6c17b40] Handler[17] [b7418b40] Handler[18] [b6416b40] Handler[19] You can see that the different handlers are executed by different threads (each thread ID marked differently). If any of the handlers threw an exception, it would be propagated across the call to the run function on the thread that was executing the handler. io_service::work Sometimes, it is useful to keep the thread pool started, even when there are no handlers to dispatch. Neither run nor run_one blocks on an empty queue. So in order for them to block waiting for a task, we have to indicate, in some way, that there is outstanding work to be performed. We do this by creating an instance of io_service::work, as shown in the following example: Listing 11.4: Using io_service::work to keep threads engaged 1 #include <boost/asio.hpp> 2 #include <memory> 3 #include <boost/thread.hpp> 4 #include <iostream> 5 namespace asio = boost::asio; 6 7 typedef std::unique_ptr<asio::io_service::work> work_ptr; 8 9 #define PRINT_ARGS(msg) do { … ... 14 15 int main() { 16   asio::io_service service; 17   // keep the workers occupied 18   work_ptr work(new asio::io_service::work(service)); 19   boost::mutex mtx; 20 21   // set up the worker threads in a thread group 22   boost::thread_group workers; 23   for (int i = 0; i < 3; ++i) { 24     workers.create_thread([&service, &mtx]() { 25                         PRINT_ARGS("Starting worker thread "); 26                         service.run(); 27                         PRINT_ARGS("Worker thread done"); 28                       }); 29   } 30 31   // Post work 32   for (int i = 0; i < 20; ++i) { 33     service.post( 34       [&service, &mtx]() { 35         PRINT_ARGS("Hello, world!"); 36         service.post([&mtx]() { 37                           PRINT_ARGS("Hola, mundo!"); 38                         }); 39       }); 40   } 41 42 work.reset(); // destroy work object: signals end of work 43   workers.join_all(); // wait for all worker threads to finish 44 } In this example, we create an object of io_service::work wrapped in a unique_ptr (line 18). We associate it with an io_service object by passing to the work constructor a reference to the io_service object. Note that, unlike listing 11.3, we create the worker threads first (lines 24-27) and then post the handlers (lines 33-39). However, the worker threads stay put waiting for the handlers because of the calls to run block (line 26). This happens because of the io_service::work object we created, which indicates that there is outstanding work in the io_service queue. As a result, even after all handlers are dispatched, the threads do not exit. By calling reset on the unique_ptr, wrapping the work object, its destructor is called, which notifies the io_service that all outstanding work is complete (line 42). The calls to run in the threads return and the program exits once all the threads are joined (line 43). We wrapped the work object in a unique_ptr to destroy it in an exception-safe way at a suitable point in the program. We omitted the definition of PRINT_ARGS here, refer to listing 11.3. Serialized and ordered execution via strands Thread pools allow handlers to be run concurrently. This means that handlers that access shared resources need to synchronize access to these resources. We already saw examples of this in listings 11.3 and 11.4, when we synchronized access to std::cout, which is a global object. As an alternative to writing synchronization code in handlers, which can make the handler code more complex, we can use strands. Think of a strand as a subsequence of the task queue with the constraint that no two handlers from the same strand ever run concurrently. The scheduling of other handlers in the queue, which are not in the strand, is not affected by the strand in any way. Let us look at an example of using strands: Listing 11.5: Using strands 1 #include <boost/asio.hpp> 2 #include <boost/thread.hpp> 3 #include <boost/date_time.hpp> 4 #include <cstdlib> 5 #include <iostream> 6 #include <ctime> 7 namespace asio = boost::asio; 8 #define PRINT_ARGS(msg) do { ... 13 14 int main() { 15   std::srand(std::time(0)); 16 asio::io_service service; 17   asio::io_service::strand strand(service); 18   boost::mutex mtx; 19   size_t regular = 0, on_strand = 0; 20 21 auto workFuncStrand = [&mtx, &on_strand] { 22           ++on_strand; 23           PRINT_ARGS(on_strand << ". Hello, from strand!"); 24           boost::this_thread::sleep( 25                       boost::posix_time::seconds(2)); 26         }; 27 28   auto workFunc = [&mtx, &regular] { 29                   PRINT_ARGS(++regular << ". Hello, world!"); 30                  boost::this_thread::sleep( 31                         boost::posix_time::seconds(2)); 32                 }; 33   // Post work 34   for (int i = 0; i < 15; ++i) { 35     if (rand() % 2 == 0) { 36       service.post(strand.wrap(workFuncStrand)); 37     } else { 38       service.post(workFunc); 39     } 40   } 41 42   // set up the worker threads in a thread group 43   boost::thread_group workers; 44   for (int i = 0; i < 3; ++i) { 45     workers.create_thread([&service, &mtx]() { 46                      PRINT_ARGS("Starting worker thread "); 47                       service.run(); 48                       PRINT_ARGS("Worker thread done"); 49                     }); 50   } 51 52   workers.join_all(); // wait for all worker threads to finish 53 } In this example, we create two handler functions: workFuncStrand (line 21) and workFunc (line 28). The lambda workFuncStrand captures a counter on_strand, increments it, and prints a message Hello, from strand!, prefixed with the value of the counter. The function workFunc captures another counter regular, increments it, and prints Hello, World!, prefixed with the counter. Both pause for 2 seconds before returning. To define and use a strand, we first create an object of io_service::strand associated with the io_service instance (line 17). Thereafter, we post all handlers that we want to be part of that strand by wrapping them using the wrap member function of the strand (line 36). Alternatively, we can post the handlers to the strand directly by using either the post or the dispatch member function of the strand, as shown in the following snippet: 33   for (int i = 0; i < 15; ++i) { 34     if (rand() % 2 == 0) { 35       strand.post(workFuncStrand); 37     } else { ... The wrap member function of strand returns a function object, which in turn calls dispatch on the strand to invoke the original handler. Initially, it is this function object rather than our original handler that is added to the queue. When duly dispatched, this invokes the original handler. There are no constraints on the order in which these wrapper handlers are dispatched, and therefore, the actual order in which the original handlers are invoked can be different from the order in which they were wrapped and posted. On the other hand, calling post or dispatch directly on the strand avoids an intermediate handler. Directly posting to a strand also guarantees that the handlers will be dispatched in the same order that they were posted, achieving a deterministic ordering of the handlers in the strand. The dispatch member of strand blocks until the handler is dispatched. The post member simply adds it to the strand and returns. Note that workFuncStrand increments on_strand without synchronization (line 22), while workFunc increments the counter regular within the PRINT_ARGS macro (line 29), which ensures that the increment happens in a critical section. The workFuncStrand handlers are posted to a strand and therefore are guaranteed to be serialized; hence no need for explicit synchronization. On the flip side, entire functions are serialized via strands and synchronizing smaller blocks of code is not possible. There is no serialization between the handlers running on the strand and other handlers; therefore, the access to global objects, like std::cout, must still be synchronized. The following is a sample output of running the preceding code: [b73b6b40] Starting worker thread [b73b6b40] 0. Hello, world from strand! [b6bb5b40] Starting worker thread [b6bb5b40] 1. Hello, world! [b63b4b40] Starting worker thread [b63b4b40] 2. Hello, world! [b73b6b40] 3. Hello, world from strand! [b6bb5b40] 5. Hello, world! [b63b4b40] 6. Hello, world! … [b6bb5b40] 14. Hello, world! [b63b4b40] 4. Hello, world from strand! [b63b4b40] 8. Hello, world from strand! [b63b4b40] 10. Hello, world from strand! [b63b4b40] 13. Hello, world from strand! [b6bb5b40] Worker thread done [b73b6b40] Worker thread done [b63b4b40] Worker thread done There were three distinct threads in the pool and the handlers from the strand were picked up by two of these three threads: initially, by thread ID b73b6b40, and later on, by thread ID b63b4b40. This also dispels a frequent misunderstanding that all handlers in a strand are dispatched by the same thread, which is clearly not the case. Different handlers in the same strand may be dispatched by different threads but will never run concurrently. Summary Asio is a well-designed library that can be used to write fast, nimble network servers that utilize the most optimal mechanisms for asynchronous I/O available on a system. It is an evolving library and is the basis for a Technical Specification that proposes to add a networking library to a future revision of the C++ Standard. In this article, we learned how to use the Boost Asio library as a task queue manager and leverage Asio's TCP and UDP interfaces to write programs that communicate over the network. Resources for Article: Further resources on this subject: Animation features in Unity 5 [article] Exploring and Interacting with Materials using Blueprints [article] A Simple Pathfinding Algorithm for a Maze [article]
Read more
  • 0
  • 1
  • 14113

article-image-using-arcpy-dataaccess-module-withfeature-classesand-tables
Packt
10 Aug 2015
28 min read
Save for later

Using the ArcPy DataAccess Module withFeature Classesand Tables

Packt
10 Aug 2015
28 min read
In this article by Eric Pimpler, author of the book Programming ArcGIS with Python Cookbook - Second Edition, we will cover the following topics: Retrieving features from a feature class with SearchCursor Filtering records with a where clause Improving cursor performance with geometry tokens Inserting rows with InsertCursor Updating rows with UpdateCursor (For more resources related to this topic, see here.) We'll start this article with a basic question. What are cursors? Cursors are in-memory objects containing one or more rows of data from a table or feature class. Each row contains attributes from each field in a data source along with the geometry for each feature. Cursors allow you to search, add, insert, update, and delete data from tables and feature classes. The arcpy data access module or arcpy.da was introduced in ArcGIS 10.1 and contains methods that allow you to iterate through each row in a cursor. Various types of cursors can be created depending on the needs of developers. For example, search cursors can be created to read values from rows. Update cursors can be created to update values in rows or delete rows, and insert cursors can be created to insert new rows. There are a number of cursor improvements that have been introduced with the arcpy data access module. Prior to the development of ArcGIS 10.1, cursor performance was notoriously slow. Now, cursors are significantly faster. Esri has estimated that SearchCursors are up to 30 times faster, while InsertCursors are up to 12 times faster. In addition to these general performance improvements, the data access module also provides a number of new options that allow programmers to speed up processing. Rather than returning all the fields in a cursor, you can now specify that a subset of fields be returned. This increases the performance as less data needs to be returned. The same applies to geometry. Traditionally, when accessing the geometry of a feature, the entire geometric definition would be returned. You can now use geometry tokens to return a portion of the geometry rather than the full geometry of the feature. You can also use lists and tuples rather than using rows. There are also other new features, such as edit sessions and the ability to work with versions, domains, and subtypes. There are three cursor functions in arcpy.da. Each returns a cursor object with the same name as the function. SearchCursor() creates a read-only SearchCursor object containing rows from a table or feature class. InsertCursor() creates an InsertCursor object that can be used to insert new records into a table or feature class. UpdateCursor() returns a cursor object that can be used to edit or delete records from a table or feature class. Each of these cursor objects has methods to access rows in the cursor. You can see the relationship between the cursor functions, the objects they create, and how they are used, as follows: Function Object created Usage SearchCursor() SearchCursor This is a read-only view of data from a table or feature class InsertCursor() InsertCursor This adds rows to a table or feature class UpdateCursor() UpdateCursor This edits or deletes rows in a table or feature class The SearchCursor() function is used to return a SearchCursor object. This object can only be used to iterate through a set of rows returned for read-only purposes. No insertions, deletions, or updates can occur through this object. An optional where clause can be set to limit the rows returned. Once you've obtained a cursor instance, it is common to iterate the records, particularly with SearchCursor or UpdateCursor. There are some peculiarities that you need to understand when navigating the records in a cursor. Cursor navigation is forward-moving only. When a cursor is created, the pointer of the cursor sits just above the first row in the cursor. The first call to next() will move the pointer to the first row. Rather than calling the next() method, you can also use a for loop to process each of the records without the need to call the next() method. After performing whatever processing you need to do with this row, a subsequent call to next() will move the pointer to row 2. This process continues as long as you need to access additional rows. However, after a row has been visited, you can't go back a single record at a time. For instance, if the current row is row 3, you can't programmatically back up to row 2. You can only go forward. To revisit rows 1 and 2, you would need to either call the reset() method or recreate the cursor and move back through the object. As I mentioned earlier, cursors are often navigated through the use of for loops as well. In fact, this is a more common way to iterate a cursor and a more efficient way to code your scripts. Cursor navigation is illustrated in the following diagram: The InsertCursor() function is used to create an InsertCursor object that allows you to programmatically add new records to feature classes and tables. To insert rows, call the insertRow() method on this object. You can also retrieve a read-only tuple containing the field names in use by the cursor through the fields property. A lock is placed on the table or feature class being accessed through the cursor. Therefore, it is important to always design your script in a way that releases the cursor when you are done. The UpdateCursor() function can be used to create an UpdateCursor object that can update and delete rows in a table or feature class. As is the case with InsertCursor, this function places a lock on the data while it's being edited or deleted. If the cursor is used inside a Python's with statement, the lock will automatically be freed after the data has been processed. This hasn't always been the case. Prior to ArcGIS 10.1, cursors were required to be manually released using the Python del statement. Once an instance of UpdateCursor has been obtained, you can then call the updateCursor() method to update records in tables or feature classes and the deleteRow() method to delete a row. The subject of data locks requires a little more explanation. The insert and update cursors must obtain a lock on the data source they reference. This means that no other application can concurrently access this data source. Locks are a way of preventing multiple users from changing data at the same time and thus, corrupting the data. When the InsertCursor() and UpdateCursor() methods are called in your code, Python attempts to acquire a lock on the data. This lock must be specifically released after the cursor has finished processing so that the running applications of other users, such as ArcMap or ArcCatalog, can access the data sources. If this isn't done, no other application will be able to access the data. Prior to ArcGIS 10.1 and the with statement, cursors had to be specifically unlocked through Python's del statement. Similarly, ArcMap and ArcCatalog also acquire data locks when updating or deleting data. If a data source has been locked by either of these applications, your Python code will not be able to access the data. Therefore, the best practice is to close ArcMap and ArcCatalog before running any standalone Python scripts that use insert or update cursors. In this article, we're going to cover the use of cursors to access and edit tables and feature classes. However, many of the cursor concepts that existed before ArcGIS 10.1 still apply. Retrieving features from a feature class with SearchCursor There are many occasions when you need to retrieve rows from a table or feature class for read-only purposes. For example, you might want to generate a list of all land parcels in a city with a value greater than $100,000. In this case, you don't have any need to edit the data. Your needs are met simply by generating a list of rows that meet some sort of criteria. A SearchCursor object contains a read-only copy of rows from a table or feature class. These objects can also be filtered through the use of a where clause so that only a subset of the dataset is returned. Getting ready The SearchCursor() function is used to return a SearchCursor object. This object can only be used to iterate a set of rows returned for read-only purposes. No insertions, deletions, or updates can occur through this object. An optional where clause can be set to limit the rows returned. In this article, you will learn how to create a basic SearchCursor object on a feature class through the use of the SearchCursor() function. The SearchCursor object contains a fields property along with the next() and reset() methods. The fields property is a read-only structure in the form of a Python tuple, containing the fields requested from the feature class or table. You are going to hear the term tuple a lot in conjunction with cursors. If you haven't covered this topic before, tuples are a Python structure to store a sequence of data similar to Python lists. However, there are some important differences between Python tuples and lists. Tuples are defined as a sequence of values inside parentheses, while lists are defined as a sequence of values inside brackets. Unlike lists, tuples can't grow and shrink, which can be a very good thing in some cases when you want data values to occupy a specific position each time. This is the case with cursor objects that use tuples to store data from fields in tables and feature classes. How to do it… Follow these steps to learn how to retrieve rows from a table or feature class inside a SearchCursor object: Open IDLE and create a new script window. Save the script as C:ArcpyBookCh8SearchCursor.py. Import the arcpy.da module: import arcpy.da Set the workspace: arcpy.env.workspace = "c:/ArcpyBook/Ch8" Use a Python with statement to create a cursor: with arcpy.da.SearchCursor("Schools.shp",("Facility","Name")) as cursor: Loop through each row in SearchCursor and print the name of the school. Make sure you indent the for loop inside the with block:for row in sorted(cursor): print("School name: " + row[1]) The entire script should appear as follows: import arcpy.da arcpy.env.workspace = "c:/ArcpyBook/Ch8" with arcpy.da.SearchCursor("Schools.shp",("Facility","Name")) as cursor:    for row in sorted(cursor):        print("School name: " + row[1]) Save the script. You can check your work by examining the C:ArcpyBookcodeCh8SearchCursor_Step1.py solution file. Run the script. You should see the following output: School name: ALLAN School name: ALLISON School name: ANDREWS School name: BARANOFF School name: BARRINGTON School name: BARTON CREEK School name: BARTON HILLS School name: BATY School name: BECKER School name: BEE CAVE How it works… The with statement used with the SearchCursor() function will create, open, and close the cursor. So, you no longer have to be concerned with explicitly releasing the lock on the cursor as you did prior to ArcGIS 10.1. The first parameter passed into the SearchCursor() function is a feature class, represented by the Schools.shp file. The second parameter is a Python tuple containing a list of fields that we want returned in the cursor. For performance reasons, it is a best practice to limit the fields returned in the cursor to only those that you need to complete the task. Here, we've specified that only the Facility and Name fields should be returned. The SearchCursor object is stored in a variable called cursor. Inside the with block, we use a Python for loop to loop through each school returned. We also use the Python sorted() function to sort the contents of the cursor. To access the values from a field on the row, simply use the index number of the field you want to return. In this case, we want to return the contents of the Name column, which will be the 1 index number, since it is the second item in the tuple of field names that are returned. Filtering records with a where clause By default, SearchCursor will contain all rows in a table or feature class. However, in many cases, you will want to restrict the number of rows returned by some sort of criteria. Applying a filter through the use of a where clause limits the records returned. Getting ready By default, all rows from a table or feature class will be returned when you create a SearchCursor object. However, in many cases, you will want to restrict the records returned. You can do this by creating a query and passing it as a where clause parameter when calling the SearchCursor() function. In this article, you'll build on the script you created in the previous article by adding a where clause that restricts the records returned. How to do it… Follow these steps to apply a filter to a SearchCursor object that restricts the rows returned from a table or feature class: Open IDLE and load the SearchCursor.py script that you created in the previous recipe. Update the SearchCursor() function by adding a where clause that queries the facility field for records that have the HIGH SCHOOL text: with arcpy.da.SearchCursor("Schools.shp",("Facility","Name"), '"FACILITY" = 'HIGH SCHOOL'') as cursor: You can check your work by examining the C:ArcpyBookcodeCh8SearchCursor_Step2.py solution file. Save and run the script. The output will now be much smaller and restricted to high schools only: High school name: AKINS High school name: ALTERNATIVE LEARNING CENTER High school name: ANDERSON High school name: AUSTIN High school name: BOWIE High school name: CROCKETT High school name: DEL VALLE High school name: ELGIN High school name: GARZA High school name: HENDRICKSON High school name: JOHN B CONNALLY High school name: JOHNSTON High school name: LAGO VISTA How it works… The where clause parameter accepts any valid SQL query, and is used in this case to restrict the number of records that are returned. Improving cursor performance with geometry tokens Geometry tokens were introduced in ArcGIS 10.1 as a performance improvement for cursors. Rather than returning the entire geometry of a feature inside the cursor, only a portion of the geometry is returned. Returning the entire geometry of a feature can result in decreased cursor performance due to the amount of data that has to be returned. It's significantly faster to return only the specific portion of the geometry that is needed. Getting ready A token is provided as one of the fields in the field list passed into the constructor for a cursor and is in the SHAPE@<Part of Feature to be Returned> format. The only exception to this format is the OID@ token, which returns the object ID of the feature. The following code example retrieves only the X and Y coordinates of a feature: with arcpy.da.SearchCursor(fc, ("SHAPE@XY","Facility","Name")) as cursor: The following table lists the available geometry tokens. Not all cursors support the full list of tokens. Check the ArcGIS help files for information about the tokens supported by each cursor type. The SHAPE@ token returns the entire geometry of the feature. Use this carefully though, because it is an expensive operation to return the entire geometry of a feature and can dramatically affect performance. If you don't need the entire geometry, then do not include this token! In this article, you will use a geometry token to increase the performance of a cursor. You'll retrieve the X and Y coordinates of each land parcel from the parcels feature class along with some attribute information about the parcel. How to do it… Follow these steps to add a geometry token to a cursor, which should improve the performance of this object: Open IDLE and create a new script window. Save the script as C:ArcpyBookCh8GeometryToken.py. Import the arcpy.da module and the time module: import arcpy.da import time Set the workspace: arcpy.env.workspace = "c:/ArcpyBook/Ch8" We're going to measure how long it takes to execute the code using a geometry token. Add the start time for the script: start = time.clock() Use a Python with statement to create a cursor that includes the centroid of each feature as well as the ownership information stored in the PY_FULL_OW field: with arcpy.da.SearchCursor("coa_parcels.shp",("PY_FULL_OW","SHAPE@XY")) as cursor: Loop through each row in SearchCursor and print the name of the parcel and location. Make sure you indent the for loop inside the with block: for row in cursor: print("Parcel owner: {0} has a location of: {1}".format(row[0], row[1])) Measure the elapsed time: elapsed = (time.clock() - start) Print the execution time: print("Execution time: " + str(elapsed)) The entire script should appear as follows: import arcpy.da import time arcpy.env.workspace = "c:/ArcpyBook/Ch9" start = time.clock() with arcpy.da.SearchCursor("coa_parcels.shp",("PY_FULL_OW", "SHAPE@XY")) as cursor:    for row in cursor:        print("Parcel owner: {0} has a location of: {1}".format(row[0], row[1])) elapsed = (time.clock() - start) print("Execution time: " + str(elapsed)) You can check your work by examining the C:ArcpyBookcodeCh8GeometryToken.py solution file. Save the script. Run the script. You should see something similar to the following output. Note the execution time; your time will vary: Parcel owner: CITY OF AUSTIN ATTN REAL ESTATE DIVISION has a location of: (3110480.5197341456, 10070911.174956793) Parcel owner: CITY OF AUSTIN ATTN REAL ESTATE DIVISION has a location of: (3110670.413783513, 10070800.960865) Parcel owner: CITY OF AUSTIN has a location of: (3143925.0013213265, 10029388.97419636) Parcel owner: CITY OF AUSTIN % DOROTHY NELL ANDERSON ATTN BARRY LEE ANDERSON has a location of: (3134432.983822767, 10072192.047894118) Execution time: 9.08046185109 Now, we're going to measure the execution time if the entire geometry is returned instead of just the portion of the geometry that we need: Save a new copy of the script as C:ArcpyBookCh8GeometryTokenEntireGeometry.py. Change the SearchCursor() function to return the entire geometry using SHAPE@ instead of SHAPE@XY: with arcpy.da.SearchCursor("coa_parcels.shp",("PY_FULL_OW", "SHAPE@")) as cursor: You can check your work by examining the C:ArcpyBookcodeCh8GeometryTokenEntireGeometry.py solution file. Save and run the script. You should see the following output. Your time will vary from mine, but notice that the execution time is slower. In this case, it's only a little over a second slower, but we're only returning 2600 features. If the feature class were significantly larger, as many are, this would be amplified: Parcel owner: CITY OF AUSTIN ATTN REAL ESTATE DIVISION has a location of: <geoprocessing describe geometry object object at 0x06B9BE00> Parcel owner: CITY OF AUSTIN ATTN REAL ESTATE DIVISION has a location of: <geoprocessing describe geometry object object at 0x2400A700> Parcel owner: CITY OF AUSTIN has a location of: <geoprocessing describe geometry object object at 0x06B9BE00> Parcel owner: CITY OF AUSTIN % DOROTHY NELL ANDERSON ATTN BARRY LEE ANDERSON has a location of: <geoprocessing describe geometry object object at 0x2400A700> Execution time: 10.1211390896 How it works… A geometry token can be supplied as one of the field names supplied in the constructor for the cursor. These tokens are used to increase the performance of a cursor by returning only a portion of the geometry instead of the entire geometry. This can dramatically increase the performance of a cursor, particularly when you are working with large polyline or polygon datasets. If you only need specific properties of the geometry in your cursor, you should use these tokens. Inserting rows with InsertCursor You can insert a row into a table or feature class using an InsertCursor object. If you want to insert attribute values along with the new row, you'll need to supply the values in the order found in the attribute table. Getting ready The InsertCursor() function is used to create an InsertCursor object that allows you to programmatically add new records to feature classes and tables. The insertRow() method on the InsertCursor object adds the row. A row in the form of a list or tuple is passed into the insertRow() method. The values in the list must correspond to the field values defined when the InsertCursor object was created. Similar to instances that include other types of cursors, you can also limit the field names returned using the second parameter of the method. This function supports geometry tokens as well. The following code example illustrates how you can use InsertCursor to insert new rows into a feature class. Here, we insert two new wildfire points into the California feature class. The row values to be inserted are defined in a list variable. Then, an InsertCursor object is created, passing in the feature class and fields. Finally, the new rows are inserted into the feature class by using the insertRow() method: rowValues = [(Bastrop','N',3000,(-105.345,32.234)), ('Ft Davis','N', 456, (-109.456,33.468))] fc = "c:/data/wildfires.gdb/California" fields = ["FIRE_NAME", "FIRE_CONTAINED", "ACRES", "SHAPE@XY"] with arcpy.da.InsertCursor(fc, fields) as cursor: for row in rowValues:    cursor.insertRow(row) In this article, you will use InsertCursor to add wildfires retrieved from a .txt file into a point feature class. When inserting rows into a feature class, you will need to know how to add the geometric representation of a feature into the feature class. This can be accomplished by using InsertCursor along with two miscellaneous objects: Array and Point. In this exercise, we will add point features in the form of wildfire incidents to an empty point feature class. In addition to this, you will use Python file manipulation techniques to read the coordinate data from a text file. How to do it… We will be importing the North American wildland fire incident data from a single day in October, 2007. This data is contained in a comma-delimited text file containing one line for each fire incident on this particular day. Each fire incident has a latitude, longitude coordinate pair separated by commas along with a confidence value. This data was derived by automated methods that use remote sensing data to derive the presence or absence of a wildfire. Confidence values can range from 0 to 100. Higher numbers represent a greater confidence that this is indeed a wildfire: Open the file at C:ArcpyBookCh8Wildfire DataNorthAmericaWildfire_2007275.txt and examine the contents.You will notice that this is a simple comma-delimited text file containing the longitude and latitude values for each fire along with a confidence value. We will use Python to read the contents of this file line by line and insert new point features into the FireIncidents feature class located in the C:ArcpyBookCh8 WildfireDataWildlandFires.mdb personal geodatabase. Close the file. Open ArcCatalog. Navigate to C:ArcpyBookCh8WildfireData.You should see a personal geodatabase called WildlandFires. Open this geodatabase and you will see a point feature class called FireIncidents. Right now, this is an empty feature class. We will add features by reading the text file you examined earlier and inserting points. Right-click on FireIncidents and select Properties. Click on the Fields tab.The latitude/longitude values found in the file we examined earlier will be imported into the SHAPE field and the confidence values will be written to the CONFIDENCEVALUE field. Open IDLE and create a new script. Save the script to C:ArcpyBookCh8InsertWildfires.py. Import the arcpy modules: import arcpy Set the workspace: arcpy.env.workspace = "C:/ArcpyBook/Ch8/WildfireData/WildlandFires.mdb" Open the text file and read all the lines into a list: f = open("C:/ArcpyBook/Ch8/WildfireData/NorthAmericaWildfires_2007275.txt","r") lstFires = f.readlines() Start a try block: try: Create an InsertCursor object using a with block. Make sure you indent inside the try statement. The cursor will be created in the FireIncidents feature class: with arcpy.da.InsertCursor("FireIncidents",("SHAPE@XY","CONFIDENCEVALUE")) as cur: Create a counter variable that will be used to print the progress of the script: cntr = 1 Loop through the text file line by line using a for loop. Since the text file is comma-delimited, we'll use the Python split() function to separate each value into a list variable called vals. We'll then pull out the individual latitude, longitude, and confidence value items and assign them to variables. Finally, we'll place these values into a list variable called rowValue, which is then passed into the insertRow() function for the InsertCursor object, and we then print a message: for fire in lstFires:      if 'Latitude' in fire:        continue      vals = fire.split(",")      latitude = float(vals[0])      longitude = float(vals[1])      confid = int(vals[2])      rowValue = [(longitude,latitude),confid]      cur.insertRow(rowValue)      print("Record number " + str(cntr) + " written to feature class")      #arcpy.AddMessage("Record number" + str(cntr) + " written to feature class")      cntr = cntr + 1 Add the except block to print any errors that may occur: except Exception as e: print(e.message) Add a finally block to close the text file: finally: f.close() The entire script should appear as follows: import arcpy   arcpy.env.workspace = "C:/ArcpyBook/Ch8/WildfireData/WildlandFires.mdb" f = open("C:/ArcpyBook/Ch8/WildfireData/NorthAmericaWildfires_2007275.txt","r") lstFires = f.readlines() try: with arcpy.da.InsertCursor("FireIncidents", ("SHAPE@XY","CONFIDENCEVALUE")) as cur:    cntr = 1    for fire in lstFires:      if 'Latitude' in fire:        continue      vals = fire.split(",")      latitude = float(vals[0])      longitude = float(vals[1])      confid = int(vals[2])      rowValue = [(longitude,latitude),confid]      cur.insertRow(rowValue)      print("Record number " + str(cntr) + " written to feature class")      #arcpy.AddMessage("Record number" + str(cntr) + "       written to feature class")      cntr = cntr + 1 except Exception as e: print(e.message) finally: f.close() You can check your work by examining the C:ArcpyBookcodeCh8InsertWildfires.py solution file. Save and run the script. You should see messages being written to the output window as the script runs: Record number: 406 written to feature class Record number: 407 written to feature class Record number: 408 written to feature class Record number: 409 written to feature class Record number: 410 written to feature class Record number: 411 written to feature class Open ArcMap and add the FireIncidents feature class to the table of contents. The points should be visible, as shown in the following screenshot: You may want to add a basemap to provide some reference for the data. In ArcMap, click on the Add Basemap button and select a basemap from the gallery. How it works… Some additional explanation may be needed here. The lstFires variable contains a list of all the wildfires that were contained in the comma-delimited text file. The for loop will loop through each of these records one by one, inserting each individual record into the fire variable. We also include an if statement that is used to skip the first record in the file, which serves as the header. As I explained earlier, we then pull out the individual latitude, longitude, and confidence value items from the vals variable, which is just a Python list object and assign them to variables called latitude, longitude, and confid. We then place these values into a new list variable called rowValue in the order that we defined when we created InsertCursor. Thus, the latitude and longitude pair should be placed first followed by the confidence value. Finally, we call the insertRow() function on the InsertCursor object assigned to the cur variable, passing in the new rowValue variable. We close by printing a message that indicates the progress of the script and also create the except and finally blocks to handle errors and close the text file. Placing the file.close() method in the finally block ensures that it will execute and close the file even if there is an error in the previous try statement. Updating rows with UpdateCursor If you need to edit or delete rows from a table or feature class, you can use UpdateCursor. As is the case with InsertCursor, the contents of UpdateCursor can be limited through the use of a where clause. Getting ready The UpdateCursor() function can be used to either update or delete rows in a table or feature class. The returned cursor places a lock on the data, which will automatically be released if used inside a Python with statement. An UpdateCursor object is returned from a call to this method. The UpdateCursor object places a lock on the data while it's being edited or deleted. If the cursor is used inside a Python with statement, the lock will automatically be freed after the data has been processed. This hasn't always been the case. Previous versions of cursors were required to be manually released using the Python del statement. Once an instance of UpdateCursor has been obtained, you can then call the updateCursor() method to update records in tables or feature classes and the deleteRow() method can be used to delete a row. In this article, you're going to write a script that updates each feature in the FireIncidents feature class by assigning a value of poor, fair, good, or excellent to a new field that is more descriptive of the confidence values using an UpdateCursor. Prior to updating the records, your script will add the new field to the FireIncidents feature class. How to do it… Follow these steps to create an UpdateCursor object that will be used to edit rows in a feature class: Open IDLE and create a new script. Save the script to C:ArcpyBookCh8UpdateWildfires.py. Import the arcpy module: import arcpy Set the workspace: arcpy.env.workspace = "C:/ArcpyBook/Ch8/WildfireData/WildlandFires.mdb" Start a try block: try: Add a new field called CONFID_RATING to the FireIncidents feature class. Make sure to indent inside the try statement: arcpy.AddField_management("FireIncidents","CONFID_RATING", "TEXT","10") print("CONFID_RATING field added to FireIncidents") Create a new instance of UpdateCursor inside a with block: with arcpy.da.UpdateCursor("FireIncidents", ("CONFIDENCEVALUE","CONFID_RATING")) as cursor: Create a counter variable that will be used to print the progress of the script. Make sure you indent this line of code and all the lines of code that follow inside the with block: cntr = 1 Loop through each of the rows in the FireIncidents fire class. Update the CONFID_RATING field according to the following guidelines:     Confidence value 0 to 40 = POOR     Confidence value 41 to 60 = FAIR     Confidence value 61 to 85 = GOOD     Confidence value 86 to 100 = EXCELLENT This can be translated in the following block of code:    for row in cursor:      # update the confid_rating field      if row[0] <= 40:        row[1] = 'POOR'      elif row[0] > 40 and row[0] <= 60:        row[1] = 'FAIR'      elif row[0] > 60 and row[0] <= 85:        row[1] = 'GOOD'      else:        row[1] = 'EXCELLENT'      cursor.updateRow(row)                       print("Record number " + str(cntr) + " updated")      cntr = cntr + 1 Add the except block to print any errors that may occur: except Exception as e: print(e.message) The entire script should appear as follows: import arcpy   arcpy.env.workspace = "C:/ArcpyBook/Ch8/WildfireData/WildlandFires.mdb" try: #create a new field to hold the values arcpy.AddField_management("FireIncidents", "CONFID_RATING","TEXT","10") print("CONFID_RATING field added to FireIncidents") with arcpy.da.UpdateCursor("FireIncidents",("CONFIDENCEVALUE", "CONFID_RATING")) as cursor:    cntr = 1    for row in cursor:      # update the confid_rating field      if row[0] <= 40:        row[1] = 'POOR'      elif row[0] > 40 and row[0] <= 60:        row[1] = 'FAIR'      elif row[0] > 60 and row[0] <= 85:        row[1] = 'GOOD'      else:        row[1] = 'EXCELLENT'      cursor.updateRow(row)                       print("Record number " + str(cntr) + " updated")      cntr = cntr + 1 except Exception as e: print(e.message) You can check your work by examining the C:ArcpyBookcodeCh8UpdateWildfires.py solution file. Save and run the script. You should see messages being written to the output window as the script runs: Record number 406 updated Record number 407 updated Record number 408 updated Record number 409 updated Record number 410 updated Open ArcMap and add the FireIncidents feature class. Open the attribute table and you should see that a new CONFID_RATING field has been added and populated by UpdateCursor: When you insert, update, or delete data in cursors, the changes are permanent and can't be undone if you're working outside an edit session. However, with the new edit session functionality provided by ArcGIS 10.1, you can now make these changes inside an edit session to avoid these problems. We'll cover edit sessions soon. How it works… In this case, we've used UpdateCursor to update each of the features in a feature class. We first used the Add Field tool to add a new field called CONFID_RATING, which will hold new values that we assign based on values found in another field. The groups are poor, fair, good, and excellent and are based on numeric values found in the CONFIDENCEVALUE field. We then created a new instance of UpdateCursor based on the FireIncidents feature class, and returned the two fields mentioned previously. The script then loops through each of the features and assigns a value of poor, fair, good, or excellent to the CONFID_RATING field (row[1]), based on the numeric value found in CONFIDENCEVALUE. A Python if/elif/else structure is used to control the flow of the script based on the numeric value. The value for CONFID_RATING is then committed to the feature class by passing the row variable into the updateRow() method. Summary In this article we studied, how to retrieve features from a feature class with SerchCursor, filtering records with a where clause, improving cursr performance with geoetry tokens, inserting rows with InsertCursor and updating rows with UpdateCursor. Resources for Article: Further resources on this subject: Adding Graphics to the Map [article] Introduction to Mobile Web ArcGIS Development [article] Python functions – Avoid repeating code [article]
Read more
  • 0
  • 0
  • 4843
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-more-about-julia
Packt
21 Jul 2015
28 min read
Save for later

More about Julia

Packt
21 Jul 2015
28 min read
In this article by Malcolm Sherrington, author of the book Mastering Julia, we will see why write a book on Julia when the language is not yet reached the version v1.0 stage? It was the first question which needed to be addressed when deciding on the contents and philosophy behind the book. (For more resources related to this topic, see here.) Julia at the time as v0.2, it is now soon to achieve a stable v0.4 but already the blueprint for v0.5 is being touted. There were some common misconceptions which I wished to address: It is a language designed for Geeks It's main attribute, possibly only, was its speed It was a scientific language primarily a MATLAB clone It is not as easy to use compared with the alternatives such Python and R There are not enough library support to tackle Enterprise Solutions In fact none of these apply to Julia. True it is a relatively young programming language. The initial design work on Julia project began at MIT in August 2009, by February 2012 it became open source. It is largely the work of three developers Stefan Karpinski, Jeff Bezanson, and Viral Shah. Initially Julia was envisaged by the designers as a scientific language sufficiently rapid to make the necessity of modeling in an interactive language and subsequently having to redevelop in a compiled language, such as C or Fortran. To achieve this, Julia code would need to be transformed to the underlying machine code of the computer but using the low level virtual machine (LLVM) compilation system, at the time itself a new project. This was a masterstroke. LLVM is now the basis of a variety of systems, the Apple C compiler (clang) uses it, Google V8 JavaScript engine and Mozilla's Rust language use it and Python is attempting to achieve significant increases in speed with its numba module. In Julia LLVM always works, there are no exceptions because it has to. When launched possibly the community itself saw Julia as a replacement for MATLAB but that proved not to be just case. The syntax of Julia is similar to MATLAB, so much so that anyone competent in the latter can easily learn Julia but, it is a much richer language with many significant differences. The task of the book was to focus on these. In particular my target audience was the data scientist and programmer analyst but have sufficient for the "jobbing" C++ and Java programmer. Julia's features The Julia programming language is free and open source (MIT licensed) and the source is available on GitHub. To the veteran programmer it has a look and feel similar to MATLAB. Blocks created by for, while, and if statements are all terminated by end rather than by endfor, endwhile, and endif or using the familiar {} style syntax. However it is not a MATLAB clone and sources written for MATLAB will not run on Julia. The following are some of the Julia's features: Designed for parallelism and distributed computation (multicore and cluster) C functions called directly (no wrappers or special APIs needed) Powerful shell-like capabilities for managing other processes Lisp-like macros and other metaprogramming facilities User-defined types are as fast and compact as built-ins LLVM-based, just-in-time (JIT) compiler that allows Julia to approach and often match the performance of C/C++ An extensive mathematical function library (written in Julia) Integrated mature, best-of-breed C and Fortran libraries for linear algebra, random number generation, FFTs, and string processing Julia's core is implemented in C and C++, its parser in Scheme, and the LLVM compiler framework is used for JIT generation of machine code. The standard library is written in Julia itself using the Node.js's libuv library for efficient, cross-platform I/O. Julia has a rich language of types for constructing and describing objects that can also optionally be used to make type declarations. The ability to define function behavior across many combinations of argument types via multiple dispatch which is a key cornerstone of the language design. Julia can utilize code in other programming languages by a directly calling routines written in C or Fortran and stored in shared libraries or DLLs. This is a feature of the language syntax. In addition it is possible to interact with Python via the PyCall and this is used in the implementation of the IJulia programming environment. A quick look at some Julia To get feel for programming in Julia let's look at an example which uses random numbers to price an Asian derivative on the options market. A share option is the right to purchase a specific stock at a nominated price sometime in the future. The person granting the option is called the grantor and the person who has the benefit of the option is the beneficiary. At the time the option matures the beneficiary may choose to exercise the option if it is in his/her interest the grantor is then obliged to complete the contract. The following snippet is part of the calculation and computes a single trial and uses the Winston package to display the trajectory: using Winston; S0 = 100;     # Spot price K   = 102;     # Strike price r   = 0.05;     # Risk free rate q   = 0.0;      # Dividend yield v   = 0.2;     # Volatility tma = 0.25;     # Time to maturity T = 100;       # Number of time steps dt = tma/T;     # Time increment S = zeros(Float64,T) S[1] = S0; dW = randn(T)*sqrt(dt); [ S[t] = S[t-1] * (1 + (r - q - 0.5*v*v)*dt + v*dW[t] + 0.5*v*v*dW[t]*dW[t]) for t=2:T ]   x = linspace(1, T, length(T)); p = FramedPlot(title = "Random Walk, drift 5%, volatility 2%") add(p, Curve(x,S,color="red")) display(p) Plot one track so only compute a vector S of T elements. The stochastic variance dW is computed in a single vectorized statement. The track S is computed using a "list comprehension". The array x is created using linspace to define a linear absicca for the plot. Using the Winston package to produce the display, which only requires 3 statements: to define the plot space, add a curve to it and display the plot as shown in the following figure: Generating Julia sets Both the Mandelbrot set and Julia set (for a given constant z0) are the sets of all z (complex number) for which the iteration z → z*z + z0 does not diverge to infinity. The Mandelbrot set is those z0 for which the Julia set is connected. We create a file jset.jl and its contents defines the function to generate a Julia set. functionjuliaset(z, z0, nmax::Int64) for n = 1:nmax if abs(z) > 2 (return n-1) end z = z^2 + z0 end returnnmax end Here z and z0 are complex values and nmax is the number of trials to make before returning. If the modulus of the complex number z gets above 2 then it can be shown that it will increase without limit. The function returns the number of iterations until the modulus test succeeds or else nmax. Also we will write a second file pgmfile.jl to handling displaying the Julia set. functioncreate_pgmfile(img, outf::String) s = open(outf, "w") write(s, "P5n") n, m = size(img) write(s, "$m $n 255n") for i=1:n, j=1:m    p = img[i,j]    write(s, uint8(p)) end close(s) end It is quite easy to create a simple disk file using the portable bitmap (netpbm) format. This consists of a "magic" number P1 - P6, followed on the next line the image height, width and a maximum color value which must be greater than 0 and less than 65536; all of these are ASCII values not binary. Then follows the image values (height x width) which make be ASCII for P1, P2, P3 or binary for P4, P5, P6. There are three different types of portable bitmap; B/W (P1/P4), Grayscale (P2/P5), and Colour (P3/P6). The function create_pgmfile() creates a binary grayscale file (magic number = P5) from an image matrix where the values are written as Uint8. Notice that the for loop defines the indices i, j in a single statement with correspondingly only one end statement. The image matrix is output in column order which matches the way it is stored in Julia. So the main program looks like: include("jset.jl") include("pgmfile.jl") h = 400; w = 800; M = Array(Int64, h, w); c0 = -0.8+0.16im; pgm_name = "juliaset.pgm";   t0 = time(); for y=1:h, x=1:w c = complex((x-w/2)/(w/2), (y-h/2)/(w/2)) M[y,x] = juliaset(c, c0, 256) end t1 = time(); create_pgmfile(M, pgm_name); print("Written $pgm_namenFinished in $(t1-t0) seconds.n"); This is how the previous code works: We define an array N of type Int64 to hold the return values from the juliaset function. The constant c0 is arbitrary, different values of c0 will produce different Julia sets. The starting complex number is constructed from the (x,y) coordinates and scaled to the half width. We 'cheat' a little by defining the maximum number of iterations as 256. Because we are writing byte values (Uint8) and value which remains bounded will be 256 and since overflow values wrap around will be output as 0 (black). The script defines a main program in a function jmain(): julia>jmain Written juliaset.pgm Finished in 0.458 seconds # => (on my laptop) Julia type system and multiple dispatch Julia is not an object-oriented language so when we speak of object they are a different sort of data structure to those in traditional O-O languages. Julia does not allow types to have methods or so it is not possible to create subtypes which inherit methods. While this might seem restrictive it does permit methods to use a multiple dispatch call structure rather than the single dispatch system employed in object orientated ones. Coupled with Julia's system of types, multiple dispatch is extremely powerful. Moreover it is a more logical approach for data scientists and scientific programmers and if for no other reason exposing this to you the analyst/programmer is a reason to use Julia. A function is an object that maps a tuple of arguments to a return value. In the case where the arguments are not valid the function should handle the situation cleanly by catching the error and handling it or throw an exception. When a function is applied to its argument tuple it selects the appropriate method and this process is called dispatch. In traditional object-oriented languages the method chosen is based only on the object type and this paradigm is termed single-dispatch. With Julia the combination of all a functions argument determine which method is chosen, this is the basis of multiple dispatch. To the scientific programmer this all seems very natural. It makes little sense in most circumstances for one argument to be more important than the others. In Julia all functions and operators, which are also functions, use multiple dispatch. The methods chosen for any combination of operators. For example looking at the methods of the power operator (^): julia> methods(^) # 43 methods for generic function "^": ^(x::Bool,y::Bool) at bool.jl:41 ^(x::BigInt,y::Bool) at gmp.jl:314 ^(x::Integer,y::Bool) at bool.jl:42     ^(A::Array{T,2},p::Number) at linalg/dense.jl:180 ^(::MathConst{:e},x::AbstractArray{T,2}) at constants.jl:87 We can see that there are 43 methods for ^ and the file and line where the methods is defined are given too. Because any untyped argument is designed as type Any, it is possible to define a set of function methods such that there is no unique most specific method applicable to some combinations of arguments. julia> pow(a,b::Int64) = a^b; julia> pow(a::Float64,b) = a^b; Warning: New definition pow(Float64,Any) at /Applications/JuliaStudio.app/Contents/Resources/Console/Console.jl:1 is ambiguous with: pow(Any,Int64) at /Applications/JuliaStudio.app/Contents/Resources/Console/Console.jl:1. To fix, define pow(Float64,Int64) before the new definition. A call of pow(3.5, 2) can be handled by either function. In this case they will give the same result by only because of the function bodies and Julia can't know that. Working with Python The ability for Julia with call code written in other languages is one of its main strengths. From its inception Julia had to play "catchup" and a key feature was it makes calling code written in C, and by implication Fortran, very easy. The code to be called must be available as a shared library rather than just a standalone object file. There is a zero-overhead in the call, meaning that it is reduced to a single machine instruction in the LLVM compilation. In addition Python models can be accessed via the PyCall package which provides a @pyimport macro that mimics a Python import statement. This imports a Python module and provides Julia wrappers for all of the functions and constants including automatic conversion of types between Julia and Python. This work has led to the creation of an IJulia kernel to the IPython IDE which now is a principle component of the Jupyter project. In Pycall, type conversions are automatically performed for numeric, boolean, string, IO streams plus all tuples, arrays and dictionaries of these types. julia> using PyCall julia> @pyimport scipy.optimize as so julia> @pyimport scipy.integrate as si julia> so.ridder(x -> x*cos(x), 1, pi); # => 1.570796326795 julia> si.quad(x -> x*sin(x), 1, pi)[1]; # => 2.840423974650 In the preceding commands, the Python optimize and integrate modules are imported and functions in those modules called from the Julia REPL. One difference imposed on the package is that calls using the Python object notation are not possible from Julia, so these are referenced using an array-style notation po[:atr] rather po.atr, where po is a PyObject and atr is an attribute. It is also easy to use the Python matplotlib module to display simple (and complex) graphs. @pyimport matplotlib.pyplot as plt x = linspace(0,2*pi,1000); y = sin(3*x + 4*cos(2*x)); plt.plot(x, y, color="red", linewidth=2.0, linestyle="--") 1-element Array{Any,1}: PyObject<matplotlib.lines.Line2D object at 0x0000000027652358> plt.show() Notice that keywords can also be passed such as the color, line width and the preceding style. Simple statistics using dataframes Julia implements various approaches for handling data held on disk. These may be 'normal' files such as text files, CSV and other delimited files, or in SQL and NoSQL databases. Also there is an implementation of dataframe support similar to that provided in R and via the pandas module in Python. The following looks at the Apple share prices from 2000 to 200, using a CSV file with provides opening, closing, high and low prices together with trading volumes over the day. using DataFrames, StatsBase aapl = readtable("AAPL-Short.csv");   naapl = size(aapl)[1] m1 = int(mean((aapl[:Volume]))); # => 6306547 The data is read into a DataFrame and we can estimate the mean (m1). For the volume, it is possible to cast it as an integer as this makes more sense. We can do this by creating a weighting vector. using Match wts = zeros(naapl); for i in 1:naapl    dt = aapl[:Date][i]    wts[i] = @match dt begin          r"^2000" => 1.0          r"^2001" => 2.0          r"^2002" => 4.0          _       => 0.0    end end;   wv = WeightVec(wts); m2 = int(mean(aapl[:Volume], wv)); # => 6012863 Computing weighted statistical metrics it is possible to 'trim' off the outliers from each end of the data. Returning to the closing prices: mean(aapl[:Close]);         # => 37.1255 mean(aapl[:Close], wv);     # => 26.9944 trimmean(aapl[:Close], 0.1); # => 34.3951 trimean() is the trimmed mean where 5 percent is taken from each end. std(aapl[:Close]);           # => 34.1186 skewness(aapl[:Close])       # => 1.6911 kurtosis(aapl[:Close])       # => 1.3820 As well as second moments such as standard deviation, StatsBase provides a generic moments() function and specific instances based on these such as for skewness (third) and kurtosis (fourth). It is also possible to provide some summary statistics: summarystats(aapl[:Close])   Summary Stats: Mean:         37.125505 Minimum:     13.590000 1st Quartile: 17.735000 Median:       21.490000 3rd Quartile: 31.615000 Maximum:     144.190000 The first and third quartiles related to the 25 percent and 75 percent percentiles for a finer granularity we can use the percentile() function. percentile(aapl[:Close],5); # => 14.4855 percentile(aapl[:Close],95); # => 118.934 MySQL access using PyCall We have seen previously that Python can be used for plotting via the PyPlot package which interfaces with matplotlib. In fact the ability to easily call Python modules is a very powerful feature in Julia and we can use this as an alternative method for connecting to databases. Any database which can be manipulated by Python is also available to Julia. In particular since the DBD driver for MySQL is not fully DBT compliant, let's look this approach to running some queries. Our current MySQL setup already has the Chinook dataset loaded some we will execute a query to list the Genre table. In Python we will first need to download the MySQL Connector module. For Anaconda this needs to be using the source (independent) distribution, rather than a binary package and the installation performed using the setup.py file. The query (in Python) to list the Genre table would be: import mysql.connector as mc cnx = mc.connect(user="malcolm", password="mypasswd") csr = cnx.cursor() qry = "SELECT * FROM Chinook.Genre" csr.execute(qry) for vals in csr:    print(vals)   (1, u'Rock') (2, u'Jazz') (3, u'Metal') (4, u'Alternative & Punk') (5, u'Rock And Roll') ... ... csr.close() cnx.close() We can execute the same in Julia by using the PyCall to the mysql.connector module and the form of the coding is remarkably similar: using PyCall @pyimport mysql.connector as mc   cnx = mc.connect (user="malcolm", password="mypasswd"); csr = cnx[:cursor]() query = "SELECT * FROM Chinook.Genre" csr[:execute](query)   for vals in csr    id   = vals[1]    genre = vals[2]    @printf “ID: %2d, %sn” id genre end ID: 1, Rock ID: 2, Jazz ID: 3, Metal ID: 4, Alternative & Punk ID: 5, Rock And Roll ... ... csr[:close]() cnx[:close]() Note that the form of the call is a little different from the corresponding Python method, since Julia is not object-oriented the methods for a Python object are constructed as an array of symbols. For example the Python csr.execute(qry) routine is called in Julia as csr[:execute](qry). Also be aware that although Python arrays are zero-based this is translated to one-based by PyCall, so the first values is referenced as vals[1]. Scientific programming with Julia Julia was originally developed as a replacement for MATLAB with a focus on scientific programming. There are modules which are concerned with linear algebra, signal processing, mathematical calculus, optimization problems, and stochastic simulation. The following is a subject dear to my heart: the solution of differential equations. Differential equations are those which involve terms which involve the rates of change of variates as well as the variates themselves. They arise naturally in a number of fields, notably dynamics and when the changes are with respect to one dependent variable, often time, the systems are called ordinary differential equations. If more than a single dependent variable is involved, then they are termed partial differential equations. Julia supports the solution of ordinary differential equations thorough a couple of packages ODE and Sundials. The former (ODE) consists of routines written solely in Julia whereas Sundials is a wrapper package around a shared library. ODE exports a set of adaptive solvers; adaptive meaning that the 'step' size of the algorithm changes algorithmically to reduce the error estimate to be below a certain threshold. The calls take the form odeXY, where X is the order of the solver and Y the error control. ode23: Third order adaptive solver with third order error control ode45: Fourth order adaptive solver with fifth order error control ode78: Seventh order adaptive solver with eighth order error control To solve the explicit ODE defined as a vectorize set of equations dy/dt = F(t,y), all routines of which have the same basic form: (tout, yout) = odeXY(F, y0, tspan). As an example, I will look at it as a linear three-species food chain model where the lowest-level prey x is preyed upon by a mid-level species y, which, in turn, is preyed upon by a top level predator z. This is an extension of the Lotka-Volterra system from to three species. Examples might be three-species ecosystems such as mouse-snake-owl, vegetation-rabbits-foxes, and worm-sparrow-falcon. x' = a*x − b*x*y y' = −c*y + d*x*y − e*y*z z' = −f*z + g*y*z #for a,b,c,d,e,f g > 0 Where a, b, c, d are in the two-species Lotka-Volterra equations: e represents the effect of predation on species y by species z f represents the natural death rate of the predator z in the absence of prey g represents the efficiency and propagation rate of the predator z in the presence of prey This translates to the following set of linear equations: x[1] = p[1]*x[1] - p[2]*x[1]*x[2] x[2] = -p[3]*x[2] + p[4]*x[1]*x[2] - p[5]*x[2]*x[3] x[3] = -p[6]*x[3] + p[7]*x[2]*x[3] It is slightly over specified since one of the parameters can be removed by rescaling the timescale. We define the function F as follows: function F(t,x,p) d1 = p[1]*x[1] - p[2]*x[1]*x[2] d2 = -p[3]*x[2] + p[4]*x[1]*x[2] - p[5]*x[2]*x[3] d3 = -p[6]*x[3] + p[7]*x[2]*x[3] [d1, d2, d3] end This takes the time range, vectors of the independent variables and of the coefficients and returns a vector of the derivative estimates: p = ones(7); # Choose all parameters as 1.0 x0 = [0.5, 1.0, 2.0]; # Setup the initial conditions tspan = [0.0:0.1:10.0]; # and the time range Solve the equations by calling the ode23 routine. This returns a matrix of the solutions in a columnar order which we extract and display using Winston: (t,x) = ODE.ode23((t,x) -> F(t,x,pp), x0, tspan);   n = length(t); y1 = zeros(n); [y1[i] = x[i][1] for i = 1:n]; y2 = zeros(n); [y2[i] = x[i][2] for i = 1:n]; y3 = zeros(n); [y3[i] = x[i][3] for i = 1:n];   using Winston plot(t,y1,"b.",t,y2,"g-.",t,y3,"r--") This is shown in the following figure: Graphics with Gadlfy Julia now provides a vast array of graphics packages. The "popular" ones may be thought of as Winston, PyPlot and Gadfly and there is also an interface to the increasingly more popular online system Plot.ly. Gadfly is a large and complex package and provides great flexibility in the range and breadth of the visualizations possible in Julia. It is equivalent to the R module ggplot2 and similarly is based on the seminal work The Grammar of Graphics by Leland Wilkinson. The package was written by Daniel Jones and the source on GitHub contains numerous examples of visual displays together with the accompanying code. An entire text could be devoted just to Gadfly so I can only point out some of the main features here and encourage the reader interested in print standard graphics in Julia to refer to the online website at http://gadflyjl.org. The standard call is to the plot() function with creates a graph on the display device via a browser either directly or under the control of IJulia if that is being used as an IDE. It is possible to assign the result of plot() to a variable and invoke this using display(), In this way output can be written to files including: SVG, SVGJS/D3 PDF, and PNG. dd = plot(x = rand(10), y = rand(10)); draw(SVG(“random-pts.svg”, 15cm, 12cm) , dd); Notice that if writing to a backend, the display size is provided, this can be specified in units of cm and inch. Gadfly works well with C libraries of cairo, pango and, fontconfig installed. It will produce SVG and SVGJS graphics but for PNG, PostScript (PS) and PDF cairo is required. Also complex text layouts are more accurate when pango and fontconfig are available. The plot() call can operate on three different data sources: Dataframes Functions and expressions Arrays and collections Unless otherwise specified the type of graph produced is a scatter diagram. The ability to work directly with data frames is especially useful. To illustrate this let's look at the GCSE result set. Recall that this is available as part of the RDatasets suite of source data. using Gadfly, RDatasets, DataFrames; set_default_plot_size(20cm, 12cm); mlmf = dataset("mlmRev","Gcsemv") df = mlmf[complete_cases(mlmf), :] After extracting the data we need to operate with values with do not have any NA values, so we use the complete_cases() function to create a subset of the original data. names(df) 5-element Array{Symbol,1}: ; # => [ :School, :Student, :Gender, :Written, :Course ] If we wish to view the data values for the exam and course work results and at the same time differentiate between boys and girls, this can be displayed by: plot(df, x="Course", y="Written", color="Gender") The JuliaGPU community group Many Julia modules build on the work of other authors working within the same field of study and these have classified as community groups (http://julialang.org/community). Probably the most prolific is the statistics group: JuliaStats (http://juliastats.github.io). One of the main themes in my professional career has been working with hardware to speed up the computing process. In my work on satellite data I worked with the STAR-100 array processor, and once back in the UK, used Silicon Graphics for 3D rendering of medical data . Currently I am interested in using NVIDIA GPUs in financial scenarios and risk calculations. Much of this work has been coded in C, with domain specific languages to program the ancillary hardware. It is now possible to do much of this in Julia with packages contained in the JuliaGPU group. This has routines for both CUDA and OpenCL, at present covering: Basic runtime: CUDA.jl, CUDArt.jl, OpenCL.jl BLAS integration: CUBLAS.jl, CLBLAS FFT operations: CUFFT.jl, CLFFT.jl The CU*-style routines only applies to NVIDIA cards and requires the CUDA SDK to be installed, whereas CL*-functions can be used with variety of GPU s. The CLFFT and CLBLAS do require some additional libraries to be present but we can use OpenCL as is. The following is output from a Lenovo Z50 laptop with an i7 processor and both Intel and NVIDIA graphics chips. julia> using OpenCL julia> OpenCL.devices() OpenCL.Platform(Intel(R) HDGraphics 4400) OpenCL.Platform(Intel(R) Core(TM) i7-4510U CPU) OpenCL.Platform(GeForce 840M on NVIDIA CUDA) To do some calculations we need to define a kernel to be loaded on the GPU. The following multiplies two 1024x1024 matrices of Gaussian random numbers: import OpenCL const cl = OpenCL const kernel_source = """ __kernel void mmul( const int Mdim, const int Ndim, const int Pdim, __global float* A, __global float* B, __global float* C) {    int k;    int i = get_global_id(0);    int j = get_global_id(1);    float tmp;    if ((i < Ndim) && (j < Mdim)) {      tmp = 0.0f;      for (k = 0; k < Pdim; k++)        tmp += A[i*Ndim + k] * B[k*Pdim + j];        C[i*Ndim+j] = tmp;    } } """ The kernel is expressed as a string and the OpenCL DSL has a C-like syntax: const ORDER = 1024; # Order of the square matrices A, B and C const TOL   = 0.001; # Tolerance used in floating point comps const COUNT = 3;     # Number of runs   sizeN = ORDER * ORDER; h_A = float32(randn(ORDER)); # Fill array with random numbers h_B = float32(randn(ORDER)); # --- ditto -- h_C = Array(Float32, ORDER); # Array to hold the results   ctx   = cl.Context(cl.devices()[3]); queue = cl.CmdQueue(ctx, :profile);   d_a = cl.Buffer(Float32, ctx, (:r,:copy), hostbuf = h_A); d_b = cl.Buffer(Float32, ctx, (:r,:copy), hostbuf = h_B); d_c = cl.Buffer(Float32, ctx, :w, length(h_C)); We now create the Open CL context and some data space on the GPU for the three arrays d_A, d_B, and D_C. Then we copy the data in the host arrays h_A and h_B to the device and then load the kernel onto the GPU. prg = cl.Program(ctx, source=kernel_source) |> cl.build! mmul = cl.Kernel(prg, "mmul"); The following loop runs the kernel COUNT times to give an accurate estimate of the elapsed time for the operation. This includes the cl-copy!() operation which copies the results back from the device to the host (Julia) program. for i in 1:COUNT fill!(h_C, 0.0); global_range = (ORDER. ORDER); mmul_ocl = mmul[queue, global_range]; evt = mmul_ocl(int32(ORDER), int32(ORDER), int32(ORDER), d_a, d_b, d_c); run_time = evt[:profile_duration] / 1e9; cl.copy!(queue, h_C, d_c); mflops = 2.0 * Ndims^3 / (1000000.0 * run_time); @printf “%10.8f seconds at %9.5f MFLOPSn” run_time mflops end 0.59426405 seconds at 3613.686 MFLOPS 0.59078856 seconds at 3634.957 MFLOPS 0.57401651 seconds at 3741.153 MFLOPS This compares with the figures for running this natively, without the GPU processor: 7.060888678 seconds at 304.133 MFLOPS That is using the GPU gives a 12-fold increase in the performance of matrix calculation. Summary This article has introduced some of the main features which sets Julia apart from other similar programming languages. I began with a quick look some Julia code by developing a trajectory used in estimating the price of a financial option which was displayed graphically. Continuing with the graphics theme we presented some code to generating a Julia set and to write this to disk as a PGM formatted file. The type system and use of multiple dispatch is discussed next. This a major difference for the user between Julia and object-orientated languages such as R and Python and is central to what gives Julia the power to generate fast machine-level code via LLVM compilation. We then turned to a series of topics from the Julia armory: Working with Python: The ability to call C and Fortran, seamlessly, has been a central feature of Julia since its initial development by the addition of interoperability with Python has opened up a new series of possibilities, leading to the development of the IJulia interface and its integration in the Jupyter project. Simple statistics using DataFrames :As an example of working with data highlighted the Julia implementation of data frames by looking at Apple share prices and applying some simple statistics. MySQL Access using PyCall: Returns to another usage of Python interoperability to illustrate an unconventional method to interface to a MySQL database. Scientific programming with Julia: The case of solution of the ordinary differential equations is presented here by looking at the Lotka-Volterras equation but unusually we develop a solution for the three species model. Graphics with Gadfly: Julia has a wide range of options when developing data visualizations. Gadfly is one of the ‘heavyweights’ and a dataset is extracted from the RDataset.jl package, containing UK GCSE results and the comparison between written and course work results is displayed using Gadfly, categorized by gender. Finally we showcased the work of Julia community groups by looking at an example from the JuliaGPU group by utilizing the OpenCL package to check on the set of supported devices. We then selected an NVIDIA GeForce chip, in order to run execute a simple kernel which multiplied a pair of matrices via the GPU. This was timed against conventional evaluation against native Julia coding in order to illustrate the speedup involved in this approach from parallelizing matrix operations. Resources for Article: Further resources on this subject: Pricing the Double-no-touch option [article] Basics of Programming in Julia [article] SQL Server Analysis Services Administering and Monitoring Analysis Services [article]
Read more
  • 0
  • 0
  • 1769

article-image-fine-tune-nginx-configufine-tune-nginx-configurationfine-tune-nginx-configurationratio
Packt
14 Jul 2015
20 min read
Save for later

Fine-tune the NGINX Configuration

Packt
14 Jul 2015
20 min read
In this article by Rahul Sharma, author of the book NGINX High Performance, we will cover the following topics: NGINX configuration syntax Configuring NGINX workers Configuring NGINX I/O Configuring TCP Setting up the server (For more resources related to this topic, see here.) NGINX configuration syntax This section aims to cover it in good detail. The complete configuration file has a logical structure that is composed of directives grouped into a number of sections. A section defines the configuration for a particular NGINX module, for example, the http section defines the configuration for the ngx_http_core module. An NGINX configuration has the following syntax: Valid directives begin with a variable name and then state an argument or series of arguments separated by spaces. All valid directives end with a semicolon (;). Sections are defined with curly braces ({}). Sections can be nested in one another. The nested section defines a module valid under the particular section, for example, the gzip section under the http section. Configuration outside any section is part of the NGINX global configuration. The lines starting with the hash (#) sign are comments. Configurations can be split into multiple files, which can be grouped using the include directive. This helps in organizing code into logical components. Inclusions are processed recursively, that is, an include file can further have include statements. Spaces, tabs, and new line characters are not part of the NGINX configuration. They are not interpreted by the NGINX engine, but they help to make the configuration more readable. Thus, the complete file looks like the following code: #The configuration begins here global1 value1; #This defines a new section section { sectionvar1 value1; include file1;    subsection {    subsectionvar1 value1; } } #The section ends here global2 value2; # The configuration ends here NGINX provides the -t option, which can be used to test and verify the configuration written in the file. If the file or any of the included files contains any errors, it prints the line numbers causing the issue: $ sudo nginx -t This checks the validity of the default configuration file. If the configuration is written in a file other than the default one, use the -c option to test it. You cannot test half-baked configurations, for example, you defined a server section for your domain in a separate file. Any attempt to test such a file will throw errors. The file has to be complete in all respects. Now that we have a clear idea of the NGINX configuration syntax, we will try to play around with the default configuration. This article only aims to discuss the parts of the configuration that have an impact on performance. The NGINX catalog has large number of modules that can be configured for some purposes. This article does not try to cover all of them as the details are beyond the scope of the book. Please refer to the NGINX documentation at http://nginx.org/en/docs/ to know more about the modules. Configuring NGINX workers NGINX runs a fixed number of worker processes as per the specified configuration. In the following sections, we will work with NGINX worker parameters. These parameters are mostly part of the NGINX global context. worker_processes The worker_processes directive controls the number of workers: worker_processes 1; The default value for this is 1, that is, NGINX runs only one worker. The value should be changed to an optimal value depending on the number of cores available, disks, network subsystem, server load, and so on. As a starting point, set the value to the number of cores available. Determine the number of cores available using lscpu: $ lscpu Architecture:     x86_64 CPU op-mode(s):   32-bit, 64-bit Byte Order:     Little Endian CPU(s):       4 The same can be accomplished by greping out cpuinfo: $ cat /proc/cpuinfo | grep 'processor' | wc -l Now, set this value to the parameter: # One worker per CPU-core. worker_processes 4; Alternatively, the directive can have auto as its value. This determines the number of cores and spawns an equal number of workers. When NGINX is running with SSL, it is a good idea to have multiple workers. SSL handshake is blocking in nature and involves disk I/O. Thus, using multiple workers leads to improved performance. accept_mutex Since we have configured multiple workers in NGINX, we should also configure the flags that impact worker selection. The accept_mutex parameter available under the events section will enable each of the available workers to accept new connections one by one. By default, the flag is set to on. The following code shows this: events { accept_mutex on; } If the flag is turned to off, all of the available workers will wake up from the waiting state, but only one worker will process the connection. This results in the Thundering Herd phenomenon, which is repeated a number of times per second. The phenomenon causes reduced server performance as all the woken-up workers take up CPU time before going back to the wait state. This results in unproductive CPU cycles and nonutilized context switches. accept_mutex_delay When accept_mutex is enabled, only one worker, which has the mutex lock, accepts connections, while others wait for their turn. The accept_mutex_delay corresponds to the timeframe for which the worker would wait, and after which it tries to acquire the mutex lock and starts accepting new connections. The directive is available under the events section with a default value of 500 milliseconds. The following code shows this: events{ accept_mutex_delay 500ms; } worker_connections The next configuration to look at is worker_connections, with a default value of 512. The directive is present under the events section. The directive sets the maximum number of simultaneous connections that can be opened by a worker process. The following code shows this: events{    worker_connections 512; } Increase worker_connections to something like 1,024 to accept more simultaneous connections. The value of worker_connections does not directly translate into the number of clients that can be served simultaneously. Each browser opens a number of parallel connections to download various components that compose a web page, for example, images, scripts, and so on. Different browsers have different values for this, for example, IE works with two parallel connections while Chrome opens six connections. The number of connections also includes sockets opened with the upstream server, if any. worker_rlimit_nofile The number of simultaneous connections is limited by the number of file descriptors available on the system as each socket will open a file descriptor. If NGINX tries to open more sockets than the available file descriptors, it will lead to the Too many opened files message in the error.log. Check the number of file descriptors using ulimit: $ ulimit -n Now, increase this to a value more than worker_process * worker_connections. The value should be increased for the user that runs the worker process. Check the user directive to get the username. NGINX provides the worker_rlimit_nofile directive, which can be an alternative way of setting the available file descriptor rather modifying ulimit. Setting the directive will have a similar impact as updating ulimit for the worker user. The value of this directive overrides the ulimit value set for the user. The directive is not present by default. Set a large value to handle large simultaneous connections. The following code shows this: worker_rlimit_nofile 20960; To determine the OS limits imposed on a process, read the file /proc/$pid/limits. $pid corresponds to the PID of the process. multi_accept The multi_accept flag enables an NGINX worker to accept as many connections as possible when it gets the notification of a new connection. The purpose of this flag is to accept all connections in the listen queue at once. If the directive is disabled, a worker process will accept connections one by one. The following code shows this: events{    multi_accept on; } The directive is available under the events section with the default value off. If the server has a constant stream of incoming connections, enabling multi_accept may result in a worker accepting more connections than the number specified in worker_connections. The overflow will lead to performance loss as the previously accepted connections, part of the overflow, will not get processed. use NGINX provides several methods for connection processing. Each of the available methods allows NGINX workers to monitor multiple socket file descriptors, that is, when there is data available for reading/writing. These calls allow NGINX to process multiple socket streams without getting stuck in any one of them. The methods are platform-dependent, and the configure command, used to build NGINX, selects the most efficient method available on the platform. If we want to use other methods, they must be enabled first in NGINX. The use directive allows us to override the default method with the method specified. The directive is part of the events section: events { use select; } NGINX supports the following methods of processing connections: select: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-select_module or --without-select_module configuration parameter. poll: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-poll_module or --without-poll_module configuration parameter. kqueue: This is an efficient method of processing connections available on FreeBSD 4.1, OpenBSD 2.9+, NetBSD 2.0, and OS X. There are the additional directives kqueue_changes and kqueue_events. These directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 512. The kqueue method will ignore the multi_accept directive if it has been enabled. epoll: This is an efficient method of processing connections available on Linux 2.6+. The method is similar to the FreeBSD kqueue. There is also the additional directive epoll_events. This specifies the number of events that NGINX will pass to the kernel. The default value for this is 512. /dev/poll: This is an efficient method of processing connections available on Solaris 7 11/99+, HP/UX 11.22+, IRIX 6.5.15+, and Tru64 UNIX 5.1A+. This has the additional directives, devpoll_events and devpoll_changes. The directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 32. eventport: This is an efficient method of processing connections available on Solaris 10. The method requires necessary security patches to avoid kernel crash issues. rtsig: Real-time signals is a connection processing method available on Linux 2.2+. The method has some limitations. On older kernels, there is a system-wide limit of 1,024 signals. For high loads, the limit needs to be increased by setting the rtsig-max parameter. For kernel 2.6+, instead of the system-wide limit, there is a limit on the number of outstanding signals for each process. NGINX provides the worker_rlimit_sigpending parameter to modify the limit for each of the worker processes: worker_rlimit_sigpending 512; The parameter is part of the NGINX global configuration. If the queue overflows, NGINX drains the queue and uses the poll method to process the unhandled events. When the condition is back to normal, NGINX switches back to the rtsig method of connection processing. NGINX provides the rtsig_overflow_events, rtsig_overflow_test, and rtsig_overflow_threshold parameters to control how a signal queue is handled on overflows. The rtsig_overflow_events parameter defines the number of events passed to poll. The rtsig_overflow_test parameter defines the number of events handled by poll, after which NGINX will drain the queue. Before draining the signal queue, NGINX will look up how much it is filled. If the factor is larger than the specified rtsig_overflow_threshold, it will drain the queue. The rtsig method requires accept_mutex to be set. The method also enables the multi_accept parameter. Configuring NGINX I/O NGINX can also take advantage of the Sendfile and direct I/O options available in the kernel. In the following sections, we will try to configure parameters available for disk I/O. Sendfile When a file is transferred by an application, the kernel first buffers the data and then sends the data to the application buffers. The application, in turn, sends the data to the destination. The Sendfile method is an improved method of data transfer, in which data is copied between file descriptors within the OS kernel space, that is, without transferring data to the application buffers. This results in improved utilization of the operating system's resources. The method can be enabled using the sendfile directive. The directive is available for the http, server, and location sections. http{ sendfile on; } The flag is set to off by default. Direct I/O The OS kernel usually tries to optimize and cache any read/write requests. Since the data is cached within the kernel, any subsequent read request to the same place will be much faster because there's no need to read the information from slow disks. Direct I/O is a feature of the filesystem where reads and writes go directly from the applications to the disk, thus bypassing all OS caches. This results in better utilization of CPU cycles and improved cache effectiveness. The method is used in places where the data has a poor hit ratio. Such data does not need to be in any cache and can be loaded when required. It can be used to serve large files. The directio directive enables the feature. The directive is available for the http, server, and location sections: location /video/ { directio 4m; } Any file with size more than that specified in the directive will be loaded by direct I/O. The parameter is disabled by default. The use of direct I/O to serve a request will automatically disable Sendfile for the particular request. Direct I/O depends on the block size while doing a data transfer. NGINX has the directio_alignment directive to set the block size. The directive is present under the http, server, and location sections: location /video/ { directio 4m; directio_alignment 512; } The default value of 512 bytes works well for all boxes unless it is running a Linux implementation of XFS. In such a case, the size should be increased to 4 KB. Asynchronous I/O Asynchronous I/O allows a process to initiate I/O operations without having to block or wait for it to complete. The aio directive is available under the http, server, and location sections of an NGINX configuration. Depending on the section, the parameter will perform asynchronous I/O for the matching requests. The parameter works on Linux kernel 2.6.22+ and FreeBSD 4.3. The following code shows this: location /data { aio on; } By default, the parameter is set to off. On Linux, aio needs to be enabled with directio, while on FreeBSD, sendfile needs to be disabled for aio to take effect. If NGINX has not been configured with the --with-file-aio module, any use of the aio directive will cause the unknown directive aio error. The directive has a special value of threads, which enables multithreading for send and read operations. The multithreading support is only available on the Linux platform and can only be used with the epoll, kqueue, or eventport methods of processing requests. In order to use the threads value, configure multithreading in the NGINX binary using the --with-threads option. Post this, add a thread pool in the NGINX global context using the thread_pool directive. Use the same pool in the aio configuration: thread_pool io_pool threads=16; http{ ….....    location /data{      sendfile   on;      aio       threads=io_pool;    } } Mixing them up The three directives can be mixed together to achieve different objectives on different platforms. The following configuration will use sendfile for files with size smaller than what is specified in directio. Files served by directio will be read using asynchronous I/O: location /archived-data/{ sendfile on; aio on; directio 4m; } The aio directive has a sendfile value, which is available only on the FreeBSD platform. The value can be used to perform Sendfile in an asynchronous manner: location /archived-data/{ sendfile on; aio sendfile; } NGINX invokes the sendfile() system call, which returns with no data in the memory. Post this, NGINX initiates data transfer in an asynchronous manner. Configuring TCP HTTP is an application-based protocol, which uses TCP as the transport layer. In TCP, data is transferred in the form of blocks known as TCP packets. NGINX provides directives to alter the behavior of the underlying TCP stack. These parameters alter flags for an individual socket connection. TCP_NODELAY TCP/IP networks have the "small packet" problem, where single-character messages can cause network congestion on a highly loaded network. Such packets are 41 bytes in size, where 40 bytes are for the TCP header and 1 byte has useful information. These small packets have huge overhead, around 4000 percent and can saturate a network. John Nagle solved the problem (Nagle's algorithm) by not sending the small packets immediately. All such packets are collected for some amount of time and then sent in one go as a single packet. This results in improved efficiency of the underlying network. Thus, a typical TCP/IP stack waits for up to 200 milliseconds before sending the data packages to the client. It is important to note that the problem exists with applications such as Telnet, where each keystroke is sent over wire. The problem is not relevant to a web server, which severs static files. The files will mostly form full TCP packets, which can be sent immediately instead of waiting for 200 milliseconds. The TCP_NODELAY option can be used while opening a socket to disable Nagle's buffering algorithm and send the data as soon as it is available. NGINX provides the tcp_nodelay directive to enable this option. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nodelay on; } The directive is enabled by default. NGINX use tcp_nodelay for connections with the keep-alive mode. TCP_CORK As an alternative to Nagle's algorithm, Linux provides the TCP_CORK option. The option tells the TCP stack to append packets and send them when they are full or when the application instructs to send the packet by explicitly removing TCP_CORK. This results in an optimal amount of data packets being sent and, thus, improves the efficiency of the network. The TCP_CORK option is available as the TCP_NOPUSH flag on FreeBSD and Mac OS. NGINX provides the tcp_nopush directive to enable TCP_CORK over the connection socket. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nopush on; } The directive is disabled by default. NGINX uses tcp_nopush for requests served with sendfile. Setting them up The two directives discussed previously do mutually exclusive things; the former makes sure that the network latency is reduced, while the latter tries to optimize the data packets sent. An application should set both of these options to get efficient data transfer. Enabling tcp_nopush along with sendfile makes sure that while transferring a file, the kernel creates the maximum amount of full TCP packets before sending them over wire. The last packet(s) can be partial TCP packets, which could end up waiting with TCP_CORK being enabled. NGINX make sure it removes TCP_CORK to send these packets. Since tcp_nodelay is also set then, these packets are immediately sent over the network, that is, without any delay. Setting up the server The following configuration sums up all the changes proposed in the preceding sections: worker_processes 3; worker_rlimit_nofile 8000;   events { multi_accept on; use epoll; worker_connections 1024; }   http { sendfile on; aio on; directio 4m; tcp_nopush on; tcp_nodelay on; # Rest Nginx configuration removed for brevity } It is assumed that NGINX runs on a quad core server. Thus three worker processes have been spanned to take advantage of three out of four available cores and leaving one core for other processes. Each of the workers has been configured to work with 1,024 connections. Correspondingly, the nofile limit has been increased to 8,000. By default, all worker processes operate with mutex; thus, the flag has not been set. Each worker processes multiple connections in one go using the epoll method. In the http section, NGINX has been configured to serve files larger than 4 MB using direct I/O, while efficiently buffering smaller files using Sendfile. TCP options have also been set up to efficiently utilize the available network. Measuring gains It is time to test the changes and make sure that they have given performance gain. Run a series of tests using Siege/JMeter to get new performance numbers. The tests should be performed with the same configuration to get a comparable output: $ siege -b -c 790 -r 50 -q http://192.168.2.100/hello   Transactions:               79000 hits Availability:               100.00 % Elapsed time:               24.25 secs Data transferred:           12.54 MB Response time:             0.20 secs Transaction rate:           3257.73 trans/sec Throughput:                 0.52 MB/sec Concurrency:               660.70 Successful transactions:   39500 Failed transactions:       0 Longest transaction:       3.45 Shortest transaction:       0.00 The results from Siege should be evaluated and compared to the baseline. Throughput: The transaction rate defines this as 3250 requests/second Error rate: Availability is reported as 100 percent; thus; the error rate is 0 percent Response time: The results shows a response time of 0.20 seconds Thus, these new numbers demonstrate performance improvement in various respects. After the server configuration is updated with all the changes, reperform all tests with increased numbers. The aim should be to determine the new baseline numbers for the updated configuration. Summary The article started with an overview of the NGINX configuration syntax. Going further, we discussed worker_connections and the related parameters. These allow you to take advantage of the available hardware. The article also talked about the different event processing mechanisms available on different platforms. The configuration discussed helped in processing more requests, thus improving the overall throughput. NGINX is primarily a web server; thus, it has to serve all kinds static content. Large files can take advantage of direct I/O, while smaller content can take advantage of Sendfile. The different disk modes make sure that we have an optimal configuration to serve the content. In the TCP stack, we discussed the flags available to alter the default behavior of TCP sockets. The tcp_nodelay directive helps in improving latency. The tcp_nopush directive can help in efficiently delivering the content. Both these flags lead to improved response time. In the last part of the article, we applied all the changes to our server and then did performance tests to determine the effectiveness of the changes done. In the next article, we will try to configure buffers, timeouts, and compression to improve the utilization of the available network. Resources for Article: Further resources on this subject: Using Nginx as a Reverse Proxy [article] Nginx proxy module [article] Introduction to nginx [article]
Read more
  • 0
  • 0
  • 14139

article-image-understanding-mutability-and-immutability-python-c-and-javascript
Packt
14 Jul 2015
15 min read
Save for later

Understanding mutability and immutability in Python, C#, and JavaScript

Packt
14 Jul 2015
15 min read
In this article by Gastón C. Hillar, author of the book, Learning Object-Oriented Programming, you will learn the concept of mutability and immutability in the programming languages such as, Python, C#, and JavaScript. What’s the difference? By default, any instance field or attribute works like a variable; therefore we can change their values. When we create an instance of a class that defines many public instance fields, we are creating a mutable object, that is, an object that can change its state. For example, let's think about a class named MutableVector3D that represents a mutable 3D vector with three public instance fields: X, Y, and Z. We can create a new MutableVector3D instance and initialize the X, Y, and Z attributes. Then, we can call the Sum method with their delta values for X, Y, and Z as arguments. The delta values specify the difference between the existing value and the new or desired value. So, for example, if we specify a positive value of 20 in the deltaX parameter, it means that we want to add 20 to the X value. The following lines show pseudocode in a neutral programming language that create a new MutableVector3D instance called myMutableVector, initialized with values for the X, Y, and Z fields. Then, the code calls the Sum method with the delta values for X, Y, and Z as arguments, as shown in the following code: myMutableVector = new MutableVector3D instance with X = 30, Y = 50 and Z = 70 myMutableVector.Sum(deltaX: 20, deltaY: 30, deltaZ: 15) The initial values for the myMutableVector field are 30 for X, 50 for Y, and 70 for Z. The Sum method changes the values of all the three fields; therefore, the object state mutates as follows: myMutableVector.X mutates from 30 to 30 + 20 = 50 myMutableVector.Y mutates from 50 to 50 + 30 = 80 myMutableVector.Z mutates from 70 to 70 + 15 = 85 The values for the myMutableVector field after the call to the Sum method are 50 for X, 80 for Y, and 85 for Z. We can say this method mutated the object's state; therefore, myMutableVector is a mutable object: an instance of a mutable class. Mutability is very important in object-oriented programming. In fact, whenever we expose fields and/or properties, we will create a class that will generate mutable instances. However, sometimes a mutable object can become a problem. In certain situations, we want to avoid objects to change their state. For example, when we work with a concurrent code, an object that cannot change its state solves many concurrency problems and avoids potential bugs. For example, we can create an immutable version of the previous MutableVector3D class to represent an immutable 3D vector. The new ImmutableVector3D class has three read-only properties: X, Y, and Z. Thus, there are only three getter methods without setter methods, and we cannot change the values of the underlying internal fields: m_X, m_Y, and m_Z. We can create a new ImmutableVector3D instance and initialize the underlying internal fields: m_X, m_Y, and m_Z. X, Y, and Z attributes. Then, we can call the Sum method with the delta values for X, Y, and Z as arguments. The following lines show the pseudocode in a neutral programming language that create a new ImmutableVector3D instance called myImmutableVector, which is initialized with values for X, Y, and Z as arguments. Then, the pseudocode calls the Sum method with the delta values for X, Y, and Z as arguments: myImmutableVector = new ImmutableVector3D instance with X = 30, Y = 50 and Z = 70 myImmutableSumVector = myImmutableVector.Sum(deltaX: 20, deltaY: 30, deltaZ: 15) However, this time the Sum method returns a new instance of the ImmutableVector3D class with the X, Y, and Z values initialized to the sum of X, Y, and Z and the delta values for X, Y, and Z. So, myImmutableSumVector is a new ImmutableVector3D instance initialized with X = 50, Y = 80, and Z = 85. The call to the Sum method generated a new instance and didn't mutate the existing object. The immutable version adds an overhead as compared with the mutable version because it's necessary to create a new instance of a class as a result of calling the Sum method. The mutable version just changed the values for the attributes and it wasn't necessary to generate a new instance. Obviously, the immutable version has a memory and a performance overhead. However, when we work with the concurrent code, it makes sense to pay for the extra overhead to avoid potential issues caused by mutable objects. Using methods to add behaviors to classes in Python So far, we have added instance methods to classes and used getter and setter methods combined with decorators to define properties. Now, we want to generate a class to represent the mutable version of a 3D vector. We will use properties with simple getter and setter methods for x, y, and z. The sum public instance method receives the delta values for x, y, and z and mutates an object, that is, the setter method changes the values of x, y, and z. Here is the initial code of the MutableVector3D class: class MutableVector3D: def __init__(self, x, y, z): self.__x = x self.__y = y self.__z = z def sum(self, delta_x, delta_y, delta_z): self.__x += delta_x self.__y += delta_y self.__z += delta_z @property def x(self): return self.__x @x.setter def x(self, x): self.__x = x @property def y(self): return self.__y @y.setter def y(self, y): self.__y = y @property def z(self): return self.__z @z.setter def z(self, z): self.__z = z It's a very common requirement to generate a 3D vector with all the values initialized to 0, that is, x = 0, y = 0, and z = 0. A 3D vector with these values is known as an origin vector. We can add a class method to the MutableVector3D class named origin_vector to generate a new instance of the class initialized with all the values initialized to 0. It's necessary to add the @classmethod decorator before the class method header. Instead of receiving self as the first argument, a class method receives the current class; the parameter name is usually named cls. The following code defines the origin_vector class method: @classmethod def origin_vector(cls): return cls(0, 0, 0) The preceding method returns a new instance of the current class (cls) with 0 as the initial value for the three elements. The class method receives cls as the only argument; therefore, it will be a parameterless method when we call it because Python includes a class as a parameter under the hood. The following command calls the origin_vector class method to generate a 3D vector, calls the sum method for the generated instance, and prints the values for the three elements: mutableVector3D = MutableVector3D.origin_vector() mutableVector3D.sum(5, 10, 15) print(mutableVector3D.x, mutableVector3D.y, mutableVector3D.z) Now, we want to generate a class to represent the immutable version of a 3D vector. In this case, we will use read-only properties for x, y, and z. The sum public instance method receives the delta values for x, y, and z and returns a new instance of the same class with the values of x, y, and z initialized with the results of the sum. Here is the code of the ImmutableVector3D class: class ImmutableVector3D: def __init__(self, x, y, z): self.__x = x self.__y = y self.__z = z def sum(self, delta_x, delta_y, delta_z): return type(self)(self.__x + delta_x, self.__y + delta_y, self.__z + delta_z) @property def x(self): return self.__x @property def y(self): return self.__y @property def z(self): return self.__z @classmethod def equal_elements_vector(cls, initial_value): return cls(initial_value, initial_value, initial_value) @classmethod def origin_vector(cls): return cls.equal_elements_vector(0) Note that the sum method uses type(self) to generate and return a new instance of the current class. In this case, the origin_vector class method returns the results of calling the equal_elements_vector class method with 0 as an argument. Remember that the cls argument refers to the actual class. The equal_elements_vector class method receives an initial_value argument for all the elements of the 3D vector, creates an instance of the actual class, and initializes all the elements with the received unique value. The origin_vector class method demonstrates how we can call another class method in a class method. The following command calls the origin_vector class method to generate a 3D vector, calls the sum method for the generated instance, and prints the values for the three elements of the new instance returned by the sum method: vector0 = ImmutableVector3D.origin_vector() vector1 = vector0.sum(5, 10, 15) print(vector1.x, vector1.y, vector1.z) As explained previously, we can change the values of the private attributes; therefore, the ImmutableVector3D class isn't 100 percent immutable. However, we are all adults and don't expect the users of a class with read-only properties to change the values of private attributes hidden under difficult to access names. Using methods to add behaviors to classes in C# So far, we have added instance methods to a class in C#, and used getter and setter methods to define properties. Now, we want to generate a class to represent the mutable version of a 3D vector in C#. We will use auto-implemented properties for X, Y, and Z. The public Sum instance method receives the delta values for X, Y, and Z (deltaX, deltaY, and deltaZ) and mutates the object, that is, the method changes the values of X, Y, and Z. The following shows the initial code of the MutableVector3D class: class MutableVector3D { public double X { get; set; } public double Y { get; set; } public double Z { get; set; } public void Sum(double deltaX, double deltaY, double deltaZ) { this.X += deltaX; this.Y += deltaY; this.Z += deltaZ; } public MutableVector3D(double x, double y, double z) { this.X = x; this.Y = y; this.Z = z; } } It's a very common requirement to generate a 3D vector with all the values initialized to 0, that is, X = 0, Y = 0, and Z = 0. A 3D vector with these values is known as an origin vector. We can add a class method to the MutableVector3D class named OriginVector to generate a new instance of the class initialized with all the values initialized to 0. Class methods are also known as static methods in C#. It's necessary to add the static keyword after the public access modifier before the class method name. The following commands define the OriginVector static method: public static MutableVector3D OriginVector() { return new MutableVector3D(0, 0, 0); } The preceding method returns a new instance of the MutableVector3D class with 0 as the initial value for all the three elements. The following code calls the OriginVector static method to generate a 3D vector, calls the Sum method for the generated instance, and prints the values for all the three elements on the console output: var mutableVector3D = MutableVector3D.OriginVector(); mutableVector3D.Sum(5, 10, 15); Console.WriteLine(mutableVector3D.X, mutableVector3D.Y, mutableVector3D.Z) Now, we want to generate a class to represent the immutable version of a 3D vector. In this case, we will use read-only properties for X, Y, and Z. We will use auto-generated properties with private set. The Sum public instance method receives the delta values for X, Y, and Z (deltaX, deltaY, and deltaZ) and returns a new instance of the same class with the values of X, Y, and Z initialized with the results of the sum. The code for the ImmutableVector3D class is as follows: class ImmutableVector3D { public double X { get; private set; } public double Y { get; private set; } public double Z { get; private set; } public ImmutableVector3D Sum(double deltaX, double deltaY, double deltaZ) { return new ImmutableVector3D ( this.X + deltaX, this.Y + deltaY, this.Z + deltaZ); } public ImmutableVector3D(double x, double y, double z) { this.X = x; this.Y = y; this.Z = z; } public static ImmutableVector3D EqualElementsVector(double initialValue) { return new ImmutableVector3D(initialValue, initialValue, initialValue); } public static ImmutableVector3D OriginVector() { return ImmutableVector3D.EqualElementsVector(0); } } In the new class, the Sum method returns a new instance of the ImmutableVector3D class, that is, the current class. In this case, the OriginVector static method returns the results of calling the EqualElementsVector static method with 0 as an argument. The EqualElementsVector class method receives an initialValue argument for all the elements of the 3D vector, creates an instance of the actual class, and initializes all the elements with the received unique value. The OriginVector static method demonstrates how we can call another static method in a static method. The following code calls the OriginVector static method to generate a 3D vector, calls the Sum method for the generated instance, and prints all the values for the three elements of the new instance returned by the Sum method on the console output: var vector0 = ImmutableVector3D.OriginVector(); var vector1 = vector0.Sum(5, 10, 15); Console.WriteLine(vector1.X, vector1.Y, vector1.Z); C# doesn't allow users of the ImmutableVector3D class to change the values of X, Y, and Z properties. The code doesn't compile if you try to assign a new value to any of these properties. Thus, we can say that the ImmutableVector3D class is 100 percent immutable. Using methods to add behaviors to constructor functions in JavaScript So far, we have added methods to a constructor function that produced instance methods in a generated object. In addition, we used getter and setter methods combined with local variables to define properties. Now, we want to generate a constructor function to represent the mutable version of a 3D vector. We will use properties with simple getter and setter methods for x, y, and z. The sum public instance method receives the delta values for x, y, and z and mutates an object, that is, the method changes the values of x, y, and z. The following code shows the initial code of the MutableVector3D constructor function: function MutableVector3D(x, y, z) { var _x = x; var _y = y; var _z = z; Object.defineProperty(this, 'x', { get: function(){ return _x; }, set: function(val){ _x = val; } }); Object.defineProperty(this, 'y', { get: function(){ return _y; }, set: function(val){ _y = val; } }); Object.defineProperty(this, 'z', { get: function(){ return _z; }, set: function(val){ _z = val; } }); this.sum = function(deltaX, deltaY, deltaZ) { _x += deltaX; _y += deltaY; _z += deltaZ; } } It's a very common requirement to generate a 3D vector with all the values initialized to 0, that is, x = 0, y = 0, and, z = 0. A 3D vector with these values is known as an origin vector. We can add a function to the MutableVector3D constructor function named originVector to generate a new instance of a class with all the values initialized to 0. The following code defines the originVector function: MutableVector3D.originVector = function() { return new MutableVector3D(0, 0, 0); }; The method returns a new instance built in the MutableVector3D constructor function with 0 as the initial value for all the three elements. The following code calls the originVector function to generate a 3D vector, calls the sum method for the generated instance, and prints all the values for all the three elements: var mutableVector3D = MutableVector3D.originVector(); mutableVector3D.sum(5, 10, 15); console.log(mutableVector3D.x, mutableVector3D.y, mutableVector3D.z); Now, we want to generate a constructor function to represent the immutable version of a 3D vector. In this case, we will use read-only properties for x, y, and z. In this case, we will use the ImmutableVector3D.prototype property to define the sum method. The method receives the values of delta for x, y, and z, and returns a new instance with the values of x, y, and z initialized with the results of the sum. The following code shows the ImmutableVector3D constructor function and the additional code that defines all the other methods: function ImmutableVector3D(x, y, z) { var _x = x; var _y = y; var _z = z; Object.defineProperty(this, 'x', { get: function(){ return _x; } }); Object.defineProperty(this, 'y', { get: function(){ return _y; } }); Object.defineProperty(this, 'z', { get: function(){ return _z; } }); } ImmutableVector3D.prototype.sum = function(deltaX, deltaY, deltaZ) { return new ImmutableVector3D( this.x + deltaX, this.y + deltaY, this.z + deltaZ); }; ImmutableVector3D.equalElementsVector = function(initialValue) { return new ImmutableVector3D(initialValue, initialValue, initialValue); }; ImmutableVector3D.originVector = function() { return ImmutableVector3D.equalElementsVector(0); }; Again, note that the preceding code defines the sum method in the ImmutableVector3D.prototype method. This method will be available to all the instances generated in the ImmutableVector3D constructor function. The sum method generates and returns a new instance of ImmutableVector3D. In this case, the originVector method returns the results of calling the equalElementsVector method with 0 as an argument. The equalElementsVector method receives an initialValue argument for all the elements of the 3D vector, creates an instance of the actual class, and initializes all the elements with the received unique value. The originVector method demonstrates how we can call another function defined in the constructor function. The following code calls the originVector method to generate a 3D vector, calls the sum method for the generated instance, and prints the values for all the three elements of the new instance returned by the sum method: var vector0 = ImmutableVector3D.originVector(); var vector1 = vector0.sum(5, 10, 15); console.log(vector1.x, vector1.y, vector1.z); Summary In this article, you learned the concept of mutability and immutability in the programming languages such as, Python, C#, and JavaScript and used methods to add behaviors in each of the programming language. So, what’s next? Continue Learning Object-Oriented Programming with Gastón’s book – find out more here.
Read more
  • 0
  • 0
  • 1860

article-image-architecting-and-coding-high-performance-net-applications
Packt
09 Jul 2015
15 min read
Save for later

Architecting and coding high performance .NET applications

Packt
09 Jul 2015
15 min read
In this article by Antonio Esposito, author of Learning .NET High Performance Programming, we will learn about low-pass audio filtering implemented using .NET, and also learn about MVVM and XAML. Model-View-ViewModel and XAML The MVVM pattern is another descendant of the MVC pattern. Born from an extensive update to the MVP pattern, it is at the base of all eXtensible Application Markup Language (XAML) language-based frameworks, such as Windows presentation foundation (WPF), Silverlight, Windows Phone applications, and Store Apps (formerly known as Metro-style apps). MVVM is different from MVC, which is used by Microsoft in its main web development framework in that it is used for desktop or device class applications. The first and (still) the most powerful application framework using MVVM in Microsoft is WPF, a desktop class framework that can use the full .NET 4.5.3 environment. Future versions within Visual Studio 2015 will support built-in .NET 4.6. On the other hand, all other frameworks by Microsoft that use the XAML language supporting MVVM patterns are based on a smaller edition of .NET. This happens with Silverlight, Windows Store Apps, Universal Apps, or Windows Phone Apps. This is why Microsoft made the Portable Library project within Visual Studio, which allows us to create shared code bases compatible with all frameworks. While a Controller in MVC pattern is sort of a router for requests to catch any request and parsing input/output Models, the MVVM lies behind any View with a full two-way data binding that is always linked to a View's controls and together at Model's properties. Actually, multiple ViewModels may run the same View and many Views can use the same single/multiple instance of a given ViewModel. A simple MVC/MVVM design comparative We could assert that the experience offered by MVVM is like a film, while the experience offered by MVC is like photography, because while a Controller always makes one-shot elaborations regarding the application user requests in MVC, in MVVM, the ViewModel is definitely the view! Not only does a ViewModel lie behind a View, but we could also say that if a VM is a body, then a View is its dress. While the concrete View is the graphical representation, the ViewModel is the virtual view, the un-concrete view, but still the View. In MVC, the View contains the user state (the value of all items showed in the UI) until a GET/POST invocation is sent to the web server. Once sent, in the MVC framework, the View simply binds one-way reading data from a Model. In MVVM, behaviors, interaction logic, and user state actually live within the ViewModel. Moreover, it is again in the ViewModel that any access to the underlying Model, domain, and any persistence provider actually flows. Between a ViewModel and View, a data connection called data binding is established. This is a declarative association between a source and target property, such as Person.Name with TextBox.Text. Although it is possible to configure data binding by imperative coding (while declarative means decorating or setting the property association in XAML), in frameworks such as WPF and other XAML-based frameworks, this is usually avoided because of the more decoupled result made by the declarative choice. The most powerful technology feature provided by any XAML-based language is actually the data binding, other than the simpler one that was available in Windows Forms. XAML allows one-way binding (also reverted to the source) and two-way binding. Such data binding supports any source or target as a property from a Model or ViewModel or any other control's dependency property. This binding subsystem is so powerful in XAML-based languages that events are handled in specific objects named Command, and this can be data-bound to specific controls, such as buttons. In the .NET framework, an event is an implementation of the Observer pattern that lies within a delegate object, allowing a 1-N association between the only source of the event (the owner of the event) and more observers that can handle the event with some specific code. The only object that can raise the event is the owner itself. In XAML-based languages, a Command is an object that targets a specific event (in the meaning of something that can happen) that can be bound to different controls/classes, and all of those can register handlers or raise the signaling of all handlers. An MVVM performance map analysis Performance concerns Regarding performance, MVVM behaves very well in several scenarios in terms of data retrieval (latency-driven) and data entry (throughput- and scalability-driven). The ability to have an impressive abstraction of the view in the VM without having to rely on the pipelines of MVC (the actions) makes the programming very pleasurable and give the developer the choice to use different designs and optimization techniques. Data binding itself is done by implementing specific .NET interfaces that can be easily centralized. Talking about latency, it is slightly different from previous examples based on web request-response time, unavailable in MVVM. Theoretically speaking, in the design pattern of MVVM, there is no latency at all. In a concrete implementation within XAML-based languages, latency can refer to two different kinds of timings. During data binding, latency is the time between when a VM makes new data available and a View actually renders it. Instead, during a command execution, latency is the time between when a command is invoked and all relative handlers complete their execution. We usually use the first definition until differently specified. Although the nominal latency is near zero (some milliseconds because of the dictionary-based configuration of data binding), specific implementation concerns about latency actually exist. In any Model or ViewModel, an updated data notification is made by triggering the View with the INotifyPropertyChanged interface. The .NET interface causes the View to read the underlying data again. Because all notifications are made by a single .NET event, this can easily become a bottleneck because of the serialized approach used by any delegate or event handlers in the .NET world. On the contrary, when dealing with data that flows from the View to the Model, such an inverse binding is usually configured declaratively within the {Binding …} keyword, which supports specifying binding directions and trigger timing (to choose from the control's lost focus CLR event or anytime the property value changes). The XAML data binding does not add any measurable time during its execution. Although this, as said, such binding may link multiple properties or the control's dependency properties together. Linking this interaction logic could increase latency time heavily, adding some annoying delay at the View level. One fact upon all, is the added latency by any validation logic. It is even worse if such validation is other than formal, such as validating some ID or CODE against a database value. Talking about scalability, MVVM patterns does some work here, while we can make some concrete analysis concerning the XAML implementation. It is easy to say that scaling out is impossible because MVVM is a desktop class layered architecture that cannot scale. Instead, we can say that in a multiuser scenario with multiple client systems connected in a 2-tier or 3-tier system architecture, simple MVVM and XAML-based frameworks will never act as bottlenecks. The ability to use the full .NET stack in WPF gives us the chance to use all synchronization techniques available, in order to use a directly connected DBMS or middleware tier. Instead of scaling up by moving the application to an increased CPU clock system, the XAML-based application would benefit more from an increased CPU core count system. Obviously, to profit from many CPU cores, mastering parallel techniques is mandatory. About the resource usage, MVVM-powered architectures require only a simple POCO class as a Model and ViewModel. The only additional requirement is the implementation of the INotifyPropertyChanged interface that costs next to nothing. Talking about the pattern, unlike MVC, which has a specific elaboration workflow, MVVM does not offer this functionality. Multiple commands with multiple logic can process their respective logic (together with asynchronous invocation) with the local VM data or by going down to the persistence layer to grab missing information. We have all the choices here. Although MVVM does not cost anything in terms of graphical rendering, XAML-based frameworks make massive use of hardware-accelerated user controls. Talking about an extreme choice, Windows Forms with Graphics Device Interface (GDI)-based rendering require a lot less resources and can give a higher frame rate on highly updatable data. Thus, if a very high FPS is needed, the choice of still rendering a WPF area in GDI is available. For other XAML languages, the choice is not so easy to obtain. Obviously, this does not mean that XAML is slow in rendering with its DirectX based engine. Simply consider that WPF animations need a good Graphics Processing Unit (GPU), while a basic GDI animation will execute on any system, although it is obsolete. Talking about availability, MVVM-based architectures usually lead programmers to good programming. As MVC allows it, MVVM designs can be tested because of the great modularity. While a Controller uses a pipelined workflow to process any requests, a ViewModel is more flexible and can be tested with multiple initialization conditions. This makes it more powerful but also less predictable than a Controller, and hence is tricky to use. In terms of design, the Controller acts as a transaction script, while the ViewModel acts in a more realistic, object-oriented approach. Finally, yet importantly, throughput and efficiency are simply unaffected by MVVM-based architectures. However, because of the flexibility the solution gives to the developer, any interaction and business logic design may be used inside a ViewModel and their underlying Models. Therefore, any success or failure regarding those performance aspects are usually related to programmer work. In XAML frameworks, throughput is achieved by an intensive use of asynchronous and parallel programming assisted by a built-in thread synchronization subsystem, based on the Dispatcher class that deals with UI updates. Low-pass filtering for Audio Low-pass filtering has been available since 2008 in the native .NET code. NAudio is a powerful library helping any CLR programmer to create, manipulate, or analyze audio data in any format. Available through NuGet Package Manager, NAudio offers a simple and .NET-like programming framework, with specific classes and stream-reader for audio data files. Let's see how to apply the low-pass digital filter in a real audio uncompressed file in WAVE format. For this test, we will use the Windows start-up default sound file. The chart is still made in a legacy Windows Forms application with an empty Form1 file, as shown in the previous example: private async void Form1_Load(object sender, EventArgs e) {    //stereo wave file channels    var channels = await Task.Factory.StartNew(() =>        {            //the wave stream-like reader            using (var reader = new WaveFileReader("startup.wav"))            {                var leftChannel = new List<float>();              var rightChannel = new List<float>();                  //let's read all frames as normalized floats                while (reader.Position < reader.Length)                {                    var frame = reader.ReadNextSampleFrame();                   leftChannel.Add(frame[0]);                    rightChannel.Add(frame[1]);                }                  return new                {                    Left = leftChannel.ToArray(),                    Right = rightChannel.ToArray(),                };            }        });      //make a low-pass digital filter on floating point data    //at 200hz    var leftLowpassTask = Task.Factory.StartNew(() => LowPass(channels.Left, 200).ToArray());    var rightLowpassTask = Task.Factory.StartNew(() => LowPass(channels.Right, 200).ToArray());      //this let the two tasks work together in task-parallelism    var leftChannelLP = await leftLowpassTask;    var rightChannelLP = await rightLowpassTask;      //create and databind a chart    var chart1 = CreateChart();      chart1.DataSource = Enumerable.Range(0, channels.Left.Length).Select(i => new        {            Index = i,            Left = channels.Left[i],            Right = channels.Right[i],            LeftLP = leftChannelLP[i],            RightLP = rightChannelLP[i],        }).ToArray();      chart1.DataBind();      //add the chart to the form    this.Controls.Add(chart1); }   private static Chart CreateChart() {    //creates a chart    //namespace System.Windows.Forms.DataVisualization.Charting      var chart1 = new Chart();      //shows chart in fullscreen    chart1.Dock = DockStyle.Fill;      //create a default area    chart1.ChartAreas.Add(new ChartArea());      //left and right channel series    chart1.Series.Add(new Series    {        XValueMember = "Index",        XValueType = ChartValueType.Auto,        YValueMembers = "Left",        ChartType = SeriesChartType.Line,    });    chart1.Series.Add(new Series    {        XValueMember = "Index",        XValueType = ChartValueType.Auto,        YValueMembers = "Right",        ChartType = SeriesChartType.Line,    });      //left and right channel low-pass (bass) series    chart1.Series.Add(new Series    {        XValueMember = "Index",        XValueType = ChartValueType.Auto,        YValueMembers = "LeftLP",        ChartType = SeriesChartType.Line,        BorderWidth = 2,    });    chart1.Series.Add(new Series    {        XValueMember = "Index",        XValueType = ChartValueType.Auto,        YValueMembers = "RightLP",        ChartType = SeriesChartType.Line,        BorderWidth = 2,    });      return chart1; } Let's see the graphical result: The Windows start-up sound waveform. In bolt, the bass waveform with a low-pass filter at 200hz. The usage of parallelism in elaborations such as this is mandatory. Audio elaboration is a canonical example of engineering data computation because it works on a huge dataset of floating points values. A simple file, such as the preceding one that contains less than 2 seconds of audio sampled at (only) 22,050 Hz, produces an array greater than 40,000 floating points per channel (stereo = 2 channels). Just to have an idea of how hard processing audio files is, note that an uncompressed CD quality song of 4 minutes sampled at 44,100 samples per second * 60 (seconds) * 4 (minutes) will create an array greater than 10 million floating-point items per channel. Because of the FFT intrinsic logic, any low-pass filtering run must run in a single thread. This means that the only optimization we can apply when running FFT based low-pass filtering is parallelizing in a per channel basis. For most cases, this choice can only bring a 2X throughput improvement, regardless of the processor count of the underlying system. Summary In this article we got introduced to the applications of .NET high-performance performance. We learned how MVVM and XAML play their roles in .NET to create applications for various platforms, also we learned about its performance characteristics. Next we learned how high-performance .NET had applications in engineering aspects through a practical example of low-pass audio filtering. It showed you how versatile it is to apply high-performance programming to specific engineering applications. Resources for Article: Further resources on this subject: Windows Phone 8 Applications [article] Core .NET Recipes [article] Parallel Programming Patterns [article]
Read more
  • 0
  • 0
  • 1797
article-image-essentials-working-python-collections
Packt
09 Jul 2015
14 min read
Save for later

The Essentials of Working with Python Collections

Packt
09 Jul 2015
14 min read
In this article by Steven F. Lott, the author of the book Python Essentials, we'll look at the break and continue statements; these modify a for or while loop to allow skipping items or exiting before the loop has processed all items. This is a fundamental change in the semantics of a collection-processing statement. (For more resources related to this topic, see here.) Processing collections with the for statement The for statement is an extremely versatile way to process every item in a collection. We do this by defining a target variable, a source of items, and a suite of statements. The for statement will iterate through the source of items, assigning each item to the target variable, and also execute the suite of statements. All of the collections in Python provide the necessary methods, which means that we can use anything as the source of items in a for statement. Here's some sample data that we'll work with. This is part of Mike Keith's poem, Near a Raven. We'll remove the punctuation to make the text easier to work with: >>> text = '''Poe, E. ...     Near a Raven ... ... Midnights so dreary, tired and weary.''' >>> text = text.replace(",","").replace(".","").lower() This will put the original text, with uppercase and lowercase and punctuation into the text variable. When we use text.split(), we get a sequence of individual words. The for loop can iterate through this sequence of words so that we can process each one. The syntax looks like this: >>> cadaeic= {} >>> for word in text.split(): ...     cadaeic[word]= len(word) We've created an empty dictionary, and assigned it to the cadaeic variable. The expression in the for loop, text.split(), will create a sequence of substrings. Each of these substrings will be assigned to the word variable. The for loop body—a single assignment statement—will be executed once for each value assigned to word. The resulting dictionary might look like this (irrespective of ordering): {'raven': 5, 'midnights': 9, 'dreary': 6, 'e': 1, 'weary': 5, 'near': 4, 'a': 1, 'poe': 3, 'and': 3, 'so': 2, 'tired': 5} There's no guaranteed order for mappings or sets. Your results may differ slightly. In addition to iterating over a sequence, we can also iterate over the keys in a dictionary. >>> for word in sorted(cadaeic): ...   print(word, cadaeic[word]) When we use sorted() on a tuple or a list, an interim list is created with sorted items. When we apply sorted() to a mapping, the sorting applies to the keys of the mapping, creating a sequence of sorted keys. This loop will print a list in alphabetical order of the various pilish words used in this poem. Pilish is a subset of English where the word lengths are important: they're used as mnemonic aids. A for statement corresponds to the "for all" logical quantifier, . At the end of a simple for loop we can assert that all items in the source collection have been processed. In order to build the "there exists" quantifier, , we can either use the while statement, or the break statement inside the body of a for statement. Using literal lists in a for statement We can apply the for statement to a sequence of literal values. One of the most common ways to present literals is as a tuple. It might look like this: for scheme in 'http', 'https', 'ftp':    do_something(scheme) This will assign three different values to the scheme variable. For each of those values, it will evaluate the do_something() function. From this, we can see that, strictly-speaking, the () are not required to delimit a tuple object. If the sequence of values grows, however, and we need to span more than one physical line, we'll want to add (), making the tuple literal more explicit. Using the range() and enumerate() functions The range() object will provide a sequence of numbers, often used in a for loop. The range() object is iterable, it's not itself a sequence object. It's a generator, which will produce items when required. If we use range() outside a for statement, we need to use a function like list(range(x)) or tuple(range(a,b)) to consume all of the generated values and create a new sequence object. The range() object has three commonly-used forms: range(n) produces ascending numbers including 0 but not including n itself. This is a half-open interval. We could say that range(n) produces numbers, x, such that . The expression list(range(5)) returns [0, 1, 2, 3, 4]. This produces n values including 0 and n - 1. range(a,b) produces ascending numbers starting from a but not including b. The expression tuple(range(-1,3)) will return (-1, 0, 1, 2). This produces b - a values including a and b - 1. range(x,y,z) produces ascending numbers in the sequence . This produces (y-x)//z values. We can use the range() object like this: for n in range(1, 21):    status= str(n)    if n % 5 == 0: status += " fizz"    if n % 7 == 0: status += " buzz"    print(status) In this example, we've used a range() object to produce values, n, such that . We use the range() object to generate the index values for all items in a list: for n in range(len(some_list)):    print(n, some_list[n]) We've used the range() function to generate values between 0 and the length of the sequence object named some_list. The for statement allows multiple target variables. The rules for multiple target variables are the same as for a multiple variable assignment statement: a sequence object will be decomposed and items assigned to each variable. Because of that, we can leverage the enumerate() function to iterate through a sequence and assign the index values at the same time. It looks like this: for n, v in enumerate(some_list):      print(n, v) The enumerate() function is a generator function which iterates through the items in source sequence and yields a sequence of two-tuple pairs with the index and the item. Since we've provided two variables, the two-tuple is decomposed and assigned to each variable. There are numerous use cases for this multiple-assignment for loop. We often have list-of-tuples data structures that can be handled very neatly with this multiple-assignment feature. Iterating with the while statement The while statement is a more general iteration than the for statement. We'll use a while loop in two situations. We'll use this in cases where we don't have a finite collection to impose an upper bound on the loop's iteration; we may suggest an upper bound in the while clause itself. We'll also use this when writing a "search" or "there exists" kind of loop; we aren't processing all items in a collection. A desktop application that accepts input from a user, for example, will often have a while loop. The application runs until the user decides to quit; there's no upper bound on the number of user interactions. For this, we generally use a while True: loop. Infinite iteration is recommended. If we want to write a character-mode user interface, we could do it like this: quit_received= False while not quit_received:    command= input("prompt> ")    quit_received= process(command) This will iterate until the quit_received variable is set to True. This will process indefinitely; there's no upper boundary on the number of iterations. This process() function might use some kind of command processing. This should include a statement like this: if command.lower().startswith("quit"): return True When the user enters "quit", the process() function will return True. This will be assigned to the quit_received variable. The while expression, not quit_received, will become False, and the loop ends. A "there exists" loop will iterate through a collection, stopping at the first item that meets certain criteria. This can look complex because we're forced to make two details of loop processing explicit. Here's an example of searching for the first value that meets a condition. This example assumes that we have a function, condition(), which will eventually be True for some number. Here's how we can use a while statement to locate the minimum for which this function is True: >>> n = 1 >>> while n != 101 and not condition(n): ...     n += 1 >>> assert n == 101 or condition(n) The while statement will terminate when n == 101 or the condition(n) is True. If this expression is False, we can advance the n variable to the next value in the sequence of values. Since we're iterating through the values in order from the smallest to the largest, we know that n will be the smallest value for which the condition() function is true. At the end of the while statement we have included a formal assertion that either n is 101 or the condition() function is True for the given value of n. Writing an assertion like this can help in design as well as debugging because it will often summarize the loop invariant condition. We can also write this kind of loop using the break statement in a for loop, something we'll look at in the next section. The continue and break statements The continue statement is helpful for skipping items without writing deeply-nested if statements. The effect of executing a continue statement is to skip the rest of the loop's suite. In a for loop, this means that the next item will be taken from the source iterable. In a while loop, this must be used carefully to avoid an otherwise infinite iteration. We might see file processing that looks like this: for line in some_file:    clean = line.strip()    if len(clean) == 0:        continue    data, _, _ = clean.partition("#")    data = data.rstrip()    if len(data) == 0:        continue    process(data) In this loop, we're relying on the way files act like sequences of individual lines. For each line in the file, we've stripped whitespace from the input line, and assigned the resulting string to the clean variable. If the length of this string is zero, the line was entirely whitespace, and we'll continue the loop with the next line. The continue statement skips the remaining statements in the body of the loop. We'll partition the line into three pieces: a portion in front of any "#", the "#" (if present), and the portion after any "#". We've assigned the "#" character and any text after the "#" character to the same easily-ignored variable, _, because we don't have any use for these two results of the partition() method. We can then strip any trailing whitespace from the string assigned to the data variable. If the resulting string has a length of zero, then the line is entirely filled with "#" and any trailing comment text. Since there's no useful data, we can continue the loop, ignoring this line of input. If the line passes the two if conditions, we can process the resulting data. By using the continue statement, we have avoided complex-looking, deeply-nested if statements. It's important to note that a continue statement must always be part of the suite inside an if statement, inside a for or while loop. The condition on that if statement becomes a filter condition that applies to the collection of data being processed. continue always applies to the innermost loop. Breaking early from a loop The break statement is a profound change in the semantics of the loop. An ordinary for statement can be summarized by "for all." We can comfortably say that "for all items in a collection, the suite of statements was processed." When we use a break statement, a loop is no longer summarized by "for all." We need to change our perspective to "there exists". A break statement asserts that at least one item in the collection matches the condition that leads to the execution of the break statement. Here's a simple example of a break statement: for n in range(1, 100):    factors = []    for x in range(1,n):        if n % x == 0: factors.append(x)    if sum(factors) == n:        break We've written a loop that is bound by . This loop includes a break statement, so it will not process all values of n. Instead, it will determine the smallest value of n, for which n is equal to the sum of its factors. Since the loop doesn't examine all values, it shows that at least one such number exists within the given range. We've used a nested loop to determine the factors of the number n. This nested loop creates a sequence, factors, for all values of x in the range , such that x, is a factor of the number n. This inner loop doesn't have a break statement, so we are sure it examines all values in the given range. The least value for which this is true is the number six. It's important to note that a break statement must always be part of the suite inside an if statement inside a for or while loop. If the break isn't in an if suite, the loop will always terminate while processing the first item. The condition on that if statement becomes the "where exists" condition that summarizes the loop as a whole. Clearly, multiple if statements with multiple break statements mean that the overall loop can have a potentially confusing and difficult-to-summarize post-condition. Using the else clause on a loop Python's else clause can be used on a for or while statement as well as on an if statement. The else clause executes after the loop body if there was no break statement executed. To see this, here's a contrived example: >>> for item in 1,2,3: ...     print(item) ...     if item == 2: ...         print("Found",item) ...       break ... else: ...     print("Found Nothing") The for statement here will iterate over a short list of literal values. When a specific target value has been found, a message is printed. Then, the break statement will end the loop, avoiding the else clause. When we run this, we'll see three lines of output, like this: 1 2 Found 2 The value of three isn't shown, nor is the "Found Nothing" message in the else clause. If we change the target value in the if statement from two to a value that won't be seen (for example, zero or four), then the output will change. If the break statement is not executed, then the else clause will be executed. The idea here is to allow us to write contrasting break and non-break suites of statements. An if statement suite that includes a break statement can do some processing in the suite before the break statement ends the loop. An else clause allows some processing at the end of the loop when none of the break-related suites statements were executed. Summary In this article, we've looked at the for statement, which is the primary way we'll process the individual items in a collection. A simple for statement assures us that our processing has been done for all items in the collection. We've also looked at the general purpose while loop. Resources for Article: Further resources on this subject: Introspecting Maya, Python, and PyMEL [article] Analyzing a Complex Dataset [article] Geo-Spatial Data in Python: Working with Geometry [article]
Read more
  • 0
  • 0
  • 1354

article-image-creating-f-project
Packt
08 Jul 2015
5 min read
Save for later

Creating an F# Project

Packt
08 Jul 2015
5 min read
In this article by Adnan Masood, author of the book, Learning F# Functional Data Structures and Algorithms, we see how to set up the IDE and create our first F# project. "Ah yes, Haskell. Where all the types are strong, all the men carry arrows, and all the children are above average." – marked trees (on the city of Haskell) The perceived adversity of functional programming is overly exaggerated; the essence of this paradigm is to explicitly recognize and enforce the referential transparency. We will see how to set up the tooling for Visual Studio 2013 and for F# 3.1, the currently available version of F# at the time of writing. We will review the F# 4.0 preview features by the end of this project. After we get the tooling sorted out, we will review some simple algorithms; starting with recursion with typical a Fibonacci sequence and Tower of Hanoi, we will perform lazy evaluation on a quick sort example. In this article, we will cover the following topics: Setting up Visual Studio and F# compiler to work together Setting up the environment and running your F# programs (For more resources related to this topic, see here.) Setting up the IDE As developers, we love our IDEs (Integrated Development Environments) and Visual Studio.NET is probably the best IDE for .NET development; no offense to Eclipse bloatware Luna. From the open source perspective, there has been a recent major development in making the .NET framework available as open source and on Mac and Linux platforms. Microsoft announced a pre-release of F# 4.0 in Visual Studio 2015 Preview and it will be available as part of the full release. To install and run F#, there are various options available for all platforms, sizes, and budgets. For those with a fear of commitments, there is the online interactive version of TryFsharp at http://www.tryfsharp.org/ where you can code in the browser. For Windows users, you have a few options. Until VS.NET 2015 comes out, you can try out the freely available Visual Studio Community 2013 or a Visual Studio 2013 trial edition, with trial being the keyword. The trial editions include Ultimate, Premium, and Professional versions. The free community edition IDE can be downloaded from https://www.visualstudio.com/en-us/news/vs2013-community-vs.aspx and the trial editions can be downloaded from http://www.visualstudio.com/downloads/download-visual-studio-vs. Alternatively, there are express editions available at no cost. Visual Studio Express 2013 for Windows Desktop Web editions can be downloaded from http://www.visualstudio.com/downloads/download-visual-studio-vs#d-express-windows-desktop. F# support is built into Visual Studio; the Visual F# tools package the latest updates to the F# compiler: interactive, runtime, and Visual Studio integration. F# support comes with Visual Studio. However, the F# team releases regular updates in the form of F# tools. The tools can be downloaded from www.microsoft.com/en-us/download/details.aspx?id=44011. The F# tools contain the F# command-line compiler (fsc.exe) and F# Interactive (fsi.exe), which are the easiest way to get started with F#. The fsi.exe compiler can be found in C:Program Files (x86)Microsoft SDKsF#<version>Framework<version>. The <version> placeholder in the preceding path is substituted by your .NET version installed. If you just want to use the F# compiler and tools from the command line, you can download the .NET framework 4.5 or above from https://www.microsoft.com/en-in/download/details.aspx?id=30653. You will also need the Windows SDK for associated dependencies, which can be downloaded from http://msdn.microsoft.com/windows/desktop/bg162891. Alternatively, Tsunami is the free IDE for F# that you can download from http://tsunami.io/download.html and use to build applications. CloudSharper by IntelliFactory is in beta but shows promise as a web-based IDE. For more information regarding CloudSharper, refer to http://cloudsharper.com/. In this article, we will be using Visual Studio 2013 Professional Edition and FSI (F# interactive) but you can either use the trial or Express edition, or the FSI command line to run the examples and exercises. Your first F# project Going through installation screens and showing how to click Next would be discourteous to our reader's intelligence. Therefore we will skip step-by-step installation for other more verbose texts. Let's start with our first F# project in Visual Studio. In the preceding screenshot, you can see the F# interactive window at the bottom. Here we have selected FILE | New | Project because we are attempting to open a new project of F# type. There are a few project types available, including console applications and F# library. For ease of explanation, let's begin with a Console Application as shown in the next screenshot: Alternatively, from within Visual Studio, we can use FSharp Interactive. FSharp Interactive (FSI) is an effective tool for testing out your code quickly. You can open the FSI window by selecting View | Other Windows | F# Interactive from the Visual Studio IDE as shown in the next screenshot: FSI lets you run code from a console which is similar to a shell. You can access the FSI executable directly from the location at c:Program Files (x86)Microsoft SDKsF#<version>Framework<version>. FSI maintains session context, which means that the constructs created earlier in the FSI are still available in the later parts of code. The FsiAnyCPU.exe executable file is the 64-bit counterpart of F# interactive; Visual Studio determines which executable to use based on the machine's architecture being either 32-bit or 64-bit. You can also change the F# interactive parameters and settings from the Options dialog as shown in the following screenshot: Summary In this article, you learned how to set up an IDE for F# in Visual Studio 2013 and created a new F# project. Resources for Article: Further resources on this subject: Test-driven API Development with Django REST Framework [article] edX E-Learning Course Marketing [article] Introduction to Microsoft Azure Cloud Services [article]
Read more
  • 0
  • 0
  • 1385

article-image-what-apache-camel
Packt
08 Jul 2015
9 min read
Save for later

What is Apache Camel?

Packt
08 Jul 2015
9 min read
In this article Jean-Baptiste Onofré, author of the book Mastering Apache Camel, we will see how Apache Camel originated in Apache ServiceMix. Apache ServiceMix 3 was powered by the Spring framework and implemented in the JBI specification. The Java Business Integration (JBI) specification proposed a Plug and Play approach for integration problems. JBI was based on WebService concepts and standards. For instance, it directly reusesthe Message Exchange Patterns (MEP) concept that comes from WebService Description Language (WSDL). Camel reuses some of these concepts, for instance, you will see that we have the concept of MEP in Camel. However, JBI suffered mostly from two issues: In JBI, all messages between endpoints are transported in the Normalized Messages Router (NMR). In the NMR, a message has a standard XML format. As all messages in the NMR havethe same format, it's easy to audit messages and the format is predictable. However, the JBI XML format has an important drawback for performances: it needs to marshall and unmarshall the messages. Some protocols (such as REST or RMI) are not easy to describe in XML. For instance, REST can work in stream mode. It doesn't make sense to marshall streamsin XML. Camel is payload-agnostic. This means that you can transport any kind of messages with Camel (not necessary XML formatted). JBI describes a packaging. We distinguish the binding components (responsible for the interaction with the system outside of the NMR and the handling of the messages in the NMR), and the service engines (responsible for transforming the messages inside the NMR). However, it's not possible to directly deploy the endpoints based on these components. JBI requires a service unit (a ZIP file) per endpoint, and for each package in a service assembly (another ZIP file). JBI also splits the description of the endpoint from its configuration. It does not result in a very flexible packaging: with definitions and configurations scattered in different files, not easy to maintain. In Camel, the configuration and definition of the endpoints are gatheredin a simple URI. It's easier to read. Moreover, Camel doesn't force any packaging; the same definition can be packaged in a simple XML file, OSGi bundle, andregular JAR file. In addition to JBI, another foundation of Camel is the book Enterprise Integration Patterns by Gregor Hohpe and Bobby Woolf. It describes design patterns answering classical problems while dealing with enterprise application integration and message oriented middleware. The book describes the problems and the patterns to solve them. Camel strives to implement the patterns described in the book to make them easy to use and let the developer concentrate on the task at hand. This is what Camel is: an open source framework that allows you to integrate systems and that comes with a lot of connectors and Enterprise Integration Patterns (EIP) components out of the box. And if that is not enough, one can extend and implement custom components. Components and bean support Apache Camel ships with a wide variety of components out of the box; currently, there are more than 100 components available. We can see: The connectivity components that allow exposure of endpoints for external systems or communicate with external systems. For instance, the FTP, HTTP, JMX, WebServices, JMS, and a lot more components are connectivity components. Creating an endpoint and the associated configuration for these components is easy, by directly using a URI. The internal components applying rules to the messages internally to Camel. These kinds of components apply validation or transformation rules to the inflight message. For instance, validation or XSLT are internal components. Camel brings a very powerful connectivity and mediation framework. Moreover, it's pretty easy to create new custom components, allowing you to extend Camel if the default components set doesn't match your requirements. It's also very easy to implement complex integration logic by creating your own processors and reusing your beans. Camel supports beans frameworks (IoC), such as Spring or Blueprint. Predicates and expressions As we will see later, most of the EIP need a rule definition to apply a routing logic to a message. The rule is described using an expression. It means that we have to define expressions or predicates in the Enterprise Integration Patterns. An expression returns any kind of value, whereas a predicate returns true or false only. Camel supports a lot of different languages to declare expressions or predicates. It doesn't force you to use one, it allows you to use the most appropriate one. For instance, Camel supports xpath, mvel, ognl, python, ruby, PHP, JavaScript, SpEL (Spring Expression Language), Groovy, and so on as expression languages. It also provides native Camel prebuilt functions and languages that are easy to use such as header, constant, or simple languages. Data format and type conversion Camel is payload-agnostic. This means that it can support any kind of message. Depending on the endpoints, it could be required to convert from one format to another. That's why Camel supports different data formats, in a pluggable way. This means that Camel can marshall or unmarshall a message in a given format. For instance, in addition to the standard JVM serialization, Camel natively supports Avro, JSON, protobuf, JAXB, XmlBeans, XStream, JiBX, SOAP, and so on. Depending on the endpoints and your need, you can explicitly define the data format during the processing of the message. On the other hand, Camel knows the expected format and type of endpoints. Thanks to this, Camel looks for a type converter, allowing to implicitly transform a message from one format to another. You can also explicitly define the type converter of your choice at some points during the processing of the message. Camel provides a set of ready-to-use type converters, but, as Camel supports a pluggable model, you can extend it by providing your own type converters. It's a simple POJO to implement. Easy configuration and URI Camel uses a different approach based on URI. The endpoint itself and its configuration are on the URI. The URI is human readable and provides the details of the endpoint, which is the endpoint component and the endpoint configuration. As this URI is part of the complete configuration (which defines what we name a route, as we will see later), it's possible to have a complete overview of the integration logic and connectivity in a row. Lightweight and different deployment topologies Camel itself is very light. The Camel core is only around 2 MB, and contains everythingrequired to run Camel. As it's based on a pluggable architecture, all Camel components are provided as external modules, allowing you to install only what you need, without installing superfluous and needlessly heavy modules. Camel is based on simple POJO, which means that the Camel core doesn't depend on other frameworks: it's an atomic framework and is ready to use. All other modules (components, DSL, and so on) are built on top of this Camel core. Moreover, Camel is not tied to one container for deployment. Camel supports a wide range of containers to run. They are as follows: A J2EE application server such as WebSphere, WebLogic, JBoss, and so on A Web container such as Apache Tomcat An OSGi container such as Apache Karaf A standalone application using frameworks such as Spring Camel gives a lot of flexibility, allowing you to embed it into your application or to use an enterprise-ready container. Quick prototyping and testing support In any integration project, it's typical that we have some part of the integration logic not yet available. For instance: The application to integrate with has not yet been purchased or not yet ready The remote system to integrate with has a heavy cost, not acceptable during the development phase Multiple teams work in parallel, so we may have some kinds of deadlocks between the teams As a complete integration framework, Camel provides a very easy way to prototype part of the integration logic. Even if you don't have the actual system to integrate, you can simulate this system (mock), as it allows you to implement your integration logic without waiting for dependencies. The mocking support is directly part of the Camel core and doesn't require any additional dependency. Along the same lines, testing is also crucial in an integration project. In such a kind of project, a lot of errors can happen and most are unforeseen. Moreover, a small change in an integration process might impact a lot of other processes. Camel provides the tools to easily test your design and integration logic, allowing you to integrate this in a continuous integration platform. Management and monitoring using JMX Apache Camel uses the Java Management Extension (JMX) standard and provides a lot of insights into the system using MBeans (Management Beans), providing a detailed view of the following current system: The different integration processes with the associated metrics The different components and endpoints with the associated metrics Moreover, these MBeans provide more insights than metrics. They also provide the operations to manage Camel. For instance, the operations allow you to stop an integration process, to suspend an endpoint, and so on. Using a combination of metrics and operations, you can configure a very agile integration solution. Active community The Apache Camel community is very active. This means that potential issues are identified very quickly and a fix is available soon after. However, it also means that a lot of ideas and contributions are proposed, giving more and more features to Camel. Another big advantage of an active community is that you will never be alone; a lot of people are active on the mailing lists who are ready to answer your question and provide advice. Summary Apache Camel is an enterprise integration solution used in many large organizations with enterprise support available through RedHat or Talend. Resources for Article: Further resources on this subject: Getting Started [article] A Quick Start Guide to Flume [article] Best Practices [article]
Read more
  • 0
  • 0
  • 1828
article-image-processing-next-generation-sequencing-datasets-using-python
Packt
07 Jul 2015
25 min read
Save for later

Processing Next-generation Sequencing Datasets Using Python

Packt
07 Jul 2015
25 min read
In this article by Tiago Antao, author of Bioinformatics with Python Cookbook, you will process next-generation sequencing datasets using Python. If you work in life sciences, you are probably aware of the increasing importance of computational methods to analyze increasingly larger datasets. There is a massive need for bioinformaticians to process this data, and one the main tools is, of course, Python. Python is probably the fastest growing language in the field of data sciences. It includes a rich ecology of software libraries to perform complex data analysis. Another major point in Python is its great community, which is always ready to help and produce great documentation and high-quality reliable software. In this article, we will use Python to process next-generation sequencing datasets. This is one of the many examples of Python usability in bioinformatics; chances are that if you have a biological dataset to analyze, Python can help you. This is surely the case with population genetics, genomics, phylogenetics, proteomics, and many other fields. Next-generation Sequencing (NGS) is one of the fundamental technological developments of the decade in the field of life sciences. Whole Genome Sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies with good reason; they generate vast amounts of data that need to be processed. NGS is the main reason why computational biology is becoming a "big data" discipline. More than anything else, this is a field that requires strong bioinformatics techniques. There is very strong demand for professionals with these skillsets. Here, we will not discuss each individual NGS technique per se (this will be a massive undertaking). We will use two existing WGS datasets: the Human 1000 genomes project (http://www.1000genomes.org/) and the Anopheles 1000 genomes dataset (http://www.malariagen.net/projects/vector/ag1000g). The code presented will be easily applicable for other genomic sequencing approaches; some of them can also be used for transcriptomic analysis (for example, RNA-Seq). Most of the code is also species-independent, that is, you will be able to apply them to any species in which you have sequenced data. As this is not an introductory text, you are expected to at least know what FASTA, FASTQ, BAM, and VCF files are. We will also make use of basic genomic terminology without introducing it (things such as exomes, nonsynonym mutations, and so on). You are required to be familiar with basic Python, and we will leverage that knowledge to introduce the fundamental libraries in Python to perform NGS analysis. Here, we will concentrate on analyzing VCF files. Preparing the environment You will need Python 2.7 or 3.4. You can use many of the available distributions, including the standard one at http://www.python.org, but we recommend Anaconda Python from http://continuum.io/downloads. We also recommend the IPython Notebook (Project Jupyter) from http://ipython.org/. If you use Anaconda, this and many other packages are available with a simple conda install. There are some amazing libraries to perform data analysis in Python; here, we will use NumPy (http://www.numpy.org/) and matplotlib (http://matplotlib.org/), which you may already be using in your projects. We will also make use of the less widely used seaborn library (http://stanford.edu/~mwaskom/software/seaborn/). For bioinformatics, we will use Biopython (http://biopython.org) and PyVCF (https://pyvcf.readthedocs.org). The code used here is available on GitHub at https://github.com/tiagoantao/bioinf-python. In your realistic pipeline, you will probably be using other tools, such as bwa, samtools, or GATK to perform your alignment and SNP calling. In our case, tabix and bgzip (http://www.htslib.org/) is needed. Analyzing variant calls After running a genotype caller (for example, GATK or samtools mpileup), you will have a Variant Call Format (VCF) file reporting on genomic variations, such as SNPs (Single-Nucleotide Polymorphisms), InDels (Insertions/Deletions), CNVs (Copy Number Variation) among others. In this recipe, we will discuss VCF processing with the PyVCF module over the human 1000 genomes project to analyze SNP data. Getting ready I am to believe that 2 to 20 GB of data for a tutorial is asking too much. Although, the 1000 genomes' VCF files with realistic annotations are in that order of magnitude, we will want to work with much less data here. Fortunately, the bioinformatics community has developed tools that allow partial download of data. As part of the samtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. For example: tabix -fh ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_ with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_ integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 |bgzip -c > genotypes.vcf.gz tabix -p vcf genotypes.vcf.gz The first line will perform a partial download of the VCF file for chromosome 22 (up to 17 Mbp) of the 1000 genomes project. Then, bgzip will compress it. The second line will create an index, which we will need for direct access to a section of the genome. The preceding code is available at https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Working_with_VCF.ipynb. How to do it… Take a look at the following steps: Let's start inspecting the information that we can get per record, as shown in the following code: import vcf v = vcf.Reader(filename='genotypes.vcf.gz')   print('Variant Level information') infos = v.infos for info in infos:    print(info)   print('Sample Level information') fmts = v.formats for fmt in fmts:    print(fmt)     We start by inspecting the annotations that are available for each record (remember that each record encodes variants, such as SNP, CNV, and InDel, and the state of that variant per sample). At the variant (record) level, we will find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth. There are others, but they are mostly specific to the 1000 genomes project (here, we are trying to be as general as possible). Your own dataset might have much more annotations or none of these.     At the sample level, there are only two annotations in this file: GT: genotype and DP: the per sample read depth. Yes, you have the per variant (total) read depth and the per sample read depth; be sure not to confuse both. Now that we know which information is available, let's inspect a single VCF record with the following code: v = vcf.Reader(filename='genotypes.vcf.gz') rec = next(v) print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER) print(rec.INFO) print(rec.FORMAT) samples = rec.samples print(len(samples)) sample = samples[0] print(sample.called, sample.gt_alleles, sample.is_het, sample.phased) print(int(sample['DP']))     We start by retrieving standard information: the chromosome, position, identifier, reference base (typically, just one), alternative bases (can have more than one, but it is not uncommon as the first filtering approach to only accept a single ALT, for example, only accept bi-allelic SNPs), quality (PHRED scaled—as you may expect), and the FILTER status. Regarding the filter, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).     Then, we will print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we will count the number of samples and inspect a single sample checking if it was called for this variant. If available, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common). Let's check the type of variant and the number of nonbiallelic SNPs in a single pass with the following code: from collections import defaultdict f = vcf.Reader(filename='genotypes.vcf.gz')   my_type = defaultdict(int) num_alts = defaultdict(int)   for rec in f:    my_type[rec.var_type, rec.var_subtype] += 1    if rec.is_snp:        num_alts[len(rec.ALT)] += 1 print(num_alts) print(my_type)     We use the Python defaultdict collection type. We find that this dataset has InDels (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-third being transitions with one-third transversions). There is a residual number (79) of triallelic SNPs. There's more… The purpose of this recipe is to get you up to speed on the PyVCF module. At this stage, you should be comfortable with the API. We do not delve much here on usage details because that will be the main purpose of the next recipe: using the VCF module to study the quality of your variant calls. It will probably not be a shocking revelation that PyVCF is not the fastest module on earth. This file format (highly text-based) makes processing a time-consuming task. There are two main strategies of dealing with this problem: parallel processing or converting to a more efficient format. Note that VCF developers will perform a binary (BCF) version to deal with part of these problems at http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2. See also The specification for VCF is available at http://samtools.github.io/hts-specs/VCFv4.2.pdf GATK is one of the most widely used variant callers; check https://www.broadinstitute.org/gatk/ samtools and htslib are both used for variant calling and SAM/BAM management; check http://htslib.org Studying genome accessibility and filtering SNP data If you are using NGS data, the quality of your VCF calls may need to be assessed and filtered. Here, we will put in place a framework to filter SNP data. More than giving filtering rules (an impossible task to be performed in a general way), we give you procedures to assess the quality of your data. With this, you can then devise your own filters. Getting ready In the best-case scenario, you have a VCF file with proper filters applied; if this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this does not mean that all the proper filters were applied. You have to be sure that your data is properly filtered. In the second case, which is one of the most common, your file will have unfiltered data, but you have enough annotations. Also, you can apply hard filters (that is, no need for programmatic filtering). If you have a GATK annotated file, refer, for instance, to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set. In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, "if read depth > 20, then accept; if mapping quality > 30, accept if mapping quality > 40"). In the fourth case, your VCF file does not have all the necessary annotations, and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you have and create a new VCF file with the needed annotations. Some genotype callers like GATK allow you to specify with annotations you want; you may also want to use extra programs to provide more annotations, for example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding on noncoding?). It is impossible to provide a clear-cut recipe; it will vary with the type of your sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering. In this recipe, we will not use data from the Human 1000 genomes project; we want "dirty" unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles 1000 genomes project (Anopheles is the mosquito vector involved in the transmission of the parasite causing malaria), which makes available filtered and unfiltered data. You can find more information about this project at http://www.malariagen.net/projects/vector/ag1000g. We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, followed by a part somewhere in the middle of that chromosome (and index both), as shown in the following code: tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz tabix -p vcf centro.vcf.gz tabix -p vcf standard.vcf.gz As usual, the code to download this data is available at the https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Filtering_SNPs.ipynb notebook. Finally, a word of warning about this recipe: the level of Python here will be slightly more complicated than before. The more general code that we will write may be easier to reuse in your specific case. We will perform extensive use of functional programming techniques (lambda functions) and the partial function application. How to do it… Take a look at the following steps: Let's start by plotting the distribution of variants across the genome in both files as follows: %matplotlib inline from collections import defaultdict   import seaborn as sns import matplotlib.pyplot as plt   import vcf   def do_window(recs, size, fun):    start = None    win_res = []    for rec in recs:        if not rec.is_snp or len(rec.ALT) > 1:            continue        if start is None:            start = rec.POS        my_win = 1 + (rec.POS - start) // size        while len(win_res) < my_win:            win_res.append([])        win_res[my_win - 1].extend(fun(rec))    return win_res   wins = {} size = 2000 vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    wins[name] = do_window(recs, size, lambda x: [1])     We start by performing the required imports (as usual, remember to remove the first line if you are not on the IPython Notebook). Before I explain the function, note what we will do.     For both files, we will compute windowed statistics: we will divide our file that includes 200,000 bp of data in windows of size 2,000 (100 windows). Every time we find a bi-allelic SNP, we will add one to the list related to that window in the window function. The window function will take a VCF record (an SNP—rec.is_snp—that is not bi-allelic—len(rec.ALT) == 1), determine the window where that record belongs (by performing an integer division of rec.POS by size), and extend the list of results of that window by the function that is passed to it as the fun parameter (which in our case is just one).     So, now we have a list of 100 elements (each representing 2,000 base pairs). Each element will be another list, which will have 1 for each bi-allelic SNP found. So, if you have 200 SNPs in the first 2,000 base pairs, the first element of the list will have 200 ones. Let's continue: def apply_win_funs(wins, funs):    fun_results = []    for win in wins:        my_funs = {}        for name, fun in funs.items():            try:                my_funs[name] = fun(win)            except:                my_funs[name] = None        fun_results.append(my_funs)    return fun_results   stats = {} fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in wins.items():    stats[name] = apply_win_funs(nwins, {'sum': sum})    x_lim = [i * size for i in range(len(stats[name]))]    ax.plot(x_lim, [x['sum'] for x in stats[name]], label=name) ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('Number of variant sites (bi-allelic SNPs)') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')     Here, we will perform a plot that contains statistical information for each of our 100 windows. The apply_win_funs will calculate a set of statistics for every window. In this case, it will sum all the numbers in the window. Remember that every time we find an SNP, we will add one to the window list. This means that if we have 200 SNPs, we will have 200 1s; hence; summing them will return 200.     So, we are able to compute the number of SNPs per window in an apparently convoluted way. Why we are doing things with this strategy will become apparent soon, but for now, let's check the result of this computation for both files (refer to the following figure): Figure 1: The number of bi-allelic SNPs distributed of windows of 2, 000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green). Both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project     Note that the amount of SNPs in the centromere is smaller than the one in the middle of the chromosome. This is expected because calling variants in chromosomes is more difficult than calling variants in the middle and also because probably there is less genomic diversity in centromeres. If you are used to humans or other mammals, you may find the density of variants obnoxiously high, that is, mosquitoes for you! Let's take a look at the sample-level annotation. We will inspect Mapping Quality Zero (refer to https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityZeroBySample.php for details), which is a measure of how well all the sequences involved in calling this variant map clearly to this position. Note that there is also an MQ0 annotation at the variant-level: import functools   import numpy as np mq0_wins = {} vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] size = 5000 def get_sample(rec, annot, my_type):    res = []    samples = rec.samples    for sample in samples:        if sample[annot] is None: # ignoring nones            continue        res.append(my_type(sample[annot]))    return res   for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    mq0_wins[vcf_name] = do_window(recs, size, functools.partial(get_sample, annot='MQ0', my_type=int))     Start by inspecting this by looking at the last for; we will perform a windowed analysis by getting the MQ0 annotation from each record. We perform this by calling the get_sample function in which we return our preferred annotation (in this case, MQ0) cast with a certain type (my_type=int). We will use the partial application function here. Python allows you to specify some parameters of a function and wait for other parameters to be specified later. Note that the most complicated thing here is the functional programming style. Also, note that it makes it very easy to compute other sample-level annotations; just replace MQ0 with AB, AD, GQ, and so on. You will immediately have a computation for that annotation. If the annotation is not of type integer, no problem; just adapt my_type. This is a difficult programming style if you are not used to it, but you will reap the benefits very soon. Let's now print the median and top 75 percent percentile for each window (in this case, with a size of 5,000) as follows: stats = {} colors = ['b', 'g'] i = 0 fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in mq0_wins.items():    stats[name] = apply_win_funs(nwins, {'median': np.median, '75': functools.partial(np.percentile, q=75)})    x_lim = [j * size for j in range(len(stats[name]))]    ax.plot(x_lim, [x['median'] for x in stats[name]], label=name, color=colors[i])    ax.plot(x_lim, [x['75'] for x in stats[name]], '--', color=colors[i])    i += 1 ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('MQ0') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')     Note that we now have two different statistics on apply_win_funs: percentile and median. Again, we will pass function names as parameters (np.median) and perform the partial function application (np.percentile). The result can be seen in the following figure: Figure 2: Median (continuous line) and 75th percentile (dashed) of MQ0 of sample SNPs distributed on windows of 5,000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project     For the "standard" file, the median MQ0 is 0 (it is plotted at the very bottom, which is almost unseen); this is good as it suggests that most sequences involved in the calling of variants map clearly to this area of the genome. For the centromere, MQ0 is of poor quality. Furthermore, there are areas where the genotype caller could not find any variants at all; hence, the incomplete chart. Let's compare heterozygosity with the DP sample-level annotation:     Here, we will plot the fraction of heterozygosity calls as a function of the sample read depth (DP) for every SNP. We will first explain the result and only then the code that generates it.     The following screenshot shows the fraction of calls that are heterozygous at a certain depth: Figure 3: The continuous line represents the fraction of heterozygosite calls computed at a certain depth; in blue is the centromeric area, in green is the "standard" area; the dashed lines represent the number of sample calls per depth; both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project In the preceding screenshot, there are two considerations to be taken into account:     At a very low depth, the fraction of heterozygote calls is biased low; this makes sense because the number of reads per position does not allow you to make a correct estimate of the presence of both alleles in a sample. So, you should not trust calls at a very low depth.     As expected, the number of calls in the centromere is way lower than calls outside it. The distribution of SNPs outside the centromere follows a common pattern that you can expect in many datasets. Here is the code: def get_sample_relation(recs, f1, f2):    rel = defaultdict(int)    for rec in recs:        if not rec.is_snp:              continue        for sample in rec.samples:            try:                 v1 = f1(sample)                v2 = f2(sample)                if v1 is None or v2 is None:                    continue # We ignore Nones                rel[(v1, v2)] += 1            except:                pass # This is outside the domain (typically None)    return rel   rels = {} for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    rels[vcf_name] = get_sample_relation(recs, lambda s: 1 if s.is_het else 0, lambda s: int(s['DP'])) Let's start by looking at the for loop. Again, we will use functional programming: the get_sample_relation function will traverse all the SNP records and apply the two functional parameters; the first determines heterozygosity, whereas the second gets the sample DP (remember that there is also a variant DP).     Now, as the code is complex as it is, I opted for a naive data structure to be returned by get_sample_relation: a dictionary where the key is the pair of results (in this case, heterozygosity and DP) and the sum of SNPs, which share both values. There are more elegant data structures with different trade-offs for this: scipy spare matrices, pandas' DataFrames, or maybe, you want to consider PyTables. The fundamental point here is to have a framework that is general enough to compute relationships among a couple of sample annotations.     Also, be careful with the dimension space of several annotations, for example, if your annotation is of float type, you might have to round it (if not, the size of your data structure might become too big). Now, let's take a look at all the plotting codes. Let's perform it in two parts; here is part 1: def plot_hz_rel(dps, ax, ax2, name, rel):    frac_hz = []    cnt_dp = []    for dp in dps:        hz = 0.0        cnt = 0          for khz, kdp in rel.keys():             if kdp != dp:                continue            cnt += rel[(khz, dp)]            if khz == 1:                hz += rel[(khz, dp)]        frac_hz.append(hz / cnt)        cnt_dp.append(cnt)    ax.plot(dps, frac_hz, label=name)    ax2.plot(dps, cnt_dp, '--', label=name)     This function will take a data structure (as generated by get_sample_relation) expecting that the first parameter of the key tuple is the heterozygosity state (0 = homozygote, 1 = heterozygote) and the second is the DP. With this, it will generate two lines: one with the fraction of samples (which are heterozygotes at a certain depth) and the other with the SNP count. Let's now call this function, as shown in the following code: fig, ax = plt.subplots(figsize=(16, 9)) ax2 = ax.twinx() for name, rel in rels.items():    dps = list(set([x[1] for x in rel.keys()]))    dps.sort()    plot_hz_rel(dps, ax, ax2, name, rel) ax.set_xlim(0, 75) ax.set_ylim(0, 0.2) ax2.set_ylabel('Quantity of calls') ax.set_ylabel('Fraction of Heterozygote calls') ax.set_xlabel('Sample Read Depth (DP)') ax.legend() fig.suptitle('Number of calls per depth and fraction of calls which are Hz',,              fontsize='xx-large')     Here, we will use two axes. On the left-hand side, we will have the fraction of heterozygozite SNPs, whereas on the right-hand side, we will have the number of SNPs. Then, we will call our plot_hz_rel for both data files. The rest is standard matplotlib code. Finally, let's compare variant DP with the categorical variant-level annotation: EFF. EFF is provided by SnpEFF and tells us (among many other things) the type of SNP (for example, intergenic, intronic, coding synonymous, and coding nonsynonymous). The Anopheles dataset provides this useful annotation. Let's start by extracting variant-level annotations and the functional programming style, as shown in the following code: def get_variant_relation(recs, f1, f2):    rel = defaultdict(int)    for rec in recs:        if not rec.is_snp:              continue        try:            v1 = f1(rec)            v2 = f2(rec)            if v1 is None or v2 is None:                continue # We ignore Nones            rel[(v1, v2)] += 1        except:            pass    return rel     The programming style here is similar to get_sample_relation, but we do not delve into the samples. Now, we will define the types of effects that we will work with and convert the effect to an integer as it would allow you to use it as in index, for example, matrices. Think about coding a categorical variable: accepted_eff = ['INTERGENIC', 'INTRON', 'NON_SYNONYMOUS_CODING', 'SYNONYMOUS_CODING']   def eff_to_int(rec):    try:        for annot in rec.INFO['EFF']:            #We use the first annotation            master_type = annot.split('(')[0]            return accepted_eff.index(master_type)    except ValueError:        return len(accepted_eff) We will now traverse the file; the style should be clear to you now: eff_mq0s = {} for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    eff_mq0s[vcf_name] = get_variant_relation(recs, lambda r: eff_to_int(r), lambda r: int(r.INFO['DP'])) Finally, we will plot the distribution of DP using the SNP effect, as shown in the following code: fig, ax = plt.subplots(figsize=(16,9)) vcf_name = 'standard.vcf.gz' bp_vals = [[] for x in range(len(accepted_eff) + 1)] for k, cnt in eff_mq0s[vcf_name].items():    my_eff, mq0 = k    bp_vals[my_eff].extend([mq0] * cnt) sns.boxplot(bp_vals, sym='', ax=ax) ax.set_xticklabels(accepted_eff + ['OTHER']) ax.set_ylabel('DP (variant)') fig.suptitle('Distribution of variant DP per SNP type',              fontsize='xx-large') Here, we will just print a box plot for the noncentromeric file (refer to the following screenshot). The results are as expected: SNPs in code areas will probably have more depth if they are in more complex regions (that is easier to call) than intergenic SNPs: Figure 4: Boxplot for the distribution of variant read depth across different SNP effects There's more… The approach would depend on the type of sequencing data that you have, the number of samples, and potential extra information (for example, pedigree among samples). This recipe is very complex as it is, but parts of it are profoundly naive (there is a limit of complexity that I could force on you on a simple recipe). For example, the window code does not support overlapping windows; also, data structures are simplistic. However, I hope that they give you an idea of the general strategy to process genomic high-throughput sequencing data. See also There are many filtering rules, but I would like to draw your attention to the need of reasonably good coverage (clearly more than 10 x), for example, refer to. Meynet et al "Variant detection sensitivity and biases in whole genome and exome sequencing" at http://www.biomedcentral.com/1471-2105/15/247/ Brad Chapman is one of the best known specialist in sequencing analysis and data quality with Python and the main author of Blue Collar Bioinformatics, a blog that you may want to check at https://bcbio.wordpress.com/ Brad is also the main author of bcbio-nextgen, a Python-based pipeline for high-throughput sequencing analysis. Refer to https://bcbio-nextgen.readthedocs.org Peter Cock is the main author of Biopython and is heavily involved in NGS analysis; be sure to check his blog, "Blasted Bionformatics!?" at http://blastedbio.blogspot.co.uk/ Summary In this article, we prepared the environment, analyzed variant calls and learned about genome accessibility and filtering SNP data.
Read more
  • 0
  • 0
  • 16576

article-image-methodology-modeling-business-processes-soa
Packt
07 Jul 2015
27 min read
Save for later

Methodology for Modeling Business Processes in SOA

Packt
07 Jul 2015
27 min read
This article by Matjaz B. Juric, Sven Bernhardt, Hajo Normann, Danilo Schmiedel, Guido Schmutz, Mark Simpson, and Torsten Winterberg, authors of the book Design Principles for Process-driven Architectures Using Oracle BPM and SOA Suite 12c, describes the strategies and a methodology that can help us realize the benefits of BPM as a successful enterprise modernization strategy. In this article, we will do the following: Provide the reader with a set of actions in the course of a complete methodology that they can incorporate in order to create the desired attractiveness towards broader application throughout the enterprise Describe organizational and cultural barriers to applying enterprise BPM and discuss ways to overcome them (For more resources related to this topic, see here.) The postmature birth of enterprise BPM When enterprise architects discuss the future of the software landscape of their organization, they map the functional capabilities, such as customer relationship management and order management, to existing or new applications—some packaged and some custom. Then, they connect these applications by means of middleware. They typically use the notion of an integration middleware, such as an enterprise service bus (ESB), to depict the technical integration between these applications, exposing functionality as services, APIs, or, more trendy, "micro services". These services are used by modern, more and more mobile frontends and B2B partners. For several years now, it has been hard to find a PowerPoint slide that discusses future enterprise middleware without the notion of a BPM layer that sits on top of the frontend and the SOA service layer. So, in most organizations, we find a slide deck that contains this visual box named BPM, signifying the aim to improve process excellence by automating business processes along the management discipline known as business process management (BPM). Over the years, we have seen that the frontend layer often does materialize as a portal or through a modern mobile application development platform. The envisioned SOA services can be found living on an ESB or API gateway. Yet, the component called BPM and the related practice of modeling executable processes has failed to finally incubate until now—at least in most organizations. BPM still waits for morphing from an abstract item on a PowerPoint slide and in a shelved analyst report to some automated business processes that are actually deployed to a business-critical production machine. When we look closer—yes—there is a license for a BPM tool, and yes, some processes have even been automated, but those tend to be found rather in the less visible corners of the enterprise, seldom being of the concern for higher management and the hot project teams that work day and night on the next visible release. In short, BPM remains the hobby of some enterprise architects and the expensive consultants they pay. Will BPM ever take the often proposed lead role in the middleware architect's tool box? Will it lead to a better, more efficient, and productive organization? To be very honest, at the moment the answer to that question is rather no than yes. There is a good chance that BPM remains just one more of the silver bullets that fill some books and motivate some great presentations at conferences, yet do not have an impact on the average organization. But there is still hope for enterprise BPM as opposed to a departmental approach to process optimization. There is a good chance that BPM, next to other enabling technologies, will indeed be the driver for successful enterprise modernization. Large organizations all over the globe reengage with smaller and larger system integrators to tackle the process challenge. Maybe BPM as a practice needs more time than other items found in Gardner hype curves to mature before it's widely applied. This necessary level of higher maturity encompasses both the tools and the people using them. Ultimately, this question of large-scale BPM adoption will be answered individually in each organization. Only when a substantive set of enterprises experience tangible benefits from BPM will they talk about it, thus growing a momentum that leads to the success of enterprise BPM as a whole. This positive marketing based on actual project and program success will be the primary way to establish a force of attraction towards BPM that will raise curiosity and interest in the minds of the bulk of the organizations that are still rather hesitant or ignorant about using BPM. Oracle BPM Suite 12c – new business architecture features New tools in Oracle BPM Suite 12c put BPM in the mold of business architecture (BA). This new version contains new BA model types and features that help companies to move out of the IT-based, rather technical view of business processes automation and into strategic process improvement. Thus, these new model types help us to embark on the journey towards enterprise BPM. This is an interesting step in evolution of enterprise middleware—Oracle is the first vendor of a business process automation engine that moved up from concrete automated processes to strategic views on end-to-end processes, thus crossing the automation/strategic barrier. BPM Suite 12c introduces cross-departmental business process views. Thereby, it allows us to approach an enterprise modeling exercise through top-down modeling. It has become an end-to-end value chain model that sits on top of processes. It chains separated business processes together into one coherent end-to-end view. The value chain describes a bird's-eye view of the steps needed to achieve the most critical business goals of an organization. These steps comprise of business processes, of which some are automated in a BPMN engine and others actually run in packaged applications or are not automated at all. Also, BPM Suite 12c allows the capturing of the respective goals and provides the tools to measure them as KPIs and service-level agreements. In order to understand the path towards these yet untackled areas, it is important to understand where we stand today with BPM and what this new field of end-to-end process transparency is all about. Yet, before we get there, we will leave enterprise IT for a moment and take a look at the nature of a game (any game) in order to prepare for a deeper understanding of the mechanisms and cultural impact that underpin the move from a departmental to an enterprise approach to BPM. Football games – same basic rules, different methodology Any game, be it a physical sport, such as football, or a mental sport, such as chess, is defined through a set of common rules. How the game is played will look very different depending on the level of the league it is played in. A Champions League football game is so much more refined than your local team playing at the nearby stadium, not to mention the neighboring kids kicking the ball in the dirt ground. These kids will show creativity and pleasure in the game, yet the level of sophistication is a completely different ball game in the big stadium. You can marvel at the effort made to ensure that everybody plays their role in a well-trained symbiosis with their peers, all sharing a common set of collaboration rules and patterns. The team spent so many hours training the right combinations, establishing a deep trust. There is no time to discuss the meaning of an order shouted out by the trainer. They have worked on this common understanding of how to do things in various situations. They have established one language that they share. As an observer of a great match, you appreciate an elaborate art, not of one artist but of a coherent team. No one would argue the prevalence of this continuum in the refinement and rising sophistication in rough physical sports. It is puzzling to know to which extent we, as players in the games of enterprise IT, often tend to neglect the needs and forces that are implied by such a continuum of rising sophistication. The next sections will take a closer look at these BPM playgrounds and motivate you to take the necessary steps toward team excellence when moving from a small, departmental level BPM/SOA project to a program approach that is centered on a BPM and SOA paradigm. Which BPM game do we play? Game Silo BPM is the workflow or business process automation in organizational departments. It resembles the kids playing soccer on the neighborhood playground. After a few years of experience with automated processes, the maturity rises to resemble your local football team—yes, they play in a stadium, and it is often not elegant. Game Silo BPM is a tactical game in which work gets done while management deals with reaching departmental goals. New feature requests lead to changed or new applications and the people involved know each other very well over many years under established leadership. Workflows are automated to optimize performance. Game Enterprise BPM thrives for process excellence at Champions League. It is a strategic game in which higher management and business departments outline the future capability maps and cross-departmental business process models. In this game, players tend to regard the overall organization as a set of more or less efficient functional capabilities. One or a group of functional capabilities make up a department. Game Silo BPM – departmental workflows Today, most organizations use BPM-based process automation based on tools such as Oracle BPM Suite to improve the efficiency of the processes within departments. These processes often support the functional capability that this particular department owns by increasing its efficiency. Increasing efficiency is to do more with less. It is about automating manual steps, removing bottlenecks, and making sure that the resources are optimally allocated across your process. The driving force is typically the team manager, who is measured by the productivity of his team. The key factor to reach this goal is the automation of redundant or unnecessary human/IT interactions. Through process automation, we gain insights into the performance of the process. This insight can be called process transparency, allowing for the constant identification of areas of improvement. A typical example of the processes found in Game Silo BPM is an approval process that can be expressed as a clear path among process participants. Often, these processes work on documents, thus having a tight relationship with content management. We also find complex backend integration processes that involve human interaction only in the case of an exception. The following figure depicts this siloed approach to process automation. Recognizing a given business unit as a silo indicates its closed and self-contained world. A word of caution is needed: The term "silo" is often used in a negative connotation. It is important, though, to recognize that a closed and coherent team with no dependencies on other teams is a preferred working environment for many employees and managers. In more general terms, it reflects an archaic type of organization that we lean towards. It allows everybody in the team to indulge in what can be called the siege mentality we as humans gravitate to some coziness along the notion of a well-defined and manageable island of order. Figure 1: Workflow automation in departmental silos As discussed in the introduction, there is a chance that BPM never leaves this departmental context in many organizations. If you are curious to find out where you stand today, it is easy to depict whether a business process is stuck in the silo or whether it's part of an enterprise BPM strategy. Just ask whether the process is aligned with corporate goals or whether it's solely associated with departmental goals and KPIs. Another sign is the lack of a methodology and of a link to cross-departmental governance that defines the mode of operation and the means to create consistent models that speak one common language. To impose the enterprise architecture tools of Game Enterprise BPM on Game Silo BPM would be an overhead; you don't need them if your organization does not strive for cross-departmental process transparency. Oracle BPM Suite 11g is made for playing Game Silo BPM Oracle BPM Suite 11g provides all the tools and functionalities necessary to automate a departmental workflow. It is not sufficient to model business processes that span departmental barriers and require top-down process hierarchies. The key components are the following: The BPMN process modeling tool and the respective execution engine The means to organize logical roles and depict them as swimlanes in BPMN process models The human-task component that involves human input in decision making The business rule for supporting automated decision making The technical means to call backend SOA services The wizards to create data mappings The process performance can be measured by means of business activity monitoring (BAM) Oracle BPM Suite models processes in BPMN Workflows are modeled on a pretty fine level of granularity using the standard BPMN 2.0 version (and later versions). BPMN is both business ready and technically detailed enough to allow model processes to be executed in a process engine. Oracle fittingly expresses the mechanism as what you see is what you execute. Those workflows typically orchestrate human interaction through human tasks and functionality through SOA services. In the next sections of this article, we will move our head out of the cocoon of the silo, looking higher and higher along the hierarchy of the enterprise until we reach the world of workshops and polished PowerPoint slides in which the strategy of the organization is defined. Game Enterprise BPM Enterprise BPM is a management discipline with a methodology typically found in business architecture (BA) and the respective enterprise architecture (EA) teams. Representatives of higher management and of business departments and EA teams meet management consultants in order to understand the current situation and the desired state of the overall organization. Therefore, enterprise architects define process maps—a high-level view of the organization's business processes, both for the AS-IS and various TO-BE states. In the next step, they define the desired future state and depict strategies and means to reach it. Business processes that generate value for the organization typically span several departments. The steps in these end-to-end processes can be mapped to the functional building blocks—the departments. Figure 2: Cross-departmental business process needs an owner The goal of Game Enterprise BPM is to manage enterprise business processes, making sure they realize the corporate strategy and meet the respective goals, which are ideally measured by key performance indicators, such as customer satisfaction, reduction of failure, and cost reduction. It is a good practice to reflect on the cross-departmental efforts through the notion of a common or shared language that is spoken across departmental boundaries. A governance board is a great means to reach this capability in the overall organization. Still wide open – the business/IT divide Organizational change in the management structure is a prerequisite for the success of Game Enterprise BPM but is not a sufficient condition. Several business process management books describe the main challenge in enterprise BPM as the still-wide-open business/IT divide. There is still a gap between process understanding and ownership in Game Enterprise BPM and how automated process are modeled and perceived in departmental workflows of Game Silo BPM. Principles, goals, standards, and best practices defined in Game Enterprise BPM do not trickle down into everyday work in Game Silo BPM. One of the biggest reasons for this divide is the fact that there is no direct link between the models and tools used in top management to depict the business strategy, IT strategy, and business architecture and the high-level value chain and between the process models and the models and artifacts used in enterprise architecture and from there, even software architecture. Figure 3: Gap between business architecture and IT enterprise architecture in strategic BPM So, traditionally, organizations use specific business architecture or enterprise architecture tools in order to depict AS-IS and TO-BE high-level reflections of value chains, hierarchical business processes, and capability maps alongside application heat maps. These models kind of hang in the air, they are not deeply grounded in real life. Business process models expressed in event process chains (EPCs), vision process models and other modeling types often don't really reflect the flows and collaboration structures of actual procedures of the office. This leads to the perception of business architecture departments as ivory towers with no, or weak, links to the realities of the organization. On the other side, the tools and models of the IT enterprise architecture and software architecture speak a language not understood by the members of business departments. Unified Modeling Language (UML) is the most prominent set of model types that stuck in IT. However, while the UML class and activity diagrams promised to be valid to depict the nouns and stories of the business processes, their potential to allow a shared language and approach to depict joint vocabulary and views on requirements rarely materialized. Until now, there has been no workflow tool vendor approaching these strategic enterprise-level models, bridging the business/IT gap. Oracle BPM Suite 12c tackles Game Enterprise BPM With BPM Suite 12c, Oracle is starting to engage in this domain. The approach Oracle took can be summarized as applying the Pareto principle: 80 percent of the needed features for strategic enterprise modeling can be found in just 20 percent of the functionality of those high-end, enterprise-level modeling tools. So, Oracle implemented these 20 percent of business architecture models: Enterprise maps to define the organizational and application context Value chains to establish a root for process hierarchies The strategy model to depict focus areas and assign technical capabilities and optimization strategies The following figure represents the new features in Oracle BPM Suite 12c in the context of the Game Enterprise BPM methodology: Figure 4: Oracle BPM Suite 12c new features in the context of the Game Enterprise BPM methodology The preceding figure is based on the BPTrends Process Change Methodology introduced in the Business Process Change book by Paul Harmon. These new features promise to create a link from higher-level process models and other strategic depictions into executable processes. This link could not be established until now since there are too many interface mismatches between enterprise tooling and workflow automation engines. The new model types in Oracle BPM Suite 12c are, as discussed, a subset of all the features and model types in the EA tools. For this subset, these interface mismatches have been made obsolete: there is a clear trace with no tool disruption from the enterprise map to the value chain and associated KPIs down to the BPMN process that are automated. These features have been there in the EA and BPM tools before. What is new is this undisrupted trace. Figure 5: Undisrupted trace from the business architecture to executable processes The preceding figure is based on the BPTrends Process Change Methodology introduced in the Business Process Change book by Paul Harmon. This holistic set of tools that brings together aspects from modeling time, design time, and runtime makes it more likely to succeed in finally bridging the business/IT gap. Figure 6: Tighter links from business and strategy to executable software and processes Today, we do not live in a perfect world. To understand to which extent this gap is closed, it helps to look at how people work. If there is a tool to depict enterprise strategy, end-to-end business processes, business capabilities, and KPIs that are used in daily work and that have the means to navigate to lower-level models, then we have come quite far. The features of Oracle BPM Suite 12c, which are discussed below, are a step in this direction but are not the end of the journey. Using business architect features The process composer is a business user-friendly web application. From the login page, it guides the user in the spirit of a business architecture methodology. All the models that we create are part of a "space". On its entry page, the main structure is divided into business architecture models in BA projects, which are described in this article. It is a good practice to start with an enterprise map that depicts the business capabilities of our organization. The rationale is that the functions the organization is made of tend to be more stable and less a matter of interpretation and perspective than any business process view. Enterprise maps can be used to put those value chains and process models into the organizational and application landscape context. Oracle suggests organizing the business capabilities into three segments. Thus, the default enterprise map model is prepopulated through three default lanes: core, management, and support. In many organizations, this structure is feasible as it is up to you to either use them or create your own lanes. Then we can define within each lane the key business capabilities that make up the core business processes. Figure 7: Enterprise map of RYLC depicting key business capabilities Properties of BA models Each element (goal, objective, strategy) within the model can be enriched with business properties, such as actual cost, actual time, proposed cost and proposed time. These properties are part of the impact analysis report that can be generated to analyze the BA project. Figure 8: Use properties to specify SLAs and other BA characteristics Depicting organizational units Within RYLC as an organization, we now depict its departments as organizational units. We can adorn goals to each of the units, which depict its function and the role it plays in the concert of the overall ecosystem. This association of a unit to a goal is expressed via links to goals defined in the strategy model. These goals will be used for the impact analysis reports that show the impact of changes on the organizational or procedural changes. It is possible to create several organization units as shown in the following screenshot: Figure 9: Define a set of organization units Value chains A value chain model forms the root of a process hierarchy. A value chain consists of one direct line of steps, no gateways, and no exceptions. The modeler in Oracle BPM Suite allows each step in the chain to depict associated business goals and key performance indicators (KPIs) that can be used to measure the organization's performance rather than the details of departmental performance. The value chain model is a very simple one depicting the flow of the most basic business process steps. Each step is a business process in its own right. On the level of the chain, there is no decision making expressed. This resembles a business process expressed in BPMN that has only direct lines and no gateways. Figure 10: Creation of a new value chain model called "Rental Request-to-Delivery" The value chain model type allows the structuring of your business processes into a hierarchy with a value chain forming the topmost level. Strategy models Strategy models that can be used to further motivate their KPIs are depicted at the value chain level. Figure 11: Building the strategy model for the strategy "Become a market leader" These visual maps leverage existing process documentation and match it with current business trends and existing reference models. These models depict processes and associated organizational aspects encompassing cross-departmental views on higher levels and concrete processes down to level 3. From there, they prioritize distinct processes and decide on one of several modernization strategies—process automation being just one of several! So, in the proposed methodology in the following diagram, in the Define strategy and performance measures (KPIs), the team down-selects for each business process or even subprocess one or several means to improve process efficiency or transparency. Typically, these means are defined as "supporting capabilities". These are a technique, a tool, or an approach that helps to modernize a business process. A few typical supporting capabilities are mentioned here: Explicit process automation Implicit process handling and automation inside a packaged application, such as SAP ERP or Oracle Siebel Refactored COTS existing application Business process outsourcing Business process retirement The way toward establishing these supporting capabilities is defined through a list of potential modernization strategies. Several of the modernization strategies relate to the existing applications that support a business process, such as refactoring, replacement, retirement, or re-interfacing of the respective supporting applications. The application modernization strategy that we are most interested in this article is establish explicit automated process. It is a best practice to create a high-level process map and define for each of the processes whether to leave it as it is or to depict one of several modernization strategies. When several modernization strategies are found for one process, we can drill down into the process through a hierarchy and stop at the level on which there is a disjunct modernization strategy. Figure 12: Business process automation as just one of several process optimization strategies The preceding figure is based on the BPTrends Process Change Methodology introduced in the Business Process Change book by Paul Harmon. Again, just one of these technical capabilities in strategic BPM is the topic of this article: process automation. It is not feasible to suggest automating all the business processes of any organization. Key performance indicators Within a BA project (strategy and value chain models), there are three different types of KPIs that can be defined: Manual KPI: This allows us to enter a known value Rollup KPI: This evaluates an aggregate of the child KPIs External KPI: This provides a way to include KPI data from applications other than BPM Suite, such as SAP, E-Business Suite, PeopleSoft, and so on. Additionally, KPIs can be defined on a BPMN process level, which is not covered in this article. KPIs in the value chain step level The following are the steps to configure the KPIs in the value chain step level: Open the value chain model Rental Request-to-Delivery. Right-click on the Vehicle Reservation & Allocation chain step, and select KPI. Click on the + (plus) sign to create a manual KPI, as illustrated in the next screenshot. The following image shows the configuration of the KPIs: Figure 13: Configuring a KPI Why we need a new methodology for Game Enterprise BPM Now, Game Enterprise BPM needs to be played everywhere. This implies that Game Silo BPM needs to diminish, meaning it needs to be replaced, gradually, through managed evolution, league by league, aiming for excellence at Champions League. We can't play Game Enterprise BPM with the same culture of ad hoc, joyful creativity, which we find in Game Silo BPM. We can't just approach our colleague; let's call him Ingo Maier, who we know has drawn the process model for a process we are interested in. We can't just walk over to the other desk to him, asking him about the meaning of an unclear section in the process. That is because in Game Enterprise BPM, Ingo Maier, as a person whom we know as part of our team Silo, does not exist anymore. We deal with process models, with SOA services, with a language defined somewhere else, in another department. This is what makes it so hard to move up in BPM leagues. Hiding behind the buzz term "agile" does not help. In order to raise BPM maturity up, when we move from Game Silo BPM to Game Enterprise BPM, the organization needs to establish a set of standards, guidelines, tools, and modes of operations that allow playing and succeeding at Champions League. Additionally, we have to define the modes of operations and the broad steps that lead to a desired state. This formalization of collaboration in teams should be described, agreed on, and lived as our BPM methodology. The methodology thrives for a team in which each player contributes to one coherent game along well-defined phases. Political change through Game Enterprise BPM The political problem with a cross-departmental process view becomes apparent if we look at the way organizations distribute political power in Game Silo BPM. The heads of departments form the most potent management layer while the end-to-end business process has no potent stakeholder. Thus, it is critical for any Game Enterprise BPM to establish a good balance of de facto power with process owners acting as stakeholders for a cross-departmental process view. This fills the void in Game Silo BPM of end-to-end process owners. With Game Enterprise BPM, the focus shifts from departmental improvements to the KPIs and improvement of the core business processes. Pair modeling the value chains and business processes Value chains and process models down to a still very high-level layer, such as a layer 3, can be modeled by process experts without involving technically skilled people. They should omit all technical details. To provide the foundation for automated processes, we need to add more details about domain knowledge and some technical details. Therefore, these domain process experts meet with BPM tool experts to jointly define the next level of detail in BPMN. In an analogy to the practice of pair development in agile methodologies, you could call this kind of collaboration pair modeling. Ideally, the process expert(s) and the tool expert look at the same screen and discuss how to improve the flow of the process model, while the visual representation evolves into variances, exceptions, and better understanding of the involved business objects. For many organizations that are used to a waterfall process, this is a fundamentally new way of requirement gathering that might be a challenge for some. The practice is an analogy of the customer on site practice in agile methodologies. This new way of close collaboration for process modeling is crucial for the success of BPM projects since it allows us to establish a deep and shared understanding in a very pragmatic and productive way. Figure 14: Roles and successful modes of collaboration When the process is modeled in sufficient detail to clearly depict an algorithmic definition of the flow of the process and all its variances, the model can be handed over to BPM developers. They add all the technical bells and whistles, such as data mapping, decision rules, service calls, and exception handling. Portal developers will work on their implementation of the use cases. SOA developers will use Oracle SOA Suite to integrate with backend systems, therefore implementing SOA services. The discussed notion of a handover from higher-level business process models to development teams can also be used to depict the line at which it might make sense to outsource parts of the overall development. Summary In this article, we saw how BPM as an approach to model, automate, and optimize business process is typically applied rather on a departmental level. We saw how BPM Suite 12c introduced new features that allow us to cross the bridge towards the top-down, cross-departmental, enterprise-level BPM. We depicted the key characteristics of the enterprise BPM methodology, which aligns corporate or strategic activities with actual process automation projects. We learned the importance of modeling standards and guidelines, which should be used to gain business process insight and understanding on broad levels throughout the enterprise. The goal is to establish a shared language to talk about the capabilities and processes of the overall organization and the services it provides to its customers. The role of data in SOA that will support business processes was understood with a critical success factor being the definition of the business data model that, along with services, will form the connection between the process layer, user interface layer, and the services layer. We understood how important it is to separate application logic from service logic and process logic to ensure the benefits of a process-driven architecture are realized. Resources for Article: Further resources on this subject: Introduction to Oracle BPM [article] Oracle B2B Overview [article] Event-driven BPEL Process [article]
Read more
  • 0
  • 0
  • 2231