Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Programming

1081 Articles
article-image-java-multithreading-synchronize-threads-implement-critical-sections
Fatema Patrawala
30 May 2018
13 min read
Save for later

Java Multithreading: How to synchronize threads to implement critical sections and avoid race conditions

Fatema Patrawala
30 May 2018
13 min read
One of the most common situations in concurrent programming occurs when more than one execution thread shares a resource. In a concurrent application, it is normal for multiple threads to read or write the same data structure or have access to the same file or database connection. These shared resources can provoke error situations or data inconsistency, and we have to implement some mechanism to avoid these errors. These situations are called race conditions and they occur when different threads have access to the same shared resource at the same time. Therefore, the final result depends on the order of the execution of threads, and most of the time, it is incorrect. You can also have problems with change visibility. So if a thread changes the value of a shared variable, the changes would only be written in the local cache of that thread; other threads will not have access to the change (they will only be able to see the old value). We present to you a java multithreading tutorial taken from the book, Java 9 Concurrency Cookbook - Second Edition, written by Javier Fernández González. The solution to these problems lies in the concept of a critical section. A critical section is a block of code that accesses a shared resource and can't be executed by more than one thread at the same time. To help programmers implement critical sections, Java (and almost all programming languages) offers synchronization mechanisms. When a thread wants access to a critical section, it uses one of these synchronization mechanisms to find out whether there is any other thread executing the critical section. If not, the thread enters the critical section. If yes, the thread is suspended by the synchronization mechanism until the thread that is currently executing the critical section ends it. When more than one thread is waiting for a thread to finish the execution of a critical section, JVM chooses one of them and the rest wait for their turn. Java language offers two basic synchronization mechanisms: The  synchronized keyword The  Lock interface and its implementations In this article, we explore the use of synchronized keyword method to perform synchronization mechanism in Java. So let's get started: Synchronizing a method In this recipe, you will learn how to use one of the most basic methods of synchronization in Java, that is, the use of the synchronized keyword to control concurrent access to a method or a block of code. All the synchronized sentences (used on methods or blocks of code) use an object reference. Only one thread can execute a method or block of code protected by the same object reference. When you use the synchronized keyword with a method, the object reference is implicit. When you use the synchronized keyword in one or more methods of an object, only one execution thread will have access to all these methods. If another thread tries to access any method declared with the synchronized keyword of the same object, it will be suspended until the first thread finishes the execution of the method. In other words, every method declared with the synchronized keyword is a critical section, and Java only allows the execution of one of the critical sections of an object at a time. In this case, the object reference used is the own object, represented by the this keyword. Static methods have a different behavior. Only one execution thread will have access to one of the static methods declared with the synchronized keyword, but a different thread can access other non-static methods of an object of that class. You have to be very careful with this point because two threads can access two different synchronized methods if one is static and the other is not. If both methods change the same data, you can have data inconsistency errors. In this case, the object reference used is the class object. When you use the synchronized keyword to protect a block of code, you must pass an object reference as a parameter. Normally, you will use the this keyword to reference the object that executes the method, but you can use other object references as well. Normally, these objects will be created exclusively for this purpose. You should keep the objects used for synchronization private. For example, if you have two independent attributes in a class shared by multiple threads, you must synchronize access to each variable; however, it wouldn't be a problem if one thread is accessing one of the attributes and the other accessing a different attribute at the same time. Take into account that if you use the own object (represented by the this keyword), you might interfere with other synchronized code (as mentioned before, the this object is used to synchronize the methods marked with the synchronized keyword). In this recipe, you will learn how to use the synchronized keyword to implement an application simulating a parking area, with sensors that detect the following: when a car or a motorcycle enters or goes out of the parking area, an object to store the statistics of the vehicles being parked, and a mechanism to control cash flow. We will implement two versions: one without any synchronization mechanisms, where we will see how we obtain incorrect results, and one that works correctly as it uses the two variants of the synchronized keyword. The example of this recipe has been implemented using the Eclipse IDE. If you use Eclipse or a different IDE, such as NetBeans, open it and create a new Java project. How to do it... Follow these steps to implement the example: First, create the application without using any synchronization mechanism. Create a class named ParkingCash with an internal constant and an attribute to store the total amount of money earned by providing this parking service: public class ParkingCash { private static final int cost=2; private long cash; public ParkingCash() { cash=0; } Implement a method named vehiclePay() that will be called when a vehicle (a car or motorcycle) leaves the parking area. It will increase the cash attribute: public void vehiclePay() { cash+=cost; } Finally, implement a method named close() that will write the value of the cash attribute in the console and reinitialize it to zero: public void close() { System.out.printf("Closing accounting"); long totalAmmount; totalAmmount=cash; cash=0; System.out.printf("The total amount is : %d", totalAmmount); } } Create a class named ParkingStats with three private attributes and the constructor that will initialize them: public class ParkingStats { private long numberCars; private long numberMotorcycles; private ParkingCash cash; public ParkingStats(ParkingCash cash) { numberCars = 0; numberMotorcycles = 0; this.cash = cash; } Then, implement the methods that will be executed when a car or motorcycle enters or leaves the parking area. When a vehicle leaves the parking area, cash should be incremented: public void carComeIn() { numberCars++; } public void carGoOut() { numberCars--; cash.vehiclePay(); } public void motoComeIn() { numberMotorcycles++; } public void motoGoOut() { numberMotorcycles--; cash.vehiclePay(); } Finally, implement two methods to obtain the number of cars and motorcycles in the parking area, respectively. Create a class named Sensor that will simulate the movement of vehicles in the parking area. It implements the Runnable interface and has a ParkingStats attribute, which will be initialized in the constructor: public class Sensor implements Runnable { private ParkingStats stats; public Sensor(ParkingStats stats) { this.stats = stats; } Implement the run() method. In this method, simulate that two cars and a motorcycle arrive in and then leave the parking area. Every sensor will perform this action 10 times: @Override public void run() { for (int i = 0; i< 10; i++) { stats.carComeIn(); stats.carComeIn(); try { TimeUnit.MILLISECONDS.sleep(50); } catch (InterruptedException e) { e.printStackTrace(); } stats.motoComeIn(); try { TimeUnit.MILLISECONDS.sleep(50); } catch (InterruptedException e) { e.printStackTrace(); } stats.motoGoOut(); stats.carGoOut(); stats.carGoOut(); } } Finally, implement the main method. Create a class named Main with the main() method. It needs ParkingCash and ParkingStats objects to manage parking: public class Main { public static void main(String[] args) { ParkingCash cash = new ParkingCash(); ParkingStats stats = new ParkingStats(cash); System.out.printf("Parking Simulatorn"); Then, create the Sensor tasks. Use the availableProcessors() method (that returns the number of available processors to the JVM, which normally is equal to the number of cores in the processor) to calculate the number of sensors our parking area will have. Create the corresponding Thread objects and store them in an array: intnumberSensors=2 * Runtime.getRuntime() .availableProcessors(); Thread threads[]=new Thread[numberSensors]; for (int i = 0; i<numberSensors; i++) { Sensor sensor=new Sensor(stats); Thread thread=new Thread(sensor); thread.start(); threads[i]=thread; } Then wait for the finalization of the threads using the join() method: for (int i=0; i<numberSensors; i++) { try { threads[i].join(); } catch (InterruptedException e) { e.printStackTrace(); } } Finally, write the statistics of Parking: System.out.printf("Number of cars: %dn", stats.getNumberCars()); System.out.printf("Number of motorcycles: %dn", stats.getNumberMotorcycles()); cash.close(); } } In our case, we executed the example in a four-core processor, so we will have eight Sensor tasks. Each task performs 10 iterations, and in each iteration, three vehicles enter the parking area and the same three vehicles go out. Therefore, each Sensor task will simulate 30 vehicles. If everything goes well, the final stats will show the following: There are no cars in the parking area, which means that all the vehicles that came into the parking area have moved out Eight Sensor tasks were executed, where each task simulated 30 vehicles and each vehicle was charged 2 dollars each; therefore, the total amount of cash earned was 480 dollars When you execute this example, each time you will obtain different results, and most of them will be incorrect. The following screenshot shows an example: We had race conditions, and the different shared variables accessed by all the threads gave incorrect results. Let's modify the previous code using the synchronized keyword to solve these problems: First, add the synchronized keyword to the vehiclePay() method of the ParkingCash class: public synchronized void vehiclePay() { cash+=cost; } Then, add a synchronized block of code using the this keyword to the close() method: public void close() { System.out.printf("Closing accounting"); long totalAmmount; synchronized (this) { totalAmmount=cash; cash=0; } System.out.printf("The total amount is : %d",totalAmmount); } Now add two new attributes to the ParkingStats class and initialize them in the constructor of the class: private final Object controlCars, controlMotorcycles; public ParkingStats (ParkingCash cash) { numberCars=0; numberMotorcycles=0; controlCars=new Object(); controlMotorcycles=new Object(); this.cash=cash; } Finally, modify the methods that increment and decrement the number of cars and motorcycles, including the synchronized keyword. The numberCars attribute will be protected by the controlCars object, and the numberMotorcycles attribute will be protected by the controlMotorcycles object. You must also synchronize the getNumberCars() and getNumberMotorcycles() methods with the associated reference object: public void carComeIn() { synchronized (controlCars) { numberCars++; } } public void carGoOut() { synchronized (controlCars) { numberCars--; } cash.vehiclePay(); } public void motoComeIn() { synchronized (controlMotorcycles) { numberMotorcycles++; } } public void motoGoOut() { synchronized (controlMotorcycles) { numberMotorcycles--; } cash.vehiclePay(); } Execute the example now and see the difference when compared to the previous version. How it works... The following screenshot shows the output of the new version of the example. No matter how many times you execute it, you will always obtain the correct result: Let's see the different uses of the synchronized keyword in the example: First, we protected the vehiclePay() method. If two or more Sensor tasks call this method at the same time, only one will execute it and the rest will wait for their turn; therefore, the final amount will always be correct. We used two different objects to control access to the car and motorcycle counters. This way, one Sensor task can modify the numberCars attribute and another Sensor task can modify the numberMotorcycles attribute at the same time; however, no two Sensor tasks will be able to modify the same attribute at the same time, so the final value of the counters will always be correct. Finally, we also synchronized the getNumberCars() and getNumberMotorcycles() methods. Using the synchronized keyword, we can guarantee correct access to shared data in concurrent applications. As mentioned at the introduction of this recipe, only one thread can access the methods of an object that uses the synchronized keyword in their declaration. If thread (A) is executing a synchronized method and thread (B) wants to execute another synchronized method of the same object, it will be blocked until thread (A) is finished. But if thread (B) has access to different objects of the same class, none of them will be blocked. When you use the synchronized keyword to protect a block of code, you use an object as a parameter. JVM guarantees that only one thread can have access to all the blocks of code protected with this object (note that we always talk about objects, not classes). We used the TimeUnit class as well. The TimeUnit class is an enumeration with the following constants: DAYS, HOURS, MICROSECONDS, MILLISECONDS, MINUTES, NANOSECONDS, and SECONDS. These indicate the units of time we pass to the sleep method. In our case, we let the thread sleep for 50 milliseconds. There's more... The synchronized keyword penalizes the performance of the application, so you must only use it on methods that modify shared data in a concurrent environment. If you have multiple threads calling a synchronized method, only one will execute them at a time while the others will remain waiting. If the operation doesn't use the synchronized keyword, all the threads can execute the operation at the same time, reducing the total execution time. If you know that a method will not be called by more than one thread, don't use the synchronized keyword. Anyway, if the class is designed for multithreading access, it should always be correct. You must promote correctness over performance. Also, you should include documentation in methods and classes in relation to their thread safety. You can use recursive calls with synchronized methods. As the thread has access to the synchronized methods of an object, you can call other synchronized methods of that object, including the method that is being executed. It won't have to get access to the synchronized methods again. We can use the synchronized keyword to protect access to a block of code instead of an entire method. We should use the synchronized keyword in this way to protect access to shared data, leaving the rest of the operations out of this block and obtaining better performance of the application. The objective is to have the critical section (the block of code that can be accessed only by one thread at a time) as short as possible. Also, avoid calling blocking operations (for example, I/O operations) inside a critical section. We have used the synchronized keyword to protect access to the instruction that updates the number of persons in the building, leaving out the long operations of the block that don't use shared data. When you use the synchronized keyword in this way, you must pass an object reference as a parameter. Only one thread can access the synchronized code (blocks or methods) of this object. Normally, we will use the this keyword to reference the object that is executing the method: synchronized (this) { // Java code } To summarize, we learnt to use the synchronized  keyword method for multithreading in Java to perform synchronization mechasim. You read an excerpt from the book Java 9 Concurrency Cookbook - Second Edition. This book will help you master the art of fast, effective Java development with the power of concurrent and parallel programming. Concurrency programming 101: Why do programmers hang by a thread? How to create multithreaded applications in Qt Getting Inside a C++ Multithreaded Application
Read more
  • 0
  • 0
  • 17403

article-image-python-design-patterns-depth-singleton-pattern
Packt
15 Feb 2016
14 min read
Save for later

Python Design Patterns in Depth: The Singleton Pattern

Packt
15 Feb 2016
14 min read
There are situations where you need to create only one instance of data throughout the lifetime of a program. This can be a class instance, a list, or a dictionary, for example. The creation of a second instance is undesirable. This can result in logical errors or malfunctioning of the program. The design pattern that allows you to create only one instance of data is called singleton. In this article, you will learn about module-level, classic, and borg singletons; you'll also learn about how they work, when to use them, and build a two-threaded web crawler that uses a singleton to access the shared resource. (For more resources related to this topic, see here.) Singleton is the best candidate when the requirements are as follows: Controlling concurrent access to a shared resource If you need a global point of access for the resource from multiple or different parts of the system When you need to have only one object Some typical use cases of using a singleton are: The logging class and its subclasses (global point of access for the logging class to send messages to the log) Printer spooler (your application should only have a single instance of the spooler in order to avoid having a conflicting request for the same resource) Managing a connection to a database File manager Retrieving and storing information on external configuration files Read-only singletons storing some global states (user language, time, time zone, application path, and so on) There are several ways to implement singletons. We will look at module-level singleton, classic singletons, and borg singleton. Module-level singleton All modules are singletons by nature because of Python's module importing steps: Check whether a module is already imported. If yes, return it. If not, find a module, initialize it, and return it. Initializing a module means executing a code, including all module-level assignments. When you import the module for the first time, all of the initializations will be done; however, if you try to import the module for the second time, Python will return the initialized module. Thus, the initialization will not be done, and you get a previously imported module with all of its data. So, if you want to quickly make a singleton, use the following steps and keep the shared data as the module attribute. singletone.py: only_one_var = "I'm only one var" module1.py: import single tone print singleton.only_one_var singletone.only_one_var += " after modification" import module2 module2.py: import singletone print singleton.only_one_var Here, if you try to import a global variable in a singleton module and change its value in the module1 module, module2 will get a changed variable. This function is quick and sometimes is all that you need; however, we need to consider the following points: It's pretty error-prone. For example, if you happen to forget the global statements, variables local to the function will be created and, the module's variables won't be changed, which is not what you want. It's ugly, especially if you have a lot of objects that should remain as singletons. They pollute the module namespace with unnecessary variables. They don't permit lazy allocation and initialization; all global variables will be loaded during the module import process. It's not possible to re-use the code because you can not use the inheritance. No special methods and no object-oriented programming benefits at all. Classic singleton In classic singleton in Python, we check whether an instance is already created. If it is created, we return it; otherwise, we create a new instance, assign it to a class attribute, and return it. Let's try to create a dedicated singleton class: class Singleton(object): def __new__(cls): if not hasattr(cls, 'instance'): cls.instance = super(Singleton, cls).__new__(cls) return cls.instance Here, before creating the instance, we check for the special __new__ method, which is called right before __init__ if we had created an instance earlier. If not, we create a new instance; otherwise, we return the already created instance. Let's check how it works: >>> singleton = Singleton() >>> another_singleton = Singleton() >>> singleton is another_singleton True >>> singleton.only_one_var = "I'm only one var" >>> another_singleton.only_one_var I'm only one var Try to subclass the Singleton class with another one. class Child(Singleton): pass If it's a successor of Singleton, all of its instances should also be the instances of Singleton, thus sharing its states. But this doesn't work as illustrated in the following code: >>> child = Child() >>> child is singleton >>> False >>> child.only_one_var AttributeError: Child instance has no attribute 'only_one_var' To avoid this situation, the borg singleton is used. Borg singleton Borg is also known as monostate. In the borg pattern, all of the instances are different, but they share the same state. In the following code , the shared state is maintained in the _shared_state attribute. And all new instances of the Borg class will have this state as defined in the __new__ class method. class Borg(object):    _shared_state = {}    def __new__(cls, *args, **kwargs):        obj = super(Borg, cls).__new__(cls, *args, **kwargs)        obj.__dict__ = cls._shared_state        return obj Generally, Python stores the instance state in the __dict__ dictionary and when instantiated normally, every instance will have its own __dict__. But, here we deliberately assign the class variable _shared_state to all of the created instances. Here is how it works with subclassing: class Child(Borg):    pass>>> borg = Borg()>>> another_borg = Borg()>>> borg is another_borgFalse>>> child = Child()>>> borg.only_one_var = "I'm the only one var">>> child.only_one_varI'm the only one var So, despite the fact that you can't compare objects by their identity, using the is statement, all child objects share the parents' state. If you want to have a class, which is a descendant of the Borg class but has a different state, you can reset shared_state as follows: class AnotherChild(Borg):    _shared_state = {}>>> another_child = AnotherChild()>>> another_child.only_one_varAttributeError: AnotherChild instance has no attribute 'shared_state' Which type of singleton should be used is up to you. If you expect that your singleton will not be inherited, you can choose the classic singleton; otherwise, it's better to stick with borg. Implementation in Python As a practical example, we'll create a simple web crawler that scans a website you open on it, follows all the links that lead to the same website but to other pages, and downloads all of the images it'll find. To do this, we'll need two functions: a function that scans a website for links, which leads to other pages to build a set of pages to visit, and a function that scans a page for images and downloads them. To make it quicker, we'll download images in two threads. These two threads should not interfere with each other, so don't scan pages if another thread has already scanned them, and don't download images that are already downloaded. So, a set with downloaded images and scanned web pages will be a shared resource for our application, and we'll keep it in a singleton instance. In this example, you will need a library for parsing and screen scraping websites named BeautifulSoup and an HTTP client library httplib2. It should be sufficient to install both with either of the following commands: $ sudo pip install BeautifulSoup httplib2 $ sudo easy_install BeautifulSoup httplib2 First of all, we'll create a Singleton class. Let's use the classic singleton in this example: import httplib2import osimport reimport threadingimport urllibfrom urlparse import urlparse, urljoinfrom BeautifulSoup import BeautifulSoupclass Singleton(object):    def __new__(cls):        if not hasattr(cls, 'instance'):             cls.instance = super(Singleton, cls).__new__(cls)        return cls.instance It will return the singleton objects to all parts of the code that request it. Next, we'll create a class for creating a thread. In this thread, we'll download images from the website: class ImageDownloaderThread(threading.Thread):    """A thread for downloading images in parallel."""    def __init__(self, thread_id, name, counter):        threading.Thread.__init__(self)        self.name = name    def run(self):        print 'Starting thread ', self.name        download_images(self.name)        print 'Finished thread ', self.name The following function traverses the website using BFS algorithms, finds links, and adds them to a set for further downloading. We are able to specify the maximum links to follow if the website is too large. def traverse_site(max_links=10):    link_parser_singleton = Singleton()    # While we have pages to parse in queue    while link_parser_singleton.queue_to_parse:        # If collected enough links to download images, return        if len(link_parser_singleton.to_visit) == max_links:            return        url = link_parser_singleton.queue_to_parse.pop()        http = httplib2.Http()        try:            status, response = http.request(url)        except Exception:            continue        # Skip if not a web page        if status.get('content-type') != 'text/html':            continue        # Add the link to queue for downloading images        link_parser_singleton.to_visit.add(url)        print 'Added', url, 'to queue'        bs = BeautifulSoup(response)        for link in BeautifulSoup.findAll(bs, 'a'):            link_url = link.get('href')            # <img> tag may not contain href attribute            if not link_url:                continue            parsed = urlparse(link_url)            # If link follows to external webpage, skip it            if parsed.netloc and parsed.netloc != parsed_root.netloc:                continue            # Construct a full url from a link which can be relative            link_url = (parsed.scheme or parsed_root.scheme) + '://' + (parsed.netloc or parsed_root.netloc) + parsed.path or ''            # If link was added previously, skip it            if link_url in link_parser_singleton.to_visit:                continue            # Add a link for further parsing            link_parser_singleton.queue_to_parse = [link_url] + link_parser_singleton.queue_to_parse The following function downloads images from the last web resource page in the singleton.to_visit queue and saves it to the img directory. Here, we use a singleton for synchronizing shared data, which is a set of pages to visit between two threads: def download_images(thread_name):    singleton = Singleton()    # While we have pages where we have not download images    while singleton.to_visit:        url = singleton.to_visit.pop()        http = httplib2.Http()        print thread_name, 'Starting downloading images from', url        try:            status, response = http.request(url)        except Exception:            continue        bs = BeautifulSoup(response)       # Find all <img> tags        images = BeautifulSoup.findAll(bs, 'img')        for image in images:            # Get image source url which can be absolute or relative            src = image.get('src')            # Construct a full url. If the image url is relative,            # it will be prepended with webpage domain.            # If the image url is absolute, it will remain as is            src = urljoin(url, src)            # Get a base name, for example 'image.png' to name file locally            basename = os.path.basename(src)            if src not in singleton.downloaded:                singleton.downloaded.add(src)                print 'Downloading', src                # Download image to local filesystem                urllib.urlretrieve(src, os.path.join('images', basename))        print thread_name, 'finished downloading images from', url Our client code is as follows: if __name__ == '__main__':    root = 'http://python.org'    parsed_root = urlparse(root)    singleton = Singleton()    singleton.queue_to_parse = [root]    # A set of urls to download images from    singleton.to_visit = set()    # Downloaded images    singleton.downloaded = set()    traverse_site()    # Create images directory if not exists    if not os.path.exists('images'):        os.makedirs('images')    # Create new threads    thread1 = ImageDownloaderThread(1, "Thread-1", 1)    thread2 = ImageDownloaderThread(2, "Thread-2", 2)    # Start new Threads    thread1.start()    thread2.start() Run a crawler using the following command: $ python crawler.py You should get the following output (your output may vary because the order in which the threads access resources is not predictable): If you go to the images directory, you will find the downloaded images there. Summary To learn more about design patterns in depth, the following books published by Packt Publishing (https://www.packtpub.com/) are recommended: Learning Python Design Patterns – Second Edition (https://www.packtpub.com/application-development/learning-python-design-patterns-second-edition) Mastering Python Design Patterns (https://www.packtpub.com/application-development/mastering-python-design-patterns) Resources for Article: Further resources on this subject: Python Design Patterns in Depth: The Factory Pattern [Article] Recommending Movies at Scale (Python) [Article] Customizing IPython [Article]
Read more
  • 0
  • 0
  • 17387

article-image-task-execution-asio
Packt
11 Aug 2015
20 min read
Save for later

Task Execution with Asio

Packt
11 Aug 2015
20 min read
In this article by Arindam Mukherjee, the author of Learning Boost C++ Libraries, we learch how to execute a task using Boost Asio (pronounced ay-see-oh), a portable library for performing efficient network I/O using a consistent programming model. At its core, Boost Asio provides a task execution framework that you can use to perform operations of any kind. You create your tasks as function objects and post them to a task queue maintained by Boost Asio. You enlist one or more threads to pick these tasks (function objects) and invoke them. The threads keep picking up tasks, one after the other till the task queues are empty at which point the threads do not block but exit. (For more resources related to this topic, see here.) IO Service, queues, and handlers At the heart of Asio is the type boost::asio::io_service. A program uses the io_service interface to perform network I/O and manage tasks. Any program that wants to use the Asio library creates at least one instance of io_service and sometimes more than one. In this section, we will explore the task management capabilities of io_service. Here is the IO Service in action using the obligatory "hello world" example: Listing 11.1: Asio Hello World 1 #include <boost/asio.hpp> 2 #include <iostream> 3 namespace asio = boost::asio; 4 5 int main() { 6   asio::io_service service; 7 8   service.post( 9     [] { 10       std::cout << "Hello, world!" << 'n'; 11     }); 12 13   std::cout << "Greetings: n"; 14   service.run(); 15 } We include the convenience header boost/asio.hpp, which includes most of the Asio library that we need for the examples in this aritcle (line 1). All parts of the Asio library are under the namespace boost::asio, so we use a shorter alias for this (line 3). The program itself just prints Hello, world! on the console but it does so through a task. The program first creates an instance of io_service (line 6) and posts a function object to it, using the post member function of io_service. The function object, in this case defined using a lambda expression, is referred to as a handler. The call to post adds the handler to a queue inside io_service; some thread (including that which posted the handler) must dispatch them, that is, remove them off the queue and call them. The call to the run member function of io_service (line 14) does precisely this. It loops through the handlers in the queue inside io_service, removing and calling each handler. In fact, we can post more handlers to the io_service before calling run, and it would call all the posted handlers. If we did not call run, none of the handlers would be dispatched. The run function blocks until all the handlers in the queue have been dispatched and returns only when the queue is empty. By itself, a handler may be thought of as an independent, packaged task, and Boost Asio provides a great mechanism for dispatching arbitrary tasks as handlers. Note that handlers must be nullary function objects, that is, they should take no arguments. Asio is a header-only library by default, but programs using Asio need to link at least with boost_system. On Linux, we can use the following command line to build this example: $ g++ -g listing11_1.cpp -o listing11_1 -lboost_system -std=c++11 Running this program prints the following: Greetings: Hello, World! Note that Greetings: is printed from the main function (line 13) before the call to run (line 14). The call to run ends up dispatching the sole handler in the queue, which prints Hello, World!. It is also possible for multiple threads to call run on the same I/O object and dispatch handlers concurrently. We will see how this can be useful in the next section. Handler states – run_one, poll, and poll_one While the run function blocks until there are no more handlers in the queue, there are other member functions of io_service that let you process handlers with greater flexibility. But before we look at this function, we need to distinguish between pending and ready handlers. The handlers we posted to the io_service were all ready to run immediately and were invoked as soon as their turn came on the queue. In general, handlers are associated with background tasks that run in the underlying OS, for example, network I/O tasks. Such handlers are meant to be invoked only once the associated task is completed, which is why in such contexts, they are called completion handlers. These handlers are said to be pending until the associated task is awaiting completion, and once the associated task completes, they are said to be ready. The poll member function, unlike run, dispatches all the ready handlers but does not wait for any pending handler to become ready. Thus, it returns immediately if there are no ready handlers, even if there are pending handlers. The poll_one member function dispatches exactly one ready handler if there be one, but does not block waiting for pending handlers to get ready. The run_one member function blocks on a nonempty queue waiting for a handler to become ready. It returns when called on an empty queue, and otherwise, as soon as it finds and dispatches a ready handler. post versus dispatch A call to the post member function adds a handler to the task queue and returns immediately. A later call to run is responsible for dispatching the handler. There is another member function called dispatch that can be used to request the io_service to invoke a handler immediately if possible. If dispatch is invoked in a thread that has already called one of run, poll, run_one, or poll_one, then the handler will be invoked immediately. If no such thread is available, dispatch adds the handler to the queue and returns just like post would. In the following example, we invoke dispatch from the main function and from within another handler: Listing 11.2: post versus dispatch 1 #include <boost/asio.hpp> 2 #include <iostream> 3 namespace asio = boost::asio; 4 5 int main() { 6   asio::io_service service; 7   // Hello Handler – dispatch behaves like post 8   service.dispatch([]() { std::cout << "Hellon"; }); 9 10   service.post( 11     [&service] { // English Handler 12       std::cout << "Hello, world!n"; 13       service.dispatch([] { // Spanish Handler, immediate 14                         std::cout << "Hola, mundo!n"; 15                       }); 16     }); 17   // German Handler 18   service.post([&service] {std::cout << "Hallo, Welt!n"; }); 19   service.run(); 20 } Running this code produces the following output: Hello Hello, world! Hola, mundo! Hallo, Welt! The first call to dispatch (line 8) adds a handler to the queue without invoking it because run was yet to be called on io_service. We call this the Hello Handler, as it prints Hello. This is followed by the two calls to post (lines 10, 18), which add two more handlers. The first of these two handlers prints Hello, world! (line 12), and in turn, calls dispatch (line 13) to add another handler that prints the Spanish greeting, Hola, mundo! (line 14). The second of these handlers prints the German greeting, Hallo, Welt (line 18). For our convenience, let's just call them the English, Spanish, and German handlers. This creates the following entries in the queue: Hello Handler English Handler German Handler Now, when we call run on the io_service (line 19), the Hello Handler is dispatched first and prints Hello. This is followed by the English Handler, which prints Hello, World! and calls dispatch on the io_service, passing the Spanish Handler. Since this executes in the context of a thread that has already called run, the call to dispatch invokes the Spanish Handler, which prints Hola, mundo!. Following this, the German Handler is dispatched printing Hallo, Welt! before run returns. What if the English Handler called post instead of dispatch (line 13)? In that case, the Spanish Handler would not be invoked immediately but would queue up after the German Handler. The German greeting Hallo, Welt! would precede the Spanish greeting Hola, mundo!. The output would look like this: Hello Hello, world! Hallo, Welt! Hola, mundo! Concurrent execution via thread pools The io_service object is thread-safe and multiple threads can call run on it concurrently. If there are multiple handlers in the queue, they can be processed concurrently by such threads. In effect, the set of threads that call run on a given io_service form a thread pool. Successive handlers can be processed by different threads in the pool. Which thread dispatches a given handler is indeterminate, so the handler code should not make any such assumptions. In the following example, we post a bunch of handlers to the io_service and then start four threads, which all call run on it: Listing 11.3: Simple thread pools 1 #include <boost/asio.hpp> 2 #include <boost/thread.hpp> 3 #include <boost/date_time.hpp> 4 #include <iostream> 5 namespace asio = boost::asio; 6 7 #define PRINT_ARGS(msg) do { 8   boost::lock_guard<boost::mutex> lg(mtx); 9   std::cout << '[' << boost::this_thread::get_id() 10             << "] " << msg << std::endl; 11 } while (0) 12 13 int main() { 14   asio::io_service service; 15   boost::mutex mtx; 16 17   for (int i = 0; i < 20; ++i) { 18     service.post([i, &mtx]() { 19                         PRINT_ARGS("Handler[" << i << "]"); 20                         boost::this_thread::sleep( 21                               boost::posix_time::seconds(1)); 22                       }); 23   } 24 25   boost::thread_group pool; 26   for (int i = 0; i < 4; ++i) { 27     pool.create_thread([&service]() { service.run(); }); 28   } 29 30   pool.join_all(); 31 } We post twenty handlers in a loop (line 18). Each handler prints its identifier (line 19), and then sleeps for a second (lines 19-20). To run the handlers, we create a group of four threads, each of which calls run on the io_service (line 21) and wait for all the threads to finish (line 24). We define the macro PRINT_ARGS which writes output to the console in a thread-safe way, tagged with the current thread ID (line 7-10). We will use this macro in other examples too. To build this example, you must also link against libboost_thread, libboost_date_time, and in Posix environments, with libpthread too: $ g++ -g listing9_3.cpp -o listing9_3 -lboost_system -lboost_thread -lboost_date_time -pthread -std=c++11 One particular run of this program on my laptop produced the following output (with some lines snipped): [b5c15b40] Handler[0] [b6416b40] Handler[1] [b6c17b40] Handler[2] [b7418b40] Handler[3] [b5c15b40] Handler[4] [b6416b40] Handler[5] … [b6c17b40] Handler[13] [b7418b40] Handler[14] [b6416b40] Handler[15] [b5c15b40] Handler[16] [b6c17b40] Handler[17] [b7418b40] Handler[18] [b6416b40] Handler[19] You can see that the different handlers are executed by different threads (each thread ID marked differently). If any of the handlers threw an exception, it would be propagated across the call to the run function on the thread that was executing the handler. io_service::work Sometimes, it is useful to keep the thread pool started, even when there are no handlers to dispatch. Neither run nor run_one blocks on an empty queue. So in order for them to block waiting for a task, we have to indicate, in some way, that there is outstanding work to be performed. We do this by creating an instance of io_service::work, as shown in the following example: Listing 11.4: Using io_service::work to keep threads engaged 1 #include <boost/asio.hpp> 2 #include <memory> 3 #include <boost/thread.hpp> 4 #include <iostream> 5 namespace asio = boost::asio; 6 7 typedef std::unique_ptr<asio::io_service::work> work_ptr; 8 9 #define PRINT_ARGS(msg) do { … ... 14 15 int main() { 16   asio::io_service service; 17   // keep the workers occupied 18   work_ptr work(new asio::io_service::work(service)); 19   boost::mutex mtx; 20 21   // set up the worker threads in a thread group 22   boost::thread_group workers; 23   for (int i = 0; i < 3; ++i) { 24     workers.create_thread([&service, &mtx]() { 25                         PRINT_ARGS("Starting worker thread "); 26                         service.run(); 27                         PRINT_ARGS("Worker thread done"); 28                       }); 29   } 30 31   // Post work 32   for (int i = 0; i < 20; ++i) { 33     service.post( 34       [&service, &mtx]() { 35         PRINT_ARGS("Hello, world!"); 36         service.post([&mtx]() { 37                           PRINT_ARGS("Hola, mundo!"); 38                         }); 39       }); 40   } 41 42 work.reset(); // destroy work object: signals end of work 43   workers.join_all(); // wait for all worker threads to finish 44 } In this example, we create an object of io_service::work wrapped in a unique_ptr (line 18). We associate it with an io_service object by passing to the work constructor a reference to the io_service object. Note that, unlike listing 11.3, we create the worker threads first (lines 24-27) and then post the handlers (lines 33-39). However, the worker threads stay put waiting for the handlers because of the calls to run block (line 26). This happens because of the io_service::work object we created, which indicates that there is outstanding work in the io_service queue. As a result, even after all handlers are dispatched, the threads do not exit. By calling reset on the unique_ptr, wrapping the work object, its destructor is called, which notifies the io_service that all outstanding work is complete (line 42). The calls to run in the threads return and the program exits once all the threads are joined (line 43). We wrapped the work object in a unique_ptr to destroy it in an exception-safe way at a suitable point in the program. We omitted the definition of PRINT_ARGS here, refer to listing 11.3. Serialized and ordered execution via strands Thread pools allow handlers to be run concurrently. This means that handlers that access shared resources need to synchronize access to these resources. We already saw examples of this in listings 11.3 and 11.4, when we synchronized access to std::cout, which is a global object. As an alternative to writing synchronization code in handlers, which can make the handler code more complex, we can use strands. Think of a strand as a subsequence of the task queue with the constraint that no two handlers from the same strand ever run concurrently. The scheduling of other handlers in the queue, which are not in the strand, is not affected by the strand in any way. Let us look at an example of using strands: Listing 11.5: Using strands 1 #include <boost/asio.hpp> 2 #include <boost/thread.hpp> 3 #include <boost/date_time.hpp> 4 #include <cstdlib> 5 #include <iostream> 6 #include <ctime> 7 namespace asio = boost::asio; 8 #define PRINT_ARGS(msg) do { ... 13 14 int main() { 15   std::srand(std::time(0)); 16 asio::io_service service; 17   asio::io_service::strand strand(service); 18   boost::mutex mtx; 19   size_t regular = 0, on_strand = 0; 20 21 auto workFuncStrand = [&mtx, &on_strand] { 22           ++on_strand; 23           PRINT_ARGS(on_strand << ". Hello, from strand!"); 24           boost::this_thread::sleep( 25                       boost::posix_time::seconds(2)); 26         }; 27 28   auto workFunc = [&mtx, &regular] { 29                   PRINT_ARGS(++regular << ". Hello, world!"); 30                  boost::this_thread::sleep( 31                         boost::posix_time::seconds(2)); 32                 }; 33   // Post work 34   for (int i = 0; i < 15; ++i) { 35     if (rand() % 2 == 0) { 36       service.post(strand.wrap(workFuncStrand)); 37     } else { 38       service.post(workFunc); 39     } 40   } 41 42   // set up the worker threads in a thread group 43   boost::thread_group workers; 44   for (int i = 0; i < 3; ++i) { 45     workers.create_thread([&service, &mtx]() { 46                      PRINT_ARGS("Starting worker thread "); 47                       service.run(); 48                       PRINT_ARGS("Worker thread done"); 49                     }); 50   } 51 52   workers.join_all(); // wait for all worker threads to finish 53 } In this example, we create two handler functions: workFuncStrand (line 21) and workFunc (line 28). The lambda workFuncStrand captures a counter on_strand, increments it, and prints a message Hello, from strand!, prefixed with the value of the counter. The function workFunc captures another counter regular, increments it, and prints Hello, World!, prefixed with the counter. Both pause for 2 seconds before returning. To define and use a strand, we first create an object of io_service::strand associated with the io_service instance (line 17). Thereafter, we post all handlers that we want to be part of that strand by wrapping them using the wrap member function of the strand (line 36). Alternatively, we can post the handlers to the strand directly by using either the post or the dispatch member function of the strand, as shown in the following snippet: 33   for (int i = 0; i < 15; ++i) { 34     if (rand() % 2 == 0) { 35       strand.post(workFuncStrand); 37     } else { ... The wrap member function of strand returns a function object, which in turn calls dispatch on the strand to invoke the original handler. Initially, it is this function object rather than our original handler that is added to the queue. When duly dispatched, this invokes the original handler. There are no constraints on the order in which these wrapper handlers are dispatched, and therefore, the actual order in which the original handlers are invoked can be different from the order in which they were wrapped and posted. On the other hand, calling post or dispatch directly on the strand avoids an intermediate handler. Directly posting to a strand also guarantees that the handlers will be dispatched in the same order that they were posted, achieving a deterministic ordering of the handlers in the strand. The dispatch member of strand blocks until the handler is dispatched. The post member simply adds it to the strand and returns. Note that workFuncStrand increments on_strand without synchronization (line 22), while workFunc increments the counter regular within the PRINT_ARGS macro (line 29), which ensures that the increment happens in a critical section. The workFuncStrand handlers are posted to a strand and therefore are guaranteed to be serialized; hence no need for explicit synchronization. On the flip side, entire functions are serialized via strands and synchronizing smaller blocks of code is not possible. There is no serialization between the handlers running on the strand and other handlers; therefore, the access to global objects, like std::cout, must still be synchronized. The following is a sample output of running the preceding code: [b73b6b40] Starting worker thread [b73b6b40] 0. Hello, world from strand! [b6bb5b40] Starting worker thread [b6bb5b40] 1. Hello, world! [b63b4b40] Starting worker thread [b63b4b40] 2. Hello, world! [b73b6b40] 3. Hello, world from strand! [b6bb5b40] 5. Hello, world! [b63b4b40] 6. Hello, world! … [b6bb5b40] 14. Hello, world! [b63b4b40] 4. Hello, world from strand! [b63b4b40] 8. Hello, world from strand! [b63b4b40] 10. Hello, world from strand! [b63b4b40] 13. Hello, world from strand! [b6bb5b40] Worker thread done [b73b6b40] Worker thread done [b63b4b40] Worker thread done There were three distinct threads in the pool and the handlers from the strand were picked up by two of these three threads: initially, by thread ID b73b6b40, and later on, by thread ID b63b4b40. This also dispels a frequent misunderstanding that all handlers in a strand are dispatched by the same thread, which is clearly not the case. Different handlers in the same strand may be dispatched by different threads but will never run concurrently. Summary Asio is a well-designed library that can be used to write fast, nimble network servers that utilize the most optimal mechanisms for asynchronous I/O available on a system. It is an evolving library and is the basis for a Technical Specification that proposes to add a networking library to a future revision of the C++ Standard. In this article, we learned how to use the Boost Asio library as a task queue manager and leverage Asio's TCP and UDP interfaces to write programs that communicate over the network. Resources for Article: Further resources on this subject: Animation features in Unity 5 [article] Exploring and Interacting with Materials using Blueprints [article] A Simple Pathfinding Algorithm for a Maze [article]
Read more
  • 0
  • 1
  • 16897
Visually different images

article-image-processing-next-generation-sequencing-datasets-using-python
Packt
07 Jul 2015
25 min read
Save for later

Processing Next-generation Sequencing Datasets Using Python

Packt
07 Jul 2015
25 min read
In this article by Tiago Antao, author of Bioinformatics with Python Cookbook, you will process next-generation sequencing datasets using Python. If you work in life sciences, you are probably aware of the increasing importance of computational methods to analyze increasingly larger datasets. There is a massive need for bioinformaticians to process this data, and one the main tools is, of course, Python. Python is probably the fastest growing language in the field of data sciences. It includes a rich ecology of software libraries to perform complex data analysis. Another major point in Python is its great community, which is always ready to help and produce great documentation and high-quality reliable software. In this article, we will use Python to process next-generation sequencing datasets. This is one of the many examples of Python usability in bioinformatics; chances are that if you have a biological dataset to analyze, Python can help you. This is surely the case with population genetics, genomics, phylogenetics, proteomics, and many other fields. Next-generation Sequencing (NGS) is one of the fundamental technological developments of the decade in the field of life sciences. Whole Genome Sequencing (WGS), RAD-Seq, RNA-Seq, Chip-Seq, and several other technologies are routinely used to investigate important biological problems. These are also called high-throughput sequencing technologies with good reason; they generate vast amounts of data that need to be processed. NGS is the main reason why computational biology is becoming a "big data" discipline. More than anything else, this is a field that requires strong bioinformatics techniques. There is very strong demand for professionals with these skillsets. Here, we will not discuss each individual NGS technique per se (this will be a massive undertaking). We will use two existing WGS datasets: the Human 1000 genomes project (http://www.1000genomes.org/) and the Anopheles 1000 genomes dataset (http://www.malariagen.net/projects/vector/ag1000g). The code presented will be easily applicable for other genomic sequencing approaches; some of them can also be used for transcriptomic analysis (for example, RNA-Seq). Most of the code is also species-independent, that is, you will be able to apply them to any species in which you have sequenced data. As this is not an introductory text, you are expected to at least know what FASTA, FASTQ, BAM, and VCF files are. We will also make use of basic genomic terminology without introducing it (things such as exomes, nonsynonym mutations, and so on). You are required to be familiar with basic Python, and we will leverage that knowledge to introduce the fundamental libraries in Python to perform NGS analysis. Here, we will concentrate on analyzing VCF files. Preparing the environment You will need Python 2.7 or 3.4. You can use many of the available distributions, including the standard one at http://www.python.org, but we recommend Anaconda Python from http://continuum.io/downloads. We also recommend the IPython Notebook (Project Jupyter) from http://ipython.org/. If you use Anaconda, this and many other packages are available with a simple conda install. There are some amazing libraries to perform data analysis in Python; here, we will use NumPy (http://www.numpy.org/) and matplotlib (http://matplotlib.org/), which you may already be using in your projects. We will also make use of the less widely used seaborn library (http://stanford.edu/~mwaskom/software/seaborn/). For bioinformatics, we will use Biopython (http://biopython.org) and PyVCF (https://pyvcf.readthedocs.org). The code used here is available on GitHub at https://github.com/tiagoantao/bioinf-python. In your realistic pipeline, you will probably be using other tools, such as bwa, samtools, or GATK to perform your alignment and SNP calling. In our case, tabix and bgzip (http://www.htslib.org/) is needed. Analyzing variant calls After running a genotype caller (for example, GATK or samtools mpileup), you will have a Variant Call Format (VCF) file reporting on genomic variations, such as SNPs (Single-Nucleotide Polymorphisms), InDels (Insertions/Deletions), CNVs (Copy Number Variation) among others. In this recipe, we will discuss VCF processing with the PyVCF module over the human 1000 genomes project to analyze SNP data. Getting ready I am to believe that 2 to 20 GB of data for a tutorial is asking too much. Although, the 1000 genomes' VCF files with realistic annotations are in that order of magnitude, we will want to work with much less data here. Fortunately, the bioinformatics community has developed tools that allow partial download of data. As part of the samtools/htslib package (http://www.htslib.org/), you can download tabix and bgzip, which will take care of data management. For example: tabix -fh ftp://ftp- trace.ncbi.nih.gov/1000genomes/ftp/release/20130502/supporting/vcf_ with_sample_level_annotation/ALL.chr22.phase3_shapeit2_mvncall_ integrated_v5_extra_anno.20130502.genotypes.vcf.gz 22:1-17000000 |bgzip -c > genotypes.vcf.gz tabix -p vcf genotypes.vcf.gz The first line will perform a partial download of the VCF file for chromosome 22 (up to 17 Mbp) of the 1000 genomes project. Then, bgzip will compress it. The second line will create an index, which we will need for direct access to a section of the genome. The preceding code is available at https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Working_with_VCF.ipynb. How to do it… Take a look at the following steps: Let's start inspecting the information that we can get per record, as shown in the following code: import vcf v = vcf.Reader(filename='genotypes.vcf.gz')   print('Variant Level information') infos = v.infos for info in infos:    print(info)   print('Sample Level information') fmts = v.formats for fmt in fmts:    print(fmt)     We start by inspecting the annotations that are available for each record (remember that each record encodes variants, such as SNP, CNV, and InDel, and the state of that variant per sample). At the variant (record) level, we will find AC: the total number of ALT alleles in called genotypes, AF: the estimated allele frequency, NS: the number of samples with data, AN: the total number of alleles in called genotypes, and DP: the total read depth. There are others, but they are mostly specific to the 1000 genomes project (here, we are trying to be as general as possible). Your own dataset might have much more annotations or none of these.     At the sample level, there are only two annotations in this file: GT: genotype and DP: the per sample read depth. Yes, you have the per variant (total) read depth and the per sample read depth; be sure not to confuse both. Now that we know which information is available, let's inspect a single VCF record with the following code: v = vcf.Reader(filename='genotypes.vcf.gz') rec = next(v) print(rec.CHROM, rec.POS, rec.ID, rec.REF, rec.ALT, rec.QUAL, rec.FILTER) print(rec.INFO) print(rec.FORMAT) samples = rec.samples print(len(samples)) sample = samples[0] print(sample.called, sample.gt_alleles, sample.is_het, sample.phased) print(int(sample['DP']))     We start by retrieving standard information: the chromosome, position, identifier, reference base (typically, just one), alternative bases (can have more than one, but it is not uncommon as the first filtering approach to only accept a single ALT, for example, only accept bi-allelic SNPs), quality (PHRED scaled—as you may expect), and the FILTER status. Regarding the filter, remember that whatever the VCF file says, you may still want to apply extra filters (as in the next recipe).     Then, we will print the additional variant-level information (AC, AS, AF, AN, DP, and so on), followed by the sample format (in this case, DP and GT). Finally, we will count the number of samples and inspect a single sample checking if it was called for this variant. If available, the reported alleles, heterozygosity, and phasing status (this dataset happens to be phased, which is not that common). Let's check the type of variant and the number of nonbiallelic SNPs in a single pass with the following code: from collections import defaultdict f = vcf.Reader(filename='genotypes.vcf.gz')   my_type = defaultdict(int) num_alts = defaultdict(int)   for rec in f:    my_type[rec.var_type, rec.var_subtype] += 1    if rec.is_snp:        num_alts[len(rec.ALT)] += 1 print(num_alts) print(my_type)     We use the Python defaultdict collection type. We find that this dataset has InDels (both insertions and deletions), CNVs, and, of course, SNPs (roughly two-third being transitions with one-third transversions). There is a residual number (79) of triallelic SNPs. There's more… The purpose of this recipe is to get you up to speed on the PyVCF module. At this stage, you should be comfortable with the API. We do not delve much here on usage details because that will be the main purpose of the next recipe: using the VCF module to study the quality of your variant calls. It will probably not be a shocking revelation that PyVCF is not the fastest module on earth. This file format (highly text-based) makes processing a time-consuming task. There are two main strategies of dealing with this problem: parallel processing or converting to a more efficient format. Note that VCF developers will perform a binary (BCF) version to deal with part of these problems at http://www.1000genomes.org/wiki/analysis/variant-call-format/bcf-binary-vcf-version-2. See also The specification for VCF is available at http://samtools.github.io/hts-specs/VCFv4.2.pdf GATK is one of the most widely used variant callers; check https://www.broadinstitute.org/gatk/ samtools and htslib are both used for variant calling and SAM/BAM management; check http://htslib.org Studying genome accessibility and filtering SNP data If you are using NGS data, the quality of your VCF calls may need to be assessed and filtered. Here, we will put in place a framework to filter SNP data. More than giving filtering rules (an impossible task to be performed in a general way), we give you procedures to assess the quality of your data. With this, you can then devise your own filters. Getting ready In the best-case scenario, you have a VCF file with proper filters applied; if this is the case, you can just go ahead and use your file. Note that all VCF files will have a FILTER column, but this does not mean that all the proper filters were applied. You have to be sure that your data is properly filtered. In the second case, which is one of the most common, your file will have unfiltered data, but you have enough annotations. Also, you can apply hard filters (that is, no need for programmatic filtering). If you have a GATK annotated file, refer, for instance, to http://gatkforums.broadinstitute.org/discussion/2806/howto-apply-hard-filters-to-a-call-set. In the third case, you have a VCF file that has all the annotations that you need, but you may want to apply more flexible filters (for example, "if read depth > 20, then accept; if mapping quality > 30, accept if mapping quality > 40"). In the fourth case, your VCF file does not have all the necessary annotations, and you have to revisit your BAM files (or even other sources of information). In this case, the best solution is to find whatever extra information you have and create a new VCF file with the needed annotations. Some genotype callers like GATK allow you to specify with annotations you want; you may also want to use extra programs to provide more annotations, for example, SnpEff (http://snpeff.sourceforge.net/) will annotate your SNPs with predictions of their effect (for example, if they are in exons, are they coding on noncoding?). It is impossible to provide a clear-cut recipe; it will vary with the type of your sequencing data, your species of study, and your tolerance to errors, among other variables. What we can do is provide a set of typical analysis that is done for high-quality filtering. In this recipe, we will not use data from the Human 1000 genomes project; we want "dirty" unfiltered data that has a lot of common annotations that can be used to filter it. We will use data from the Anopheles 1000 genomes project (Anopheles is the mosquito vector involved in the transmission of the parasite causing malaria), which makes available filtered and unfiltered data. You can find more information about this project at http://www.malariagen.net/projects/vector/ag1000g. We will get a part of the centromere of chromosome 3L for around 100 mosquitoes, followed by a part somewhere in the middle of that chromosome (and index both), as shown in the following code: tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:1-200000 |bgzip -c > centro.vcf.gz tabix -fh ftp://ngs.sanger.ac.uk/production/ag1000g/phase1/preview/ag1000g.AC. phase1.AR1.vcf.gz 3L:21000001-21200000 |bgzip -c > standard.vcf.gz tabix -p vcf centro.vcf.gz tabix -p vcf standard.vcf.gz As usual, the code to download this data is available at the https://github.com/tiagoantao/bioinf-python/blob/master/notebooks/01_NGS/Filtering_SNPs.ipynb notebook. Finally, a word of warning about this recipe: the level of Python here will be slightly more complicated than before. The more general code that we will write may be easier to reuse in your specific case. We will perform extensive use of functional programming techniques (lambda functions) and the partial function application. How to do it… Take a look at the following steps: Let's start by plotting the distribution of variants across the genome in both files as follows: %matplotlib inline from collections import defaultdict   import seaborn as sns import matplotlib.pyplot as plt   import vcf   def do_window(recs, size, fun):    start = None    win_res = []    for rec in recs:        if not rec.is_snp or len(rec.ALT) > 1:            continue        if start is None:            start = rec.POS        my_win = 1 + (rec.POS - start) // size        while len(win_res) < my_win:            win_res.append([])        win_res[my_win - 1].extend(fun(rec))    return win_res   wins = {} size = 2000 vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    wins[name] = do_window(recs, size, lambda x: [1])     We start by performing the required imports (as usual, remember to remove the first line if you are not on the IPython Notebook). Before I explain the function, note what we will do.     For both files, we will compute windowed statistics: we will divide our file that includes 200,000 bp of data in windows of size 2,000 (100 windows). Every time we find a bi-allelic SNP, we will add one to the list related to that window in the window function. The window function will take a VCF record (an SNP—rec.is_snp—that is not bi-allelic—len(rec.ALT) == 1), determine the window where that record belongs (by performing an integer division of rec.POS by size), and extend the list of results of that window by the function that is passed to it as the fun parameter (which in our case is just one).     So, now we have a list of 100 elements (each representing 2,000 base pairs). Each element will be another list, which will have 1 for each bi-allelic SNP found. So, if you have 200 SNPs in the first 2,000 base pairs, the first element of the list will have 200 ones. Let's continue: def apply_win_funs(wins, funs):    fun_results = []    for win in wins:        my_funs = {}        for name, fun in funs.items():            try:                my_funs[name] = fun(win)            except:                my_funs[name] = None        fun_results.append(my_funs)    return fun_results   stats = {} fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in wins.items():    stats[name] = apply_win_funs(nwins, {'sum': sum})    x_lim = [i * size for i in range(len(stats[name]))]    ax.plot(x_lim, [x['sum'] for x in stats[name]], label=name) ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('Number of variant sites (bi-allelic SNPs)') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')     Here, we will perform a plot that contains statistical information for each of our 100 windows. The apply_win_funs will calculate a set of statistics for every window. In this case, it will sum all the numbers in the window. Remember that every time we find an SNP, we will add one to the window list. This means that if we have 200 SNPs, we will have 200 1s; hence; summing them will return 200.     So, we are able to compute the number of SNPs per window in an apparently convoluted way. Why we are doing things with this strategy will become apparent soon, but for now, let's check the result of this computation for both files (refer to the following figure): Figure 1: The number of bi-allelic SNPs distributed of windows of 2, 000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green). Both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project     Note that the amount of SNPs in the centromere is smaller than the one in the middle of the chromosome. This is expected because calling variants in chromosomes is more difficult than calling variants in the middle and also because probably there is less genomic diversity in centromeres. If you are used to humans or other mammals, you may find the density of variants obnoxiously high, that is, mosquitoes for you! Let's take a look at the sample-level annotation. We will inspect Mapping Quality Zero (refer to https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_MappingQualityZeroBySample.php for details), which is a measure of how well all the sequences involved in calling this variant map clearly to this position. Note that there is also an MQ0 annotation at the variant-level: import functools   import numpy as np mq0_wins = {} vcf_names = ['centro.vcf.gz', 'standard.vcf.gz'] size = 5000 def get_sample(rec, annot, my_type):    res = []    samples = rec.samples    for sample in samples:        if sample[annot] is None: # ignoring nones            continue        res.append(my_type(sample[annot]))    return res   for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    mq0_wins[vcf_name] = do_window(recs, size, functools.partial(get_sample, annot='MQ0', my_type=int))     Start by inspecting this by looking at the last for; we will perform a windowed analysis by getting the MQ0 annotation from each record. We perform this by calling the get_sample function in which we return our preferred annotation (in this case, MQ0) cast with a certain type (my_type=int). We will use the partial application function here. Python allows you to specify some parameters of a function and wait for other parameters to be specified later. Note that the most complicated thing here is the functional programming style. Also, note that it makes it very easy to compute other sample-level annotations; just replace MQ0 with AB, AD, GQ, and so on. You will immediately have a computation for that annotation. If the annotation is not of type integer, no problem; just adapt my_type. This is a difficult programming style if you are not used to it, but you will reap the benefits very soon. Let's now print the median and top 75 percent percentile for each window (in this case, with a size of 5,000) as follows: stats = {} colors = ['b', 'g'] i = 0 fig, ax = plt.subplots(figsize=(16, 9)) for name, nwins in mq0_wins.items():    stats[name] = apply_win_funs(nwins, {'median': np.median, '75': functools.partial(np.percentile, q=75)})    x_lim = [j * size for j in range(len(stats[name]))]    ax.plot(x_lim, [x['median'] for x in stats[name]], label=name, color=colors[i])    ax.plot(x_lim, [x['75'] for x in stats[name]], '--', color=colors[i])    i += 1 ax.legend() ax.set_xlabel('Genomic location in the downloaded segment') ax.set_ylabel('MQ0') fig.suptitle('Distribution of MQ0 along the genome', fontsize='xx-large')     Note that we now have two different statistics on apply_win_funs: percentile and median. Again, we will pass function names as parameters (np.median) and perform the partial function application (np.percentile). The result can be seen in the following figure: Figure 2: Median (continuous line) and 75th percentile (dashed) of MQ0 of sample SNPs distributed on windows of 5,000 bp of size for an area of 200 Kbp near the centromere (blue) and in the middle of chromosome (green); both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project     For the "standard" file, the median MQ0 is 0 (it is plotted at the very bottom, which is almost unseen); this is good as it suggests that most sequences involved in the calling of variants map clearly to this area of the genome. For the centromere, MQ0 is of poor quality. Furthermore, there are areas where the genotype caller could not find any variants at all; hence, the incomplete chart. Let's compare heterozygosity with the DP sample-level annotation:     Here, we will plot the fraction of heterozygosity calls as a function of the sample read depth (DP) for every SNP. We will first explain the result and only then the code that generates it.     The following screenshot shows the fraction of calls that are heterozygous at a certain depth: Figure 3: The continuous line represents the fraction of heterozygosite calls computed at a certain depth; in blue is the centromeric area, in green is the "standard" area; the dashed lines represent the number of sample calls per depth; both areas come from chromosome 3L for circa 100 Ugandan mosquitoes from the Anopheles 1000 genomes project In the preceding screenshot, there are two considerations to be taken into account:     At a very low depth, the fraction of heterozygote calls is biased low; this makes sense because the number of reads per position does not allow you to make a correct estimate of the presence of both alleles in a sample. So, you should not trust calls at a very low depth.     As expected, the number of calls in the centromere is way lower than calls outside it. The distribution of SNPs outside the centromere follows a common pattern that you can expect in many datasets. Here is the code: def get_sample_relation(recs, f1, f2):    rel = defaultdict(int)    for rec in recs:        if not rec.is_snp:              continue        for sample in rec.samples:            try:                 v1 = f1(sample)                v2 = f2(sample)                if v1 is None or v2 is None:                    continue # We ignore Nones                rel[(v1, v2)] += 1            except:                pass # This is outside the domain (typically None)    return rel   rels = {} for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    rels[vcf_name] = get_sample_relation(recs, lambda s: 1 if s.is_het else 0, lambda s: int(s['DP'])) Let's start by looking at the for loop. Again, we will use functional programming: the get_sample_relation function will traverse all the SNP records and apply the two functional parameters; the first determines heterozygosity, whereas the second gets the sample DP (remember that there is also a variant DP).     Now, as the code is complex as it is, I opted for a naive data structure to be returned by get_sample_relation: a dictionary where the key is the pair of results (in this case, heterozygosity and DP) and the sum of SNPs, which share both values. There are more elegant data structures with different trade-offs for this: scipy spare matrices, pandas' DataFrames, or maybe, you want to consider PyTables. The fundamental point here is to have a framework that is general enough to compute relationships among a couple of sample annotations.     Also, be careful with the dimension space of several annotations, for example, if your annotation is of float type, you might have to round it (if not, the size of your data structure might become too big). Now, let's take a look at all the plotting codes. Let's perform it in two parts; here is part 1: def plot_hz_rel(dps, ax, ax2, name, rel):    frac_hz = []    cnt_dp = []    for dp in dps:        hz = 0.0        cnt = 0          for khz, kdp in rel.keys():             if kdp != dp:                continue            cnt += rel[(khz, dp)]            if khz == 1:                hz += rel[(khz, dp)]        frac_hz.append(hz / cnt)        cnt_dp.append(cnt)    ax.plot(dps, frac_hz, label=name)    ax2.plot(dps, cnt_dp, '--', label=name)     This function will take a data structure (as generated by get_sample_relation) expecting that the first parameter of the key tuple is the heterozygosity state (0 = homozygote, 1 = heterozygote) and the second is the DP. With this, it will generate two lines: one with the fraction of samples (which are heterozygotes at a certain depth) and the other with the SNP count. Let's now call this function, as shown in the following code: fig, ax = plt.subplots(figsize=(16, 9)) ax2 = ax.twinx() for name, rel in rels.items():    dps = list(set([x[1] for x in rel.keys()]))    dps.sort()    plot_hz_rel(dps, ax, ax2, name, rel) ax.set_xlim(0, 75) ax.set_ylim(0, 0.2) ax2.set_ylabel('Quantity of calls') ax.set_ylabel('Fraction of Heterozygote calls') ax.set_xlabel('Sample Read Depth (DP)') ax.legend() fig.suptitle('Number of calls per depth and fraction of calls which are Hz',,              fontsize='xx-large')     Here, we will use two axes. On the left-hand side, we will have the fraction of heterozygozite SNPs, whereas on the right-hand side, we will have the number of SNPs. Then, we will call our plot_hz_rel for both data files. The rest is standard matplotlib code. Finally, let's compare variant DP with the categorical variant-level annotation: EFF. EFF is provided by SnpEFF and tells us (among many other things) the type of SNP (for example, intergenic, intronic, coding synonymous, and coding nonsynonymous). The Anopheles dataset provides this useful annotation. Let's start by extracting variant-level annotations and the functional programming style, as shown in the following code: def get_variant_relation(recs, f1, f2):    rel = defaultdict(int)    for rec in recs:        if not rec.is_snp:              continue        try:            v1 = f1(rec)            v2 = f2(rec)            if v1 is None or v2 is None:                continue # We ignore Nones            rel[(v1, v2)] += 1        except:            pass    return rel     The programming style here is similar to get_sample_relation, but we do not delve into the samples. Now, we will define the types of effects that we will work with and convert the effect to an integer as it would allow you to use it as in index, for example, matrices. Think about coding a categorical variable: accepted_eff = ['INTERGENIC', 'INTRON', 'NON_SYNONYMOUS_CODING', 'SYNONYMOUS_CODING']   def eff_to_int(rec):    try:        for annot in rec.INFO['EFF']:            #We use the first annotation            master_type = annot.split('(')[0]            return accepted_eff.index(master_type)    except ValueError:        return len(accepted_eff) We will now traverse the file; the style should be clear to you now: eff_mq0s = {} for vcf_name in vcf_names:    recs = vcf.Reader(filename=vcf_name)    eff_mq0s[vcf_name] = get_variant_relation(recs, lambda r: eff_to_int(r), lambda r: int(r.INFO['DP'])) Finally, we will plot the distribution of DP using the SNP effect, as shown in the following code: fig, ax = plt.subplots(figsize=(16,9)) vcf_name = 'standard.vcf.gz' bp_vals = [[] for x in range(len(accepted_eff) + 1)] for k, cnt in eff_mq0s[vcf_name].items():    my_eff, mq0 = k    bp_vals[my_eff].extend([mq0] * cnt) sns.boxplot(bp_vals, sym='', ax=ax) ax.set_xticklabels(accepted_eff + ['OTHER']) ax.set_ylabel('DP (variant)') fig.suptitle('Distribution of variant DP per SNP type',              fontsize='xx-large') Here, we will just print a box plot for the noncentromeric file (refer to the following screenshot). The results are as expected: SNPs in code areas will probably have more depth if they are in more complex regions (that is easier to call) than intergenic SNPs: Figure 4: Boxplot for the distribution of variant read depth across different SNP effects There's more… The approach would depend on the type of sequencing data that you have, the number of samples, and potential extra information (for example, pedigree among samples). This recipe is very complex as it is, but parts of it are profoundly naive (there is a limit of complexity that I could force on you on a simple recipe). For example, the window code does not support overlapping windows; also, data structures are simplistic. However, I hope that they give you an idea of the general strategy to process genomic high-throughput sequencing data. See also There are many filtering rules, but I would like to draw your attention to the need of reasonably good coverage (clearly more than 10 x), for example, refer to. Meynet et al "Variant detection sensitivity and biases in whole genome and exome sequencing" at http://www.biomedcentral.com/1471-2105/15/247/ Brad Chapman is one of the best known specialist in sequencing analysis and data quality with Python and the main author of Blue Collar Bioinformatics, a blog that you may want to check at https://bcbio.wordpress.com/ Brad is also the main author of bcbio-nextgen, a Python-based pipeline for high-throughput sequencing analysis. Refer to https://bcbio-nextgen.readthedocs.org Peter Cock is the main author of Biopython and is heavily involved in NGS analysis; be sure to check his blog, "Blasted Bionformatics!?" at http://blastedbio.blogspot.co.uk/ Summary In this article, we prepared the environment, analyzed variant calls and learned about genome accessibility and filtering SNP data.
Read more
  • 0
  • 0
  • 16630

article-image-regular-expressions-awk-programming
Pavan Ramchandani
18 May 2018
8 min read
Save for later

Regular expressions in AWK programming: What, Why, and How

Pavan Ramchandani
18 May 2018
8 min read
AWK is a pattern-matching language. It searches for a pattern in a file and, upon finding the corresponding match, it performs the file's action on the input line. This pattern could consist of fixed strings or a pattern of text. This variable content or pattern is generally searched with the help of regular expressions. Hence, regular expressions form an important part of AWK programming language. Today we will introduce you to the regular expressions in AWK programming and will get started with string-matching patterns and basic constructs to use with AWK. This article is an excerpt from a book written by Shiwang Kalkhanda, titled Learning AWK Programming. What is a regular expression? A regular expression, or regexpr, is a set of characters used to describe a pattern. A regular expression is generally used to match lines in a file that contain a particular pattern. Many Unix utilities operate on plain text files line by line, such as grep, sed, and awk. Regular expressions search for a pattern on a single line in a file. A regular expression doesn't search for a pattern that begins on one line and ends on another. Other programming languages may support this, notably Perl. Why use regular expressions? Generally, all editors have the ability to perform search-and-replace operations. Some editors can only search for patterns, others can also replace them, and others can also print the line containing that pattern. A regular expression goes many steps beyond this simple search, replace, and printing functionality, and hence it is more powerful and flexible. We can search for a word of a certain size, such as a word that has four characters or numbers. We can search for a word that ends with a particular character, let's say e. You can search for phone numbers, email IDs, and so on, and can also perform validation using regular expressions. They simplify complex pattern-matching tasks and hence form an important part of AWK programming. Other regular expression variations also exist, notably those for Perl. Using regular expressions with AWK There are mainly two types of regular expressions in Linux: Basic regular expressions that are used by vi, sed, grep, and so on Extended regular expressions that are used by awk, nawk, gawk, and egrep Here, we will refer to extended regular expressions as regular expressions in the context of AWK. In AWK, regular expressions are enclosed in forward slashes, '/', (forming the AWK pattern) and match every input record whose text belongs to that set. The simplest regular expression is a string of letters, numbers, or both that matches itself. For example, here we use the ly regular expression string to print all lines that contain the ly pattern in them. We just need to enclose the regular expression in forward slashes in AWK: $ awk '/ly/' emp.dat The output on execution of this code is as follows: Billy Chabra 9911664321 [email protected] M lgs 1900 Emily Kaur 8826175812 [email protected] F Ops 2100 In this example, the /ly/ pattern matches when the current input line contains the ly sub-string, either as ly itself or as some part of a bigger word, such as Billy or Emily, and prints the corresponding line. Regular expressions as string-matching patterns with AWK Regular expressions are used as string-matching patterns with AWK in the following three ways. We use the '~' and '! ~' match operators to perform regular expression comparisons: /regexpr/: This matches when the current input line contains a sub-string matched by regexpr. It is the most basic regular expression, which matches itself as a string or sub-string. For example, /mail/ matches only when the current input line contains the mail string as a string, a sub-string, or both. So, we will get lines with Gmail as well as Hotmail in the email ID field of the employee database as follows: $ awk '/mail/' emp.dat The output on execution of this code is as follows: Jack Singh 9857532312 [email protected] M hr 2000 Jane Kaur 9837432312 [email protected] F hr 1800 Eva Chabra 8827232115 [email protected] F lgs 2100 Ana Khanna 9856422312 [email protected] F Ops 2700 Victor Sharma 8826567898 [email protected] M Ops 2500 John Kapur 9911556789 [email protected] M hr 2200 Sam khanna 8856345512 [email protected] F lgs 2300 Emily Kaur 8826175812 [email protected] F Ops 2100 Amy Sharma 9857536898 [email protected] F Ops 2500 In this example, we do not specify any expression, hence it automatically matches a whole line, as follows: $ awk '$0 ~ /mail/' emp.dat The output on execution of this code is as follows: Jack Singh 9857532312 [email protected] M hr 2000 Jane Kaur 9837432312 [email protected] F hr 1800 Eva Chabra 8827232115 [email protected] F lgs 2100 Ana Khanna 9856422312 [email protected] F Ops 2700 Victor Sharma 8826567898 [email protected] M Ops 2500 John Kapur 9911556789 [email protected] M hr 2200 Sam khanna 8856345512 [email protected] F lgs 2300 Emily Kaur 8826175812 [email protected] F Ops 2100 Amy Sharma 9857536898 [email protected] F Ops 2500 expression ~ /regexpr /: This matches if the string value of the expression contains a sub-string matched by regexpr. Generally, this left-hand operand of the matching operator is a field. For example, in the following command, we print all the lines in which the value in the second field contains a /Singh/ string: $ awk '$2 ~ /Singh/{ print }' emp.dat We can also use the expression as follows: $ awk '{ if($2 ~ /Singh/) print}' emp.dat The output on execution of the preceding code is as follows: Jack Singh 9857532312 [email protected] M hr 2000 Hari Singh 8827255666 [email protected] M Ops 2350 Ginny Singh 9857123466 [email protected] F hr 2250 Vina Singh 8811776612 [email protected] F lgs 2300 expression !~ /regexpr /: This matches if the string value of the expression does not contain a sub-string matched by regexpr. Generally, this expression is also a field variable. For example, in the following example, we print all the lines that don't contain the Singh sub-string in the second field, as follows: $ awk '$2 !~ /Singh/{ print }' emp.dat The output on execution of the preceding code is as follows: Jane Kaur 9837432312 [email protected] F hr 1800 Eva Chabra 8827232115 [email protected] F lgs 2100 Amit Sharma 9911887766 [email protected] M lgs 2350 Julie Kapur 8826234556 [email protected] F Ops 2500 Ana Khanna 9856422312 [email protected] F Ops 2700 Victor Sharma 8826567898 [email protected] M Ops 2500 John Kapur 9911556789 [email protected] M hr 2200 Billy Chabra 9911664321 [email protected] M lgs 1900 Sam khanna 8856345512 [email protected] F lgs 2300 Emily Kaur 8826175812 [email protected] F Ops 2100 Amy Sharma 9857536898 [email protected] F Ops 2500 Any expression may be used in place of /regexpr/ in the context of ~; and !~. The expression here could also be if, while, for, and do statements. Basic regular expression construct Regular expressions are made up of two types of characters: normal text characters, called literals, and special characters, such as the asterisk (*, +, ?, .), called metacharacters. There are times when you want to match a metacharacter as a literal character. In such cases, we prefix that metacharacter with a backslash (), which is called an escape sequence. The basic regular expression construct can be summarized as follows: Here is the list of metacharacters, also known as special characters, that are used in building regular expressions:     ^    $    .    [    ]    |    (    )    *    +    ? The following table lists the remaining elements that are used in building a basic regular expression, apart from the metacharacters mentioned before: Literal A literal character (non-metacharacter ), such as A, that matches itself. Escape sequence An escape sequence that matches a special symbol: for example t matches tab. Quoted metacharacter () In quoted metacharacters, we prefix metacharacter with a backslash, such as $ that matches the metacharacter literally. Anchor (^) Matches the beginning of a string. Anchor ($) Matches the end of a string. Dot (.) Matches any single character. Character classes (...) A character class [ABC] matches any one of the A, B, or C characters. Character classes may include abbreviations, such as [A-Za-z]. They match any single letter. Complemented character classes Complemented character classes [^0-9] match any character except a digit. These operators combine regular expressions into larger ones: Alternation (|) A|B matches A or B. Concatenation AB matches A immediately followed by B. Closure (*) A* matches zero or more As. Positive closure (+) A+ matches one or more As. Zero or one (?) A? matches the null string or A. Parentheses () Used for grouping regular expressions and back-referencing. Like regular expressions, (r) can be accessed using n digit in future. Do check out the book Learning AWK Programming to learn more about the intricacies of AWK programming language for text processing. Read More What is the difference between functional and object-oriented programming? What makes a programming language simple or complex?
Read more
  • 0
  • 0
  • 16512

article-image-fast-array-operations-numpy
Packt
19 Dec 2013
10 min read
Save for later

Fast Array Operations with NumPy

Packt
19 Dec 2013
10 min read
(For more resources related to this topic, see here.) Getting started with NumPy NumPy is founded around its multidimensional array object, numpy.ndarray. NumPy arrays are a collection of elements of the same data type; this fundamental restriction allows NumPy to pack the data in an efficient way. By storing the data in this way NumPy can handle arithmetic and mathematical operations at high speed. Creating arrays You can create NumPy arrays using the numpy.array function. It takes list-like object (or another array) as input and, optionally, a string expressing its data type. You can interactively test array creation using an IPython shell as follows: In [1]: import numpy as np In [2]: a = np.array([0, 1, 2]) Every NumPy array has a data type that can be accessed by the dtype attribute, as shown in the following code. In the following code example, dtype is a 64-bit integer. In [3]: a.dtype Out[3]: dtype('int64') If we want those numbers to be treated as a float type of variable, we can either pass the dtype argument in the np.array function or cast the array to another data type using the astype method as shown in the following code: In [4]: a = np.array([1, 2, 3], dtype='float32') In [5]: a.astype('float32') Out[5]: array([ 0.,  1.,  2.], dtype=float32) To create an array with two dimensions (an array of arrays) we can initialize the array using a nested sequence shown as follows: In [6]: a = np.array([[0, 1, 2], [3, 4, 5]]) In [7]: print(a) Out[7]: [[0 1 2]         [3 4 5]] The array created in this way has two dimensions—axes in NumPy's jargon. Such an array is like a table that contains two rows and three columns. We can access the axes structure using the ndarray.shape attribute: In [7]: a.shape Out[7]: (2, 3) Arrays can also be reshaped only as long as the product of the shape dimensions is equal to the total number of elements in the array. For example, we can reshape an array containing 16 elements in the following ways: (2, 8), (4, 4), or (2, 2, 4). To reshape an array we can either use the ndarray.reshape method or directly change the ndarray.shape attribute. The following code illustrates the use of the ndarray.reshape method: In [7]: a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8,                       9, 10, 11, 12, 13, 14, 15]) In [7]: a.shape Out[7]: (16,) In [8]: a.reshape(4, 4) # Equivalent: a.shape = (4, 4) Out[8]: array([[ 0,  1,  2,  3],        [ 4,  5,  6,  7],        [ 8,  9, 10, 11],        [12, 13, 14, 15]]) Thanks to this property you are also free to add dimensions of size one. You can reshape an array with 16 elements to (16, 1), (1, 16), (16, 1, 1), and so on. NumPy provides convenience functions, shown in the following code, to create arrays filled with zeros, filled with ones, or without an initialization value (empty—their actual value is meaningless and depends on the memory state). Those functions take the array shape as a tuple and optionally its dtype. In [8]: np.zeros((3, 3)) In [9]: np.empty((3, 3)) In [10]: np.ones((3, 3), dtype='float32') In our examples we will use the numpy.random module to generate random floating point numbers in the (0, 1) interval. The numpy.random module is shown as follows: In [11]: np.random.rand(3, 3) Sometimes it is convenient to initialize arrays that have a similar shape to other arrays. Again, NumPy provides some handy functions for that purpose such as zeros_like, empty_like, and ones_like. These functions are as follows: In [12]: np.zeros_like(a) In [13]: np.empty_like(a) In [14]: np.ones_like(a) Accessing arrays NumPy array interface is, on a shallow level, similar to Python lists. They can be indexed using integers, and can also be iterated using a for loop. In [15]: A = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8]) In [16]: A[0] Out[16]: 0 In [17]: [a for a in A] Out[17]: [0, 1, 2, 3, 4, 5, 6, 7, 8] It is also possible to index an array in multiple dimensions. If we take a (3,3) array (an array containing 3 triplets) and we index the first element, we obtain the first triplet shown as follows: In [18]: A = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]]) In [19]: A[0] Out[19]: array([0, 1, 2]) We can index the triplet again by adding the other index separated by a comma. To get the second element of the first triplet we can index using [0, 1] as shown in the following code: In [20]: A[0, 1] Out[20]: 1 NumPy allows you to slice arrays in single and multiple dimensions. If we index on the first dimension we will get a collection of triplets shown as follows: In [21]: A[0:2] Out[21]: array([[0, 1, 2],                [3, 4, 5]]) If we slice the array with [0:2]. for every selected triplet we extract the first two elements, resulting in a (2, 2) array shown in the following code: In [22]: A[0:2, 0:2] Out[22]: array([[0, 1],                 [3, 4]]) Intuitively, you can update values in the array by using both numerical indexes and slices. The syntax is as follows: In [23]: A[0, 1] = 8 In [24]: A[0:2, 0:2] = [[1, 1], [1, 1]] Indexing with the slicing syntax is fast because it doesn't make copies of the array. In NumPy terminology it returns a view over the same memory area. If we take a slice of the original array and then changes one of its value; the original array will be updated as well. The following code illustrates an example of the same: In [25]: a = np.array([1, 1, 1, 1]) In [26]: a_view = A[0:2] In [27]: a_view[0] = 2 In [28]: print(A) Out[28]: [2 1 1 1] We can take a look at another example that shows how the slicing syntax can be used in a real-world scenario. We define an array r_i, shown in the following line of code, which contains a set of 10 coordinates (x, y); its shape will be (10, 2): In [29]: r_i = np.random.rand(10, 2) A typical operation is extracting the x component of each coordinate. In other words you want to extract the items [0, 0], [1, 0], [2, 0], and so on. resulting in an array with shape (10,). It is helpful to think that the first index is moving while the second one is fixed (at 0). With this in mind, we will slice every index on the first axis (the moving one) and take the first element (the fixed one) on the second axis as shown in the following line of code: In [30]: x_i = r_i[:, 0] On the other hand, the following expression of code will keep the first index fixed and the second index moving, giving the first (x, y) coordinate: In [31]: r_0 = r_i[0, :] Slicing all the indexes over the last axis is optional; using r_i[0] has the same effect as r_i[0, :]. NumPy allows to index an array by using another NumPy array made of either integer or Boolean values—a feature called fancy indexing. If you index with an array of integers, NumPy will interpret the integers as indexes and will return an array containing their corresponding values. If we index an array containing 10 elements with [0, 2, 3], we obtain an array of size 3 containing the elements at positions 0, 2 and 3. The following code gives us an illustration of this concept: In [32]: a = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0]) In [33]: idx = np.array([0, 2, 3]) In [34]: a[idx] Out[34]: array([9, 7, 6]) You can use fancy indexing on multiple dimensions by passing an array for each dimension. If we want to extract the elements [0, 2] and [1, 3] we have to pack all the indexes acting on the first axis in one array, and the ones acting on the second axis in another. This can be seen in the following code: In [35]: a = np.array([[0, 1, 2], [3, 4, 5],                        [6, 7, 8], [9, 10, 11]]) In [36]: idx1 = np.array([0, 1]) In [37]: idx2 = np.array([2, 3]) In [38]: a[idx1, idx2] You can also use normal lists as index arrays, but not tuples. For example the following two statements are equivalent: >>> a[np.array([0, 1])] # is equivalent to >>> a[[0, 1]] However, if you use a tuple, NumPy will interpret the following statement as an index on multiple dimensions: >>> a[(0, 1)] # is equivalent to >>> a[0, 1] The index arrays are not required to be one-dimensional; we can extract elements from the original array in any shape. For example we can select elements from the original array to form a (2,2) array shown as follows: In [39]: idx1 = [[0, 1], [3, 2]] In [40]: idx2 = [[0, 2], [1, 1]] In [41]: a[idx1, idx2] Out[41]: array([[ 0,  5],                 [10,  7]]) The array slicing and fancy indexing features can be combined. For example, this is useful if we want to swap the x and y columns in a coordinate array. In the following code, the first index will be running over all the elements (a slice), and for each of those we extract the element in position 1 (the y) first and then the one in position 0 (the x): In [42]: r_i = np.random(10, 2) In [43]: r_i[:, [0, 1]] = r_i[:, [1, 0]] When the index array is a Boolean there are slightly different rules. The Boolean array will act like a mask; every element corresponding to True will be extracted and put in the output array. This procedure is shown as follows: In [44]: a = np.array([0, 1, 2, 3, 4, 5]) In [45]: mask = np.array([True, False, True, False, False, False]) In [46]: a[mask] Out[46]: array([0, 2]) The same rules apply when dealing with multiple dimensions. Furthermore, if the index array has the same shape as the original array, the elements corresponding to True will be selected and put in the resulting array. Indexing in NumPy is a reasonably fast operation. Anyway, when speed is critical, you can use the, slightly faster, numpy.take and numpy.compress functions to squeeze out a little more speed. The first argument of numpy.take is the array we want to operate on, and the second is the list of indexes we want to extract. The last argument is axis; if not provided, the indexes will act on the flattened array, otherwise they will act along the specified axis. In [47]: r_i = np.random(100, 2) In [48]: idx = np.arange(50) # integers 0 to 50 In [49]: %timeit np.take(r_i, idx, axis=0) 1000000 loops, best of 3: 962 ns per loop In [50]: %timeit r_i[idx] 100000 loops, best of 3: 3.09 us per loop The similar, but faster version for Boolean arrays is numpy.compress which works in the same way. The use of numpy.compress is shown as follows: In [51]: idx = np.ones(100, dtype='bool') # all True values In [52]: %timeit np.compress(idx, r_i, axis=0) 1000000 loops, best of 3: 1.65 us per loop In [53]: %timeit r_i[idx] 100000 loops, best of 3: 5.47 us per loop Summary The article thus covers the basics of NumPy arrays, talking about the creating of arrays and how we can access them. Resources for Article: Further resources on this subject: Getting Started with Spring Python [Article] Python Testing: Installing the Robot Framework [Article] Python Multimedia: Fun with Animations using Pyglet [Article]
Read more
  • 0
  • 0
  • 16261
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime
article-image-architects-love-api-driven-architecture
Aaron Lazar
07 Jun 2018
6 min read
Save for later

8 Reasons why architects love API driven architecture

Aaron Lazar
07 Jun 2018
6 min read
Everyday, we see a new architecture popping up, being labeled as a modern architecture for application development. That’s what happened with Microservices in the beginning, and then all went for a toss when they were termed as a design pattern rather than an architecture on a whole. APIs are growing in popularity and are even being used as a basis to draw out the architecture of applications. We’re going to try and understand what some of the top factors are, which make Architects (and Developers) appreciate API driven architectures over the other “modern” and upcoming architectures. Before we get to the reasons, let’s understand where I’m coming from in the first place. So, we recently published our findings from the Skill Up survey that we conducted for 8,000 odd IT pros. We asked them various questions ranging from what their favourite tools were, to whether they felt they knew more than what their managers did. Of the questions, one of them was directed to find out which of the modern architectures interested them the most. The choices were among Chaos Engineering, API Driven Architecture and Evolutionary Architecture. Source: Skill Up 2018 From the results, it's evident that they’re more inclined towards API driven Architecture. Or maybe, those who didn’t really find the architecture of their choice among the lot, simply chose API driven to be the best of the lot. But why do architects love API driven development? Anyway, I’ve been thinking about it a bit and thought I would come up with a few reasons as to why this might be so. So here goes… Reason #1: The big split between the backend and frontend Also known as Split Stack Development, API driven architecture allows for the backend and frontend of the application to be decoupled. This allows developers and architects to mitigate any dependencies that each end might have or rather impose on the other. Instead of having the dependencies, each end communicates with the other via APIs. This is extremely beneficial in the sense that each end can be built in completely different tools and technologies. For example, the backend could be in Python/Java, while the front end is built in JavaScript. Reason #2: Sensibility in scalability When APIs are the foundation of an architecture, it enables the organisation to scale the app by simply plugging in services as and when needed, instead of having to modify the app itself. This is a great way to plugin and plugout functionality as and when needed without disrupting the original architecture. Reason #3: Parallel Development aka Agile When different teams work on the front and back end of the application, there’s no reason for them to be working together. That doesn’t mean they don’t work together at all, rather, what I mean is that the only factor they have to agree upon is the API structure and nothing else. This is because of Reason #1, where both layers of the architecture are disconnected or decoupled. This enables teams to be more flexible and agile when developing the application. It is only at the testing and deployment stages that the teams will collaborate more. Reason #4: API as a product This is more of a business case, rather than developer centric, but I thought I should add it in anyway. So, there’s something new that popped up on the Thoughtworks Radar, a few months ago - API-as-a-product.  As a matter of fact, you could consider this similar to API-as-a-Service. Organisations like Salesforce have been offering their services in the form of APIs. For example, suppose you’re using Salesforce CRM and you want to extend the functionality, all you need to do is use the APIs for extending the system. Google is another good example of a company that offers APIs as products. This is a great way to provide extensibility instead of having a separate application altogether. Individual APIs or groups of them can be priced with subscription plans. These plans contain not only access to the APIs themselves, but also a defined number of calls or data that is allowed. Reason #5: Hiding underlying complexity In an API driven architecture, all components that are connected to the API are modular, exist on their own and communicate via the API. The modular nature of the application makes it easier to test and maintain. Moreover, if you’re using or consuming someone else’s API, you needn’t learn/decipher the entire code’s working, rather you can just plug in the API and use it. That reduces complexity to a great extent. Reason #6: Business Logic comes first API driven architecture allows developers to focus on the Business Logic, rather than having to worry about structuring the application. The initial API structure is all that needs to be planned out, after which each team goes forth and develops the individual APIs. This greatly reduces development time as well. Reason #7: IoT loves APIs API architecture makes for a great way to build IoT applications, as IoT needs a great deal of scalability. An application that is built on a foundation of APIs is a dream for IoT developers as devices can be easily connected to the mother app. I expect everything to be connected via APIs in the next 5 years. If it doesn’t happen, you can always get back at me in the comments section! ;) Reason #8: APIs and DevOps are a match made in Heaven APIs allow for a more streamlined deployment pipeline, while also eliminating the production of duplicate assets by development teams. Moreover, deployments can reach production a lot faster through these slick pipelines, thus increasing efficiency and reducing costs by a great deal. The merger of DevOps and API driven architecture, however, is not a walk in the park, as it requires a change in mindset. Teams need to change culturally, to become enablers of reusable, self-service consumption. The other side of the coin Well, there’s always two sides to the coin, and there are some drawbacks to API driven architecture. For starters, you’ll have APIs all over the place! While that was the point in the first place, it becomes really tedious to manage all those APIs. Secondly, when you have things running in parallel, you require a lot of processing power - more cores, more infrastructure. Another important issue is regarding security. With so many cyber attacks, and privacy breaches, an API driven architecture only invites trouble with more doors for hackers to open. So apart from the above flipside, those were some of the reasons I could think of, as to why Architects would be interested in an API driven architecture. APIs give customers, i.e both internal and external stakeholders, the freedom to leverage enterprise’s assets, while customizing as required. In a way, APIs aren’t just ways to offer integration and connectivity for large enterprise apps. Rather, they should be looked at as a way to drive faster and more modern software architecture and delivery. What are web developers favorite front-end tools? The best backend tools in web development The 10 most common types of DoS attacks you need to know
Read more
  • 0
  • 0
  • 15583

article-image-azure-function-asp-net-core-mvc-application
Aaron Lazar
03 May 2018
10 min read
Save for later

How to call an Azure function from an ASP.NET Core MVC application

Aaron Lazar
03 May 2018
10 min read
In this tutorial, we'll learn how to call an Azure Function from an ASP.NET Core MVC application. [box type="shadow" align="" class="" width=""]This article is an extract from the book C# 7 and .NET Core Blueprints, authored by Dirk Strauss and Jas Rademeyer. This book is a step-by-step guide that will teach you essential .NET Core and C# concepts with the help of real-world projects.[/box] We will get started with creating an ASP.NET Core MVC application that will call our Azure Function to validate an email address entered into a login screen of the application: This application does no authentication at all. All it is doing is validating the email address entered. ASP.NET Core MVC authentication is a totally different topic and not the focus of this post. In Visual Studio 2017, create a new project and select ASP.NET Core Web Application from the project templates. Click on the OK button to create the project. This is shown in the following screenshot: On the next screen, ensure that .NET Core and ASP.NET Core 2.0 is selected from the drop-down options on the form. Select Web Application (Model-View-Controller) as the type of application to create. Don't bother with any kind of authentication or enabling Docker support. Just click on the OK button to create your project: After your project is created, you will see the familiar project structure in the Solution Explorer of Visual Studio: Creating the login form For this next part, we can create a plain and simple vanilla login form. For a little bit of fun, let's spice things up a bit. Have a look on the internet for some free login form templates: I decided to use a site called colorlib that provided 50 free HTML5 and CSS3 login forms in one of their recent blog posts. The URL to the article is: https://colorlib.com/wp/html5-and-css3-login-forms/. I decided to use Login Form 1 by Colorlib from their site. Download the template to your computer and extract the ZIP file. Inside the extracted ZIP file, you will see that we have several folders. Copy all the folders in this extracted ZIP file (leave the index.html file as we will use this in a minute): Next, go to the solution for your Visual Studio application. In the wwwroot folder, move or delete the contents and paste the folders from the extracted ZIP file into the wwwroot folder of your ASP.NET Core MVC application. Your wwwroot folder should now look as follows: 4. Back in Visual Studio, you will see the folders when you expand the wwwroot node in the CoreMailValidation project. 5. I also want to focus your attention to the Index.cshtml and _Layout.cshtml files. We will be modifying these files next: Open the Index.cshtml file and remove all the markup (except the section in the curly brackets) from this file. Paste the HTML markup from the index.html file from the ZIP file we extracted earlier. Do not copy the all the markup from the index.html file. Only copy the markup inside the <body></body> tags. Your Index.cshtml file should now look as follows: @{ ViewData["Title"] = "Login Page"; } <div class="limiter"> <div class="container-login100"> <div class="wrap-login100"> <div class="login100-pic js-tilt" data-tilt> <img src="images/img-01.png" alt="IMG"> </div> <form class="login100-form validate-form"> <span class="login100-form-title"> Member Login </span> <div class="wrap-input100 validate-input" data-validate="Valid email is required: [email protected]"> <input class="input100" type="text" name="email" placeholder="Email"> <span class="focus-input100"></span> <span class="symbol-input100"> <i class="fa fa-envelope" aria-hidden="true"></i> </span> </div> <div class="wrap-input100 validate-input" data-validate="Password is required"> <input class="input100" type="password" name="pass" placeholder="Password"> <span class="focus-input100"></span> <span class="symbol-input100"> <i class="fa fa-lock" aria-hidden="true"></i> </span> </div> <div class="container-login100-form-btn"> <button class="login100-form-btn"> Login </button> </div> <div class="text-center p-t-12"> <span class="txt1"> Forgot </span> <a class="txt2" href="#"> Username / Password? </a> </div> <div class="text-center p-t-136"> <a class="txt2" href="#"> Create your Account <i class="fa fa-long-arrow-right m-l-5" aria-hidden="true"></i> </a> </div> </form> </div> </div> </div> The code for this chapter is available on GitHub here: Next, open the Layout.cshtml file and add all the links to the folders and files we copied into the wwwroot folder earlier. Use the index.html file for reference. You will notice that the _Layout.cshtml file contains the following piece of code—@RenderBody(). This is a placeholder that specifies where the Index.cshtml file content should be injected. If you are coming from ASP.NET Web Forms, think of the _Layout.cshtml page as a master page. Your Layout.cshtml markup should look as follows: <!DOCTYPE html> <html> <head> <meta charset="utf-8" /> <meta name="viewport" content="width=device-width, initial-scale=1.0" /> <title>@ViewData["Title"] - CoreMailValidation</title> <link rel="icon" type="image/png" href="~/images/icons/favicon.ico" /> <link rel="stylesheet" type="text/css" href="~/vendor/bootstrap/css/bootstrap.min.css"> <link rel="stylesheet" type="text/css" href="~/fonts/font-awesome-4.7.0/css/font-awesome.min.css"> <link rel="stylesheet" type="text/css" href="~/vendor/animate/animate.css"> <link rel="stylesheet" type="text/css" href="~/vendor/css-hamburgers/hamburgers.min.css"> <link rel="stylesheet" type="text/css" href="~/vendor/select2/select2.min.css"> <link rel="stylesheet" type="text/css" href="~/css/util.css"> <link rel="stylesheet" type="text/css" href="~/css/main.css"> </head> <body> <div class="container body-content"> @RenderBody() <hr /> <footer> <p>© 2018 - CoreMailValidation</p> </footer> </div> <script src="~/vendor/jquery/jquery-3.2.1.min.js"></script> <script src="~/vendor/bootstrap/js/popper.js"></script> <script src="~/vendor/bootstrap/js/bootstrap.min.js"></script> <script src="~/vendor/select2/select2.min.js"></script> <script src="~/vendor/tilt/tilt.jquery.min.js"></script> <script> $('.js-tilt').tilt({ scale: 1.1 }) </script> <script src="~/js/main.js"></script> @RenderSection("Scripts", required: false) </body> </html> If everything worked out right, you will see the following page when you run your ASP.NET Core MVC application. The login form is obviously totally non-functional: However, the login form is totally responsive. If you had to reduce the size of your browser window, you will see the form scale as your browser size reduces. This is what you want. If you want to explore the responsive design offered by Bootstrap, head on over to https://getbootstrap.com/ and go through the examples in the documentation:   The next thing we want to do is hook this login form up to our controller and call the Azure Function we created to validate the email address we entered. Let's look at doing that next. Hooking it all up To simplify things, we will be creating a model to pass to our controller: Create a new class in the Models folder of your application called LoginModel and click on the Add button:  2. Your project should now look as follows. You will see the model added to the Models folder: The next thing we want to do is add some code to our model to represent the fields on our login form. Add two properties called Email and Password: namespace CoreMailValidation.Models { public class LoginModel { public string Email { get; set; } public string Password { get; set; } } } Back in the Index.cshtml view, add the model declaration to the top of the page. This makes the model available for use in our view. Take care to specify the correct namespace where the model exists: @model CoreMailValidation.Models.LoginModel @{ ViewData["Title"] = "Login Page"; } The next portion of code needs to be written in the HomeController.cs file. Currently, it should only have an action called Index(): public IActionResult Index() { return View(); } Add a new async function called ValidateEmail that will use the base URL and parameter string of the Azure Function URL we copied earlier and call it using an HTTP request. I will not go into much detail here, as I believe the code to be pretty straightforward. All we are doing is calling the Azure Function using the URL we copied earlier and reading the return data: private async Task<string> ValidateEmail(string emailToValidate) { string azureBaseUrl = "https://core-mail- validation.azurewebsites.net/api/HttpTriggerCSharp1"; string urlQueryStringParams = $"? code=/IS4OJ3T46quiRzUJTxaGFenTeIVXyyOdtBFGasW9dUZ0snmoQfWoQ ==&email={emailToValidate}"; using (HttpClient client = new HttpClient()) { using (HttpResponseMessage res = await client.GetAsync( $"{azureBaseUrl}{urlQueryStringParams}")) { using (HttpContent content = res.Content) { string data = await content.ReadAsStringAsync(); if (data != null) { return data; } else return ""; } } } } Create another public async action called ValidateLogin. Inside the action, check to see if the ModelState is valid before continuing. For a nice explanation of what ModelState is, have a look at the following article—https://www.exceptionnotfound.net/asp-net-mvc-demystified-modelstate/. We then do an await on the ValidateEmail function, and if the return data contains the word false, we know that the email validation failed. A failure message is then passed to the TempData property on the controller. The TempData property is a place to store data until it is read. It is exposed on the controller by ASP.NET Core MVC. The TempData property uses a cookie-based provider by default in ASP.NET Core 2.0 to store the data. To examine data inside the TempData property without deleting it, you can use the Keep and Peek methods. To read more on TempData, see the Microsoft documentation here: https://docs.microsoft.com/en-us/aspnet/core/fundamentals/app-state?tabs=aspnetcore2x. If the email validation passed, then we know that the email address is valid and we can do something else. Here, we are simply just saying that the user is logged in. In reality, we will perform some sort of authentication here and then route to the correct controller. So now you know how to call an Azure Function from an ASP.NET Core application. If you found this tutorial helpful and you'd like to learn more, go ahead and pick up the book C# 7 and .NET Core Blueprints. What is ASP.NET Core? Why ASP.NET makes building apps for mobile and web easy How to dockerize an ASP.NET Core application    
Read more
  • 0
  • 2
  • 15500

article-image-5-reasons-you-should-learn-node-js
Richard Gall
22 Feb 2019
7 min read
Save for later

5 reasons you should learn Node.js

Richard Gall
22 Feb 2019
7 min read
Open source software in general, and JavaScript in particular, can seem like a place where boom and bust is the rule of law: rapid growth before everyone moves on to the next big thing. But Node.js is different. Although it certainly couldn’t be described as new, and it's growth hasn't been dramatic by any measure, over the last few years it has managed to push itself forward as one of the most widely used JavaScript tools on the planet. Do you want to learn Node.js? Popularity, however can only tell you so much. The key question, if you’re reading this, is whether you should learn Node.js. So, to help you decide if it’s time to learn the JavaScript library, here’s a list of the biggest reasons why you should start learning Node.js... Learn everything you need to know about Node.js with Packt's Node.js Complete Reference Guide Book Learning Path. Node.js lets you write JavaScript on both client and server Okay, let’s get the obvious one out of the way first: Node.js is worth learning because it allows you to write JavaScript on the server. This has arguably transformed the way we think about JavaScript. Whereas in the past it was a language specifically written on the client, backed by the likes of PHP and Java, it’s now a language that you can use across your application. Read next: The top 5 reasons Node.js could topple Java This is important because it means teams can work much more efficiently together. Using different languages for backend and frontend is typically a major source of friction. Unless you have very good polyglot developers, a team is restricted to their core skills, while tooling is also more inflexible. If you’re using JavaScript across the stack, it’s easier to use a consistent toolchain. From a personal perspective, learning Node.js is a great starting point for full stack development. In essence, it's like an add-on that immediately expands what you can do with JavaScript. In terms of your career, then, it could well make you an invaluable asset to a development team. Read next: How is Node.js changing web development? Node.js allows you to build complex and powerful applications without writing complex code Another strong argument for Node.js is that it is built for performance. This is because of 2 important things - Node.js' asynchronous-driven architecture, and the fact that it uses the V8 JavaScript engine. The significance of this is that V8 is one of the fastest implementations of JavaScript, used to power many of Google’s immensely popular in-browser products (like Gmail). Node.js is powerful because it employs an asynchronous paradigm for handling data between client and server. To clarify what this means, it’s worth comparing to the typical application server model that uses blocking I/O - in this instance, the application has to handle each request sequentially, suspending threads until they can be processed. This can add complexity to an application and, of course, slows an application down. In contrast, Node.js allows you to use non-blocking I/O in which threads (in this case sequential, not concurrent), which can manage multiple requests. If one can’t be processed, it’s effectively ‘withheld’ as a promise, which means it can be executed later without holding up other threads. This means Node.js can help you build applications of considerable complexity without adding to the complexity of your code. Node.js is well suited to building microservices Microservices have become a rapidly growing architectural style that offer increased agility and flexibility over the traditional monolith. The advantages of microservices are well documented, and whether or not they’re right for you now, it’s likely that they’re going to dominate the software landscape as the world moves away from monolithic architecture. This fact only serves to strengthen the argument that you should learn Node.js because the library is so well suited to developing in this manner. This is because it encourages you to develop in a modular and focused manner, quite literally using specific modules to develop an application. This is distinct and almost at odds with the monolithic approach to software architecture. At this point, it’s probably worth highlighting that it’s incredibly easy to package and publish the modules you build thanks to npm (node package manager). So, even if you haven’t yet worked with microservices, learning Node.js is a good way to prepare yourself for a future where they are going to become even more prevalent. Node.js can be used for more than just web development We know by now that Node.js is flexible. But it’s important to recognise that its flexibility means it can be used for a wide range of different purposes. Yes, the library's community are predominantly building applications for the web, but it’s also a useful tool for those working in ops or infrastructure. This is because Node.js is a great tool for developing other development tools. If you’re someone working to support a team of developers, or, indeed, to help manage an entire distributed software infrastructure, it could be vital in empowering you to get creative and build your own support tools. Even more surprisingly, Node.js can be used in some IoT projects. As this post from 2016 suggests, the two things might not be quite such strange bedfellows. Node.js is a robust project that won't be going anywhere As I’ve already said, in the JavaScript world frameworks and tools can appear and disappear quickly. That means deciding what to learn, and, indeed, what to integrate into your stack, can feel like a bit of a gamble. However, you can be sure that Node.js is here to stay. There are a number of reasons for this. For starters, there’s no other tool that brings JavaScript to the server. But more than that, with Google betting heavily on V8 - which is, as we’ve seen, such an important part of the project - you can be sure it’s only going to go from strength to strength. It’s also worth pointing out that Node.js went through a small crisis when io.js broke away from the main Node.js project. This feud was as much personal as it was technical, but with the rift healed, and the Node.js Foundation now managing the whole project, helping to ensure that the software is continually evolving with other relevant technological changes and that the needs of the developers who use it continue to be met. Conclusion: spend some time exploring Node.js before you begin using it at work That’s just 5 reasons why you should learn Node.js. You could find more, but broadly speaking these all underline its importance in today’s development world. If you’re still not convinced, there’s a caveat. If Node.js isn’t yet right for you, don’t assume that it’s going to fix any technological or cultural issues that have been causing you headaches. It probably won’t. In fact, you should probably tackle those challenges before deciding to use it. But that all being said, even if you don’t think it’s the right time to use Node.js professionally, that doesn’t mean it isn’t worth learning. As you can see, it’s well worth your time. Who knows where it might take you? Ready to begin learning? Purchase Node.js Complete Reference Guide or read it for free with a subscription free trial.
Read more
  • 0
  • 0
  • 14705

article-image-building-your-first-odoo-application
Packt
02 Jan 2017
22 min read
Save for later

Building Your First Odoo Application

Packt
02 Jan 2017
22 min read
In this article by, Daniel Reis, the author of the book Odoo 10 Development Essentials, we will create our first Odoo application and learn the steps needed make it available to Odoo and install it. (For more resources related to this topic, see here.) Inspired by the notable http://todomvc.com/ project, we will build a simple To-Do application. It should allow us to add new tasks, mark them as completed, and finally clear the task list from all the already completed tasks. Understanding applications and modules It's common to hear about Odoo modules and applications. But what exactly is the difference between them? Module add-ons are building blocks for Odoo applications. A module can add new features to Odoo, or modify existing ones. It is a directory containing a manifest, or descriptor file, named __manifest__.py, plus the remaining files that implement its features. Applications are the way major features are added to Odoo. They provide the core elements for a functional area, such as Accounting or HR, based on which additional add-on modules modify or extend features. Because of this, they are highlighted in the Odoo Apps menu. If your module is complex, and adds new or major functionality to Odoo, you might consider creating it as an application. If you module just makes changes to existing functionality in Odoo, it is likely not an application. Whether a module is an application or not is defined in the manifest. Technically is does not have any particular effect on how the add-on module behaves. It is only used for highlight on the Apps list. Creating the module basic skeleton We should have the Odoo server at ~/odoo-dev/odoo/. To keep things tidy, we will create a new directory alongside it to host our custom modules, at ~/odoo-dev/custom-addons. Odoo includes a scaffold command to automatically create a new module directory, with a basic structure already in place. You can learn more about it with: $ ~/odoo-dev/odoo/odoo-bin scaffold --help You might want to keep this in mind when you start working your next module, but we won't be using it right now, since we will prefer to manually create all the structure for our module. An Odoo add-on module is a directory containing a __manifest__.py descriptor file. In previous versions, this descriptor file was named __openerp__.py. This name is still supported, but is deprecated. It also needs to be Python-importable, so it must also have an __init__.py file. The module's directory name is its technical name. We will use todo_app for it. The technical name must be a valid Python identifier: it should begin with a letter and can only contain letters, numbers, and the underscore character. The following commands create the module directory and create an empty __init__.py file in it, ~/odoo-dev/custom-addons/todo_app/__init__.py. In case you would like to do that directly from the command line, this is what you would use: $ mkdir ~/odoo-dev/custom-addons/todo_app $ touch ~/odoo-dev/custom-addons/todo_app/__init__.py Next, we need to create the descriptor file. It should contain only a Python dictionary with about a dozen possible attributes; of this, only the name attribute is required. A longer description attribute and the author attribute also have some visibility and are advised. We should now add a __manifest__.py file alongside the __init__.py file with the following content: { 'name': 'To-Do Application', 'description': 'Manage your personal To-Do tasks.', 'author': 'Daniel Reis', 'depends': ['base'], 'application': True, } The depends attribute can have a list of other modules that are required. Odoo will have them automatically installed when this module is installed. It's not a mandatory attribute, but it's advised to always have it. If no particular dependencies are needed, we should depend on the core base module. You should be careful to ensure all dependencies are explicitly set here; otherwise, the module may fail to install in a clean database (due to missing dependencies) or have loading errors, if by chance the other required modules are loaded afterwards. For our application, we don't need any specific dependencies, so we depend on the base module only. To be concise, we chose to use very few descriptor keys, but in a real word scenario, we recommend that you also use the additional keys since they are relevant for the Odoo apps store: summary: This is displayed as a subtitle for the module. version: By default, is 1.0. It should follow semantic versioning rules (see http://semver.org/ for details). license: By default, is LGPL-3. website: This is a URL to find more information about the module. This can help people find more documentation or the issue tracker to file bugs and suggestions. category: This is the functional category of the module, which defaults to Uncategorized. The list of existing categories can be found in the security groups form (Settings | User | Groups), in the Application field drop-down list. These other descriptor keys are also available: installable: It is by default True but can be set to False to disable a module. auto_install: If the auto_install module is set to True, this module will be automatically installed, provided all its dependencies are already installed. It is used for the Glue modules. Since Odoo 8.0, instead of the description key, we can use a README.rst or README.md file in the module's top directory. A word about licenses Choosing a license for your work is very important, and you should consider carefully what is the best choice for you, and its implications. The most used licenses for Odoo modules are the GNU Lesser General Public License (LGLP) and the Affero General Public License (AGPL). The LGPL is more permissive and allows commercial derivate work, without the need to share the corresponding source code. The AGPL is a stronger open source license, and requires derivate work and service hosting to share their source code. Learn more about the GNU licenses at https://www.gnu.org/licenses/. Adding to the add-ons path Now that we have a minimalistic new module, we want to make it available to the Odoo instance. For that, we need to make sure the directory containing the module is in the add-ons path, and then update the Odoo module list. We will position in our work directory and start the server with the appropriate add-ons path configuration: $ cd ~/odoo-dev $ ./odoo/odoo-bin -d todo --addons-path="custom-addons,odoo/addons" --save The --save option saves the options you used in a config file. This spares us from repeating them every time we restart the server: just run ./odoo-bin and the last saved options will be used. Look closely at the server log. It should have an INFO ? odoo: addons paths:[...] line. It should include our custom-addons directory. Remember to also include any other add-ons directories you might be using. For instance, if you also have a ~/odoo-dev/extra directory containing additional modules to be used, you might want to include them also using the option: --addons-path="custom-addons,extra,odoo/addons" Now we need the Odoo instance to acknowledge the new module we just added. Installing the new module In the Apps top menu, select the Update Apps List option. This will update the module list, adding any modules that may have been added since the last update to the list. Remember that we need the developer mode enabled for this option to be visible. That is done in the Settings dashboard, in the link at the bottom right, below the Odoo version number information . Make sure your web client session is working with the right database. You can check that at the top right: the database name is shown in parenthesis, right after the user name. A way to enforce using the correct database is to start the server instance with the additional option --db-filter=^MYDB$. The Apps option shows us the list of available modules. By default it shows only application modules. Since we created an application module we don't need to remove that filter to see it. Type todo in the search and you should see our new module, ready to be installed. Now click on the module's Install button and we're ready! The Model layer Now that Odoo knows about our new module, let's start by adding a simple model to it. Models describe business objects, such as an opportunity, sales order, or partner (customer, supplier, and so on.). A model has a list of attributes and can also define its specific business. Models are implemented using a Python class derived from an Odoo template class. They translate directly to database objects, and Odoo automatically takes care of this when installing or upgrading the module. The mechanism responsible for this is Object Relational Model (ORM). Our module will be a very simple application to keep to-do tasks. These tasks will have a single text field for the description and a checkbox to mark them as complete. We should later add a button to clean the to-do list from the old completed tasks. Creating the data model The Odoo development guidelines state that the Python files for models should be placed inside a models subdirectory. For simplicity, we won't be following this here, so let's create a todo_model.py file in the main directory of the todo_app module. Add the following content to it: # -*- coding: utf-8 -*- from odoo import models, fields class TodoTask(models.Model): _name = 'todo.task' _description = 'To-do Task' name = fields.Char('Description', required=True) is_done = fields.Boolean('Done?') active = fields.Boolean('Active?', default=True) The first line is a special marker telling the Python interpreter that this file has UTF-8 so that it can expect and handle non-ASCII characters. We won't be using any, but it's a good practice to have it anyway. The second line is a Python import statement, making available the models and fields objects from the Odoo core. The third line declares our new model. It's a class derived from models.Model. The next line sets the _name attribute defining the identifier that will be used throughout Odoo to refer to this model. Note that the actual Python class name , TodoTask in this case, is meaningless to other Odoo modules. The _name value is what will be used as an identifier. Notice that this and the following lines are indented. If you're not familiar with Python, you should know that this is important: indentation defines a nested code block, so these four lines should all be equally indented. Then we have the _description model attribute. It is not mandatory, but it provides a user friendly name for the model records, that can be used for better user messages. The last three lines define the model's fields. It's worth noting that name and active are special field names. By default, Odoo will use the name field as the record's title when referencing it from other models. The active field is used to inactivate records, and by default, only active records will be shown. We will use it to clear away completed tasks without actually deleting them from the database. Right now, this file is not yet used by the module. We must tell Python to load it with the module in the __init__.py file. Let's edit it to add the following line: from . import todo_model That's it. For our Python code changes to take effect the server instance needs to be restarted (unless it was using the --dev mode). We won't see any menu option to access this new model, since we didn't add them yet. Still we can inspect the newly created model using the Technical menu. In the Settings top menu, go to Technical | Database Structure | Models, search for the todo.task model on the list and then click on it to see its definition: If everything goes right, it is confirmed that the model and fields were created. If you can't see them here, try a server restart with a module upgrade, as described before. We can also see some additional fields we didn't declare. These are reserved fields Odoo automatically adds to every new model. They are as follows: id: A unique, numeric identifier for each record in the model. create_date and create_uid: These specify when the record was created and who created it, respectively. write_date and write_uid: These confirm when the record was last modified and who modified it, respectively. __last_update: This is a helper that is not actually stored in the database. It is used for concurrency checks. The View layer The View layer describes the user interface. Views are defined using XML, which is used by the web client framework to generate data-aware HTML views. We have menu items that can activate the actions that can render views. For example, the Users menu item processes an action also called Users, that in turn renders a series of views. There are several view types available, such as the list and form views, and the filter options made available are also defined by particular type of view, the search view. The Odoo development guidelines state that the XML files defining the user interface should be placed inside a views/ subdirectory. Let's start creating the user interface for our To-Do application. Adding menu items Now that we have a model to store our data, we should make it available on the user interface. For that we should add a menu option to open the To-do Task model so that it can be used. Create the views/todo_menu.xml file to define a menu item and the action performed by it: <?xml version="1.0"?> <odoo> <!-- Action to open To-do Task list --> <act_window id="action_todo_task" name="To-do Task" res_model="todo.task" view_mode="tree,form" /> <!-- Menu item to open To-do Task list --> <menuitem id="menu_todo_task" name="Todos" action="action_todo_task" /> </odoo> The user interface, including menu options and actions, is stored in database tables. The XML file is a data file used to load those definitions into the database when the module is installed or upgraded. The preceding code is an Odoo data file, describing two records to add to Odoo: The <act_window> element defines a client-side window action that will open the todo.task model with the tree and form views enabled, in that order The <menuitem> defines a top menu item calling the action_todo_task action, which was defined before Both elements include an id attribute. This id , also called an XML ID, is very important: it is used to uniquely identify each data element inside the module, and can be used by other elements to reference it. In this case, the <menuitem> element needs to reference the action to process, and needs to make use of the <act_window> id for that. Our module does not know yet about the new XML data file. This is done by adding it to the data attribute in the __manifest__.py file. It holds the list of files to be loaded by the module. Add this attribute to the descriptor's dictionary: 'data': ['views/todo_menu.xml'], Now we need to upgrade the module again for these changes to take effect. Go to the Todos top menu and you should see our new menu option available: Even though we haven't defined our user interface view, clicking on the Todos menu will open an automatically generated form for our model, allowing us to add and edit records. Odoo is nice enough to automatically generate them so that we can start working with our model right away. Odoo supports several types of views, but the three most important ones are: tree (usually called list views), form, and search views. We'll add an example of each to our module. Creating the form view All views are stored in the database, in the ir.ui.view model. To add a view to a module, we declare a <record> element describing the view in an XML file, which is to be loaded into the database when the module is installed. Add this new views/todo_view.xml file to define our form view: <?xml version="1.0"?> <odoo> <record id="view_form_todo_task" model="ir.ui.view"> <field name="name">To-do Task Form</field> <field name="model">todo.task</field> <field name="arch" type="xml"> <form string="To-do Task"> <group> <field name="name"/> <field name="is_done"/> <field name="active" readonly="1"/> </group> </form> </field> </record> </odoo> Remember to add this new file to the data key in manifest file, otherwise our module won't know about it and it won't be loaded. This will add a record to the ir.ui.view model with the identifier view_form_todo_task. The view is for the todo.task model and is named To-do Task Form. The name is just for information; it does not have to be unique, but it should allow one to easily identify which record it refers to. In fact the name can be entirely omitted, in that case it will be automatically generated from the model name and the view type. The most important attribute is arch, and contains the view definition, highlighted in the XML code above. The <form> tag defines the view type, and in this case contains three fields. We also added an attribute to the active field to make it read-only. Adding action buttons Forms can have buttons to perform actions. These buttons are able to trigger workflow actions, run window actions—such as opening another form, or run Python functions defined in the model. They can be placed anywhere inside a form, but for document-style forms, the recommended place for them is the <header> section. For our application, we will add two buttons to run the methods of the todo.task model: <header> <button name="do_toggle_done" type="object" string="Toggle Done" class="oe_highlight" /> <button name="do_clear_done" type="object" string="Clear All Done" /> </header> The basic attributes of a button comprise the following: The string attribute that has the text to be displayed on the button The type attribute referring to the action it performs The name attribute referring to the identifier for that action The class attribute, which is an optional attribute to apply CSS styles, like in regular HTML The complete form view At this point, our todo.task form view should look like this: <form> <header> <button name="do_toggle_done" type="object" string="Toggle Done" class="oe_highlight" /> <button name="do_clear_done" type="object" string="Clear All Done" /> </header> <sheet> <group name="group_top"> <group name="group_left"> <field name="name"/> </group> <group name="group_right"> <field name="is_done"/> <field name="active" readonly="1" /> </group> </group> </sheet> </form> Remember that for the changes to be loaded to our Odoo database, a module upgrade is needed. To see the changes in the web client, the form needs to be reloaded: either click again on the menu option that opens it or reload the browser page (F5 in most browsers). The action buttons won't work yet, since we still need to add their business logic. The business logic layer Now we will add some logic to our buttons. This is done with Python code, using the methods in the model's Python class. Adding business logic We should edit the todo_model.py Python file to add to the class the methods called by the buttons. First we need to import the new API, so add it to the import statement at the top of the Python file: from odoo import models, fields, api The action of the Toggle Done button will be very simple: just toggle the Is Done? flag. For logic on records, use the @api.multi decorator. Here, self will represent a recordset, and we should then loop through each record. Inside the TodoTask class, add this: @api.multi def do_toggle_done(self): for task in self: task.is_done = not task.is_done return True The code loops through all the to-do task records, and for each one, modifies the is_done field, inverting its value. The method does not need to return anything, but we should have it to at least return a True value. The reason is that clients can use XML-RPC to call these methods, and this protocol does not support server functions returning just a None value. For the Clear All Done button, we want to go a little further. It should look for all active records that are done and make them inactive. Usually, form buttons are expected to act only on the selected record, but in this case, we will want it also act on records other than the current one: @api.model def do_clear_done(self): dones = self.search([('is_done', '=', True)]) dones.write({'active': False}) return True On methods decorated with @api.model, the self variable represents the model with no record in particular. We will build a dones recordset containing all the tasks that are marked as done. Then, we set on the active flag to False on them. The search method is an API method that returns the records that meet some conditions. These conditions are written in a domain, which is a list of triplets. The write method sets the values at once on all the elements of a recordset. The values to write are described using a dictionary. Using write here is more efficient than iterating through the recordset to assign the value to each of them one by one. Set up access security You might have noticed that upon loading, our module is getting a warning message in the server log: The model todo.task has no access rules, consider adding one. The message is pretty clear: our new model has no access rules, so it can't be used by anyone other than the admin super user. As a super user, the admin ignores data access rules, and that's why we were able to use the form without errors. But we must fix this before other users can use our model. Another issue yet to address is that we want the to-do tasks to be private to each user. Odoo supports row-level access rules, which we will use to implement that. Adding access control security To get a picture of what information is needed to add access rules to a model, use the web client and go to Settings | Technical | Security | Access Controls List: Here we can see the ACL for some models. It indicates, per security group, what actions are allowed on records. This information has to be provided by the module using a data file to load the lines into the ir.model.access model. We will add full access to the Employee group on the model. Employee is the basic access group nearly everyone belongs to. This is done using a CSV file named security/ir.model.access.csv. Let's add it with the following content: id,name,model_id:id,group_id:id,perm_read,perm_write,perm_create,perm_unlink acess_todo_task_group_user,todo.task.user,model_todo_task,base.group_user,1,1,1,1 The filename corresponds to the model to load the data into, and the first line of the file has the column names. These are the columns provided by the CSV file: id: It is the record external identifier (also known as XML ID). It should be unique in our module. name: This is a description title. It is only informative and it's best if it's kept unique. Official modules usually use a dot-separated string with the model name and the group. Following this convention, we used todo.task.user. model_id: This is the external identifier for the model we are giving access to. Models have XML IDs automatically generated by the ORM: for todo.task, the identifier is model_todo_task. group_id: This identifies the security group to give permissions to. The most important ones are provided by the base module. The Employee group is such a case and has the identifier base.group_user. The last four perm fields flag the access to grant read, write, create, or unlink (delete) access. We must not forget to add the reference to this new file in the __manifest__.py descriptor's data attribute It should look like this: 'data': [ 'security/ir.model.access.csv', 'views/todo_view.xml', 'views/todo_menu.xml', ], As before, upgrade the module for these additions to take effect. The warning message should be gone, and we can confirm that the permissions are OK by logging in with the user demo (password is also demo). If we run our tests now it they should only fail the test_record_rule test case. Summary We created a new module from the start, covering the most frequently used elements in a module: models, the three basic types of views (form, list, and search), business logic in model methods, and access security. Always remember, when adding model fields, an upgrade is needed. When changing Python code, including the manifest file, a restart is needed. When changing XML or CSV files, an upgrade is needed; also, when in doubt, do both: restart the server and upgrade the modules. Resources for Article: Further resources on this subject: Getting Started with Odoo Development [Article] Introduction to Odoo [Article] Web Server Development [Article]
Read more
  • 0
  • 0
  • 14657
article-image-python-3-when-use-object-oriented-programming
Packt
12 Aug 2010
11 min read
Save for later

Python 3: When to Use Object-oriented Programming

Packt
12 Aug 2010
11 min read
(For more resources on Python 3, see here.) Treat objects as objects This may seem obvious, but you should generally give separate objects in your problem domain a special class in your code. The process is generally to identify objects in the problem and then model their data and behaviors. Identifying objects is a very important task in object-oriented analysis and programming. But it isn't always as easy as counting the nouns in a short paragraph, as we've been doing. Remember, objects are things that have both data and behavior. If we are working with only data, we are often better off storing it in a list, set, dictionary, or some other Python data structure. On the other hand, if we are working with only behavior, with no stored data, a simple function is more suitable. An object, however, has both data and behavior. Most Python programmers use built-in data structures unless (or until) there is an obvious need to define a class. This is a good thing; there is no reason to add an extra level of abstraction if it doesn't help organize our code. Sometimes, though, the "obvious" need is not so obvious. A Python programmer often starts by storing data in a few variables. As our program expands, we will later find that we are passing the same set of related variables to different functions. This is the time to think about grouping both variables and functions into a class. If we are designing a program to model polygons in two-dimensional space, we might start with each polygon being represented as a list of points. The points would be modeled as two-tuples (x,y) describing where that point is located. This is all data, stored in two nested data structures (specifically, a list of tuples): square = [(1,1), (1,2), (2,2), (2,1)] Now, if we want to calculate the distance around the perimeter of the polygon, we simply need to sum the distances between the two points, but to do that, we need a function to calculate the distance between two points. Here are two such functions: import mathdef distance(p1, p2): return math.sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)def perimeter(polygon): perimeter = 0 points = polygon + [polygon[0]] for i in range(len(polygon)): perimeter += distance(points[i], points[i+1]) return perimeter Now, as object-oriented programmers, we clearly recognize that a polygon class could encapsulate the list of points (data) and the perimeter function (behavior). Further, a point class, might encapsulate the x and y coordinates and the distance method. But should we do this? For the previous code, maybe, maybe not. We've been studying object-oriented principles long enough that we can now write the object-oriented version in record time: import mathclass Point: def __init__(self, x, y): self.x = x self.y = y def distance(self, p2): return math.sqrt((self.x-p2.x)**2 + (self.y-p2.y)**2)class Polygon: def __init__(self): self.vertices = [] def add_point(self, point): self.vertices.append((point))def perimeter(self): perimeter = 0 points = self.vertices + [self.vertices[0]] for i in range(len(self.vertices)): perimeter += points[i].distance(points[i+1]) return perimeter Now, to understand the difference a little better, let's compare the two APIs in use. Here's how to calculate the perimeter of a square using the object-oriented code: >>> square = Polygon()>>> square.add_point(Point(1,1))>>> square.add_point(Point(1,2))>>> square.add_point(Point(2,2))>>> square.add_point(Point(2,1))>>> square.perimeter()4.0 That's fairly succinct and easy to read, you might think, but let's compare it to the function-based code: >>> square = [(1,1), (1,2), (2,2), (2,1)]>>> perimeter(square)4.0 Hmm, maybe the object-oriented API isn't so compact! On the other hand, I'd argue that it was easier to read than the function example: How do we know what the list of tuples is supposed to represent in the second version? How do we remember what kind of object (a list of two-tuples? That's not intuitive!) we're supposed to pass into the perimeter function? We would need a lot of external documentation to explain how these functions should be used. In contrast, the object-oriented code is relatively self documenting, we just have to look at the list of methods and their parameters to know what the object does and how to use it. By the time we wrote all the documentation for the functional version, it would probably be longer than the object-oriented code. Besides, code length is a horrible indicator of code complexity. Some programmers (thankfully, not many of them are Python coders) get hung up on complicated, "one liners", that do incredible amounts of work in one line of code. One line of code that even the original author isn't able to read the next day, that is. Always focus on making your code easier to read and easier to use, not shorter. As a quick exercise, can you think of any ways to make the object-oriented Polygon as easy to use as the functional implementation? Pause a moment and think about it. Really, all we have to do is alter our Polygon API so that it can be constructed with multiple points. Let's give it an initializer that accepts a list of Point objects. In fact, let's allow it to accept tuples too, and we can construct the Point objects ourselves, if needed: def __init__(self, points = []): self.vertices = [] for point in points: if isinstance(point, tuple): point = Point(*point) self.vertices.append(point) This example simply goes through the list and ensures that any tuples are converted to points. If the object is not a tuple, we leave it as is, assuming that it is either a Point already, or an unknown duck typed object that can act like a Point. As we can see, it's not always easy to identify when an object should really be represented as a self-defined class. If we have new functions that accept a polygon argument, such as area(polygon) or point_in_polygon(polygon, x, y), the benefits of the object-oriented code become increasingly obvious. Likewise, if we add other attributes to the polygon, such as color or texture, it makes more and more sense to encapsulate that data into a class. The distinction is a design decision, but in general, the more complicated a set of data is, the more likely it is to have functions specific to that data, and the more useful it is to use a class with attributes and methods instead. When making this decision, it also pays to consider how the class will be used. If we're only trying to calculate the perimeter of one polygon in the context of a much greater problem, using a function will probably be quickest to code and easiest to use "one time only". On the other hand, if our program needs to manipulate numerous polygons in a wide variety of ways (calculate perimeter, area, intersection with other polygons, and more), we have most certainly identified an object; one that needs to be extremely versatile. Pay additional attention to the interaction between objects. Look for inheritance relationships; inheritance is impossible to model elegantly without classes, so make sure to use them. Composition can, technically, be modeled using only data structures; for example, we can have a list of dictionaries holding tuple values, but it is often less complicated to create an object, especially if there is behavior associated with the data. Don't rush to use an object just because you can use an object, but never neglect to create a class when you need to use a class. Using properties to add behavior to class data Python is very good at blurring distinctions; it doesn't exactly help us to "think outside the box". Rather, it teaches us that the box is in our own head; "there is no box". Before we get into the details, let's discuss some bad object-oriented theory. Many object-oriented languages (Java is the most guilty) teach us to never access attributes directly. They teach us to write attribute access like this: class Color: def __init__(self, rgb_value, name): self._rgb_value = rgb_value self._name = name def set_name(self, name): self._name = name def get_name(self): return self._name The variables are prefixed with an underscore to suggest that they are private (in other languages it would actually force them to be private). Then the get and set methods provide access to each variable. This class would be used in practice as follows: >>> c = Color("#ff0000", "bright red")>>> c.get_name()'bright red'>>> c.set_name("red")>>> c.get_name()'red' This is not nearly as readable as the direct access version that Python favors: class Color: def __init__(self, rgb_value, name): self.rgb_value = rgb_value self.name = namec = Color("#ff0000", "bright red")print(c.name)c.name = "red" So why would anyone recommend the method-based syntax? Their reasoning is that someday we may want to add extra code when a value is set or retrieved. For example, we could decide to cache a value and return the cached value, or we might want to validate that the value is a suitable input. In code, we could decide to change the set_name() method as follows: def set_name(self, name): if not name: raise Exception("Invalid Name") self._name = name Now, in Java and similar languages, if we had written our original code to do direct attribute access, and then later changed it to a method like the above, we'd have a problem: Anyone who had written code that accessed the attribute directly would now have to access the method; if they don't change the access style, their code will be broken. The mantra in these languages is that we should never make public members private. This doesn't make much sense in Python since there isn't any concept of private members! Indeed, the situation in Python is much better. We can use the Python property keyword to make methods look like a class attribute. If we originally wrote our code to use direct member access, we can later add methods to get and set the name without changing the interface. Let's see how it looks: class Color: def __init__(self, rgb_value, name): self.rgb_value = rgb_value self._name = name def _set_name(self, name): if not name: raise Exception("Invalid Name") self._name = name def _get_name(self): return self._namename = property(_get_name, _set_name) If we had started with the earlier non-method-based class, which set the name attribute directly, we could later change the code to look like the above. We first change the name attribute into a (semi-) private _name attribute. Then we add two more (semi-) private methods to get and set that variable, doing our validation when we set it. Finally, we have the property declaration at the bottom. This is the magic. It creates a new attribute on the Color class called name, which now replaces the previous name attribute. It sets this attribute to be a property, which calls the two methods we just created whenever the property is accessed or changed. This new version of the Color class can be used exactly the same way as the previous version, yet it now does validation when we set the name: >>> c = Color("#0000ff", "bright red")>>> print(c.name)bright red>>> c.name = "red">>> print(c.name)red>>> c.name = ""Traceback (most recent call last): File "<stdin>", line 1, in <module> File "setting_name_property.py", line 8, in _set_name raise Exception("Invalid Name")Exception: Invalid Name So if we'd previously written code to access the name attribute, and then changed it to use our property object, the previous code would still work, unless it was sending an empty property value, which is the behavior we wanted to forbid in the first place. Success! Bear in mind that even with the name property, the previous code is not 100% safe. People can still access the _name attribute directly and set it to an empty string if they wanted to. But if they access a variable we've explicitly marked with an underscore to suggest it is private, they're the ones that have to deal with the consequences, not us.
Read more
  • 0
  • 0
  • 14398

article-image-implementing-c-libraries-in-delphi-for-hpc-tutorial
Pavan Ramchandani
24 Jul 2018
16 min read
Save for later

Implementing C++ libraries in Delphi for HPC [Tutorial]

Pavan Ramchandani
24 Jul 2018
16 min read
Using C object files in Delphi is hard but possible. Linking to C++ object files is, however, nearly impossible. The problem does not lie within the object files themselves but in C++. While C is hardly more than an assembler with improved syntax, C++ represents a sophisticated high-level language with runtime support for strings, objects, exceptions, and more. All these features are part of almost any C++ program and are as such compiled into (almost) any object file produced by C++. In this tutorial, we will leverage various C++ libraries that enable high-performance with Delphi. It starts with memory management, which is an important program for any high performance applications. The article is an excerpt from a book written by Primož Gabrijelčič, titled Delphi High Performance. The problem here is that Delphi has no idea how to deal with any of that. C++ object is not equal to a Delphi object. Delphi has no idea how to call functions of a C++ object, how to deal with its inheritance chain, how to create and destroy such objects, and so on. The same holds for strings, exceptions, streams, and other C++ concepts. If you can compile the C++ source with C++Builder then you can create a package (.bpl) that can be used from a Delphi program. Most of the time, however, you will not be dealing with a source project. Instead, you'll want to use a commercial library that only gives you a bunch of C++ header files (.h) and one or more static libraries (.lib). Most of the time, the only Windows version of that library will be compiled with Microsoft's Visual Studio. A more general approach to this problem is to introduce a proxy DLL created in C++. You will have to create it in the same development environment as was used to create the library you are trying to link into the project. On Windows, that will in most cases be Visual Studio. That will enable us to include the library without any problems. To allow Delphi to use this DLL (and as such use the library), the DLL should expose a simple interface in the Windows API style. Instead of exposing C++ objects, the API must expose methods implemented by the objects as normal (non-object) functions and procedures. As the objects cannot cross the API boundary we must find some other way to represent them on the Delphi side. Instead of showing how to write a DLL wrapper for an existing (and probably quite complicated) C++ library, I have decided to write a very simple C++ library that exposes a single class, implementing only two methods. As compiling this library requires Microsoft's Visual Studio, which not all of you have installed, I have also included the compiled version (DllLib1.dll) in the code archive. The Visual Studio solution is stored in the StaticLib1 folder and contains two projects. StaticLib1 is the project used to create the library while the Dll1 project implements the proxy DLL. The static library implements the CppClass class, which is defined in the header file, CppClass.h. Whenever you are dealing with a C++ library, the distribution will also contain one or more header files. They are needed if you want to use a library in a C++ project—such as in the proxy DLL Dll1. The header file for the demo library StaticLib1 is shown in the following. We can see that the code implements a single CppClass class, which implements a constructor (CppClass()), destructor (~CppClass()), a method accepting an integer parameter (void setData(int)), and a function returning an integer (int getSquare()). The class also contains one integer private field, data: #pragma once class CppClass { int data; public: CppClass(); ~CppClass(); void setData(int); int getSquare(); }; The implementation of the CppClass class is stored in the CppClass.cpp file. You don't need this file when implementing the proxy DLL. When we are using a C++ library, we are strictly coding to the interface—and the interface is stored in the header file. In our case, we have the full source so we can look inside the implementation too. The constructor and destructor don't do anything and so I'm not showing them here. The other two methods are as follows. The setData method stores its parameter in the internal field and the getSquare function returns the squared value of the internal field: void CppClass::setData(int value) { data = value; } int CppClass::getSquare() { return data * data; } This code doesn't contain anything that we couldn't write in 60 seconds in Delphi. It does, however, serve as a perfect simple example for writing a proxy DLL. Creating such a DLL in Visual Studio is easy. You just have to select File | New | Project, and select the Dynamic-Link Library (DLL) project type from the Visual C++ | Windows Desktop branch. The Dll1 project from the code archive has only two source files. The file, dllmain.cpp was created automatically by Visual Studio and contains the standard DllMain method. You can change this file if you have to run project-specific code when a program and/or a thread attaches to, or detaches from, the DLL. In my example, this file was left just as the Visual Studio created it. The second file, StaticLibWrapper.cpp fully implements the proxy DLL. It starts with two include lines (shown in the following) which bring in the required RTL header stdafx.h and the header definition for our C++ class, CppClass.h: #include "stdafx.h" #include "CppClass.h" The proxy has to be able to find our header file. There are two ways to do that. We could simply copy it to the folder containing the source files for the DLL project, or we can add it to the project's search path. The second approach can be configured in Project | Properties | Configuration Properties | C/C++ | General | Additional Include Directories. This is also the approach used by the demonstration program. The DLL project must be able to find the static library that implements the CppClass object. The path to the library file should be set in project options, in the Configuration Properties | Linker | General | Additional Library Directories settings. You should put the name of the library (StaticLib1.lib) in the Linker | Input | Additional Dependencies settings. The next line in the source file defines a macro called EXPORT, which will be used later in the program to mark a function as exported. We have to do that for every DLL function that we want to use from the Delphi code. Later, we'll see how this macro is used: #define EXPORT comment(linker, "/EXPORT:" __FUNCTION__ "=" __FUNCDNAME__) The next part of the StaticLibWrapper.cpp file implements an IndexAllocator class, which is used internally to cache C++ objects. It associates C++ objects with simple integer identifiers, which are then used outside the DLL to represent the object. I will not show this class in the book as the implementation is not that important. You only have to know how to use it. This class is implemented as a simple static array of pointers and contains at most MAXOBJECTS objects. The constant MAXOBJECTS is set to 100 in the current code, which limits the number of C++ objects created by the Delphi code to 100. Feel free to modify the code if you need to create more objects. The following code fragment shows three public functions implemented by the IndexAllocator class. The Allocate function takes a pointer obj, stores it in the cache, and returns its index in the deviceIndex parameter. The result of the function is FALSE if the cache is full and TRUE otherwise. The Release function accepts an index (which was previously returned from Allocate) and marks the cache slot at that index as empty. This function returns FALSE if the index is invalid (does not represent a value returned from Allocate) or if the cache slot for that index is already empty. The last function, Get, also accepts an index and returns the pointer associated with that index. It returns NULL if the index is invalid or if the cache slot for that index is empty: bool Allocate(int& deviceIndex, void* obj) bool Release(int deviceIndex) void* Get(int deviceIndex) Let's move now to functions that are exported from the DLL. The first two—Initialize and Finalize—are used to initialize internal structures, namely the GAllocator of type IndexAllocator and to clean up before the DLL is unloaded. Instead of looking into them, I'd rather show you the more interesting stuff, namely functions that deal with CppClass. The CreateCppClass function creates an instance of CppClass, stores it in the cache, and returns its index. The important three parts of the declaration are: extern "C", WINAPI, and #pragma EXPORT. extern "C" is there to guarantee that CreateCppClass name will not be changed when it is stored in the library. The C++ compiler tends to mangle (change) function names to support method overloading (the same thing happens in Delphi) and this declaration prevents that. WINAPI changes the calling convention from cdecl, which is standard for C programs, to stdcall, which is commonly used in DLLs. Later, we'll see that we also have to specify the correct calling convention on the Delphi side. The last important part, #pragma EXPORT, uses the previously defined EXPORT macro to mark this function as exported. The CreateCppClass returns 0 if the operation was successful and -1 if it failed. The same approach is used in all functions exported from the demo DLL: extern "C" int WINAPI CreateCppClass (int& index) { #pragma EXPORT CppClass* instance = new CppClass; if (!GAllocator->Allocate(index, (void*)instance)) { delete instance; return -1; } else return 0; } Similarly, the DestroyCppClass function (not shown here) accepts an index parameter, fetches the object from the cache, and destroys it. The DLL also exports two functions that allow the DLL user to operate on an object. The first one, CppClass_setValue, accepts an index of the object and a value. It fetches the CppClass instance from the cache (given the index) and calls its setData method, passing it the value: extern "C" int WINAPI CppClass_setValue(int index, int value) { #pragma EXPORT CppClass* instance = (CppClass*)GAllocator->Get(index); if (instance == NULL) return -1; else { instance->setData(value); return 0; } } The second function, CppClass_getSquare also accepts an object index and uses it to access the CppClass object. After that, it calls the object's getSquare function and stores the result in the output parameter, value: extern "C" int WINAPI CppClass_getSquare(int index, int& value) { #pragma EXPORT CppClass* instance = (CppClass*)GAllocator->Get(index); if (instance == NULL) return -1; else { value = instance->getSquare(); return 0; } } A proxy DLL that uses a mapping table is a bit complicated and requires some work. We could also approach the problem in a much simpler manner—by treating an address of an object as its external identifier. In other words, the CreateCppClass function would create an object and then return its address as an untyped pointer type. A CppClass_getSquare, for example, would accept this pointer, cast it to a CppClass instance, and execute an operation on it. An alternative version of these two methods is shown in the following: extern "C" int WINAPI CreateCppClass2(void*& ptr) { #pragma EXPORT ptr = new CppClass; return 0; } extern "C" int WINAPI CppClass_getSquare2(void* index, int& value) { #pragma EXPORT value = ((CppClass*)index)->getSquare(); return 0; } This approach is simpler but offers far less security in the form of error checking. The table-based approach can check whether the index represents a valid value, while the latter version cannot know if the pointer parameter is valid or not. If we make a mistake on the Delphi side and pass in an invalid pointer, the code would treat it as an instance of a class, do some operations on it, possibly corrupt some memory, and maybe crash. Finding the source of such errors is very hard. That's why I prefer to write more verbose code that implements some safety checks on the code that returns pointers. Using a proxy DLL in Delphi To use any DLL from a Delphi program, we must firstly import functions from the DLL. There are different ways to do this—we could use static linking, dynamic linking, and static linking with delayed loading. There's plenty of information on the internet about the art of DLL writing in Delphi so I won't dig into this topic. I'll just stick with the most modern approach—delay loading. The code archive for this book includes two demo programs, which demonstrate how to use the DllLib1.dll library. The simpler one, CppClassImportDemo uses the DLL functions directly, while CppClassWrapperDemo wraps them in an easy-to-use class. Both projects use the CppClassImport unit to import the DLL functions into the Delphi program. The following code fragment shows the interface part of that unit which tells the Delphi compiler which functions from the DLL should be imported, and what parameters they have. As with the C++ part, there are three important parts to each declaration. Firstly, the stdcall specifies that the function call should use the stdcall (or what is known in C as  WINAPI) calling convention. Secondly, the name after the name specifier should match the exported function name from the C++ source. And thirdly, the delayed keyword specifies that the program should not try to find this function in the DLL when it is started but only when the code calls the function. This allows us to check whether the DLL is present at all before we call any of the functions: const CPP_CLASS_LIB = 'DllLib1.dll'; function Initialize: integer; stdcall; external CPP_CLASS_LIB name 'Initialize' delayed; function Finalize: integer; stdcall; external CPP_CLASS_LIB name 'Finalize' delayed; function CreateCppClass(var index: integer): integer; stdcall; external CPP_CLASS_LIB name 'CreateCppClass' delayed; function DestroyCppClass(index: integer): integer; stdcall; external CPP_CLASS_LIB name 'DestroyCppClass' delayed; function CppClass_setValue(index: integer; value: integer): integer; stdcall; external CPP_CLASS_LIB name 'CppClass_setValue' delayed; function CppClass_getSquare(index: integer; var value: integer): integer; stdcall; external CPP_CLASS_LIB name 'CppClass_getSquare' delayed; The implementation part of this unit (not shown here) shows how to catch errors that occur during delayed loading—that is, when the code that calls any of the imported functions tries to find that function in the DLL. If you get an External exception C06D007F  exception when you try to call a delay-loaded function, you have probably mistyped a name—either in C++ or in Delphi. You can use the tdump utility that comes with Delphi to check which names are exported from the DLL. The syntax is tdump -d <dll_name.dll>. If the code crashes when you call a DLL function, check whether both sides correctly define the calling convention. Also check if all the parameters have correct types on both sides and if the var parameters are marked as such on both sides. To use the DLL, the code in the CppClassMain unit firstly calls the exported Initialize function from the form's OnCreate handler to initialize the DLL. The cleanup function, Finalize is called from the OnDestroy handler to clean up the DLL. All parts of the code check whether the DLL functions return the OK status (value 0): procedure TfrmCppClassDemo.FormCreate(Sender: TObject); begin if Initialize <> 0 then ListBox1.Items.Add('Initialize failed') end; procedure TfrmCppClassDemo.FormDestroy(Sender: TObject); begin if Finalize <> 0 then ListBox1.Items.Add('Finalize failed'); end; When you click on the Use import library button, the following code executes. It uses the DLL to create a CppClass object by calling the CreateCppClass function. This function puts an integer value into the idxClass value. This value is used as an identifier that identifies a CppClass object when calling other functions. The code then calls CppClass_setValue to set the internal field of the CppClass object and CppClass_getSquare to call the getSquare method and to return the calculated value. At the end, DestroyCppClass destroys the CppClass object: procedure TfrmCppClassDemo.btnImportLibClick(Sender: TObject); var idxClass: Integer; value: Integer; begin if CreateCppClass(idxClass) <> 0 then ListBox1.Items.Add('CreateCppClass failed') else if CppClass_setValue(idxClass, SpinEdit1.Value) <> 0 then ListBox1.Items.Add('CppClass_setValue failed') else if CppClass_getSquare(idxClass, value) <> 0 then ListBox1.Items.Add('CppClass_getSquare failed') else begin ListBox1.Items.Add(Format('square(%d) = %d', [SpinEdit1.Value, value])); if DestroyCppClass(idxClass) <> 0 then ListBox1.Items.Add('DestroyCppClass failed') end; end; This approach is relatively simple but long-winded and error-prone. A better way is to write a wrapper Delphi class that implements the same public interface as the corresponding C++ class. The second demo, CppClassWrapperDemo contains a unit CppClassWrapper which does just that. This unit implements a TCppClass class, which maps to its C++ counterpart. It only has one internal field, which stores the index of the C++ object as returned from the CreateCppClass function: type TCppClass = class strict private FIndex: integer; public class procedure InitializeWrapper; class procedure FinalizeWrapper; constructor Create; destructor Destroy; override; procedure SetValue(value: integer); function GetSquare: integer; end; I won't show all of the functions here as they are all equally simple. One—or maybe two— will suffice. The constructor just calls the CreateCppClass function, checks the result, and stores the resulting index in the internal field: constructor TCppClass.Create; begin inherited Create; if CreateCppClass(FIndex) <> 0 then raise Exception.Create('CreateCppClass failed'); end; Similarly, GetSquare just forwards its job to the CppClass_getSquare function: function TCppClass.GetSquare: integer; begin if CppClass_getSquare(FIndex, Result) <> 0 then raise Exception.Create('CppClass_getSquare failed'); end; When we have this wrapper, the code in the main unit becomes very simple—and very Delphi-like. Once the initialization in the OnCreate event handler is done, we can just create an instance of the TCppClass and work with it: procedure TfrmCppClassDemo.FormCreate(Sender: TObject); begin TCppClass.InitializeWrapper; end; procedure TfrmCppClassDemo.FormDestroy(Sender: TObject); begin TCppClass.FinalizeWrapper; end; procedure TfrmCppClassDemo.btnWrapClick(Sender: TObject); var cpp: TCppClass; begin cpp := TCppClass.Create; try cpp.SetValue(SpinEdit1.Value); ListBox1.Items.Add(Format('square(%d) = %d', [SpinEdit1.Value, cpp.GetSquare])); finally FreeAndNil(cpp); end; end; To summarize, we learned about the C/C++ library that provides a solution for high-performance computing working with Delphi as the primary language. If you found this post useful, do check out the book Delphi High Performance to learn more about the intricacies of how to perform High-performance programming with Delphi. Exploring the Usages of Delphi Delphi: memory management techniques for parallel programming Delphi Cookbook
Read more
  • 0
  • 0
  • 14263

article-image-fine-tune-nginx-configufine-tune-nginx-configurationfine-tune-nginx-configurationratio
Packt
14 Jul 2015
20 min read
Save for later

Fine-tune the NGINX Configuration

Packt
14 Jul 2015
20 min read
In this article by Rahul Sharma, author of the book NGINX High Performance, we will cover the following topics: NGINX configuration syntax Configuring NGINX workers Configuring NGINX I/O Configuring TCP Setting up the server (For more resources related to this topic, see here.) NGINX configuration syntax This section aims to cover it in good detail. The complete configuration file has a logical structure that is composed of directives grouped into a number of sections. A section defines the configuration for a particular NGINX module, for example, the http section defines the configuration for the ngx_http_core module. An NGINX configuration has the following syntax: Valid directives begin with a variable name and then state an argument or series of arguments separated by spaces. All valid directives end with a semicolon (;). Sections are defined with curly braces ({}). Sections can be nested in one another. The nested section defines a module valid under the particular section, for example, the gzip section under the http section. Configuration outside any section is part of the NGINX global configuration. The lines starting with the hash (#) sign are comments. Configurations can be split into multiple files, which can be grouped using the include directive. This helps in organizing code into logical components. Inclusions are processed recursively, that is, an include file can further have include statements. Spaces, tabs, and new line characters are not part of the NGINX configuration. They are not interpreted by the NGINX engine, but they help to make the configuration more readable. Thus, the complete file looks like the following code: #The configuration begins here global1 value1; #This defines a new section section { sectionvar1 value1; include file1;    subsection {    subsectionvar1 value1; } } #The section ends here global2 value2; # The configuration ends here NGINX provides the -t option, which can be used to test and verify the configuration written in the file. If the file or any of the included files contains any errors, it prints the line numbers causing the issue: $ sudo nginx -t This checks the validity of the default configuration file. If the configuration is written in a file other than the default one, use the -c option to test it. You cannot test half-baked configurations, for example, you defined a server section for your domain in a separate file. Any attempt to test such a file will throw errors. The file has to be complete in all respects. Now that we have a clear idea of the NGINX configuration syntax, we will try to play around with the default configuration. This article only aims to discuss the parts of the configuration that have an impact on performance. The NGINX catalog has large number of modules that can be configured for some purposes. This article does not try to cover all of them as the details are beyond the scope of the book. Please refer to the NGINX documentation at http://nginx.org/en/docs/ to know more about the modules. Configuring NGINX workers NGINX runs a fixed number of worker processes as per the specified configuration. In the following sections, we will work with NGINX worker parameters. These parameters are mostly part of the NGINX global context. worker_processes The worker_processes directive controls the number of workers: worker_processes 1; The default value for this is 1, that is, NGINX runs only one worker. The value should be changed to an optimal value depending on the number of cores available, disks, network subsystem, server load, and so on. As a starting point, set the value to the number of cores available. Determine the number of cores available using lscpu: $ lscpu Architecture:     x86_64 CPU op-mode(s):   32-bit, 64-bit Byte Order:     Little Endian CPU(s):       4 The same can be accomplished by greping out cpuinfo: $ cat /proc/cpuinfo | grep 'processor' | wc -l Now, set this value to the parameter: # One worker per CPU-core. worker_processes 4; Alternatively, the directive can have auto as its value. This determines the number of cores and spawns an equal number of workers. When NGINX is running with SSL, it is a good idea to have multiple workers. SSL handshake is blocking in nature and involves disk I/O. Thus, using multiple workers leads to improved performance. accept_mutex Since we have configured multiple workers in NGINX, we should also configure the flags that impact worker selection. The accept_mutex parameter available under the events section will enable each of the available workers to accept new connections one by one. By default, the flag is set to on. The following code shows this: events { accept_mutex on; } If the flag is turned to off, all of the available workers will wake up from the waiting state, but only one worker will process the connection. This results in the Thundering Herd phenomenon, which is repeated a number of times per second. The phenomenon causes reduced server performance as all the woken-up workers take up CPU time before going back to the wait state. This results in unproductive CPU cycles and nonutilized context switches. accept_mutex_delay When accept_mutex is enabled, only one worker, which has the mutex lock, accepts connections, while others wait for their turn. The accept_mutex_delay corresponds to the timeframe for which the worker would wait, and after which it tries to acquire the mutex lock and starts accepting new connections. The directive is available under the events section with a default value of 500 milliseconds. The following code shows this: events{ accept_mutex_delay 500ms; } worker_connections The next configuration to look at is worker_connections, with a default value of 512. The directive is present under the events section. The directive sets the maximum number of simultaneous connections that can be opened by a worker process. The following code shows this: events{    worker_connections 512; } Increase worker_connections to something like 1,024 to accept more simultaneous connections. The value of worker_connections does not directly translate into the number of clients that can be served simultaneously. Each browser opens a number of parallel connections to download various components that compose a web page, for example, images, scripts, and so on. Different browsers have different values for this, for example, IE works with two parallel connections while Chrome opens six connections. The number of connections also includes sockets opened with the upstream server, if any. worker_rlimit_nofile The number of simultaneous connections is limited by the number of file descriptors available on the system as each socket will open a file descriptor. If NGINX tries to open more sockets than the available file descriptors, it will lead to the Too many opened files message in the error.log. Check the number of file descriptors using ulimit: $ ulimit -n Now, increase this to a value more than worker_process * worker_connections. The value should be increased for the user that runs the worker process. Check the user directive to get the username. NGINX provides the worker_rlimit_nofile directive, which can be an alternative way of setting the available file descriptor rather modifying ulimit. Setting the directive will have a similar impact as updating ulimit for the worker user. The value of this directive overrides the ulimit value set for the user. The directive is not present by default. Set a large value to handle large simultaneous connections. The following code shows this: worker_rlimit_nofile 20960; To determine the OS limits imposed on a process, read the file /proc/$pid/limits. $pid corresponds to the PID of the process. multi_accept The multi_accept flag enables an NGINX worker to accept as many connections as possible when it gets the notification of a new connection. The purpose of this flag is to accept all connections in the listen queue at once. If the directive is disabled, a worker process will accept connections one by one. The following code shows this: events{    multi_accept on; } The directive is available under the events section with the default value off. If the server has a constant stream of incoming connections, enabling multi_accept may result in a worker accepting more connections than the number specified in worker_connections. The overflow will lead to performance loss as the previously accepted connections, part of the overflow, will not get processed. use NGINX provides several methods for connection processing. Each of the available methods allows NGINX workers to monitor multiple socket file descriptors, that is, when there is data available for reading/writing. These calls allow NGINX to process multiple socket streams without getting stuck in any one of them. The methods are platform-dependent, and the configure command, used to build NGINX, selects the most efficient method available on the platform. If we want to use other methods, they must be enabled first in NGINX. The use directive allows us to override the default method with the method specified. The directive is part of the events section: events { use select; } NGINX supports the following methods of processing connections: select: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-select_module or --without-select_module configuration parameter. poll: This is the standard method of processing connections. It is built automatically on platforms that lack more efficient methods. The module can be enabled or disabled using the --with-poll_module or --without-poll_module configuration parameter. kqueue: This is an efficient method of processing connections available on FreeBSD 4.1, OpenBSD 2.9+, NetBSD 2.0, and OS X. There are the additional directives kqueue_changes and kqueue_events. These directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 512. The kqueue method will ignore the multi_accept directive if it has been enabled. epoll: This is an efficient method of processing connections available on Linux 2.6+. The method is similar to the FreeBSD kqueue. There is also the additional directive epoll_events. This specifies the number of events that NGINX will pass to the kernel. The default value for this is 512. /dev/poll: This is an efficient method of processing connections available on Solaris 7 11/99+, HP/UX 11.22+, IRIX 6.5.15+, and Tru64 UNIX 5.1A+. This has the additional directives, devpoll_events and devpoll_changes. The directives specify the number of changes and events that NGINX will pass to the kernel. The default value for both of these is 32. eventport: This is an efficient method of processing connections available on Solaris 10. The method requires necessary security patches to avoid kernel crash issues. rtsig: Real-time signals is a connection processing method available on Linux 2.2+. The method has some limitations. On older kernels, there is a system-wide limit of 1,024 signals. For high loads, the limit needs to be increased by setting the rtsig-max parameter. For kernel 2.6+, instead of the system-wide limit, there is a limit on the number of outstanding signals for each process. NGINX provides the worker_rlimit_sigpending parameter to modify the limit for each of the worker processes: worker_rlimit_sigpending 512; The parameter is part of the NGINX global configuration. If the queue overflows, NGINX drains the queue and uses the poll method to process the unhandled events. When the condition is back to normal, NGINX switches back to the rtsig method of connection processing. NGINX provides the rtsig_overflow_events, rtsig_overflow_test, and rtsig_overflow_threshold parameters to control how a signal queue is handled on overflows. The rtsig_overflow_events parameter defines the number of events passed to poll. The rtsig_overflow_test parameter defines the number of events handled by poll, after which NGINX will drain the queue. Before draining the signal queue, NGINX will look up how much it is filled. If the factor is larger than the specified rtsig_overflow_threshold, it will drain the queue. The rtsig method requires accept_mutex to be set. The method also enables the multi_accept parameter. Configuring NGINX I/O NGINX can also take advantage of the Sendfile and direct I/O options available in the kernel. In the following sections, we will try to configure parameters available for disk I/O. Sendfile When a file is transferred by an application, the kernel first buffers the data and then sends the data to the application buffers. The application, in turn, sends the data to the destination. The Sendfile method is an improved method of data transfer, in which data is copied between file descriptors within the OS kernel space, that is, without transferring data to the application buffers. This results in improved utilization of the operating system's resources. The method can be enabled using the sendfile directive. The directive is available for the http, server, and location sections. http{ sendfile on; } The flag is set to off by default. Direct I/O The OS kernel usually tries to optimize and cache any read/write requests. Since the data is cached within the kernel, any subsequent read request to the same place will be much faster because there's no need to read the information from slow disks. Direct I/O is a feature of the filesystem where reads and writes go directly from the applications to the disk, thus bypassing all OS caches. This results in better utilization of CPU cycles and improved cache effectiveness. The method is used in places where the data has a poor hit ratio. Such data does not need to be in any cache and can be loaded when required. It can be used to serve large files. The directio directive enables the feature. The directive is available for the http, server, and location sections: location /video/ { directio 4m; } Any file with size more than that specified in the directive will be loaded by direct I/O. The parameter is disabled by default. The use of direct I/O to serve a request will automatically disable Sendfile for the particular request. Direct I/O depends on the block size while doing a data transfer. NGINX has the directio_alignment directive to set the block size. The directive is present under the http, server, and location sections: location /video/ { directio 4m; directio_alignment 512; } The default value of 512 bytes works well for all boxes unless it is running a Linux implementation of XFS. In such a case, the size should be increased to 4 KB. Asynchronous I/O Asynchronous I/O allows a process to initiate I/O operations without having to block or wait for it to complete. The aio directive is available under the http, server, and location sections of an NGINX configuration. Depending on the section, the parameter will perform asynchronous I/O for the matching requests. The parameter works on Linux kernel 2.6.22+ and FreeBSD 4.3. The following code shows this: location /data { aio on; } By default, the parameter is set to off. On Linux, aio needs to be enabled with directio, while on FreeBSD, sendfile needs to be disabled for aio to take effect. If NGINX has not been configured with the --with-file-aio module, any use of the aio directive will cause the unknown directive aio error. The directive has a special value of threads, which enables multithreading for send and read operations. The multithreading support is only available on the Linux platform and can only be used with the epoll, kqueue, or eventport methods of processing requests. In order to use the threads value, configure multithreading in the NGINX binary using the --with-threads option. Post this, add a thread pool in the NGINX global context using the thread_pool directive. Use the same pool in the aio configuration: thread_pool io_pool threads=16; http{ ….....    location /data{      sendfile   on;      aio       threads=io_pool;    } } Mixing them up The three directives can be mixed together to achieve different objectives on different platforms. The following configuration will use sendfile for files with size smaller than what is specified in directio. Files served by directio will be read using asynchronous I/O: location /archived-data/{ sendfile on; aio on; directio 4m; } The aio directive has a sendfile value, which is available only on the FreeBSD platform. The value can be used to perform Sendfile in an asynchronous manner: location /archived-data/{ sendfile on; aio sendfile; } NGINX invokes the sendfile() system call, which returns with no data in the memory. Post this, NGINX initiates data transfer in an asynchronous manner. Configuring TCP HTTP is an application-based protocol, which uses TCP as the transport layer. In TCP, data is transferred in the form of blocks known as TCP packets. NGINX provides directives to alter the behavior of the underlying TCP stack. These parameters alter flags for an individual socket connection. TCP_NODELAY TCP/IP networks have the "small packet" problem, where single-character messages can cause network congestion on a highly loaded network. Such packets are 41 bytes in size, where 40 bytes are for the TCP header and 1 byte has useful information. These small packets have huge overhead, around 4000 percent and can saturate a network. John Nagle solved the problem (Nagle's algorithm) by not sending the small packets immediately. All such packets are collected for some amount of time and then sent in one go as a single packet. This results in improved efficiency of the underlying network. Thus, a typical TCP/IP stack waits for up to 200 milliseconds before sending the data packages to the client. It is important to note that the problem exists with applications such as Telnet, where each keystroke is sent over wire. The problem is not relevant to a web server, which severs static files. The files will mostly form full TCP packets, which can be sent immediately instead of waiting for 200 milliseconds. The TCP_NODELAY option can be used while opening a socket to disable Nagle's buffering algorithm and send the data as soon as it is available. NGINX provides the tcp_nodelay directive to enable this option. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nodelay on; } The directive is enabled by default. NGINX use tcp_nodelay for connections with the keep-alive mode. TCP_CORK As an alternative to Nagle's algorithm, Linux provides the TCP_CORK option. The option tells the TCP stack to append packets and send them when they are full or when the application instructs to send the packet by explicitly removing TCP_CORK. This results in an optimal amount of data packets being sent and, thus, improves the efficiency of the network. The TCP_CORK option is available as the TCP_NOPUSH flag on FreeBSD and Mac OS. NGINX provides the tcp_nopush directive to enable TCP_CORK over the connection socket. The directive is available under the http, server, and location sections of an NGINX configuration: http{ tcp_nopush on; } The directive is disabled by default. NGINX uses tcp_nopush for requests served with sendfile. Setting them up The two directives discussed previously do mutually exclusive things; the former makes sure that the network latency is reduced, while the latter tries to optimize the data packets sent. An application should set both of these options to get efficient data transfer. Enabling tcp_nopush along with sendfile makes sure that while transferring a file, the kernel creates the maximum amount of full TCP packets before sending them over wire. The last packet(s) can be partial TCP packets, which could end up waiting with TCP_CORK being enabled. NGINX make sure it removes TCP_CORK to send these packets. Since tcp_nodelay is also set then, these packets are immediately sent over the network, that is, without any delay. Setting up the server The following configuration sums up all the changes proposed in the preceding sections: worker_processes 3; worker_rlimit_nofile 8000;   events { multi_accept on; use epoll; worker_connections 1024; }   http { sendfile on; aio on; directio 4m; tcp_nopush on; tcp_nodelay on; # Rest Nginx configuration removed for brevity } It is assumed that NGINX runs on a quad core server. Thus three worker processes have been spanned to take advantage of three out of four available cores and leaving one core for other processes. Each of the workers has been configured to work with 1,024 connections. Correspondingly, the nofile limit has been increased to 8,000. By default, all worker processes operate with mutex; thus, the flag has not been set. Each worker processes multiple connections in one go using the epoll method. In the http section, NGINX has been configured to serve files larger than 4 MB using direct I/O, while efficiently buffering smaller files using Sendfile. TCP options have also been set up to efficiently utilize the available network. Measuring gains It is time to test the changes and make sure that they have given performance gain. Run a series of tests using Siege/JMeter to get new performance numbers. The tests should be performed with the same configuration to get a comparable output: $ siege -b -c 790 -r 50 -q http://192.168.2.100/hello   Transactions:               79000 hits Availability:               100.00 % Elapsed time:               24.25 secs Data transferred:           12.54 MB Response time:             0.20 secs Transaction rate:           3257.73 trans/sec Throughput:                 0.52 MB/sec Concurrency:               660.70 Successful transactions:   39500 Failed transactions:       0 Longest transaction:       3.45 Shortest transaction:       0.00 The results from Siege should be evaluated and compared to the baseline. Throughput: The transaction rate defines this as 3250 requests/second Error rate: Availability is reported as 100 percent; thus; the error rate is 0 percent Response time: The results shows a response time of 0.20 seconds Thus, these new numbers demonstrate performance improvement in various respects. After the server configuration is updated with all the changes, reperform all tests with increased numbers. The aim should be to determine the new baseline numbers for the updated configuration. Summary The article started with an overview of the NGINX configuration syntax. Going further, we discussed worker_connections and the related parameters. These allow you to take advantage of the available hardware. The article also talked about the different event processing mechanisms available on different platforms. The configuration discussed helped in processing more requests, thus improving the overall throughput. NGINX is primarily a web server; thus, it has to serve all kinds static content. Large files can take advantage of direct I/O, while smaller content can take advantage of Sendfile. The different disk modes make sure that we have an optimal configuration to serve the content. In the TCP stack, we discussed the flags available to alter the default behavior of TCP sockets. The tcp_nodelay directive helps in improving latency. The tcp_nopush directive can help in efficiently delivering the content. Both these flags lead to improved response time. In the last part of the article, we applied all the changes to our server and then did performance tests to determine the effectiveness of the changes done. In the next article, we will try to configure buffers, timeouts, and compression to improve the utilization of the available network. Resources for Article: Further resources on this subject: Using Nginx as a Reverse Proxy [article] Nginx proxy module [article] Introduction to nginx [article]
Read more
  • 0
  • 0
  • 14186
article-image-why-does-the-c-programming-language-refuse-to-die
Kunal Chaudhari
23 Oct 2018
8 min read
Save for later

Why does the C programming language refuse to die?

Kunal Chaudhari
23 Oct 2018
8 min read
As a technology research analyst, I try to keep up the pace with the changing world of technology. It seems like every single day, there is a new programming language, framework, or tool emerging out of nowhere. In order to keep up, I regularly have a peek at the listicles on TIOBE, PyPL, and Stackoverflow along with some twitter handles and popular blogs, which keeps my FOMO (fear of missing out) in check. So here I was, strolling through the TIOBE index, to see if a new programming language is making the rounds or if any old timer language is facing its doomsday in the lower half of the table. The first thing that caught my attention was Python, which interestingly broke into the top 3 for the first time since it was ranked by TIOBE. I never cared to look at Java, since it has been claiming the throne ever since it became popular. But with my pupils dilated, I saw something which I would have never expected, especially with the likes of Python, C#, Swift, and JavaScript around. There it was, the language which everyone seemed to have forgotten about, C, sitting at the second position, like an old tower among the modern skyscrapers in New York. A quick scroll down shocked me even more: C was only recently named the language of 2017 by TIOBE. The reason it won was because of its impressive yearly growth of 1.69% and its consistency - C has been featured in the top 3 list for almost four decades now. This result was in stark contrast to many news sources (including Packt’s own research) that regularly place languages like Python and JavaScript on top of their polls. But surely this was an indicator of something. Why would a language which is almost 50 years old still hold its ground against the ranks of newer programming language? C has a design philosophy for the ages A solution to the challenges of UNIX and Assembly The 70s was a historic decade for computing. Many notable inventions and developments, particularly in the area of networking, programming, and file systems, took place. UNIX was one such revolutionary milestone, but the biggest problem with UNIX was that it was programmed in Assembly language. Assembly was fine for machines, but difficult for humans. Watch now: Learn and Master C Programming For Absolute Beginners So, the team working on UNIX, namely Dennis Ritchie, Ken Thompson, and Brian Kernighan decided to develop a language which could understand data types and supported data structures. They wanted C to be as fast as the Assembly but with the features of a high-level language. And that’s how C came into existence, almost out of necessity. But the principles on which the C programming language was built were not coincidental. It compelled the programmers to write better code and strive for efficiency rather than being productive by providing a lot of abstractions. Let’s discuss some features which makes C a language to behold. Portability leads to true ubiquity When you try to search for the biggest feature of C, almost instantly, you are bombarded with articles on portability. Which makes you wonder what is it about portability that makes C relevant in the modern world of computing. Well, portability can be defined as the measure of how easily software can be transferred from one computer environment or architecture to another. One can also argue that portability is directly proportional to how flexible your software is. Applications or software developed using C are considered to be extremely flexible because you can find a C compiler for almost every possible platform available today. So if you develop your application by simply exercising some discipline to write portable code, you have yourself an application which virtually runs on every major platform. Programmer-driven memory management It is universally accepted that C is a high-performance language. The primary reason for this is that it works very close to the machine, almost like an Assembly language. But very few people realize that versatile features like explicit memory management makes C one of the better-performing languages out there. Memory management allows programmers to scale down a program to run with a small amount of memory. This feature was important in the early days because the computers or terminals as they used to call it, were not as powerful as they are today. But the advent of mobile devices and embedded systems has renewed the interest of programmers in C language because these mobile devices demand that the programmers keep memory requirement to a minimum. Many of the programming languages today provide functionalities like garbage collection that takes care of the memory allocation. But C calls programmers’ bluff by asking them to be very specific. This makes their programs and its memory efficient and inherently fast. Manual memory management makes C one of the most suitable languages for developing other programming languages. This is because even in a garbage collector someone has to take care of memory allocation - that infrastructure is provided by C. Structure is all I got As discussed before, Assembly was difficult to work with, particularly when dealing with large chunks of code. C has a structured approach in its design which allows the programmers to break down the program into multiple blocks of code for execution, often called as procedures or functions. There are, of course, multiple ways in which software development can be approached. Structural programming is one such approach that is effective when you need to break down a problem into its component pieces and then convert it into application code. Although it might not be quite as in vogue as object-oriented programming is today, this approach is well suited to tasks like database scripting or developing small programs with logical sequences to carry out specific set of tasks. As one of the best languages for structural programming, it’s easy to see how C has remained popular, especially in the context of embedded systems and kernel development. Applications that stand the test of time If Beyoncé would have been a programmer, she definitely might have sang “Who runs the world? C developers”. And she would have been right. If you’re using a digital alarm clock, a microwave, or a car with anti-lock brakes, chances are that they have been programmed using C. Though it was never developed specifically for embedded systems, C has become the defacto programming language for embedded developers, systems programmers, and kernel development. C: the backbone of our operating systems We already know that the world famous UNIX system was developed in C, but is it the only popular application that has been developed using C? You’ll be astonished to see the list of applications that follows: The world desktop operating market is dominated by three major operating systems: Windows, MAC, and Linux. The kernel of all these OSes has been developed using the C programming language. Similarly, Android, iOS, and Windows are some of the popular mobile operating systems whose kernels were developed in C. Just like UNIX, the development of Oracle Database began on Assembly and then switched to C. It’s still widely regarded as one of the best database systems in the world. Not only Oracle but MySQL and PostgreSQL have also been developed using C - the list goes on and on. What does the future hold for C? So far we discussed the high points of C programming, it’s design principle and the applications that were developed using it. But the bigger question to ask is, what its future might hold. The answer to this question is tricky, but there are several indicators which show positive signs. IoT is one such domain where the C programming language shines. Whether or not beginner programmers should learn C has been a topic of debate everywhere. The general consensus says that learning C is always a good thing, as it builds up your fundamental knowledge of programming and it looks good on the resume. But IoT provides another reason to learn C, due to the rapid growth in the IoT industry. We already saw the massive number of applications built on C and their codebase is still maintained in it. Switching to a different language means increased cost for the company. Since it is used by numerous enterprises across the globe the demand for C programmers is unlikely to vanish anytime soon. Read Next Rust as a Game Programming Language: Is it any good? Google releases Oboe, a C++ library to build high-performance Android audio apps Will Rust Replace C++?
Read more
  • 0
  • 0
  • 13605

article-image-the-v-programming-language-is-now-open-sourced-is-it-too-good-to-be-true
Bhagyashree R
24 Jun 2019
5 min read
Save for later

The V programming language is now open source - is it too good to be true?

Bhagyashree R
24 Jun 2019
5 min read
Yesterday, a new statically-typed programming language named V was open sourced. It is described as a simple, fast, and compiled language for creating maintainable software. Its creator, Alex Medvednikov, says that it is very similar to Go and is inspired by Oberon, Rust, and Swift. What to expect from V programming language Fast compilation V can compile up to 1.2 million lines of code per second per CPU. It achieves this by direct machine code generation and strong modularity. If we decide to emit C code, the compilation speed drops to approximately 100k of code per second per CPU. Medvednikov mentions that direct machine code generation is still in its very early stages and right now only supports x64/Mach-O. He plans to make this feature stable by the end of this year. Safety It seems to be an ideal language because it has no null, global variables, undefined values, undefined behavior, variable shadowing, and does bound checking. It supports immutable variables, pure functions, and immutable structs by default. Generics are right now work in progress and are planned for next month. Performance According to the website, V is as fast as C, requires a minimal amount of allocations, and supports built-in serialization without runtime reflection. It compiles to native binaries without any dependencies. Just a 0.4 MB compiler Compared to Go, Rust, GCC, and Clang, the space required and build time of V are very very less. The entire language and standard library is just 400 KB and you can build it in 0.4s. By the end of this year, the author aims to bring this build time down to 0.15s. C/C++ translation V allows you to translate your V code to C or C++. However, this feature is at a very early stage, given that C and C++ are a very complex language. The creator aims to make this feature stable by the end of this year. What do developers think about this language? As much as developers like to have a great language to build applications, many felt that V is too good to be true. Looking at the claims made on the site some developers thought that the creator is either not being truthful about the capabilities of V or is scamming people. https://twitter.com/warnvod/status/1112571835558825986 A language that has the simplicity of Go and the memory management model of Rust is what everyone desires. However, the main reason that makes people skeptical about V is that there is not much proof behind the hard claims it makes. A user on Hacker news commented, “...V's author makes promises and claims which are then retracted, falsified, or untestable. Most notably, the source for V's toolchain has been teased repeatedly as coming soon but has never been released. Without an open toolchain, none of the claims made on V's front page [2] can be verified.” Another thing that makes this case concerning is that the V programming language is currently in alpha stage and is incomplete. Despite that, the creator is making $827 per month from his Patreon account. “However, advertising a product can do something and then releasing it stating it cannot do it yet, is one thing, but accepting money for a product that does not what is advertised, is a fraud,” a user commented. Some developers are also speculating that the creator is maybe just embarrassed to open source his code because of bad coding pattern choices. A user speculates, “V is not Free Software, which is disappointing but not atypical; however, V is not even open source, which precludes a healthy community. Additionally, closed languages tend to have bad patterns like code dumps over the wall, poor community communication, untrustworthy binary behaviors, and delayed product/feature releases. Yes, it's certainly embarrassing to have years of history on display for everybody to see, but we all apparently have gotten over it. What's hiding in V's codebase? We don't know. As a best guess, I think that the author may be ashamed of the particular nature of their bootstrap.” The features listed on the official website are incredible. The only concern was that the creator was not being transparent about how he plans to achieve them. Also, as this was closed source earlier, there was no way for others to verify the performance guarantees it promises that’s why so much confusion happened. Alex Medvednikov on why you can trust V programming On an issue that was reported on GitHub, the creator commented, “So you either believe me or you don't, we'll see who is right in June. But please don't call me a liar, scammer and spread misinformation.” Medvednikov was maybe overwhelmed by the responses and speculations, he was seeing on different discussion forums. Developing a whole new language requires a lot of work and perhaps his deadlines are ambitious. Going by the release announcement Medvednikov made yesterday, he is aware that the language designing process hasn’t been the most elegant version of his vision. He wrote, “There are lots of hacks I'm really embarrassed about, like using os.system() instead of native API calls, especially on Windows. There's a lot of ugly C code with #, which I regret adding at all.” Here’s great advice shared by a developer on V’s GitHub repository: Take your time, good software takes time. It's easy to get overwhelmed building Free software: sometimes it's better to say "no" or "not for now" in order to build great things in the long run :) Visit the official website of the V programming language for more detail. Docker and Microsoft collaborate over WSL 2, future of Docker Desktop for Windows is near Pull Panda is now a part of GitHub; code review workflows now get better! Scala 2.13 is here with overhauled collections, improved compiler performance, and more!
Read more
  • 0
  • 0
  • 13597