Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-generate-prime-numbers-ancient-greek-connection
Richard Gall
16 Mar 2018
2 min read
Save for later

Prime numbers, modern encryption and their ancient Greek connection!

Richard Gall
16 Mar 2018
2 min read
Prime numbers are incredibly important in a number of fields, from computer science to cybersecurity. But they are incredibly mysterious - they are quite literally enigmas. They have been the subject of thousands of years of research and exploration but they have still not been cracked - we are yet to find a formula that will help generate prime numbers easily. Prime numbers are particularly integral to modern encryption - when a file is encrypted, the number used to do so is built using two primes. The only way to decrypt it is to work out the prime factors of that gargantuan number, a task which is almost impossible even with the most extensive computing power currently at our disposal. As well as this, prime numbers are also used as error correcting codes and in mass storage and data transmission. Did you know the Greeks were one of the early champions for our modern encryption systems? Find out how to generate prime numbers manually using a method devised by the Greek mathematician, Eratothenes, in this fun video from the video course by Packt, Fundamental Algorithms in Scala. [embed]https://www.youtube.com/watch?v=cd8v-Jo8obs&t=56s[/embed]  
Read more
  • 0
  • 0
  • 1951

article-image-connecting-data-mongodb-using-pymongo-php
Amey Varangaonkar
16 Mar 2018
6 min read
Save for later

Connecting your data to MongoDB using PyMongo and PHP

Amey Varangaonkar
16 Mar 2018
6 min read
[box type="note" align="" class="" width=""]The following book excerpt is taken from the title Mastering MongoDB 3.x written by Alex Giamas. This book covers the fundamental as well as advanced tasks related to working with MongoDB.[/box] There are two ways to connect your data to MongoDB. The first is using the driver for your programming language. The second is by using an ODM (Object Document Mapper) layer to map your model objects to MongoDB in a transparent way. In this post, we will cover both methods using two of the most popular languages for web application development: Python, using the official MongoDB low level driver, PyMongo, and PHP, using the official PHP driver for MongoDB Connect using Python Installing PyMongo can be done using pip or easy_install: python -m pip install pymongo python -m easy_install pymongo Then in our class we can connect to a database: >>> from pymongo import MongoClient >>> client = MongoClient() Connecting to a replica set needs a set of seed servers for the client to find out what the primary, secondary, or arbiter nodes in the set are: client = pymongo.MongoClient('mongodb://user:passwd@node1:p1,node2:p2/?replicaSet=rs name') Using the connection string URL we can pass a username/password and replicaSet name all in a single string. Some of the most interesting options for the connection string URL are presented later. Connecting to a shard requires the server host and IP for the mongo router, which is the mongos process. PyMODM ODM Similar to Ruby's Mongoid, PyMODM is an ODM for Python that follows closely on Django's built-in ORM. Installing it can be done via pip: pip install pymodm Then we need to edit settings.py and replace the database engine with a dummy database: DATABASES = { 'default': { 'ENGINE': 'django.db.backends.dummy' } } And add our connection string anywhere in settings.py: from pymodm import connect connect("mongodb://localhost:27017/myDatabase", alias="MyApplication") Here we have to use a connection string that has the following structure: mongodb://[username:password@]host1[:port1][,host2[:port2],...[,hostN[:port N]]][/[database][?options]] Options have to be pairs of name=value with an & between each pair. Some interesting pairs are: Model classes need to inherit from MongoModel. A sample class will look like this: from pymodm import MongoModel, fields class User(MongoModel): email = fields.EmailField(primary_key=True) first_name = fields.CharField() last_name = fields.CharField() This has a User class with first_name, last_name, and email fields where email is the primary field. Inheritance with PyMODM models Handling one-one and one-many relationships in MongoDB can be done using references or embedding. This example shows both ways: references for the model user and embedding for the comment model: from pymodm import EmbeddedMongoModel, MongoModel, fields class Comment(EmbeddedMongoModel): author = fields.ReferenceField(User) content = fields.CharField() class Post(MongoModel): title = fields.CharField() author = fields.ReferenceField(User) revised_on = fields.DateTimeField() content = fields.CharField() comments = fields.EmbeddedDocumentListField(Comment) Connecting with PHP The MongoDB PHP driver was rewritten from scratch two years ago to support the PHP 5, PHP 7, and HHVM architectures. The current architecture is shown in the following diagram: Currently we have official drivers for all three architectures with full support for the underlying functionality. Installation is a two-step process. First we need to install the MongoDB extension. This extension is dependent on the version of PHP (or HHVM) that we have installed and can be done using brew in Mac. For example with PHP 7.0: brew install php70-mongodb Then, using composer (a widely used dependency manager for PHP): composer require mongodb/mongodb Connecting to the database can then be done by using the connection string URI or by passing an array of options. Using the connection string URI we have: $client = new MongoDBClient($uri = 'mongodb://127.0.0.1/', array $uriOptions = [], array $driverOptions = []) For example, to connect to a replica set using SSL authentication: $client = new MongoDBClient('mongodb://myUsername:[email protected],rs2.example.com/?ssl=true&replicaSet=myReplicaSet&authSource=admin'); Or we can use the $uriOptions parameter to pass in parameters without using the connection string URL, like this: $client = new MongoDBClient( 'mongodb://rs1.example.com,rs2.example.com/' [ 'username' => 'myUsername', 'password' => 'myPassword', 'ssl' => true, 'replicaSet' => 'myReplicaSet', 'authSource' => 'admin', ], ); The set of $uriOptions and the connection string URL options available are analogous to the ones used for Ruby and Python. Doctrine ODM Laravel is one of the most widely used MVC frameworks for PHP, similar in architecture to Django and Rails from the Python and Ruby worlds respectively. We will follow through configuring our models using a stack of Laravel, Doctrine, and MongoDB. This section assumes that Doctrine is installed and working with Laravel 5.x. Doctrine entities are POPO (Plain Old PHP Objects) that, unlike Eloquent, Laravel's default ORM doesn't need to inherit from the Model class. Doctrine uses the Data Mapper pattern, whereas Eloquent uses Active Record. Skipping the get() set() methods, a simple class would look like: use DoctrineORMMapping AS ORM; use DoctrineCommonCollectionsArrayCollection; /** * @ORMEntity * @ORMTable(name="scientist") */ class Scientist { /** * @ORMId * @ORMGeneratedValue * @ORMColumn(type="integer") */ protected $id; /** * @ORMColumn(type="string") */ protected $firstname; /** * @ORMColumn(type="string") */ protected $lastname; /** * @ORMOneToMany(targetEntity="Theory", mappedBy="scientist", cascade={"persist"}) * @var ArrayCollection|Theory[] */ protected $theories; /** * @param $firstname * @param $lastname */ public function __construct($firstname, $lastname) { $this->firstname = $firstname; $this->lastname = $lastname; $this->theories = new ArrayCollection; } … public function addTheory(Theory $theory) { if(!$this->theories->contains($theory)) { $theory->setScientist($this); $this->theories->add($theory); } } This POPO-based model used annotations to define field types that need to be persisted in MongoDB. For example, @ORMColumn(type="string") defines a field in MongoDB with the string type firstname and lastname as the attribute names, in the respective lines. There is a whole set of annotations available here http://doctrine-orm.readthedocs.io/en/latest/reference/annotations- reference.html . If we want to separate the POPO structure from annotations, we can also define them using YAML or XML instead of inlining them with annotations in our POPO model classes. Inheritance with Doctrine Modeling one-one and one-many relationships can be done via annotations, YAML, or XML. Using annotations, we can define multiple embedded subdocuments within our document: /** @Document */ class User { // … /** @EmbedMany(targetDocument="Phonenumber") */ private $phonenumbers = array(); // … } /** @EmbeddedDocument */ class Phonenumber { // … } Here a User document embeds many PhoneNumbers. @EmbedOne() will embed one subdocument to be used for modeling one-one relationships. Referencing is similar to embedding: /** @Document */ class User { // … /** * @ReferenceMany(targetDocument="Account") */ private $accounts = array(); // … } /** @Document */ class Account { // … } @ReferenceMany() and @ReferenceOne() are used to model one-many and one-one relationships via referencing into a separate collection. We saw that the process of connecting data to MongoDB using Python and PHP is quite similar. We can accordingly define relationships as being embedded or referenced, depending on our design decision. If you found this post useful, check out our book Mastering MongoDB 3.x for more tips and techniques on working with MongoDB efficiently.  
Read more
  • 0
  • 0
  • 3865

article-image-troubleshooting-in-sql-server
Sunith Shetty
15 Mar 2018
16 min read
Save for later

Troubleshooting in SQL Server

Sunith Shetty
15 Mar 2018
16 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book SQL Server 2017 Administrator's Guide written by Marek Chmel and Vladimír Mužný. This book will help you learn to implement and administer successful database solution with SQL Server 2017.[/box] Today, we will perform SQL Server analysis, and also learn ways for efficient performance monitoring and tuning. Performance monitoring and tuning Performance monitoring and tuning is a crucial part of your database administration skill set so as to keep the performance and stability of your server great, and to be able to find and fix the possible issues. The overall system performance can decrease over time; your system may work with more data or even become totally unresponsive. In such cases, you need the skills and tools to find the issue to bring the server back to normal as fast as possible. We can use several tools on the operating system layer and, then, inside the SQL Server to verify the performance and the possible root cause of the issue. The first tool that we can use is the performance monitor, which is available on your Windows Server: Performance monitor can be used to track important SQL Server counters, which can be very helpful in evaluating the SQL Server Performance. To add a counter, simply right-click on the monitoring screen in the Performance monitoring and tuning section and use the Add Counters item. If the SQL Server instance that you're monitoring is a default instance, you will find all the performance objects listed as SQL Server. If your instance is named, then the performance objects will be listed as MSSQL$InstanceName in the list of performance objects. We can split the important counters to watch between the system counters for the whole server and specific SQL Server counters. The list of system counters to watch include the following: Processor(_Total)% Processor Time: This is a counter to display the CPU load. Constant high values should be investigated to verify whether the current load does not exceed the performance limits of your HW or VM server, or if your workload is not running with proper indexes and statistics, and is generating bad query plans. MemoryAvailable MBytes: This counter displays the available memory on the operating system. There should always be enough memory for the operating system. If this counter drops below 64MB, it will send a notification of low memory and the SQL Server will reduce the memory usage. Physical Disk—Avg. Disk sec/Read: This disk counter provides the average latency information for your storage system; be careful if your storage is made of several different disks to monitor the proper storage system. Physical Disk: This indicates the average disk writes per second. Physical Disk: This indicates the average disk reads per second. Physical Disk: This indicates the number of disk writes per second. System—Processor Queue Length: This counter displays the number of threads waiting on a system CPU. If the counter is above 0, this means that there are more requests than the CPU can handle, and if the counter is constantly above 0, this may signal performance issues. Network interface: This indicates the total number of bytes per second. Once you have added all these system counters, you can see the values real time or you can configure a data collection, which will run for a specified selected time and periodically collect the information: With SQL Server-specific counters, we can dig deeper into the CPU, memory, and storage utilization to see what the SQL Server is doing and how the SQL Server is utilizing the subsystems. SQL Server memory monitoring and troubleshooting Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: Important counters to watch for SQL Server memory utilization include counters from the SQL Server: Buffer Manager performance object and from SQL Server:Memory Manager: SQLServer-Buffer Manager—buffer cache hit ratio: This counter displays the ratio of how often the SQL Server can find the proper data in the cache when a query returns such data. If the data is not found in the cache, it has to be read from the disk. The higher the counter, the better the overall performance, since the memory access is usually faster compared to the disk subsystem. SQLServer-Buffer Manager—page life expectancy: This counter can measure how long a page can stay in the memory in seconds. The longer a page can stay in the memory, the less likely it will be for the SQL Server to need to access the disk in order to get the data into the memory again. SQL Server-Memory Manager—total server memory (KB): This is the amount of memory the server has committed using the memory manager. SQL Server-Memory Manager—target server memory (KB): This is the ideal amount of memory the server can consume. On a stable system, the target and total should be equal unless you face a memory pressure. Once the memory is utilized after the warm-up of your server, these two counters should not drop significantly, which would be another indication of system-level memory pressure, where the SQL Server memory manager has to deallocate memory. SQL Server-Memory Manager—memory grants pending: This counter displays the total number of SQL Server processes that are waiting to be granted memory from the memory manager. To check the performance counters, you can also use a T-SQL query, where you can query the sys.dm_os_performance_counters DMV: SELECT [counter_name] as [Counter Name], [cntr_value]/1024 as [Server Memory (MB)] FROM sys.dm_os_performance_counters WHERE  [object_name] LIKE '%Memory Manager%'  AND [counter_name] IN ('Total Server Memory (KB)', 'Target Server Memory (KB)') This query will return two values—one for target memory and one for total memory. These two should be close to each other on a warmed up system. Another query you can use is to get the information from a DMV named sys.dm_0s_sys_memory: SELECT total_physical_memory_kb/1024/1024 AS [Physical Memory (GB)],    available_physical_memory_kb/1024/1024 AS [Available Memory (GB)], system_memory_state_desc AS [System Memory State] FROM sys.dm_os_sys_memory WITH (NOLOCK) OPTION (RECOMPILE) This query will display the available physical memory and the total physical memory of your server with several possible memory states: Available physical memory is high (this is a state you would like to see on your system, indicating there is no lack of memory) Physical memory usage is steady Available physical memory is getting low Available physical memory is low The memory grants can be verified with a T-SQL query: SELECT [object_name] as [Object name] , cntr_value AS [Memory Grants Pending] FROM sys.dm_os_performance_counters WITH (NOLOCK) WHERE  [object_name] LIKE N'%Memory Manager%'  AND counter_name = N'Memory Grants Pending' OPTION (RECOMPILE); If you face memory issues, there are several steps you can take for improvements: Check and configure your SQL Server max memory usage Add more RAM to your server; the limit for Standard Edition is 128 GB and there is no limit for SQL Server with Enterprise Use Lock Pages in Memory Optimize your queries SQL Server storage monitoring and troubleshooting The important counters to watch for SQL Server storage utilization would include counters from the SQL Server:Access Methods performance object: SQL Server-Access Methods—Full Scans/sec: This counter displays the number of full scans per second, which can be either table or full-index scans SQL Server-Access Methods—Index Searches/sec: This counter displays the number of searches in the index per second SQL Server-Access Methods—Forwarded Records/sec: This counter displays the number of forwarded records per second Monitoring the disk system is crucial, since your disk is used as a storage for the following: Data files Log files tempDB database Page file Backup files To verify the disk latency and IOPS metric of your drives, you can use the Performance monitor, or the T-SQL commands, which will query the sys.dm_os_volume_stats and sys.dm_io_virtual_file_stats DMF. Simple code to start with would be a T-SQL script utilizing the DMF to check the space available within the database files: SELECT f.database_id, f.file_id, volume_mount_point, total_bytes, available_bytes FROM sys.master_files AS f CROSS APPLY sys.dm_os_volume_stats(f.database_id, f.file_id); To check the I/O file stats with the second provided DMF, you can use a T-SQL code for checking the information about tempDB data files: SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL) vfs join sys.master_files mf on mf.database_id = vfs.database_id and mf.file_id = vfs.file_id WHERE mf.database_id = 2 and mf.type = 0 To measure the disk performance, we can use a tool named DiskSpeed, which is a replacement for older SQLIO tool, which was used for a long time. DiskSpeed is an external utility, which is not available on the operating system by default. This tool can be downloaded from GitHub or the Technet Gallery at https://github.com/microsoft/diskspd. The following example runs a test for 15 seconds using a single thread to drive 100 percent random 64 KB reads at a depth of 15 overlapped (outstanding) I/Os to a regular file: DiskSpd –d300 -F1 -w0 -r –b64k -o15 d:datafile.dat Troubleshooting wait statistics We can use the whole Wait Statistics approach for a thorough understanding of the SQL Server workload and undertake performance troubleshooting based on the collected data. Wait Statistics are based on the fact that, any time a request has to wait for a resource, the SQL Server tracks this information, and we can use this information for further analysis. When we consider any user process, it can include several threads. A thread is a single unit of execution on SQL Server, where SQLOS controls the thread scheduling instead of relying on the operating system layer. Each processor core has it's own scheduler component responsible for executing such threads. To see the available schedulers in your SQL Server, you can use the following query: SELECT * FROM sys.dm_os_schedulers Such code will return all the schedulers in your SQL Server; some will be displayed as visible online and some as hidden online. The hidden ones are for internal system tasks while the visible ones are used by user tasks running on the SQL Server. There is one more scheduler, which is displayed as Visible Online (DAC). This one is used for dedicated administration connection, which comes in handy when the SQL Server stops responding. To use a dedicated admin connection, you can modify your SSMS connection to use the DAC, or you can use a switch with the sqlcmd.exe utility, to connect to the DAC. To connect to the default instance with DAC on your server, you can use the following command: sqlcmd.exe -E -A Each thread can be in three possible states: running: This indicates that the thread is running on the processor suspended: This indicates that the thread is waiting for a resource on a waiter list runnable:  This indicates that the thread is waiting for execution on a runnable queue Each running thread runs until it has to wait for a resource to become available or until it has exhausted the CPU time for a running thread, which is set to 4 ms. This 4 ms time can be visible in the output of the previous query to sys.dm_os_schedulers and is called a quantum. When a thread requires any resource, it is moved away from the processor to a waiter list, where the thread waits for the resource to become available. Once the resource is available, the thread is notified about the resource availability and moves to the bottom of the runnable queue. Any waiting thread can be found via the following code, which will display the waiting threads and the resource they are waiting for: SELECT * FROM sys.dm_os_waiting_tasks The threads then transition between the execution at the CPU, waiter list, and runnable queue. There is a special case when a thread does not need to wait for any resource and has already run for 4 ms on the CPU, then the thread will be moved directly to the runnable queue instead of the waiter list. In the following image, we can see the thread states and the objects where the thread resides: When the thread is waiting on the waiter list, we can talk about a resource wait time. When the thread is waiting on the runnable queue to get on the CPU for execution, we can talk about the signal time. The total wait time is, then, the sum of the signal and resource wait times. You can find the ratio of the signal to resource wait times with the following script: Select signalWaitTimeMs=sum(signal_wait_time_ms)  ,'%signal waits' = cast(100.0 * sum(signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2))  ,resourceWaitTimeMs=sum(wait_time_ms - signal_wait_time_ms)  ,'%resource waits'= cast(100.0 * sum(wait_time_ms - signal_wait_time_ms) / sum (wait_time_ms) as numeric(20,2)) from sys.dm_os_wait_stats When the ratio goes over 30 percent for the signal waits, then there will be a serious CPU pressure and your processor(s) will have a hard time handling all the incoming requests from the threads. The following query can then grab the wait statistics and display the most frequent wait types, which were recorded through the thread executions, or actually during the time the threads were waiting on the waiter list for any particular resource: WITH [Waits] AS  (SELECT    [wait_type],   [wait_time_ms] / 1000.0 AS [WaitS], ([wait_time_ms] - [signal_wait_time_ms]) / 1000.0 AS [ResourceS], [signal_wait_time_ms] / 1000.0 AS [SignalS], [waiting_tasks_count] AS [WaitCount], 100.0 * [wait_time_ms] / SUM ([wait_time_ms]) OVER() AS [Percentage], ROW_NUMBER() OVER(ORDER BY [wait_time_ms] DESC) AS [RowNum] FROM sys.dm_os_wait_stats WHERE [wait_type] NOT IN ( N'BROKER_EVENTHANDLER', N'BROKER_RECEIVE_WAITFOR', N'BROKER_TASK_STOP', N'BROKER_TO_FLUSH', N'BROKER_TRANSMITTER', N'CHECKPOINT_QUEUE', N'CHKPT', N'CLR_AUTO_EVENT', N'CLR_MANUAL_EVENT', N'CLR_SEMAPHORE', N'DIRTY_PAGE_POLL', N'DISPATCHER_QUEUE_SEMAPHORE', N'EXECSYNC', N'FSAGENT', N'FT_IFTS_SCHEDULER_IDLE_WAIT', N'FT_IFTSHC_MUTEX', N'HADR_CLUSAPI_CALL', N'HADR_FILESTREAM_IOMGR_IOCOMPLETION', N'HADR_LOGCAPTURE_WAIT', N'HADR_NOTIFICATION_DEQUEUE', N'HADR_TIMER_TASK', N'HADR_WORK_QUEUE', N'KSOURCE_WAKEUP', N'LAZYWRITER_SLEEP', N'LOGMGR_QUEUE', N'MEMORY_ALLOCATION_EXT', N'ONDEMAND_TASK_QUEUE', N'PREEMPTIVE_XE_GETTARGETSTATE', N'PWAIT_ALL_COMPONENTS_INITIALIZED', N'PWAIT_DIRECTLOGCONSUMER_GETNEXT', N'QDS_PERSIST_TASK_MAIN_LOOP_SLEEP', N'QDS_ASYNC_QUEUE', N'QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP', N'QDS_SHUTDOWN_QUEUE', N'REDO_THREAD_PENDING_WORK', N'REQUEST_FOR_DEADLOCK_SEARCH', N'RESOURCE_QUEUE', N'SERVER_IDLE_CHECK', N'SLEEP_BPOOL_FLUSH', N'SLEEP_DBSTARTUP', N'SLEEP_DCOMSTARTUP', N'SLEEP_MASTERDBREADY', N'SLEEP_MASTERMDREADY', N'SLEEP_MASTERUPGRADED', N'SLEEP_MSDBSTARTUP', N'SLEEP_SYSTEMTASK', N'SLEEP_TASK', N'SLEEP_TEMPDBSTARTUP', N'SNI_HTTP_ACCEPT', N'SP_SERVER_DIAGNOSTICS_SLEEP', N'SQLTRACE_BUFFER_FLUSH', N'SQLTRACE_INCREMENTAL_FLUSH_SLEEP', N'SQLTRACE_WAIT_ENTRIES', N'WAIT_FOR_RESULTS', N'WAITFOR', N'WAITFOR_TASKSHUTDOWN', N'WAIT_XTP_RECOVERY', N'WAIT_XTP_HOST_WAIT', N'WAIT_XTP_OFFLINE_CKPT_NEW_LOG', N'WAIT_XTP_CKPT_CLOSE', N'XE_DISPATCHER_JOIN', N'XE_DISPATCHER_WAIT', N'XE_TIMER_EVENT' ) AND [waiting_tasks_count] > 0 ) SELECT MAX ([W1].[wait_type]) AS [WaitType], CAST (MAX ([W1].[WaitS]) AS DECIMAL (16,2)) AS [Wait_S], CAST (MAX ([W1].[ResourceS]) AS DECIMAL (16,2)) AS [Resource_S], CAST (MAX ([W1].[SignalS]) AS DECIMAL (16,2)) AS [Signal_S], MAX ([W1].[WaitCount]) AS [WaitCount], CAST (MAX ([W1].[Percentage]) AS DECIMAL (5,2)) AS [Percentage], CAST ((MAX ([W1].[WaitS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgWait_S], CAST ((MAX ([W1].[ResourceS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgRes_S], CAST ((MAX ([W1].[SignalS]) / MAX ([W1].[WaitCount])) AS DECIMAL (16,4)) AS [AvgSig_S] FROM [Waits] AS [W1] INNER JOIN [Waits] AS [W2] ON [W2].[RowNum] <= [W1].[RowNum] GROUP BY [W1].[RowNum] HAVING SUM ([W2].[Percentage]) - MAX( [W1].[Percentage] ) < 95 GO This code is available on the whitepaper, published by SQLSkills, named SQL Server Performance Tuning Using Wait Statistics by Erin Stellato and Jonathan Kehayias, which then refers the URL on SQL Skills and uses the full query by Paul Randal available at https://www.sqlskills.com/ blogs/paul/wait-statistics-or-please-tell-me-where-it-hurts/. Some of the typical wait stats you can see are: PAGEIOLATCH The PAGEIOLATCH wait type is used when the thread is waiting for a page to be read into the buffer pool from the disk. This wait type comes with two main forms: PAGEIOLATCH_SH: This page will be read from the disk PAGEIOLATCH_EX: This page will be modified You may quickly assume that the storage has to be the problem, but that may not be the case. Like any other wait, they need to be considered in correlation with other wait types and other counters available to correctly find the root cause of the slow SQL Server operations. The page may be read into the buffer pool, because it was previously removed due to memory pressure and is needed again. So, you may also investigate the following: Buffer Manager: Page life expectancy Buffer Manager: Buffer cache hit ratio Also, you need to consider the following as a possible factor to the PAGEIOLATCH wait types: Large scans versus seeks on the indexes Implicit conversions Inundated statistics Missing indexes PAGELATCH This wait type is quite frequently misplaced with PAGEIOLATCH but PAGELATCH is used for pages already present in the memory. The thread waits for the access to such a page again with possible PAGELATCH_SH and PAGELATCH_EX wait types. A pretty common situation with this wait type is a tempDB contention, where you need to analyze what page is being waited for and what type of query is actually waiting for such a resource. As a solution to the tempDB, contention you can do the following: Add more tempDB data files Use traceflags 1118 and 1117 for tempDB on systems older than SQL Server 2016 CXPACKET This wait type is encountered when any thread is running in parallel. The CXPACKET wait type itself does not mean that there is really any problem on the SQL Server. But if such a wait type is accumulated very quickly, it may be a signal of skewed statistics, which require an update or a parallel scan on the table where proper indexes are missing. The option for parallelism is controlled via MAX DOP setting, which can be configured on the following: The server level The database level A query level with a hint We learned about SQL Server analysis with the Wait Statistics troubleshooting methodology and possible DMVs to get more insight on the problems occurring in the SQL Server. To know more about how to successfully create, design, and deploy databases using SQL Server 2017, do checkout the book SQL Server 2017 Administrator's Guide.
Read more
  • 0
  • 0
  • 9463
Visually different images

article-image-stephen-hawking-artificial-intelligence-quotes
Richard Gall
15 Mar 2018
3 min read
Save for later

5 polarizing Quotes from Professor Stephen Hawking on artificial intelligence

Richard Gall
15 Mar 2018
3 min read
Professor Stephen Hawking died today (March 14, 2018) aged 76 at his home in Cambridge, UK. Best known for his theory of cosmology that unified quantum mechanics with Einstein’s General Theory of Relativity, and for his book a Brief History of Time that brought his concepts to a wider general audience, Professor Hawking is quite possibly one of the most important and well-known voices in the scientific world. Among many things, Professor Hawking had a lot to say about artificial intelligence - its dangers, its opportunities and what we should be thinking about, not just as scientists and technologists, but as humans. Over the years, Hawking has remained cautious and consistent in his views on the topic constantly urging AI researchers and machine learning developers to consider the wider implications of their work on society and the human race itself.  The machine learning community is quite divided on all the issues Hawking has raised and will probably continue to be so as the field grows faster than it can be fathomed. Here are 5 widely debated things Stephen Hawking said about AI arranged in chronological order - and if you’re going to listen to anyone, you’ve surely got to listen to him?   On artificial intelligence ending the human race The development of full artificial intelligence could spell the end of the human race….It would take off on its own, and re-design itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn't compete and would be superseded. From an interview with the BBC, December 2014 On the future of AI research The establishment of shared theoretical frameworks, combined with the availability of data and processing power, has yielded remarkable successes in various component tasks such as speech recognition, image classification, autonomous vehicles, machine translation, legged locomotion, and question-answering systems. As capabilities in these areas and others cross the threshold from laboratory research to economically valuable technologies, a virtuous cycle takes hold whereby even small improvements in performance are worth large sums of money, prompting greater investments in research. There is now a broad consensus that AI research is progressing steadily, and that its impact on society is likely to increase.... Because of the great potential of AI, it is important to research how to reap its benefits while avoiding potential pitfalls. From Research Priorities for Robust and Beneficial Artificial Intelligence, an open letter co-signed by Hawking, January 2015 On AI emulating human intelligence I believe there is no deep difference between what can be achieved by a biological brain and what can be achieved by a computer. It, therefore, follows that computers can, in theory, emulate human intelligence — and exceed it From a speech given by Hawking at the opening of the Leverhulme Centre of the Future of Intelligence, Cambridge, U.K., October 2016 On making artificial intelligence benefit humanity Perhaps we should all stop for a moment and focus not only on making our AI better and more successful but also on the benefit of humanity. Taken from a speech given by Hawking at Web Summit in Lisbon, November 2017 On AI replacing humans The genie is out of the bottle. We need to move forward on artificial intelligence development but we also need to be mindful of its very real dangers. I fear that AI may replace humans altogether. If people design computer viruses, someone will design AI that replicates itself. This will be a new form of life that will outperform humans. From an interview with Wired, November 2017
Read more
  • 0
  • 2
  • 37674

article-image-selecting-statistical-based-features-in-machine-learning-application
Pravin Dhandre
14 Mar 2018
16 min read
Save for later

Selecting Statistical-based Features in Machine Learning application

Pravin Dhandre
14 Mar 2018
16 min read
In today’s tutorial, we will work on one of the methods of executing feature selection, the statistical-based method for interpreting both quantitative and qualitative datasets. Feature selection attempts to reduce the size of the original dataset by subsetting the original features and shortlisting the best ones with the highest predictive power. We may intelligently choose which feature selection method might work best for us, but in reality, a very valid way of working in this domain is to work through examples of each method and measure the performance of the resulting pipeline. To begin, let's take a look at the subclass of feature selection modules that are reliant on statistical tests to select viable features from a dataset. Statistical-based feature selections Statistics provides us with relatively quick and easy methods of interpreting both quantitative and qualitative data. We have used some statistical measures in previous chapters to obtain new knowledge and perspective around our data, specifically in that we recognized mean and standard deviation as metrics that enabled us to calculate z-scores and scale our data. In this tutorial, we will rely on two new concepts to help us with our feature selection: Pearson correlations hypothesis testing Both of these methods are known as univariate methods of feature selection, meaning that they are quick and handy when the problem is to select out single features at a time in order to create a better dataset for our machine learning pipeline. Using Pearson correlation to select features We have actually looked at correlations in this book already, but not in the context of feature selection. We already know that we can invoke a correlation calculation in pandas by calling the following method: credit_card_default.corr() The output of the preceding code produces is the following: As a continuation of the preceding table we have: The Pearson correlation coefficient (which is the default for pandas) measures the linear relationship between columns. The value of the coefficient varies between -1 and +1, where 0 implies no correlation between them. Correlations closer to -1 or +1 imply an extremely strong linear relationship. It is worth noting that Pearson’s correlation generally requires that each column be normally distributed (which we are not assuming). We can also largely ignore this requirement because our dataset is large (over 500 is the threshold). The pandas .corr() method calculates a Pearson correlation coefficient for every column versus every other column. This 24 column by 24 row matrix is very unruly, and in the past, we used heatmaps to try and make the information more digestible: # using seaborn to generate heatmaps import seaborn as sns import matplotlib.style as style # Use a clean stylizatino for our charts and graphs style.use('fivethirtyeight') sns.heatmap(credit_card_default.corr()) The heatmap generated will be as follows: Note that the heatmap function automatically chose the most correlated features to show us. That being said, we are, for the moment, concerned with the features correlations to the response variable. We will assume that the more correlated a feature is to the response, the more useful it will be. Any feature that is not as strongly correlated will not be as useful to us. Correlation coefficients are also used to determine feature interactions and redundancies. A key method of reducing overfitting in machine learning is spotting and removing these redundancies. We will be tackling this problem in our model-based selection methods. Let's isolate the correlations between the features and the response variable, using the following code: # just correlations between every feature and the response credit_card_default.corr()['default payment next month'] LIMIT_BAL -0.153520 SEX -0.039961 EDUCATION 0.028006 MARRIAGE -0.024339 AGE 0.013890 PAY_0 0.324794 PAY_2 0.263551 PAY_3 0.235253 PAY_4 0.216614 PAY_5 0.204149 PAY_6 0.186866 BILL_AMT1 -0.019644 BILL_AMT2 -0.014193 BILL_AMT3 -0.014076 BILL_AMT4 -0.010156 BILL_AMT5 -0.006760 BILL_AMT6 -0.005372 PAY_AMT1 -0.072929 PAY_AMT2 -0.058579 PAY_AMT3 -0.056250 PAY_AMT4 -0.056827 PAY_AMT5 -0.055124 PAY_AMT6 -0.053183 default payment next month 1.000000 We can ignore the final row, as is it is the response variable correlated perfectly to itself. We are looking for features that have correlation coefficient values close to -1 or +1. These are the features that we might assume are going to be useful. Let's use pandas filtering to isolate features that have at least .2 correlation (positive or negative). Let's do this by first defining a pandas mask, which will act as our filter, using the following code: # filter only correlations stronger than .2 in either direction (positive or negative) credit_card_default.corr()['default payment next month'].abs() > .2 LIMIT_BAL False SEX False EDUCATION False MARRIAGE False AGE False PAY_0 True PAY_2 True PAY_3 True PAY_4 True PAY_5 True PAY_6 False BILL_AMT1 False BILL_AMT2 False BILL_AMT3 False BILL_AMT4 False BILL_AMT5 False BILL_AMT6 False PAY_AMT1 False PAY_AMT2 False PAY_AMT3 False PAY_AMT4 False PAY_AMT5 False PAY_AMT6 False default payment next month True Every False in the preceding pandas Series represents a feature that has a correlation value between -.2 and .2 inclusive, while True values correspond to features with preceding correlation values .2 or less than -0.2. Let's plug this mask into our pandas filtering, using the following code: # store the features highly_correlated_features = credit_card_default.columns[credit_card_default.corr()['default payment next month'].abs() > .2] highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5', u'default payment next month'], dtype='object') The variable highly_correlated_features is supposed to hold the features of the dataframe that are highly correlated to the response; however, we do have to get rid of the name of the response column, as including that in our machine learning pipeline would be cheating: # drop the response variable highly_correlated_features = highly_correlated_features.drop('default payment next month') highly_correlated_features Index([u'PAY_0', u'PAY_2', u'PAY_3', u'PAY_4', u'PAY_5'], dtype='object') So, now we have five features from our original dataset that are meant to be predictive of the response variable, so let's try it out with the help of the following code: # only include the five highly correlated features X_subsetted = X[highly_correlated_features] get_best_model_and_accuracy(d_tree, tree_params, X_subsetted, y) # barely worse, but about 20x faster to fit the model Best Accuracy: 0.819666666667 Best Parameters: {'max_depth': 3} Average Time to Fit (s): 0.01 Average Time to Score (s): 0.002 Our accuracy is definitely worse than the accuracy to beat, .8203, but also note that the fitting time saw about a 20-fold increase. Our model is able to learn almost as well as with the entire dataset with only five features. Moreover, it is able to learn as much in a much shorter timeframe. Let's bring back our scikit-learn pipelines and include our correlation choosing methodology as a part of our preprocessing phase. To do this, we will have to create a custom transformer that invokes the logic we just went through, as a pipeline-ready class. We will call our class the CustomCorrelationChooser and it will have to implement both a fit and a transform logic, which are: The fit logic will select columns from the features matrix that are higher than a specified threshold The transform logic will subset any future datasets to only include those columns that were deemed important from sklearn.base import TransformerMixin, BaseEstimator class CustomCorrelationChooser(TransformerMixin, BaseEstimator): def __init__(self, response, cols_to_keep=[], threshold=None): # store the response series self.response = response # store the threshold that we wish to keep self.threshold = threshold # initialize a variable that will eventually # hold the names of the features that we wish to keep self.cols_to_keep = cols_to_keep def transform(self, X): # the transform method simply selects the appropiate # columns from the original dataset return X[self.cols_to_keep] def fit(self, X, *_): # create a new dataframe that holds both features and response df = pd.concat([X, self.response], axis=1) # store names of columns that meet correlation threshold self.cols_to_keep = df.columns[df.corr()[df.columns[-1]].abs() > Self.threshold] # only keep columns in X, for example, will remove response Variable self.cols_to_keep = [c for c in self.cols_to_keep if c in X.columns] return self Let's take our new correlation feature selector for a spin, with the help of the following code: # instantiate our new feature selector ccc = CustomCorrelationChooser(threshold=.2, response=y) ccc.fit(X) ccc.cols_to_keep ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5'] Our class has selected the same five columns as we found earlier. Let's test out the transform functionality by calling it on our X matrix, using the following code: ccc.transform(X).head() The preceding code produces the following table as the output: We see that the transform method has eliminated the other columns and kept only the features that met our .2 correlation threshold. Now, let's put it all together in our pipeline, with the help of the following code: # instantiate our feature selector with the response variable set ccc = CustomCorrelationChooser(response=y) # make our new pipeline, including the selector ccc_pipe = Pipeline([('correlation_select', ccc), ('classifier', d_tree)]) # make a copy of the decisino tree pipeline parameters ccc_pipe_params = deepcopy(tree_pipe_params) # update that dictionary with feature selector specific parameters ccc_pipe_params.update({ 'correlation_select__threshold':[0, .1, .2, .3]}) print ccc_pipe_params #{'correlation_select__threshold': [0, 0.1, 0.2, 0.3], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # better than original (by a little, and a bit faster on # average overall get_best_model_and_accuracy(ccc_pipe, ccc_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'correlation_select__threshold': 0.1, 'classifier__max_depth': 5} Average Time to Fit (s): 0.105 Average Time to Score (s): 0.003 Wow! Our first attempt at feature selection and we have already beaten our goal (albeit by a little bit). Our pipeline is showing us that if we threshold at 0.1, we have eliminated noise enough to improve accuracy and also cut down on the fitting time (from .158 seconds without the selector). Let's take a look at which columns our selector decided to keep: # check the threshold of .1 ccc = CustomCorrelationChooser(threshold=0.1, response=y) ccc.fit(X) # check which columns were kept ccc.cols_to_keep ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] It appears that our selector has decided to keep the five columns that we found, as well as two more, the LIMIT_BAL and the PAY_6 columns. Great! This is the beauty of automated pipeline gridsearching in scikit-learn. It allows our models to do what they do best and intuit things that we could not have on our own. Feature selection using hypothesis testing Hypothesis testing is a methodology in statistics that allows for a bit more complex statistical testing for individual features. Feature selection via hypothesis testing will attempt to select only the best features from a dataset, just as we were doing with our custom correlation chooser, but these tests rely more on formalized statistical methods and are interpreted through what are known as p-values. A hypothesis test is a statistical test that is used to figure out whether we can apply a certain condition for an entire population, given a data sample. The result of a hypothesis test tells us whether we should believe the hypothesis or reject it for an alternative one. Based on sample data from a population, a hypothesis test determines whether or not to reject the null hypothesis. We usually use a p-value (a non-negative decimal with an upper bound of 1, which is based on our significance level) to make this conclusion. In the case of feature selection, the hypothesis we wish to test is along the lines of: True or False: This feature has no relevance to the response variable. We want to test this hypothesis for every feature and decide whether the features hold some significance in the prediction of the response. In a way, this is how we dealt with the correlation logic. We basically said that, if a column's correlation with the response is too weak, then we say that the hypothesis that the feature has no relevance is true. If the correlation coefficient was strong enough, then we can reject the hypothesis that the feature has no relevance in favor of an alternative hypothesis, that the feature does have some relevance. To begin to use this for our data, we will have to bring in two new modules: SelectKBest and f_classif, using the following code: # SelectKBest selects features according to the k highest scores of a given scoring function from sklearn.feature_selection import SelectKBest # This models a statistical test known as ANOVA from sklearn.feature_selection import f_classif # f_classif allows for negative values, not all do # chi2 is a very common classification criteria but only allows for positive values # regression has its own statistical tests SelectKBest is basically just a wrapper that keeps a set amount of features that are the highest ranked according to some criterion. In this case, we will use the p-values of completed hypothesis testings as a ranking. Interpreting the p-value The p-values are a decimals between 0 and 1 that represent the probability that the data given to us occurred by chance under the hypothesis test. Simply put, the lower the p-value, the better the chance that we can reject the null hypothesis. For our purposes, the smaller the p-value, the better the chances that the feature has some relevance to our response variable and we should keep it. The big take away from this is that the f_classif function will perform an ANOVA test (a type of hypothesis test) on each feature on its own (hence the name univariate testing) and assign that feature a p-value. The SelectKBest will rank the features by that p-value (the lower the better) and keep only the best k (a human input) features. Let's try this out in Python. Ranking the p-value Let's begin by instantiating a SelectKBest module. We will manually enter a k value, 5, meaning we wish to keep only the five best features according to the resulting p-values: # keep only the best five features according to p-values of ANOVA test k_best = SelectKBest(f_classif, k=5) We can then fit and transform our X matrix to select the features we want, as we did before with our custom selector: # matrix after selecting the top 5 features k_best.fit_transform(X, y) # 30,000 rows x 5 columns array([[ 2, 2, -1, -1, -2], [-1, 2, 0, 0, 0], [ 0, 0, 0, 0, 0], ..., [ 4, 3, 2, -1, 0], [ 1, -1, 0, 0, 0], [ 0, 0, 0, 0, 0]]) If we want to inspect the p-values directly and see which columns were chosen, we can dive deeper into the select k_best variables: # get the p values of columns k_best.pvalues_ # make a dataframe of features and p-values # sort that dataframe by p-value p_values = pd.DataFrame({'column': X.columns, 'p_value': k_best.pvalues_}).sort_values('p_value') # show the top 5 features p_values.head() The preceding code produces the following table as the output: We can see that, once again, our selector is choosing the PAY_X columns as the most important. If we take a look at our p-value column, we will notice that our values are extremely small and close to zero. A common threshold for p-values is 0.05, meaning that anything less than 0.05 may be considered significant, and these columns are extremely significant according to our tests. We can also directly see which columns meet a threshold of 0.05 using the pandas filtering methodology: # features with a low p value p_values[p_values['p_value'] < .05] The preceding code produces the following table as the output: The majority of the columns have a low p-value, but not all. Let's see the columns with a higher p_value, using the following code: # features with a high p value p_values[p_values['p_value'] >= .05] The preceding code produces the following table as the output: These three columns have quite a high p-value. Let's use our SelectKBest in a pipeline to see if we can grid search our way into a better machine learning pipeline, using the following code: k_best = SelectKBest(f_classif) # Make a new pipeline with SelectKBest select_k_pipe = Pipeline([('k_best', k_best), ('classifier', d_tree)]) select_k_best_pipe_params = deepcopy(tree_pipe_params) # the 'all' literally does nothing to subset select_k_best_pipe_params.update({'k_best__k':range(1,23) + ['all']}) print select_k_best_pipe_params # {'k_best__k': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 'all'], 'classifier__max_depth': [None, 1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21]} # comparable to our results with correlationchooser get_best_model_and_accuracy(select_k_pipe, select_k_best_pipe_params, X, y) Best Accuracy: 0.8206 Best Parameters: {'k_best__k': 7, 'classifier__max_depth': 5} Average Time to Fit (s): 0.102 Average Time to Score (s): 0.002 It seems that our SelectKBest module is getting about the same accuracy as our custom transformer, but it's getting there a bit quicker! Let's see which columns our tests are selecting for us, with the help of the following code: k_best = SelectKBest(f_classif, k=7) # lowest 7 p values match what our custom correlationchooser chose before # ['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'] p_values.head(7) The preceding code produces the following table as the output: They appear to be the same columns that were chosen by our other statistical method. It's possible that our statistical method is limited to continually picking these seven columns for us. There are other tests available besides ANOVA, such as Chi2 and others, for regression tasks. They are all included in scikit-learn's documentation. For more info on feature selection through univariate testing, check out the scikit-learn documentation here: http:/​/​scikit-​learn.​org/​stable/​modules/​feature_​selection. html#univariate-​feature-​selection Before we move on to model-based feature selection, it's helpful to do a quick sanity check to ensure that we are on the right track. So far, we have seen two statistical methods for feature selection that gave us the same seven columns for optimal accuracy. But what if we were to take every column except those seven? We should expect a much lower accuracy and worse pipeline overall, right? Let's make sure. The following code helps us to implement sanity checks: # sanity check # If we only the worst columns the_worst_of_X = X[X.columns.drop(['LIMIT_BAL', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6'])] # goes to show, that selecting the wrong features will # hurt us in predictive performance get_best_model_and_accuracy(d_tree, tree_params, the_worst_of_X, y) Best Accuracy: 0.783966666667 Best Parameters: {'max_depth': 5} Average Time to Fit (s): 0.21 Average Time to Score (s): 0.002 Hence, by selecting the columns except those seven, we see not only worse accuracy (almost as bad as the null accuracy), but also slower fitting times on average. We statistically selected features from the dataset for our machine learning pipeline. [box type="note" align="" class="" width=""]This article is an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla.  Do check out the book to get access to alternative techniques such as the model-based method to achieve optimum results from the machine learning application.[/box]      
Read more
  • 0
  • 0
  • 11729

article-image-getting-started-with-q-learning-using-tensorflow
Savia Lobo
14 Mar 2018
9 min read
Save for later

Getting started with Q-learning using TensorFlow

Savia Lobo
14 Mar 2018
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. This book will help you master advanced concepts of deep learning such as transfer learning, reinforcement learning, generative models and more, using TensorFlow and Keras.[/box] In this tutorial, we will learn about Q-learning and how to implement it using deep reinforcement learning. Q-Learning is a model-free method of finding the optimal policy that can maximize the reward of an agent. During initial gameplay, the agent learns a Q value for each pair of (state, action), also known as the exploration strategy. Once the Q values are learned, then the optimal policy will be to select an action with the largest Q-value in every state, also known as the exploitation strategy. The learning algorithm may end in locally optimal solutions, hence we keep using the exploration policy by setting an exploration_rate parameter. The Q-Learning algorithm is as follows: initialize  Q(shape=[#s,#a])  to  random  values  or  zeroes Repeat  (for  each  episode) observe  current  state  s Repeat select  an  action  a  (apply  explore  or  exploit  strategy) observe  state  s_next  as  a  result  of  action  a update  the  Q-Table  using  bellman's  equation set  current  state  s  =  s_next until  the  episode  ends  or  a  max  reward  /  max  steps  condition  is  reached Until  a  number  of  episodes  or  a  condition  is  reached (such  as  max  consecutive  wins) Q(s, a) in the preceding algorithm represents the Q function. The values of this function are used for selecting the action instead of the rewards, thus this function represents the reward or discounted rewards. The values for the Q-function are updated using the values of the Q function in the future state. The well- known bellman equation captures this update: This basically means that at time step t, in state s, for action a, the maximum future reward (Q) is equal to the reward from the current state plus the max future reward from the next state. Q(s,a) can be implemented as a Q-Table or as a neural network known as a Q-Network. In both cases, the task of the Q-Table or the Q-Network is to provide the best possible action based on the Q value of the given input. The Q-Table-based approach generally becomes intractable as the Q-Table becomes large, thus making neural networks the best candidate for approximating the Q-function through Q-Network. Let us look at both of these approaches in action. Initializing and discretizing for Q-Learning The observations returned by the pole-cart environment involves the state of the environment. The state of pole-cart is represented by continuous values that we need to discretize. If we discretize these values into small state-space, then the agent gets trained faster, but with the caveat of risking the convergence to the optimal policy. We use the following helper function to discretize the state-space of the pole-cart environment: #  discretize  the  value  to  a  state  space def  discretize(val,bounds,n_states): discrete_val  =  0 if  val  <=  bounds[0]: discrete_val  =  0 elif  val  >=  bounds[1]: discrete_val  =  n_states-1 else:                       discrete_val  =  int(round(  (n_states-1)  * ((val-bounds[0])/                                                                                         (bounds[1]-bounds[0])) return  discrete_val def  discretize_state(vals,s_bounds,n_s): discrete_vals  =  [] for  i  in  range(len(n_s)): discrete_vals.append(discretize(vals[i],s_bounds[i],n_s[i])) return  np.array(discrete_vals,dtype=np.int) We discretize the space into 10 units for each of the observation dimensions. You may want to try out different discretization spaces. After the discretization, we find the upper and lower bounds of the observations, and change the bounds of velocity and angular velocity to be between -1 and +1, instead of -Inf and +Inf. The code is as follows: env  =  gym.make('CartPole-v0') n_a  =  env.action_space.n #  number  of  discrete  states  for  each  observation  dimension n_s  =  np.array([10,10,10,10])                  #  position,  velocity,  angle,  angular velocity s_bounds  =  np.array(list(zip(env.observation_space.low, env.observation_space.high))) #  the  velocity  and  angular  velocity  bounds  are #  too  high  so  we  bound  between  -1,  +1 s_bounds[1]  =  (-1.0,1.0) s_bounds[3]  =  (-1.0,1.0) Q-Learning with Q-Table Since our discretised space is of the dimensions [10,10,10,10], our Q-Table is of [10,10,10,10,2] dimensions: #  create  a  Q-Table  of  shape  (10,10,10,10,  2)  representing  S  X  A  ->  R q_table  =  np.zeros(shape  =  np.append(n_s,n_a)) We define a Q-Table policy that exploits or explores based on the exploration_rate: def  policy_q_table(state,  env): #  Exploration  strategy  -  Select  a  random  action if  np.random.random()  <  explore_rate: action  =  env.action_space.sample() #  Exploitation  strategy  -  Select  the  action  with  the  highest  q else: action  =  np.argmax(q_table[tuple(state)]) return  action Define the episode() function that runs a single episode as follows:  Start with initializing the variables and the first state: obs  =  env.reset() state_prev  =  discretize_state(obs,s_bounds,n_s) episode_reward  =  0 done  =  False t  =  0  Select the action and observe the next state: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_new  =  discretize_state(obs,s_bounds,n_s)  Update the Q-Table: best_q  =  np.amax(q_table[tuple(state_new)]) bellman_q  =  reward  +  discount_rate  *  best_q indices  =  tuple(np.append(state_prev,action)) q_table[indices]  +=  learning_rate*(  bellman_q  -  q_table[indices])  Set the next state as the previous state and add the rewards to the episode's rewards: state_prev  =  state_new episode_reward  +=  reward The experiment() function calls the episode function and accumulates the rewards for reporting. You may want to modify the function to check for consecutive wins and other logic specific to your play or games: #  collect  observations  and  rewards  for  each  episode def  experiment(env,  policy,  n_episodes,r_max=0,  t_max=0): rewards=np.empty(shape=[n_episodes]) for  i  in  range(n_episodes): val  =  episode(env,  policy,  r_max,  t_max) rewards[i]=val print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) Now, all we have to do is define the parameters, such as learning_rate, discount_rate, and explore_rate, and run the experiment() function as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  1000 experiment(env,  policy_q_table,  n_episodes) For 1000 episodes, the Q-Table-based policy's maximum reward is 180 based on our simple implementation: Policy:policy_q_table,  Min  reward:8.0,  Max  reward:180.0,  Average reward:17.592 Our implementation of the algorithm is very simple to explain. However, you can modify the code to set the explore rate high initially and then decay as the time-steps pass. Similarly, you can also implement the decay logic for the learning and discount rates. Let us see if we can get a higher reward with fewer episodes as our Q function learns faster. Q-Learning with Q-Network  or Deep Q Network (DQN)  In the DQN, we replace the Q-Table with a neural network (Q-Network) that will learn to respond with the optimal action as we train it continuously with the explored states and their Q-Values. Thus, for training the network we need a place to store the game memory:  Implement the game memory using a deque of size 1000: memory  =  deque(maxlen=1000)  Next, build a simple hidden layer neural network model, q_nn: from  keras.models  import  Sequential from  keras.layers  import  Dense model  =  Sequential() model.add(Dense(8,input_dim=4,  activation='relu')) model.add(Dense(2,  activation='linear')) model.compile(loss='mse',optimizer='adam') model.summary() q_nn  =  model The Q-Network looks like this: Layer  (type)                                           Output  Shape                                 Param  # ================================================================= dense_1  (Dense)                                 (None,  8)                                         40 dense_2  (Dense)                                 (None,  2)                                         18 ================================================================= Total  params:  58 Trainable  params:  58 Non-trainable  params:  0 The episode() function that executes one episode of the game, incorporates the following changes for the Q-Network-based algorithm:  After generating the next state, add the states, action, and rewards to the game memory: action  =  policy(state_prev,  env) obs,  reward,  done,  info  =  env.step(action) state_next  =  discretize_state(obs,s_bounds,n_s) #  add  the  state_prev,  action,  reward,  state_new,  done  to  memory memory.append([state_prev,action,reward,state_next,done])   Generate and update the q_values with the maximum future rewards using the bellman function: states  =  np.array([x[0]  for  x  in  memory]) states_next  =  np.array([np.zeros(4)  if  x[4]  else  x[3]  for  x  in memory]) q_values  =  q_nn.predict(states) q_values_next  =  q_nn.predict(states_next) for  i  in  range(len(memory)): state_prev,action,reward,state_next,done  =  memory[i] if  done: q_values[i,action]  =  reward else: best_q  =  np.amax(q_values_next[i]) bellman_q  =  reward  +  discount_rate  *  best_q q_values[i,action]  =  bellman_q  Train the q_nn with the states and the q_values we received from memory: q_nn.fit(states,q_values,epochs=1,batch_size=50,verbose=0) The process of saving gameplay in memory and using it to train the model is also known as memory replay in deep reinforcement learning literature. Let us run our DQN-based gameplay as follows: learning_rate  =  0.8 discount_rate  =  0.9 explore_rate  =  0.2 n_episodes  =  100 experiment(env,  policy_q_nn,  n_episodes) We get a max reward of 150 that you can improve upon with hyper-parameter tuning, network tuning, and by using rate decay for the discount rate and explore rate: Policy:policy_q_nn,  Min  reward:8.0,  Max  reward:150.0,  Average  reward:41.27 To summarize, we calculated and trained the model at every step. One can change the code to discard the memory replay and retrain the model for the episodes that return smaller rewards. However, implement this option with caution as it may slow down your learning as initial gameplay would generate smaller rewards more often. Do check out the book Mastering TensorFlow 1.x  to explore advanced features of TensorFlow 1.x and gain insight into TensorFlow Core, Keras, TF Estimators, TFLearn, TF Slim, Pretty Tensor, and Sonnet.
Read more
  • 0
  • 0
  • 4744
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-stack-overflow-developer-survey-2018-quick-overview
Amey Varangaonkar
14 Mar 2018
4 min read
Save for later

Stack Overflow Developer Survey 2018: A Quick Overview

Amey Varangaonkar
14 Mar 2018
4 min read
Stack Overflow recently published their annual developer survey in which over 100,000 developers and professionals participated. The survey shed light on some very interesting insights - from the developers’ preferred language for programming, to the development platform they hate the most. As the survey is quite detailed and comprehensive, we thought why not present the most important takeaways and findings for you to go through very quickly? If you are short of time and want to scan through the results of the survey quickly, read on.. Developer Profile Young developers form the majority: Half the developer population falls in the age group of 25-34 years while almost all respondents (90%) fall within the 18 - 44 year age group. Limited professional coding experience: Majority of the developers have been coding from the last 2 to 10 years. That said, almost half of the respondents have a professional coding experience of less than 5 years. Continuous learning is key to surviving as a developer: Almost 75% of the developers have a bachelor’s degree, or higher. In addition, almost 90% of the respondents say they have learnt a new language, framework or a tool without taking any formal course, but with the help of the official documentation and/or Stack Overflow Back-end developers form the majority: Among the top developer roles, more than half the developers identify themselves as back-end developers, while the percentage of data scientists and analysts is quite low. About 20% of the respondents identify themselves as mobile developers Working full-time: More than 75% of the developers responded that they work a full-time job. Close to 10% are freelancers, or self-employed. Popularly used languages and frameworks The Javascript family continue their reign: For the sixth year running, JavaScript has continued to be the most popular programming language, and is the choice of language for more than 70% of the respondents. In terms of frameworks, Node.js and Angular continue to be the most popular choice of the developers. Desktop development ain’t dead yet: When it comes to the platforms, developers prefer Linux and Windows Desktop or Server for their development work. Cloud platforms have not gained that much adoption, as yet, but there is a slow but steady rise. What about Data Science? Machine Learning and DevOps rule the roost: Machine Learning and DevOps are two trends which are trending highly due to the vast applications and research that is being done on these fronts. Tensorflow rises, Hadoop falls: About 75% of the respondents love the Tensorflow framework, and say they would love to continue using it for their machine learning/deep learning tasks. Hadoop’s popularity seems to be going down, on the other hand, as other Big Data frameworks like Apache Spark gain more traction and popularity. Python - the next big programming language: Popular data science languages like R and Python are on the rise in terms of popularity. Python, which surpassed PHP last year, has surpassed C# this year, indicating its continuing rise in popularity. Python based Frameworks like Tensorflow and pyTorch are gaining a lot of adoption. Learn F# for more moolah: Languages like F#, Clojure and Rust are associated with high global salaries, with median salaries above $70,000. The likes of R and Python are associated with median salaries of up to $57,000. PostgreSQL growing rapidly, Redis most loved database: MySQL and SQL Server are the two most widely used databases as per the survey, while the usage of PostgreSQL has surpassed that of the traditionally popular databases like MongoDB and Redis. In terms of popularity, Redis is the most loved database while the developers dread (read looking to switch from) databases like IBM DB2 and Oracle. Job-hunt for data scientists: Approximately 18% of the 76,000+ respondents who are actively looking for jobs are data scientists or work as academicians and researchers. AI more exciting than worrying: Close to 75% of the 69,000+ respondents are excited about the future possibilities with AI than worried about the dangers posed by AI. Some of the major concerns include AI making important business decisions. The big surprise was that most developers find automation of jobs as the most exciting part of a future enabled by AI. So that’s it then! What do you think about the Stack Overflow Developer survey results? Do you agree with the developers’ responses? We would love to know your thoughts. In the coming days, watch out for more fine grained analysis of the Stack Overflow survey data.
Read more
  • 0
  • 0
  • 4470

article-image-feature-improvement-identifying-missing-values-using-eda-exploratory-data-analysis-technique
Pravin Dhandre
13 Mar 2018
9 min read
Save for later

Feature Improvement: Identifying missing values using EDA (Exploratory Data Analysis) technique

Pravin Dhandre
13 Mar 2018
9 min read
Today, we will work towards developing a better sense of data through identifying missing values in a dataset using Exploratory Data Analysis (EDA) technique and python packages. Identifying missing values in data Our first method of identifying missing values is to give us a better understanding of how to work with real-world data. Often, data can have missing values due to a variety of reasons, for example with survey data, some observations may not have been recorded. It is important for us to analyze our data, and get a sense of what the missing values are so we can decide how we want to handle missing values for our machine learning. To start, let's dive into a dataset the Pima Indian Diabetes Prediction dataset. This dataset is available on the UCI Machine Learning Repository at: https:/​/​archive.​ics.​uci.​edu/​ml/​datasets/​pima+indians+diabetes From the main website, we can learn a few things about this publicly available dataset. We have nine columns and 768 instances (rows). The dataset is primarily used for predicting the onset of diabetes within five years in females of Pima Indian heritage over the age of 21 given medical details about their bodies. The dataset is meant to correspond with a binary (2-class) classification machine learning problem. Namely, the answer to the question, will this person develop diabetes within five years? The column names are provided as follows (in order): Number of times pregnant Plasma glucose concentration a 2 hours in an oral glucose tolerance test Diastolic blood pressure (mm Hg) Triceps skinfold thickness (mm) 2-Hour serum insulin measurement (mu U/ml) Body mass index (weight in kg/(height in m)2) Diabetes pedigree function Age (years) Class variable (zero or one) The goal of the dataset is to be able to predict the final column of class variable, which predicts if the patient has developed diabetes, using the other eight features as inputs to a machine learning function. There are two very important reasons we will be working with this dataset: We will have to work with missing values All of the features we will be working with will be quantitative The first point makes more sense for now as a reason, because the point of this chapter is to deal with missing values. As far as only choosing to work with quantitative data, this will only be the case for this chapter. We do not have enough tools to deal with missing values in categorical columns. In the next chapter, when we talk about feature construction, we will deal with this procedure. The exploratory data analysis (EDA) To identify our missing values we will begin with an EDA of our dataset. We will be using some useful python packages, pandas and numpy, to store our data and make some simple calculations as well as some popular visualization tools to see what the distribution of our data looks like. Let's begin and dive into some code. First, we will do some imports: # import packages we need for exploratory data analysis (EDA) import pandas as pd # to store tabular data import numpy as np # to do some math import matplotlib.pyplot as plt # a popular data visualization tool import seaborn as sns # another popular data visualization tool %matplotlib inline plt.style.use('fivethirtyeight') # a popular data visualization theme We will import our tabular data through a CSV, as follows: # load in our dataset using pandas pima = pd.read_csv('../data/pima.data') pima.head() The head method allows us to see the first few rows in our dataset. The output is as follows: Something's not right here, there's no column names. The CSV must not have the names for the columns built into the file. No matter, we can use the data source's website to fill this in, as shown in the following code: pima_column_names = ['times_pregnant', 'plasma_glucose_concentration', 'diastolic_blood_pressure', 'triceps_thickness', 'serum_insulin', 'bmi', 'pedigree_function', 'age', 'onset_diabetes'] pima = pd.read_csv('../data/pima.data', names=pima_column_names) pima.head() Now, using the head method again, we can see our columns with the appropriate headers. The output of the preceding code is as follows: Much better, now we can use the column names to do some basic stats, selecting, and visualizations. Let's first get our null accuracy as follows: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 If our eventual goal is to exploit patterns in our data in order to predict the onset of diabetes, let us try to visualize some of the differences between those that developed diabetes and those that did not. Our hope is that the histogram will reveal some sort of pattern, or obvious difference in values between the classes of prediction: # get a histogram of the plasma_glucose_concentration column for # both classes col = 'plasma_glucose_concentration' plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='nondiabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code is as follows: It seems that this histogram is showing us a pretty big difference between plasma_glucose_concentration between the two prediction classes. Let's show the same histogram style for multiple columns as follows: for col in ['bmi', 'diastolic_blood_pressure', 'plasma_glucose_concentration']: plt.hist(pima[pima['onset_diabetes']==0][col], 10, alpha=0.5, label='non-diabetes') plt.hist(pima[pima['onset_diabetes']==1][col], 10, alpha=0.5, label='diabetes') plt.legend(loc='upper right') plt.xlabel(col) plt.ylabel('Frequency') plt.title('Histogram of {}'.format(col)) plt.show() The output of the preceding code will give us the following three histograms. The first one is show us the distributions of bmi for the two class variables (non-diabetes and diabetes): The next histogram to appear will shows us again contrastingly different distributions between a feature across our two class variables. This time we are looking at diastolic_blood_pressure: The final graph will show plasma_glucose_concentration differences between our two class Variables: We can definitely see some major differences simply by looking at just a few histograms. For example, there seems to be a large jump in plasma_glucose_concentration for those who will eventually develop diabetes. To solidify this, perhaps we can visualize a linear correlation matrix in an attempt to quantify the relationship between these variables. We will use the visualization tool, seaborn, which we imported at the beginning of this chapter for our correlation matrix as follows: # look at the heatmap of the correlation matrix of our dataset sns.heatmap(pima.corr()) # plasma_glucose_concentration definitely seems to be an interesting feature here Following is the correlation matrix of our dataset. This is showing us the correlation amongst the different columns in our Pima dataset. The output is as follows: This correlation matrix is showing a strong correlation between plasma_glucose_concentration and onset_diabetes. Let's take a further look at the numerical correlations for the onset_diabetes column, with the following code: pima.corr()['onset_diabetes'] # numerical correlation matrix # plasma_glucose_concentration definitely seems to be an interesting feature here times_pregnant 0.221898 plasma_glucose_concentration 0.466581 diastolic_blood_pressure 0.065068 triceps_thickness 0.074752 serum_insulin 0.130548 bmi 0.292695 pedigree_function 0.173844 age 0.238356 onset_diabetes 1.000000 Name: onset_diabetes, dtype: float64 We will explore the powers of correlation in a later Chapter 4, Feature Construction, but for now we are using exploratory data analysis (EDA) to hint at the fact that the plasma_glucose_concentration column will be an important factor in our prediction of the onset of diabetes. Moving on to more important matters at hand, let's see if we are missing any values in our dataset by invoking the built-in isnull() method of the pandas DataFrame: pima.isnull().sum() >>>> times_pregnant 0 plasma_glucose_concentration 0 diastolic_blood_pressure 0 triceps_thickness 0 serum_insulin 0 bmi 0 pedigree_function 0 age 0 onset_diabetes 0 dtype: int64 Great! We don't have any missing values. Let's go on to do some more EDA, first using the shape method to see the number of rows and columns we are working with: pima.shape . # (# rows, # cols) (768, 9) Confirming we have 9 columns (including our response variable) and 768 data observations (rows). Now, let's take a peak at the percentage of patients who developed diabetes, using the following code: pima['onset_diabetes'].value_counts(normalize=True) # get null accuracy, 65% did not develop diabetes 0 0.651042 1 0.348958 Name: onset_diabetes, dtype: float64 This shows us that 65% of the patients did not develop diabetes, while about 35% did. We can use a nifty built-in method of a pandas DataFrame called describe to look at some basic descriptive statistics: pima.describe() # get some basic descriptive statistics We get the output as follows: This shows us quite quickly some basic stats such as mean, standard deviation, and some different percentile measurements of our data. But, notice that the minimum value of the BMI column is 0. That is medically impossible; there must be a reason for this to happen. Perhaps the number zero has been encoded as a missing value instead of the None value or a missing cell. Upon closer inspection, we see that the value 0 appears as a minimum value for the following columns: times_pregnant plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi onset_diabetes Because zero is a class for onset_diabetes and 0 is actually a viable number for times_pregnant, we may conclude that the number 0 is encoding missing values for: plasma_glucose_concentration diastolic_blood_pressure triceps_thickness serum_insulin bmi So, we actually do having missing values! It was obviously not luck that we happened upon the zeros as missing values, we knew it beforehand. As a data scientist, you must be ever vigilant and make sure that you know as much about the dataset as possible in order to find missing values encoded as other symbols. Be sure to read any and all documentation that comes with open datasets in case they mention any missing values. If no documentation is available, some common values used instead of missing values are: 0 (for numerical values) unknown or Unknown (for categorical variables) ? (for categorical variables) To summarize, we have five columns where the fields are left with missing values and symbols. [box type="note" align="" class="" width=""]You just read an excerpt from a book Feature Engineering Made Easy co-authored by Sinan Ozdemir and Divya Susarla. To learn more about missing values and manipulating features, do check out Feature Engineering Made Easy and develop expert proficiency in Feature Selection, Learning, and Optimization.[/box]    
Read more
  • 0
  • 0
  • 6135

article-image-build-cartpole-game-using-openai-gym
Savia Lobo
10 Mar 2018
11 min read
Save for later

How to build a cartpole game using OpenAI Gym

Savia Lobo
10 Mar 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Mastering TensorFlow 1.x written by Armando Fandango. In this book, you will learn advanced features of TensorFlow1.x, such as distributed TensorFlow with TF Clusters, deploy production models with TensorFlow Serving, and more. [/box] Today, we will help you understand OpenAI Gym and how to apply the basics of OpenAI Gym onto a cartpole game. OpenAI Gym 101 OpenAI Gym is a Python-based toolkit for the research and development of reinforcement learning algorithms. OpenAI Gym provides more than 700 opensource contributed environments at the time of writing. With OpenAI, you can also create your own environment. The biggest advantage is that OpenAI provides a unified interface for working with these environments, and takes care of running the simulation while you focus on the reinforcement learning algorithms. Note : The research paper describing OpenAI Gym is available here: http://arxiv.org/abs/1606.01540 You can install OpenAI Gym using the following command: pip3  install  gym Note: If the above command does not work, then you can find further help with installation at the following link: https://github.com/openai/ gym#installation  Let us print the number of available environments in OpenAI Gym: all_env  =  list(gym.envs.registry.all()) print('Total  Environments  in  Gym  version  {}  :  {}' .format(gym.     version     ,len(all_env))) Total  Environments  in  Gym  version  0.9.4  :  777  Let us print the list of all environments: for  e  in  list(all_env): print(e) The partial list from the output is as follows: EnvSpec(Carnival-ramNoFrameskip-v0) EnvSpec(EnduroDeterministic-v0) EnvSpec(FrostbiteNoFrameskip-v4) EnvSpec(Taxi-v2) EnvSpec(Pooyan-ram-v0) EnvSpec(Solaris-ram-v4) EnvSpec(Breakout-ramDeterministic-v0) EnvSpec(Kangaroo-ram-v4) EnvSpec(StarGunner-ram-v4) EnvSpec(Enduro-ramNoFrameskip-v4) EnvSpec(DemonAttack-ramDeterministic-v0) EnvSpec(TimePilot-ramNoFrameskip-v0) EnvSpec(Amidar-v4) Each environment, represented by the env object, has a standardized interface, for example: An env object can be created with the env.make(<game-id-string>) function by passing the id string. Each env object contains the following main functions: The step() function takes an action object as an argument and returns four objects: observation: An object implemented by the environment, representing the observation of the environment. reward: A signed float value indicating the gain (or loss) from the previous action. done: A Boolean value representing if the scenario is finished. info: A Python dictionary object representing the diagnostic information. The render() function creates a visual representation of the environment. The reset() function resets the environment to the original state. Each env object comes with well-defined actions and observations, represented by action_space and observation_space. One of the most popular games in the gym to learn reinforcement learning is CartPole. In this game, a pole attached to a cart has to be balanced so that it doesn't fall. The game ends if either the pole tilts by more than 15 degrees or the cart moves by more than 2.4 units from the center. The home page of OpenAI.com emphasizes the game in these words: The small size and simplicity of this environment make it possible to run very quick experiments, which is essential when learning the basics. The game has only four observations and two actions. The actions are to move a cart by applying a force of +1 or -1. The observations are the position of the cart, the velocity of the cart, the angle of the pole, and the rotation rate of the pole. However, knowledge of the semantics of observation is not necessary to learn to maximize the rewards of the game. Now let us load a popular game environment, CartPole-v0, and play it with stochastic control:  Create the env object with the standard make function: env  =  gym.make('CartPole-v0')  The number of episodes is the number of game plays. We shall set it to one, for now, indicating that we just want to play the game once. Since every episode is stochastic, in actual production runs you will run over several episodes and calculate the average values of the rewards. Additionally, we can initialize an array to store the visualization of the environment at every timestep: n_episodes  =  1 env_vis  =  []  Run two nested loops—an external loop for the number of episodes and an internal loop for the number of timesteps you would like to simulate for. You can either keep running the internal loop until the scenario is done or set the number of steps to a higher value. At the beginning of every episode, reset the environment using env.reset(). At the beginning of every timestep, capture the visualization using env.render(). for  i_episode  in  range(n_episodes): observation  =  env.reset() for  t  in  range(100): env_vis.append(env.render(mode  =  'rgb_array')) print(observation) action  =  env.action_space.sample() observation,  reward,  done,  info  =  env.step(action) if  done: print("Episode  finished  at  t{}".format(t+1)) break  Render the environment using the helper function: env_render(env_vis)  The code for the helper function is as follows: def  env_render(env_vis): plt.figure() plot  =  plt.imshow(env_vis[0]) plt.axis('off') def  animate(i): plot.set_data(env_vis[i]) anim  =  anm.FuncAnimation(plt.gcf(), animate, frames=len(env_vis), interval=20, repeat=True, repeat_delay=20) display(display_animation(anim,  default_mode='loop')) We get the following output when we run this example: [-0.00666995  -0.03699492  -0.00972623    0.00287713] [-0.00740985    0.15826516  -0.00966868  -0.29285861] [-0.00424454  -0.03671761  -0.01552586  -0.00324067] [-0.0049789    -0.2316135    -0.01559067    0.28450351] [-0.00961117  -0.42650966  -0.0099006    0.57222875] [-0.01814136  -0.23125029    0.00154398    0.27644332] [-0.02276636  -0.0361504    0.00707284  -0.01575223] [-0.02348937    0.1588694       0.0067578    -0.30619523] [-0.02031198  -0.03634819    0.00063389  -0.01138875] [-0.02103895    0.15876466    0.00040612  -0.3038716  ] [-0.01786366    0.35388083  -0.00567131  -0.59642642] [-0.01078604    0.54908168  -0.01759984  -0.89089036] [    1.95594914e-04   7.44437934e-01    -3.54176495e-02    -1.18905344e+00] [ 0.01508435 0.54979251 -0.05919872 -0.90767902] [ 0.0260802 0.35551978 -0.0773523 -0.63417465] [ 0.0331906 0.55163065 -0.09003579 -0.95018025] [ 0.04422321 0.74784161 -0.1090394 -1.26973934] [ 0.05918004 0.55426764 -0.13443418 -1.01309691] [ 0.0702654 0.36117014 -0.15469612 -0.76546874] [ 0.0774888 0.16847818 -0.1700055 -0.52518186] [ 0.08085836 0.3655333 -0.18050913 -0.86624457] [ 0.08816903 0.56259197 -0.19783403 -1.20981195] Episode  finished  at  t22 It took 22 time-steps for the pole to become unbalanced. At every run, we get a different time-step value because we picked the action scholastically by using env.action_space.sample(). Since the game results in a loss so quickly, randomly picking an action and applying it is probably not the best strategy. There are many algorithms for finding solutions to keeping the pole straight for a longer number of time-steps that you can use, such as Hill Climbing, Random Search, and Policy Gradient. Note: Some of the algorithms for solving the Cartpole game are available at the following links: https://openai.com/requests-for-research/#cartpole http://kvfrans.com/simple-algoritms-for-solving-cartpole/ https://github.com/kvfrans/openai-cartpole Applying simple policies to a cartpole game So far, we have randomly picked an action and applied it. Now let us apply some logic to picking the action instead of random chance. The third observation refers to the angle. If the angle is greater than zero, that means the pole is tilting right, thus we move the cart to the right (1). Otherwise, we move the cart to the left (0). Let us look at an example: We define two policy functions as follows: def  policy_logic(env,obs): return  1  if  obs[2]  >  0  else  0 def  policy_random(env,obs): return  env.action_space.sample() Next, we define an experiment function that will run for a specific number of episodes; each episode runs until the game is lost, namely when done is True. We use rewards_max to indicate when to break out of the loop as we do not wish to run the experiment forever: def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(env,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break rewards[i]=episode_reward print('Policy:{},  Min  reward:{},  Max  reward:{}' .format(policy.     name     , min(rewards), max(rewards))) We run the experiment 100 times, or until the rewards are less than or equal to rewards_max, that is set to 10,000: n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that the logically selected actions do better than the randomly selected ones, but not that much better: Policy:policy_random,  Min  reward:9.0,  Max  reward:63.0,  Average  reward:20.26 Policy:policy_logic,  Min  reward:24.0,  Max  reward:66.0,  Average  reward:42.81 Now let us modify the process of selecting the action further—to be based on parameters. The parameters will be multiplied by the observations and the action will be chosen based on whether the multiplication result is zero or one. Let us modify the random search method in which we initialize the parameters randomly. The code looks as follows: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,  policy,  rewards_max): obs  =  env.reset() done  =  False episode_reward  =  0 if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  experiment(policy,  n_episodes,  rewards_max): rewards=np.empty(shape=(n_episodes)) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 experiment(policy_random,  n_episodes,  rewards_max) experiment(policy_logic,  n_episodes,  rewards_max) We can see that random search does improve the results: Policy:policy_random,  Min  reward:8.0,  Max  reward:200.0,  Average reward:40.04 Policy:policy_logic,  Min  reward:25.0,  Max  reward:62.0,  Average  reward:43.03 With the random search, we have improved our results to get the max rewards of 200. On average, the rewards for random search are lower because random search tries various bad parameters that bring the overall results down. However, we can select the best parameters from all the runs and then, in production, use the best parameters. Let us modify the code to train the parameters first: def  policy_logic(theta,obs): #  just  ignore  theta return  1  if  obs[2]  >  0  else  0 def  policy_random(theta,obs): return  0  if  np.matmul(theta,obs)  <  0  else  1 def  episode(env,policy,  rewards_max,theta): obs  =  env.reset() done  =  False episode_reward  =  0 while  not  done: action  =  policy(theta,obs) obs,  reward,  done,  info  =  env.step(action) episode_reward  +=  reward if  episode_reward  >  rewards_max: break return  episode_reward def  train(policy,  n_episodes,  rewards_max): env  =  gym.make('CartPole-v0') theta_best  =  np.empty(shape=[4]) reward_best  =  0 for  i  in  range(n_episodes): if  policy.   name          in  ['policy_random']: theta  =  np.random.rand(4)  *  2  -  1 else: theta  =  None reward_episode=episode(env,policy,rewards_max,  theta) if  reward_episode  >  reward_best: reward_best  =  reward_episode theta_best  =  theta.copy() return  reward_best,theta_best def  experiment(policy,  n_episodes,  rewards_max,  theta=None): rewards=np.empty(shape=[n_episodes]) env  =  gym.make('CartPole-v0') for  i  in  range(n_episodes): rewards[i]=episode(env,policy,rewards_max,theta) #print("Episode  finished  at  t{}".format(reward)) print('Policy:{},  Min  reward:{},  Max  reward:{},  Average  reward:{}' .format(policy.     name     , np.min(rewards), np.max(rewards), np.mean(rewards))) n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We train for 100 episodes and then use the best parameters to run the experiment for the random search policy: n_episodes  =  100 rewards_max  =  10000 reward,theta  =  train(policy_random,  n_episodes,  rewards_max) print('trained  theta:  {},  rewards:  {}'.format(theta,reward)) experiment(policy_random,  n_episodes,  rewards_max,  theta) experiment(policy_logic,  n_episodes,  rewards_max) We find the that the training parameters gives us the best results of 200: trained  theta:  [-0.14779543               0.93269603    0.70896423   0.84632461],  rewards: 200.0 Policy:policy_random,  Min  reward:200.0,  Max  reward:200.0,  Average reward:200.0 Policy:policy_logic,  Min  reward:24.0,  Max  reward:63.0,  Average  reward:41.94 We may optimize the training code to continue training until we reach a maximum reward. To summarize, we learnt the basics of OpenAI Gym and also applied it onto a cartpole game for relevant output.   If you found this post useful, do check out this book Mastering TensorFlow 1.x  to build, scale, and deploy deep neural network models using star libraries in Python.
Read more
  • 0
  • 1
  • 15536

article-image-learning-the-salesforce-analytics-query-language-saql
Amey Varangaonkar
09 Mar 2018
6 min read
Save for later

Learning the Salesforce Analytics Query Language (SAQL)

Amey Varangaonkar
09 Mar 2018
6 min read
Salesforce Einstein offers its own query language to retrieve your data from various sources, called Salesforce Analytics Query Language (SAQL). The lenses and dashboards in Einstein use SAQL behind the scenes to manipulate data for meaningful visualizations. In this article, we see how to use Salesforce Analytics Query Language effectively. Using SAQL There are the following three ways to use SAQL in Einstein Analytics: Creating steps/lenses: We can use SAQL while creating a lens or step. It is the easiest way of using SAQL. While creating a step, Einstein Analytics provides the flexibility of switching between modes such as Chart Mode, Table Mode, and SAQL Mode. In this chapter, we will use this method for SAQL. Analytics REST API: Using this API, the user can access the datasets, lenses, dashboards, and so on. This is a programmatic approach and you can send the queries to the Einstein Analytics platform. Einstein Analytics uses the OAuth 2.0 protocol to securely access the platform data. The OAuth protocol is a way of securely authenticating the user without asking them for credentials. The first step to using the Analytics REST API to access Analytics is to authenticate the user using OAuth 2.0. Using Dashboard JSON: We can use SAQL while editing the Dashboard JSON. We have already seen the Dashboard JSON in previous chapters. To access Dashboard JSON, you can open the dashboard in the edit mode and press Ctrl + E. The simplest way of using SAQL is while creating a step or lens. A user can switch between the modes here. To use SAQL for lens, perform the following steps: Navigate to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner, as shown in the following screenshot: In SAQL, the query is made up of multiple statements. In the first statement, the query loads the input data from the dataset, operates on it, and then finally gives the result. The user can use the Run Query button to see the results and errors after changing or adding statements. The user can see the errors at the bottom of the Query editor. SAQL is made up of statements that take the input dataset, and we build our logic on that. We can add filters, groups, orders, and so on, to this dataset to get the desired output. There are certain order rules that need to be followed while creating these statements and those rules are as follows: There can be only one offset in the foreach statement The limit statement must be after offset The offset statement must be after filter and order The order and filter statements can be swapped as there is no rule for them In SAQL, we can perform all the mathematical calculations and comparisons. SAQL also supports arithmetic operators, comparison operators, string operators, and logical operators. Using foreach in SAQL The foreach statement applies the set of expressions to every row, which is called projection. The foreach statement is mandatory to get the output of the query. The following is the syntax for the foreach statement: q = foreach q generate expression as 'expresion name'; Let's look at one example of using the foreach statement: Go to Analytics Studio | DATASETS and select any dataset. We are going to select Opportunity here. Click on it and it will open a window to create a lens. Switch to SAQL Mode by clicking on the icon in the top-right corner. In the Query editor you will see the following code: q = load "opportunity"; q = group q by all; q = foreach q generate count() as 'count'; q = limit q 2000; You can see the result of this query just below the Query editor: 4. Now replace the third statement with the following statement: q = foreach q generate sum('Amount') as 'Sum Amount'; 5. Click on the Run Query button and observe the result as shown in the following screenshot: Using grouping in SAQL The user can group records of the same value in one group by using the group statements. Use the following syntax: q = group rows by fieldName Let's see how to use grouping in SAQL by performing the following steps: Replace the second and third statement with the following statement: q = group q by 'StageName'; q = foreach q generate 'StageName' as 'StageName', sum('Amount') as 'Sum Amount'; 2. Click on the Run Query button and you should see the following result: Using filters in SAQL Filters in SAQL behave just like a where clause in SOQL and SQL, filtering the data as per the condition or clause. In Einstein Analytics, it selects the row from the dataset that satisfies the condition added. The syntax for the filter is as follows: q = filter q by fieldName 'Operator' value Click on Run Query and view the result as shown in the following screenshot: Using functions in SAQL The beauty of a function is in its reusability. Once the function is created it can be used multiple times. In SAQL, we can use different types of functions, such as string  functions, math functions, aggregate functions, windowing functions, and so on. These functions are predefined and saved quite a few times. Let's use a math function power. The syntax for the power is power(m, n). The function returns the value of m raised to the nth power. Replace the following statement with the fourth statement: q = foreach q generate 'StageName' as 'StageName', power(sum('Amount'), 1/2) as 'Amount Squareroot', sum('Amount') as 'Sum Amount'; Click on the Run Query button. We saw how to apply different kinds of case-specific functions in Salesforce Einstein to play with data in order to get the desired outcome. [box type="note" align="" class="" width=""]The above excerpt is taken from the book Learning Einstein Analytics, written by Santosh Chitalkar. It covers techniques to set-up and create apps, lenses, and dashboards using Salesforce Einstein Analytics for effective business insights. If you want to know more about these techniques, check out the book Learning Einstein Analytics.[/box]     
Read more
  • 0
  • 1
  • 10039
article-image-perform-data-partitioning-postgresql-10
Sugandha Lahoti
09 Mar 2018
11 min read
Save for later

How to perform data partitioning in PostgreSQL 10

Sugandha Lahoti
09 Mar 2018
11 min read
Partitioning refers to splitting; logically it means breaking one large table into smaller physical pieces. PostgreSQL supports basic table partitioning. It can store up to 32 TB of data inside a single table, which are by default 8k blocks. Infact, if we compile PostgreSQL with 32k blocks, we can even put up to 128 TB into a single table. However, large tables like these are not necessarily too convenient and it makes sense to partition tables to enable processing easier, and in some cases, a bit faster. With PostgreSQL 10.0, partitioning data has improved and offers significantly easier handling of partitioning data to the end users. In this article, we will talk about both, the classic way to partition data as well as the new features available on PostgreSQL 10.0 to perform data partitioning. Creating partitions First, we will learn the old method to partition data. Before digging deeper into the advantages of partitioning, I want to show how partitions can be created. The entire thing starts with a parent table: test=# CREATE TABLE t_data (id serial, t date, payload text); CREATE TABLE In this example, the parent table has three columns. The date column will be used for partitioning but more on that a bit later. Now that the parent table is in place, the child tables can be created. This is how it works: test=# CREATE TABLE t_data_2016 () INHERITS (t_data); CREATE TABLE test=# d t_data_2016 Table "public.t_data_2016" Column | Type  | Modifiers ---------+---------+----------------------------------------------------- id   | integer | not null default nextval('t_data_id_seq'::regclass) t    | date | payload | text   | Inherits: t_data The table is called t_data_2016 and inherits from t_data.  () means that no extra columns are added to the child table. As you can see, inheritance means that all columns from the parents are available in the child table. Also note that the id column will inherit the sequence from the parent so that all children can share the very same numbering. Let's create more tables: test=# CREATE TABLE t_data_2015 () INHERITS (t_data); CREATE TABLE test=# CREATE TABLE t_data_2014 () INHERITS (t_data); CREATE TABLE So far, all the tables are identical and just inherit from the parent. However, there is more: child tables can actually have more columns than parents. Adding more fields is simple: test=# CREATE TABLE t_data_2013 (special text) INHERITS (t_data); CREATE TABLE In this case, a special column has been added. It has no impact on the parent, but just enriches the children and makes them capable of holding more data. After creating a handful of tables, a row can be added: test=# INSERT INTO t_data_2015 (t, payload) VALUES ('2015-05-04', 'some data'); INSERT 0 1 The most important thing now is that the parent table can be used to find all the data in the child tables: test=# SELECT * FROM t_data; id |   t   | payload ----+------------+----------- 1 | 2015-05-04 | some data (1 row) Querying the parent allows you to gain access to everything below the parent in a simple and efficient manner. To understand how PostgreSQL does partitioning, it makes sense to take a look at the plan: test=# EXPLAIN SELECT * FROM t_data; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..84.10 rows=4411 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) -> Seq Scan on t_data_2016 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2015 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2014 (cost=0.00..22.00 rows=1200 width=40) -> Seq Scan on t_data_2013 (cost=0.00..18.10 rows=810 width=40) (6 rows) Actually, the process is quite simple. PostgreSQL will simply unify all tables and show us all the content from all the tables inside and below the partition we are looking at. Note that all tables are independent and are just connected logically through the system catalog. Applying table constraints What happens if filters are applied? test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..95.12 rows=23 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2015 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2014 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2013 (cost=0.00..20.12 rows=4 width=40) Filter: (t = '2016-01-04'::date) (11 rows) PostgreSQL will apply the filter to all the partitions in the structure. It does not know that the table name is somehow related to the content of the tables. To the database, names are just names and have nothing to do with what you are looking for. This makes sense, of course, as there is no mathematical justification for doing anything else. The point now is: how can we teach the database that the 2016 table only contains 2016 data, the 2015 table only contains 2015 data, and so on? Table constraints are here to do exactly that. They teach PostgreSQL about the content of those tables and therefore allow the planner to make smarter decisions than before. The feature is called constraint exclusion and helps dramatically to speed up queries in many cases. The following listing shows how table constraints can be created: test=# ALTER TABLE t_data_2013 ADD CHECK (t < '2014-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2014 ADD CHECK (t >= '2014-01-01' AND t < '2015-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2015 ADD CHECK (t >= '2015-01-01' AND t < '2016-01-01'); ALTER TABLE test=# ALTER TABLE t_data_2016 ADD CHECK (t >= '2016-01-01' AND t < '2017-01-01'); ALTER TABLE For each table, a CHECK constraint can be added. PostgreSQL will only create the constraint if all the data in those tables is perfectly correct and if every single row satisfies the constraint. In contrast to MySQL, constraints in PostgreSQL are taken seriously and honored under any circumstances. In PostgreSQL, those constraints can overlap--this is not forbidden and can make sense in some cases. However, it is usually better to have non-overlapping constraints because PostgreSQL has the option to prune more tables. Here is what happens after adding those table constraints: test=# EXPLAIN SELECT * FROM t_data WHERE t = '2016-01-04'; QUERY PLAN ----------------------------------------------------------------- Append (cost=0.00..25.00 rows=7 width=40) -> Seq Scan on t_data (cost=0.00..0.00 rows=1 width=40) Filter: (t = '2016-01-04'::date) -> Seq Scan on t_data_2016 (cost=0.00..25.00 rows=6 width=40) Filter: (t = '2016-01-04'::date) (5 rows) The planner will be able to remove many of the tables from the query and only keep those which potentially contain the data. The query can greatly benefit from a shorter and more efficient plan. In particular, if those tables are really large, removing them can boost speed considerably. Modifying inherited structures Once in a while, data structures have to be modified. The ALTER  TABLE clause is here to do exactly that. The question is: how can partitioned tables be modified? Basically, all you have to do is tackle the parent table and add or remove columns. PostgreSQL will automatically propagate those changes through to the child tables and ensure that changes are made to all the relations, as follows: test=# ALTER TABLE t_data ADD COLUMN x int; ALTER TABLE test=# d t_data_2016 Table "public.t_data_2016" Column |   Type   | Modifiers ---------+---------+----------------------------------------------------- id | integer | not null default t | date | payload |  text | x | integer | Check constraints: nextval('t_data_id_seq'::regclass) "t_data_2016_t_check" CHECK (t >= '2016-01-01'::date AND t < '2017-01-01'::date) Inherits: t_data As you can see, the column is added to the parent and automatically added to the child table here. Note that this works for columns, and so on. Indexes are a totally different story. In an inherited structure, every table has to be indexed separately. If you add an index to the parent table, it will only be present on the parent-it won't be deployed on those child tables. Indexing all those columns in all those tables is your task and PostgreSQL is not going to make those decisions for you. Of course, this can be seen as a feature or as a limitation. On the upside, you could say that PostgreSQL gives you all the flexibility to index things separately and therefore potentially more efficiently. However, people might also argue that deploying all those indexes one by one is a lot more work. Moving tables in and out of partitioned structures Suppose you have an inherited structure. Data is partitioned by date and you want to provide the most recent years to the end user. At some point, you might want to remove some data from the scope of the user without actually touching it. You might want to put data into some sort of archive or something. PostgreSQL provides a simple means to achieve exactly that. First, a new parent can be created: test=# CREATE TABLE t_history (LIKE t_data); CREATE TABLE The LIKE keyword allows you to create a table which has exactly the same layout as the t_data table. If you have forgotten which columns the t_data table actually has, this might come in handy as it saves you a lot of work. It is also possible to include indexes, constraints, and defaults. Then, the table can be moved away from the old parent table and put below the new one. Here is how it works: test=# ALTER TABLE t_data_2013 NO INHERIT t_data; ALTER TABLE test=# ALTER TABLE t_data_2013 INHERIT t_history; ALTER TABLE The entire process can of course be done in a single transaction to assure that the operation stays atomic. Cleaning up data One advantage of partitioned tables is the ability to clean data up quickly. Suppose that we want to delete an entire year. If data is partitioned accordingly, a simple DROP  TABLE clause can do the job: test=# DROP TABLE t_data_2014; DROP TABLE As you can see, dropping a child table is easy. But what about the parent table? There are depending objects and therefore PostgreSQL naturally errors out to make sure that nothing unexpected happens: test=# DROP TABLE t_data; ERROR: cannot drop table t_data because other objects depend on it DETAIL: default for table t_data_2013 column id depends on sequence t_data_id_seq table t_data_2016 depends on table t_data table t_data_2015 depends on table t_data HINT: Use DROP ... CASCADE to drop the dependent objects too. The DROP  TABLE clause will warn us that there are depending objects and refuses to drop those tables. The CASCADE clause is needed to force PostgreSQL to actually remove those objects, along with the parent table: test=# DROP TABLE t_data CASCADE; NOTICE:   drop   cascades to 3 other objects DETAIL:   drop   cascades to default for table    t_data_2013 column id drop cascades to table      t_data_2016 drop   cascades to table t_data_2015 DROP TABLE Understanding PostgreSQL 10.0 partitioning For many years, the PostgreSQL community has been working on built-in partitioning. Finally, PostgreSQL 10.0 offers the first implementation of in-core partitioning, which will be covered in this chapter. For now, the partitioning functionality is still pretty basic. However, a lot of infrastructure for future improvements is already in place. To show you how partitioning works, I have compiled a simple example featuring range partitioning: CREATE TABLE data ( payload   integer )  PARTITION BY RANGE (payload); CREATE TABLE negatives PARTITION OF data FOR VALUES FROM (MINVALUE) TO (0); CREATE TABLE positives PARTITION OF data FOR VALUES FROM (0) TO (MAXVALUE); In this example, one partition will hold all negative values while the other one will take care of positive values. While creating the parent table, you can simply specify which way you want to partition data. In PostgreSQL 10.0, there is range partitioning and list partitioning. Support for hash partitioning and the like might be available as soon as PostgreSQL 11.0. Once the parent table has been created, it is already time to create the partitions. To do that, the PARTITION  OF clause has been added. At this point, there are still some limitations. The most important one is that a tuple (= a row) cannot move from one partition to the other, for example: UPDATE data SET payload = -10 WHERE id = 5 If there were rows satisfying this condition, PostgreSQL would simply error out and refuse to change the value. However, in case of a good design, it is a bad idea to change the partitioning key anyway. Also, keep in mind that you have to think about indexing each partition. We learnt both, the old way of data partitioning and new data partitioning features introduced in PostgreSQL 10.0. [box type="note" align="" class="" width=""]You read an excerpt from the book Mastering PostgreSQL 10, written by Hans-Jürgen Schönig.  To know about, query optimization, stored procedures and other techniques in PostgreSQL 10.0, you may check out this book Mastering PostgreSQL 10..[/box]
Read more
  • 0
  • 0
  • 9293

article-image-spam-filtering-natural-language-processing-approach
Packt
08 Mar 2018
16 min read
Save for later

Spam Filtering - Natural Language Processing Approach

Packt
08 Mar 2018
16 min read
In this article, by Jalaj Thanaki, the author of the book Python Natural Language Processing discusses how to develop natural language processing (NLP) application. In this article, we will be developing a spam filtering. In order to develop spam filtering we will be using supervised machine learning (ML) algorithm named logisticregression. You can also use decision tree, NaiveBayes,or support vector machine (SVM).Tomake this happen the following steps will be covered: Understandlogistic regression with MLalgorithm Data collection and exploration Split dataset into training-dataset and testing-dataset (For more resources related to this topic, see here.) Understanding logistic regression ML algorithm Let's understand logistic regression algorithm first.For this classification algorithm, I will give you intuition how logistic regression algorithm works and we will see some basic mathematics related to it. Then we will see the spam filtering application. First we are considering the binary classes like spam or not-spam, good or bad, win or lose, 0 or 1, and so on for understanding the algorithm and its application. Suppose I want to classify emails into spam and non-spam (ham)category so the spam and non-spam are discrete output label or target concept here. Our goal here is that we want to predict that whether the new email is spam or not-spam. Not-spam also known asham. In order to build this NLP application we are going to use logistic regression. Let's step back a while and understand the technicality of algorithm first. Here I'm stating the facts related to mathematics and this algorithm in very simple manner so everyone can understand the logic. General approach for understanding this algorithm is as follows. If you know some part of ML then you can connect your dot and if you are new to ML then don't worry because we are going to understand every part which I will describe as follows: We are defining our hypothesis function which helps us to generate our target output or target concept We are defining the cost function or error function and we choose error function in such a way that we can derive the partial derivate of error function easily so we can calculate gradient descent easily Over the time we are trying to minimize the error so we can generate the more accurate label and classify data accurately In statistics, logistic regression is also called as logitregression or logitmodel. This algorithm is mostly used as binary class classifier that means there should be two different class in which you want to classify the data. The binary logistic model is used to estimate the probability of a binary response and it generates the response based on one or more predictor or independent variables or features. By the way the ML algorithm that basic mathematics concepts used in deep learning (DL) as well. First I want to explain that why this algorithm called logistic regression? The reason is that the algorithm uses logistic function or sigmoid function and that is the reason it called logistic regression. Logistic function or sigmoid function are the synonyms of each other. We use sigmoid function as hypothesis function and this function belongs to the hypothesis class. Now if you want to say thatwhat do you mean by the hypothesis function? well as we have seen earlier that machine has to learn mapping between data attributes and given label in such a way so it can predict the label for new data. This can be achieved by machine if it learns this mapping using mathematical function. So the mathematical function is called hypothesis function,which machine will use to classify the data and predict the labels or target concept. Here, as I said, we want to build binary classifier so our label is either spam or ham. So mathematically I can assign 0 for ham or not-spam and 1 for spam or viceversa as per your choice. These mathematically assigned labels are our dependent variables. Now we need that our output labels should be either zero or one. Mathematically,we can say that label is y and y ∈ {0, 1}. So we need to choose that kind of hypothesis function which convert our output value either in zero or one and logistic function or sigmoid function is exactly doing that and this is the main reason why logistic regression uses sigmoid function as hypothesis function. Logistic or Sigmoid Function Let me provide you the mathematical equation for logistic or sigmoid function. Refer to Figure 1: Figure 1: Logistic or sigmoid function You can see the plot which is showing g(z). Here, g(z)= Φ(z). Refer to Figure 2: Figure 2: Graph of sigmoid or logistic function From the preceding graph you can see following facts:  If you have z value greater than or equal to zero then logistic function gives the output value one.  If you have value of z less than zero then logistic function or sigmoid function generate the output zero. You can see the following mathematical condition for logistic function. Refer to Figure 3:   Figure 3: Logistic function mathematical property Because of the preceding mathematical property, we can use this function to perform binary classification. Now it's time to show the hypothesis function how this sigmoid function will be represented as hypothesis function. Refer to Figure 4: Figure 4: Hypothesis function for logistic regression If we take the preceding equation and substitute the value of z with θTx then equation given in Figure 1gets convertedas following. Refer to Figure 5: Figure 5: Actual hypothesis function after mathematical manipulation Here hθx is the hypothesis function,θT is the matrix of the feature or matrix of the independent variables and transpose representation of it, x is the stand for all independent variables or for all possible feature set. In order to generate the hypothesis equation we replace the z value of logistic function with θTx. By using hypothesis equation machine actually tries to learn mapping between input variables or input features, and output labels. Let's talk a bit about the interpretation of this hypothesis function. Here for logistic regression, can you think what is the best way to predict the class label? Accordingly, we can predict the target class label by using probability concept. We need to generate the probability for both classes and whatever class has high probability we will assign that class label for that particular instance of feature. So in binary classification the value of y or target class is either zero or one. So if you are familiar with probability then you can represent the probability equation as given in Figure 6: Figure 6: Interpretation of hypothesis function using probabilistic representation So those who are not familiar with probability the P(y=1|x;θ) can be read like this. Probability of y =1, given x, and parameterized by θ. In simple language you can say like this hypothesis function will generate the probability value for target output 1 where we give features matrix x and some parameter θ. This seems intuitive concept, so for a while, you can keep all these in your mind. I will later on given you the reason why we need to generate probability as well as let you know how we can generate probability values for each of the class. Here we complete first step of general approach to understand the logistic regression. Cost or Error function for logistic regression First, let's understand what is cost function or the error function? Cost function or lose function, or error function are all the same things. In ML it is very important concept so here we understand definition of cost function and what is the purpose of defining the cost function. Cost function is the function which we use to check how accurate our ML classifier performs. So let me simplify this for you, in our training dataset we have data and we have labels. Now, when we use hypothesis function and generate the output we need to check how much near we are from the actual prediction and if we predict the actual output label then the difference between our hypothesis function output and actual label is zero or minimum and if our hypothesis function output and actual label are not same then we have big difference between them. So suppose if actual label of email is spam which is 1 and our hypothesis function also generate the result 1 then difference between actual target value and predicated output value is zero and therefore error in prediction is also zero and if our predicted output is 1 and actual output is zero then we have maximum error between our actual target concept and prediction. So it is important for us to have minimum error in our predication. This is the very basic concept of error function. We will get in to the mathematics in some minutes. There are several types of error function available like r2 error, sum of squared error, and so on. As per the ML algorithm and as per the hypothesis function our error function also changes. Now I know you wanted to know what will be the error function for logistic regression? and I have put θ in our hypothesis function so you also want to know what is θ and if I need to choose some value of the θ then how can I approach it? So here I will give all answers. Let me give you some background what we used to do in linear regression so it will help you to understand the logistic regression. We generally used sum of squared error or residuals error, or cost function. In linear regression we used to use it. So, just to give you background about sum of squared error. In linear regression we are trying to generate the line of best fit for our dataset so as I stated the example earlier given height I want to predict the weight and in this case we fist draw a line and measure the distance from each of the data point to line. We will square these distance and sum them and try to minimize this error function. Refer to Figure 7: Figure 7: Sum of squared error representation for reference You can see the distance of each data point from the line which is denoted using red line we will take this distance, square them, and sum them. This error function we will use in linear regression. We use this error function and we have generate partial derivative with respect to slop of line m and with respect to intercept b. Every time we calculate error and update the value of m and b so we can generate the line of best fit. The process of updating m and b is called gradient descent. By using gradient descent we update m and b in such a way so our error function has minimum error value and we can generate line of best fit. Gradient descent gives us a direction in which we need to plot a line so we can generate the line of best fit. You can find the detail example in Chapter 9,Deep Learning for NLU and NLG Problems. So by defining error function and generating partial derivatives we can apply gradient descent algorithm which help us to minimize our error or cost function. Now back to the main question which error function can we use for logistic regression? What you think can we use this as sum of squared error function for logistic regression as well? If you know function and calculus very well, then probably your answer is no. That is the correct answer. Let me explain this for those who aren't familiar with function and calculus. This is important so be careful. In linear regression our hypothesis function is linear so it is very easy for us to calculate sum of squared errors but here we are using sigmoid function which is non-linear function if you apply same function which we used in linear regression will not turn out well because if you take sigmoid function and put into the sum of squared error function then and if you try to visualized the all possible values then you will get non-convex curve. Refer to Figure 8: Figure 8: Non-convex with (Image credit: http://www.yuthon.com/images/non-convex_and_convex_function.png) In machine learning we majorly use function which are able to provide convex curve because then we can use gradient descent algorithm to minimize the error function and able to reach at global minimum certainly. As you saw in Figure 8, non-convex curve has many local minimum so in order to reach to global minimum is very challenging and very time consuming because then you need to apply second order or nth order optimization in order to reach to global minimum where in convex curve you can reach to global minimum certainly and fast as well. So if we plug our sigmoid function in sum of squared error then you get the non-convex function so we are not going to define same error function which we use in linear regression. So, we need to define a different cost function which is convex so we can apply gradient descent algorithm and generate global minimum. So here we are using the statistical concept called likelihood. To derive likelihood function we will use the equation of the probability which is given in Figure 6 and we are considering all data points in training set. So we can generate the following equation which is the likelihood function. Refer to Figure 9: Figure 9: likelihood function for logistic regression (Image credit: http://cs229.stanford.edu/notes/cs229-notes1.pdf) Now in order to simplify the derivative process we need to convert the likelihood function into monotonically increasing function which can be achieved by taking natural logarithm of the likelihood function and this is called loglikelihood. This log likelihood is our cost function for logistic regression. See the following equation given in Figure 10: Figure 10: Cost function for logistic regression Here to gain some intuition about the given cost function we will plot it and understand what benefit it provides to us. Here in xaxis we have our hypothesis function. Our hypothesis function range is 0 to 1 so we have these two points on xaxis. Start with the first case where y =1. You can see the generated curve which is on top right hand side in Figure 11: Figure 11: Logistic function cost function graphs If you see any log function plot and then flip that curve because here we have negative sign then you get the same curve as we plot in Figure 11. you can see the log graph as well as flipped graph in Figure 12: Figure 12:comparing log(x) and –log(x) graph for better understanding of cost function (Image credit : http://www.sosmath.com/algebra/logs/log4/log42/log422/gl30.gif) So here we are interested for value 0 and 1 so we are considering that part of the graph which we have depicted in Figure 11. This cost function has some interesting and useful properties. If predict or candidate label is same as the actual target label then cost will be zero so you can put like this if y=1 and hypothesis function predict hθ(x) = 1 then cost is 0 but if hθ(x) tends to 0 means more towards the zero then cost function blows up to ∞. Now you can see for the y = 0 you can see the graph which is on top left hand side inside the Figure 11. This case condition also have same advantages and properties which we have seen earlier. It will go to ∞ when actual value is 0 and hypothesis function predicts 1. If hypothesis function predict 0 and actual target is also 0 then cost =0. As I told you earlier that I will give you reason why we are choosing this cost function then the reason is that this function makes our optimization easy as we are using maximum log likelihood function as we as this function has convex curve which help us to run gradient decent. In order to apply gradient decent we need to generate the partial derivative with respect to θ and we can generate the following equation which is given in Figure 13: Figure 13: Partial derivative for performing gradient descent (Image credit : http://2.bp.blogspot.com) This equation is used for updating the parameter value of θ and α is here define the learning rate. This is the parameter which you can use how fast or how slow your algorithm should learn or train. If you set learning rate too high then algorithm can not learn and if you set it too low then it take lot of time to train. So you need to choose learning rate wisely. Now let's start building the spam filtering application. Data loading and exploration To build the spam filtering application we need dataset. Here we are using small size dataset. This dataset is simply straight forward. This dataset has two attribute. The first attribute is the label and second attribute is the text content of the email. Let's discuss more about the first attribute. Here the presence of label make this dataset a tagged data. This label indicated that the email content is belong to thespam category or ham category. Let's jump into the practical part. Here we are using numpy, pandas, andscikit-learnas dependency libraries. So let's explore or dataset first.We read dataset using pandas library.I have also checked how many total data records we have and basic details of the dataset. Once we load data,we will check its first ten records and then we will replace the spam and ham categories with number. As we have seen that machine can understand numerical format only so here all labels ham is converted into 0 and all labels spam is converted into 1.Refer to Figure 14: Figure 14: Code snippet for converting labels into numerical format Split dataset intotrainingdataset and testingdataset In this part we divide our dataset into two parts one part is called training set and other part is called testing set. Refer to Figure 15: Figure 15: Code snippet for dividing dataset into trainingdataset and testingdataset We are dividing dataset into two partsbecause we will perform training by using our trainingdataset and one our ML algorithm trained on that dataset and generate ML-model after that we will use generated ML-model and feed testing into that model as result our ML-model will generate the prediction. Based on that result we evaluate out ML-model Summary Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 2846

article-image-4-common-challenges-web-scraping-handle
Amarabha Banerjee
08 Mar 2018
13 min read
Save for later

4 common challenges in Web Scraping and how to handle them

Amarabha Banerjee
08 Mar 2018
13 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] In this article, we will explore primary challenges of Web Scraping and how to get away with it easily. Developing a reliable scraper is never easy, there are so many what ifs that we need to take into account. What if the website goes down? What if the response returns unexpected data? What if your IP is throttled or blocked? What if authentication is required? While we can never predict and cover all what ifs, we will discuss some common traps, challenges, and workarounds. Note that several of the recipes require access to a website that I have provided as a Docker container. They require more logic than the simple, static site we used in earlier chapters. Therefore, you will need to pull and run a Docker container using the following Docker commands: docker pull mheydt/pywebscrapecookbook docker run -p 5001:5001 pywebscrapecookbook Retrying failed page downloads Failed page requests can be easily handled by Scrapy using retry middleware. When installed, Scrapy will attempt retries when receiving the following HTTP error codes: [500, 502, 503, 504, 408] The process can be further configured using the following parameters: RETRY_ENABLED (True/False - default is True) RETRY_TIMES (# of times to retry on any errors - default is 2) RETRY_HTTP_CODES (a list of HTTP error codes which should be retried - default is [500, 502, 503, 504, 408]) How to do it The 06/01_scrapy_retry.py script demonstrates how to configure Scrapy for retries. The script file contains the following configuration for Scrapy: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.retry.RetryMiddleware": 500 }, 'RETRY_ENABLED': True, 'RETRY_TIMES': 3 }) process.crawl(Spider) process.start() How it works Scrapy will pick up the configuration for retries as specified when the spider is run. When encountering errors, Scrapy will retry up to three times before giving up. Supporting page redirects Page redirects in Scrapy are handled using redirect middleware, which is enabled by default. The process can be further configured using the following parameters: REDIRECT_ENABLED: (True/False - default is True) REDIRECT_MAX_TIMES: (The maximum number of redirections to follow for any single request - default is 20) How to do it The script in 06/02_scrapy_redirects.py demonstrates how to configure Scrapy to handle redirects. This configures a maximum of two redirects for any page. Running the script reads the NASA sitemap and crawls that content. This contains a large number of redirects, many of which are redirects from HTTP to HTTPS versions of URLs. There will be a lot of output, but here are a few lines demonstrating the output: Parsing: <200 https://www.nasa.gov/content/earth-expeditions-above/> ['http://www.nasa.gov/content/earth-expeditions-above', 'https://www.nasa.gov/content/earth-expeditions-above'] This particular URL was processed after one redirection, from an HTTP to an HTTPS version of the URL. The list defines all of the URLs that were involved in the redirection. You will also be able to see where redirection exceeded the specified level (2) in the output pages. The following is one example: 2017-10-22 17:55:00 [scrapy.downloadermiddlewares.redirect] DEBUG: Discarding <GET http://www.nasa.gov/topics/journeytomars/news/index.html>: max redirections reached How it works The spider is defined as the following: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] def parse(self, response): print("Parsing: ", response) print (response.request.meta.get('redirect_urls')) This is identical to our previous NASA sitemap based crawler, with the addition of one line printing the redirect_urls. In any call to parse, this metadata will contain all redirects that occurred to get to this page. The crawling process is configured with the following code: process = CrawlerProcess({ 'LOG_LEVEL': 'DEBUG', 'DOWNLOADER_MIDDLEWARES': { "scrapy.downloadermiddlewares.redirect.RedirectMiddleware": 500 }, 'REDIRECT_ENABLED': True, 'REDIRECT_MAX_TIMES': 2 }) Redirect is enabled by default, but this sets the maximum number of redirects to 2 instead of the default of 20. Waiting for content to be available in Selenium A common problem with dynamic web pages is that even after the whole page has loaded, and hence the get() method in Selenium has returned, there still may be content that we need to access later as there are outstanding Ajax requests from the page that are still pending completion. An example of this is needing to click a button, but the button not being enabled until all data has been loaded asynchronously to the page after loading. Take the following page as an example: http://the-internet.herokuapp.com/dynamic_loading/2. This page finishes loading very quickly and presents us with a Start button: When pressing the button, we are presented with a progress bar for five seconds: And when this is completed, we are presented with Hello World! Now suppose we want to scrape this page to get the content that is exposed only after the button is pressed and after the wait? How do we do this? How to do it We can do this using Selenium. We will use two features of Selenium. The first is the ability to click on page elements. The second is the ability to wait until an element with a specific ID is available on the page. First, we get the button and click it. The button's HTML is the following: <div id='start'> <button>Start</button> </div> When the button is pressed and the load completes, the following HTML is added to the document: <div id='finish'> <h4>Hello World!"</h4> </div> We will use the Selenium driver to find the Start button, click it, and then wait until a div with an ID of 'finish' is available. Then we get that element and return the text in the enclosed <h4> tag. You can try this by running 06/03_press_and_wait.py. It's output will be the following: clicked Hello World! Now let's see how it worked. How it works Let us break down the explanation: We start by importing the required items from Selenium: from selenium import webdriver from selenium.webdriver.support import ui Now we load the driver and the page: driver = webdriver.PhantomJS() driver.get("http://the-internet.herokuapp.com/dynamic_loading/2") With the page loaded, we can retrieve the button: button = driver.find_element_by_xpath("//*/div[@id='start']/button") And then we can click the button: button.click() print("clicked") Next we create a WebDriverWait object: wait = ui.WebDriverWait(driver, 10) With this object, we can request Selenium's UI wait for certain events. This also sets a maximum wait of 10 seconds. Now using this, we can wait until we meet a criterion; that an element is identifiable using the following XPath: wait.until(lambda driver: driver.find_element_by_xpath("//*/div[@id='finish']")) When this completes, we can retrieve the h4 element and get its enclosing text: finish_element=driver.find_element_by_xpath("//*/div[@id='finish']/ h4") print(finish_element.text) Limiting crawling to a single domain We can inform Scrapy to limit the crawl to only pages within a specified set of domains. This is an important task, as links can point to anywhere on the web, and we often want to control where crawls end up going. Scrapy makes this very easy to do. All that needs to be done is setting the allowed_domains field of your scraper class. How to do it The code for this example is 06/04_allowed_domains.py. You can run the script with your Python interpreter. It will execute and generate a ton of output, but if you keep an eye on it, you will see that it only processes pages on nasa.gov. How it works The code is the same as previous NASA site crawlers except that we include allowed_domains=['nasa.gov']: class Spider(scrapy.spiders.SitemapSpider): name = 'spider' sitemap_urls = ['https://www.nasa.gov/sitemap.xml'] allowed_domains=['nasa.gov'] def parse(self, response): print("Parsing: ", response) The NASA site is fairly consistent with staying within its root domain, but there are occasional links to other sites such as content on boeing.com. This code will prevent moving to those external sites. Processing infinitely scrolling pages Many websites have replaced "previous/next" pagination buttons with an infinite scrolling mechanism. These websites use this technique to load more data when the user has reached the bottom of the page. Because of this, strategies for crawling by following the "next page" link fall apart. While this would seem to be a case for using browser automation to simulate the scrolling, it's actually quite easy to figure out the web pages' Ajax requests and use those for crawling instead of the actual page.  Let's look at spidyquotes.herokuapp.com/scroll as an example. Getting ready Open http://spidyquotes.herokuapp.com/scroll in your browser. This page will load additional content when you scroll to the bottom of the page: Screenshot of the quotes to scrape Once the page is open, go into your developer tools and select the network panel. Then, scroll to the bottom of the page. You will see new content in the network panel: When we click on one of the links, we can see the following JSON: { "has_next": true, "page": 2, "quotes": [{ "author": { "goodreads_link": "/author/show/82952.Marilyn_Monroe", "name": "Marilyn Monroe", "slug": "Marilyn-Monroe" }, "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"], "text": "u201cThis life is what you make it...." }, { "author": { "goodreads_link": "/author/show/1077326.J_K_Rowling", "name": "J.K. Rowling", "slug": "J-K-Rowling" }, "tags": ["courage", "friends"], "text": "u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.u201d" }, This is great because all we need to do is continually generate requests to /api/quotes?page=x, increasing x until the has_next tag exists in the reply document. If there are no more pages, then this tag will not be in the document. How to do it The 06/05_scrapy_continuous.py file contains a Scrapy agent, which crawls this set of pages. Run it with your Python interpreter and you will see output similar to the following (the following is multiple excerpts from the output): <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'Sisters']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']} 2017-10-29 16:17:37 [scrapy.core.scraper] DEBUG: Scraped from <200 http://spidyquotes.herokuapp.com/api/quotes?page=2> {'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'Understand']} When this gets to page 10 it will stop as it will see that there is no next page flag set in the Content. How it works Let's walk through the spider to see how this works. The spider starts with the following definition of the start URL: class Spider(scrapy.Spider): name = 'spidyquotes' quotes_base_url = 'http://spidyquotes.herokuapp.com/api/quotes' start_urls = [quotes_base_url] download_delay = 1.5 The parse method then prints the response and also parses the JSON into the data variable: def parse(self, response): print(response) data = json.loads(response.body) Then it loops through all the items in the quotes element of the JSON objects. For each item, it yields a new Scrapy item back to the Scrapy engine: for item in data.get('quotes', []): yield { 'text': item.get('text'), 'author': item.get('author', {}).get('name'), 'tags': item.get('tags'), } It then checks to see if the data JSON variable has a 'has_next' property, and if so it gets the next page and yields a new request back to Scrapy to parse the next page: if data['has_next']: next_page = data['page'] + 1 yield scrapy.Request(self.quotes_base_url + "?page=%s" % next_page) There's more... It is also possible to process infinite, scrolling pages using Selenium. The following code is in 06/06_scrape_continuous_twitter.py: from selenium import webdriver import time driver = webdriver.PhantomJS() print("Starting") driver.get("https://twitter.com") scroll_pause_time = 1.5 # Get scroll height last_height = driver.execute_script("return document.body.scrollHeight") while True: print(last_height) # Scroll down to bottom driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # Wait to load page time.sleep(scroll_pause_time) # Calculate new scroll height and compare with last scroll height new_height = driver.execute_script("return document.body.scrollHeight") print(new_height, last_height) if new_height == last_height: break last_height = new_height The output would be similar to the following: Starting 4882 8139 4882 8139 11630 8139 11630 15055 11630 15055 15055 15055 Process finished with exit code 0 This code starts by loading the page from Twitter. The call to .get() will return when the page is fully loaded. The scrollHeight is then retrieved, and the program scrolls to that height and waits for a moment for the new content to load. The scrollHeight of the browser is retrieved again, and if different than last_height, it will loop and continue processing. If the same as last_height, no new content has loaded and you can then continue on and retrieve the HTML for the completed page. We have discussed the common challenges faced in performing Web Scraping using Python and got to know their workaround. If you liked this post, be sure to check out Web Scraping with Python, which consists of useful recipes to work with Python and perform efficient web scraping.
Read more
  • 0
  • 0
  • 17925
article-image-how-to-perform-audio-video-image-scraping-with-python
Amarabha Banerjee
08 Mar 2018
9 min read
Save for later

How to perform Audio-Video-Image Scraping with Python

Amarabha Banerjee
08 Mar 2018
9 min read
[box type="note" align="" class="" width=""]Our article is an excerpt from the book Web Scraping with Python, written by Richard Lawson. This book contains step by step tutorials on how to leverage Python programming techniques for ethical web scraping. [/box] A common practice in scraping is the download, storage, and further processing of media content (non-web pages or data files). This media can include images, audio, and video. To store the content locally (or in a service like S3) and to do it correctly, we need to know what is the type of media, and it isn’t enough to trust the file extension in the URL. Hence, we will learn how to download and correctly represent the media type based on information from the web server. Another common task is the generation of thumbnails of images, videos, or even a page of a website. We will examine several techniques of how to generate thumbnails and make website page screenshots. Many times these are used on a new website as thumbnail links to the scraped media which is stored locally. Finally, it is often the need to be able to transcode media, such as converting non-MP4 videos to MP4, or changing the bit-rate or resolution of a video. Another scenario is to extract only the audio from a video file. We won't look at video transcoding, but we will rip MP3 audio out of an MP4 file using ffmpeg. It's a simple step from there to also transcode video with ffmpeg. Downloading media content from the web Downloading media content from the web is a simple process: use Requests or another library and download it just like you would HTML content. Getting ready There is a class named URLUtility in the urls.py module in the util folder of the solution. This class handles several of the scenarios in this chapter with downloading and parsing URLs. We will be using this class in this recipe and a few others. Make sure the modules folder is in your Python path. Also, the example for this recipe is in the 04/01_download_image.py file. How to do it Here is how we proceed with the recipe: The URLUtility class can download content from a URL. The code in the recipe's file is the following: import const from util.urls import URLUtility util = URLUtility(const.ApodEclipseImage()) print(len(util.data)) When running this you will see the following output:  Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes 171014 The example reads 171014 bytes of data. How it works The URL is defined as a constant const.ApodEclipseImage() in the const module: def ApodEclipseImage(): return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg" The constructor of the URLUtility class has the following implementation: def __init__(self, url, readNow=True): """ Construct the object, parse the URL, and download now if specified""" self._url = url self._response = None self._parsed = urlparse(url) if readNow: self.read() The constructor stores the URL, parses it, and downloads the file with the read() method. The following is the code of the read() method: def read(self): self._response = urllib.request.urlopen(self._url) self._data = self._response.read() This function uses urlopen to get a response object, and then reads the stream and stores it as a property of the object. That data can then be retrieved using the data property: @property def data(self): self.ensure_response() return self._data The code then simply reports on the length of that data, with the value of 171014. There's more This class will be used for other tasks such as determining content types, filename, and extensions for those files. We will examine parsing of URLs for filenames next. Parsing a URL with urllib to get the filename When downloading content from a URL, we often want to save it in a file. Often it is good enough to save the file in a file with a name found in the URL. But the URL consists of a number of fragments, so how can we find the actual filename from the URL, especially where there are often many parameters after the file name? Getting ready We will again be using the URLUtility class for this task. The code file for the recipe is 04/02_parse_url.py. How to do it Execute the recipe's file with your python interpreter. It will run the following code: util = URLUtility(const.ApodEclipseImage()) print(util.filename_without_ext) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The filename is: BT5643s How it works In the constructor for URLUtility, there is a call to urlib.parse.urlparse. The following demonstrates using the function interactively: >>> parsed = urlparse(const.ApodEclipseImage()) >>> parsed ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='') The ParseResult object contains the various components of the URL. The path element contains the path and the filename. The call to the .filename_without_ext property returns just the filename without the extension: @property def filename_without_ext(self): filename = os.path.splitext(os.path.basename(self._parsed.path))[0] return filename The call to os.path.basename returns only the filename portion of the path (including the extension). os.path.splittext() then separates the filename and the extension, and the function returns the first element of that tuple/list (the filename). There's more It may seem odd that this does not also return the extension as part of the filename. This is because we cannot assume that the content that we received actually matches the implied type from the extension. It is more accurate to determine this using headers returned by the web server. That's our next recipe. Determining the type of content for a URL When performing a GET requests for content from a web server, the web server will return a number of headers, one of which identities the type of the content from the perspective of the web server. In this recipe we learn to use that to determine what the web server considers the type of the content. Getting ready We again use the URLUtility class. The code for the recipe is in 04/03_determine_content_type_from_response.py. How to do it We proceed as follows: Execute the script for the recipe. It contains the following code: util = URLUtility(const.ApodEclipseImage()) print("The content type is: " + util.contenttype) With the following result: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes The content type is: image/jpeg How it works The .contentype property is implemented as follows: @property def contenttype(self): self.ensure_response() return self._response.headers['content-type'] The .headers property of the _response object is a dictionary-like class of headers. The content-type key will retrieve the content-type specified by the server. This call to the ensure_response() method simply ensures that the .read() function has been executed. There's more The headers in a response contain a wealth of information. If we look more closely at the headers property of the response, we can see the following headers are returned: >>> response = urllib.request.urlopen(const.ApodEclipseImage()) >>> for header in response.headers: print(header) Date Server Last-Modified ETag Accept-Ranges Content-Length Connection Content-Type Strict-Transport-Security And we can see the values for each of these headers. >>> for header in response.headers: print(header + " ==> " + response.headers[header]) Date ==> Tue, 26 Sep 2017 19:31:41 GMT Server ==> WebServer/1.0 Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT ETag ==> "547bb44-29c06-5581275ce2b86" Accept-Ranges ==> bytes Content-Length ==> 171014 Connection ==> close Content-Type ==> image/jpeg Strict-Transport-Security ==> max-age=31536000; includeSubDomains Many of these we will not examine in this book, but for the unfamiliar it is good to know that they exist. Determining the file extension from a content type It is good practice to use the content-type header to determine the type of content, and to determine the extension to use for storing the content as a file. Getting ready We again use the URLUtility object that we created. The recipe's script is 04/04_determine_file_extension_from_contenttype.py):. How to do it Proceed by running the recipe's script. An extension for the media type can be found using the .extension property: util = URLUtility(const.ApodEclipseImage()) print("Filename from content-type: " + util.extension_from_contenttype) print("Filename from url: " + util.extension_from_url) This results in the following output: Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg Read 171014 bytes Filename from content-type: .jpg Filename from url: .jpg This reports both the extension determined from the file type, and also from the URL. These can be different, but in this case they are the same. How it works The following is the implementation of the .extension_from_contenttype property: @property def extension_from_contenttype(self): self.ensure_response() map = const.ContentTypeToExtensions() if self.contenttype in map: return map[self.contenttype] return None The first line ensures that we have read the response from the URL. The function then uses a python dictionary, defined in the const module, which contains a dictionary of content types to extension: def ContentTypeToExtensions(): return { "image/jpeg": ".jpg", "image/jpg": ".jpg", "image/png": ".png" } If the content type is in the dictionary, then the corresponding value will be returned. Otherwise, None is returned. Note the corresponding property, .extension_from_url: @property def extension_from_url(self): ext = os.path.splitext(os.path.basename(self._parsed.path))[1] return ext This uses the same technique as the .filename property to parse the URL, but instead returns the [1] element, which represents the extension instead of the base filename. To summarize, we discussed how effectively we can scrap audio, video and image content from the web using Python. If you liked our post, be sure to check out Web Scraping with Python, which gives more information on performing web scraping efficiently with Python.
Read more
  • 0
  • 0
  • 13522

article-image-data-explorationusing-spark-sql
Packt
08 Mar 2018
9 min read
Save for later

Data Exploration using Spark SQL

Packt
08 Mar 2018
9 min read
In this article, Aurobindo Sarkar, the author of the book, Learning Spark SQL, we will be covering the following points to introduce you to using Spark SQL for exploratory data analysis. What is exploratory Data Analysis (EDA)? Why EDA is important? Using Spark SQL for basic data analysis Visualizing data with Apache Zeppelin Introducing Exploratory Data Analysis (EDA) Exploratory Data Analysis (EDA), or Initial Data Analysis (IDA), is an approach to data analysis that attempts to maximize insight into data. This includes assessing the quality and structure of the data, calculating summary or descriptive statistics, and plotting appropriate graphs. It can uncover underlying structures and suggest how the data should be modeled. Furthermore, EDA helps us detect outliers, errors, and anomalies in our data, and deciding what to do about such data is often more important than other more sophisticated analysis. EDA enables us to test our underlying assumptions, discover clusters and other patterns in our data, and identify possible relationships between various variables. A careful EDA process is vital to understanding the data and is sometimes sufficient to reveal such poor data quality that a more sophisticated model-based analysis is not justified. Typically, the graphical techniques used in EDA are simple, consisting of plotting the raw data and simple statistics. The focus is on the structures and models revealed by the data or best fit the data. EDA techniques include scatter plots, box plots, histograms, probability plots, and so on. In most EDA techniques, we use all of the data, without making any underlying assumptions. The analyst builds intuition, or gets a “feel”, for the data set as a result of such exploration. More specifically, the graphical techniques allow us to, efficiently, select and validate appropriate models, test our assumptions, identify relationships, select estimators, and detect outliers. EDA involves a lot of trial and error, and several iterations.The best way is to start simple and then build in complexity as you go along. There is a major trade-off in modeling between the simple and the more accurate ones. Simple models may be much easier to interpret and understand. These models can get you to 90% accuracy very quickly, versus a more complex model that might take weeks or months to get you an additional 2% improvement.For example, you should plot simple histograms and scatterplots to quickly start developing an intuition for your data. Using Spark SQL for basic data analysis Interactively, processing and visualizing large data is challenging as the queries can take long to execute and the visual interface cannot accommodate as many pixels as data points. Spark supports in-memory computations and a high degree of parallelism to achieve interactivity with large distributed data. In addition, Spark is capable of handling petabytes of data and provides a set of versatile programming interfaces and libraries. These include SQL, Scala, Python, Java, and R APIs, and libraries for distributed statistics and machine learning. For data that fits into a single computer there are many good tools available such as R, MATLAB, and others. However, if the data does not fit into a single machine, or if it is very complicated to get the data to that machine, or if a single computer cannot easily process the data then this section will offer some good tools and techniques data exploration. Here, we will do some basic data exploration exercises to understand a sample dataset. We will use a dataset that contains data related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The marketing campaigns were based on phone calls to customers. We use the bank-additional-full.csv file that contains 41188 records and 20 input fields, ordered by date (from May 2008 to November 2010). As a first step, let’s define a schema and read in the CSV file to create a DataFrame. You can use :paste to paste-in the initial set of statements in the Spark shell, as shown in the following figure: After the DataFrame is created, we first verify the number of records. We can also define a case class called Call for our input records, and then create a strongly-typed Dataset as follows: In the next section, we will begin our data exploration by identifying missing data in our dataset. Identifying Missing Data Missing data can occur in datasets due to reasons ranging from negligence to a refusal on part of respondants to provide a specific data point. However, in all cases missing data is a common occurrence in real-world datasets. Missing data can create problems in data analysis and sometimes lead to wrong decisions or conclusions. Hence, it is very important to identify missing data and devise effective strategies for dealing with it. Here, weanalyzethe numbers of records with missing data fields in our sample dataset. In order to simulate missing data, we will edit our sample dataset by replacing fields containing “unknown” values with empty strings. First, we created a DataFrame / Dataset from our edited file, as shown in the following figure: The following two statements give us a count of rows with certain fields having missing data. Later, we will look at effective ways of dealing with missing data and compute some basic statistics for sample dataset to improve our understanding of the data. Computing basic statistics Computing basic statistics is essential for a good preliminary understanding of our data. First, for convenience, we create a case class and a dataset containing a subset of fields from our original DataFrame. In our following example, we choose some of the numeric fields and the outcome field that is the “term deposit subscribed” field. Next, we use describe to quickly compute the count, mean, standard deviation, min and max values for the numeric columns in our dataset. Further, we use the stat package to compute additional statistics like covariance, correlation, creating crosstabs, examining items that occur most frequently in data columns, and computing quantiles. These computations are shown in the following figure: Next, we use the typed aggregation functions to summarize our data to understand our data better. In the following statement, we aggregate the results by whether a term deposit was subscribed along with total customers contacted, average number of calls made per customer, the average duration of the calls and the average number of previous calls made to such customers. The results are rounded to two decimal points. Similarly, executing the following statement givessimilar results by customers’ age.   After getting a better understanding of our data by computing basic statistics, we shift our focus to identifying outliers in our data. Identifying data outliers An outlier or an anomalyis an observation of the data that deviates significantly from other observations in the dataset. Erroneous outliers are observations thatare distorted due to possible errors in the data-collection process. These outliers may exert undue influence on the results of statistical analysis, sothey should be identified using reliable detection methods prior to performing data analysis. Many algorithms find outliers as a side-product of clustering algorithms. However these techniques define outliers as points, which do not lie in clusters. The user has to model the data points using a statistical distributions, and the outliers identified depending on how they appear in relation to the underlying model. The main problem with these approaches is that during EDA, the user typically doesnot have enough knowledge about the underlying data distribution. EDA, using a modeling and visualizing approach, is a good way of achieving a deeper intuition of our data. SparkMLlib supports a large (and growing) set of distributed machine learning algorithms to make this task simpler. In the following example, we use the k-means clustering algorithm to compute two clusters in our data. Other distributed algorithms useful for EDA includeclassification, regression, dimensionality reduction, correlation and hypothesis testing. Visualizing data with Apache Zeppelin Typically, we will generate many graphs to verify our hunches about the data.A lot of thesequick and dirty graphs used during EDA are,ultimately, discarded. Exploratory data visualization is critical for data analysis and modeling. However, we often skip exploratory visualization with large data because it is hard. For instance, browsers, typically, cannot handle millions of data points.Hence we have to summarize, sample or model our data before we can effectively visualize it. Traditionally, BI tools provided extensive aggregation and pivoting features to visualize the data. However, these tools typically used nightly jobs to summarize large volumes of data. The summarized data wassubsequently downloaded and visualizedon the practitioner’s workstations. Spark can eliminate many of these batch jobs to support interactive data visualization. Here, we will explore some basic data visualization techniques using Apache Zeppelin. Apache Zeppelin is a web-based tool that supports interactive data analysis and visualization. It supports several language interpreters and comes with built-in Spark integration. Hence, it is quick and easy to get started with exploratory data analysis using Apache Zeppelin. You can download Appache Zeppelin from https://zeppelin.apache.org/. Unzip the package on your hard drive and start Zeppelin using the following command: Aurobindos-MacBook-Pro-2:zeppelin-0.6.2-bin-allaurobindosarkar$ bin/zeppelin-daemon.sh start You should see the following message: Zeppelin start                                             [ OK] You should be able to see the Zeppelin home page at: http://localhost:8080/ Click on Create new note link, and specify a path and name for your notebook, as shown in the following figure: In the next step, we paste the same code as in the beginning of this article to create a DataFrame for our sample dataset. We can execute typical DataFrameoperations as shown in the following figure: Next, we create a table from our DataFrame and execute some SQL on it. The results of the SQL statements execution can be charted by clicking on the appropriatechart-type required. Here, we create bar charts as an illustrative example of summarizing and visualizing data: We can also plot a scatter plot, and read the coordinate values of each of the points plotted, as shown in the following two figures. Additionally, we can create a textbox that accepts input values to make experience interactive. In the following figure we create a textbox that can accept different values for the age parameter and the bar chart is updated, accordingly. Similarly, we can also create dropdown lists where the user can select the appropriate option, and the table of values or chart, automatically gets updated. Summary In this article, we demonstrated using Spark SQL for exploring datasets, performing basic data quality checks, generating samples and pivot tables, and visualizing data with Apache Zeppelin.
Read more
  • 0
  • 0
  • 3314