





















































(For more resources related to this topic, see here.)
In this article we will use cURL to request and download a web page from a server.
<?php
// Function to make GET request using cURL
function curlGet($url) {
$ch = curl_init(); // Initialising cURL session
// Setting cURL options
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch); // Executing cURL session
curl_close($ch); // Closing cURL session
return $results; // Return the results
}
$packtPage = curlGet('http://www.packtpub.com/oop-php-5/book');
echo $packtPage;
?>
Let's look at how we performed the previously defined steps:
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
// Tells cURL to return the results of the request (the source
code of the target page) as a string.
curl_setopt($ch, CURLOPT_URL, $url);
// Here we tell cURL the URL we wish to request, notice that it is
the $url variable that we passed into the function as a parameter.
$results = curl_exec($ch);
curl_close($ch);
return $results;
$packtPage = curlGet('http://www.packtpub.com/oop-php-5/book');
echo $packtPage;
There are a number of different HTTP request methods which indicate the server the desired response, or the action to be performed. The request method being used in this article is cURLs default GET request. This tells the server that we would like to retrieve a resource.
Depending on the resource we are requesting, a number of parameters may be passed in the URL. For example, when we perform a search on the Packt Publishing website for a query, say, php, we notice that the URL is http://www.packtpub.com/books?keys=php. This is requesting the resource books (the page that displays search results) and passing a value of php to the keys parameter, indicating that the dynamically generated page should show results for the search query php.
Of the many cURL options available, only two have been used in our preceding code. They are CURLOPT_RETURNTRANSFER and CURLOPT_URL. Though we will cover many more throughout the course of this article, some other options to be aware of, that you may wish to try out, are listed in the following table:
Option | Name | Value Purpose |
CURLOPT_FAILONERROR | TRUE or FALSE | If a response code greater than 400 is returned, cURL will fail silently. |
CURLOPT_FOLLOWLOCATION | TRUE or FALSE | If Location: headers are sent by the server, follow the location. |
CURLOPT_USERAGENT | A user agent string, for example: 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.5; rv:15.0) Gecko/20100101 Firefox/15.0.1' | Sending the user agent string in your request informs the target server, which client is requesting the resource. Since many servers will only respond to 'legitimate' requests it is advisable to include one. |
CURLOPT_HTTPHEADER | An array containing header information, for example: array('Cache-Control: max-age=0', 'Connection: keep-alive', 'Keep-Alive: 300', 'Accept-Language: en-us,en;q=0.5') | This option is used to send header information with the request and we will come across use cases for this in later recipes. |
A full listing of cURL options can be found on the PHP website at http://php.net/manual/en/function.curl-setopt.php.
An HTTP response code is the number that is returned, which corresponds with the result of an HTTP request. Some common response code values are as follows:
This article covers techniques on making a simple cURL request. It is often useful to have our scrapers responding to different response code values in a different manner, for example, letting us know if a web page has moved, or is no longer accessible, or we are unauthorized to access a particular page.
In this case, we can access the response of a request using cURL by adding the following line to our function, which will store the response code in the $httpResponse variable:
$httpResponse = curl_getinfo($ch, CURLINFO_HTTP_CODE);