Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-introduction-raspberry-pi-zero-w-wireless
Packt
03 Mar 2018
14 min read
Save for later

Introduction to Raspberry Pi Zero W Wireless

Packt
03 Mar 2018
14 min read
In this article by Vasilis Tzivaras, the author of the book Raspberry Pi Zero W Wireless Projects, we will be covering the following topics:  An overview of the Raspberry Pi family  An introduction to the new Raspberry Pi Zero W Distributions  Common issues Raspberry Pi Zero W is the new product of the Raspberry Pi Zero family. In early 2017, Raspberry Pi community has announced a new board with wireless extension. It offers wireless functionality and now everyone can develop his own projects without cables and other components. Comparing the new board with Raspberry Pi 3 Model B we can easily see that it is quite smaller with many possibilities over the Internet of Things. But what is a Raspberry Pi Zero W and why do you need it? Let' s go though the rest of the family and introduce the new board. In the following article we will cover the following topics: (For more resources related to this topic, see here.) Raspberry Pi family As said earlier Raspberry Pi Zero W is the new member of Raspberry Pi family boards. All these years Raspberry Pi are evolving and become more user friendly with endless possibilities. Let's have a short look at the rest of the family so we can understand the difference of the Pi Zero board. Right now, the heavy board is named Raspberry Pi 3 Model B. It is the best solution for projects such as face recognition, video tracking, gaming or anything else that is demanding:                                      RASPBERRY PI 3 MODEL B It is the 3rd generation of Raspberry Pi boards after Raspberry Pi 2 and has the following specs:  A 1.2GHz 64-bit quad-core ARMv8 CPU 802.11n Wireless LAN Bluetooth 4.1 Bluetooth Low Energy (BLE)  Like the Pi 2, it also has 1GB RAM 4 USB ports 40 GPIO pins Full HDMI port Ethernet port Combined 3.5mm audio jack and composite video  Camera interface (CSI)  Display interface (DSI)  Micro SD card slot (now push-pull rather than push-push)  VideoCore IV 3D graphics core The next board is Raspberry Pi Zero, in which the Zero W was based. A small low cost and power board able to do many things:                                     Raspberry Pi Zero The specs of this board can be found as follows:  1GHz, Single-core CPU  512MB RAM  Mini-HDMI port Micro-USB OTG port  Micro-USB power  HAT-compatible 40-pin header  Composite video and reset headers  CSI camera connector (v1.3 only) At this point we should not forget to mention that apart from the boards mentioned earlier there are several other modules and components such as the Sense Hat or Raspberry Pi Touch Display available which will work great for advance projects. The 7″ Touchscreen Monitor for Raspberry Pi gives users the ability to create all-in-one, integrated projects such as tablets, infotainment systems and embedded projects:                                                        RASPBERRY PI Touch Display Where Sense HAT is an add-on board for Raspberry Pi, made especially for the Astro Pi mission. The Sense HAT has an 8×8 RGB LED matrix, a five-button joystick and includes the following sensors: Gyroscope Accelerometer  Magnetometer Temperature  Barometric pressure Humidity                                                                         sense HAT Stay tuned with more new boards and modules at the official website: https://www.raspberrypi.org/ Raspberry Pi Zero W Raspberry Pi Zero W is a small device that has the possibilities to be connected either on an external monitor or TV and of course it is connected to the internet. The operating system varies as there are many distros in the official page and almost everyone is baled on Linux systems.                                                        Raspberry Pi Zero W   With Raspberry Pi Zero W you have the ability to do almost everything, from automation to gaming! It is a small computer that allows you easily program with the help of the GPIO pins and some other components such as a camera. Its possibilities are endless! Specifications If you have bought Raspberry PI 3 Model B you would be familiar with Cypress CYW43438 wireless chip. It provides 802.11n wireless LAN and Bluetooth 4.0 connectivity. The new Raspberry Pi Zero W is equipped with that wireless chip as well. Following you can find the specifications of the new board: Dimensions: 65mm × 30mm × 5mm SoC:Broadcom BCM 2835 chip ARM11 at 1GHz, single core CPU 512ΜΒ RAM Storage: MicroSD card  Video and Audio:1080P HD video and stereo audio via mini-HDMI connector Power:5V, supplied via micro USB connector  Wireless:2.4GHz 802.11 n wireless LAN Bluetooth: Bluetooth classic 4.1 and Bluetooth Low Energy (BLE) Output: Micro USB  GPIO: 40-pin GPIO, unpopulated                                Raspberry Pi Zero W Notice that all the components are on the top side of the board so you can easily choose your case without any problems and keep it safe. As far as the antenna concern, it is formed by etching away copper on each layer of the PCB. It may not be visible as it is in other similar boards but it is working great and offers quite a lot functionalities:                  Raspberry Pi Zero W Capacitors Also, the product is limited to only one piece per buyer and costs 10$. You can buy a full kit with microsd card, a case and some more extra components for about 45$ or choose the camera full kit which contains a small camera component for 55$. Camera support Image processing projects such as video tracking or face recognition require a camera. Following you can see the official camera support of Raspberry Pi Zero W. The camera can easily be mounted at the side of the board using a cable like the Raspberry Pi 3 Model B board:The official Camera support of Raspberry Pi Zero W Depending on your distribution you many need to enable the camera though command line. More information about the usage of this module will be mentioned at the project. Accessories Well building projects with the new board there are some other gadgets that you might find useful working with. Following there is list of some crucial components. Notice that if you buy Raspberry Pi Zero W kit, it includes some of them. So, be careful and don't double buy them:  OTG cable  powerHUB GPIO header  microSD card and card adapter  HDMI to miniHDMI cable  HDMI to VGA cable Distributions The official site https://www.raspberrypi.org/downloads/ contains several distributions for downloading. The two basic operating systems that we will analyze after are RASPBIAN and NOOBS. Following you can see how the desktop environment looks like. Both RASPBIAN and NOOBS allows you to choose from two versions. There is the full version of the operating system and the lite one. Obviously the lite version does not contain everything that you might use so if you tend to use your Raspberry with a desktop environment choose and download the full version. On the other side if you tend to just ssh and do some basic stuff pick the lite one. It' s really up to you and of course you can easily download again anything you like and re-write your microSD card. NOOBS distribution Download NOOBS: https://www.raspberrypi.org/downloads/noobs/. NOOBS distribution is for the new users with not so much knowledge in linux systems and Raspberry PI boards. As the official page says it is really "New Out Of the Box Software". There is also pre-installed NOOBS SD cards that you can purchase from many retailers, such as Pimoroni, Adafruit, and The Pi Hut, and of course you can download NOOBS and write your own microSD card. If you are having trouble with the specific distribution take a look at the following links: Full guide at https://www.raspberrypi.org/learning/software-guide/. View the video at https://www.raspberrypi.org/help/videos/#noobs-setup. NOOBS operating system contains Raspbian and it provides various of other operating systems available to download. RASPBIAN distribution Download RASPBIAN: https://www.raspberrypi.org/downloads/raspbian/. Raspbian is the official supported operating system. It can be installed though NOOBS or be downloading the image file at the following link and going through the guide of the official website. Image file: https://www.raspberrypi.org/documentation/installation/installing-images/README.md. It has pre-installed plenty of software such as Python, Scratch, Sonic Pi, Java, Mathematica, and more! Furthermore, more distributions like Ubuntu MATE, Windows 10 IOT Core or Weather Station are meant to be installed for more specific projects like Internet of Things (IoT) or weather stations. To conclude with, the right distribution to install actually depends on your project and your expertise in Linux systems administration. Raspberry Pi Zero W needs an microSD card for hosting any operating system. You are able to write Raspbian, Noobs, Ubuntu MATE, or any other operating system you like. So, all that you need to do is simple write your operating system to that microSD card. First of all you have to download the image file from https://www.raspberrypi.org/downloads/ which, usually comes as a .zip file. Once downloaded, unzip the zip file, the full image is about 4.5 Gigabytes. Depending on your operating system you have to use different programs:  7-Zip for Windows  The Unarchiver for Mac  Unzip for Linux Now we are ready to write the image in the MicroSD card. You can easily write the .img file in the microSD card by following one of the next guides according to your system. For Linux users dd tool is recommended. Before connecting your microSD card with your adaptor in your computer run the following command:  df -h Now connect your card and run the same command again. You must see some new records. For example if the new device is called /dev/sdd1 keep in your mind that the card is at /dev/sdd (without the 1). The next step is to use the dd command and copy the image to the microSD card. We can do this by the following command:  dd if= of= Where if is the input file (image file or the distribution) and of is the output file (microSD card). Again be careful here and use only /dev/sdd or whatever is yours without any numbers. If you are having trouble with that please use the full manual at the following link https://www.raspberrypi.org/documentation/installation/installing-images/linux.md. A good tool that could help you out for that job is GParted. If it is not installed on your system you can easily install it with the following command:  sudo apt-get install gparted Then run sudogparted to start the tool. Its handles partitions very easily and you can format, delete or find information about all your mounted partitions. More information about ddcan be found here: https://www.raspberrypi.org/documentation/installation/installing-images/linux.md For Mac OS users dd tool is always recommended: https://www.raspberrypi.org/documentation/installation/installing-images/mac.md For Windows users Win32DiskImager utility is recommended: https://www.raspberrypi.org/documentation/installation/installing-images/windows.md There are several other ways to write an image file in a microSD card. So, if you are against any kind of problems when following the guides above feel free to use any other guide available on the Internet. Now, assuming that everything is ok and the image is ready. You can now gently plugin the microcard to your Raspberry PI Zero W board. Remember that you can always confirm that your download was successful with the sha1 code. In Linux systems you can use sha1sum followed by the file name (the image) and print the sha1 code that should and must be the same as it is at the end of the official page where you downloaded the image. Common issues Sometimes, working with Raspberry Pi boards can lead to issues. We all have faced some of them and hope to never face them again. The Pi Zero is so minimal and it can be tough to tell if it is working or not. Since, there is no LED on the board, sometimes a quick check if it is working properly or something went wrong is handy. Debugging steps With the following steps you will probably find its status: Take your board, with nothing in any slot or socket. Remove even the microSD card!  Take a normal micro-USB to USB-ADATA SYNC cable and connect the one side to your computer and the other side to the Pi's USB, (not the PWR_IN).  If the Zero is alive: • On Windows the PC will go ding for the presence of new hardware and you should see BCM2708 Boot in Device Manager. On Linux, with a ID 0a5c:2763 Broadcom Corp message from dmesg. Try to run dmesg in a Terminal before your plugin the USB and after that. You will find a new record there. Output example: [226314.048026] usb 4-2: new full-speed USB device number 82 using uhci_hcd [226314.213273] usb 4-2: New USB device found, idVendor=0a5c, idProduct=2763 [226314.213280] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [226314.213284] usb 4-2: Product: BCM2708 Boot [226314.213] usb 4-2: Manufacturer: Broadcom If you see any of the preceding, so far so good, you know the Zero's not dead. microSD card issue Remember that if you boot your Raspberry and there is nothing working, you may have burned your microSD card wrong. This means that your card many not contain any boot partition as it should and it is not able to boot the first files. That problem occurs when the distribution is burned to /dev/sdd1 and not to /dev/sdd as we should. This is a quite common mistake and there will be no errors in your monitor. It will just not work! Case protection Raspberry Pi boards are electronics and we never place electronics in metallic surfaces or near magnetic objects. It will affect the booting operation of the Raspberry and it will probably not work. So a tip of advice, spend some extra money for the Raspberry PI Case and protect your board from anything like that. There are many problems and issues when hanging your raspberry pi using tacks. It may be silly, but there are many that do that. Summary Raspberry Pi Zero W is a new promising board allowing everyone to connect their devices to the Internet and use their skills to develop projects including software and hardware. This board is the new toy of any engineer interested in Internet of Things, security, automation and more! We have gone through an introduction in the new Raspberry Pi Zero board and the rest of its family and a brief analysis on some extra components that you should buy as well. Resources for Article:   Further resources on this subject: Raspberry Pi Zero W Wireless Projects Full Stack Web Development with Raspberry Pi 3
Read more
  • 0
  • 0
  • 7713

article-image-internationalization-and-localization
Packt
03 Mar 2018
16 min read
Save for later

Internationalization and localization

Packt
03 Mar 2018
16 min read
In this article by Dmitry Sheiko, the author of the book, Cross Platform Desktop Application Development: Electron, Node, NW.js and React, will cover the concept of Internationalization and localization and will be also covering context menu and system clipboard in detail. Internationalization, often abbreviated as i18n, implies a particular software design capable of adapting to the requirements of target local markets. In other words if we want to distribute our application to the markets other than USA we need to take care of translations, formatting of datetime, numbers, addresses, and such. (For more resources related to this topic, see here.) Date format by country Internationalization is a cross-cutting concern. When you are changing the locale it usually affects multiple modules. So I suggest going with the observer pattern that we already examined while working on DirService'. The ./js/Service/I18n.js file contains the following code: const EventEmitter = require( "events" ); class I18nService extends EventEmitter { constructor(){ super(); this.locale = "en-US"; } Internationalization and localization [ 2 ] notify(){ this.emit( "update" ); } } As you see, we can change the locale by setting a new value to locale property. As soon as we call notify method, then all the subscribed modules immediately respond. But locale is a public property and therefore we have no control on its access and mutation. We can fix it by using overloading. The ./js/Service/I18n.js file contains the following code: //... constructor(){ super(); this._locale = "en-US"; } get locale(){ return this._locale; } set locale( locale ){ // validate locale... this._locale = locale; } //... Now if we access locale property of I18n instance it gets delivered by the getter (get locale). When setting it a value, it goes through the setter (set locale). Thus we can add extra functionality such as validation and logging on property access and mutation. Remember we have in the HTML, a combobox for selecting language. Why not give it a view? The ./js/View/LangSelector.j file contains the following code: class LangSelectorView { constructor( boundingEl, i18n ){ boundingEl.addEventListener( "change", this.onChanged.bind( this ), false ); this.i18n = i18n; } onChanged( e ){ const selectEl = e.target; this.i18n.locale = selectEl.value; this.i18n.notify(); } } Internationalization and localization [ 3 ] exports.LangSelectorView = LangSelectorView; In the preceding code, we listen for change events on the combobox. When the event occurs we change locale property of the passed in I18n instance and call notify to inform the subscribers. The ./js/app.js file contains the following code: const i18nService = new I18nService(), { LangSelectorView } = require( "./js/View/LangSelector" ); new LangSelectorView( document.querySelector( "[data-bind=langSelector]" ), i18nService ); Well, we can change the locale and trigger the event. What about consuming modules? In FileList view we have static method formatTime that formats the passed in timeString for printing. We can make it formated in accordance with currently chosen locale. The ./js/View/FileList.js file contains the following code: constructor( boundingEl, dirService, i18nService ){ //... this.i18n = i18nService; // Subscribe on i18nService updates i18nService.on( "update", () => this.update( dirService.getFileList() ) ); } static formatTime( timeString, locale ){ const date = new Date( Date.parse( timeString ) ), options = { year: "numeric", month: "numeric", day: "numeric", hour: "numeric", minute: "numeric", second: "numeric", hour12: false }; return date.toLocaleString( locale, options ); } update( collection ) { //... this.el.insertAdjacentHTML( "beforeend", `<li class="file-list__li" data-file="${fInfo.fileName}"> <span class="file-list__li__name">${fInfo.fileName}</span> <span class="filelist__li__size">${filesize(fInfo.stats.size)}</span> <span class="file-list__li__time">${FileListView.formatTime( fInfo.stats.mtime, this.i18n.locale )}</span> </li>` ); //... } //... In the constructor, we subscribe for I18n update event and update the file list every time the locale changes. Static method formatTime converts passed in string into a Date object and uses Date.prototype.toLocaleString() method to format the datetime according to a given locale. This method belongs to so called The ECMAScript Internationalization API (http://norbertlindenberg.com/2012/12/ecmascript-internationalization-api/index .html). The API describes methods of built-in object String, Date and Number designed to format and compare localized data. But what it really does is formatting a Date instance with toLocaleString for the English (United States) locale ("en-US") and it returns the date as follows: 3/17/2017, 13:42:23 However if we feed to the method German locale ("de-DE") we get quite a different result: 17.3.2017, 13:42:23 To put it into action we set an identifier to the combobox. The ./index.html file contains the following code: .. <select class="footer__select" data-bind="langSelector"> .. And of course, we have to create an instance of I18n service and pass it in LangSelectorView and FileListView: ./js/app.js // ... const { I18nService } = require( "./js/Service/I18n" ), { LangSelectorView } = require( "./js/View/LangSelector" ), i18nService = new I18nService(); new LangSelectorView( document.querySelector( "[data-bind=langSelector]" ), i18nService ); // ... new FileListView( document.querySelector( "[data-bind=fileList]" ), dirService, i18nService ); Now we start the application. Yeah! As we change the language in the combobox the file modification dates adjust accordingly: Multilingual support Localization dates and number is a good thing, but it would be more exciting to provide translation to multiple languages. We have a number of terms across the application, namely the column titles of the file list and tooltips (via title attribute) on windowing action buttons. What we need is a dictionary. Normally it implies sets of token translation pairs mapped to language codes or locales. Thus when you request from the translation service a term, it can correlate to a matching translation according to currently used language/locale. Here I have suggested making the dictionary as a static module that can be loaded with the required function. The ./js/Data/dictionary.js file contains the following code: exports.dictionary = { "en-US": { NAME: "Name", SIZE: "Size", MODIFIED: "Modified", MINIMIZE_WIN: "Minimize window", Internationalization and localization [ 6 ] RESTORE_WIN: "Restore window", MAXIMIZE_WIN: "Maximize window", CLOSE_WIN: "Close window" }, "de-DE": { NAME: "Dateiname", SIZE: "Grösse", MODIFIED: "Geändert am", MINIMIZE_WIN: "Fenster minimieren", RESTORE_WIN: "Fenster wiederherstellen", MAXIMIZE_WIN: "Fenster maximieren", CLOSE_WIN: "Fenster schliessen" } }; So we have two locales with translations per term. We are going to inject the dictionary as a dependency into our I18n service. The ./js/Service/I18n.js file contains the following code: //... constructor( dictionary ){ super(); this.dictionary = dictionary; this._locale = "en-US"; } translate( token, defaultValue ) { const dictionary = this.dictionary[ this._locale ]; return dictionary[ token ] || defaultValue; } //... We also added a new method translate that accepts two parameters: token and default translation. The first parameter can be one of the keys from the dictionary like NAME. The second one is guarding value for the case when requested token does not yet exist in the dictionary. Thus we still get a meaningful text at least in English. Let's see how we can use this new method. The ./js/View/FileList.js file contains the following code: //... update( collection ) { this.el.innerHTML = `<li class="file-list__li file-list__head"> <span class="file-list__li__name">${this.i18n.translate( "NAME", "Name" )}</span> <span class="file-list__li__size">${this.i18n.translate( "SIZE", Internationalization and localization [ 7 ] "Size" )}</span> <span class="file-list__li__time">${this.i18n.translate( "MODIFIED", "Modified" )}</span> </li>`; //... We change in FileList view hardcoded column titles with calls for translate method of I18n instance, meaning that every time view updates it receives the actual translations. We shall not forget about TitleBarActions view where we have windowing action buttons. The ./js/View/TitleBarActions.js file contains the following code: constructor( boundingEl, i18nService ){ this.i18n = i18nService; //... // Subscribe on i18nService updates i18nService.on( "update", () => this.translate() ); } translate(){ this.unmaximizeEl.title = this.i18n.translate( "RESTORE_WIN", "Restore window" ); this.maximizeEl.title = this.i18n.translate( "MAXIMIZE_WIN", "Maximize window" ); this.minimizeEl.title = this.i18n.translate( "MINIMIZE_WIN", "Minimize window" ); this.closeEl.title = this.i18n.translate( "CLOSE_WIN", "Close window" ); } Here we add method translate, which updates button title attributes with actual translations. We subscribe for i18n update event to call the method every time user changes locale:   Context menu Well, with our application we can already navigate through the file system and open files. Yet, one might expect more of a File Explorer. We can add some file related actions like delete, copy/paste. Usually these tasks are available via the context menu, what gives us a good opportunity to examine how to make it with NW.js. With the environment integration API we can create an instance of system menu (http://docs.nwjs.io/en/latest/References/Menu/). Then we compose objects representing menu items and attach them to the menu instance (http://docs.nwjs.io/en/latest/References/MenuItem/). This menu can be shown in an arbitrary position: const menu = new nw.Menu(), menutItem = new nw.MenuItem({ label: "Say hello", click: () => console.log( "hello!" ) }); menu.append( menu ); menu.popup( 10, 10 ); Yet our task is more specific. We have to display the menu on the right mouse click in the position of the cursor. That is, we achieve by subscribing a handler to contextmenu DOM event: document.addEventListener( "contextmenu", ( e ) => { console.log( `Show menu in position ${e.x}, ${e.y}` ); }); Now whenever we right-click within the application window the menu shows up. It's not exactly what we want, isn't it? We need it only when the cursor resides within a particular region. For an instance, when it hovers a file name. That means we have to test if the target element matches our conditions: document.addEventListener( "contextmenu", ( e ) => { const el = e.target; if ( el instanceof HTMLElement && el.parentNode.dataset.file ) { console.log( `Show menu in position ${e.x}, ${e.y}` ); } }); Here we ignore the event until the cursor hovers any cell of file table row, given every row is a list item generated by FileList view and therefore provided with a value for data file attribute. This passage explains pretty much how to build a system menu and how to attach it to the file list. But before starting on a module capable of creating menu, we need a service to handle file operations. The ./js/Service/File.js file contains the following code: const fs = require( "fs" ), path = require( "path" ), // Copy file helper cp = ( from, toDir, done ) => { const basename = path.basename( from ), to = path.join( toDir, basename ), write = fs.createWriteStream( to ) ; fs.createReadStream( from ) .pipe( write ); write .on( "finish", done ); }; class FileService { Internationalization and localization [ 10 ] constructor( dirService ){ this.dir = dirService; this.copiedFile = null; } remove( file ){ fs.unlinkSync( this.dir.getFile( file ) ); this.dir.notify(); } paste(){ const file = this.copiedFile; if ( fs.lstatSync( file ).isFile() ){ cp( file, this.dir.getDir(), () => this.dir.notify() ); } } copy( file ){ this.copiedFile = this.dir.getFile( file ); } open( file ){ nw.Shell.openItem( this.dir.getFile( file ) ); } showInFolder( file ){ nw.Shell.showItemInFolder( this.dir.getFile( file ) ); } }; exports.FileService = FileService; What's going on here? FileService receives an instance of DirService as a constructor argument. It uses the instance to obtain the full path to a file by name ( this.dir.getFile( file ) ). It also exploits notify method of the instance to request all the views subscribed to DirService to update. Method showInFolder calls the corresponding method of nw.Shell to show the file in the parent folder with the system file manager. As you can recon method remove deletes the file. As for copy/paste we do the following trick. When user clicks copy we store the target file path in property copiedFile. So when user next time clicks paste we can use it to copy that file to the supposedly changed current location. Method open evidently opens file with the default associated program. That is what we do in FileList view directly. Actually this action belongs to FileService. So we rather refactor the view to use the service. The ./js/View/FileList.js file contains the following code: constructor( boundingEl, dirService, i18nService, fileService ){ this.file = fileService; //... } Internationalization and localization [ 11 ] bindUi(){ //... this.file.open( el.dataset.file ); //... } Now we have a module to handle context menu for a selected file. The module will subscribe for contextmenu DOM event and build a menu when user right clicks on a file. This menu will contain items Show Item in the Folder, Copy, Paste, and Delete. Whereas copy and paste are separated from other items with delimiters. Besides, Paste will be disabled until we store a file with copy. Further goes the source code. The ./js/View/ContextMenu.js file contains the following code: class ConextMenuView { constructor( fileService, i18nService ){ this.file = fileService; this.i18n = i18nService; this.attach(); } getItems( fileName ){ const file = this.file, isCopied = Boolean( file.copiedFile ); return [ { label: this.i18n.translate( "SHOW_FILE_IN_FOLDER", "Show Item in the Folder" ), enabled: Boolean( fileName ), click: () => file.showInFolder( fileName ) }, { type: "separator" }, { label: this.i18n.translate( "COPY", "Copy" ), enabled: Boolean( fileName ), click: () => file.copy( fileName ) }, { label: this.i18n.translate( "PASTE", "Paste" ), enabled: isCopied, click: () => file.paste() }, { type: "separator" }, { Internationalization and localization [ 12 ] label: this.i18n.translate( "DELETE", "Delete" ), enabled: Boolean( fileName ), click: () => file.remove( fileName ) } ]; } render( fileName ){ const menu = new nw.Menu(); this.getItems( fileName ).forEach(( item ) => menu.append( new nw.MenuItem( item ))); return menu; } attach(){ document.addEventListener( "contextmenu", ( e ) => { const el = e.target; if ( !( el instanceof HTMLElement ) ) { return; } if ( el.classList.contains( "file-list" ) ) { e.preventDefault(); this.render() .popup( e.x, e.y ); } // If a child of an element matching [data-file] if ( el.parentNode.dataset.file ) { e.preventDefault(); this.render( el.parentNode.dataset.file ) .popup( e.x, e.y ); } }); } } exports.ConextMenuView = ConextMenuView; So in ConextMenuView constructor, we receive instances of FileService and I18nService. During the construction we also call attach method that subscribes for contextmenu DOM event, creates the menu and shows it in the position of the mouse cursor. The event gets ignored unless the cursor hovers a file or resides in empty area of the file list component. When user right clicks the file list, the menu still appears, but with all items disable except paste (in case a file was copied before). Method render create an instance of menu and populates it with nw.MenuItems created by getItems method. The method creates an array representing menu items. Elements of the array are object literals. Internationalization and localization [ 13 ] Property label accepts translation for item caption. Property enabled defines the state of item depending on our cases (whether we have copied file or not, whether the cursor on a file or not). Finally property click expects the handler for click event. Now we need to enable our new components in the main module. The ./js/app.js file contains the following code: const { FileService } = require( "./js/Service/File" ), { ConextMenuView } = require( "./js/View/ConextMenu" ), fileService = new FileService( dirService ); new FileListView( document.querySelector( "[data-bind=fileList]" ), dirService, i18nService, fileService ); new ConextMenuView( fileService, i18nService ); Let's now run the application, right-click on a file and voilà! We have the context menu and new file actions. System clipboard Usually Copy/Paste functionality involves system clipboard. NW.js provides an API to control it (http://docs.nwjs.io/en/latest/References/Clipboard/). Unfortunately it's quite limited, we cannot transfer an arbitrary file between applications, what you may expect of a file manager. Yet some things we are still available to us. Transferring text In order to examine text transferring with the clipboard we modify the method copy of FileService: copy( file ){ this.copiedFile = this.dir.getFile( file ); const clipboard = nw.Clipboard.get(); clipboard.set( this.copiedFile, "text" ); } What does it do? As soon as we obtained file full path, we create an instance of nw.Clipboard and save the file path there as a text. So now, after copying a file within the File Explorer we can switch to an external program (for example, a text editor) and paste the copied path from the clipboard. Transferring graphics It doesn't look very handy, does it? It would be more interesting if we could copy/paste a file. Unfortunately NW.js doesn't give us many options when it comes to file exchange. Yet we can transfer between NW.js application and external programs PNG and JPEG images. The ./js/Service/File.js file contains the following code: //... copyImage( file, type ){ const clip = nw.Clipboard.get(), // load file content as Base64 data = fs.readFileSync( file ).toString( "base64" ), // image as HTML html = `<img src="file:///${encodeURI( data.replace( /^//, "" ) )}">`; // write both options (raw image and HTML) to the clipboard clip.set([ Internationalization and localization [ 16 ] { type, data: data, raw: true }, { type: "html", data: html } ]); } copy( file ){ this.copiedFile = this.dir.getFile( file ); const ext = path.parse( this.copiedFile ).ext.substr( 1 ); switch ( ext ){ case "jpg": case "jpeg": return this.copyImage( this.copiedFile, "jpeg" ); case "png": return this.copyImage( this.copiedFile, "png" ); } } //... We extended our FileService with private method copyImage. It reads a given file, converts its contents in Base64 and passes the resulting code in a clipboard instance. In addition, it creates HTML with image tag with Base64-encoded image in data Uniform Resource Identifier (URI). Now after copying an image (PNG or JPEG) in the File Explorer, we can paste it in an external program such as graphical editor or text processor. Receiving text and graphics We've learned how to pass a text and graphics from our NW.js application to external programs. But how can we receive data from outside? As you can guess it is accessible through get method of nw.Clipboard. Text can be retrieved that simple: const clip = nw.Clipboard.get(); console.log( clip.get( "text" ) ); When graphic is put in the clipboard we can get it with NW.js only as Base64-encoded content or as HTML. To see it in practice we add a few methods to FileService. The ./js/Service/File.js file contains the following code: //... hasImageInClipboard(){ const clip = nw.Clipboard.get(); return clip.readAvailableTypes().indexOf( "png" ) !== -1; } pasteFromClipboard(){ const clip = nw.Clipboard.get(); if ( this.hasImageInClipboard() ) { Internationalization and localization [ 17 ] const base64 = clip.get( "png", true ), binary = Buffer.from( base64, "base64" ), filename = Date.now() + "--img.png"; fs.writeFileSync( this.dir.getFile( filename ), binary ); this.dir.notify(); } } //... Method hasImageInClipboard checks if the clipboard keeps any graphics. Method pasteFromClipboard takes graphical content from the clipboard as Base64-encoded PNG. It converts the content into binary code, writes into a file and requests DirService subscribers to update. To make use of these methods we need to edit ContextMenu view. The ./js/View/ContextMenu.js file contains the following code: getItems( fileName ){ const file = this.file, isCopied = Boolean( file.copiedFile ); return [ //... { label: this.i18n.translate( "PASTE_FROM_CLIPBOARD", "Paste image from clipboard" ), enabled: file.hasImageInClipboard(), click: () => file.pasteFromClipboard() }, //... ]; } We add to the menu a new item Paste image from clipboard, which is enabled only when there is any graphic in the clipboard. Summary In this article, we have covered concept of internationalization and localization and also covered context menu and system clipboard in detail. Resources for Article:   Further resources on this subject: [article] [article] [article]
Read more
  • 0
  • 0
  • 2658

article-image-compute-discrete-fourier-transform-dft-using-scipy
Pravin Dhandre
02 Mar 2018
5 min read
Save for later

How to compute Discrete Fourier Transform (DFT) using SciPy

Pravin Dhandre
02 Mar 2018
5 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book co-authored by L. Felipe Martins, Ruben Oliva Ramos and V Kishore Ayyadevara titled SciPy Recipes. This book provides numerous recipes to tackle day-to-day challenges associated with scientific computing and data manipulation using SciPy stack.[/box] Today, we will compute Discrete Fourier Transform (DFT) and inverse DFT using SciPy stack. In this article, we will focus majorly on the syntax and the application of DFT in SciPy assuming you are well versed with the mathematics of this concept. Discrete Fourier Transforms   A discrete Fourier transform transforms any signal from its time/space domain into a related signal in frequency domain. This allows us to not only analyze the different frequencies of the data, but also enables faster filtering operations, when used properly. It is possible to turn a signal in a frequency domain back to its time/spatial domain, thanks to inverse Fourier transform (IFT). How to do it… To follow with the example, we need to continue with the following steps: The basic routines in the scipy.fftpack module compute the DFT and its inverse, for discrete signals in any dimension—fft, ifft (one dimension), fft2, ifft2 (two dimensions), and fftn, ifftn (any number of dimensions). Verify all these routines assume that the data is complex valued. If we know beforehand that a particular dataset is actually real-valued, and should offer realvalued frequencies, we use rfft and irfft instead, for a faster algorithm. In order to complete with this, these routines are designed so that composition with their inverses always yields the identity. The syntax is the same in all cases, as follows: fft(x[, n, axis, overwrite_x]) The first parameter, x, is always the signal in any array-like form. Note that fft performs one-dimensional transforms. This means that if x happens to be two-dimensional, for example, fft will output another two-dimensional array, where each row is the transform of each row of the original. We can use columns instead, with the optional axis parameter. The rest of the parameters are also optional; n indicates the length of the transform and overwrite_x gets rid of the original data to save memory and resources. We usually play with the n integer when we need to pad the signal with zeros or truncate it. For a higher dimension, n is substituted by shape (a tuple) and axis by axes (another tuple). To better understand the output, it is often useful to shift the zero frequencies to the center of the output arrays with ifftshift. The inverse of this operation, ifftshift, is also included in the module. How it works… The following code shows some of these routines in action when applied to a checkerboard: import numpy from scipy.fftpack import fft,fft2, fftshift import matplotlib.pyplot as plt B=numpy.ones((4,4)); W=numpy.zeros((4,4)) signal = numpy.bmat("B,W;W,B") onedimfft = fft(signal,n=16) twodimfft = fft2(signal,shape=(16,16)) plt.figure() plt.gray() plt.subplot(121,aspect='equal') plt.pcolormesh(onedimfft.real) plt.colorbar(orientation='horizontal') plt.subplot(122,aspect='equal') plt.pcolormesh(fftshift(twodimfft.real)) plt.colorbar(orientation='horizontal') plt.show() Note how the first four rows of the one-dimensional transform are equal (and so are the last four), while the two-dimensional transform (once shifted) presents a peak at the origin and nice symmetries in the frequency domain. In the following screenshot, which has been obtained from the previous code, the image on the left is the fft and the one on the right is the fft2 of a 2 x 2 checkerboard signal: Computing the discrete Fourier transform (DFT) of a data series using the FFT Algorithm In this section, we will see how to compute the discrete Fourier transform and some of its Applications. How to do it… In the following table, we will see the parameters to create a data series using the FFT algorithm: How it works… This code represents computing an FFT discrete Fourier in the main part: np.fft.fft(np.exp(2j * np.pi * np.arange(8) / 8)) array([ -3.44505240e-16 +1.14383329e-17j, 8.00000000e+00 -5.71092652e-15j, 2.33482938e-16 +1.22460635e-16j, 1.64863782e-15 +1.77635684e-15j, 9.95839695e-17 +2.33482938e-16j, 0.00000000e+00 +1.66837030e-15j, 1.14383329e-17 +1.22460635e-16j, -1.64863782e-15 +1.77635684e-15j]) In this example, real input has an FFT that is Hermitian, that is, symmetric in the real part and anti-symmetric in the imaginary part, as described in the numpy.fft documentation. import matplotlib.pyplot as plt t = np.arange(256) sp = np.fft.fft(np.sin(t)) freq = np.fft.fftfreq(t.shape[-1]) plt.plot(freq, sp.real, freq, sp.imag) [<matplotlib.lines.Line2D object at 0x...>, <matplotlib.lines.Line2D object at 0x...>] plt.show() The following screenshot shows how we represent the results: Computing the inverse DFT of a data series In this section, we will learn how to compute the inverse DFT of a data series. How to do it… In this section we will see how to compute the inverse Fourier transform. The returned complex array contains y(0), y(1),..., y(n-1) where: How it works… In this part, we represent the calculous of the DFT: np.fft.ifft([0, 4, 0, 0]) array([ 1.+0.j, 0.+1.j, -1.+0.j, 0.-1.j]) Create and plot a band-limited signal with random phases: import matplotlib.pyplot as plt t = np.arange(400) n = np.zeros((400,), dtype=complex) n[40:60] = np.exp(1j*np.random.uniform(0, 2*np.pi, (20,))) s = np.fft.ifft(n) plt.plot(t, s.real, 'b-', t, s.imag, 'r--') plt.legend(('real', 'imaginary')) plt.show() Then we represent it, as shown in the following screenshot:   We successfully explored how to transform signals from time or space domain into frequency domain and vice-versa, allowing you to analyze frequencies in detail. If you found this tutorial useful, do check out the book SciPy Recipes to get hands-on recipes to perform various data science tasks with ease.    
Read more
  • 0
  • 1
  • 15521

article-image-how-to-use-mapreduce-with-mongo-shell
Amey Varangaonkar
02 Mar 2018
8 min read
Save for later

How to use MapReduce with Mongo shell

Amey Varangaonkar
02 Mar 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x authored by Alex Giamas. This book demonstrates the power of MongoDB to build high performance database solutions with ease.[/box] MongoDB is one of the most popular NoSQL databases in the world and can be combined with various Big Data tools for efficient data processing. In this article we explore interesting features of MongoDB, which has been underappreciated and not widely supported throughout the industry as yet - the ability to write MapReduce natively using shell. MapReduce is a data processing method for getting aggregate results from a large set of data. The main advantage is that it is inherently parallelizable as evidenced by frameworks such as Hadoop. A simple example of MapReduce would be as follows, given that our input books collection is as follows: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30 } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50 } { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40 } And our map and reduce functions are defined as follows: > var mapper = function() { emit(this.id, 1); }; In this mapper, we simply output a key of the id of each document with a value of 1: > var reducer = function(id, count) { return Array.sum(count); }; In the reducer, we sum across all values (where each one has a value of 1): > db.books.mapReduce(mapper, reducer, { out:"books_count" }); { "result" : "books_count", "timeMillis" : 16613, "counts" : { "input" : 3, "emit" : 3, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 3 } > Our final output is a document with no ID, since we didn't output any value for id, and a value of 6, since there are six documents in the input dataset. Using MapReduce, MongoDB will apply map to each input document, emitting key-value pairs at the end of the map phase. Then each reducer will get key-value pairs with the same key as input, processing all multiple values. The reducer's output will be a single key-value pair for each key. Optionally, we can use a finalize function to further process the results of the mapper and reducer. MapReduce functions use JavaScript and run within the mongod process. MapReduce can output inline as a single document, subject to the 16 MB document size limit, or as multiple documents in an output collection. Input and output collections can be sharded. MapReduce concurrency MapReduce operations will place several short-lived locks that should not affect operations. However, at the end of the reduce phase, if we are outputting data to an existing collection, then output actions such as merge, reduce, and replace will take an exclusive global write lock for the whole server, blocking all other writes in the db instance. If we want to avoid that we should invoke MapReduce in the following way: > db.collection.mapReduce( Mapper, Reducer, { out: { merge/reduce: bookOrders, nonAtomic: true } }) We can apply nonAtomic only to merge or reduce actions. replace will just replace the contents of documents in bookOrders, which would not take much time anyway. With the merge action, the new result is merged with the existing result if the output collection already exists. If an existing document has the same key as the new result, then it will overwrite that existing document. With the reduce action, the new result is processed together with the existing result if the output collection already exists. If an existing document has the same key as the new result, it will apply the reduce function to both the new and the existing documents and overwrite the existing document with the result. Although MapReduce has been present since the early versions of MongoDB, it hasn't evolved as much as the rest of the database, resulting in its usage being less than that of specialized MapReduce frameworks such as Hadoop. Incremental MapReduce Incremental MapReduce is a pattern where we use MapReduce to aggregate to previously calculated values. An example would be counting non-distinct users in a collection for different reporting periods (that is, hour, day, month) without the need to recalculate the result every hour. To set up our data for incremental MapReduce we need to do the following: Output our reduce data to a different collection At the end of every hour, query only for the data that got into the collection in the last hour With the output of our reduce data, merge our results with the calculated results from the previous hour Following up on the previous example, let's assume that we have a published field in each of the documents, with our input dataset being: > db.books.find() { "_id" : ObjectId("592149c4aabac953a3a1e31e"), "isbn" : "101", "name" : "Mastering MongoDB", "price" : 30, "published" : ISODate("2017-06-25T00:00:00Z") } { "_id" : ObjectId("59214bc1aabac954263b24e0"), "isbn" : "102", "name" : "MongoDB in 7 years", "price" : 50, "published" : ISODate("2017-06-26T00:00:00Z") } Using our previous example of counting books we would get the following: var mapper = function() { emit(this.id, 1); }; var reducer = function(id, count) { return Array.sum(count); }; > db.books.mapReduce(mapper, reducer, { out: "books_count" }) { "result" : "books_count", "timeMillis" : 16700, "counts" : { "input" : 2, "emit" : 2, "reduce" : 1, "output" : 1 }, "ok" : 1 } > db.books_count.find() { "_id" : null, "value" : 2 } Now we get a third book in our mongo_books collection with a document: { "_id" : ObjectId("59214bc1aabac954263b24e1"), "isbn" : "103", "name" : "MongoDB for experts", "price" : 40, "published" : ISODate("2017-07-01T00:00:00Z") } > db.books.mapReduce( mapper, reducer, { query: { published: { $gte: ISODate('2017-07-01 00:00:00') } }, out: { reduce: "books_count" } } ) > db.books_count.find() { "_id" : null, "value" : 3 } What happened here, is that by querying for documents in July 2017 we only got the new document out of the query and then used its value to reduce the value with the already calculated value of 2 in our books_count document, adding 1 to the final sum of three documents. This example, as contrived as it is, shows a powerful attribute of MapReduce: the ability to re-reduce results to incrementally calculate aggregations over time. Troubleshooting MapReduce Throughout the years, one of the major shortcomings of MapReduce frameworks has been the inherent difficulty in troubleshooting as opposed to simpler non-distributed patterns. Most of the time, the most effective tool is debugging using log statements to verify that output values match our expected values. In the mongo shell, this being a JavaScript shell, this is as simple as outputting using the console.log()function. Diving deeper into MapReduce in MongoDB we can debug both in the map and the reduce phase by overloading the output values. Debugging the mapper phase, we can overload the emit() function to test what the output key values are: > var emit = function(key, value) { print("debugging mapper's emit"); print("key: " + key + " value: " + tojson(value)); } We can then call it manually on a single document to verify that we get back the key-value pair that we would expect: > var myDoc = db.orders.findOne( { _id: ObjectId("50a8240b927d5d8b5891743c") } ); > mapper.apply(myDoc); The reducer function is somewhat more complicated. A MapReduce reducer function must meet the following criteria: It must be idempotent The order of values coming from the mapper function should not matter for the reducer's result The reduce function must return the same type of result as the mapper function We will dissect these following requirements to understand what they really mean: It must be idempotent: MapReduce by design may call the reducer multiple times for the same key with multiple values from the mapper phase. It also doesn't need to reduce single instances of a key as it's just added to the set. The final value should be the same no matter the order of execution. This can be verified by writing our own "verifier" function forcing the reducer to re-reduce or by executing the reducer many, many times: reduce( key, [ reduce(key, valuesArray) ] ) == reduce( key, valuesArray ) It must be commutative: Again, because multiple invocations of the reducer may happen for the same key, if it has multiple values, the following should hold: reduce(key, [ C, reduce(key, [ A, B ]) ] ) == reduce( key, [C, A, B ] ) The order of values coming from the mapper function should not matter for the reducer's result: We can test that the order of values from the mapper doesn't change the output for the reducer by passing in documents to the mapper in a different order and verifying that we get the same results out: reduce( key, [ A, B ] ) == reduce( key, [ B, A ] ) The reduce function must return the same type of result as the mapper function: Hand-in-hand with the first requirement, the type of object that the reduce function returns should be the same as the output of the mapper function. We saw how MapReduce is useful when implemented on a data pipeline. Multiple MapReduce commands can be chained to produce different results. An example would be aggregating data by different reporting periods (hour, day, week, month, year) where we use the output of each more granular reporting period to produce a less granular report. If you found this article useful, make sure to check our book Mastering MongoDB 3.x to get more insights and information about MongoDB’s vast data storage, management and administration capabilities.
Read more
  • 0
  • 0
  • 6205

article-image-preparing-spring-web-development-environment
Packt
02 Mar 2018
28 min read
Save for later

Preparing the Spring Web Development Environment

Packt
02 Mar 2018
28 min read
In this article by Ajitesh Kumar, the author of the book Building Web Apps with Spring 5 and Angular, we will see the key aspects of web request-response handling in relation with Spring Web MVC framework. In this article, we will go into the details of setting up development environment for working with Spring web applications. Following are going to be the key areas we are going to look into: Installing Java SDK Installing/configuring Maven Installing Eclipse IDE Installing/configuring Apache Tomcat Server Installing/configuring MySQL Database Introducing Docker containers Setting up development environment using Docker-compose (For more resources related to this topic, see here.) Installing Java SDK First and foremost, we will install Java SDK. We will work with Java 8 throughout this book. Go ahead and access this page (http://www.oracle.com/technetwork/java/javase /downloads/jdk8-downloads-2133151.html). Download the appropriate JDK kit. For Windows OS, there are two different versions, one for x86 and another for x64. One should select appropriate version and download “exe” file. Once downloaded, double-click on the executable file. This would start the installer. Once installed, following needs to be done:  Set the JAVA_HOME as the path where JDK is installed. Include the path in %JAVA_HOME%/bin in the environment variable. One could do that by adding the %JAVA_HOME%/bin directory to his/her user PATH environment variable by opening up the system properties (WinKey + Pause), selecting the “Advanced” tab, and the “Environment Variables” button, then adding or selecting the PATH variable in the user variables with the value. Once done with the preceding steps, open a shell and type the command, "java - version". It should print the version of Java you installed just now. Next, let us try and understand how to install and configure Maven, a tool for building and managing Java projects. Installing/Configuring Maven Maven is a tool which can be used for building and managing Java-based project. Following are some of the key benefits of using Maven as a build tool: It provides a simple project setup that follows best practices - Get a new project or module started in seconds. It allows a project to build using its Project Object Model (POM) and a set of plugins that are shared by all projects using Maven, providing a uniform build system. It allows usage of large and growing repository of libraries and metadata to use out of the box. Based on model based builds, it provides ability to work with multiple projects at the same time. Any number of projects can be built into predefined output types such as a JAR, WAR, or distribution based on metadata about the project, without the need to do any scripting in most cases. One can download Maven from https://maven.apache.org/download.cgi. Before installing Maven, make sure Java is installed and configured (JAVA_HOME) appropriately as mentioned in the previous section. On Windows, you could check the same by typing the command, “echo %JAVA_HOME%”:  Extract distribution archive in any directory. If it works on Windows, install unzip tool such as WinRAR. Right-click on the ZIP file and unzip it. A directory (with name as “apache-maven-3.3.9”, the version of maven at the time of writing) holding the files such as bin, conf and so on will be created. Add the bin directory of the created directory, “apache-maven-3.3.9” to the PATH environment variable. One could do that by adding the bin directory to his/her user PATH environment variable by opening up the system properties (WinKey + Pause), selecting the “Advanced” tab, and the “Environment Variables” button, then adding or selecting the PATH variable in the user variables with the value Open a new shell and type, “mvn -v”. The result should print the Maven version along with details including Java version, Java home, OS name, and so on. Now, let’s look at how can we create a Java project using Maven from command prompt before we get on to creating a Maven project in Eclipse IDE. Use following mvn command to create a Java project:  mvn archetype:generate -DgroupId=com.healthapp -DartifactId=HealthApp - DarchetypeArtifactId=maven-archetype-quickstart -DinteractiveMode=false With archetype:generate and -DarchetypeArtifactId=maven-archetypequickstart template, following project directory structure is created: In the preceding diagram, healthapp folders within src/main and src/test folder consist of a hello world program named as "App.java" and a corresponding test program such as "AppTest.java". Also, the at the top most folder, a pom.xml file is created. In the next section, we will install Eclipse IDE and create a maven project using the functionality provided by the IDE. Installing Eclipse IDE In this section, we will get ourselves setup with Eclipse IDE, a tool used by Java developers to create Java EE and web applications. Go to Eclipse website, http://www.eclipse.org and download the latest version of Eclipse and install thereafter. As we shall be working with web applications, select the option such as "Eclipse IDE for Java EE Developers" while downloading the IDE. As you launch the IDE, it will ask to select a folder for workspace. Select appropriate path and start the IDE. Following are some of the different types of projects developers could work using IDE: A new Java EE Web project A new JavaScript project. This option will be very useful when you are working with standalone JavaScript project and planning to integrate with server components using APIs. Checkout existing Eclipse projects from Git and work on them Import one or more existing Eclipse projects from filesystem or archive Import existing Maven Project in Eclipse In previous section, we have created a maven project namely HealthApp. We will now see how we can import this project into Eclipse IDE:  Click File > import. Type Maven in the search box under Select an import source. Select Existing Maven Projects. Click Next. Click Browse and select the HealthApp folder which is the root of the Maven project. Note that it contains the pom.xml file. Click Finish. The project will be imported in Eclipse. Make sure this is how it looks like: Figure 2: Maven project imported into Eclipse Let's also see how one can create a new Maven project with Eclipse IDE. Create new Maven Project in Eclipse Follow the instructions given to create new Java Maven project with Eclipse IDE:  Click File > New > Project. Type Maven in the search box under Wizards. Select Maven project. A dialog box with title as "New Maven Project", having option "use default Workspace location" as checked, appears. Make sure that Group Id is selected as org.apache.maven.archetypes with Artifact Id selected as maven-archetype-quickstart. Give a name to Group Id, say, "com.orgname". Give a name to Artifact Id, say, "healthapp2". Click Finish. As a result of preceding steps, a new Maven project will be created in Eclipse. Make sure this is how it looks like: Figure 3: Maven project created within Eclipse In next section, we will see how to install and configure Tomcat Server. Installing/Configuring Apache Tomcat Server In this section, we will learn about some of the following: How to install and configure Apache Tomcat server Common deployment approaches with Tomcat server How to add Tomcat server in Eclipse The Apache Tomcat software is an open source implementation of the Java Servlet, JavaServer Pages (JSPs), Java Expression Language and Java WebSocket technologies. We will work with Apache Tomcat 8.x version in this book. We will look at both Windows and Unix version of Java. One can go to http://tomcat.apache.org/ and download the appropriate version from this page. At the time of installation, it requires you to choose the path to one of the JREs installed on your computer. Once installation is complete, Apache Tomcat server is started as a Windows service. With default installation options, one can then access the Tomcat server by accessing URL such as http://127.0.0.1:8080/. A page such as following will be displayed: Figure 4: Apache Tomcat Server Homepage Following is how Tomcat's folder structure looks like: Figure 5: Apache Tomcat Folder Structure In the preceding diagram, note the "webapps" folder which will contain our web apps. The following description uses the variable name such as following: $CATALINA_HOME, the directory into which Tomcat is installed. $CATALINA_BASE, the base directory against which most relative paths are resolved. If you have not configured Tomcat for multiple instances by setting a CATALINA_BASE directory, then $CATALINA_BASE will be set to the value of $CATALINA_HOME. Following are most commonly used approaches to deploy web apps in Tomcat: Copy unpacked directory hierarchy into a subdirectory in directory $CATALINA_BASE/webapps/. Tomcat will assign a context path to your application based on the subdirectory name you choose. Copy the web application archive (WAR) file into directory $CATALINA_BASE/webapps/. When Tomcat is started, it will automatically expand the web application archive file into its unpacked form, and execute the application that way. Let us learn how to configure Apache Tomcat from within Eclipse. This would be very useful as one could start and stop Tomcat from Eclipse while working with his/her web applications. Adding/Configuring Apache Tomcat in Eclipse In this section, we will learn how to add and configure Apache Tomcat in Eclipse. It would help to start and stop the server from within Eclipse IDE. Following steps need to be taken to achieve this objective: Make sure you are in Java EE perspective. Click on "Servers" tab in lower panel. You will find a link saying "No servers are available. Click this link to create a new server...". Click on this link. Type "Tomcat" under Select the server type. It would show a list of Tomcat server with different versions. Select "Tomcat v8.5 Server" and click Next. Select the Tomcat installation directory. Click on "Installed JREs..." button and make sure that appropriate JRE is checked. Click Next. Click Finish. This would create an entry for Tomcat server in "Servers" tab. Double-click on Tomcat server. This would open up a configuration window where multiple options such as Server Locations, Server Options, Ports can be configured. Under Server Locations, click on "Browse Path" button to select the path to "webapps" folder within your local Tomcat installation folder. Once done, save it using Ctrl-S. Right click on "Tomcat Server" link listed under "Servers" panel and click "Start". This should start the server. You should be able to access the Tomcat page on the URL, http://localhost:8080/. Installing/Configuring MySQL Database In this section, we will learn on how to install MySQL database. Go to MySQL Downloads site (https://www.mysql.com/downloads/) and click on "Community (GPL) Downloads" under MySQL community edition. On the next page, you will see listing of several MySQL software packages. Download following: MySQL Community Server MySQL Connector for Java development (Connector/J) Installing/Configuring MySQL Server In this section, we will see how to download, install and configure the MySQL database and related utility such as MySQL Workbench. Note that MySQL Workbench is a unified visual tool which can be used by database architects, developers and DBA for activities such as data modeling, SQL development, and comprehensive administration tools for server configuration, user administration etc. Follow the instructions given for installation & configuration of MySQL server and workbench:  Click on "Download" link under "MySQL Community Server (GPL)" found as first entry on "MySQL Community Downloads" page. We shall be working with Windows version of MySQL in the following instructions. Click the "Download" button against the entry "Windows (x86, 32-bit), MySQL Installer MSI". This would download the an exe file such as mysql-installercommunity-5.7.16.0.exe. Double-click on the installer to start the installation. As you progress ahead after accepting the license terms and condition, you would find the interactive UI such as following. Choose the appropriate version of MySQL server and also, MySQL Workbench and click on Next. Figure 6: Selecting and installing MySQL Server and MySQL Workbench Clicking on Execute would install the MySQL server and MySQL workbench as shown in the following diagram: Figure 7: MySQL Server and Workbench installation in progress Once installation is complete, next few steps would require you to configure the MySQL database including setting root password, adding one or more users, opting to start MySQL server as a Windows service and so on. The quickest way will be to use default instructions as much as possible and finish the installation. Once all is done, you would see UI such as following: Figure 8: Completion of MySQL Server and Workbench installation Clicking on "Finish" button will take on the next window where you could choose to start MySQL workbench. Following is how the MySQL Workbench would look like after you click on MySQL server instance on the Workbench homepage, enter the root password and execute "Show databases" command: Figure 9: MySQL Workbench Using MySQL Connector Before testing MySQL database connection from Java program, one would need to add the MySQL JDBC connector library to the classpath. In this section, we will learn how to configure/add MySQL JDBC connector library to classpath while working with Eclipse IDE or command console. The MySQL connector (Connector/J) comes in ZIP file (*.tar.gz). The MySQL connector is a concrete implementation of JDBC API. Once extracted, one can see a JAR file with name such as mysql-connector-java-xxx.jar. Following are different ways in which this JAR file is dealt with while working with or without IDEs such as Eclipse:  While working with Eclipse IDE, one can add the JAR file to the classpath by adding it as Library to the Build Path in project's properties. While working with command console, one needs to specify the path to the JAR file in the -cp or -classpath argument when executing the Java application. Following is the sample command representing the preceding: java -cp .;/path/to/mysql-connector-java-xxx.jar com.healthapp.JavaClassName Note the "." in classpath (-cp) option. This is there to add the current directory to the classpath as well such that com.healthapp.JavaClassName can be located. Connecting to MySQL Database from a Java Class In this section, we will learn how to test the MySQL database connection from a Java program. Before executing the code shown as follows in your Eclipse IDE, make sure to do the following:  Add the MySQL connector jar file by right-clicking on top-level project folder, clicking on "Properties", clicking on "Java Build Path" and, then, adding mysqlconnector-java-xxx.jar file by clicking on "Add External JARs...": Figure 10: Adding MySQL Java Connector to Java Build Path in Eclipse IDE Create a MySQL database namely "healthapp". You could do that by accessing MySQL Workbench and executing the MySQL command such as "create database healthapp". Following diagram represents the same: Figure 11: Creating new MySQL Database using MySQL Workbench Once done with the preceding steps, use the following code to test the connection to MySQL database from your Java class. On successful connection, you should be able to see "Database connected!" getting printed. import java.sql.Connection; import java.sql.DriverManager; import java.sql.SQLException; /** * Sample program to test MySQL database connection */ public class App { public static void main( String[] args ) { String url = "jdbc:mysql://localhost:3306/healthapp"; String username = "root"; String password = "r00t"; //Root password set during MySQL installation procedure as described above. System.out.println("Connecting database..."); try { Connection connection = DriverManager.getConnection(url, username, password); System.out.println("Database connected!"); } catch (SQLException e) { throw new IllegalStateException("Cannot connect the database!", e); } } } Introduction to Dockers Docker is a virtualization technology which helps IT organizations achieve some of the following:  Enable Dev/QA team develop and test applications in a quick and easy manner in any environment. Break the barriers between Dev/QA and Operations teams during software development life cycle (SDLC) processes. Optimize infrastructure usage in the most appropriate manner. In this section, we will emphasize on first point which would help us setup Spring web application development in quick and easy manner. So far, we have seen traditional manners in which we could set up the Java web application development environment by installing different tools in independent manner and later configuring them appropriately. In a traditional setup, one would be required to setup and configure Java, Maven, Tomcat, MySQL server and so on, one tool at a time, by following manual steps. On the same lines, you could see that all of the steps described in preceding sections have to be performed one-by-one in manual fashion. Following are some of the disadvantages of setting up development/test environments in this manner:  Conflicting Runtimes: If a need arises to use software packages (say, different versions of Java and Tomcat) of different versions to run and test the same web application, it can become very cumbersome to manually set up the environment having different versions of software. Environments getting corrupted: If more than one developers are working in a particular development environment, there are chances that the environment could get corrupted due to changes made by one developer while others are not aware about. And, that generally leads to developers'/team's productivity loss due to time spent in fixing the configuration issue or re-installing the development environment from scratch. "Works for me" syndrome: Have you come across another member of your team saying that the application works in their environment although the application seems to have broken? New Developers/Testers' On-boarding: If there is a need to quickly on-board the new developers, manually setting up development environment takes some significant amount of time depending upon the applications' complexity. All of the praceding disadvantages could be taken care by making use of Dockers technology. In this section, we will learn briefly about some of the following: What are Docker Containers? What are key building blocks of Docker containers? Installing Dockers Useful commands to work with Docker containers What are Docker Containers? In this section, we will try and understand what are Docker containers while comparing them with real-world containers. Simply speaking, Docker is an open platform for developing, shipping and running applications. It provides the ability to package and run an application in a loosely isolated environment called a container. Before going into details of Docker containers, let us try and understand the problems that are solved by real-world containers. What are real-world containers good for? Following picture represents real world containers which are used to package annything and everything and, then, transport the goods from one place to other in an easy and safe manner:Figure 12: Real-world containers The following diagram represents different form of goods which needs to be transported using different from of transport mechanisms from one place to another: Figure 13: Different forms of goods vis-a-vis different form of transport mechanisms The following diagram displays the matrix representing need to transport each of the goods via different transport mechanism. The challenge is to make sure that these goods get transported in easy and safe manner: Figure 14: Complexity associated with transporting goods of different types using different transport mechanisms In order to solve preceding problem of transporting the goods in safe and easy manner irrespective of transport medium, the containers are used. Look at the following diagram: Figure 15: Goods can be packed within containers, and containers can be transported. How does Docker containers relate to the real-world containers? Now imagine the act of moving a software application from one environment to another environment starting from development right up to production. Following diagram represents complexity associated with making different application components work in different environments: Figure 16: Complexity associated with making different application components work in different environments As per the preceding diagram, to make different application components work in different environments (different hardware platforms), one would require to make sure environment compatible software versions and related configurations are set appropriately. Doing this using manual steps can be real cumbersome and error prone task. This is where docker containers fit in. Following diagram represents containerizing different application components using Docker containers. As like real-world containers, it would become very easy to move the containerized application components from one environment to another with very less or no issues: Figure 17: Docker containers to move application components across different environments Docker containers In simple terms, Docker containers provide an isolated and secured environment for the application components to run. The isolation and security allows one or many containers to run simultaneously on a given host. Often, for simplicity sake, Docker containers are loosely termed as lightweight-VMs (Virtual Machine). However, they are very much different from the traditional VMs. Docker containers do not need hypervisors to run as like virtual machines and, thus, multiple containers can be run on a given hardware combination. Virtual machines include the application, the necessary binaries and libraries, and an entire guest operating system; all of which can amount to tens of GBs. On the other hand, Docker Containers include the application and all of its dependencies; but share the kernel with other containers, running as isolated processes in user space on the host operating system. Docker containers are not tied to any specific infrastructure: they run on any computer, on any infrastructure, and in any cloud. This very aspect make them look like a real-world container. Following diagram sums it all: Figure 18: Difference between traditional VMs and Docker containers Following are some of the key building blocks of Docker technology: Docker Containers: Isolated and secured environment for applications to run. Docker engine: A client-server application having following components: Daemon process used to create and manage Docker objects, such as images, containers, networks, and data volumes. A REST API interface and A command line interface (CLI) client Docker client: Client program that invokes Docker engine using APIs. Docker host: Underlying operating system sharing the kernel space with Docker containers. Until recently, Windows OS needed Linux virtualization to host Docker containers.  Docker hub: Public repository used to manage Docker images posted by various users. Images made public are available for all to download in order to create containers using those images. What are key building blocks of Dockers containers? For setting up our development environment, we will rely on Docker containers and assemble them together using the tool called as Docker compose which we shall learn about little later. Let us understand some of the following which can also be termed as key building blocks of Docker containers:  Docker image: In simple terms, Docker image can be thought of as a "class" in Java. Docker containers can be thought of as running instances of the image as like having one or more "instances" of a Java class. Technically speaking, Docker images consist of a list of layers that are stacked on top of each other to form a base for containers' root file system. Following diagram represents command which can be used to create a Docker container using an image named helloworld: Figure 19: Docker command representing creation of Docker container using a Docker image. In order to set up our development environment, we will require images of following to create the respective Docker containers: - Tomcat - MySQL. Dockerfile: Dockerfile is a text document that contains all the commands which could be called on the command line to assemble or build an image. docker build command is used to build an image from a Dockerfile and a context. In order to create custom images for Tomcat and MySQL, it may be required to create a Dockerfile and, then, build the image. Following is a sample command for building an image using a Dockerfile: docker build -f tomcat.df -t tomcat_debug The preceding command would look for the Dockerfile "tomcat.df" in the current directory specified by "." and build the image with tag, "tomcat_debug". Installing Dockers Now that we have got an understanding on What are Dockers, lets install Dockers. We shall look into steps that are required to install Dockers on Windows OS:  Download the Windows version of Docker Toolbox from the webpage, https https://www.docker.com/products/docker-toolbox. Docker toolbox comes as an installer which can be double-clicked for quick setup and launch of the docker environment. Following comes with Docker toolbox installation: Docker Machine for running docker-machine commands. Docker Engine for running the docker commands. Docker Compose for running the docker-compose commands. This is what we are looking for. Kitematic, the Docker GUI. A shell preconfigured for a Docker command-line environment. Oracle VirtualBox. Setting up Development Environment using Docker Compose In this section, we will learn how to setup on-demand, self-service development environment using Docker compose. Following are some of the points covered in this section: What is Docker compose? Docker compose script for setting up the development environment What is Docker Compose? Docker compose is a tool for defining and running multi-container Docker applications. One will require to create a Compose file to configure the application's services. Following steps are required to be taken in order to work with Docker compose: Define the application’s environment with a Dockerfile so it can be reproduced anywhere. Define the services that make up the application in docker-compose.yml so they can be run together in an isolated environment. Lastly, run docker-compose up and Compose will start and run the entire application. As we are going to setup a multi-container applications using Tomcat and MySQL as different containers, we will use Docker compose to configure both of them and, then, assemble the application. Docker Compose script for setting up the development environment In order to come up with a Docker compose script which can set up our Spring Web App development environment with one script execution, we will first set up images for following by creating independent Dockerfiles. Tomcat 8.x with Java and Maven installed as one container MySQL as another container Setting up Tomcat 8.x as a Container Service Following steps can be used to setup Tomcat 8.x along with Java 8 and Maven 3.x as one container: Create a folder and put following files within the folder. The source code for the files will be given as follows: tomcat.df create_tomcat_admin_user.sh run.sh Copy following source code for tomcat.df: FROM phusion/baseimage:0.9.17 RUN echo "deb http://archive.ubuntu.com/ubuntu trusty main universe" > /etc/apt/sources.list RUN apt-get -y update RUN DEBIAN_FRONTEND=noninteractive apt-get install -y -q pythonsoftware-properties software-properties-common ENV JAVA_VER 8 ENV JAVA_HOME /usr/lib/jvm/java-8-oracle RUN echo 'deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' >> /etc/apt/sources.list && echo 'deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main' >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-keys C2518248EEA14886 && apt-get update && echo oracle-java${JAVA_VER}-installer shared/accepted-oraclelicense-v1-1 select true | sudo /usr/bin/debconf-set-selections && apt-get install -y --force-yes --no-install-recommends oraclejava${JAVA_VER}-installer oracle-java${JAVA_VER}-set-default && apt-get clean && rm -rf /var/cache/oracle-jdk${JAVA_VER}-installer RUN update-java-alternatives -s java-8-oracle RUN echo "export JAVA_HOME=/usr/lib/jvm/java-8-oracle" >> ~/.bashrc RUN apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* ENV MAVEN_VERSION 3.3.9 RUN mkdir -p /usr/share/maven && curl -fsSL http://apache.osuosl.org/maven/maven-3/$MAVEN_VERSION/binaries/apac he-maven-$MAVEN_VERSION-bin.tar.gz | tar -xzC /usr/share/maven --strip-components=1 && ln -s /usr/share/maven/bin/mvn /usr/bin/mvn ENV MAVEN_HOME /usr/share/maven VOLUME /root/.m2 RUN apt-get update && apt-get install -yq --no-install-recommends wget pwgen cacertificates && apt-get clean && rm -rf /var/lib/apt/lists/* ENV TOMCAT_MAJOR_VERSION 8 ENV TOMCAT_MINOR_VERSION 8.5.8 ENV CATALINA_HOME /tomcat RUN wget -q https://archive.apache.org/dist/tomcat/tomcat-${TOMCAT_MAJOR_VERSIO N}/v${TOMCAT_MINOR_VERSION}/bin/apache-tomcat${TOMCAT_MINOR_VERSION}.tar.gz && wget -qOhttps://archive.apache.org/dist/tomcat/tomcat-${TOMCAT_MAJOR_VERSIO N}/v${TOMCAT_MINOR_VERSION}/bin/apache-tomcat${TOMCAT_MINOR_VERSION}.tar.gz.md5 | md5sum -c - && tar zxf apache-tomcat-*.tar.gz && rm apache-tomcat-*.tar.gz && mv apache-tomcat* tomcat ADD create_tomcat_admin_user.sh /create_tomcat_admin_user.sh RUN mkdir /etc/service/tomcat ADD run.sh /etc/service/tomcat/run RUN chmod +x /*.sh RUN chmod +x /etc/service/tomcat/run EXPOSE 8080 CMD ["/sbin/my_init"] Copy following code in a file named as create_tomcat_admin_user.sh. This file should be created in the same folder as preceding file, tomcat.df. While copying into notepad and later using with docker terminal, you may find Ctrl-M character inserted at the end of the line. Make sure that those lines are appropriately handled and removed: #!/bin/bash if [ -f /.tomcat_admin_created ]; then echo "Tomcat 'admin' user already created" exit 0 fi PASS=${TOMCAT_PASS:-$(pwgen -s 12 1)} _word=$( [ ${TOMCAT_PASS} ] && echo "preset" || echo "random" ) echo "=> Creating an admin user with a ${_word} password in Tomcat" sed -i -r 's/</tomcat-users>//' ${CATALINA_HOME}/conf/tomcatusers.xml echo '<role rolename="manager-gui"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="manager-script"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="manager-jmx"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="admin-gui"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '<role rolename="admin-script"/>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo "<user username="admin" password="${PASS}" roles="managergui,manager-script,manager-jmx,admin-gui, admin-script"/>" >> ${CATALINA_HOME}/conf/tomcat-users.xml echo '</tomcat-users>' >> ${CATALINA_HOME}/conf/tomcat-users.xml echo "=> Done!" touch /.tomcat_admin_created echo "================================================================== ======" echo "You can now configure to this Tomcat server using:" echo "" echo " admin:${PASS}" echo "" echo "================================================================== ======" Copy following code in a file named as run.sh in the same folder as preceding two files: #!/bin/bash if [ ! -f /.tomcat_admin_created ]; then /create_tomcat_admin_user.sh fi exec ${CATALINA_HOME}/bin/catalina.sh run Open up a Docker terminal and go to folder where these files are located. Execute following command to create the Tomcat image. In few minutes, the tomcat image will be created: docker build -f tomcat.df -t demo/tomcat:8 . Execute the command such as following and make sure that an image with name as demo/tomcat is found: docker images Next, run a container with name such as "tomcatdev" using following command: docker run -ti -d -p 8080:8080 --name tomcatdev -v "$PWD":/mnt/ demo/tomcat:8 Open a browser and type the URL as http://192.168.99.100:8080/. You should be able to see following page getting loaded. Note the URL and the Tomcat version, 8.5.8. This is the same version we earlier installed (check figure 1.4): Figure 20: Tomcat 8.5.8 installed as a Docker container You could access the container through the terminal using command with following command. Make sure to check the Tomcat installation inside folder "/tomcat". Also, execute command such as "java -version" and "mvn -v" to check the version of Java and Maven respectively: docker exec -ti tomcatdev /bin/bash In this section, we learnt to setup Tomcat 8.5.8 along with Java 8 and Maven 3.x as one container. Setting up MySQL as a Container Service In this section, we will learn how to setup MySQL as a container service. In the docker terminal, execute the following command: docker run -ti -d -p 3326:3306 --name mysqldev -e MYSQL_ROOT_PASSWORD=r00t -v "$PWD":/mnt/ mysql:5.7 The preceding command setup MySQL 5.7 version within the container and starts the mysqld service. Open MySQL Workbench and create a new connection by entering the details such as following and click "Test Connection". You should be able to establish the connection successfully: Figure 21: MySQL server running in the container and accessible from host machine at 3326 port using MySQL Workbench Docker Compose script to setup the Dev Environment Now, that we have setup both Tomcat and MySQL as individual containers, let us learn to create a Docker compose script using which both the containers can be started simultaneously thereby starting the Dev environment.  Save following source code as docker-compose.yml in the same folder as preceding mentioned files: version: '2' services: web: build: context: . dockerfile: tomcat.df ports: - "8080:8080" volumes: - .:/mnt/ links: - db db: image: mysql:5.7 ports: - "3326:3306" environment: - MYSQL_ROOT_PASSWORD=r00t Execute following command to start and stop the services: // For starting the services in the foreground docker-compose up // For starting the services in the background (detached mode) docker-compose up -d // For stopping the services docker-compose stop Test whether both the default Tomcat web app and MySQL server can be accessed. Access the URL, 192.168.99.100:8080 and make sure that the web page as shown in figure 1.20 is displayed. Also, open MySQL Workbench and access the MySQL server at IP, 192.168.99.100 and port 3326 (as specified in the preceding docker-compose.yml file). Summary In this article, we learnt how we could start and stop the Web app Dev environment on- demand. Note that with these scripts including Dockerfiles, shell scripts and Dockercompose file, you could setup the Dev environment on any machine where Docker Toolbox could be installed. Resources for Article:   Further resources on this subject: Building Web Apps with Spring 5 and Angular 4 Spring 5 Design Patterns
Read more
  • 0
  • 0
  • 3937

article-image-implementing-apache-spark-k-means-clustering-method-on-digital-breath-test-data-for-road-safety
Savia Lobo
01 Mar 2018
7 min read
Save for later

Implementing Apache Spark K-Means Clustering method on digital breath test data for road safety

Savia Lobo
01 Mar 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will learn to use Spark as a big data operating system, understand how to implement advanced analytics on the new APIs, and explore how easy it is to use Spark in day-to-day tasks.[/box] In today’s tutorial, we have used the Road Safety test data from our previous article, to show how one can attempt to find clusters in data using K-Means algorithm with Apache Spark MLlib. Theory on Clustering The K-Means algorithm iteratively attempts to determine clusters within the test data by minimizing the distance between the mean value of cluster center vectors, and the new candidate cluster member vectors. The following equation assumes dataset members that range from X1 to Xn; it also assumes K cluster sets that range from S1 to Sk, where K <= n. K-Means in practice The K-Means MLlib functionality uses the LabeledPoint structure to process its data and so it needs numeric input data. As the same data from the last section is being reused, we will not explain the data conversion again. The only change that has been made in data terms in this section, is that processing in HDFS will now take place under the /data/spark/kmeans/ directory. Additionally, the conversion Scala script for the K-Means example produces a record that is all comma-separated. The development and processing for the K-Means example has taken place under the /home/hadoop/spark/kmeans directory to separate the work from other development. The sbt configuration file is now called kmeans.sbt and is identical to the last example, except for the project name: name := "K-Means" The code for this section can be found in the software package under chapter7K-Means. So, looking at the code for kmeans1.scala, which is stored under kmeans/src/main/scala, some similar actions occur. The import statements refer to the Spark context and configuration. This time, however, the K-Means functionality is being imported from MLlib. Additionally, the application class name has been changed for this example to kmeans1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.clustering.{KMeans,KMeansModel} object kmeans1 extends App { The same actions are being taken as in the last example to define the data file--to define the Spark configuration and create a Spark context: val hdfsServer = "hdfs://localhost:8020" val hdfsPath      = "/data/spark/kmeans/" val dataFile     = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2a.csv" val sparkMaster = "spark://localhost:7077" val appName = "K-Means 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) Next, the CSV data is loaded from the data file and split by comma characters into the VectorData variable: val csvData = sparkCxt.textFile(dataFile) val VectorData = csvData.map { csvLine => Vectors.dense( csvLine.split(',').map(_.toDouble)) } A KMeans object is initialized, and the parameters are set to define the number of clusters and the maximum number of iterations to determine them: val kMeans = new KMeans val numClusters       = 3 val maxIterations     = 50 Some default values are defined for the initialization mode, number of runs, and Epsilon, which we needed for the K-Means call but did not vary for the processing. Finally, these parameters were set against the KMeans object: val initializationMode = KMeans.K_MEANS_PARALLEL val numRuns     = 1 val numEpsilon       = 1e-4 kMeans.setK( numClusters ) kMeans.setMaxIterations( maxIterations ) kMeans.setInitializationMode( initializationMode ) kMeans.setRuns( numRuns ) kMeans.setEpsilon( numEpsilon ) We cached the training vector data to improve the performance and trained the KMeans object using the vector data to create a trained K-Means model: VectorData.cache val kMeansModel = kMeans.run( VectorData ) We have computed the K-Means cost and number of input data rows, and have output the results via println statements. The cost value indicates how tightly the clusters are packed and how separate the clusters are: val kMeansCost = kMeansModel.computeCost( VectorData ) println( "Input data rows : " + VectorData.count() ) println( "K-Means Cost  : " + kMeansCost ) Next, we have used the K-Means Model to print the cluster centers as vectors for each of the three clusters that were computed: kMeansModel.clusterCenters.foreach{ println } Finally, we use the K-Means model predict function to create a list of cluster membership predictions. We then count these predictions by value to give a count of the data points in each cluster. This shows which clusters are bigger and whether there really are three clusters: val clusterRddInt = kMeansModel.predict( VectorData ) val clusterCount = clusterRddInt.countByValue clusterCount.toList.foreach{ println } } // end object kmeans1 So, in order to run this application, it must be compiled and packaged from the kmeans subdirectory as the Linux pwd command shows here: [hadoop@hc2nn kmeans]$ pwd /home/hadoop/spark/kmeans [hadoop@hc2nn kmeans]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash [info] Set current project to K-Means (in build file:/home/hadoop/spark/kmeans/) [info] Compiling 2 Scala sources to /home/hadoop/spark/kmeans/target/scala-2.10/classes... [info] Packaging /home/hadoop/spark/kmeans/target/scala-2.10/k- means_2.10-1.0.jar ... [info] Done packaging. [success] Total time: 20 s, completed Feb 19, 2015 5:02:07 PM Once this packaging is successful, we check HDFS to ensure that the test data is ready. As in the last example, we convert our data to numeric form using the convert.scala file, provided in the software package. We will process the DigitalBreathTestData2013- MALE2a.csv data file in the HDFS directory, /data/spark/kmeans, as follows: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/kmeans Found 3 items -rw-r--r--   3 hadoop supergroup 24645166 2015-02-05 21:11 /data/spark/kmeans/DigitalBreathTestData2013-MALE2.csv -rw-r--r--   3 hadoop supergroup 5694226 2015-02-05 21:48 /data/spark/kmeans/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup   0 2015-02-05 21:46 /data/spark/kmeans/result The spark-submit tool is used to run the K-Means application. The only change in this command is that the class is now kmeans1: spark-submit --class kmeans1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/kmeans/target/scala-2.10/k-means_2.10-1.0.jar The output from the Spark cluster run is shown to be as follows: Input data rows : 467054 K-Means Cost  : 5.40312223450789E7 The previous output shows the input data volume, which looks correct; it also shows the K- Means cost value. The cost is based on the Within Set Sum of Squared Errors (WSSSE) which basically gives a measure how well the found cluster centroids are matching the distribution of the data points. The better they are matching, the lower the cost. The following link https://datasciencelab.wordpress.com/2013/12/27/finding-the-k-in-k-means-clustering/ explains WSSSE and how to find a good value for k in more detail. Next come the three vectors, which describe the data cluster centers with the correct number of dimensions. Remember that these cluster centroid vectors will have the same number of columns as the original vector data: [0.24698249738061878,1.3015883142472253,0.005830116872250263,2.917374778855 5207,1.156645130895448,3.4400290524342454] [0.3321793984152627,1.784137241326256,0.007615970459266097,2.58319870759289 17,119.58366028156011,3.8379106085083468] [0.25247226760684494,1.702510963969387,0.006384899819416975,2.2314042480006 88,52.202897927594805,3.551509158139135] Finally, cluster membership is given for clusters 1 to 3 with cluster 1 (index 0) having the largest membership at 407539 member vectors: (0,407539) (1,12999) (2,46516) To summarize, we saw a practical  example that shows how K-means algorithm is used to cluster data with the help of Apache Spark. If you found this post useful, do check out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements in Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.
Read more
  • 0
  • 1
  • 6050
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-introduction-raspberry-pi-zero-w-wirelessv
Packt
01 Mar 2018
13 min read
Save for later

Introduction to Raspberry Pi Zero W Wirelessv

Packt
01 Mar 2018
13 min read
In this article by Vasilis Tzivaras, the author of the book Raspberry Pi Zero W Wireless Projects, we will be covering the following topics: An overview of the Raspberry Pi family An overview of the Raspberry Pi family Common issues Raspberry Pi Zero W is the new product of the Raspberry Pi Zero family. In early 2017, Raspberry Pi community has announced a new board with wireless extension. It offers wireless functionality and now everyone can develop his own projects without cables and other components. Comparing the new board with Raspberry Pi 3 Model B we can easily see that it is quite smaller with many possibilities over the Internet of Things. But what is a Raspberry Pi Zero W and why do you need it? Let' s go though the rest of the family and introduce the new board. In the following article we will cover the following topics: Raspberry Pi family As said earlier Raspberry Pi Zero W is the new member of Raspberry Pi family boards. All these years Raspberry Pi are evolving and become more user friendly with endless possibilities. Let's have a short look at the rest of the family so we can understand the difference of the Pi Zero board. Right now, the heavy board is named Raspberry Pi 3 Model B. It is the best solution for projects such as face recognition, video tracking, gaming or anything else that is demanding: RASPBERRY PI 3 MODEL B Insert Image B07972_01_15.png It is the 3rd generation of Raspberry Pi boards after Raspberry Pi 2 and has the following specs: 1.2GHz 64-bit quad-core ARMv8 CPU 802.11n Wireless LAN Bluetooth 4.1 Bluetooth Low Energy (BLE) Like the Pi 2, it also has 1GB RAM 4 USB ports 40 GPIO pins Full HDMI port Ethernet port Combined 3.5 mm audio jack and composite video Camera interface (CSI) Display interface (DSI) Micro SD card slot (now push-pull rather than push-push) </li><li> VideoCore IV 3D graphics core</li></ol> The next board is Raspberry Pi Zero, in which the Zero W was based. A small low cost and power board able to do many things:Raspberry Pi Zero Insert Image B07972_01_18.png The specs of this board can be found as follows: <ol><li>1GHz, Single-core CPU </li><li> 512MB RAM </li><li> Mini-HDMI port </li><li>Micro-USB OTG port </li><li>Micro-USB power </li><li>HAT-compatible 40-pin header </li><li>Composite video and reset headers </li><li> CSI camera connector (v1.3 only) At this point we should not forget to mention that apart from the boards mentioned earlier there are several other modules and components such as the Sense Hat or Raspberry Pi Touch Display available which will work great for advance projects. The 7″ Touchscreen Monitor for Raspberry Pi gives users the ability to create all-in-one, integrated projects such as tablets, infotainment systems and embedded projects: RASPBERRY PI Touch Display Insert Image B07972_01_16.png Where Sense HAT is an add-on board for Raspberry Pi, made especially for the Astro Pi mission. The Sense HAT has an 8×8 RGB LED matrix, a five-button joystick and includes the following sensors: <ol><li> Gyroscope </li><li> Accelerometer </li><li> Magnetometer </li><li>Temperature</li><li> Barometric pressure</li><li> Humidity Sense HAT Insert Image B07972_01_17.png Stay tuned with more new boards and modules at the official website: https://www.raspberrypi.org/ Raspberry Pi Zero W Raspberry Pi Zero W is a small device that has the possibilities to be connected either on an external monitor or TV and of course it is connected to the internet. The operating system varies as there are many distros in the official page and almost everyone is baled on Linux systems. Raspberry Pi Zero W Insert Image B07972_01_01.png With Raspberry Pi Zero W you have the ability to do almost everything, from automation to gaming! It is a small computer that allows you easily program with the help of the GPIO pins and some other components such as a camera. Its possibilities are endless! Specifications If you have bought Raspberry PI 3 Model B you would be familiar with Cypress CYW43438 wireless chip. It provides 802.11 n wireless LAN and Bluetooth 4.0 connectivity. The new Raspberry Pi Zero W is equipped with that wireless chip as well. Following you can find the specifications of the new board Dimensions: 65mm × 30mm × 5mm SoC: Broadcom BCM 2835 chip ARM11 at 1 GHz, single core CPU 512ΜΒ RAM Storage: MicroSD card Video and Audio: 1080P HD video and stereo audio via mini-HDMI connector </li><li> Power: 5V, supplied via micro USB connector </li><li> Wireless: 2.4GHz 802.11 n wireless LAN Bluetooth: Bluetooth classic 4.1 and Bluetooth Low Energy (BLE) Output: Micro USB GPIO: 40-pin GPIO, unpopulated Raspberry Pi Zero W Insert Image B07972_01_03.png Notice that all the components are on the top side of the board so you can easily choose your case without any problems and keep it safe. As far as the antenna concern, it is formed by etching away copper on each layer of the PCB. It may not be visible as it is in other similar boards but it is working great and offers quite a lot functionalities: Raspberry Pi Zero W Capacitors Insert Image B07972_01_04.png Also, the product is limited to only one piece per buyer and costs 10$. You can buy a full kit with microsd card, a case and some more extra components for about 45$ or choose the camera full kit which contains a small camera component for 55$. Camera support Image processing projects such as video tracking or face recognition require a camera. Following you can see the official camera support of Raspberry Pi Zero W. The camera can easily be mounted at the side of the board using a cable like the Raspberry Pi 3 Model B board: The official Camera support of Raspberry Pi Zero W Insert Image B07972_01_05.png Depending on your distribution you many need to enable the camera though command line. More information about the usage of this module will be mentioned at the project. Accessories Well building projects with the new board there are some other gadgets that you might find useful working with. Following there is list of some crucial components. Notice that if you buy Raspberry Pi Zero W kit, it includes some of them. So, be careful and don't double buy them: OTG cable powerHUB GPIO header microSD card and card adapter HDMI to miniHDMI cable HDMI to VGA cable Distributions The official site https://www.raspberrypi.org/downloads/ contains several distributions for downloading. The two basic operating systems that we will analyze after are RASPBIAN and NOOBS. Following you can see how the desktop environment looks like. Both RASPBIAN and NOOBS allows you to choose from two versions. There is the full version of the operating system and the lite one. Obviously the lite version does not contain everything that you might use so if you tend to use your Raspberry with a desktop environment choose and download the full version. On the other side if you tend to just ssh and do some basic stuff pick the lite one. It' s really up to you and of course you can easily download again anything you like and re-write your microSD card: Insert Image B07972_01_14.png NOOBS distribution Download NOOBS: https://www.raspberrypi.org/downloads/noobs/. NOOBS distribution is for the new users with not so much knowledge in linux systems and Raspberry PI boards. As the official page says it is really "New Out Of the Box Software". There is also pre-installed NOOBS SD cards that you can purchase from many retailers, such as Pimoroni, Adafruit, and The Pi Hut, and of course you can download NOOBS and write your own microSD card. If you are having trouble with the specific distribution take a look at the following links: Full guide at https://www.raspberrypi.org/learning/software-guide/. View the video at https://www.raspberrypi.org/help/videos/#noobs-setup. NOOBS operating system contains Raspbian and it provides various of other operating systems available to download. RASPBIAN distribution Download RASPBIAN: https://www.raspberrypi.org/downloads/raspbian/. Raspbian is the official supported operating system. It can be installed though NOOBS or be downloading the image file at the following link and going through the guide of the official website. Image file: https://www.raspberrypi.org/documentation/installation/installing-images/README.md. It has pre-installed plenty of software such as Python, Scratch, Sonic Pi, Java, Mathematica, and more! Furthermore, more distributions like Ubuntu MATE, Windows 10 IOT Core or Weather Station are meant to be installed for more specific projects like Internet of Things (IoT) or weather stations. To conclude with, the right distribution to install actually depends on your project and your expertise in Linux systems administration. Raspberry Pi Zero W needs an microSD card for hosting any operating system. You are able to write Raspbian, Noobs, Ubuntu MATE, or any other operating system you like. So, all that you need to do is simple write your operating system to that microSD card. First of all you have to download the image file from https://www.raspberrypi.org/downloads/ which, usually comes as a .zip file. Once downloaded, unzip the zip file, the full image is about 4.5 Gigabytes. Depending on your operating system you have to use different programs 7-Zip for Windows The Unarchiver for Mac Unzip for Linux Now we are ready to write the image in the MicroSD card. You can easily write the .img file in the microSD card by following one of the next guides according to your system. For Linux users dd tool is recommended. Before connecting your microSD card with your adaptor in your computer run the following command: df -h Now connect your card and run the same command again. You must see some new records. For example if the new device is called /dev/sdd1 keep in your mind that the card is at /dev/sdd (without the 1). The next step is to use the dd command and copy the image to the microSD card. We can do this by the following command: dd if=<path to your image> of=</dev/***> Where if is the input file (image file or the distribution) and of is the output file (microSD card). Again be careful here and use only /dev/sdd or whatever is yours without any numbers. If you are having trouble with that please use the full manual at the following link https://www.raspberrypi.org/documentation/installation/installing-images/linux.md. A good tool that could help you out for that job is GParted. If it is not installed on your system you can easily install it with the following command: sudo apt-get install gparted Then run sudo gparted to start the tool. Its handles partitions very easily and you can format, delete or find information about all your mounted partitions. More information about dd can be found here: https://www.raspberrypi.org/documentation/installation/installing-images/linux.md For Mac OS users dd tool is always recommended: https://www.raspberrypi.org/documentation/installation/installing-images/mac.md For Windows users Win32DiskImager utility is recommended: https://www.raspberrypi.org/documentation/installation/installing-images/windows.md There are several other ways to write an image file in a microSD card. So, if you are against any kind of problems when following the guides above feel free to use any other guide available on the Internet. Now, assuming that everything is ok and the image is ready. You can now gently plugin the microcard to your Raspberry PI Zero W board. Remember that you can always confirm that your download was successful with the sha1 code. In Linux systems you can use sha1sum followed by the file name (the image) and print the sha1 code that should and must be the same as it is at the end of the official page where you downloaded the image. Common issues Sometimes, working with Raspberry Pi boards can lead to issues. We all have faced some of them and hope to never face them again. The Pi Zero is so minimal and it can be tough to tell if it is working or not. Since, there is no LED on the board, sometimes a quick check if it is working properly or something went wrong is handy. Debugging steps With the following steps you will probably find its status: 1. Take your board, with nothing in any slot or socket. Remove even the microSD card! 2. Take a normal micro-USB to USB-ADATA SYNC cable and connect the one side to your computer and the other side to the Pi's USB, (not the PWR_IN). 3. If the Zero is alive: • On Windows the PC will go ding for the presence of new hardware and you should see BCM2708 Boot in Device Manager. • On Linux, with a ID 0a5c:2763 Broadcom Corp message from dmesg. Try to run dmesg in a Terminal before your plugin the USB and after that. You will find a new record there. Output example: [226314.048026] usb 4-2: new full-speed USB device number 82 using uhci_hcd [226314.213273] usb 4-2: New USB device found, idVendor=0a5c, idProduct=2763 [226314.213280] usb 4-2: New USB device strings: Mfr=1, Product=2, SerialNumber=0 [226314.213284] usb 4-2: Product: BCM2708 Boot [226314.213] usb 4-2: Manufacturer: Broadcom If you see any of the preceding, so far so good, you know the Zero's not dead. microSD card issue Remember that if you boot your Raspberry and there is nothing working, you may have burned your microSD card wrong. This means that your card many not contain any boot partition as it should and it is not able to boot the first files. That problem occurs when the distribution is burned to /dev/sdd1 and not to /dev/sdd as we should. This is a quite common mistake and there will be no errors in your monitor. It will just not work! Case protection Raspberry Pi boards are electronics and we never place electronics in metallic surfaces or near magnetic objects. It will affect the booting operation of the Raspberry and it will probably not work. So a tip of advice, spend some extra money for the Raspberry PI Case and protect your board from anything like that. There are many problems and issues when hanging your raspberry pi using tacks. It may be silly, but there are many that do that. Summary Raspberry Pi Zero W is a new promising board allowing everyone to connect their devices to the Internet and use their skills to develop projects including software and hardware. This board is the new toy of any engineer interested in Internet of Things, security, automation and more! We have gone through an introduction in the new Raspberry Pi Zero board and the rest of its family and a brief analysis on some extra components that you should buy as well.
Read more
  • 0
  • 0
  • 3097

article-image-4-must-know-levels-in-mongodb-security
Amey Varangaonkar
01 Mar 2018
8 min read
Save for later

4 must-know levels in MongoDB security

Amey Varangaonkar
01 Mar 2018
8 min read
[box type="note" align="" class="" width=""]The following excerpt is taken from the book Mastering MongoDB 3.x written by Alex Giamas. It presents the techniques and essential concepts needed to tackle even the trickiest problems when it comes to working and administering your MongoDB instance.[/box] Security is a multifaceted goal in a MongoDB cluster. In this article, we will examine different attack vectors and how we can protect MongoDB against them. 1. Authentication in MongoDB Authentication refers to verifying the identity of a client. This prevents impersonating someone else in order to gain access to our data. The simplest way to authenticate is using a username/password pair. This can be done via the shell in two ways: > db.auth( <username>, <password> ) Passing in a comma separated username and password will assume default values for the rest of the fields: > db.auth( { user: <username>, pwd: <password>, mechanism: <authentication mechanism>, digestPassword: <boolean> } ) If we pass a document object we can define more parameters than username/password. The (authentication) mechanism parameter can take several different values with the default being SCRAM-SHA-1. The parameter value MONGODB-CR is used for backwards compatibility with versions earlier than 3.0 MONGODB-X509 is used for TLS/SSL authentication. Users and internal replica set servers can be authenticated using SSL certificates, which are self-generated and signed, or come from a trusted third-party authority. This for the configuration file: security.clusterAuthMode / net.ssl.clusterFile Or like this on the command line: --clusterAuthMode and --sslClusterFile > mongod --replSet <name> --sslMode requireSSL --clusterAuthMode x509 --sslClusterFile <path to membership certificate and key PEM file> --sslPEMKeyFile <path to SSL certificate and key PEM file> --sslCAFile <path to root CA PEM file> MongoDB Enterprise Edition, the paid offering from MongoDB Inc., adds two more options for authentication. The first added option is GSSAPI (Kerberos). Kerberos is a mature and robust authentication system that can be used, among others, for Windows based Active Directory Deployments. The second added option is PLAIN (LDAP SASL). LDAP is just like Kerberos; a mature and robust authentication mechanism. The main consideration when using PLAIN authentication mechanism is that credentials are transmitted in plaintext over the wire. This means that we should secure the path between client and server via VPN or a TSL/SSL connection to avoid a man in the middle stealing our credentials. 2. Authorization in MongoDB After we have configured authentication to verify that users are who they claim they are when connecting to our MongoDB server, we need to configure the rights that each one of them will have in our database. This is the authorization aspect of permissions. MongoDB uses role-based access control to control permissions for different user classes. Every role has permissions to perform some actions on a resource. A resource can be a collection or a database or any collections or any databases. The command's format is: { db: <database>, collection: <collection> } If we specify "" (empty string) for either db or collection it means any db or collection. For example: { db: "mongo_books", collection: "" } This would apply our action in every collection in database mongo_books. Similar to the preceding, we can define: { db: "", collection: "" } We define this to apply our rule to all collections across all databases, except system collections of course. We can also apply rules across an entire cluster as follows: { resource: { cluster : true }, actions: [ "addShard" ] } The preceding example grants privileges for the addShard action (adding a new shard to our system) across the entire cluster. The cluster resource can only be used for actions that affect the entire cluster rather than a collection or database, as for example shutdown, replSetReconfig, appendOplogNote, resync, closeAllDatabases, and addShard. What follows is an extensive list of cluster specific actions and some of the most widely used actions. The list of most widely used actions are: find insert remove update bypassDocumentValidation viewRole / viewUser createRole / dropRole createUser / dropUser inprog killop replSetGetConfig / replSetConfigure / replSetStateChange / resync getShardMap / getShardVersion / listShards / moveChunk / removeShard / addShard dropDatabase / dropIndex / fsync / repairDatabase / shutDown serverStatus / top / validate Cluster-specific actions are: unlock authSchemaUpgrade cleanupOrphaned cpuProfiler inprog invalidateUserCache killop appendOplogNote replSetConfigure replSetGetConfig replSetGetStatus replSetHeartbeat replSetStateChange resync addShard flushRouterConfig getShardMap listShards removeShard shardingState applicationMessage closeAllDatabases connPoolSync fsync getParameter hostInfo logRotate setParameter shutdown touch connPoolStats cursorInfo diagLogging getCmdLineOpts getLog listDatabases netstat serverStatus top If this sounds too complicated that is because it is. The flexibility that MongoDB allows in configuring different actions on resources means that we need to study and understand the extensive lists as described previously. Thankfully, some of the most common actions and resources are bundled in built-in roles. We can use the built-in roles to establish the baseline of permissions that we will give to our users and then fine grain these based on the extensive list. User roles in MongoDB There are two different generic user roles that we can specify: read: A read-only role across non-system collections and the following system collections: system.indexes, system.js, and system.namespaces collections readWrite: A read and modify role across non-system collections and the system.js collection Database administration roles in MongoDB There are three database specific administration roles shown as follows: dbAdmin: The basic admin user role which can perform schema-related tasks, indexing, gathering statistics. A dbAdmin cannot perform user and role management. userAdmin: Create and modify roles and users. This is complementary to the dbAdmin role. dbOwner: Combining readWrite, dbAdmin, and userAdmin roles, this is the most powerful admin user role. Cluster administration roles in MongoDB These are the cluster wide administration roles available: hostManager: Monitor and manage servers in a cluster. clusterManager: Provides management and monitoring actions on the cluster. A user with this role can access the config and local databases, which are used in sharding and replication, respectively. clusterMonitor: Read-only access for monitoring tools provided by MongoDB such as MongoDB Cloud Manager and Ops Manager agent. clusterAdmin: Provides the greatest cluster-management access. This role combines the privileges granted by the clusterManager, clusterMonitor, and hostManager roles. Additionally, the role provides the dropDatabase action. Backup restore roles Role-based authorization roles can be defined in the backup restore granularity level as Well: backup: Provides privileges needed to back-up data. This role provides sufficient privileges to use the MongoDB Cloud Manager backup agent, Ops Manager backup agent, or to use mongodump. restore: Provides privileges needed to restore data with mongorestore without the --oplogReplay option or without system.profile collection data. Roles across all databases Similarly, here are the set of available roles across all databases: readAnyDatabase: Provides the same read-only permissions as read, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. readWriteAnyDatabase: Provides the same read and write permissions as readWrite, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. userAdminAnyDatabase: Provides the same access to user administration operations as userAdmin, except it applies to all but the local and config databases in the cluster. Since the userAdminAnyDatabase role allows users to grant any privilege to any user, including themselves, the role also indirectly provides superuser access. dbAdminAnyDatabase: Provides the same access to database administration operations as dbAdmin, except it applies to all but the local and config databases in the cluster. The role also provides the listDatabases action on the cluster as a whole. Superuser Finally, these are the superuser roles available: root: Provides access to the operations and all the resources of the readWriteAnyDatabase, dbAdminAnyDatabase, userAdminAnyDatabase, clusterAdmin, restore, and backup combined. __internal: Similar to root user, any __internal user can perform any action against any object across the server. 3. Network level security Apart from MongoDB specific security measures, there are best practices established for network level security: Only allow communication between servers and only open the ports that are used for communicating between them. Always use TLS/SSL for communication between servers. This prevents man-inthe- middle attacks impersonating a client. Always use different sets of development, staging, and production environments and security credentials. Ideally, create different accounts for each environment and enable two-factor authentication in both staging and production environments. 4. Auditing security No matter how much we plan our security measures, a second or third pair of eyes from someone outside our organization can give a different view of our security measures and uncover problems that we may not have thought of or underestimated. Don't hesitate to involve security experts / white hat hackers to do penetration testing in your servers. Special cases Medical or financial applications require added levels of security for data privacy reasons. If we are building an application in the healthcare space, accessing users' personal identifiable information, we may need to get HIPAA certified. If we are building an application interacting with payments and managing cardholder information, we may need to become PCI/DSS compliant. The specifics of each certification are outside the scope of this book but it is important to know that MongoDB has use cases in these fields that fulfill the requirements and as such it can be the right tool with proper design beforehand. To sum up, in addition to the best practices listed above, developers and administrators must always use common sense so that security interferes only as much as needed with operational goals. If you found our article useful, make sure to check out this book Mastering MongoDB 3.x to master other MongoDB administration-related techniques and become a true MongoDB expert.  
Read more
  • 0
  • 0
  • 3202

article-image-getting-started-apache-mesos
Packt
28 Feb 2018
7 min read
Save for later

Getting Started with Apache Mesos

Packt
28 Feb 2018
7 min read
1 Getting Started with Apache Mesos In this article by David Blomquist author of the book Apache Mesos Cookbook, we will In this chapter, we will provide an overview of the Mesos architecture and recipes for installing Mesos on Linux and Mac. The following are the recipes coverbeed covering in this chapter following topics: Installing Mesos on Ubuntu 16.04 from packages Installing Mesos on Ubuntu 14.04 from packages Installing Mesos on CentOS 7 and RHEL 7 from packages Introduction Apache Mesos is a cluster management software that can distribute the combined resources of many individual servers to applications through frameworks. Mesos is an open source software that is free to download and use in accordance with the Apache License 2.0. This book article will provide the reader with recipes for deploying and developing Apache Mesos and the Mesos frameworks. Mesos can run on Linux, Mac, and Windows. However, we recommend running Mesos on Linux for production deployments and on Mac and Windows for development purposes only. Mesos can be installed from TAR files, Git, source code, or from packages downloaded from repositories. We have chosen to cover only a few installation methods on select operating system versions in this bookarticle. The reasons for covering these specific operating systems and installation methods are as follows: The operating systems natively include a kernel that supports full resource isolation The operating systems are current as of this writing with long term support The operating systems and installation methods do not require workarounds or an excessive number of external repositories The installation methods use the latest stable version of Mesos, whether it is from packages or source code Mesosphere, founded by one of the original developers of Mesos, is a company that provides free open source packages as well as commercial support for Mesos. Mesosphere packages are well maintained and provide an easy way to install and run Mesos. You can run a production Mesos cluster using these packages if you do not require any customization of the build or install process. However, installing from source will allow you to customize the build and install process and enable and disable features. If you want a completely open source and customizable production cluster, we recommend you to install Mesos from source on Ubuntu 16.04 or Ubuntu 14.04. If you want a development environment on a Mac, building from source on OS X is the only way to go. The installation methods that we cover will all provide you with a good base for building out a Mesos development or production environment. We will guide you through the installations in the following sections, but first, you will need to plan for your Mesos deployment. For a Mesos development environment, you only need one host or node. The node can be a physical computer, a virtual machine, or a cloud instance. For a production cluster, we recommend at least three master nodes and as many slave nodes as you will need to support your application frameworks. You can think of the slave nodes as a pool of CPU, RAM, and storage that can be increased by simply adding more slave nodes. Mesos makes it very easy to add slave nodes to an existing cluster as your application requirements increase. At this point, you should know whether you will be building a Mesos development environment or a production cluster and you should have an idea of how many master and slave nodes you will need. The next sections will provide recipes for installing Mesos in the environment of your choice Installing Mesos on Ubuntu 16.04 from Packages In this recipe, we will be installing Mesos .deb packages from the Mesosphere repositories using apt. Getting ready You must be running a 64-bit version of the Ubuntu 16.04 operating system and it should be patched to the most current patch level using apt-get prior to installing the Mesos packages. How to do it… First, download and install the OpenPGP key for the Mesosphere packages: $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF Now install the Mesosphere repository:  $ DISTRO=$(lsb_release -is | tr '‘[:upper:]'’ '‘[:lower:]'’) $ CODENAME=$(lsb_release -cs) $ echo "“deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main"” | sudo tee /etc/apt/sources.list.d/mesosphere.list Update the apt-get package indexes:  $ sudo apt-get update And finally, install Mesos and the included ZooKeeper binaries:  $ sudo apt-get -y install mesos At this point, you can start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following:  $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos.. Next, you will configure ZooKeeper, which is covered in Chapter 2. See also If you prefer to build and install Mesos on Ubuntu from source code, we will cover that in an upcoming section in this chapter. Installing Mesos on Ubuntu 14.04 from Packages In this recipe, we will be installing Mesos .deb packages from the Mesosphere repositories using apt. Getting ready You must be running a 64-bit version of the Ubuntu 14.04 operating system and it should be patched to the most current patch level using apt-get prior to installing the Mesos packages. How to do it…… First, download and install the OpenPGP key for the Mesosphere packages:  $ sudo apt-key adv --keyserver keyserver.ubuntu.com --recv E56151BF Now install the Mesosphere repository: $ DISTRO=$(lsb_release -is | tr '‘[:upper:]'’ '‘[:lower:]'’) $ CODENAME=$(lsb_release -cs) $ echo "“deb http://repos.mesosphere.io/${DISTRO} ${CODENAME} main"” | sudo tee /etc/apt/sources.list.d/mesosphere.list Update the apt-get package indexes:  $ sudo apt-get update And finally, install Mesos and the included ZooKeeper binaries: $ sudo apt-get -y install mesos At this point, you can either start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following command:  $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos. Next, you will configure ZooKeeper, which is covered in Chapter 2., See also If you prefer to build and install Mesos on Ubuntu from source code, we will cover that in an upcoming section in this chapter. Installing Mesos on CentOS 7 and RHEL 7 from Packages In this recipe, we will be installing Mesos .rpm packages from the Mesosphere repositories using yum. Getting ready Your CentOS 7 or RHEL 7 operating system should be patched to the most current patch level using yum prior to installing the Mesosphere packages. How to do it… First, add the Mesosphere repository: $ sudo rpm -Uvh http://repos.mesosphere.io/el/7/noarch/RPMS/mesosphere-el-repo-7-1.noarch.rpm And now, install Mesos and ZooKeeper: $ sudo yum -y install mesos mesosphere-zookeeper At this point, you can start Mesos to do some basic testing. To start the Mesos master and agent (slave) daemons, execute the following:  $ sudo service mesos-master start $ sudo service mesos-slave start To validate the Mesos installation, open a browser and point it to http://:5050. Replace with the actual address of the host with the new Mesos installation How it works… The Mesosphere packages provide the software required to run Mesos. Next, you will configure ZooKeeper, which is covered in Chapter 2. See also If you prefer to build and install Mesos from source code on RHEL 7 or CentOS 7, you can find installation instructions for CentOS 7 on the mesos.apache.org website. We do not cover installing Mesos source code on RHEL7 or CentOS 7 in this book article due to dependencies that require the installation of packages from multiple third-party repositories.  Summary In this article we have learned how to install Mesos on Ubuntu 16.04 from packages, how to install Mesos on Ubuntu 14.04 from packages and how to installing Mesos on CentOS 7 and RHEL 7 from packages. Resources for Article:   Further resources on this subject: Apache Mesos Cookbook Mastering Mesos
Read more
  • 0
  • 0
  • 2221

article-image-defining-business-context
Packt
28 Feb 2018
15 min read
Save for later

Defining the business context

Packt
28 Feb 2018
15 min read
Introduction In this article by Ezra Schwartz the author of the book Experience Design for Beginners, will cover topics on how can designers help organizations form an experience strategy. It is a common mistake to believe that companies are formed to create products or services. The primary objective of any company is to be as profitable as possible. Profits are any funds left after all obligations of the company are paid off. Profits help attract new investors and fuel investments in new products--new products that will help make more profits --and the sequence repeats. Without profits, companies just shut down. The question how will our product/s make us profitable? is the generic survival challenge shared by all companies, regardless of its size, industry, or product. The business context of each company, however, may be very unique, based on its circumstances. That business context will influence its product experience strategy, and here is why: Suppose that you are the CEO of a small unknown company, which is similar to&#160;Pure Digital, maker of the Flip and that you want to take advantage of cheap memory cards to create a tapeless camcorder. Here are a couple of options for a product approach: Create a camcorder that has all the features of a tape-based camcorder, and a similar look and feel, except that no cassette-tape is need. The product will have a competitive edge--fewer moving parts will make it lighter, cheaper, and more durable, and transfer to computer will also be much simplified. However, the product will have to compete with the established giants in the camcorder market, such as Sony, Panasonic, or Canon. If your product begins to show signs of success, these other companies will release products similar to yours, well before your company could recoup its initial investments Create a product that is a complete departure from typical products. Your product will address all the frustrations that people have with current products, and be a compelling, highly competitive alternative. If it becomes successful, its distinctive look and feel will be associated with your company. By the time the competition release their own products, your product will dominate the market for this segment. However, the product will require coming up with a new design, which will unify the various experience features into a product solution that is attractive and profitable Which option will you choose? Both have their risks and opportunities, but the context may call for option two, which depends on a successful experience design, and indeed, thanks to its experience design of its Flip line of products, Pure Digital was able to popularize and dominate the pocket camcorders market and win over many customers from the tape camcorder market. Another question all companies face is "What's next?". The question may emerge as a result of changes in leadership, opportunities to implement new technologies, new ways to implement the existing technologies, decreased sales of its current product/s, customer dissatisfaction with existing products, or numerous other reasons. Again, it is a matter of each company's business context. Whatever the motivation, although the organization and its products are currently very successful, company leadership may feel that it is necessary for them to invest profits in either developing the next generation of their product/s or creating new ones. Another common aspect of most companies is limited funds and resources for future products. Short and long term priorities must be aligned: Pressures to increase spending: Investing in the future requires spending in the present. Long-term vision requires companies to make significant investments in product research and development, with no guarantee that hopes for future profits in the form of a competitive edge, larger market share will materialize Pressures to reduce spending: Investing in future reduces the funds necessary to compete in the present. Immediate budgetary constraints and competitive pressures require companies to focus on maintaining quarterly profitability, by investing in advertising and other methods, which can improve the performance of the existing profitable products. How well future and present priorities are aligned is a measure of multiple factors. Some would argue that the most important factor is a mutual trust between employees and management. Trust is established with transparency, good communication, fairness in compensation and treatment, and a belief in a shared vision for the present and the future. Mutual trust plays a critical role in flattening the hierarchical structures inherent to companies. Whether any experience design strategy, even a good one, can be a success, also depends on internal company dynamics of trust, because designers, whether employees or consultants, operate within the hierarchical organizational structure. <div class="packt_figure CDPAlignCenter CDPAlign"><img class=" image-border" src="> Companies are hierarchical. The preceding diagram reflects a couple of common characteristics: This is a hierarchical, top-down structure. The larger the company, the more layers separate senior decision-makers from the product. Some executives may not be familiar with critical issues with the product or may disagree about priorities, which sometimes leads to internal fragmentation. There is an unavoidable compartmentalization and specialization, as each group within the company must focus on their responsibilities for a specific aspect of the organization. Regardless of the company's size, this natural division of roles and responsibilities is often the culprit of dispute over issues of product vision and priorities. Experience design is often hierarchically nested under marketing, product management, or engineering department. In highly compartmentalized organizations, trust is sometimes an issue, and designers, who need the cooperation of all units, have a hard time aligning competing visions for the product.  Some departments are highly influential, whereas the participation of other departments can be minimal. In such circumstances, it is very difficult to emerge with an experience strategy that satisfies everyone. Consequently, the end result is an unsatisfying shadow of the original vision. Moreover, it is not uncommon for projects to be canceled midstream, due to internal infighting. The following diagram shows the flatting effect a culture of mutual trust has on successful design. Experience strategists, with a mandate from leadership, can reach out to each of the groups within a company, synthesize the various inputs, identify internal gaps of vision and priority, and help all stakeholders embrace a unified vision forward. This unified vision is referred to as the voice of the business <p><img class="image-border" src="" /></p> Over the past couple of decades, a growing number of organizations recognized the value of integrated design by forming in-house experience design departments led by a senior designer who reports directly to the CEO. Celebrated examples in the automotive tech and manufacturing industries include BMW, Apple, and Herman-Miller. The results of an integrated design are expressed in the quality of their products, improved sales performance, and increased customer satisfaction. Although the trend points toward fully integrated design capabilities, there are still many organizations--that for reasons, such as size, budget, or lack of skilled resources--prefer to partner with design consultants, to guide their product experience strategy. Design consultants can be effective when given autonomy and active support from leadership. Business needs - research activities Experience strategists conduct a number of research activities during the first phase of their experience design project. The purpose of the research is two fold: Help designers understand the company's vision and objectives for the product, that is, what is at stake. Based on this context, they work with stakeholders to align product objectives and reach a shared understanding on the goals of design. Once an organizational alignment is achieved, use research insights to develop a product experience strategy, which is aligned with agreed company objectives. The included research activities are as follows: Stakeholder and subject-matter expert (SME) interview Documents review Competitive research Expert product reviews <div class="packt_figure"></div> Stakeholder and subject-matter expert interviewsStakeholders are typically senior executives who have a direct responsibility for, or influence on, the product. Stakeholders include product managers, who manage the planning and day-to-day activities associated with their product, and have a direct decision-making authority over its development. In projects that are important to the company, it is not uncommon for the executive leadership from the chief executive and down to be among the stakeholders due to their influence and authority to the direct overall product strategy. The purpose of stakeholder interviews is to gather and understand the perspective of each individual stakeholder and align the perspectives of all stakeholders around a unified vision around the scope, purpose, outcomes, opportunities and obstacles involved in undertaking a new product development project. Gaps among stakeholders on fundamental project objectives and priorities, will lead to serious trouble down the road. It is best to surfaces such deviations as early as possible, and help stakeholders reach a productive alignment. The purpose of subject-matter experts (SMEs) interviews is to balance the strategic high-level thinking provided by stakeholders, with detailed insights of experienced employees who are recognized for their deep domain expertise. Sales, customer service, and technical support employees have a wealth of operational knowledge of products and customers, which makes them invaluable when analyzing current processes and challenges. Prior to the interviews, the experience strategist prepares an interview guide. The purpose of the guide is to ensure the following<ol< All stakeholders can respond to the same questions All research topics are covered if interviews are conducted by different interviewers Interviews make the best use of stakeholders' valuable time Some of the questions in the guide are general and directed at all participants, others are more specific and focus on the stakeholders specific areas of responsibility. Similar guides are developed for SME interviews. In-person interviews are the best, because they take place at the onset of the project and provide a good opportunity to build rapport and trust between the designer and interviewee. After a formal introduction regarding the purpose of the interview and general questions regarding the person's role and professional experience, the person is asked for their personal assessment and opinions on various topics. Here is a sample of different topics: Objectives and obstacles Prioritized goals for the project What does success look like What kind of obstacles the project is facing, and suggestions to overcome them Competition Who are your top competitors Strength and weaknesses relative to the competition Product features and functionality Which features are missing Differentiating features Features to avoid The interviews are designed to last no more than an hour and are documented with notes and audio recordings, if possible. The answers are compiled and analyzed and the result is presented in a report. The report suggests a unified list of prioritized objectives, and highlights gaps and other risks that have been reported. The report is one of the inputs into the development of the overall product experience strategy. Product expert reviewsProduct expert reviews, sometimes referred to as heuristic evaluations, are professional assessments of a current product, which are performed by design experts for the purpose of identifying usability and user experience issues. The thinking behind the expert review technique is very practical. Experience designers have the expertise to assess the experience quality of a product in a systematic way, using a set of accepted <span class="KeyPACKT">heuristics</span>. A heuristic is a rule of thumb for assessing products. For example, the error prevention heuristic deals with how well the evaluated product prevents the user from making errors. The word heuristic often raises questions about its meaning, and the method has been criticized for its inherent weaknesses due to the following: Subjectivity of the evaluator Expertise and domain knowledge of the evaluator Cultural and demographic background of the evaluator These weaknesses increase the probability that the outcome of an expert evaluation will reflect the biases and preferences of the evaluator, resulting in potentially different conclusions about the same product. Still, expert evaluations, especially if conducted by two evaluators, and their aligned findings, have proven to be an effective tool for experience practitioners who need a fast and cost-effective assessment of a product, particularly digital interfaces. Jacob Nielsen developed the method in the early 1990s. Although there are other sets of heuristics, Nielsen's are probably the most known and commonly used. His initial set of heuristics was first published in his book Usability Engineering and is brought here verbatim, as there is no need for modification: Visibility of system status:The system should always keep users informed about what is going on, through appropriate feedback within reasonable time. Match between system and the real world:The system should speak the user's language, with words, phrases and concepts familiar to the user, rather than system-oriented terms. Follow real-world conventions, making information appear in a natural and logical order. User control and freedom:Users often choose system functions by mistake and will need a clearly marked "emergency exit" to leave the unwanted state without having to go through an extended dialogue. Support undo and redo. Consistency and standards: Users should not have to wonder whether different words, situations, or actions mean the same thing. Follow platform conventions. Error prevention: Even better than good error messages is a careful design which prevents a problem from occurring in the first place. Either eliminate error-prone conditions or check for them and present users with a confirmation option before they commit to the action. ecognition rather than recall: Minimize the user's memory load by making objects, actions, and options visible. The user should not have to remember information from one part of the dialogue to another. Instructions for use of the system should be visible or easily retrievable whenever appropriate. Flexibility and efficiency of use: Accelerators--unseen by the novice user--may often speed up the interaction for the expert user such that the system can cater to both inexperienced and experienced users. Allow users to tailor frequent actions. Aesthetic and minimalist design: Dialogues should not contain information which is irrelevant or rarely needed. Every extra unit of information in a dialogue competes with the relevant units of information and diminishes their relative visibility. Help users recognize, diagnose, and recover from errors: Error messages should be expressed in plain language (no codes), precisely indicate the problem, and constructively suggest a solution. Help and documentatio: Even though it is better if the system can be used without documentation, it may be necessary to provide help and documentation. Any such information should be easy to search, focused on the user's task, list concrete steps to be carried out, and not be too large. Competitive research and analysis: Most companies operate in a competitive marketplace, and having a deep understanding of the competition is critical to the success and survival. Here are few of the questions that a competitive research helps addresses: How does a product or service compare to the competition? What are the strength and weaknesses of competing offerings? What alternatives and choices does the target audience have? Experience strategists use several methods to collect and analyze competitive information. From interviews with stakeholder and SMEs, they know who the direct competition is. In some product categories, such as automobiles and consumer products, companies can reverse-engineer competitive products and try to match or surpass their capabilities. Additionally, designers can develop extensive experience analysis of such competitive products, because they can have a first-hand experience with it. With some hi-tech products, however, some capabilities are cocooned within proprietary software or secret production processes. In these cases, designers can glean the capabilities from an indirect evidence of use. The Internet is a main source of competitive information, from the ability to have a direct access to a product online, to reading help manuals, user guides, bulletin boards, reviews, and analysis in trade publications. Occasionally, unauthorized photos or documents are leaked to the public domain, and they provide clues, sometimes real and sometimes bogus, about a secret upcoming product. Social media too is an important source of competitive data in the form of customers reviews on Yelp, Amazon, or Facebook. With the wealth of this information, a practical strategy to surpass the competition and delivering a better experience can be developed. For example, Uber has been a favorite car hailing service for a while. This service has also generated public controversy and had dissatisfied riders and drivers who are not happy with its policies, including its resistance for tips. By design, a tipping function is not available in the app, which is the primary transaction method between the rider, company and, driver. Research indicates, however, that tipping for the service is a common social norm and that most people tip because it makes them feel better. Not being able to tip places riders in an uncomfortable social setting and stirs negative emotions against Uber. The evidence of dissatisfaction can be easily collected from numerous web sources and from interviewing actual riders and drivers. For Uber competitors, such as Lyft and Curb, by making tipping an integrated part of their apps, provides an immediate competitive edge that improves the experience of both riders, who have an option to reward the driver for their good service, and drivers, who benefit from an increased income. This, and additional improvements over the inferior Uber experience, become a part of an overall experience strategy that is focused on improving the likelihood that riders and drivers will dump Uber in their favor, Summary All the activities described so far were meant to enable designers to fuse business and audience perspectives, and emerge with a product strategy that is focused on user experience as means to product success.   Resources for Article:   Further resources on this subject: Exploring Experience Design
Read more
  • 0
  • 0
  • 4316
article-image-6-index-types-in-postgresql-10-you-should-know
Sugandha Lahoti
28 Feb 2018
13 min read
Save for later

6 index types in PostgreSQL 10 you should know

Sugandha Lahoti
28 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book Mastering  PostgreSQL 10 written by Hans-Jürgen Schönig. This book will help you master the capabilities of PostgreSQL 10 to efficiently manage and maintain your database.[/box] In today’s post, we will learn about the different index types available for sorting in PostgreSQL and also understand how they function. What are index types and why you need them Data types can be sorted in a useful way. Just imagine a polygon. How would you sort these objects in a useful way? Sure, you can sort by the area covered, its length or so, but doing this won't allow you to actually find them using a geometric search. The solution to the problem is to provide more than just one index type. Each index will serve a special purpose and do exactly what is needed. The following six index types are available (as of PostgreSQL 10.0): test=# SELECT * FROM pg_am; amname  | amhandler   | amtype ---------+-------------+-------- btree | bthandler | i hash     | hashhandler | i GiST     | GiSThandler | i Gin | ginhandler   | i spGiST   | spghandler   | i brin | brinhandler | i (6 rows) A closer look at the 6 index types in PostgreSQL 10 The following sections will outline the purpose of each index type available in PostgreSQL. Note that there are some extensions that can be used on top of what you can see here. Additional index types available on the web are rum, vodka, and in the future, cognac. Hash indexes Hash indexes have been around for many years. The idea is to hash the input value and store it for later lookups. Having hash indexes actually makes sense. However, before PostgreSQL 10.0, it was not advised to use hash indexes because PostgreSQL had no WAL support for them. In PostgreSQL 10.0, this has changed. Hash indexes are now fully logged and are therefore ready for replication and are considered to be a 100% crash safe. Hash indexes are generally a bit larger than b-tree indexes. Suppose you want to index 4 million integer values. A btree will need around 90 MB of storage to do this. A hash index will need around 125 MB on disk. The assumption made by many people that a hash is super small on the disk is therefore, in many cases, just wrong. GiST indexes Generalized Search Tree (GiST) indexes are highly important index types because they are used for a variety of different things. GiST indexes can be used to implement R-tree behavior and it is even possible to act as b-tree. However, abusing GiST for b-tree indexes is not recommended. Typical use cases for GiST are as follows: Range types Geometric indexes (for example, used by the highly popular PostGIS extension) Fuzzy searching Understanding how GiST works To many people, GiST is still a black box. We will now discuss how GiST works internally. Consider the following diagram: Source: http://leopard.in.ua/assets/images/postgresql/pg_indexes/pg_indexes2.jpg   Take a look at the tree. You will see that R1 and R2 are on top. R1 and R2 are the bounding boxes containing everything else. R3, R4, and R5 are contained by R1. R8, R9, and R10 are contained by R3, and so on. A GiST index is therefore hierarchically organized. What you can see in the diagram is that some operations, which are not available in b-trees are supported. Some of those operations are overlaps, left of, right of, and so on. The layout of a GiST tree is ideal for geometric indexing. Extending GiST Of course, it is also possible to come up with your own operator classes. The following strategies are supported: Operation Strategy number Strictly  left  of 1 Does  not  extend  to  right  of 2 Overlaps 3 Does  not  extend  to  left  of 4 Strictly  right  of 5 Same 6 Contains 7 Contained  by 8 Does  not  extend  above 9 Strictly  below 10 Strictly  above 11 Does  not  extend  below 12 If you want to write operator classes for GiST, a couple of support functions have to be provided. In the case of a b-tree, there is only the same function - GiST indexes provide a lot more: Function Description Support function number consistent The functions determine whether a key satisfies the query qualifier. Internally, strategies are looked up and checked. 1 union Calculate the union of a set of keys. In case of numeric values, simply the upper and lower values or a range are computed. It is especially important to geometries. 2 compress Compute a compressed representation of a key or value. 3 decompress This is the counterpart of the compress function. 4   penalty During insertion, the cost of inserting into the tree will be calculated. The cost determines where the new entry will go inside the tree. Therefore, a good penalty function is key to the good overall performance of the index. 5 picksplit Determines where to move entries in case of a page split. Some entries have to stay on the old page while others will go to the new page being created. Having a good picksplit function is essential to a good index performance. 6 equal The equal function is similar to the same function you have already seen in b-trees. 7 distance Calculates the distance (a number) between a key and the query value. The distance function is optional and is needed in case KNN search is supported. 8 fetch Determine the original representation of a compressed key. This function is needed to handle index only scans as supported by the recent version of PostgreSQL. 9 Implementing operator classes for GiST indexes is usually done in C. If you are interested in a good example, I advise you to check out the btree_GiST module in the contrib directory. It shows how to index standard data types using GiST and is a good source of information as well as inspiration. GIN indexes Generalized inverted (GIN) indexes are a good way to index text. Suppose you want to index a million text documents. A certain word may occur millions of times. In a normal b- tree, this would mean that the key is stored millions of times. Not so in a GIN. Each key (or word) is stored once and assigned to a document list. Keys are organized in a standard b- tree. Each entry will have a document list pointing to all entries in the table having the same key. A GIN index is very small and compact. However, it lacks an important feature found in the b-trees-sorted data. In a GIN, the list of item pointers associated with a certain key is sorted by the position of the row in the table and not by some arbitrary criteria. Extending GIN Just like any other index, GIN can be extended. The following strategies are available: Operation Strategy number Overlap 1 Contains 2 Is  contained  by 3 Equal 4 On top of this, the following support functions are available: Function Description Support function number compare The compare function is similar to the same function you have seen in b-trees. If two keys are compared, it returns -1 (lower), 0 (equal), or 1 (higher). 1 extractValue Extract keys from a value to be indexed. A value can have many keys. For example, a text value might consist of more than one word. 2 extractQuery Extract keys from a query condition. 3 consistent Check whether a value matches a query condition. 4 comparePartial Compare a partial key from a query and a key from the index. Returns -1, 0, or 1 (similar to the same function supported by b-trees). 5 triConsistent Determine whether a value matches a query condition (ternary variant). It is optional if the consistent function is present. 6 If you are looking for a good example of how to extend GIN, consider looking at the btree_gin module in the PostgreSQL contrib directory. It is a valuable source of information and a good way to start your own implementation. SP-GiST indexes Space partitioned GiST (SP-GiST) has mainly been designed for in-memory use. The reason for this is an SP-GiST stored on disk needs a fairly high number of disk hits to function. Disk hits are way more expensive than just following a couple of pointers in RAM. The beauty is that SP-GiST can be used to implement various types of trees such as quad- trees, k-d trees, and radix trees (tries). The following strategies are provided: Operation Strategy number Strictly  left  of 1 Strictly  right  of 5 Same 6 Contained  by 8 Strictly  below 10 Strictly  above 11 To write your own operator classes for SP-GiST, a couple of functions have to be provided: Function Description Support function number config Provides information about the operator class in use 1 choose Figures out how to insert a new value into an inner tuple 2 picksplit Figures out how to partition/split a set of values 3 inner_consistent Determine which subpartitions need to be searched for a query 4 leaf_consistent Determine whether key satisfies the query qualifier 5 BRIN indexes Block range indexes (BRIN) are of great practical use. All indexes discussed until now need quite a lot of disk space. Although a lot of work has gone into shrinking GIN indexes and the like, they still need quite a lot because an index pointer is needed for each entry. So, if there are 10 million entries, there will be 10 million index pointers. Space is the main concern addressed by the BRIN indexes. A BRIN index does not keep an index entry for each tuple but will store the minimum and the maximum value of 128 (default) blocks of data (1 MB). The index is therefore very small but lossy. Scanning the index will return more data than we asked for. PostgreSQL has to filter out these additional rows in a later step. The following example demonstrates how small a BRIN index really is: test=# CREATE INDEX idx_brin ON t_test USING brin(id); CREATE INDEX test=# di+ idx_brin List of relations Schema | Name    | Type   | Owner | Table | Size --------+----------+-------+-------+--------+-------+------------- public | idx_brin | index | hs | t_test | 48 KB (1 row) In my example, the BRIN index is 2,000 times smaller than a standard b-tree. The question naturally arising now is, why don't we always use BRIN indexes? To answer this kind of question, it is important to reflect on the layout of BRIN; the minimum and maximum value for 1 MB are stored. If the data is sorted (high correlation), BRIN is pretty efficient because we can fetch 1 MB of data, scan it, and we are done. However, what if the data is shuffled? In this case, BRIN won't be able to exclude chunks of data anymore because it is very likely that something close to the overall high and the overall low is within 1 MB of data. Therefore, BRIN is mostly made for highly correlated data. In reality, correlated data is quite likely in data warehousing applications. Often, data is loaded every day and therefore dates can be highly correlated. Extending BRIN indexes BRIN supports the same strategies as a b-tree and therefore needs the same set of operators. The code can be reused nicely: Operation Strategy number Less  than 1 Less  than  or  equal 2 Equal 3 Greater  than  or  equal 4 Greater  than 5 The support functions needed by BRIN are as follows: Function Description Support function number opcInfo Provide internal information about the indexed columns 1 add_value Add an entry to an existing summary tuple 2 consistent Check whether a value matches a condition 3 union Calculate the union of two summary entries (minimum/maximum values) 4 Adding additional indexes Since PostgreSQL 9.6, there has been an easy way to deploy entirely new index types as extensions. This is pretty cool because if those index types provided by PostgreSQL are not enough, it is possible to add additional ones serving precisely your purpose. The instruction to do this is CREATE ACCESS  METHOD: test=# h CREATE ACCESS METHOD Command: CREATE ACCESS METHOD Description: define a new access method Syntax: CREATE ACCESS METHOD name TYPE access_method_type HANDLER handler_function Don't worry too much about this command—just in case you ever deploy your own index type, it will come as a ready-to-use extension. One of these extensions implements bloom filters. Bloom filters are probabilistic data structures. They sometimes return too many rows but never too few. Therefore, a bloom filter is a good method to pre-filter data. How does it work? A bloom filter is defined on a couple of columns. A bitmask is calculated based on the input values, which is then compared to your query. The upside of a bloom filter is that you can index as many columns as you want. The downside is that the entire bloom filter has to be read. Of course, the bloom filter is smaller than the underlying data and so it is, in many cases, very beneficial. To use bloom filters, just activate the extension, which is a part of the PostgreSQL contrib package: test=# CREATE EXTENSION bloom; CREATE EXTENSION As stated previously, the idea behind a bloom filter is that it allows you to index as many columns as you want. In many real-world applications, the challenge is to index many columns without knowing which combinations the user will actually need at runtime. In the case of a large table, it is totally impossible to create standard b-tree indexes on, say, 80 fields or more. A bloom filter might be an alternative in this case: test=# CREATE TABLE t_bloom (x1 int, x2 int, x3 int, x4 int, x5 int, x6 int, x7 int); CREATE TABLE Creating the index is easy: test=# CREATE INDEX idx_bloom ON t_bloom USING bloom(x1, x2, x3, x4, x5, x6, x7); CREATE INDEX If sequential scans are turned off, the index can be seen in action: test=# SET enable_seqscan TO off; SET test=# explain SELECT * FROM t_bloom WHERE x5 = 9 AND x3 = 7; QUERY PLAN ------------------------------------------------------------------------- Bitmap Heap Scan on t_bloom (cost=18.50..22.52 rows=1 width=28) Recheck Cond: ((x3 = 7) AND (x5 = 9)) -> Bitmap Index Scan on idx_bloom (cost=0.00..18.50 rows=1 width=0) Index Cond: ((x3 = 7) AND (x5 = 9)) Note that I have queried a combination of random columns; they are not related to the actual order in the index. The bloom filter will still be beneficial. If you are interested in bloom filters, consider checking out the website: https://en.wikipedia.org/wiki/Bloom_filter. We learnt how to use the indexing features in PostgreSQL and fine-tune the performance of our queries. If you liked our article, check out the book Mastering  PostgreSQL 10 to implement advanced administrative tasks such as server maintenance and monitoring, replication, recovery, high availability, etc in PostgreSQL 10.  
Read more
  • 0
  • 0
  • 10552

article-image-perform-regression-analysis-using-sas
Gebin George
27 Feb 2018
7 min read
Save for later

How to perform regression analysis using SAS

Gebin George
27 Feb 2018
7 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book, Big Data Analysis with SAS written by David Pope. This book will help you leverage the power of SAS for data management, analysis and reporting. It contains practical use-cases and real-world examples on predictive modelling, forecasting, optimizing, and reporting your Big Data analysis using SAS.[/box] Today, we will perform regression analysis using SAS in a step-by-step manner with a practical use-case. Regression analysis is one of the earliest predictive techniques most people learn because it can be applied across a wide variety of problems dealing with data that is related in linear and non-linear ways. Linear data is one of the easier use cases, and as such PROC REG is a well-known and often-used procedure to help predict likely outcomes before they happen. The REG procedure provides extensive capabilities for fitting linear regression models that involve individual numeric independent variables. Many other procedures can also fit regression models, but they focus on more specialized forms of regression, such as robust regression, generalized linear regression, nonlinear regression, nonparametric regression, quantile regression, regression modeling of survey data, regression modeling of survival data, and regression modeling of transformed variables. The SAS/STAT procedures that can fit regression models include the ADAPTIVEREG, CATMOD, GAM, GENMOD, GLIMMIX, GLM, GLMSELECT, LIFEREG, LOESS, LOGISTIC, MIXED, NLIN, NLMIXED, ORTHOREG, PHREG, PLS, PROBIT, QUANTREG, QUANTSELECT, REG, ROBUSTREG, RSREG, SURVEYLOGISTIC, SURVEYPHREG, SURVEYREG, TPSPLINE, and TRANSREG procedures. Several procedures in SAS/ETS software also fit regression models. SAS/STAT14.2 / SAS/STAT User's Guide - Introduction to Regression Procedures - Overview: Regression Procedures (http://documentation.sas.com/?cdcId=statcdccdcVersion=14.2 docsetId=statugdocsetTarget=statug_introreg_sect001.htmlocale=enshowBanner=yes). Regression analysis attempts to model the relationship between a response or output variable and a set of input variables. The response is considered the target variable or the variable that one is trying to predict, while the rest of the input variables make up parameters used as input into the algorithm. They are used to derive the predicted value for the response variable. PROC REG One of the easiest ways to determine if regression analysis is applicable to helping you answer a question is if the type of question being asked has only two answers. For example, should a bank lend an applicant money? Yes or no? This is known as a binary response, and as such, regression analysis can be applied to help determine the answer. In the following example, the reader will use the SASHELP.BASEBALL dataset to create a regression model to predict the value of a baseball player's salary. The SASHELP.BASEBALL dataset contains salary and performance information for Major League. Baseball players who played at least one game in both the 1986 and 1987 seasons, excluding pitchers. The salaries (Sports Illustrated, April 20, 1987) are for the 1987 season and the performance measures are from 1986 (Collier Books, The 1987 Baseball Encyclopedia Update). SAS/STAT® 14.2 / SAS/STAT User's Guide - Example 99: Modeling Salaries of Major League Baseball Players (http://documentation.sas.com/ ?cdcId= statcdc cdcVersion= 14.2 docsetId=statugdocsetTarget= statug_ reg_ examples01.htmlocale= en showBanner= yes). Let's first use PROC UNIVARIATE to learn something about this baseball data by submitting the following code: proc univariate data=sashelp.baseball; quit; While reviewing the results of the output, the reader will notice that the variance associated with logSalary, 0.79066, is much less than the variance associated with the actual target variable Salary, 203508. In this case, it makes better sense to attempt to predict the logSalary value of a player instead of Salary. Write the following code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball; id name team league; model logSalary = nAtBat nHits nHome nRuns nRBI YrMajor CrAtBat CrHits CrHome CrRuns CrRbi; Quit; Notice that there are 59 observations as specified in the first output table with at least one of the input variables with missing values; as such those are not used in the development of the regression model. The Root Mean Squared Error (RMSE) and R-square are statistics that typically inform the analyst how good the model is in predicting the target. These range from 0 to 1.0 with higher values typically indicating a better model. The higher the Rsquared values typically indicate a better performing model but sometimes conditions or the data used to train the model over-fit and don't represent the true value of the prediction power of that particular model. Over-fitting can happen when an analyst doesn't have enough real-life data and chooses data or a sample of data that over-presents the target event, and therefore it will produce a poor performing model when using real-world data as input. Since several of the input values appear to have little predictive power on the target, an analyst may decide to drop these variables, thereby reducing the need for that information to make a decent prediction. In this case, it appears we only need to use four input variables. YrMajor, nHits, nRuns, and nAtBat. Modify the code as follows and submit it again: proc reg data=sashelp.baseball; id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; The p-value associated with each of the input variables provides the analyst with an insight into which variables have the biggest impact on helping to predict the target variable. In this case, the smaller the value, the higher the predictive value of the input variable. Both the RMSE and R-square values for this second model are slightly lower than the original. However, the adjusted R-square value is slightly higher. In this case, an analyst may chose to use the second model since it requires much less data and provides basically the same predictive power. Prior to accepting any model, an analyst should determine whether there are a few observations that may be over-influencing the results by investigating the influence and fit diagnostics. The default output from PROC REG provides this type of visual insight: The top-right corner plot, showing the externally studentized residuals (RStudent) by leverage values, shows that there are a few observations with high leverage that may be overly influencing the fit produced. In order to investigate this further, we will add a plots statement to our PROC REG to produce a labeled version of this plot. Type the following code in a SAS Studio program section and submit: proc reg data=sashelp.baseball plots(only label)=(RStudentByLeverage); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; Sure enough, there are three to five individuals whose input variables may have excessive influence on fitting this model. Let's remove those points and see if the model improves. Type this code in a SAS Studio program section and submit it: proc reg data=sashelp.baseball plots=(residuals(smooth)); where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); id name team league; model logSalary = YrMajor nHits nRuns nAtBat; Quit; This change, in itself, has not improved the model but actually made the model worse as can be seen by the R-square, 0.5592. However, the plots residuals(smooth) option gives some insights as it pertains to YrMajor; players at the beginning and the end of their careers tend to be paid less compared to others, as can be seen in Figure 4.12: In order to address this lack of fit, an analyst can use polynomials of degree two for this variable, YrMajor. Type the following code in a SAS Studio program section and submit it: data work.baseball; set sashelp.baseball; where name NOT IN ("Mattingly, Don", "Henderson, Rickey", "Boggs, Wade", "Davis, Eric", "Rose, Pete"); YrMajor2 = YrMajor*YrMajor; run; proc reg data=work.baseball; id name team league; model logSalary = YrMajor YrMajor2 nHits nRuns nAtBat; Quit; After removing some outliers and adjusting for the YrMajor variable, the model's predictive power has improved significantly as can be seen in the much improved R-square value of 0.7149. We saw an effective way of performing regression analysis using SAS platform. If you found our post useful, do check out this book Big Data Analysis with SAS to understand other data analysis models and perform them practically using SAS.    
Read more
  • 0
  • 0
  • 6482

article-image-getting-started-with-the-confluent-platform-apache-kafka-for-enterprise
Amarabha Banerjee
27 Feb 2018
9 min read
Save for later

Getting started with the Confluent Platform: Apache Kafka for enterprise

Amarabha Banerjee
27 Feb 2018
9 min read
This article is a book excerpt from Apache Kafka 1.0 Cookbook written by Raúl Estrada. This book will show how to use Kafka efficiently with practical solutions to the common problems that developers and administrators usually face while working with it. In today’s tutorial, we will talk about the confluent platform and how to get started with organizing and managing data from several sources in one high-performance and reliable system. The Confluent Platform is a full stream data system. It enables you to organize and manage data from several sources in one high-performance and reliable system. As mentioned in the first few chapters, the goal of an enterprise service bus is not only to provide the system a means to transport messages and data but also to provide all the tools that are required to connect the data origins (data sources), applications, and data destinations (data sinks) to the platform. The Confluent Platform has these parts: Confluent Platform open source Confluent Platform enterprise Confluent Cloud The Confluent Platform open source has the following components: Apache Kafka core Kafka Streams Kafka Connect Kafka clients Kafka REST Proxy Kafka Schema Registry The Confluent Platform enterprise has the following components: Confluent Control Center Confluent support, professional services, and consulting All the components are open source except the Confluent Control Center, which is a proprietary of Confluent Inc. An explanation of each component is as follows: Kafka core: The Kafka brokers discussed at the moment in this book. Kafka Streams: The Kafka library used to build stream processing systems. Kafka Connect: The framework used to connect Kafka with databases, stores, and filesystems. Kafka clients: The libraries for writing/reading messages to/from Kafka. Note that there clients for these languages: Java, Scala, C/C++, Python, and Go. Kafka REST Proxy: If the application doesn't run in the Kafka clients' programming languages, this proxy allows connecting to Kafka through HTTP. Kafka Schema Registry: Recall that an enterprise service bus should have a message template repository. The Schema Registry is the repository of all the schemas and their historical versions, made to ensure that if an endpoint changes, then all the involved parts are acknowledged. Confluent Control Center: A powerful web graphic user interface for managing and monitoring Kafka systems. Confluent Cloud: Kafka as a service—a cloud service to reduce the burden of operations. Installing the Confluent Platform In order to use the REST proxy and the Schema Registry, we need to install the Confluent Platform. Also, the Confluent Platform has important administration, operation, and monitoring features fundamental for modern Kafka production systems. Getting ready At the time of writing this book, the Confluent Platform Version is 4.0.0. Currently, the supported operating systems are: Debian 8 Red Hat Enterprise Linux CentOS 6.8 or 7.2 Ubuntu 14.04 LTS and 16.04 LTS macOS currently is just supported for testing and development purposes, not for production environments. Windows is not yet supported. Oracle Java 1.7 or higher is required. The default ports for the components are: 2181: Apache ZooKeeper 8081: Schema Registry (REST API) 8082: Kafka REST Proxy 8083: Kafka Connect (REST API) 9021: Confluent Control Center 9092: Apache Kafka brokers It is important to have these ports, or the ports where the components are going to run, Open How to do it There are two ways to install: downloading the compressed files or with apt-get command. To install the compressed files: Download the Confluent open source v4.0 or Confluent Enterprise v4.0 TAR files from https://www.confluent.io/download/ Uncompress the archive file (the recommended path for installation is under /opt) To start the Confluent Platform, run this command: $ <confluent-path>/bin/confluent start The output should be as follows: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] Starting kafka-rest kafka-rest is [UP] Starting connect connect is [UP] To install with the apt-get command (in Debian and Ubuntu): Install the Confluent public key used to sign the packages in the APT repository: $ wget -qO - http://packages.confluent.io/deb/4.0/archive.key |sudo apt-key add - Add the repository to the sources list: $ sudo add-apt-repository "deb [arch=amd64] http://packages.confluent.io/deb/4.0 stable main" Finally, run the apt-get update to install the Confluent Platform To install Confluent open source: $ sudo apt-get update && sudo apt-get install confluent-platformoss- 2.11 To install Confluent Enterprise: $ sudo apt-get update && sudo apt-get install confluentplatform-2.11 The end of the package name specifies the Scala version. Currently, the supported versions are 2.11 (recommended) and 2.10. There's more The Confluent Platform provides the system and component packages. The commands in this recipe are for installing all components of the platform. To install individual components, follow the instructions on this page: https://docs.confluent.io/current/installation/available_packages.html#avaiIable-packages. Using Kafka operations With the Confluent Platform installed, the administration, operation, and monitoring of Kafka become very simple. Let's review how to operate Kafka with the Confluent Platform. Getting ready For this recipe, Confluent should be installed, up, and running. How to do it The commands in this section should be executed from the directory where the Confluent Platform is installed: To start ZooKeeper, Kafka, and the Schema Registry with one command, run: $ confluent start schema-registry The output of this command should be: Starting zookeeper zookeeper is [UP] Starting kafka kafka is [UP] Starting schema-registry schema-registry is [UP] To execute the commands outside the installation directory, add Confluent's bin directory to PATH: export PATH=<path_to_confluent>/bin:$PATH To manually start each service with its own command, run: $ ./bin/zookeeper-server-start ./etc/kafka/zookeeper.properties $ ./bin/kafka-server-start ./etc/kafka/server.properties $ ./bin/schema-registry-start ./etc/schema-registry/schemaregistry. properties Note that the syntax of all the commands is exactly the same as always but without the .sh extension. To create a topic called test_topic, run the following command: $ ./bin/kafka-topics --zookeeper localhost:2181 --create --topic test_topic --partitions 1 --replication-factor 1 To send an Avro message to test_topic in the broker without writing a single line of code, use the following command: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"name":"person","type":"record", "fields":[{"name":"name","type":"string"},{"name":"age","type":"int "}]}' Send some messages and press Enter after each line: {"name": "Alice", "age": 27} {"name": "Bob", "age": 30} {"name": "Charles", "age":57} Enter with an empty line is interpreted as null. To shut down the process, press Ctrl + C. To consume the Avro messages from test_topic since the beginning, type: $ ./bin/kafka-avro-console-consumer --topic test_topic --zookeeper localhost:2181 --from-beginning The messages created in the previous step will be written to the console in the format they were introduced. To shut down the consumer, press Ctrl + C. To test the Avro schema validation, try to produce data on the same topic using an incompatible schema, for example, with this producer: $ ./bin/kafka-avro-console-producer --broker-list localhost:9092 --topic test_topic --property value.schema='{"type":"string"}' After you've hit Enter on the first message, the following exception is raised: org.apache.kafka.common.errors.SerializationException: Error registering Avro schema: "string" Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClient Exception: Schema being registered is incompatible with the latest schema; error code: 409 at io.confluent.kafka.schemaregistry.client.rest.utils.RestUtils.httpR equest(RestUtils.java:146) To shut down the services (Schema Registry, broker, and ZooKeeper) run: confluent stop To delete all the producer messages stored in the broker, run this: confluent destroy There's more With the Confluent Platform, it is possible to manage all of the Kafka system through the Kafka operations, which are classified as follows: Production deployment: Hardware configuration, file descriptors, and ZooKeeper configuration Post deployment: Admin operations, rolling restart, backup, and restoration Auto data balancing: Rebalancer execution and decommissioning brokers Monitoring: Metrics for each concept—broker, ZooKeeper, topics, producers, and consumers Metrics reporter: Message size, security, authentication, authorization, and verification Monitoring with the Confluent Control Center This recipe shows you how to use the metrics reporter of the Confluent Control Center. Getting ready The execution of the previous recipe is needed. Before starting the Control Center, configure the metrics reporter: Back up the server.properties file located at: <confluent_path>/etc/kafka/server.properties In the server.properties file, uncomment the following lines: metric.reporters=io.confluent.metrics.reporter.ConfluentMetricsRepo rter confluent.metrics.reporter.bootstrap.servers=localhost:9092 confluent.metrics.reporter.topic.replicas=1 Back up the Kafka Connect configuration located in: <confluent_path>/etc/schema-registry/connect-avrodistributed.properties Add the following lines at the end of the connect-avrodistributed.properties file: consumer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringConsumerInterceptor producer.interceptor.classes=io.confluent.monitoring.clients.interc eptor.MonitoringProducerInterceptor Start the Confluent Platform: $ <confluent_path>/bin/confluent start Before starting the Control Center, change its configuration: Back up the control-center.properties file located in: <confluent_path>/etc/confluent-control-center/controlcenter.properties Add the following lines at the end of the control-center.properties file: confluent.controlcenter.internal.topics.partitions=1 confluent.controlcenter.internal.topics.replication=1 confluent.controlcenter.command.topic.replication=1 confluent.monitoring.interceptor.topic.partitions=1 confluent.monitoring.interceptor.topic.replication=1 confluent.metrics.topic.partitions=1 confluent.metrics.topic.replication=1 Start the Control Center: <confluent_path>/bin/control-center-start How to do it Open the Control Center web graphic user interface at the following URL: http://localhost:9021/. The test_topic created in the previous recipe is needed: $ <confluent_path>/bin/kafka-topics --zookeeper localhost:2181 -- create --test_topic --partitions 1 --replication-factor 1 From the Control Center, click on the Kafka Connect button on the left. Click on the New source button: 4. From the connector class, drop down the menu and select SchemaSourceConnector. Specify Connection Name as Schema-Avro-Source. 5. In the topic name, specify test_topic. 6. Click on Continue, and then click on the Save & Finish button to apply the configuration. To create a new sink follow these steps: From Kafka Connect, click on the SINKS button and then on the New sink button: From the topics list, choose test_topic and click on the Continue button In the SINKS tab, set the connection class to SchemaSourceConnector; specify Connection Name as Schema-Avro-Source Click on the Continue button and then on Save & Finish to apply the new configuration How it works Click on the Data streams tab and a chart shows the total number of messages produced and consumed on the cluster: To summarize, we discussed how to get started with the Apache Kafka confluent platform. If you liked our post, please be sure to check out Apache Kafka 1.0 Cookbook  which consists of useful recipes to work with your Apache Kafka installation.
Read more
  • 0
  • 0
  • 8049
article-image-implementing-apache-spark-mllib-naive-bayes-to-classify-digital-breath-test-data-for-drunk-driving
Savia Lobo
27 Feb 2018
13 min read
Save for later

Implementing Apache Spark MLlib Naive Bayes to classify digital breath test data for drunk driving

Savia Lobo
27 Feb 2018
13 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from a book Mastering Apache Spark 2.x - Second Edition written by Romeo Kienzler. In this book, you will understand how memory management and binary processing, cache-aware computation, and code generation are used to speed up things  dramatically.[/box] This article provides a working example of the Apache Spark MLlib Naive Bayes algorithm on the Road Safety - Digital Breath Test Data 2013. It will describe the theory behind the algorithm and will provide a step-by-step example in Scala to show how the algorithm may be used. Theory on Classification In order to use the Naive Bayes algorithm to classify a dataset, data must be linearly divisible; that is, the classes within the data must be linearly divisible by class boundaries. The following figure visually explains this with three datasets and two class boundaries shown via the dotted lines: Naive Bayes assumes that the features (or dimensions) within a dataset are independent of one another; that is, they have no effect on each other. The following example considers the classification of e-mails as spam. If you have 100 e-mails, then perform the following: 60% of emails are spam 80% of spam emails contain the word buy 20% of spam emails don't contain the word buy 40% of emails are not spam 10% of non spam emails contain the word buy 90% of non spam emails don't contain the word buy Let's convert this example into conditional probabilities so that a Naive Bayes classifier can pick it up: P(Spam) = the probability that an email is spam = 0.6 P(Not Spam) = the probability that an email is not spam = 0.4 P(Buy|Spam) = the probability that an email that is spam has the word buy = 0.8 P(Buy|Not Spam) = the probability that an email that is not spam has the word buy = 0.1 What is the probability that an e-mail that contains the word buy is spam? Well, this would be written as P (Spam|Buy). Naive Bayes says that it is described by the equation in the following figure: So, using the previous percentage figures, we get the following: P(Spam|Buy) = ( 0.8 * 0.6 ) / (( 0.8 * 0.6 ) + ( 0.1 * 0.4 ) ) = ( .48 ) / ( .48 + .04 ) = .48 / .52 = .923 This means that it is 92 percent more likely that an e-mail that contains the word buy is spam. That was a look at the theory; now it's time to try a real-world example using the Apache Spark MLlib Naive Bayes algorithm. Naive Bayes in practice The first step is to choose some data that will be used for classification. We have chosen some data from the UK Government data website at http://data.gov.uk/dataset/road- accidents-safety-data. The dataset is called Road Safety - Digital Breath Test Data 2013, which downloads a zipped text file called DigitalBreathTestData2013.txt. This file contains around half a million rows. The data looks as follows: Reason,Month,Year,WeekType,TimeBand,BreathAlcohol,AgeBand,Gender Suspicion of Alcohol,Jan,2013,Weekday,12am-4am,75,30-39,Male Moving Traffic Violation,Jan,2013,Weekday,12am-4am,0,20-24,Male Road Traffic Collision,Jan,2013,Weekend,12pm-4pm,0,20-24,Female In order to classify the data, we have modified both the column layout and the number of columns. We have simply used Excel, given the data volume. However, if our data size had been in the big data range, we would have had to run some Scala code on top of Apache Spark for ETL (Extract Transform Load). As the following commands show, the data now resides in HDFS in the directory named /data/spark/nbayes. The file name is called DigitalBreathTestData2013- MALE2.csv. The line count from the Linux wc command shows that there are 467,000 rows. Finally, the following data sample shows that we have selected the columns, Gender, Reason, WeekType, TimeBand, BreathAlcohol, and AgeBand to classify. We will try to classify on the Gender column using the other columns as features: [hadoop@hc2nn ~]$ hdfs dfs -cat /data/spark/nbayes/DigitalBreathTestData2013-MALE2.csv | wc -l 467054 [hadoop@hc2nn ~]$ hdfs dfs -cat /data/spark/nbayes/DigitalBreathTestData2013-MALE2.csv | head -5 Male,Suspicion of Alcohol,Weekday,12am-4am,75,30-39 Male,Moving Traffic Violation,Weekday,12am-4am,0,20-24 Male,Suspicion of Alcohol,Weekend,4am-8am,12,40-49 Male,Suspicion of Alcohol,Weekday,12am-4am,0,50-59 Female,Road Traffic Collision,Weekend,12pm-4pm,0,20-24 The Apache Spark MLlib classification function uses a data structure called LabeledPoint, which is a general purpose data representation defined at http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint and https://spark.apache.org/docs/latest/mllib-data-types.html#labeled-point. This structure only accepts double values, which means that the text values in the previous data need to be classified numerically. Luckily, all of the columns in the data will convert to numeric categories, and we have provided a program in the software package with this book under the chapter2naive bayes directory to do just that. It is called convert.scala. It takes the contents of the DigitalBreathTestData2013- MALE2.csv file and converts each record into a double vector. The directory structure and files for an sbt Scala-based development environment have already been described earlier. We are developing our Scala code on the Linux server using the Linux account, Hadoop. Next, the Linux pwd and ls commands show our top-level nbayes development directory with the bayes.sbt configuration file, whose contents have already been examined: [hadoop@hc2nn nbayes]$ pwd /home/hadoop/spark/nbayes [hadoop@hc2nn nbayes]$ ls bayes.sbt target   project   src The Scala code to run the Naive Bayes example is in the src/main/scala subdirectory under the nbayes directory: [hadoop@hc2nn scala]$ pwd /home/hadoop/spark/nbayes/src/main/scala [hadoop@hc2nn scala]$ ls bayes1.scala convert.scala We will examine the bayes1.scala file later, but first, the text-based data on HDFS must be converted into numeric double values. This is where the convert.scala file is used. The code is as follows: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf These lines import classes for the Spark context, the connection to the Apache Spark cluster, and the Spark configuration. The object that is being created is called convert1. It is an application as it extends the App class: object convert1 extends App { The next line creates a function called enumerateCsvRecord. It has a parameter called colData, which is an array of Strings and returns String: def enumerateCsvRecord( colData:Array[String]): String = { The function then enumerates the text values in each column, so, for instance, Male becomes 0. These numeric values are stored in values such as colVal1: val colVal1 = colData(0) match { case "Male"                                                          => 0 case "Female"                                                     => 1 case "Unknown"                                                   => 2 case _                                                                     => 99 } val colVal2 = colData(1) match { case "Moving Traffic Violation"           => 0 case "Other"     => 1 case "Road Traffic Collision"                => 2 case "Suspicion of Alcohol"                                                                 => 3 case _                                                                     => 99 } val colVal3 = colData(2) match { case "Weekday"                                                     => 0 case "Weekend"                                                     => 0 case _                                                                        => 99 } val colVal4 = colData(3) match { case "12am-4am"                                                => 0 case "4am-8am"                                                   => 1 case "8am-12pm"                                                => 2 case "12pm-4pm"                                                => 3 case "4pm-8pm"                                                   => 4 case "8pm-12pm"                                                => 5 case _                                                                     => 99 } val colVal5 = colData(4) val colVal6 = colData(5) match { case "16-19"                                                          => 0 case "20-24"                                                          => 1 case "25-29"                                                          => 2 case "30-39"                                                          => 3 case "40-49"                                                          => 4 case "50-59"                                                          => 5 case "60-69"                                                          => 6 case "70-98"                                                          => 7 case "Other"                                                          => 8 case _                                                                        => 99 } Note: A comma-separated string called lineString is created from the numeric column        values and is then returned. The function closes with the final brace character. Note that the data line created next starts with a label value at column one and is followed by a vector, which represents the data. The vector is space-separated while the label is separated from the vector by a comma. Using these two separator types allows us to process both--the label and vector--in two simple steps: val lineString = colVal1+","+colVal2+" "+colVal3+" "+colVal4+" "+colVal5+" "+colVal6 return lineString } The main script defines the HDFS server name and path. It defines the input file and the output path in terms of these values. It uses the Spark URL and application name to create a new configuration. It then creates a new context or connection to Spark using these details: val hdfsServer = "hdfs://localhost:8020" val hdfsPath     = "/data/spark/nbayes/" val inDataFile = hdfsServer + hdfsPath + "DigitalBreathTestData2013- MALE2.csv" val outDataFile = hdfsServer + hdfsPath + "result" val sparkMaster = "spark://localhost:7077" val appName = "Convert 1" Val sparkConf = new SparkConf() sparkConf.setMaster(sparkMaster) sparkConf.setAppName(appName) val sparkCxt = new SparkContext(sparkConf) The CSV-based raw data file is loaded from HDFS using the Spark context textFile method. Then, a data row count is printed: val csvData = sparkCxt.textFile(inDataFile) println("Records in : "+ csvData.count() ) The CSV raw data is passed line by line to the enumerateCsvRecord function. The returned string-based numeric data is stored in the enumRddData variable: val enumRddData = csvData.map { csvLine => val colData = csvLine.split(',') enumerateCsvRecord(colData) } Finally, the number of records in the enumRddData variable is printed, and the enumerated data is saved to HDFS: println("Records out : "+ enumRddData.count() ) enumRddData.saveAsTextFile(outDataFile) } // end object In order to run this script as an application against Spark, it must be compiled. This is carried out with the sbt package command, which also compiles the code. The following command is run from the nbayes directory: [hadoop@hc2nn nbayes]$ sbt package Loading /usr/share/sbt/bin/sbt-launch-lib.bash .... [info] Done packaging. [success] Total time: 37 s, completed Feb 19, 2015 1:23:55 PM This causes the compiled classes that are created to be packaged into a JAR library, as shown here: [hadoop@hc2nn nbayes]$ pwd /home/hadoop/spark/nbayes [hadoop@hc2nn nbayes]$ ls -l target/scala-2.10 total 24 drwxrwxr-x 2 hadoop hadoop 4096 Feb 19 13:23 classes -rw-rw-r-- 1 hadoop hadoop 17609 Feb 19 13:23 naive-bayes_2.10-1.0.jar The convert1 application can now be run against Spark using the application name, Spark URL, and full path to the JAR file that was created. Some extra parameters specify memory and the maximum cores that are supposed to be used: spark-submit --class convert1 --master spark://localhost:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/nbayes/target/scala-2.10/naive-bayes_2.10-1.0.jar This creates a data directory on HDFS called /data/spark/nbayes/ followed by the result, which contains part files with the processed data: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/nbayes Found 2 items -rw-r--r--   3 hadoop supergroup 24645166 2015-01-29 21:27 /data/spark/nbayes/DigitalBreathTestData2013-MALE2.csv drwxr-xr-x   - hadoop supergroup       0 2015-02-19 13:36 /data/spark/nbayes/result [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/nbayes/result Found 3 items -rw-r--r--   3 hadoop supergroup       0 2015-02-19 13:36 /data/spark/nbayes/result/_SUCCESS -rw-r--r--   3 hadoop supergroup 2828727 2015-02-19 13:36 /data/spark/nbayes/result/part-00000 -rw-r--r--   3 hadoop supergroup 2865499 2015-02-19 13:36 /data/spark/nbayes/result/part-00001 In the following HDFS cat command, we concatenated the part file data into a file called DigitalBreathTestData2013-MALE2a.csv. We then examined the top five lines of the file using the head command to show that it is numeric. Finally, we loaded it in HDFS with the put command: [hadoop@hc2nn nbayes]$ hdfs dfs -cat /data/spark/nbayes/result/part* > ./DigitalBreathTestData2013-MALE2a.csv 0,3 0 0 75 3 0,0 0 0 0 1 0,3 0 1 12 4 0,3 0 0 0 5 1,2 0 3 0 1 [hadoop@hc2nn nbayes]$ head -5 DigitalBreathTestData2013-MALE2a.csv [hadoop@hc2nn nbayes]$ hdfs dfs -put ./DigitalBreathTestData2013-MALE2a.csv /data/spark/nbayes The following HDFS ls command now shows the numeric data file stored on HDFS in the nbayes directory: [hadoop@hc2nn nbayes]$ hdfs dfs -ls /data/spark/nbayes Found 3 items -rw-r--r--   3 hadoop supergroup 24645166 2015-01-29 21:27 /data/spark/nbayes/DigitalBreathTestData2013-MALE2.csv -rw-r--r--   3 hadoop supergroup 5694226 2015-02-19 13:39 /data/spark/nbayes/DigitalBreathTestData2013-MALE2a.csv drwxr-xr-x - hadoop supergroup                           0 2015-02-19 13:36 /data/spark/nbayes/result Now that the data has been converted into a numeric form, it can be processed with the MLlib Naive Bayes algorithm; this is what the Scala file, bayes1.scala, does. This file imports the same configuration and context classes as before. It also imports MLlib classes for Naive Bayes, vectors, and the LabeledPoint structure. The application class that is created this time is called bayes1: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.mllib.classification.NaiveBayes import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint object bayes1 extends App { The HDFS data file is again defined, and a Spark context is created as before: val hdfsServer = "hdfs://localhost:8020" val hdfsPath      = "/data/spark/nbayes/" val dataFile = hdfsServer+hdfsPath+"DigitalBreathTestData2013-MALE2a.csv" val sparkMaster = "spark://loclhost:7077" val appName = "Naive Bayes 1" val conf = new SparkConf() conf.setMaster(sparkMaster) conf.setAppName(appName) val sparkCxt = new SparkContext(conf) The raw CSV data is loaded and split by the separator characters. The first column becomes the label (Male/Female) that the data will be classified on. The final columns separated by spaces become the classification features: val csvData = sparkCxt.textFile(dataFile) val ArrayData = csvData.map { csvLine => val colData = csvLine.split(',') LabeledPoint(colData(0).toDouble, Vectors.dense(colData(1) .split('') .map(_.toDouble) ) ) } The data is then randomly divided into training (70%) and testing (30%) datasets: val divData = ArrayData.randomSplit(Array(0.7, 0.3), seed = 13L) val trainDataSet = divData(0) val testDataSet = divData(1) The Naive Bayes MLlib function can now be trained using the previous training set. The trained Naive Bayes model, held in the nbTrained variable, can then be used to predict the Male/Female result labels against the testing data: val nbTrained = NaiveBayes.train(trainDataSet) val nbPredict = nbTrained.predict(testDataSet.map(_.features)) Given that all of the data already contained labels, the original and predicted labels for the test data can be compared. An accuracy figure can then be computed to determine how accurate the predictions were, by comparing the original labels with the prediction values: val predictionAndLabel = nbPredict.zip(testDataSet.map(_.label)) val accuracy = 100.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / testDataSet.count() println( "Accuracy : " + accuracy ); } So, this explains the Scala Naive Bayes code example. It's now time to run the compiled bayes1 application using spark-submit and determine the classification accuracy. The parameters are the same. It's just the class name that has changed: spark-submit --class bayes1 --master spark://hc2nn.semtech-solutions.co.nz:7077 --executor-memory 700M --total-executor-cores 100 /home/hadoop/spark/nbayes/target/scala-2.10/naive-bayes_2.10-1.0.jar The resulting accuracy given by the Spark cluster is just 43 percent, which seems to imply that this data is not suitable for Naive Bayes: Accuracy: 43.30 We have seen how with the help of Apache Spark MLib, one can perform a successful classification on Naive Bayes algorithm. If you found this post useful, do check  out this book Mastering Apache Spark 2.x - Second Edition to learn about the latest enhancements to Apache Spark 2.x, such as interactive querying of live data and unifying DataFrames and Datasets.  
Read more
  • 0
  • 0
  • 3504

article-image-how-sql-server-handles-data-under-the-hood
Sunith Shetty
27 Feb 2018
11 min read
Save for later

How SQL Server handles data under the hood

Sunith Shetty
27 Feb 2018
11 min read
[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Marek Chmel and Vladimír Mužný titled SQL Server 2017 Administrator's Guide. In this book, you will learn the required skills needed to successfully create, design, and deploy database using SQL Server 2017.[/box] Today, we will explore how SQL Server handles data as it is of utmost importance to get an understanding of what, when, and why data should be backed. Data structures and transaction logging We can think about a database as of physical database structure consisting of tables and indexes. However, this is just a human point of view. From the SQL Server's perspective, a database is a set of precisely structured files described in a form of metadata also saved in database structures. A conceptual imagination of how every database works is very helpful when the database has to be backed up correctly. How data is stored Every database on SQL Server must have at least two files: The primary data file with the usual suffix, mdf The transaction log file with the usual suffix, ldf For lots of databases, this minimal set of files is not enough. When the database contains big amounts of data such as historical tables, or the database has big data contention such as production tracking systems, it's good practise to design more data files. Another situation when a basic set of files is not sufficient can arise when documents or pictures would be saved along with relational data. However, SQL Server still is able to store all of our data in the basic file set, but it can lead to a performance bottlenecks and management issues. That's why we need to know all possible storage types useful for different scenarios of deployment. A complete structure of files is depicted in the following image: Database A relational database is defined as a complex data type consisting of tables with a given amount of columns, and each column has its domain that is actually a data type (such as an integer or a date) optionally complemented by some constraints. From SQL Server's perspective, the database is a record written in metadata and containing the name of the database, properties of the database, and names and locations of all files or folders representing storage for the database. This is the same for user databases as well as for system databases. System databases are created automatically during SQL Server installation and are crucial for correct running of SQL Server. We know five system databases. Database master Database master is crucial for the correct running of SQL Server service. In this database is stored data about logins, all databases and their files, instance configurations, linked servers, and so on. SQL Server finds this database at startup via two startup parameters, -d and -l, followed by paths to mdf and ldf files. These parameters are very important in situations when the administrator wants to move the master's files to a different location. Changing their values is possible in the SQL Server Configuration Manager in the SQL Server service Properties dialog on the tab called startup parameters. Database msdb The database msdb serves as the SQL Server Agent service, Database Mail, and Service Broker. In this database are stored job definitions, operators, and other objects needed for administration automation. This database also stores some logs such as backup and restore events of each database. If this database is corrupted or missing, SQL Server Agent cannot start. Database model Database model can be understood as a template for every new database while it is created. During a database creation (see the CREATE DATABASE statement on MSDN), files are created on defined paths and all objects, data and properties of database model are created, copied, and set into the new database during its creation. This database must always exist on the instance, because when it's corrupted, database tempdb can be created at instance start up! Database tempdb Even if database tempdb seems to be a regular database like many others, it plays a very special role in every SQL Server instance. This database is used by SQL Server itself as well as by developers to save temporary data such as table variables or static cursors. As this database is intended for a short lifespan (temporary data only, which can be stored during execution of stored procedure or until session is disconnected), SQL Server clears this database by truncating all data from it or by dropping and recreating this database every time when it's started. As the tempdb database will never contain durable data, it has some special internal behavior and it's the reason why accessing data in this database is several times faster than accessing durable data in other databases. If this database is corrupted, restart SQL Server. Database resourcedb The resourcedb is fifth in our enumeration and consists of definitions for all system objects of SQL Server, for example, sys.objects. This database is hidden and we don't need to care about it that much. It is not configurable and we don't use regular backup strategies for it. It is always placed in the installation path of SQL Server (to the binn directory) and it's backed up within the filesystem backup. In case of an accident, it is recovered as a part of the filesystem as well. Filegroup Filegroup is an organizational metadata object containing one or more data files. Filegroup does not have its own representation in the filesystem--it's just a group of files. When any database is created, a filegroup called primary is always created. This primary filegroup always contains the primary data file. Filegroups can be divided into the following: Row storage filegroups: These filegroup can contain data files (mdf or ndf). Filestream filegroups: This kind of filegroups can contain not files but folders to store binary data. In-memory filegroup: Only one instance of this kind of filegroup can be created in a database. Internally, it is a special case of filestream filegroup and it's used by SQL Server to persist data from in-memory tables. Every filegroup has three simple properties: Name: This is a descriptive name of the filegroup. The name must fulfill the naming convention criteria. Default: In a set of filegroups of the same type, one of these filegroups has this option set to on. This means that when a new table or index is created without explicitly specified to which filegroup it has to store data in, the default filegroup is used. By default, the primary filegroup is the default one. Read-only: Every filegroup, except the primary filegroup, could be set to read- only. Let's say that a filegroup is created for last year's history. When data is moved from the current period to tables created in this historical filegroup, the filegroup could be set as read-only, and later the filegroup cannot be backed up again and again. It is a very good approach to divide the database into smaller parts-- filegroups with more files. It helps in distributing data across more physical storage and also makes the database more manageable; backups can be done part by part in shorter times, which better fit into a service window. Data files Every database must have at least one data file called primary data file. This file is always bound to the primary filegroup. In this file is all the metadata of the database, such as structure descriptions (could be seen through views such as sys.objects, sys.columns, and others), users, and so on. If the database does not have other data files (in the same or other filegroups), all user data is also stored in this file, but this approach is good enough just for smaller databases. Considering how the volume of data in the database grows over time, it is a good practice to add more data files. These files are called secondary data files. Secondary data files are optional and contain user data only. Both types of data files have the same internal structure. Every file is divided into 8 KB small parts called data pages. SQL Server maintains several types of data pages such as data, data pages, index pages, index allocation maps (IAM) pages to locate data pages of tables or indexes, global allocation map (GAM) and shared global allocation maps (SGAM) pages to address objects in the database, and so on. Regardless of the type of a certain data page, SQL Server uses a data page as the smallest unit of I/O operations between hard disk and memory. Let's describe some common properties: A data page never contains data of several objects Data pages don't know each other (and that's why SQL Server uses IAMs to allocate all pages of an object) Data pages don't have any special physical ordering A data row must always fit in size to a data page These properties could seem to be useless but we have to keep in mind that when we know these properties, we can better optimize and manage our databases. Did you know that a data page is the smallest storage unit that can be restored from backup? As a data page is quite a small storage unit, SQL Server groups data pages into bigger logical units called extents. An extent is a logical allocation unit containing eight coherent data pages. When SQL Server requests data from disk, extents are read into memory. This is the reason why 64 KB NTFS clusters are recommended to format disk volumes for data files. Extents could be uniform or mixed. Uniform extent is a kind of extent containing data pages belonging to one object only; on the other hand, a mixed extent contains data pages of several objects. Transaction log When SQL Server processes any transaction, it works in a way called two-phase commit. When a client starts a transaction by sending a single DML request or by calling the BEGIN TRAN command, SQL Server requests data pages from disk to memory called buffer cache and makes the requested changes in these data pages in memory. When the DML request is fulfilled or the COMMIT command comes from the client, the first phase of the commit is finished, but data pages in memory differ from their original versions in a data file on disk. The data page in memory is in a state called dirty. When a transaction runs, a transaction log file is used by SQL Server for a very detailed chronological description of every single action done during the transaction. This description is called write-ahead-logging, shortly WAL, and is one of the oldest processes known on SQL Server. The second phase of the commit usually does not depend on the client's request and is an internal process called checkpoint. Checkpoint is a periodical action that: searches for dirty pages in buffer cache, saves dirty pages to their original data file location, marks these data pages as clean (or drops them out of memory to free memory space), marks the transaction as checkpoint or inactive in the transaction log. Write-ahead-logging is needed for SQL Server during recovery process. Recovery process is started on every database every time SQL Server service starts. When SQL Server service stops, some pages could remain in a dirty state and they are lost from memory. This can lead to two possible situations: The transaction is completely described in the transaction log, the new content of the data page is lost from memory, and data pages are not changed in the data file The transaction was not completed at the moment SQL Server stopped, so the transaction cannot be completely described in the transaction log as well, data pages in memory were not in a stable state (because the transaction was not finished and SQL Server cannot know if COMMIT or ROLLBACK will occur), and the original version of data pages in data files is intact SQL Server decides these two situations when it's starting. If a transaction is complete in the transaction log but was not marked as checkpoint, SQL Server executes this transaction again with both phases of COMMIT. If the transaction was not complete in the transaction log when SQL Server stopped, SQL Server will never know what was the user's intention with the transaction and the incomplete transaction is erased from the transaction log as if it had never started. The aforementioned described recovery process ensures that every database is in the last known consistent state after SQL Server's startup. It's crucial for DBAs to understand write-ahead-logging when planning a backup strategy because when restoring the database, the administrator has to recognize if it's time to run the recovery process or not. To summarize, we introduced internal data handling as it is important not only during performance backups and restores but also for optimizing a database. If you are interested to know more about how to backup, recover and secure SQL Server, do checkout this book SQL Server 2017 Administrator's Guide.  
Read more
  • 0
  • 0
  • 5137