Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-er-diagrams-domain-model-and-n-layer-architecture-aspnet-35-part1
Packt
20 Oct 2009
11 min read
Save for later

ER Diagrams, Domain Model, and N-Layer Architecture with ASP.NET 3.5 (part1)

Packt
20 Oct 2009
11 min read
Let us start with a 1-tier ASP.NET application configuration. Note that the application as a whole including database and client browser is three tier. We can call this 1-tier architecture a 3-tier architecture if we include the browser and database (if used). For the rest of this article we will ignore the database and browser as separate tiers so that we can focus on how to divide the main ASP.NET application layers logically, using the n-layer pattern to its best use. We will first try to separate the data access and logical code into their own separate layers and see how we can introduce flexibility and re-usability into our solution. We will understand this with a sample project. Before we go ahead into the technical details and code, we will first learn about two important terms: ER Diagram and Domain Model, and how they help us in getting a good understanding of the application we need to develop. Entity-Relationship Diagram Entity-Relationship diagrams, or ER diagrams in short, are graphical representations depicting relationships between different entities in a system. We humans understand and remember pictures or images more easily than textual information. When we first start to understand a project we need to see how different entities in the project relate to each other. ER diagrams help us achieve that goal by graphically describing the relationships. An entity can be thought of as an object in a system that can be identified uniquely. An entity can have attributes; an attribute is simply a property we can associate with an entity. For example, a Car entity can have the following attributes: EngineCapacity, NumberofGears, SeatingCapacity, Mileage, and so on. So attributes are basically fields holding data to indentify an entity. Attributes cannot exist without an entity. Let us understand ER diagrams in detail with a simple e-commerce example: a very basic Order Management System. We will be building a simple web based system to track customer's orders, and manage customers and products. To start with, let us list the basic entities for our simplified Order Management System (OMS): Customer: A person who can place Orders to buy Products. Order: An order placed by a Customer. There can be multiple Products bought by a Customer in one Order. Product: A Product is an object that can be purchased by a Customer. Category: Category of a Product. A Category can have multiple Products, and a Product can belong to many Categories. For example, a mixer-grinder can be under the Electronic Gadgets category as well as in Home Appliances. OrderLineItem: An Order can be for multiple Products. Each individual Product in an order will be encapsulated by an OrderLineItem. So an Order can have multiple OrderLineItems. Now, let us picture the relationship between the core business entities is defined using an Entity-Relationship diagram. Our ER diagram will show the relational associations between the entities from a database's perspective. So it is more of a relational model and will not show any of the object-oriented associations (for which we will use the Domain Model in the later sections of this article). In an ER diagram, we show entities using rectangular boxes, the relationships between entities using diamond boxes and attributes using oval boxes, as shown below: The purpose of using such shapes is to make the ER diagram clear and concise, depicting the relational model as closely as possible without using long sentences or text. So the Customer entity with some of the basic attributes can be depicted in an ER diagram as follows: Now, let us create an ER diagram for our Order Management System. For the sake of simplicity, we will not list the attributes of the entities involved. Here is how the ER diagram looks: The above ER diagram depicts the relationships between the OMS entities but is still incomplete as the relationships do not show how the entities are quantitatively related to each other. We will now look at how to quantify relationships using degree and cardinality. Degree and Cardinality of a Relationship The relationships in an ER diagram can also have a degree. A degree specifies the multiplicity of a relationship. In simpler terms, it refers to the number of entities involved in a relationship. All relationships in an OMS ER diagram have a degree of two, also called binary relationships. For example, in Customer-Order relationships only two entities are involved—Customer and Order; so it's a two degree relationship. Most relationships you come across would be binary. Another term associated with a relationship is cardinality. The cardinality of a relationship identifies the number of instances of entities involved in that particular relationship. For example, an Order can have multiple OrderLineItems, which means the cardinality of the relationship between Order and OrderLineItem is one-to-many. The three commonly-used cardinalities of a relationship are: One-to-one: Depicted as 1:1Example: One OrderLineItem can have only one Product; so the OrderLineItem and Product entities share a one-to-one relationship One-to-many: Depicted as 1:nExample: One customer can place multiple orders, so the Customer and Order entities share a one-to-many relationship Many-to-many: Depicted as n:mExample: One Product can be included in multiple Categories and one Category can contain multiple Products; therefore the Product and Category entities share a many-to-many relationship After adding the cardinality of the relationships to our ER diagram, here is how it will look: This basic ER diagrams tells us a lot about how the different entities in the system are related to each other, and can help new programmers to quickly understand the logic and the relationships of the system they are working on. Each entity will be a unique table in the database. OMS Project using 2-Layer We know that the default coding style in ASP.NET 2.0 already supports the 1-tier 1-layer style, with two sub-layers in the main UI layer as follows: Designer code files: ASPX markup files Code behind files: Files containing C# or VB.NET code Because both of these layers contain the UI code, we can include them as a part of the UI layer. These two layers help us to separate the markup and the code from each other. However, it is still not advisable to have logical code, such as data access or business logic, directly in these code-behind files. Now, one way to create an ASP.NET web application for our Order Management System (OMS) in just one layer is by using a DataSet (or DataReader) to fill the front-end UI elements directly in the code-behind classes. This will involve writing data access code in the UI layer (code-behind), and will tightly bind this UI layer with the data access logic, making the application rigid (inflexible), harder to maintain, and less scalable. In order to have greater flexibility, and to keep the UI layer completely independent of the data access and business logic code, we need to put these elements in separate files. So we will now try and introduce some loose-coupling by following a 2-layer approach this time. What we will do is, write all data access code in separate class files instead of using the code-behind files of the UI layer. This will make the UI layer independent of the data-access code. We are assuming that we do not have any specific business logic code at this point, or else we would have put that under another layer with its own namespace, making it a 3-layered architecture. We will examine this in the upcoming sections of this article. Sample Project Let us see how we can move from this 1-tier 1-layer style to a 1-tier 2-layer style. Using the ER diagram above as reference, we can create a 2-Layer architecture for our OMS with these layers: UI-layer with ASPX and code-behind classes Data access classes under a different namespace but in the same project So let's start with a new VS 2008 project. We will create a new ASP.NET Web Project in C#, and add a new web form, ProductList.aspx, which will simply display a list of all the products using a Repeater control. The purpose of this project is to show how we can logically break up the UI layer further by separating the data access code into another class file. The following is the ASPX markup of the ProductList page (unnecessary elements and tags have been removed to keep things simple): <asp:Repeater ID="prodRepeater" runat="server"> <ItemTemplate> Product Code: <%# Eval("Code")%> <br> Name: <%# Eval("Name")%> <br> Unit Price: $<%# Eval("UnitPrice")%> <br> </ItemTemplate></asp:Repeater> In this ASPX file, we only have a Repeater control, which we will bind with the data in the code-behind file. Here is the code in the ProductList.aspx.cs code-behind file: namespace OMS{public partial class _Default : System.Web.UI.Page { /// <summary> /// Page Load method /// </summary> /// <param name="sender"></param> /// <param name="e"></param> protected void Page_Load(object sender, EventArgs e) { DataTable dt = DAL.GetAllProducts(); prodRepeater.DataSource = dt; prodRepeater.DataBind(); } }//end class}//end namespace Note that we don't have any data access code in the code-behind sample above. We are just calling the GetAllProducts() method, which has all of data access code wrapped in a different class named DAL. We can logically separate out the code, by using different namespaces to achieve code re-use and greater architectural flexibility. So we created a new class named DAL under a different namespace from the UI layer code files. Here is the DAL code: namespace OMS.Code{ public class DAL { /// <summary> /// Load all comments from the Access DB /// </summary> public static DataTable GetAllProducts() { string sCon = ConfigurationManager.ConnectionStrings[0].ConnectionString; using (SqlConnection cn = new SqlConnection(sCon)) { string sQuery = @"SELECT * FROM OMS_Product"; SqlCommand cmd = new SqlCommand(sQuery, cn); SqlDataAdapter da = new SqlDataAdapter(cmd); DataSet ds = new DataSet(); cn.Open(); da.Fill(ds); return ds.Tables[0]; } } }//end class}//end namespace So we have separated the data access code in a new logical layer, using a separate namespace, OMS.Code, and using a new class. Now, if we want to, we can re-use the same code in the other pages as well. Furthermore, methods to add and edit a product can be defined in this class and then used in the UI layer. This allows multiple developers to work on the DAL and UI layers simultaneously. Even though we have a logical separation of the code in this 2-layer sample architecture, we are still not using real Object Oriented Programming (OOP). All of the Object-Oriented Programming we have used so far has been the default structure the .NET framework has provided, such as the Page class, and so on. When a project grows big in size as well as complexity, using the 2-layer model discussed above can become cumbersome and cause scalability and flexibility issues. If the project grows in complexity, then we will be putting all of the business logic code in either the DAL or the UI layer. This business logic code includes business rules. For example, if the customer orders a certain number of products in one order, he gets a certain level of discount. If we code such business rules in the UI layer, then if the rules change we need to change the UI as well, which is not ideal, especially in cases where we can have multiple UIs for the same code, for example one normal web browser UI and another mobile-based UI. We also cannot put business logic code in the DAL layer because the DAL layer should only contain data access code which should not be mixed with any kind of business processing logic. In fact the DAL layer should be quite "dumb"–there should be no "logic" inside it because it is mostly a utility layer which only needs to put data in and pull data out from a data store. To make our applications more scalable and to reap the benefit of OOP, we need to create objects, and wrap business behavior in their methods. This is where the Domain Model comes into the picture.
Read more
  • 0
  • 0
  • 5935

article-image-data-science-and-machine-learning-what-to-learn-in-2020
Richard Gall
19 Dec 2019
5 min read
Save for later

Data science and machine learning: what to learn in 2020

Richard Gall
19 Dec 2019
5 min read
It’s hard to keep up with the pace of change in the data science and machine learning fields. And when you’re under pressure to deliver projects, learning new skills and technologies might be the last thing on your mind. But if you don’t have at least one eye on what you need to learn next you run the risk of falling behind. In turn this means you miss out on new solutions and new opportunities to drive change: you might miss the chance to do things differently. That’s why we want to make it easy for you with this quick list of what you need to watch out for and learn in 2020. The growing TensorFlow ecosystem TensorFlow remains the most popular deep learning framework in the world. With TensorFlow 2.0 the Google-based development team behind it have attempted to rectify a number of issues and improve overall performance. Most notably, some of the problems around usability have been addressed, which should help the project’s continued growth and perhaps even lower the barrier to entry. Relatedly TensorFlow.js is proving that the wider TensorFlow ecosystem is incredibly healthy. It will be interesting to see what projects emerge in 2020 - it might even bring JavaScript web developers into the machine learning fold. Explore Packt's huge range of TensorFlow eBooks and videos on the store. PyTorch PyTorch hasn’t quite managed to topple TensorFlow from its perch, but it’s nevertheless growing quickly. Easier to use and more accessible than TensorFlow, if you want to start building deep learning systems quickly your best bet is probably to get started on PyTorch. Search PyTorch eBooks and videos on the Packt store. End-to-end data analysis on the cloud When it comes to data analysis, one of the most pressing issues is to speed up pipelines. This is, of course, notoriously difficult - even in organizations that do their best to be agile and fast, it’s not uncommon to find that their data is fragmented and diffuse, with little alignment across teams. One of the opportunities for changing this is cloud. When used effectively cloud platforms can dramatically speed up analytics pipelines and make it much easier for data scientists and analysts to deliver insights quickly. This might mean that we need increased collaboration between data professionals, engineers, and architects, but if we’re to really deliver on the data at our disposal, then this shift could be massive. Learn how to perform analytics on the cloud with Cloud Analytics with Microsoft Azure. Data science strategy and leadership While cloud might help to smooth some of the friction that exists in our organizations when it comes to data analytics, there’s no substitute for strong and clear leadership. The split between the engineering side of data and the more scientific or interpretive aspect has been noted, which means that there is going to be a real demand for people that have a strong understanding of what data can do, what it shows, and what it means in terms of action. Indeed, the article just linked to also mentions that there is likely to be an increasing need for executive level understanding. That means data scientists have the opportunity to take a more senior role inside their organizations, by either working closely with execs or even moving up to that level. Learn how to build and manage a data science team and initiative that delivers with Managing Data Science. Going back to the algorithms In the excitement about the opportunities of machine learning and artificial intelligence, it’s possible that we’ve lost sight of some of the fundamentals: the algorithms. Indeed, given the conversation around algorithmic bias, and unintended consequences it certainly makes sense to place renewed attention on the algorithms that lie right at the center of our work. Even if you’re not an experienced data analyst or data scientist, if you’re a beginner it’s just as important to dive deep into algorithms. This will give you a robust foundation for everything else you do. And while statistics and mathematics will feel a long way from the supposed sexiness of data science, carefully considering what role they play will ensure that the models you build are accurate and perform as they should. Get stuck into algorithms with Data Science Algorithms in a Week. Computer vision and natural language processing Computer vision and Natural Language Processing are two of the most exciting aspects of modern machine learning and artificial intelligence. Both can be used for analytics projects, but they also have applications in real world digital products. Indeed, with augmented reality and conversational UI becoming more and more common, businesses need to be thinking very carefully about whether this could give them an edge in how they interact with customers. These sorts of innovations can be driven from many different departments - but technologists and data professionals should be seizing the opportunity to lead the way on how innovation can transform customer relationships. For more technology eBooks and videos to help you prepare for 2020, head to the Packt store.
Read more
  • 0
  • 0
  • 5932

article-image-first-steps-selenium-rc
Packt
23 Nov 2010
4 min read
Save for later

First Steps with Selenium RC

Packt
23 Nov 2010
4 min read
Selenium 1.0 Testing Tools: Beginner’s Guide Important preliminary points To complete the examples of this article you will need to make sure that you have at least Java JRE installed. You can download it from http://java.sun.com. Selenium Remote Control has been written in Java to allow it to be cross platform, so we can test on Mac, Linux, and Windows. What is Selenium Remote Control Selenium IDE only works with Firefox so we have only been checking a small subsection of the browsers that our users prefer. We, as web developers and testers, know that unfortunately our users do not just use one browser. Some may use Internet Explorer, others may use Mozilla Firefox. This is not to mention the growth of browsers such as Google Chrome and Opera. Selenium Remote Control was initially developed by Patrick Lightbody as a way to test all of these different web browsers without having to install Selenium Core on the web server. It was developed to act as a proxy between the application under test and the test scripts. Selenium Core is bundled with Selenium Remote Control instead of being installed on the server. This change to the way that Selenium tests are run allowed developers to interact with the proxy directly giving developers and testers a chance to use one of the most prominent programming languages to send commands to the browser. Java and C# have been the main languages used by developers to create Selenium Tests. This is due to most web applications being created in one of those languages. We have seen language bindings for dynamic languages being created and supported as more developers move their web applications to those languages. Ruby and Python are the most popular languages that people are moving to. Using programming languages to write your tests instead of using the HTML-style tests with Selenium IDE allows you, as a developer or tester, to make your tests more robust and take advantage of all setups and tear down those that are common in most testing frameworks. Now that we understand how Selenium Remote Control works, let us have a look at setting it up. Setting up Selenium Remote Control Selenium Remote Control is required on all machines that will be used to run tests. It is good practice to limit the number of Selenium Remote Control instances to one per CPU core. This is due to the fact that web applications are becoming more "chatty" since we use more AJAX in them. Limiting the Selenium instances to one per core makes sure that the browsers load cleanly and Selenium will run as quickly as possible. Time for action – setting up Selenium Remote Control Download Selenium Remote Control from http://seleniumhq.org/download. Extract the ZIP file. Start a Command Prompt or a console window and navigate to where the ZIP file was extracted. Run the command java –jar selenium-server-standalone.jar and the output should appear similar to the following screenshot: What just happened? We have successfully set up Selenium Remote Control. This is the proxy that our tests will communicate with. It works by language bindings, sending commands through to Selenium Remote Control which it then passes through to the relevant browser. It does this by keeping track of browsers by having a unique ID attached to the browser, and each command needs to have that ID in the request. Now that we have finished setting up Selenium Remote Control we can have a look at running our first set of tests in a number of different browsers. Pop quiz – setting up Selenium Remote Control Where can you download Selenium Remote Control from? Once you have placed Selenium Remote Control somewhere accessible, how do you start Selenium Remote Control? Running Selenium IDE tests with Selenium Remote Control The Selenium IDE to create all the tests have only been tested on applications in Firefox. This means the testing coverage that you are offering is very limited. Users will use a number of different browsers to interact with your application. Browser and operating system combinations can mean that a developer or tester will have to run your tests more than nine times. This is to make sure that you cover all the popular browser and operating system combinations. Now let's have a look at running the IDE tests with Selenium Remote Control.
Read more
  • 0
  • 0
  • 5925

article-image-relational-databases-sqlalchemy
Packt
02 Nov 2015
28 min read
Save for later

Relational Databases with SQLAlchemy

Packt
02 Nov 2015
28 min read
In this article by Matthew Copperwaite, author of the book Learning Flask Framework, he talks about how relational databases are the bedrock upon which almost every modern web applications are built. Learning to think about your application in terms of tables and relationships is one of the keys to a clean, well-designed project. We will be using SQLAlchemy, a powerful object relational mapper that allows us to abstract away the complexities of multiple database engines, to work with the database directly from within Python. In this article, we shall: Present a brief overview of the benefits of using a relational database Introduce SQLAlchemy, The Python SQL Toolkit and Object Relational Mapper Configure our Flask application to use SQLAlchemy Write a model class to represent blog entries Learn how to save and retrieve blog entries from the database Perform queries—sorting, filtering, and aggregation Create schema migrations using Alembic (For more resources related to this topic, see here.) Why use a relational database? Our application's database is much more than a simple record of things that we need to save for future retrieval. If all we needed to do was save and retrieve data, we could easily use flat text files. The fact is, though, that we want to be able to perform interesting queries on our data. What's more, we want to do this efficiently and without reinventing the wheel. While non-relational databases (sometimes known as NoSQL databases) are very popular and have their place in the world of web, relational databases long ago solved the common problems of filtering, sorting, aggregating, and joining tabular data. Relational databases allow us to define sets of data in a structured way that maintains the consistency of our data. Using relational databases also gives us, the developers, the freedom to focus on the parts of our app that matter. In addition to efficiently performing ad hoc queries, a relational database server will also do the following: Ensure that our data conforms to the rules set forth in the schema Allow multiple people to access the database concurrently, while at the same time guaranteeing the consistency of the underlying data Ensure that data, once saved, is not lost even in the event of an application crash Relational databases and SQL, the programming language used with relational databases, are topics worthy of an entire book. Because this book is devoted to teaching you how to build apps with Flask, I will show you how to use a tool that has been widely adopted by the Python community for working with databases, namely, SQLAlchemy. SQLAlchemy abstracts away many of the complications of writing SQL queries, but there is no substitute for a deep understanding of SQL and the relational model. For that reason, if you are new to SQL, I would recommend that you check out the colorful book Learn SQL the Hard Way, Zed Shaw available online for free at http://sql.learncodethehardway.org/. Introducing SQLAlchemy SQLAlchemy is an extremely powerful library for working with relational databases in Python. Instead of writing SQL queries by hand, we can use normal Python objects to represent database tables and execute queries. There are a number of benefits to this approach which are listed as follows: Your application can be developed entirely in Python. Subtle differences between database engines are abstracted away. This allows you to do things just like a lightweight database, for instance, use SQLite for local development and testing, then switch to the databases designed for high loads (such as PostgreSQL) in production. Database errors are less common because there are now two layers between your application and the database server: the Python interpreter itself (which will catch the obvious syntax errors), and SQLAlchemy, which has well-defined APIs and it's own layer of error-checking. Your database code may become more efficient, thanks to SQLAlchemy's unit-of-work model which helps reduce unnecessary round-trips to the database. SQLAlchemy also has facilities for efficiently pre-fetching related objects known as eager loading. Object Relational Mapping (ORM) makes your code more maintainable, an asperation known as don't repeat yourself, (DRY). Suppose you add a column to a model. With SQLAlchemy it will be available whenever you use that model. If, on the other hand, you had hand-written SQL queries strewn throughout your app, you would need to update each query, one at a time, to ensure that you were including the new column. SQLAlchemy can help you avoid SQL injection vulnerabilities. Excellent library support: There are a multitude of useful libraries that can work directly with your SQLAlchemy models to provide things like maintenance interfaces and RESTful APIs. I hope you're excited after reading this list. If all the items in this list don't make sense to you right now, don't worry. Now that we have discussed some of the benefits of using SQLAlchemy, let's install it and start coding. If you'd like to learn more about SQLAlchemy, there is an article devoted entirely to its design in The Architecture of Open-Source Applications, available online for free at http://aosabook.org/en/sqlalchemy.html. Installing SQLAlchemy We will use pip to install SQLAlchemy into the blog app's virtualenv. To activate your virtualenv, change directories to source the activate script as follows: $ cd ~/projects/blog $ source bin/activate (blog) $ pip install sqlalchemy Downloading/unpacking sqlalchemy … Successfully installed sqlalchemy Cleaning up... You can check if your installation succeeded by opening a Python interpreter and checking the SQLAlchemy version; note that your exact version number is likely to differ. $ python >>> import sqlalchemy >>> sqlalchemy.__version__ '0.9.0b2' Using SQLAlchemy in our Flask app SQLAlchemy works very well with Flask on its own, but the author of Flask has released a special Flask extension named Flask-SQLAlchemy that provides helpers with many common tasks, and can save us from having to re-invent the wheel later on. Let's use pip to install this extension: (blog) $ pip install flask-sqlalchemy … Successfully installed flask-sqlalchemy Flask provides a standard interface for the developers who are interested in building extensions. As the framework has grown in popularity, the number of high quality extensions has increased. If you'd like to take a look at some of the more popular extensions, there is a curated list available on the Flask project website at http://flask.pocoo.org/extensions/. Choosing a database engine SQLAlchemy supports a multitude of popular database dialects, including SQLite, MySQL, and PostgreSQL. Depending on the database you would like to use, you may need to install an additional Python package containing a database driver. Listed next are several popular databases supported by SQLAlchemy and the corresponding pip-installable driver. Some databases have multiple driver options, so I have listed the most popular one first. Database Driver Package(s) SQLite Not needed, part of the Python standard library since version 2.5 MySQL MySQL-python, PyMySQL (pure Python), OurSQL PostgreSQL psycopg2 Firebird fdb Microsoft SQL Server pymssql, PyODBC Oracle cx-Oracle SQLite comes as standard with Python and does not require a separate server process, so it is perfect for getting up and running quickly. For simplicity in the examples that follow, I will demonstrate how to configure the blog app for use with SQLite. If you have a different database in mind that you would like to use for the blog project, feel free to use pip to install the necessary driver package at this time. Connecting to the database Using your favorite text editor, open the config.py module for our blog project (~/projects/blog/app/config.py). We are going to add an SQLAlchemy specific setting to instruct Flask-SQLAlchemy how to connect to our database. The new lines are highlighted in the following: class Configuration(object): APPLICATION_DIR = current_directory DEBUG = True SQLALCHEMY_DATABASE_URI = 'sqlite:///%s/blog.db' % APPLICATION_DIR The SQLALCHEMY_DATABASE_URIis comprised of the following parts: dialect+driver://username:password@host:port/database Because SQLite databases are stored in local files, the only information we need to provide is the path to the database file. On the other hand, if you wanted to connect to PostgreSQL running locally, your URI might look something like this: postgresql://postgres:secretpassword@localhost:5432/blog_db If you're having trouble connecting to your database, try consulting the SQLAlchemy documentation on the database URIs:http://docs.sqlalchemy.org/en/rel_0_9/core/engines.html Now that we've specified how to connect to the database, let's create the object responsible for actually managing our database connections. This object is provided by the Flask-SQLAlchemy extension and is conveniently named SQLAlchemy. Open app.py and make the following additions: from flask import Flask from flask.ext.sqlalchemy import SQLAlchemy from config import Configuration app = Flask(__name__) app.config.from_object(Configuration) db = SQLAlchemy(app) These changes instruct our Flask app, and in turn SQLAlchemy, how to communicate with our application's database. The next step will be to create a table for storing blog entries and to do so, we will create our first model. Creating the Entry model A model is the data representation of a table of data that we want to store in the database. These models have attributes called columns that represent the data items in the data. So, if we were creating a Person model, we might have columns for storing the first and last name, date of birth, home address, hair color, and so on. Since we are interested in creating a model to represent blog entries, we will have columns for things like the title and body content. Note that we don't say a People model or Entries model – models are singular even though they commonly represent many different objects. With SQLAlchemy, creating a model is as easy as defining a class and specifying a number of attributes assigned to that class. Let's start with a very basic model for our blog entries. Create a new file named models.py in the blog project's app/ directory and enter the following code: import datetime, re from app import db def slugify(s): return re.sub('[^w]+', '-', s).lower() class Entry(db.Model): id = db.Column(db.Integer, primary_key=True) title = db.Column(db.String(100)) slug = db.Column(db.String(100), unique=True) body = db.Column(db.Text) created_timestamp = db.Column(db.DateTime, default=datetime.datetime.now) modified_timestamp = db.Column( db.DateTime, default=datetime.datetime.now, onupdate=datetime.datetime.now) def __init__(self, *args, **kwargs): super(Entry, self).__init__(*args, **kwargs) # Call parent constructor. self.generate_slug() def generate_slug(self): self.slug = '' if self.title: self.slug = slugify(self.title) def __repr__(self): return '<Entry: %s>' % self.title There is a lot going on, so let's start with the imports and work our way down. We begin by importing the standard library datetime and re modules. We will be using datetime to get the current date and time, and re to do some string manipulation. The next import statement brings in the db object that we created in app.py. As you recall, the db object is an instance of the SQLAlchemy class, which is a part of the Flask-SQLAlchemy extension. The db object provides access to the classes that we need to construct our Entry model, which is just a few lines ahead. Before the Entry model, we define a helper function slugify, which we will use to give our blog entries some nice URLs. The slugify function takes a string like A post about Flask and uses a regular expression to turn a string that is human-readable in a URL, and so returns a-post-about-flask. Next is the Entry model. Our Entry model is a normal class that extends db.Model. By extending the db.Model our Entry class will inherit a variety of helpers which we'll use to query the database. The attributes of the Entry model, are a simple mapping of the names and data that we wish to store in the database and are listed as follows: id: This is the primary key for our database table. This value is set for us automatically by the database when we create a new blog entry, usually an auto incrementing number for each new entry. While we will not explicitly set this value, a primary key comes in handy when you want to refer one model to another. title: The title for a blog entry, stored as a String column with a maximum length of 100. slug: The URL-friendly representation of the title, stored as a String column with a maximum length of 100. This column also specifies unique=True, so that no two entries can share the same slug. body: The actual content of the post, stored in a Text column. This differs from the String type of the Title and Slug as you can store as much text as you like in this field. created_timestamp: The time a blog entry was created, stored in a DateTime column. We instruct SQLAlchemy to automatically populate this column with the current time by default when an entry is first saved. modified_timestamp: The time a blog entry was last updated. SQLAlchemy will automatically update this column with the current time whenever we save an entry. For short strings such as titles or names of things, the String column is appropriate, but when the text may be especially long it is better to use a Text column, as we did for the entry body. We've overridden the constructor for the class (__init__) so that when a new model is created, it automatically sets the slug for us based on the title. The last piece is the __repr__ method which is used to generate a helpful representation of instances of our Entry class. The specific meaning of __repr__ is not important but allows you to reference the object that the program is working with, when debugging. A final bit of code needs to be added to main.py, the entry-point to our application, to ensure that the models are imported. Add the highlighted changes to main.py as follows: from app import app, db import models import views if __name__ == '__main__': app.run() Creating the Entry table In order to start working with the Entry model, we first need to create a table for it in our database. Luckily, Flask-SQLAlchemy comes with a nice helper for doing just this. Create a new sub-folder named scripts in the blog project's app directory. Then create a file named create_db.py: (blog) $ cd app/ (blog) $ mkdir scripts (blog) $ touch scripts/create_db.py Add the following code to the create_db.py module. This function will automatically look at all the code that we have written and create a new table in our database for the Entry model based on our models: from main import db if __name__ == '__main__': db.create_all() Execute the script from inside the app/ directory. Make sure the virtualenv is active. If everything goes successfully, you should see no output. (blog) $ python create_db.py (blog) $ If you encounter errors while creating the database tables, make sure you are in the app directory, with the virtualenv activated, when you run the script. Next, ensure that there are no typos in your SQLALCHEMY_DATABASE_URI setting. Working with the Entry model Let's experiment with our new Entry model by saving a few blog entries. We will be doing this from the Python interactive shell. At this stage let's install IPython, a sophisticated shell with features like tab-completion (that the default Python shell lacks): (blog) $ pip install ipython Now check if we are in the app directory and let's start the shell and create a couple of entries as follows: (blog) $ ipython In []: from models import * # First things first, import our Entry model and db object. In []: db # What is db? Out[]: <SQLAlchemy engine='sqlite:////home/charles/projects/blog/app/blog.db'> If you are familiar with the normal Python shell but not IPython, things may look a little different at first. The main thing to be aware of is that In[] refers to the code you type in, and Out[] is the output of the commands you put in to the shell. IPython has a neat feature that allows you to print detailed information about an object. This is done by typing in the object's name followed by a question-mark (?). Introspecting the Entry model provides a bit of information, including the argument signature and the string representing that object (known as the docstring) of the constructor: In []: Entry? # What is Entry and how do we create it? Type: _BoundDeclarativeMeta String Form:<class 'models.Entry'> File: /home/charles/projects/blog/app/models.py Docstring: <no docstring> Constructor information: Definition:Entry(self, *args, **kwargs) We can create Entry objects by passing column values in as the keyword-arguments. In the preceding example, it uses **kwargs; this is a shortcut for taking a dict object and using it as the values for defining the object, as shown next: In []: first_entry = Entry(title='First entry', body='This is the body of my first entry.') In order to save our first entry, we will to add it to the database session. The session is simply an object that represents our actions on the database. Even after adding it to the session, it will not be saved to the database yet. In order to save the entry to the database, we need to commit our session: In []: db.session.add(first_entry) In []: first_entry.id is None # No primary key, the entry has not been saved. Out[]: True In []: db.session.commit() In []: first_entry.id Out[]: 1 In []: first_entry.created_timestamp Out[]: datetime.datetime(2014, 1, 25, 9, 49, 53, 1337) As you can see from the preceding code examples, once we commit the session, a unique id will be assigned to our first entry and the created_timestamp will be set to the current time. Congratulations, you've created your first blog entry! Try adding a few more on your own. You can add multiple entry objects to the same session before committing, so give that a try as well. At any point while you are experimenting, feel free to delete the blog.db file and re-run the create_db.py script to start over with a fresh database. Making changes to an existing entry In order to make changes to an existing Entry, simply make your edits and then commit. Let's retrieve our Entry using the id that was returned to use earlier, make some changes and commit it. SQLAlchemy will know that it needs to be updated. Here is how you might make edits to the first entry: In []: first_entry = Entry.query.get(1) In []: first_entry.body = 'This is the first entry, and I have made some edits.' In []: db.session.commit() And just like that your changes are saved. Deleting an entry Deleting an entry is just as easy as creating one. Instead of calling db.session.add, we will call db.session.delete and pass in the Entry instance that we wish to remove: In []: bad_entry = Entry(title='bad entry', body='This is a lousy entry.') In []: db.session.add(bad_entry) In []: db.session.commit() # Save the bad entry to the database. In []: db.session.delete(bad_entry) In []: db.session.commit() # The bad entry is now deleted from the database. Retrieving blog entries While creating, updating, and deleting are fairly straightforward operations, the real fun starts when we look at ways to retrieve our entries. We'll start with the basics, and then work our way up to more interesting queries. We will use a special attribute on our model class to make queries: Entry.query. This attribute exposes a variety of APIs for working with the collection of entries in the database. Let's simply retrieve a list of all the entries in the Entry table: In []: entries = Entry.query.all() In []: entries # What are our entries? Out[]: [<Entry u'First entry'>, <Entry u'Second entry'>, <Entry u'Third entry'>, <Entry u'Fourth entry'>] As you can see, in this example, the query returns a list of Entry instances that we created. When no explicit ordering is specified, the entries are returned to us in an arbitrary order chosen by the database. Let's specify that we want the entries returned to us in an alphabetical order by title: In []: Entry.query.order_by(Entry.title.asc()).all() Out []: [<Entry u'First entry'>, <Entry u'Fourth entry'>, <Entry u'Second entry'>, <Entry u'Third entry'>] Shown next is how you would list your entries in reverse-chronological order, based on when they were last updated: In []: oldest_to_newest = Entry.query.order_by(Entry.modified_timestamp.desc()).all() Out []: [<Entry: Fourth entry>, <Entry: Third entry>, <Entry: Second entry>, <Entry: First entry>] Filtering the list of entries It is very useful to be able to retrieve the entire collection of blog entries, but what if we want to filter the list? We could always retrieve the entire collection and then filter it in Python using a loop, but that would be very inefficient. Instead we will rely on the database to do the filtering for us, and simply specify the conditions for which entries should be returned. In the following example, we will specify that we want to filter by entries where the title equals 'First entry'. In []: Entry.query.filter(Entry.title == 'First entry').all() Out[]: [<Entry u'First entry'>] If this seems somewhat magical to you, it's because it really is! SQLAlchemy uses operator overloading to convert expressions like <Model>.<column> == <some value> into an abstracted object called BinaryExpression. When you are ready to execute your query, these data-structures are then translated into SQL. A BinaryExpression is simply an object that represents the logical comparison and is produced by over-riding the standards methods that are typically called on an object when comparing values in Python. In order to retrieve a single entry, you have two options, .first() and .one(). Their differences and similarities are summarized in the following table: Number of matching rows first() behavior one() behavior 1 Return the object. Return the object. 0 Return None. Raise sqlalchemy.orm.exc.NoResultFound 2+ Return the first object (based on either explicit ordering or the ordering chosen by the database). Raise sqlalchemy.orm.exc.MultipleResultsFound Let's try the same query as before, but instead of calling .all(), we will call .first() to retrieve a single Entry instance: In []: Entry.query.filter(Entry.title == 'First entry').first() Out[]: <Entry u'First entry'> Notice how previously .all() returned a list containing the object, whereas .first() returned just the object itself. Special lookups In the previous example we tested for equality, but there are many other types of lookups possible. In the following table, have listed some that you may find useful. A complete list can be found in the SQLAlchemy documentation. Example Meaning Entry.title == 'The title' Entries where the title is "The title", case-sensitive. Entry.title != 'The title' Entries where the title is not "The title". Entry.created_timestamp < datetime.date(2014, 1, 25) Entries created before January 25, 2014. For less than or equal, use <=. Entry.created_timestamp > datetime.date(2014, 1, 25) Entries created after January 25, 2014. For greater than or equal, use >=. Entry.body.contains('Python') Entries where the body contains the word "Python", case-sensitive. Entry.title.endswith('Python') Entries where the title ends with the string "Python", case-sensitive. Note that this will also match titles that end with the word "CPython", for example. Entry.title.startswith('Python') Entries where the title starts with the string "Python", case-sensitive. Note that this will also match titles like "Pythonistas". Entry.body.ilike('%python%') Entries where the body contains the word "python" anywhere in the text, case-insensitive. The "%" character is a wild-card. Entry.title.in_(['Title one', 'Title two']) Entries where the title is in the given list, either 'Title one' or 'Title two'. Combining expressions The expressions listed in the preceding table can be combined using bitwise operators to produce arbitrarily complex expressions. Let's say we want to retrieve all blog entries that have the word Python or Flask in the title. To accomplish this, we will create two contains expressions, then combine them using Python's bitwise OR operator which is a pipe| character unlike a lot of other languages that use a double pipe || character: Entry.query.filter(Entry.title.contains('Python') | Entry.title.contains('Flask')) Using bitwise operators, we can come up with some pretty complex expressions. Try to figure out what the following example is asking for: Entry.query.filter( (Entry.title.contains('Python') | Entry.title.contains('Flask')) & (Entry.created_timestamp > (datetime.date.today() - datetime.timedelta(days=30))) ) As you probably guessed, this query returns all entries where the title contains either Python or Flask, and which were created within the last 30 days. We are using Python's bitwise OR and AND operators to combine the sub-expressions. For any query you produce, you can view the generated SQL by printing the query as follows: In []: query = Entry.query.filter( (Entry.title.contains('Python') | Entry.title.contains('Flask')) & (Entry.created_timestamp > (datetime.date.today() - datetime.timedelta(days=30))) ) In []: print str(query) SELECT entry.id AS entry_id, ... FROM entry WHERE ( (entry.title LIKE '%%' || :title_1 || '%%') OR (entry.title LIKE '%%' || :title_2 || '%%') ) AND entry.created_timestamp > :created_timestamp_1 Negation There is one more piece to discuss, which is negation. If we wanted to get a list of all blog entries which did not contain Python or Flask in the title, how would we do that? SQLAlchemy provides two ways to create these types of expressions, using either Python's unary negation operator (~) or by calling db.not_(). This is how you would construct this query with SQLAlchemy: Using unary negation: In []: Entry.query.filter(~(Entry.title.contains('Python') | Entry.title.contains('Flask')))   Using db.not_(): In []: Entry.query.filter(db.not_(Entry.title.contains('Python') | Entry.title.contains('Flask'))) Operator precedence Not all operations are considered equal to the Python interpreter. This is like in math class, where we learned that expressions like 2 + 3 * 4 are equal to 14 and not 20, because the multiplication operation occurs first. In Python, bitwise operators all have a higher precedence than things like equality tests, so this means that when you are building your query expression, you have to pay attention to the parentheses. Let's look at some example Python expressions and see the corresponding query: Expression Result (Entry.title == 'Python' | Entry.title == 'Flask') Wrong! SQLAlchemy throws an error because the first thing to be evaluated is actually the 'Python' | Entry.title! (Entry.title == 'Python') | (Entry.title == 'Flask') Right. Returns entries where the title is either "Python" or "Flask". ~Entry.title == 'Python' Wrong! SQLAlchemy will turn this into a valid SQL query, but the results will not be meaningful. ~(Entry.title == 'Python') Right. Returns entries where the title is not equal to "Python". If you find yourself struggling with the operator precedence, it's a safe bet to put parentheses around any comparison that uses ==, !=, <, <=, >, and >=. Making changes to the schema The final topic we will discuss in this article is how to make modifications to an existing Model definition. From the project specification, we know we would like to be able to save drafts of our blog entries. Right now we don't have any way to tell whether an entry is a draft or not, so we will need to add a column that let's us store the status of our entry. Unfortunately, while db.create_all() works perfectly for creating tables, it will not automatically modify an existing table; to do this we need to use migrations. Adding Flask-Migrate to our project We will use Flask-Migrate to help us automatically update our database whenever we change the schema. In the blog virtualenv, install Flask-migrate using pip: (blog) $ pip install flask-migrate The author of SQLAlchemy has a project called alembic; Flask-Migrate makes use of this and integrates it with Flask directly, making things easier. Next we will add a Migrate helper to our app. We will also create a script manager for our app. The script manager allows us to execute special commands within the context of our app, directly from the command-line. We will be using the script manager to execute the migrate command. Open app.py and make the following additions: from flask import Flask from flask.ext.migrate import Migrate, MigrateCommand from flask.ext.script import Manager from flask.ext.sqlalchemy import SQLAlchemy from config import Configuration app = Flask(__name__) app.config.from_object(Configuration) db = SQLAlchemy(app) migrate = Migrate(app, db) manager = Manager(app) manager.add_command('db', MigrateCommand) In order to use the manager, we will add a new file named manage.py along with app.py. Add the following code to manage.py: from app import manager from main import * if __name__ == '__main__': manager.run() This looks very similar to main.py, the key difference being that instead of calling app.run(), we are calling manager.run(). Django has a similar, although auto-generated, manage.py file that serves a similar function. Creating the initial migration Before we can start changing our schema, we need to create a record of its current state. To do this, run the following commands from inside your blog's app directory. The first command will create a migrations directory inside the app folder which will track the changes we make to our schema. The second command db migrate will create a snapshot of our current schema so that future changes can be compared to it. (blog) $ python manage.py db init Creating directory /home/charles/projects/blog/app/migrations ... done ... (blog) $ python manage.py db migrate INFO [alembic.migration] Context impl SQLiteImpl. INFO [alembic.migration] Will assume non-transactional DDL. Generating /home/charles/projects/blog/app/migrations/versions/535133f91f00_.py ... done Finally, we will run db upgrade to run the migration which will indicate to the migration system that everything is up-to-date: (blog) $ python manage.py db upgrade INFO [alembic.migration] Context impl SQLiteImpl. INFO [alembic.migration] Will assume non-transactional DDL. INFO [alembic.migration] Running upgrade None -> 535133f91f00, empty message Adding a status column Now that we have a snapshot of our current schema, we can start making changes. We will be adding a new column named status, which will store an integer value corresponding to a particular status. Although there are only two statuses at the moment (PUBLIC and DRAFT), using an integer instead of a Boolean gives us the option to easily add more statuses in the future. Open models.py and make the following additions to the Entry model: class Entry(db.Model): STATUS_PUBLIC = 0 STATUS_DRAFT = 1 id = db.Column(db.Integer, primary_key=True) title = db.Column(db.String(100)) slug = db.Column(db.String(100), unique=True) body = db.Column(db.Text) status = db.Column(db.SmallInteger, default=STATUS_PUBLIC) created_timestamp = db.Column(db.DateTime, default=datetime.datetime.now) ... From the command-line, we will once again be running db migrate to generate the migration script. You can see from the command's output that it found our new column: (blog) $ python manage.py db migrate INFO [alembic.migration] Context impl SQLiteImpl. INFO [alembic.migration] Will assume non-transactional DDL. INFO [alembic.autogenerate.compare] Detected added column 'entry.status' Generating /home/charles/projects/blog/app/migrations/versions/2c8e81936cad_.py ... done Because we have blog entries in the database, we need to make a small modification to the auto-generated migration to ensure the statuses for the existing entries are initialized to the proper value. To do this, open up the migration file (mine is migrations/versions/2c8e81936cad_.py) and change the following line: op.add_column('entry', sa.Column('status', sa.SmallInteger(), nullable=True)) The replacement of nullable=True with server_default='0' tells the migration script to not set the column to null by default, but instead to use 0: op.add_column('entry', sa.Column('status', sa.SmallInteger(), server_default='0')) Finally, run db upgrade to run the migration and create the status column: (blog) $ python manage.py db upgrade INFO [alembic.migration] Context impl SQLiteImpl. INFO [alembic.migration] Will assume non-transactional DDL. INFO [alembic.migration] Running upgrade 535133f91f00 -> 2c8e81936cad, empty message Congratulations, your Entry model now has a status field! Summary By now you should be familiar with using SQLAlchemy to work with a relational database. We covered the benefits of using a relational database and an ORM, configured a Flask application to connect to a relational database, and created SQLAlchemy models. All this allowed us to create relationships between our data and perform queries. To top it off, we also used a migration tool to handle future database schema changes. We will set aside the interactive interpreter and start creating views to display blog entries in the web browser. We will put all our SQLAlchemy knowledge to work by creating interesting lists of blog entries, as well as a simple search feature. We will build a set of templates to make the blogging site visually appealing, and learn how to use the Jinja2 templating language to eliminate repetitive HTML coding. Resources for Article:   Further resources on this subject: Man, Do I Like Templates! [article] Snap – The Code Snippet Sharing Application [article] Deploying on your own server [article]
Read more
  • 0
  • 0
  • 5924

article-image-perform-crud-operations-on-mongodb-with-php
Amey Varangaonkar
17 Mar 2018
6 min read
Save for later

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar
17 Mar 2018
6 min read
[box type="note" align="" class="" width=""]This article is an excerpt from the book Mastering MongoDB 3.x authored by Alex Giamas. This book covers the key concepts, and tips & tricks needed to build fault-tolerant applications in MongoDB. It gives you the power to become a true expert when it comes to the world’s most popular NoSQL database.[/box] In today’s tutorial, we will cover the CRUD (Create, Read, Update and Delete) operations using the popular PHP language with the official MongoDB driver. Create and delete operations To perform the create and delete operations, run the following code: $document = array( "isbn" => "401", "name" => "MongoDB and PHP" ); $result = $collection->insertOne($document); var_dump($result); This is the output: MongoDBInsertOneResult Object ( [writeResult:MongoDBInsertOneResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 1 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedId:MongoDBInsertOneResult:private] => MongoDBBSONObjectID Object ( [oid] => 5941ac50aabac9d16f6da142 ) [isAcknowledged:MongoDBInsertOneResult:private] => 1 ) The rather lengthy output contains all the information that we may need. We can get the ObjectId of the document inserted; the number of inserted, matched, modified, removed, and upserted documents by fields prefixed with n; and information about writeError or writeConcernError. There are also convenience methods in the $result object if we want to get the Information: $result->getInsertedCount(): To get the number of inserted objects $result->getInsertedId(): To get the ObjectId of the inserted document We can also use the ->insertMany() method to insert many documents at once, like this: $documentAlpha = array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" ); $documentBeta = array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" ); $result = $collection->insertMany([$documentAlpha, $documentBeta]); print_r($result); The result is: ( [writeResult:MongoDBInsertManyResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 2 [nMatched] => 0 [nModified] => 0 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [insertedIds:MongoDBInsertManyResult:private] => Array ( [0] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a2 ) [1] => MongoDBBSONObjectID Object ( [oid] => 5941ae85aabac9d1d16c63a3 ) ) [isAcknowledged:MongoDBInsertManyResult:private] => 1 ) Again, $result->getInsertedCount() will return 2, whereas $result->getInsertedIds() will return an array with the two newly created ObjectIds: array(2) { [0]=> object(MongoDBBSONObjectID)#13 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a2" } [1]=> object(MongoDBBSONObjectID)#14 (1) { ["oid"]=> string(24) "5941ae85aabac9d1d16c63a3" } } Deleting documents is similar to inserting but with the deleteOne() and deleteMany() methods; an example of deleteMany() is shown here: $deleteQuery = array( "isbn" => "401"); $deleteResult = $collection->deleteMany($deleteQuery); print_r($result); print($deleteResult->getDeletedCount()); Here is the output: MongoDBDeleteResult Object ( [writeResult:MongoDBDeleteResult:private] => MongoDBDriverWriteResult Object ( [nInserted] => 0 [nMatched] => 0 [nModified] => 0 [nRemoved] => 2 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) [isAcknowledged:MongoDBDeleteResult:private] => 1 ) 2 In this example, we used ->getDeletedCount() to get the number of affected documents, which is printed out in the last line of the output. Bulk write The new PHP driver supports the bulk write interface to minimize network calls to MongoDB: $manager = new MongoDBDriverManager('mongodb://localhost:27017'); $bulk = new MongoDBDriverBulkWrite(array("ordered" => true)); $bulk->insert(array( "isbn" => "401", "name" => "MongoDB and PHP" )); $bulk->insert(array( "isbn" => "402", "name" => "MongoDB and PHP, 2nd Edition" )); $bulk->update(array("isbn" => "402"), array('$set' => array("price" => 15))); $bulk->insert(array( "isbn" => "403", "name" => "MongoDB and PHP, revisited" )); $result = $manager->executeBulkWrite('mongo_book.books', $bulk); print_r($result); The result is: MongoDBDriverWriteResult Object ( [nInserted] => 3 [nMatched] => 1 [nModified] => 1 [nRemoved] => 0 [nUpserted] => 0 [upsertedIds] => Array ( ) [writeErrors] => Array ( ) [writeConcernError] => [writeConcern] => MongoDBDriverWriteConcern Object ( ) ) In the preceding example, we executed two inserts, one update, and a third insert in an ordered fashion. The WriteResult object contains a total of three inserted documents and one modified document. The main difference compared to simple create/delete queries is that executeBulkWrite() is a method of the MongoDBDriverManager class, which we instantiate on the first line. Read operation Querying an interface is similar to inserting and deleting, with the findOne() and find() methods used to retrieve the first result or all results of a query: $document = $collection->findOne( array("isbn" => "101") ); $cursor = $collection->find( array( "name" => new MongoDBBSONRegex("mongo", "i") ) ); In the second example, we are using a regular expression to search for a key name with the value mongo (case-insensitive). Embedded documents can be queried using the . notation, as with the other languages that we examined earlier in this chapter: $cursor = $collection->find( array('meta.price' => 50) ); We do this to query for an embedded document price inside the meta key field. Similarly to Ruby and Python, in PHP we can query using comparison operators, like this: $cursor = $collection->find( array( 'price' => array('$gte'=> 60) ) ); Querying with multiple key-value pairs is an implicit AND, whereas queries using $or, $in, $nin, or AND ($and) combined with $or can be achieved with nested queries: $cursor = $collection->find( array( '$or' => array( array("price" => array( '$gte' => 60)), array("price" => array( '$lte' => 20)) ))); This finds documents that have price>=60 OR price<=20. Update operation Updating documents has a similar interface with the ->updateOne() OR ->updateMany() method. The first parameter is the query used to find documents and the second one will update our documents. We can use any of the update operators explained at the end of this chapter to update in place or specify a new document to completely replace the document in the query: $result = $collection->updateOne( array( "isbn" => "401"), array( '$set' => array( "price" => 39 ) ) ); We can use single quotes or double quotes for key names, but if we have special operators starting with $, we need to use single quotes. We can use array( "key" => "value" ) or ["key" => "value"]. We prefer the more explicit array() notation in this book. The ->getMatchedCount() and ->getModifiedCount() methods will return the number of documents matched in the query part or the ones modified from the query. If the new value is the same as the existing value of a document, it will not be counted as modified. We saw, it is fairly easy and advantageous to use PHP as a language and tool for performing efficient CRUD operations in MongoDB to handle data efficiently. If you are interested to get more information on how to effectively handle data using MongoDB, you may check out this book Mastering MongoDB 3.x.
Read more
  • 0
  • 0
  • 5923

article-image-dart-javascript
Packt
18 Nov 2014
12 min read
Save for later

Dart with JavaScript

Packt
18 Nov 2014
12 min read
In this article by Sergey Akopkokhyants, author of Mastering Dart, we will combine the simplicity of jQuery and the power of Dart in a real example. (For more resources related to this topic, see here.) Integrating Dart with jQuery For demonstration purposes, we have created the js_proxy package to help the Dart code to communicate with jQuery. It is available on the pub manager at https://pub.dartlang.org/packages/js_proxy. This package is layered on dart:js and has a library of the same name and sole class JProxy. An instance of the JProxy class can be created via the generative constructor where we can specify the optional reference on the proxied JsObject: JProxy([this._object]); We can create an instance of JProxy with a named constructor and provide the name of the JavaScript object accessible through the dart:js context as follows: JProxy.fromContext(String name) { _object = js.context[name]; } The JProxy instance keeps the reference on the proxied JsObject class and makes all the manipulation on it, as shown in the following code: js.JsObject _object;    js.JsObject get object => _object; How to create a shortcut to jQuery? We can use JProxy to create a reference to jQuery via the context from the dart:js library as follows: var jquery = new JProxy.fromContext('jQuery'); Another very popular way is to use the dollar sign as a shortcut to the jQuery variable as shown in the following code: var $ = new JProxy.fromContext('jQuery'); Bear in mind that the original jQuery and $ variables from JavaScript are functions, so our variables reference to the JsFunction class. From now, jQuery lovers who moved to Dart have a chance to use both the syntax to work with selectors via parentheses. Why JProxy needs a method call? Usually, jQuery send a request to select HTML elements based on IDs, classes, types, attributes, and values of their attributes or their combination, and then performs some action on the results. We can use the basic syntax to pass the search criteria in the jQuery or $ function to select the HTML elements: $(selector) Dart has syntactic sugar method call that helps us to emulate a function and we can use the call method in the jQuery syntax. Dart knows nothing about the number of arguments passing through the function, so we use the fixed number of optional arguments in the call method. Through this method, we invoke the proxied function (because jquery and $ are functions) and returns results within JProxy: dynamic call([arg0 = null, arg1 = null, arg2 = null,    arg3 = null, arg4 = null, arg5 = null, arg6 = null,    arg7 = null, arg8 = null, arg9 = null]) { var args = []; if (arg0 != null) args.add(arg0); if (arg1 != null) args.add(arg1); if (arg2 != null) args.add(arg2); if (arg3 != null) args.add(arg3); if (arg4 != null) args.add(arg4); if (arg5 != null) args.add(arg5); if (arg6 != null) args.add(arg6); if (arg7 != null) args.add(arg7); if (arg8 != null) args.add(arg8); if (arg9 != null) args.add(arg9); return _proxify((_object as js.JsFunction).apply(args)); } How JProxy invokes jQuery? The JProxy class is a proxy to other classes, so it marks with the @proxy annotation. We override noSuchMethod intentionally to call the proxied methods and properties of jQuery when the methods or properties of the proxy are invoked. The logic flow in noSuchMethod is pretty straightforward. It invokes callMethod of the proxied JsObject when we invoke the method on proxy, or returns a value of property of the proxied object if we call the corresponding operation on proxy. The code is as follows: @override dynamic noSuchMethod(Invocation invocation) { if (invocation.isMethod) {    return _proxify(_object.callMethod(      symbolAsString(invocation.memberName),      _jsify(invocation.positionalArguments))); } else if (invocation.isGetter) {    return      _proxify(_object[symbolAsString(invocation.memberName)]); } else if (invocation.isSetter) {    throw new Exception('The setter feature was not implemented      yet.'); } return super.noSuchMethod(invocation); } As you might remember, all map or Iterable arguments must be converted to JsObject with the help of the jsify method. In our case, we call the _jsify method to check and convert passed arguments aligned with a called function, as shown in the following code: List _jsify(List params) { List res = []; params.forEach((item) {    if (item is Map || item is List) {      res.add(new js.JsObject.jsify(item));    } else {      res.add(item);    } }); return res; } Before return, the result must be passed through the _proxify function as follows: dynamic _proxify(value) {    return value is js.JsObject ? new JProxy(value) : value; } This function wraps all JsObject within a JProxy class and passes other values as it is. An example project Now create the jquery project, open the pubspec.yaml file, and add js_proxy to the dependencies. Open the jquery.html file and make the following changes: <!DOCTYPE html> <html> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>jQuery</title> <link rel="stylesheet" href="jquery.css"> </head> <body> <h1>Jquery</h1> <p>I'm a paragraph</p> <p>Click on me to hide</p> <button>Click me</button> <div class="container"> <div class="box"></div> </div> </body> <script src="//code.jquery.com/jquery-1.11.0.min.js"></script> <script type="application/dart" src="jquery.dart"></script> <script src="packages/browser/dart.js"></script> </html> This project aims to demonstrate that: Communication is easy between Dart and JavaScript The syntax of the Dart code could be similar to the jQuery code In general, you may copy the JavaScript code, paste it in the Dart code, and probably make slightly small changes. How to get the jQuery version? It's time to add js_proxy in our code. Open jquery.dart and make the following changes: import 'dart:html'; import 'package:js_proxy/js_proxy.dart'; /** * Shortcut for jQuery. */ var $ = new JProxy.fromContext('jQuery'); /** * Shortcut for browser console object. */ var console = window.console; main() { printVersion(); } /** * jQuery code: * *   var ver = $().jquery; *   console.log("jQuery version is " + ver); * * JS_Proxy based analog: */ printVersion() { var ver = $().jquery; console.log("jQuery version is " + ver); } You should be familiar with jQuery and console shortcuts yet. The call to jQuery with empty parentheses returns JProxy and contains JsObject with reference to jQuery from JavaScript. The jQuery object has a jQuery property that contains the current version number, so we reach this one via noSuchMethod of JProxy. Run the application, and you will see the following result in the console: jQuery version is 1.11.1 Let's move on and perform some actions on the selected HTML elements. How to perform actions in jQuery? The syntax of jQuery is based on selecting the HTML elements and it also performs some actions on them: $(selector).action(); Let's select a button on the HTML page and fire the click event as shown in the following code: /** * jQuery code: * *   $("button").click(function(){ *     alert('You click on button'); *   }); * * JS_Proxy based analog: */ events() { // We remove 'function' and add 'event' here $("button").click((event) {    // Call method 'alert' of 'window'    window.alert('You click on button'); }); } All we need to do here is just remove the function keyword, because anonymous functions on Dart do not use it and add the event parameter. This is because this argument is required in the Dart version of the event listener. The code calls jQuery to find all the HTML button elements to add the click event listener to each of them. So when we click on any button, a specified alert message will be displayed. On running the application, you will see the following message: How to use effects in jQuery? The jQuery supports animation out of the box, so it sounds very tempting to use it from Dart. Let's take an example of the following code snippet: /** * jQuery code: * *   $("p").click(function() { *     this.hide("slow",function(){ *       alert("The paragraph is now hidden"); *     }); *   }); *   $(".box").click(function(){ *     var box = this; *     startAnimation(); *     function startAnimation(){ *       box.animate({height:300},"slow"); *       box.animate({width:300},"slow"); *       box.css("background-color","blue"); *       box.animate({height:100},"slow"); *       box.animate({width:100},"slow",startAnimation); *     } *   }); * * JS_Proxy based analog: */ effects() { $("p").click((event) {    $(event['target']).hide("slow",(){      window.alert("The paragraph is now hidden");    }); }); $(".box").click((event) {    var box = $(event['target']);    startAnimation() {      box.animate({'height':300},"slow");      box.animate({'width':300},"slow");      box.css("background-color","blue");      box.animate({'height':100},"slow");      box.animate({'width':100},"slow",startAnimation);    };    startAnimation(); }); } This code finds all the paragraphs on the web page to add a click event listener to each one. The JavaScript code uses the this keyword as a reference to the selected paragraph to start the hiding animation. The this keyword has a different notion on JavaScript and Dart, so we cannot use it directly in anonymous functions on Dart. The target property of event keeps the reference to the clicked element and presents JsObject in Dart. We wrap the clicked element to return a JProxy instance and use it to call the hide method. The jQuery is big enough and we have no space in this article to discover all its features, but you can find more examples at https://github.com/akserg/js_proxy. What are the performance impacts? Now, we should talk about the performance impacts of using different approaches across several modern web browsers. The algorithm must perform all the following actions: It should create 10000 DIV elements Each element should be added into the same DIV container Each element should be updated with one style All elements must be removed one by one This algorithm must be implemented in the following solutions: The clear jQuery solution on JavaScript The jQuery solution calling via JProxy and dart:js from Dart The clear Dart solution based on dart:html We implemented this algorithm on all of them, so we have a chance to compare the results and choose the champion. The following HTML code has three buttons to run independent tests, three paragraph elements to show the results of the tests, and one DIV element used as a container. The code is as follows: <div>  <button id="run_js" onclick="run_js_test()">Run JS</button> <button id="run_jproxy">Run JProxy</button> <button id="run_dart">Run Dart</button> </div>   <p id="result_js"></p> <p id="result_jproxy"></p> <p id="result_dart"></p> <div id="container"></div> The JavaScript code based on jQuery is as follows: function run_js_test() { var startTime = new Date(); process_js(); var diff = new Date(new Date().getTime() –    startTime.getTime()).getTime(); $('#result_js').text('jQuery tooks ' + diff +    ' ms to process 10000 HTML elements.'); }     function process_js() { var container = $('#container'); // Create 10000 DIV elements for (var i = 0; i < 10000; i++) {    $('<div>Test</div>').appendTo(container); } // Find and update classes of all DIV elements $('#container > div').css("color","red"); // Remove all DIV elements $('#container > div').remove(); } The main code registers the click event listeners and the call function run_dart_js_test. The first parameter of this function must be investigated. The second and third parameters are used to pass the selector of the result element and test the title: void main() { querySelector('#run_jproxy').onClick.listen((event) {    run_dart_js_test(process_jproxy, '#result_jproxy', 'JProxy'); }); querySelector('#run_dart').onClick.listen((event) {    run_dart_js_test(process_dart, '#result_dart', 'Dart'); }); } run_dart_js_test(Function fun, String el, String title) { var startTime = new DateTime.now(); fun(); var diff = new DateTime.now().difference(startTime); querySelector(el).text = '$title tooks ${diff.inMilliseconds} ms to process 10000 HTML elements.'; } Here is the Dart solution based on JProxy and dart:js: process_jproxy() { var container = $('#container'); // Create 10000 DIV elements for (var i = 0; i < 10000; i++) {    $('<div>Test</div>').appendTo(container.object); } // Find and update classes of all DIV elements $('#container > div').css("color","red"); // Remove all DIV elements $('#container > div').remove(); } Finally, a clear Dart solution based on dart:html is as follows: process_dart() { // Create 10000 DIV elements var container = querySelector('#container'); for (var i = 0; i < 10000; i++) {    container.appendHtml('<div>Test</div>'); } // Find and update classes of all DIV elements querySelectorAll('#container > div').forEach((Element el) {    el.style.color = 'red'; }); // Remove all DIV elements querySelectorAll('#container > div').forEach((Element el) {    el.remove(); }); } All the results are in milliseconds. Run the application and wait until the web page is fully loaded. Run each test by clicking on the appropriate button. My result of the tests on Dartium, Chrome, Firefox, and Internet Explorer are shown in the following table: Web browser jQuery framework jQuery via JProxy Library dart:html Dartium 2173 3156 714 Chrome 2935 6512 795 Firefox 2485 5787 582 Internet Explorer 12262 17748 2956 Now, we have the absolute champion—the Dart-based solution. Even the Dart code compiled in the JavaScript code to be executed in Chrome, Firefox, and Internet Explorer works quicker than jQuery (four to five times) and much quicker than dart:js and JProxy class-based solution (four to ten times). Summary This article showed you how to use Dart and JavaScript together to build web applications. It listed problems and solutions you can use to communicate between Dart and JavaScript and the existing JavaScript program. We compared jQuery, JProxy, and dart:js and cleared the Dart code based on the dart:html solutions to identify who is quicker than others. Resources for Article: Further resources on this subject: Handling the DOM in Dart [article] Dart Server with Dartling and MongoDB [article] Handle Web Applications [article]
Read more
  • 0
  • 0
  • 5921
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at £15.99/month. Cancel anytime
article-image-web-app-penetration-testing-kali
Packt
30 Oct 2013
4 min read
Save for later

Web app penetration testing in Kali

Packt
30 Oct 2013
4 min read
(For more resources related to this topic, see here.) Web apps are now a major part of today's World Wide Web. Keeping them safe and secure is the prime focus of webmasters. Building web apps from scratch can be a tedious task, and there can be small bugs in the code that can lead to a security breach. This is where web apps jump in and help you secure your application. Web app penetration testing can be implemented at various fronts such as the frontend interface, database, and web server. Let us leverage the power of some of the important tools of Kali that can be helpful during web app penetration testing. WebScarab proxy WebScarab is an HTTP and HTTPS proxy interceptor framework that allows the user to review and modify the requests created by the browser before they are sent to the server. Similarly, the responses received from the server can be modified before they are reflected in the browser. The new version of WebScarab has many more advanced features such as XSS/CSRF detection, Session ID analysis, and Fuzzing. Follow these three steps to get started with WebScarab: To launch WebScarab, browse to Applications | Kali Linux | Web applications | Web application proxies | WebScarab. Once the application is loaded, you will have to change your browser's network settings. Set the proxy settings for IP as 127.0.0.1 and Port as 8008: Save the settings and go back to the WebScarab GUI. Click on the Proxy tab and check Intercept request. Make sure that both GET and POST requests are highlighted on the left-hand side panel. To intercept the response, check Intercept responses to begin reviewing the responses coming from the server. Attacking the database using sqlninja sqlninja is a popular tool used to test SQL injection vulnerabilities in Microsoft SQL servers. Databases are an integral part of web apps hence, even a single flaw in it can lead to mass compromising of information. Let us see how sqlninja can be used for database penetration testing. To launch SQL ninja, browse to Applications | Kali Linux | Web applications | Database Exploitation | sqlninja. This will launch the terminal window with sqlninja parameters. The important parameter to look for is either the mode parameter or the –m parameter: The –m parameter specifies the type of operation we want to perform over the target database.Let us pass a basic command and analyze the output: root@kali:~#sqlninja –m test Sqlninja rel. 0.2.3-r1 Copyright (C) 2006-2008 icesurfer [-] sqlninja.conf does not exist. You want to create it now ? [y/n] This will prompt you to set up your configuration file (sqlninja.conf). You can pass the respective values and create the config file. Once you are through with it, you are ready to perform database penetration testing. The Websploit framework Websploit is an open source framework designed for vulnerability analysis and penetration testing of web applications. It is very much similar to Metasploit and incorporates many of its plugins to add functionalities. To launch Websploit, browse to Applications | Kali Linux | Web Applications | Web Application Fuzzers | Websploit. We can begin by updating the framework. Passing the update command at the terminal will begin the updating process as follows: wsf>update [*]Updating Websploit framework, Please Wait… Once the update is over, you can check out the available modules by passing the following command: wsf>show modules Let us launch a simple directory scanner module against www.target.com as follows: wsf>use web/dir_scanner wsf:Dir_Scanner>show options wsf:Dir_Scanner>set TARGET www.target.com wsf:Dir_Scanner>run Once the run command is executed, Websploit will launch the attack module and display the result. Similarly, we can use other modules based on the requirements of our scenarios. Summary In this article, we covered the following sections: WebScarab proxy Attacking the database using sqlninja The Websploit framework Resources for Article: Further resources on this subject: Installing VirtualBox on Linux [Article] Linux Shell Script: Tips and Tricks [Article] Installing Arch Linux using the official ISO [Article]
Read more
  • 0
  • 0
  • 5920

article-image-how-to-publish-docker-and-integrate-with-maven
Pravin Dhandre
11 Apr 2018
6 min read
Save for later

How to publish Docker and integrate with Maven

Pravin Dhandre
11 Apr 2018
6 min read
We have learned how to create Dockers, and how to run them, but these Dockers are stored in our system. Now we need to publish them so that they are accessible anywhere. In this post, we will learn how to publish our Docker images, and how to finally integrate Maven with Docker to easily do the same steps for our microservices. Understanding repositories In our previous example, when we built a Docker image, we published it into our local system repository so we can execute Docker run. Docker will be able to find them; this local repository exists only on our system, and most likely we need to have this access to wherever we like to run our Docker. For example, we may create our Docker in a pipeline that runs on a machine that creates our builds, but the application itself may run in our pre production or production environments, so the Docker image should be available on any system that we need. One of the great advantages of Docker is that any developer building an image can run it from their own system exactly as they would on any server. This will minimize the risk of having something different in each environment, or not being able to reproduce production when you try to find the source of a problem. Docker provides a public repository, Docker Hub, that we can use to publish and pull images, but of course, you can use private Docker repositories such as Sonatype Nexus, VMware Harbor, or JFrog Artifactory. To learn how to configure additional repositories refer to the repositories documentation. Docker Hub registration After registering, we need to log into our account, so we can publish our Dockers using the Docker tool from the command line using Docker login: docker login Login with your Docker ID to push and pull images from Docker Hub. If you don't have a Docker ID, head over to https://hub.Docker.com to create one. Username: mydockerhubuser Password: Login Succeeded When we need to publish a Docker, we must always be logged into the registry that we are working with; remember to log into Docker. Publishing a Docker Now we'd like to publish our Docker image to Docker Hub; but before we can, we need to build our images for our repository. When we create an account in Docker Hub, a repository with our username will be created; in this example, it will be mydockerhubuser. In order to build the Docker for our repository, we can use this command from our microservice directory: docker build . -t mydockerhubuser/chapter07 This should be quite a fast process since all the different layers are cached: Sending build context to Docker daemon 21.58MB Step 1/3 : FROM openjdk:8-jdk-alpine ---> a2a00e606b82 Step 2/3 : ADD target/*.jar microservice.jar ---> Using cache ---> 4ae1b12e61aa Step 3/3 : ENTRYPOINT java -jar microservice.jar ---> Using cache ---> 70d76cbf7fb2 Successfully built 70d76cbf7fb2 Successfully tagged mydockerhubuser/chapter07:latest Now that our Docker is built, we can push it to Docker Hub with the following command: docker push mydockerhubuser/chapter07 This command will take several minutes since the whole image needs to be uploaded. With our Docker published, we can now run it from any Docker system with the following command: docker run mydockerhubuser/chapter07 Or else, we can run it as a daemon, with: docker run -d mydockerhubuser/chapter07 Integrating Docker with Maven Now that we know most of the Docker concepts, we can integrate Docker with Maven using the Docker-Maven-plugin created by fabric8, so we can create Docker as part of our Maven builds. First, we will move our Dockerfile to a different folder. In the IntelliJ Project window, right-click on the src folder and choose New | Directory. We will name it Docker. Now, drag and drop the existing Dockerfile into this new directory, and we will change it to the following: FROM openjdk:8-jdk-alpine ADD maven/*.jar microservice.jar ENTRYPOINT ["java","-jar", "microservice.jar"] To manage the Dockerfile better, we just move into our project folders. When our Docker is built using the plugin, the contents of our application will be created in a folder named Maven, so we change the Dockerfile to reference that folder. Now, we will modify our Maven pom.xml, and add the Dockerfile-Maven-plugin in the build | plugins section: <build> .... <plugins> .... <plugin> <groupId>io.fabric8</groupId> <artifactId>Docker-maven-plugin</artifactId> <version>0.23.0</version> <configuration> <verbose>true</verbose> <images> <image> <name>mydockerhubuser/chapter07</name> <build> <dockerFileDir>${project.basedir}/src/Docker</dockerFileDir> <assembly> <descriptorRef>artifact</descriptorRef> </assembly> <tags> <tag>latest</tag> <tag>${project.version}</tag> </tags> </build> <run> <ports> <port>8080:8080</port> </ports> </run> </image> </images> </configuration> </plugin> </plugins> </build> Here, we are specifying how to create our Docker, where the Dockerfile is, and even which version of the Docker we are building. Additionally, we specify some parameters when our Docker runs, such as the port that it exposes. If we need IntelliJ to reload the Maven changes, we may need to click on the Reimport all maven projects button in the Maven Project window. For building our Docker using Maven, we can use the Maven Project window by running the task Docker: build, or by running the following command: mvnw docker:build This will build the Docker image, but we require to have it before it's packaged, so we can perform the following command: mvnw package docker:build We can also publish our Docker using Maven, either with the Maven Project window to run the Docker: push task, or by running the following command: mvnw docker:push This will push our Docker into the Docker Hub, but if we'd like to do everything in just one command, we can just use the following code: mvnw package docker:build docker:push Finally, the plugin provides other tasks such as Docker: run, Docker: start, and Docker: stop, which we can use in the commands that we've already learned on the command line. With this, we learned how to publish docker manually and integrate them into the Maven lifecycle. Do check out the book Hands-On Microservices with Kotlin to start simplifying development of microservices and building high quality service environment. Check out other posts: The key differences between Kubernetes and Docker Swarm How to publish Microservice as a service onto a Docker Building Docker images using Dockerfiles  
Read more
  • 0
  • 0
  • 5918

article-image-neural-network-model-multi-layer-perceptrons-classifying-iris-flower-species
Sunith Shetty
16 Dec 2017
9 min read
Save for later

Deep Learning Algorithms: How to classify Irises using multi-layer perceptrons

Sunith Shetty
16 Dec 2017
9 min read
[box type="note" align="" class="" width=""]This article is an excerpt taken from the book Big Data Analytics with Java by Rajat Mehta. Java is the de facto language for major big data environments like Hadoop, MapReduce etc. This book will teach you how to perform analytics on big data with production-friendly Java.[/box] From our below given post, we help you learn how to classify flower species from Iris dataset using multi-layer perceptrons. Code files are available for download towards the end of the post. Flower species classification using multi-layer perceptrons This is a simple hello world-style program for performing classification using multi-layer perceptrons. For this, we will be using the famous Iris dataset, which can be downloaded from the UCI Machine Learning Repository at https://archive.ics.uci.edu/ml/datasets/Iris. This dataset has four types of datapoints, shown as follows: Attribute name Attribute description Petal Length Petal length in cm Petal Width Petal width in cm Sepal Length Sepal length in cm Sepal Width Sepal width in cm Class The type of iris flower that is Iris Setosa, Iris Versicolour, Iris Virginica This is a simple dataset with three types of Iris classes, as mentioned in the table. From the perspective of our neural network of perceptrons, we will be using the multi-perceptron algorithm bundled inside the spark ml library and will demonstrate how you can club it with the Spark-provided pipeline API for the easy manipulation of the machine learning workflow. We will also split our dataset into training and testing bundles so as to separately train our model on the training set and finally test its accuracy on the test set. Let's now jump into the code of this simple example. First, create the Spark configuration object. In our case, we also mention that the master is local as we are running it on our local machine: SparkConf sc = new SparkConf().setMaster("local[*]"); Next, build the SparkSession with this configuration and provide the name of the application; in our case, it is JavaMultilayerPerceptronClassifierExample: SparkSession spark = SparkSession  .builder()  .config(sc)  .appName("JavaMultilayerPerceptronClassifierExample")  .getOrCreate(); Next, provide the location of the iris dataset file: String path = "data/iris.csv"; Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Now load this dataset file into a Spark dataset object. As the file is in an csv format, we also specify the format of the file while reading it using the SparkSession object: Dataset<Row> dataFrame1 = spark.read().format("csv").load(path); After loading the data from the file into the dataset object, let's now extract this data from the dataset and put it into a Java class, IrisVO. This IrisVO class is a plain POJOand has the attributes to store the data point types, as shown: public class IrisVO { private Double sepalLength; private Double petalLength; private Double petalWidth; private Double sepalWidth; private String labelString; On the dataset object dataFrame1, we invoke the to JavaRDD method to convert it into an RDD object and then invoke the map function on it. The map function is linked to a lambda function, as shown. In the lambda function, we go over each row of the dataset and pull the data items from it and fill it in the IrisVO POJO object before finally returning this object from the lambda function. This way, we get a dataMap rdd object filled with IrisVO objects: JavaRDD<IrisVO> dataMap = dataFrame1.toJavaRDD().map( r -> {  IrisVO irisVO = new IrisVO();  irisVO.setLabelString(r.getString(5));  irisVO.setPetalLength(Double.parseDouble(r.getString(3)));  irisVO.setSepalLength(Double.parseDouble(r.getString(1)));  irisVO.setPetalWidth(Double.parseDouble(r.getString(4)));  irisVO.setSepalWidth(Double.parseDouble(r.getString(2)));  return irisVO; }); As we are using the latest Spark ML library for applying our machine learning algorithms from Spark, we need to convert this RDD back to a dataset. In this case, however, this dataset would have the schema for the individual data points as we had mapped them to the IrisVO object attribute types earlier: Dataset<Row> dataFrame = spark.createDataFrame(dataMap.rdd(), IrisVO. class); We will now split the dataset into two portions: one for training our multi-layer perceptron model and one for testing its accuracy later. For this, we are using the prebuilt randomSplit method available on the dataset object and will provide the parameters. We keep 70 percent for training and 30 percent for testing. The last entry is the 'seed' value supplied to the randomSplit method. Dataset<Row>[] splits = dataFrame.randomSplit(new double[]{0.7, 0.3}, 1234L); Next, we extract the splits into individual datasets for training and testing: Dataset<Row> train = splits[0]; Dataset<Row> test = splits[1]; Until now we had seen the code that was pretty much generic across most of the Spark machine learning implementations. Now we will get into the code that is specific to our multi-layer perceptron model. We will create an int array that will contain the count for the various attributes needed by our model: int[] layers = new int[] {4, 5, 4, 3}; Let's now look at the attribute types of this int array, as shown in the following table: Attribute value at array index Description 0 This is the number of neurons or perceptrons at the input layer of the network. This is the count of the number of features that are passed to the model. 1 This is a hidden layer containing five perceptrons (sigmoid neurons only, ignore the terminology). 2 This is another hidden layer containing four sigmoid neurons. 3 This is the number of neurons representing the output label classes. In our case, we have three types of Iris flowers, hence three classes. After creating the layers for the neural network and specifying the number of neurons in each layer, next build a StringIndexer class. Since our models are mathematical and look for mathematical inputs for their computations, we have to convert our string labels for classification (that is, Iris Setosa, Iris Versicolour, and Iris Virginica) into mathematical numbers. To do this, we use the StringIndexer class that is provided by Apache Spark. In the instance of this class, we also provide the place from where we can read the data for the label and the column where it will output the numerical representation for that label: StringIndexer labelIndexer = new StringIndexer(). setInputCol("labelString").setOutputCol("label"); Now we build the features array. These would be the features that we use when training our model: String[] featuresArr = {"sepalLength","sepalWidth","petalLength","pet alWidth"}; Next, we build a features vector as this needs to be fed to our model. To put the feature in vector form, we use the VectorAssembler class from the Spark ML library. We also provide a features array as input and provide the output column where the vector array will be printed: VectorAssembler va = new VectorAssembler().setInputCols(featuresArr). setOutputCol("features"); Now we build the multi-layer perceptron model that is bundled within the Spark ML library. To this model we supply the array of layers we created earlier. This layer array has the number of neurons (sigmoid neurons) that are needed in each layer of the multi-perceptron network: MultilayerPerceptronClassifier trainer = new MultilayerPerceptronClassifier()  .setLayers(layers)  .setBlockSize(128)  .setSeed(1234L)  .setMaxIter(25); The other parameters that are being passed to this multi-layer perceptron model are: Block Size Block size for putting input data in matrices for faster computation. The default value is 128. Seed Seed for weight initialization if weights are not set. Maximum iterations Maximum number of iterations to be performed on the dataset while learning. The default value is 100. Finally, we hook all the workflow pieces together using the pipeline API. To this pipeline API, we pass the different pieces of the workflow, that is, the labelindexer and vector assembler, and finally provide the model: Pipeline pipeline = new Pipeline().setStages(new PipelineStage[] {labelIndexer, va, trainer}); Once our pipeline object is ready, we fit the model on the training dataset to train our model on the underlying training data: PipelineModel model = pipeline.fit(train); Once the model is trained, it is not yet ready to be run on the test data to figure out its predictions. For this, we invoke the transform method on our model and store the result in a Dataset object: Dataset<Row> result = model.transform(test); Let's see the first few lines of this result by invoking a show method on it: result.show(); This would print the result of the first few lines of the result dataset as shown: As seen in the previous image, the last column depicts the predictions made by our model. After making the predictions, let's now check the accuracy of our model. For this, we will first select two columns in our model which represent the predicted label, as well as the actual label (recall that the actual label is the output of our StringIndexer): Dataset<Row> predictionAndLabels = result.select("prediction", "label"); Finally, we will use a standard class called MulticlassClassificationEvaluator, which is provided by Spark for checking the accuracy of the models. We will create an instance of this class. Next, we will set the metric name of the metric, that is, accuracy, for which we want to get the value from our predicted results: MulticlassClassificationEvaluator evaluator = new MulticlassClassificationEvaluator() .setMetricName("accuracy"); Next, using the instance of this evaluator, invoke the evaluate method and pass the parameter of the dataset that contains the column for the actual result and predicted result (in our case, it is the predictionAndLabels column): System.out.println("Test set accuracy = " + evaluator.evaluate(predictionAndLabels)); This would print the output as: If we get this value in a percentage, this means that our model is 95% accurate. This is the beauty of neural networks - they can give us very high accuracy when tweaked properly. With this, we come to an end for our small hello world-type program on multi-perceptrons. Unfortunately, Spark support on neural networks and deep learning is not extensive; at least not until now. To summarize, we covered a sample case study for the classification of Iris flower species based on the features that were used to train our neural network. If you are keen to know more about real-time analytics using deep learning methodologies such as neural networks and multi-layer perceptrons, you can refer to the book Big Data Analytics with Java. [box type="download" align="" class="" width=""]Download Code files[/box]      
Read more
  • 0
  • 0
  • 5916

article-image-unit-testing-0
Packt
18 Feb 2015
18 min read
Save for later

Unit Testing

Packt
18 Feb 2015
18 min read
In this article by Mikael Lundin, author of the book Testing with F#, we will see how unit testing is the art of designing our program in such a way that we can easily test each function as isolated units and such verify its correctness. Unit testing is not only a tool for verification of functionality, but also mostly a tool for designing that functionality in a testable way. What you gain is the means of finding problems early, facilitating change, documentation, and design. In this article, we will dive into how to write good unit tests using F#: Testing in isolation Finding the abstraction level (For more resources related to this topic, see here.) FsUnit The current state of unit testing in F# is good. You can get all the major test frameworks running with little effort, but there is still something that feels a bit off with the way tests and asserts are expressed: open NUnit.Framework Assert.That(result, Is.EqualTo(42)) Using FsUnit, you can achieve much higher expressiveness in writing unit tests by simply reversing the way the assert is written: open FsUnit result |> should equal 42 The FsUnit framework is not a test runner in itself, but uses an underlying test framework to execute. The underlying framework can be of MSTest, NUnit, or xUnit. FsUnit can best be explained as having a different structure and syntax while writing tests. While this is a more dense syntax, the need for structure still exists and AAA is more needed more than ever. Consider the following test example: [<Measure>] type EUR [<Measure>] type SEK type Country = | Sweden | Germany | France   let calculateVat country (amount : float<'u>) =    match country with    | Sweden -> amount * 0.25    | Germany -> amount * 0.19    | France -> amount * 0.2   open NUnit.Framework open FsUnit   [<Test>] let ``Sweden should have 25% VAT`` () =    let amount = 200.<SEK>      calculateVat Sweden amount |> should equal 50<SEK> This code will calculate the VAT in Sweden in Swedish currency. What is interesting is that when we break down the test code and see that it actually follows the AAA structure, even it doesn't explicitly tell us this is so: [<Test>] let ``Germany should have 19% VAT`` () =    // arrange    let amount = 200.<EUR>    // act    calculateVat Germany amount    //assert    |> should equal 38<EUR> The only thing I did here was add the annotations for AAA. It gives us the perspective of what we're doing, what frames we're working inside, and the rules for writing good unit tests. Assertions We have already seen the equal assertion, which verifies that the test result is equal to the expected value: result |> should equal 42 You can negate this assertion by using the not' statement, as follows: result |> should not' (equal 43) With strings, it's quite common to assert that a string starts or ends with some value, as follows: "$12" |> should startWith "$" "$12" |> should endWith "12" And, you can also negate that, as follows: "$12" |> should not' (startWith "€") "$12" |> should not' (endWith "14") You can verify that a result is within a boundary. This will, in turn, verify that the result is somewhere between the values of 35-45: result |> should (equalWithin 5) 40 And, you can also negate that, as follows: result |> should not' ((equalWithin 1) 40) With the collection types list, array, and sequence, you can check that it contains a specific value: [1..10] |> should contain 5 And, you can also negate it to verify that a value is missing, as follows: [1; 1; 2; 3; 5; 8; 13] |> should not' (contain 7) It is common to test the boundaries of a function and then its exception handling. This means you need to be able to assert exceptions, as follows: let getPersonById id = failwith "id cannot be less than 0" (fun () -> getPersonById -1 |> ignore) |> should throw typeof<System.Exception> There is a be function that can be used in a lot of interesting ways. Even in situations where the equal assertion can replace some of these be structures, we can opt for a more semantic way of expressing our assertions, providing better error messages. Let us see examples of this, as follows: // true or false 1 = 1 |> should be True 1 = 2 |> should be False        // strings as result "" |> should be EmptyString null |> should be NullOrEmptyString   // null is nasty in functional programming [] |> should not' (be Null)   // same reference let person1 = new System.Object() let person2 = person1 person1 |> should be (sameAs person2)   // not same reference, because copy by value let a = System.DateTime.Now let b = a a |> should not' (be (sameAs b))   // greater and lesser result |> should be (greaterThan 0) result |> should not' (be lessThan 0)   // of type result |> should be ofExactType<int>   // list assertions [] |> should be Empty [1; 2; 3] |> should not' (be Empty) With this, you should be able to assert most of the things you're looking for. But there still might be a few edge cases out there that default FsUnit asserts won't catch. Custom assertions FsUnit is extensible, which makes it easy to add your own assertions on top of the chosen test runner. This has the possibility of making your tests extremely readable. The first example will be a custom assert which verifies that a given string matches a regular expression. This will be implemented using NUnit as a framework, as follows: open FsUnit open NUnit.Framework.Constraints open System.Text.RegularExpressions   // NUnit: implement a new assert type MatchConstraint(n) =    inherit Constraint() with       override this.WriteDescriptionTo(writer : MessageWriter) : unit =            writer.WritePredicate("matches")            writer.WriteExpectedValue(sprintf "%s" n)        override this.Matches(actual : obj) =            match actual with            | :? string as input -> Regex.IsMatch(input, n)            | _ -> failwith "input must be of string type"            let match' n = MatchConstraint(n)   open NUnit.Framework   [<Test>] let ``NUnit custom assert`` () =    "2014-10-11" |> should match' "d{4}-d{2}-d{2}"    "11/10 2014" |> should not' (match' "d{4}-d{2}-d{2}") In order to create your own assert, you need to create a type that implements the NUnit.Framework.Constraints.IConstraint interface, and this is easily done by inheriting from the Constraint base class. You need to override both the WriteDescriptionTo() and Matches() method, where the first one controls the message that will be output from the test, and the second is the actual test. In this implementation, I verify that input is a string; or the test will fail. Then, I use the Regex.IsMatch() static function to verify the match. Next, we create an alias for the MatchConstraint() function, match', with the extra apostrophe to avoid conflict with the internal F# match expression, and then we can use it as any other assert function in FsUnit. Doing the same for xUnit requires a completely different implementation. First, we need to add a reference to NHamcrest API. We'll find it by searching for the package in the NuGet Package Manager: Instead, we make an implementation that uses the NHamcrest API, which is a .NET port of the Java Hamcrest library for building matchers for test expressions, shown as follows: open System.Text.RegularExpressions open NHamcrest open NHamcrest.Core   // test assertion for regular expression matching let match' pattern =    CustomMatcher<obj>(sprintf "Matches %s" pattern, fun c ->        match c with        | :? string as input -> Regex.IsMatch(input, pattern)        | _ -> false)   open Xunit open FsUnit.Xunit   [<Fact>] let ``Xunit custom assert`` () =    "2014-10-11" |> should match' "d{4}-d{2}-d{2}"    "11/10 2014" |> should not' (match' "d{4}-d{2}-d{2}") The functionality in this implementation is the same as the NUnit version, but the implementation here is much easier. We create a function that receives an argument and return a CustomMatcher<obj> object. This will only take the output message from the test and the function to test the match. Writing an assertion for FsUnit driven by MSTest works exactly the same way as it would in Xunit, by NHamcrest creating a CustomMatcher<obj> object. Unquote There is another F# assertion library that is completely different from FsUnit but with different design philosophies accomplishes the same thing, by making F# unit tests more functional. Just like FsUnit, this library provides the means of writing assertions, but relies on NUnit as a testing framework. Instead of working with a DSL like FsUnit or API such as with the NUnit framework, the Unquote library assertions are based on F# code quotations. Code quotations is a quite unknown feature of F# where you can turn any code into an abstract syntax tree. Namely, when the F# compiler finds a code quotation in your source file, it will not compile it, but rather expand it into a syntax tree that represents an F# expression. The following is an example of a code quotation: <@ 1 + 1 @> If we execute this in F# Interactive, we'll get the following output: val it : Quotations.Expr = Call (None, op_Addition, [Value (1), Value (1)]) This is truly code as data, and we can use it to write code that operates on code as if it was data, which in this case, it is. It brings us closer to what a compiler does, and gives us lots of power in the metadata programming space. We can use this to write assertions with Unquote. Start by including the Unquote NuGet package in your test project, as shown in the following screenshot: And now, we can implement our first test using Unquote, as follows: open NUnit.Framework open Swensen.Unquote   [<Test>] let ``Fibonacci sequence should start with 1, 1, 2, 3, 5`` () =     test <@ fibonacci |> Seq.take 5 |> List.ofSeq = [1; 1; 2; 3; 5] @> This works by Unquote first finding the equals operation, and then reducing each side of the equals sign until they are equal or no longer able to reduce. Writing a test that fails and watching the output more easily explains this. The following test should fail because 9 is not a prime number: [<Test>] let ``prime numbers under 10 are 2, 3, 5, 7, 9`` () =    test <@ primes 10 = [2; 3; 5; 7; 9] @> // fail The test will fail with the following message: Test Name: prime numbers under 10 are 2, 3, 5, 7, 9 Test FullName: chapter04.prime numbers under 10 are 2, 3, 5, 7, 9 Test Outcome: Failed Test Duration: 0:00:00.077   Result Message: primes 10 = [2; 3; 5; 7; 9] [2; 3; 5; 7] = [2; 3; 5; 7; 9] false   Result StackTrace: at Microsoft.FSharp.Core.Operators.Raise[T](Exception exn) at chapter04.prime numbers under 10 are 2, 3, 5, 7, 9() In the resulting message, we can see both sides of the equals sign reduced until only false remains. It's a very elegant way of breaking down a complex assertion. Assertions The assertions in Unquote are not as specific or extensive as the ones in FsUnit. The idea of having lots of specific assertions for different situations is to get very descriptive error messages when the tests fail. Since Unquote actually outputs the whole reduction of the statements when the test fails, the need for explicit assertions is not that high. You'll get a descript error message anyway. The absolute most common is to check for equality, as shown before. You can also verify that two expressions are not equal: test <@ 1 + 2 = 4 - 1 @> test <@ 1 + 2 <> 4 @> We can check whether a value is greater or smaller than the expected value: test <@ 42 < 1337 @> test <@ 1337 > 42 @> You can check for a specific exception, or just any exception: raises<System.NullReferenceException> <@ (null : string).Length @> raises<exn> <@ System.String.Format(null, null) @> Here, the Unquote syntax excels compared to FsUnit, which uses a unit lambda expression to do the same thing in a quirky way. The Unquote library also has its reduce functionality in the public API, making it possible for you to reduce and analyze an expression. Using the reduceFully syntax, we can get the reduction in a list, as shown in the following: > <@ (1+2)/3 @> |> reduceFully |> List.map decompile;; val it : string list = ["(1 + 2) / 3"; "3 / 3"; "1"] If we just want the output to console output, we can run the unquote command directly: > unquote <@ [for i in 1..5 -> i * i] = ([1..5] |> List.map (fun i -> i * i)) @>;; Seq.toList (seq (Seq.delay (fun () -> Seq.map (fun i -> i * i) {1..5}))) = ([1..5] |> List.map (fun i -> i * i)) Seq.toList (seq seq [1; 4; 9; 16; ...]) = ([1; 2; 3; 4; 5] |> List.map (fun i -> i * i)) Seq.toList seq [1; 4; 9; 16; ...] = [1; 4; 9; 16; 25] [1; 4; 9; 16; 25] = [1; 4; 9; 16; 25] true It is important to know what tools are out there, and Unquote is one of those tools that is fantastic to know about when you run into a testing problem in which you want to reduce both sides of an equals sign. Most often, this belongs to difference computations or algorithms like price calculation. We have also seen that Unquote provides a great way of expressing tests for exceptions that is unmatched by FsUnit. Testing in isolation One of the most important aspects of unit testing is to test in isolation. This does not only mean to fake any external dependency, but also that the test code itself should not be tied up to some other test code. If you're not testing in isolation, there is a potential risk that your test fails. This is not because of the system under test, but the state that has lingered from a previous test run, or external dependencies. Writing pure functions without any state is one way of making sure your test runs in isolation. Another way is by making sure that the test creates all the needed state itself. Shared state, like connections, between tests is a bad idea. Using TestFixtureSetUp/TearDown attributes to set up a state for a set of tests is a bad idea. Keeping low performance resources around because they're expensive to set up is a bad idea. The most common shared states are the following: The ASP.NET Model View Controller (MVC) session state Dependency injection setup Database connection, even though it is no longer strictly a unit test Here's how one should think about unit testing in isolation, as shown in the following screenshot: Each test is responsible for setting up the SUT and its database/web service stubs in order to perform the test and assert on the result. It is equally important that the test cleans up after itself, which in the case of unit tests most often can be handed over to the garbage collector, and doesn't need to be explicitly disposed. It is common to think that one should only isolate a test fixture from other test fixtures, but this idea of a test fixture is bad. Instead, one should strive for having each test stand for itself to as large an extent as possible, and not be dependent on outside setups. This does not mean you will have unnecessary long unit tests, provided you write SUT and tests well within that context. The problem we often run into is that the SUT itself maintains some kind of state that is present between tests. The state can simply be a value that is set in the application domain and is present between different test runs, as follows: let getCustomerFullNameByID id =    if cache.ContainsKey(id) then        (cache.[id] :?> Customer).FullName    else        // get from database        // NOTE: stub code        let customer = db.getCustomerByID id        cache.[id] <- customer        customer.FullName The problem we see here is that the cache will be present from one test to another, so when the second test is running, it needs to make sure that its running with a clean cache, or the result might not be as expected. One way to test it properly would be to separate the core logic from the cache and test them each independently. Another would be to treat it as a black box and ignore the cache completely. If the cache makes the test fail, then the functionality fails as a whole. This depends on if we see the cache as an implementation detail of the function or a functionality by itself. Testing implementation details, or private functions, is dirty because our tests might break even if the functionality hasn't changed. And yet, there might be benefits into taking the implementation detail into account. In this case, we could use the cache functionality to easily stub out the database without the need of any mocking framework. Vertical slice testing Most often, we deal with dependencies as something we need to mock away, where as the better option would be to implement a test harness directly into the product. We know what kind of data and what kind of calls we need to make to the database, so right there, we have a public API for the database. This is often called a data access layer in a three-tier architecture (but no one ever does those anymore, right?). As we have a public data access layer, we could easily implement an in-memory representation that can be used not only by our tests, but in development of the product, as shown in the following image: When you're running the application in development mode, you configure it toward the in-memory version of the dependency. This provides you with the following benefits: You'll get a faster development environment Your tests will become simpler You have complete control of your dependency As your development environment is doing everything in memory, it becomes blazing fast. And as you develop your application, you will appreciate adjusting that public API and getting to understand completely what you expect from that dependency. It will lead to a cleaner API, where very few side effects are allowed to seep through. Your tests will become much simpler, as instead of mocking away the dependency, you can call the in-memory dependency and set whatever state you want. Here's an example of what a public data access API might look like: type IDataAccess =    abstract member GetCustomerByID : int -> Customer    abstract member FindCustomerByName : string -> Customer option    abstract member UpdateCustomerName : int -> string -> Customer    abstract member DeleteCustomerByID : int -> bool This is surely a very simple API, but it will demonstrate the point. There is a database with a customer inside it, and we want to do some operations on that. In this case, our in-memory implementation would look like this: type InMemoryDataAccess() =    let data = new System.Collections.Generic.Dictionary<int, Customer>()      // expose the add method    member this.Add customer = data.Add(customer.ID, customer)      interface IDataAccess with       // throw exception if not found        member this.GetCustomerByID id =            data.[id]        member this.FindCustomerByName fullName =            data.Values |> Seq.tryFind (fun customer -> customer.FullName = fullName)          member this.UpdateCustomerName id fullName =            data.[id] <- { data.[id] with FullName = fullName }            data.[id]          member this.DeleteCustomerByID id =            data.Remove(id) This is a simple implementation that provides the same functionality as the database would, but in memory. This makes it possible to run the tests completely in isolation without worrying about mocking away the dependencies. The dependencies are already substituted with in-memory replacements, and as seen with this example, the in-memory replacement doesn't have to be very extensive. The only extra function except from the interface implementation is the Add() function which lets us set the state prior to the test, as this is something the interface itself doesn't provide for us. Now, in order to sew it together with the real implementation, we need to create a configuration in order to select what version to use, as shown in the following code: open System.Configuration open System.Collections.Specialized   // TryGetValue extension method to NameValueCollection type NameValueCollection with    member this.TryGetValue (key : string) =        if this.Get(key) = null then            None        else            Some (this.Get key)   let dataAccess : IDataAccess =    match ConfigurationManager.AppSettings.TryGetValue("DataAccess") with    | Some "InMemory" -> new InMemoryDataAccess() :> IDataAccess    | Some _ | None -> new DefaultDataAccess() :> IDataAccess        // usage let fullName = (dataAccess.GetCustomerByID 1).FullName Again, with only a few lines of code, we manage to select the appropriate IDataAccess instance and execute against it without using dependency injection or taking a penalty in code readability, as we would in C#. The code is straightforward and easy to read, and we can execute any tests we want without touching the external dependency, or in this case, the database. Finding the abstraction level In order to start unit testing, you have to start writing tests; this is what they'll tell you. If you want to get good at it, just start writing tests, any and a lot of them. The rest will solve itself. I've watched experienced developers sit around staring dumbfounded at an empty screen because they couldn't get into their mind how to get started, what to test. The question is not unfounded. In fact, it is still debated in the Test Driven Development (TDD) community what should be tested. The ground rule is that the test should bring at least as much value as the cost of writing it, but that is a bad rule for someone new to testing, as all tests are expensive for them to write. Summary In this article, we've learned how to write unit tests by using the appropriate tools to our disposal: NUnit, FsUnit, and Unquote. We have also learned about different techniques for handling external dependencies, using interfaces and functional signatures, and executing dependency injection into constructors, properties, and methods. Resources for Article: Further resources on this subject: Learning Option Pricing [article] Pentesting Using Python [article] Penetration Testing [article]
Read more
  • 0
  • 0
  • 5915
article-image-using-nginx-reverse-proxy
Packt
23 May 2011
7 min read
Save for later

Using Nginx as a Reverse Proxy

Packt
23 May 2011
7 min read
  Nginx 1 Web Server Implementation Cookbook Over 100 recipes to master using the Nginx HTTP server and reverse proxy         Read more about this book       (For more resources on Nginx, see here.) Introduction Nginx has found most applications acting as a reverse proxy for many sites. A reverse proxy is a type of proxy server that retrieves resources for a client from one or more servers. These resources are returned to the client as though they originated from the proxy server itself. Due to its event driven architecture and C codebase, it consumes significantly lower CPU power and memory than many other better known solutions out there. This article will deal with the usage of Nginx as a reverse proxy in various common scenarios. We will have a look at how we can set up a rail application, set up load balancing, and also look at a caching setup using Nginx, which will potentially enhance the performance of your existing site without any codebase changes.   Using Nginx as a simple reverse proxy Nginx in its simplest form can be used as a reverse proxy for any site; it acts as an intermediary layer for security, load distribution, caching, and compression purposes. In effect, it can potentially enhance the overall quality of the site for the end user without any change of application source code by distributing the load from incoming requests to multiple backend servers, and also caching static, as well as dynamic content. How to do it... You will need to first define proxy.conf, which will be later included in the main configuration of the reverse proxy that we are setting up: proxy_redirect off;proxy_set_header Host $host;proxy_set_header X-Real-IP $remote_addr;proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;client_max_body_size 10m;client_body_buffer_size 128k;proxy_connect_timeout 90;proxy_send_timeout 90;proxy_read_timeout 90;sproxy_buffers 32 4k To use Nginx as a reverse proxy for a site running on a local port of the server, the following configuration will suffice: server { listen 80; server_name example1.com; access_log /var/www/example1.com/log/nginx.access.log; error_log /var/www/example1.com/log/nginx_error.log debug;location / { include proxy.conf; proxy_pass http://127.0.0.1:8080; }} How it works... In this recipe, Nginx simply acts as a proxy for the defined backend server which is running on the 8080 port of the server, which can be any HTTP web application. Later in this article, other advanced recipes will have a look at how one can define more backend servers, and how we can set them up to respond to requests.   Setting up a rails site using Nginx as a reverse proxy In this recipe, we will set up a working rails site and set up Nginx working on top of the application. This will assume that the reader has some knowledge of rails and thin. There are other ways of running Nginx and rails, as well, like using Passenger Phusion. How to do it... This will require you to set up thin first, then to configure thin for your application, and then to configure Nginx. If you already have gems installed then the following command will install thin, otherwise you will need to install it from source: sudo gem install thin Now you need to generate the thin configuration. This will create a configuration in the /etc/thin directory: sudo thin config -C /etc/thin/myapp.yml -c /var/rails/myapp--servers 5 -e production Now you can start the thin service. Depending on your operating system the start up command will vary. Assuming that you have Nginx installed, you will need to add the following to the configuration file: upstream thin_cluster { server unix:/tmp/thin.0.sock; server unix:/tmp/thin.1.sock; server unix:/tmp/thin.2.sock; server unix:/tmp/thin.3.sock; server unix:/tmp/thin.4.sock;} server { listen 80; server_name www.example1.com; root /var/www.example1.com/public; location / { proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header Host $http_host; proxy_redirect false; try_files $uri $uri/index.html $uri.html @thin; location @thin { include proxy.conf; proxy_pass http://thin_cluster; } } error_page 500 502 503 504 /50x.html; location = /50x.html { root html; }} How it works... This is a fairly simple rails stack, where we basically configure and run five upstream thin threads which interact with Nginx through socket connections. There are a few rewrites that ensure that Nginx serves the static files, and all dynamic requests are processed by the rails backend. It can also be seen how we set proxy headers correctly to ensure that the client IP is forwarded correctly to the rails application. It is important for a lot of applications to be able to access the client IP to show geo-located information, and logging this IP can be useful in identifying if geography is a problem when the site is not working properly for specific clients.   Setting up correct reverse proxy timeouts In this section we will set up correct reverse proxy timeouts which will affect your user's interaction when your backend application is unable to respond to the client's request. In such a case, it is advisable to set up some sensible timeout pages so that the user can understand that further refreshing may only aggravate the issues on the web application. How to do it... You will first need to set up proxy.conf which will later be included in the configuration: proxy_redirect off;proxy_set_header Host $host;proxy_set_header X-Real-IP $remote_addr;proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;client_max_body_size 10m;client_body_buffer_size 128k;proxy_connect_timeout 90;proxy_send_timeout 90;proxy_read_timeout 90;sproxy_buffers 32 4k Reverse proxy timeouts are some fairly simple flags that we need to set up in the Nginx configuration like in the following example: server { listen 80; server_name example1.com; access_log /var/www/example1.com/log/nginx.access.log; error_log /var/www/example1.com/log/nginx_error.log debug; #set your default location location / { include proxy.conf; proxy_read_timeout 120; proxy_connect_timeout 120; proxy_pass http://127.0.0.1:8080; }} How it works... In the preceding configuration we have set the following variables, it is fairly clear what these variables achieve in the context of the configurations:   Setting up caching on the reverse proxy In a setup where Nginx acts as the layer between the client and the backend web application, it is clear that caching can be one of the benefits that can be achieved. In this recipe, we will have a look at setting up caching for any site to which Nginx is acting as a reverse proxy. Due to extremely small footprint and modular architecture, Nginx has become quite the Swiss knife of the modern web stack. How to do it... This example configuration shows how we can use caching when utilizing Nginx as a reverse proxy web server: http { proxy_cache_path /var/www/cache levels=1:2 keys_zone=my-cache:8mmax_size=1000m inactive=600m; proxy_temp_path /var/www/cache/tmp;...server { listen 80; server_name example1.com; access_log /var/www/example1.com/log/nginx.access.log; error_log /var/www/example1.com/log/nginx_error.log debug; #set your default location location / { include proxy.conf; proxy_pass http://127.0.0.1:8080/; proxy_cache my-cache; proxy_cache_valid 200 302 60m; proxy_cache_valid 404 1m; }}} How it works... This configuration implements a simple cache with 1000MB maximum size, and keeps all HTTP response 200 pages in the cache for 60 minutes and HTTP response 404 pages in cache for 1 minute. There is an initial directive that creates the cache file on initialization, in further directives we basically configure the location that is going to be cached. It is possible to actually set up more than one cache path for multiple locations. There's more... This was a relatively small show of what can be achieved with the caching aspect of the proxy module. Here are some more directives that can be really useful in optimizing and making your stack faster and more efficient:  
Read more
  • 0
  • 0
  • 5913

article-image-introduction-sql-and-sqlite
Packt
10 Feb 2016
22 min read
Save for later

Introduction to SQL and SQLite

Packt
10 Feb 2016
22 min read
In this article by Gene Da Rocha, author or the book Learning SQLite for iOS we are introduced to the background of the Structured Query Language (SQL) and the mobile database SQLite. Whether you are an experienced technologist at SQL or a novice, using the book will be a great aid to help you understand this cool subject, which is gaining momentum. SQLite is the database used on the mobile smartphone or tablet that is local to the device. SQLite has been modified by different vendors to harden and secure it for a variety of uses and applications. (For more resources related to this topic, see here.) SQLite was released in 2000 and has grown to be as a defacto database on a mobile or smartphone today. It is an open source piece of software with a low footprint or overhead, which is packaged with a relational database management system. Mr D. Richard Hipp is the inventor and author for SQLite, which was designed and developed on a battleship while he was at a company called General Dynamics at the U. S. Navy. The programming was built for a HP-UX operating system with Informix as the database engine. It took many hours in the data to upgrade or install the database software and was an over-the-top database for this experience DBA (database administrator). Mr Hipp wanted a portable, self-contained, easy-to-use database, which could be mobile, quick to install, and not dependent on the operating. Initially, SQLite 1.0 used the gdbm as its storage system, but later, it was replaced with its own B-tree implementation and technology for the database. The B-tree implementation was enhanced to support transactions and store rows of data with key order. By 2001 onwards, open source family extensions for other languages, such as Java, Python, and Perl, were written to support their applications. The database and its popularity within the open source community and others were growing. Originally based upon relational algebra and tuple relational calculus, SQL consists of a data definition and manipulation language. The scope of SQL includes data insert, query, update and delete, schema creation and modification, and data access control. Although SQL is often described as, and to a great extent is, a declarative language (4GL), it also includes procedural elements. Internationalization supported UTF-16 and UTF-8 and included text-collating sequences in version 2 and 3 in 2004. It was supported by funding from AOL (America Online) in 2004. It works with a variety of browsers, which sometimes have in-built support for this technology. For example, there are so many extensions that use Chrome or Firefox, which allow you to manage the database. There have been many features added to this product. The future with the growth in mobile phones sets this quick and easy relational database system to quantum leap its use within the mobile and tablet application space. SQLite is based on the PostgreSQL as a point of reference. SQLite does not enforce any type checking. The schema does not constrain it since the type of value is dynamic, and a trigger will be activated by converting the data type. About SQL In June 1970, a research paper was published by Dr. E.F. Codd called A Relational Model of Data for Large Shared Data Banks. The Association of Computer Machinery (ACM) accepted Codd data and technology model, which has today become the standard for the RDBMS (Relational Database Management System). IBM Corporation had invented the language called by Structured English Query Language (SEQUEL), where the word "English" was dropped to become SQL. SQL is still pronounced as what has today become the standard for the RDBMS (Relational Database Management System) had a product called which has today become the SQL technology, followed by Oracle, Sybase and Microsoft's SQL Server. The standard commercial relational database management system language today is SQL (SEQUEL). Today, there are ANSI standards for SQL, and there are many variations of this technology. Among the mentioned manufacturers, there are also others available in the open source world, for example, an SQL query engine such as Presto. This is the distribution engine for SQL under open source, which is made to execute interactive analytic queries. Presto queries are run under databases from a variety of data source sizes—gigabytes to petabytes. Companies such as Facebook and Dropbox use the Presto SQL engine for their queries and analytics in data warehouse and related applications. SQL is made up of a data manipulation and definition language built with tuple and algebra calculation in a relational format. The SQL language has a variety of statements but most would recognize the INSERT, SELECT, UPDATE and DELETE statements. These statements form a part of the database schema management process and aid the data access and security access. SQL includes procedural elements as part of its setup. Is SQLite used anywhere? Companies may use applications but they are not aware of the SQL engines that drive their data storage and information. Although, it has become a standard with the American National Standards Institute (ANSI) in 1986, SQL features and functionality are not 100% portable among different SQL systems and require code changes to be useful. These standards are always up for revision to ensure ANSI is maintained. There are many variants of SQL engines on the market from companies, such as Oracle, SQL Server (Microsoft), DB2 (IBM), Sybase (SAP), MYSQL (Oracle), and others. Different companies operate several types of pricing structures, such as free open source, or a paid per seat or by transactions or server types or loads. Today, there is a preference for using server technology and SQL in the cloud with different providers, for example, Amazon Web Services (AWS). SQLite, as it names suggests, is SQL in a light environment, which is also flexible and versatile. Enveloped and embedded database among other processes SQLite has been designed and developed to work and coexist with other applications and processes in its area. RDBMS is tightly integrated with the native application software, which requires storing information but is masked and hidden from users, and it requires minimal administration or maintenance. SQLite can work with different API hidden from users and requires minimal administration or maintenance areas. RDBMS is intertwined with other applications; that is, it requires minimal supervision; there is no network traffic; no network access conflicts or configuration; no access limitations with privileges or permissions; and a large reduced overhead. These make it easier and quicker to deploy your applications to the app stores or other locations. The different components work seamlessly together in a harmonized way to link up data with the SQLite library and other processes. These show how the Apache process and the C/C++ process work together with the SQLite-C library to interface and link with it so that it becomes seamless and integrates with the operating system. SQLite has been developed and integrated in such a way that it will interface and gel with a variety of applications and multiple solutions. As a lightweight RDBMS, it can stand on its own by its versatility and is not cumbersome or too complex to benefit your application. It can be used on many platforms and comes with a binary compatible format, which is easier to dovetail within your mobile application. The different types of I.T. professionals will be involved with SQLite since it holds the data, affects performance, and involves database design, user or mobile interface design specialists, analysts and consultancy types. These professionals could use their previous knowledge of SQL to quickly grasp SQLite. SQLite can act as both data processor for information or deal with data in memory to perform well. The different software pieces of a jigsaw can interface properly by using the C API interface to SQLite, which some another programming language code. For example, C or C++ code can be programmed to communicate with the SQLITE C API, which will then talk to the operating system, and thus communicate with the database engine. Another language such as PHP can communicate using its own language data objects, which will in turn communicate with the SQLite C API and the database. SQLite is a great database to learn especially for computer scientists who want to use a tool that can open your mind to investigate caching, B-Tree structures and algorithms, database design architecture, and other concepts. The architecture of the SQLite database As a library within the OS-Interface, SQLite will have many functions implemented through a programming called tclsqlite.c. Since many technologies and reserved words are used, to language, and in this case, it will have the C language. The core functions are to be found in main.c, legacy.c, and vmbeapi.c. There is also a source code file in C for the TCL language to avoid any confusion; the prefix of sqlite3 is used at the beginning within the SQLite library. The Tokeniser code base is found within tokenize.c. Its task is to look at strings that are passed to it and partition or separate them into tokens, which are then passed to the parser. The Parser code base is found within parse.y. The Lemon LALR(1) parser generator is the parser for SQLite; it uses the context of tokens and assigns them a meaning. To keep within the low-sized footprint of RDBMS, only one C file is used for the parse generator. The Code Generator is then used to create SQL statements from the outputted tokens of the parser. It will produce virtual machine code that will carry out the work of the SQL statements. Several files such as attach.c, build.c, delete.c, select.c, and update.c will handle the SQL statements and syntax. Virtual machine executes the code that is generated from the Code Generator. It has in-built storage where each instruction may have up to three additional operands as a part of each code. The source file is called vdbe.c, which is a part of the SQLite database library. Built-in is also a computing engine, which has been specially created to integrate with the database system. There are two header files for virtual machine; the header files that interface a link between the SQLite libraries are vdbe.h and vdbeaux.c, which have utilities used by other modules. The vdbeapi.c file also connects to virtual machine with sqlite_bind and other related interfaces. The C language routines are called from the SQL functions that reference them. For example, functions such as count() are defined in func.c and date functions are located in date.c. B-tree is the type of table implementation used in SQLite; and the C source file is btree.c. The btree.h header file defines the interface to the B-tree system. There is a different B-tree setup for every table and index and held within the same file. There is a header portion within the btree.c, which will have details of the B-tree in a large comment field. The Pager or Page Cache using the B-tree will ask for data in a fixed sized format. The default size is 1024 bytes, which can be between 512 and 65536 bytes. Commit and Rollback operations, coupled with the caching, reading, and writing of data are handled by Page Cache or Pager. Data locking mechanisms are also handled by the Page Cache. The C file page.c is implemented to handle requests within the SQLite library and the header file is pager.h. The OS Interface C file is defined in os.h. It addresses how SQLite can be used on different operating systems and become transparent and portable to the user thus, becoming a valuable solution for any developer. An abstract layer to handle Win32 and POSIX compliant systems is also in place. Different operating systems have their own C file. For example, os_win.c is for Windows, os_unix.c is for Unix, coupled with their own os_win.h and os_unix.h header files. Util.c is the C file that will handle memory allocation and string comparisons. The Utf.c C file will hold the Unicode conversion subroutines. The Utf.c C file will hold the Unicode data, sort it within the SQL engine, and use the engine itself as a mechanism for computing data. Since the memory of the device is limited and the database size has the same constraints, the developer has to think outside the box to use these techniques. These types of memory and resource management form a part of the approach when the overlay techniques were used in the past when disk and memory was limited.   SELECT parameter1, STTDEV(parameter2)       FROM Table1 Group by parameter1       HAVING parameter1 > MAX(parameter3) IFeatures As part of its standards, SQLite uses and implements most of the SQL-92 standards, but not all the potential features or parts of functionality are used or realized. For example, the SQLite uses and implements most of the SQL-92 standards but not all potent columns. The support for triggers is not 100% as it cannot write output to views, but as a substitute, the INSTEAD OF statement can be used. As mentioned previously, the use of a type for a column is different; most relational database systems assign them to individual values. SQLite will convert a string into an integer if the columns preferred type is an integer. It is a good piece of functionality when bound to this type of scripting language, but the technique is not portable to other RDBMS systems. It also has its criticisms for not having a good data integrity mechanism compared to others in relation to statically typed columns. As mentioned previously, it has many bindings to many languages, such as Basic, C, C#, C++, D, Java, JavaScript, Lua, PHP, Objective-C, Python, Ruby, and TCL. Its popularity by the open source community and its usage by customers and developers have enabled its growth to continue. This lightweight RDBMS can be used on Google Chrome, Firefox, Safari, Opera, and the Android Browsers and has middleware support using ADO.NET, ODBC, COM (ActiveX), and XULRunner. It also has the support for web application frameworks such as Django (Python-based), Ruby on Rails, and Bugzilla (Mozilla). There are other applications such as Adobe Photoshop Light, which uses SQLite and Skype. It is also part of the Windows 8, Symbian OS, Android, and OpenBSD operating. Apple also included it via API support via OSXvia OSXother applications like Adobe Photoshop Light. Apart from not having the large overhead of other database engines, SQLite has some major enhancements such as the EXPLAIN keyword with its manifest typing. To control constraint conflicts, the REPLACE and ON CONFLICT statements are used. Within the same query, multiple independent databases can be accessed using the DETACH and ATTACH statements. New SQL functions and collating sequences can be created using the predefined API's, which offer much more flexibility. As there is no configuration required, SQLite just does the job and works. There is no need to initialize, stop, restart, or start server processes and no administrator is required to create the database with proper access control or security permits. After any failure, no user actions are required to recover the database since it is self-repairing: SQLite is more advanced than is thought of in the first place. Unlike other RDBMS, it does not require a server setup via a server to serve up data or incur network traffic costs. There are no TCP/IP calls and frequent communication backwards or forwards. SQLite is direct; the operating system process will deal with database access to its file; and control database writes and reads with no middle-man process handshaking. By having no server backend, the process of installation, configuration, or administration is reduced significantly and the access to the database is granted to programs that require this type of data operations. This is an advantage in one way but is also a disadvantage for security and protection from data-driven misuse and data concurrency or data row locking mechanisms. It also allows the database to be accessed several times by different applications at the same time. It supports a form of portability for the cross-platform database file that can be located with the database file structure. The database file can be updated on one system and copied to another on either 32 bit or 64 bit with different architectures. This does not make a difference to SQLite. The usage of different architecture and the promises of developers to keep the file system stable and compatible with the previous, current, and future developments will allow this database to grow and thrive. SQLite databases don't need to upload old data to the new formatted and upgraded databases; it just works. By having a single disk file for the database, the information can be copied on a USB and shared or just reused on another device very quickly keeping all the information intact. Other RDBMS single-disk file for the database; the information can be copied on a USB and shared or just reused on another device very quickly keeping all the information in tact to grow and thrive. Another feature of this portable database is its size, which can start on a single 512-byte page and expand to 2147483646 pages at 65536 bytes per page or in bytes 140,737,488,224,256, which equates to about 140 terabytes. Most other RDBMS are much larger, but IBM's Cloudscape is small with a 2MB jar file. It is still larger than SQLite. The Firebird alternative's client (frontend) library is about 350KB, whereas the Berkeley Oracle database is around 450kb without SQL support and with one simple key/value pair's option. This advanced portable database system and its source code is in the public domain. They have no copyright or any claim on the source code. However, there are open source license issues and controls for some test code and documentation. This is great news for developers who might want to code up new extensions or database functionality that works with their programs, which could be made into a 'product extension' for SQLite. You cannot have this sort of access to SQL source code around since everything has a patent, limited access, or just no access. There are signed affidavits by developers to disown any copyright interest in the SQLite code. SQLite is different, because it is just not governed or ruled by copyright law; the way software should really work or it used. There are signed affidavits by developers to disown any copyright interest in the SQLite code. This means that you can define a column with a datatype of integer, but its property is dictated by the inputted values and not the column itself. This can allow any value to be stored in any declared data type for this column with the exception of an integer primary key. This feature would suit TCL or Python, which are dynamically typed programming languages. When you allocate space in most RDBMS in any declared char(50), the database system will allocate the full 50 bytes of disk space even if you do not allocate the full 50 bytes of disk space. So, out of char(50) sized column, three characters were used, then the disk space would be only three characters plus two for overhead including data type, length but not 50 characters such as other database engines. This type of operation would reduce disk space usage and use only what space was required. By using the small allocation with variable length records, the applications runs faster, the database access is quicker, manifest typing can be used, and the database is small and nimble. The ease of using this RDBMS makes it easier for most programmers at an intermediate level to create applications using this technology with its detailed documentation and examples. Other RDBMS are internally complex with links to data structures and objects. SQLite comprises using a virtual machine language that uses the EXPLAIN reserved word in front of a query. Virtual machine has increased and benefitted this database engine by providing an excellent process or controlled environment between back end (where the results are computed and outputted) and the front end (where the SQL is parsed and executed). The SQL implementation language is comparable to other RDBMS especially with its lightweight base; it does support recursive triggers and requires the FOR EACH row behavior. The FOR EACH statement is not currently supported, but functionality cannot be ruled out in the future. There is a complete ALTER TABLE support with some exceptions. For example, the RENAME TABLE, ADD COLUMN, or ALTER COLUMN is supported, but the DROP COLUMN, ADD CONSTRAINT, or ALTER COLUMN is not supported. Again, this functionality cannot be ruled out in the future. The RIGHT OUTER JOIN and FULL OUTER JOIN are not support, but the RIGHT OUTER JOIN, FULL OUTER JOIN, and LEFT OUTER JOIN are implemented. The views within this RDBMS are read only. As described so far in the this article, SQLite is a nimble and easy way to use database that developers can engage with quickly, use existing skills, and output systems to mobile devices and tablets far simpler than ever before. With the advantage of today's HTML5 and other JavaScript frameworks, the advancement of SQL and the number of SQLite installations will quantum leap. Working with SQLite The website for SQLite is www.sqlite.org where you can download all the binaries for the database, documentation, and source code, which works on operating systems such as Linux, Windows and MAC OS X. The SQLite share library or DLL is the library to be used for the Windows operating system and can be installed or seen via Visual Studio with the C++ language. So, the developer can write the code using the library that is presently linked in reference via the application. When execution has taken place, the DLL will load and all references in the code will link to those in the DLL at the right time. The SQLite3 command-line program, CLP, is a self-contained program that has all the components built in for you to run at the command line. It also comes with an extension for TCL. So within TCL, you can connect and update the SQLite database. SQLite downloads come with the TAR version for Unix systems and the ZIP version for Windows systems. iOS with SQLite On the hundreds of thousands of apps on all the app stores, it would be difficult to find the one that does not require a database of some sort to store or handle data in a particular way. There are different formats of data called datafeeds, but they all require some temporary or permanent storage. Small amounts of data may not be applicable but medium or large amounts of data will require a storage mechanism such as a database to assist the app. Using SQLite with iOS will enable developers to use their existing skills to run their DBMS on this platform as well. For SQLite, there is the C-library that is embedded and available to use with iOS with the Xcode IDE. Apple fully supports SQLite, which uses an include statement as a part of the library call, but there is not easy made mechanism to engage. Developers also tend to use FMDB—a cocoa/objective-C wrapper around SQLite. As SQLite is fast and lightweight, its usage of existing SQL knowledge is reliable and supported by Apple on Mac OS and iOS and support from many developers as well as being integrated without much outside involvement. The third SQLite library is under the general tab once the main project name is highlighted on the left-hand side. Then, at the bottom of the page or within the 'Linked Frameworks and Library', click + and a modal window appears. Enter the word sqlite and select sqlite; then, select the libsqlite3.dylib library. This one way to set up the environment to get going. In effect, it is the C++ wrapper called the libsqlite3.dylib library within the framework section, which allows the API to work with the SQLite commands. The way in which a text file is created in iOS is the way SQLite will be created. It will use the location (document directory) to save the file that is the one used by iOS. Before anything can happen, the database must be opened and ready for querying and upon the success of data, the constant SQLITE_OK is set to 0. In order to create a table in the SQLite table using the iOS connection and API, the method sqlite3_exec is set up to work with the open sqlite3 object and the create table SQL statement with a callback function. When the callback function is executed and a status is returned of SQLITE_OK, it is successful; otherwise, the other constant SQLITE_ERROR is set to 1. Once the C++ wrapper is used and the access to SQLite commands are available, it is an easier process to use SQLite with iOS. Summary In this article, you read the history of SQL, the impact of relational databases, and the use of a mobile SQL database namely SQLite. It outlines the history and beginnings of SQLite and how it has grown to be the most used database on mobile devices so far. Resources for Article:   Further resources on this subject: Team Project Setup [article] Introducing Sails.js [article] Advanced Fetching [article]
Read more
  • 0
  • 0
  • 5912

article-image-react-conf-2019-concurrent-mode-preview-out-css-in-js-react-docs-in-40-languages-and-more
Bhagyashree R
29 Oct 2019
9 min read
Save for later

React Conf 2019: Concurrent Mode preview out, CSS-in-JS, React docs in 40 languages, and more

Bhagyashree R
29 Oct 2019
9 min read
React Conf 2019 wrapped up last week. It was kick-started with a keynote by Tom Occhino and Yuhi Zheng from the React team who both talked about Concurrent Mode and Suspense. Then followed by Frank Yan also from the React team, who explained how they are building the “new Facebook” with React and Relay. One of the major highlights of his talk was the CSS-in-JS library that will be open-sourced once ready. Sophie Alpert, former manager of the React team gave a talk on building a custom React renderer. To demonstrate that, she implemented a small version of ReactDOM in just 30 minutes. There were many other lightning talks and presentations on translated React, building inclusive apps by improving their accessibility, and much more. React Conf 2019 is a two-day event that took place from Oct 24-25 at Lake Las Vegas, Nevada. This conference brought together front-end and full-stack developers to “share knowledge, skills, to network, and just to have fun.” React's long-term goal: "Making it easier to build great user experiences" Tom Occhino, Engineering Director of the React group, took to the stage to talk about the goals for React and the community. He says that React’s long-term goal is to make it easier for developers to build great user experiences. “Easier to build” means improving the developer experience. The three factors that contribute to a great developer experience are a low barrier to entry, developer productivity, and ability to scale. React is constantly working towards improving the developer experience by introducing new features. Two such features are: Concurrent Mode and Suspense. Concurrent Mode Concurrent Mode is a set of features to make React apps more responsive by rendering component trees without blocking the main thread. It gives React the ability to interrupt big blocks of low-priority work in order to focus on higher priority work like responding to user input. This will enable React to work on several state updates concurrently and removing jarring and too frequent DOM updates. The team also released the first early community preview of Concurrent Mode last week. https://twitter.com/reactjs/status/1187411505001746432 Suspense Suspense was introduced as an improvement to the developer experience when dealing with asynchronous data fetching within React apps. It suspends your component rendering and shows a fallback until some condition is met. Occhino describes Suspense as a “React system for orchestrating asynchronous loading of code, data, and resources.” He adds, “Suspense lets the component wait for something before they render. This helps consolidate nested dependencies and nested spinners and things behind the single simple loading experience.” Towards the end of his keynote, Occhino also touched upon how the team plans to make the React community more inclusive and diverse. He said, “Over the past 10 years, I have learned that diverse teams build better products and make better decisions. Everyone working on React shares my conviction about this.” He adds, “Up until recently we have taken a pretty passive stance to building and shaping the React community. We have a responsibility to you all and I feel like we let many of you down. We are committed to doing better!” As a first step, the team has now replaced the React code of conduct with the contributor covenant. Read also: #Reactgate forces React leaders to confront community’s toxic culture head on What’s new the React team is working on Yuzi Zheng, Engineering Manager for React and Relay team at Facebook gave an insight into what projects the core teams are working on. She started off by giving a recap of hooks, which was one of the most-awaited React features announced at React Conf 2018. “Hooks are designed for the future of React in the way that it naturally encourages code that is compatible with all the plumbing features such as accessibility, server-side rendering, suspense, and concurrent mode. Since its release, the reception of Hooks has been really positive,” she shared. If you want to understand the fundamentals of React Hooks and use them for implementing responsive design and more, check out our book, Learning Hooks. Another long-term project that the team is focusing on is providing developers a way to easily build accessibility features in React. Currently, developers can create accessible websites using standard HTML techniques, but it does have some limitations. To help building accessibility directly into React the team is working on two areas: managing focus and input interfaces. For managing focus, the team plans to add primitives that provide “a more structured way of making sure component flows well” for cases like React portals and Suspense fallback and are accessible by default. For input interfaces, they plan to add support for rich gestures that work across platforms and are accessible by default. The team is also focusing on improving the initial render times. Server-side rendering helps in reducing the amount of CPU usage on the client for the initial render to some extent, but it does have some limitations. To meet these limitations, the team plans to add built-in support for server-side rendering. This will work with lazily loaded components to reduce the bytes needed on the client, support streaming down markups in chunks, and be fully-compatible with Concurrent Mode and Suspense. The CSS-in-JS library Frank Yan, Engineering Manager in the React group at Facebook talked about how the team has rebuilt and redesigned the Facebook website and the key lessons they have learned along the way. The new Facebook website is a single-page app with React organizing the HTML and JavaScript into components from the top down and with GraphQL and Relay colocating the queries declaratively in the components. The only key part that the team did not reorganize was CSS. They instead created a new library to embed styles in components called CSS-in-JS. It aims to make the styles easier to read, understand, and update. Its syntax is inspired by React Native and other frameworks. Since it enables you to embed styles inside JavaScript files, you can also use JavaScript tooling like type checkers and linters. React docs translated into 40 languages Nat Alison is a freelance front-end developer who helped the React team coordinate translations of reactjs.org into 40 languages. She shared why and how they were able to translate the docs for this massively popular library. She shared, “More than 80% of the world’s population does not know English. If we restrict React, one of the most popular JavaScript frameworks, we restrict who gets to create and shape the web.” Providing the officially translated docs will make it easier for several non-English speaking React developers to understand and use it in their projects. This will also prevent users from creating unofficial translations, which can be incorrect, outdated, or difficult to find. Initially, they thought of integrating a SaaS platform that allows users to submit translations, but this was not a feasible solution. Then they decided to check out the solution used by Vue, which is maintaining separate repositories for each language forked from the original repo. Similar to Vue, the team also created a bot that periodically tracks for changes in the English repo and submits pull requests whenever there is a change. If you want to contribute to translating React docs in your language, check out the IsReactTranslatedYet website. Developing accessible apps Brittany Feenstra, a developer at Formidable, took to the stage to talk about why accessibility is important and how you can approach it. Accessibility or a11y is making your apps and websites usable for everyone, including people with any kind of disabilities.  There are four types of disabilities that developers need to design for: visual, auditory, motor, and cognitive. Feenstra mentioned that though we all are aware of the importance of accessibility, we often “end up saving it for later” because of tight deadlines. Feenstra, however, compares accessibility with marathons. It is not something that you can achieve in just one sprint, she says. You should instead look at it as a training program that you will follow when participating in a marathon. You need to take a step-by-step approach to make an accessible app. If we do that “we will be way less fatigued and well-equipped,” she adds. Sharing some starting tips she said that we need to focus on three areas. First, learn to run, or in accessibility context, understand the HTML semantics then explore reference patterns, navigation, and focus traps. Second, improve nutritional habits, or in accessibility context, use environments and tools that help us write sturdier code. She recommends using axe, an accessibility checker for WCAG 2 and Section 508 accessibility. Also, check out the tools that basically simulate how people with visual impairment will see your UI such as NoCoffee and I want to see like the colour blind. She emphasizes on linting and testing your code for accessibility with the help of eslint-plugin-jsx-a11y and accessibility assessment automation tools. Third, cross-train and stretch, or in accessibility context, learn to “interact with the UI in ways that let us understand the update we are making to our code.” “React is Fiction” This was a talk by Jenn Creighton, a Frontend Architect at The Wing, who comes from a creative writing background. “Writing React to me felt like coming home. It was really familiar in a way that I could not pinpoint,” she said. Then she realized that writing React reminded her of fiction and merging the two disciplines helped her write better components. Creighton drew the similarities between developing in React and creative writing. One of the key principles of creative writing is “Show, don’t tell” that advises authors to describe a situation instead of just telling it. This will help engage the readers as they will be able to picture the situation in their heads. According to Creighton, React also has a similar principle: “Declarative, not imperative.”  React is declarative, which allows developers to describe what the final state should be, instead of listing all the steps to reach that state. There were many other exciting talks about progressive web animations, building React-Select, and more. Check out the live streams to watch the full talks: Day1: https://www.youtube.com/watch?v=RCiccdQObpo Day2: https://www.youtube.com/watch?v=JDDxR1a15Yo&t=2376s Ionic React released; Ionic Framework pivots from Angular to a native React version ReactOS 0.4.12 releases with kernel improvements, Intel e1000 NIC driver support, and more React Native 0.61 introduces Fast Refresh for reliable hot reloading
Read more
  • 0
  • 0
  • 5908
article-image-modifying-existing-theme-drupal-6-part-1
Packt
20 Oct 2009
10 min read
Save for later

Modifying an Existing Theme in Drupal 6: Part 1

Packt
20 Oct 2009
10 min read
Setting up the workspace There are several software tools that can make your work modifying themes more efficient. Though no specific tools are required to work with Drupal themes, there are a couple of applications that you might want to consider adding to your tool kit. I work with Firefox as my primary browser, principally due to the fact that I can add into Firefox various extensions that make my life easier. The Web Developer extension, for example, is hugely helpful when dealing with CSS and related issues. I recommend the combination of Firefox and the Web Developer extension to anyone working with Drupal themes. Another extension popular with many developers is Firebug, which is very similar to the Web Developer extension, and indeed more powerful in several regards. Pick up Web Developer, Firebug, and other popular Firefox add-ons at https://addons.mozilla.org/en-US/firefox/ When it comes to working with PHP files and the various theme files, you will need an editor. The most popular application is probably Dreamweaver, from Adobe, although any editor that has syntax highlighting would work well too. I use Dreamweaver as it helps me manage multiple projects and provides a number of features that make working with code easier (particularly for designers). If you choose to use Dreamweaver, you will want to tailor the program a little bit to make it easier to work with Drupal theme files. Specifically, you should configure the application preferences to open and edit the various types of files common to PHPTemplate themes. To set this up, open Dreamweaver, then: Go to the Preferences dialogue. Open file types/editors. Add the following list of file types to Dreamweaver's open in code view field: .engine.info.module.install.theme Save the changes and exit. With these changes, your Dreamweaver application should be able to open and edit all the various PHPTemplate theme files. Previewing your work Note that, as a practical matter, previewing Drupal themes requires the use of a server. Themes are really difficult to preview (with any accuracy) without a server environment. A quick solution to this problem is the XAMPP package. XAMPP provides a one step installer containing everything you need to set up a server environment on your local machine (Apache, MySQL, PHP, phpMyAdmin, and more). Visit http://www.ApacheFriends.org to download XAMPP and you can have your own Dev Server quickly and easily. Another tool that should be on the top of your list is the Theme developer extension for the popular Drupal Devel module. Theme developer can save you untold hours of digging around trying to find the right function or template. When the module is active, all you need to do is click on an element and the Theme developer pop-up window will show you what is generating the element, along with other useful information. In the example later in this article, we will also use another feature of the Devel module, that is, the ability to automatically generate sample content for your site. You can download Theme developer as part of the Devel project at Drupal.org: http://drupal.org/project/devel Note that Theme developer only works on Drupal 6 and due to the way it functions, is only suitable for use in a development environment—you don't want this installed on a client's public site! Visit http://drupal.org/node/209561 for more information on the Theme developer aspects of the Devel module. The article includes links to a screencast showing the module in action—a good quick start and a solid help in grasping what this useful tool can do. Planning the modifications We're going to base our work on the popular Zen theme. We'll take Zen, create a new subtheme, and then modify the subtheme until we reach our final goal. Let's call our new theme "Tao". The Zen theme was chosen for this exercise because it has a great deal of flexibility. It is a good solid place to start if you wish to build a CSS-based theme. The present version of Zen even comes with a generic subtheme (named "STARTERKIT") designed specifically for themers who wish to take a basic theme and customize it. We'll use the Starterkit subtheme as the way forward in the steps that follow. The Zen theme is one of the most active theme development projects. Updated versions of the theme are released regularly. We used version 6.x-1.0-beta2 for the examples in this article. Though that version was current at the time this text was prepared, it is unlikely to be current at the time you read this. To avoid difficulties, we have placed a copy of the files used in this article in the software archive that is provided on the Packt website. Download the files used in this article at http://www.packtpub.com/files/code/5661_Code.zip. You can download the current version of Zen at http://drupal.org/project/zen. Any time you set off down the path of transforming an existing theme into something new, you need to spend some time planning. The principle here is the same as in many other areas of life: A little time spent planning at the front end of a project can pay off big in savings later. A proper dissertation on site planning and usability is beyond the scope of this article; so for our purposes let us focus on defining some loose goals and then work towards satisfying a specific wish list for the final site functionality. Our goal is to create a two-column blog-type theme with solid usability and good branding. Our hypothetical client for this project needs space for advertising and a top banner. The theme must also integrate a forum and a user comments functionality. Specific changes we want to implement include: Main navigation menu in the right column Secondary navigation mirrored at the top and bottom of each page A top banner space below top nav but above the branding area Color scheme and fonts to match brand identity Enable and integrate the Drupal blog, forum, and comments modules In order to make the example easier to follow and to avoid the need to install a variety of third-party extensions, the modifications we will make in this article will be done using only the default components—excepting only the theme itself, Zen. Arguably, were you building a site like this for deployment in the real world (rather than simply for skills development) you might wish to consider implementing one or more specialized third-party extensions to handle certain tasks. Creating a new subtheme Install the Zen theme if you have not done so before now; once that is done we're ready to create a new subtheme. First, make a copy of the directory named STARTERKIT and place the copied files into the directory sites/all/themes. Rename the directory "tao". Note that in Drupal 5.x, subthemes were kept in the same directory as the parent theme, but for Drupal 6.x this is no longer the case. Subthemes should now be placed in their own directory inside the sites/all/themes/directory. Note that the authors of Zen have chosen to vary from the default stylesheet naming. Most themes use a file named style.css for their primary CSS. In Zen, however, the file is named zen.css. We need to grab that file and incorporate it into Tao. Copy the Zen CSS (zen/zen/zen.css) file. Rename it tao.css and place it in the Tao directory (tao/tao.css). When you look in the zen/zen directory, in addition to the key zen.css file, you will note the presence of a number of other CSS files. We need not concern ourselves with the other CSS files. The styles contained in those stylesheets will remain available to us (we inherit them as Zen is our base theme) and if we need to alter them, we can override the selectors as needed via our new tao.css file. In addition to renaming the theme directory, we also need to rename any other theme-name-specific files or functions. Do the following: Rename the STARTERKIT.info file to tao.info. Edit the tao.info file to replace all occurrences of STARTERKIT with tao. Open the tao.info file and find this copy: The name and description of the theme used on the admin/build/themes page. name = Zen Themer's StarterKit description = Read the <a href="http://drupal.org/node/226507">online docs</a> on how to create a sub-theme. Replace that text with this copy: The name and description of the theme used on the admin/build/themes page. name = Tao description = A 2-column fixed-width sub-theme based on Zen. Make sure the name= and description = content is not commented out, else it will not register. Edit the template.php file to replace all occurrences of STARTERKIT with tao. Edit the theme-settings.php file to replace all occurrences of STARTERKIT with tao. Copy the file zen/layout-fixed.css and place it in the tao directory, creating tao/layout-fixed.css. Include the new layout-fixed.css by modifying the tao.info file. Change style sheets[all][] = layout.css to style sheets[all][] = layout-fixed.css. The .info file functions similar to a .ini file: It provides configuration information, in this case, for your theme. A good discussion of the options available within the .info file can be found on the Drupal.org site at: http://drupal.org/node/171205 Making the transition from Zen to Tao The process of transforming an existing theme into something new consists of a set of tasks that can categorized into three groups: Configuring the Theme Adapting the CSS Adapting the Templates & Themable Functions Configuring the theme As stated previously, the goal of this redesign is to create a blog theme with solid usability and a clean look and feel. The resulting site will need to support forums and comments and will need advertising space. Let's start by enabling the functionality we need and then we can drop in some sample contents. Technically speaking, adding sample content is not 100% necessary, but practically speaking, it is extremely useful; let's see the impact of our work with the CSS, the templates, and the themable functions. Before we begin, enable your new theme, if you have not done so already. Log in as the administrator, then go to the themes manager (Administer | Site building | Themes), and enable the theme Tao. Set it to be the default theme and save the changes. Now we're set to begin customizing this theme, first through the Drupal system's default configuration options, and then through our custom styling. Enabling Modules To meet the client's functional requirements, we need to activate several features of Drupal which, although contained in the default distro, are not by default activated. Accordingly, we need to identify the necessary modules and enable them. Let's do that now. Access the module manager screen (Administer | Site building | Modules), and enable the following modules: Blog (enables blog-type presentation of content) Contact (enables the site contact forms) Forum (enables the threaded discussion forum) Search (enables users to search the site) Save your changes and let's move on to the next step in the configuration process.
Read more
  • 0
  • 0
  • 5907

article-image-performance-optimization
Packt
19 Dec 2014
30 min read
Save for later

Performance Optimization

Packt
19 Dec 2014
30 min read
In this article is written by Mark Kerzner and Sujee Maniyam, the authors of HBase Design Patterns, we will talk about how to write high performance and scalable HBase applications. In particular, will take a look at the following topics: The bulk loading of data into HBase Profiling HBase applications Tips to get good performance on writes Tips to get good performance on reads (For more resources related to this topic, see here.) Loading bulk data into HBase When deploying HBase for the first time, we usually need to import a significant amount of data. This is called initial loading or bootstrapping. There are three methods that can be used to import data into HBase, given as follows: Using the Java API to insert data into HBase. This can be done in a single client, using single or multiple threads. Using MapReduce to insert data in parallel (this approach also uses the Java API), as shown in the following diagram:  Using MapReduce to generate HBase store files in parallel in bulk and then import them into HBase directly. (This approach does not require the use of the API; it does not require code and is very efficient.)  On comparing the three methods speed wise, we have the following order: Java client < MapReduce insert < HBase file import The Java client and MapReduce use HBase APIs to insert data. MapReduce runs on multiple machines and can exploit parallelism. However, both of these methods go through the write path in HBase. Importing HBase files directly, however, skips the usual write path. HBase files already have data in the correct format that HBase understands. That's why importing them is much faster than using MapReduce and the Java client. We covered the Java API earlier. Let's start with how to insert data using MapReduce. Importing data into HBase using MapReduce MapReduce is the distributed processing engine of Hadoop. Usually, programs read/write data from HDFS. Luckily, HBase supports MapReduce. HBase can be the source and the sink for MapReduce programs. A source means MapReduce programs can read from HBase, and sink means results from MapReduce can be sent to HBase. The following diagram illustrates various sources and sinks for MapReduce:     The diagram we just saw can be summarized as follows: Scenario Source Sink Description 1 HDFS HDFS This is a typical MapReduce method that reads data from HDFS and also sends the results to HDFS. 2 HDFS HBase This imports the data from HDFS into HBase. It's a very common method that is used to import data into HBase for the first time. 3 HBase HBase Data is read from HBase and written to it. It is most likely that these will be two separate HBase clusters. It's usually used for backups and mirroring.  Importing data from HDFS into HBase Let's say we have lots of data in HDFS and want to import it into HBase. We are going to write a MapReduce program that reads from HDFS and inserts data into HBase. This is depicted in the second scenario in the table we just saw. Now, we'll be setting up the environment for the following discussion. In addition, you can find the code and the data for this discussion in our GitHub repository at https://github.com/elephantscale/hbase-book. The dataset we will use is the sensor data. Our (imaginary) sensor data is stored in HDFS as CSV (comma-separated values) text files. This is how their format looks: Sensor_id, max temperature, min temperature Here is some sample data: sensor11,90,70 sensor22,80,70 sensor31,85,72 sensor33,75,72 We have two sample files (sensor-data1.csv and sensor-data2.csv) in our repository under the /data directory. Feel free to inspect them. The first thing we have to do is copy these files into HDFS. Create a directory in HDFS as follows: $   hdfs   dfs -mkdir   hbase-import Now, copy the files into HDFS: $   hdfs   dfs   -put   sensor-data*   hbase-import/ Verify that the files exist as follows: $   hdfs   dfs -ls   hbase-import We are ready to insert this data into HBase. Note that we are designing the table to match the CSV files we are loading for ease of use. Our row key is sensor_id. We have one column family and we call it f (short for family). Now, we will store two columns, max temperature and min temperature, in this column family. Pig for MapReduce Pig allows you to write MapReduce programs at a very high level, and inserting data into HBase is just as easy. Here's a Pig script that reads the sensor data from HDFS and writes it in HBase: -- ## hdfs-to-hbase.pigdata = LOAD 'hbase-import/' using PigStorage(',') as (sensor_id:chararray, max:int, min:int);-- describe data;-- dump data; Now, store the data in hbase://sensors using the following line of code: org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:max,f:min'); After creating the table, in the first command, we will load data from the hbase-import directory in HDFS. The schema for the data is defined as follows: Sensor_id : chararray (string)max : intmin : int The describe and dump statements can be used to inspect the data; in Pig, describe will give you the structure of the data object you have, and dump will output all the data to the terminal. The final STORE command is the one that inserts the data into HBase. Let's analyze how it is structured: INTO 'hbase://sensors': This tells Pig to connect to the sensors HBase table. org.apache.pig.backend.hadoop.hbase.HBaseStorage: This is the Pig class that will be used to write in HBase. Pig has adapters for multiple data stores. The first field in the tuple, sensor_id, will be used as a row key. We are specifying the column names for the max and min fields (f:max and f:min, respectively). Note that we have to specify the column family (f:) to qualify the columns. Before running this script, we need to create an HBase table called sensors. We can do this from the HBase shell, as follows: $ hbase shell$ create 'sensors' , 'f'$ quit Then, run the Pig script as follows: $ pig hdfs-to-hbase.pig Now watch the console output. Pig will execute the script as a MapReduce job. Even though we are only importing two small files here, we can insert a fairly large amount of data by exploiting the parallelism of MapReduce. At the end of the run, Pig will print out some statistics: Input(s):Successfully read 7 records (591 bytes) from: "hdfs://quickstart.cloudera:8020/user/cloudera/hbase-import"Output(s):Successfully stored 7 records in: "hbase://sensors" Looks good! We should have seven rows in our HBase sensors table. We can inspect the table from the HBase shell with the following commands: $ hbase shell$ scan 'sensors' This is how your output might look: ROW                      COLUMN+CELL sensor11                 column=f:max, timestamp=1412373703149, value=90 sensor11                 column=f:min, timestamp=1412373703149, value=70 sensor22                 column=f:max, timestamp=1412373703177, value=80 sensor22                column=f:min, timestamp=1412373703177, value=70 sensor31                 column=f:max, timestamp=1412373703177, value=85 sensor31                 column=f:min, timestamp=1412373703177, value=72 sensor33                 column=f:max, timestamp=1412373703177, value=75 sensor33                 column=f:min, timestamp=1412373703177, value=72 sensor44                 column=f:max, timestamp=1412373703184, value=55 sensor44                 column=f:min, timestamp=1412373703184, value=42 sensor45                 column=f:max, timestamp=1412373703184, value=57 sensor45                 column=f:min, timestamp=1412373703184, value=47 sensor55                 column=f:max, timestamp=1412373703184, value=55 sensor55                 column=f:min, timestamp=1412373703184, value=427 row(s) in 0.0820 seconds There you go; you can see that seven rows have been inserted! With Pig, it was very easy. It took us just two lines of Pig script to do the import. Java MapReduce We have just demonstrated MapReduce using Pig, and you now know that Pig is a concise and high-level way to write MapReduce programs. This is demonstrated by our previous script, essentially the two lines of Pig code. However, there are situations where you do want to use the Java API, and it would make more sense to use it than using a Pig script. This can happen when you need Java to access Java libraries or do some other detailed tasks for which Pig is not a good match. For that, we have provided the Java version of the MapReduce code in our GitHub repository. Using HBase's bulk loader utility HBase is shipped with a bulk loader tool called ImportTsv that can import files from HDFS into HBase tables directly. It is very easy to use, and as a bonus, it uses MapReduce internally to process files in parallel. Perform the following steps to use ImportTsv: Stage data files into HDFS (remember that the files are processed using MapReduce). Create a table in HBase if required. Run the import. Staging data files into HDFS The first step to stage data files into HDFS has already been outlined in the previous section. The following sections explain the next two steps to stage data files. Creating an HBase table We will do this from the HBase shell. A note on regions is in order here. Regions are shards created automatically by HBase. It is the regions that are responsible for the distributed nature of HBase. However, you need to pay some attention to them in order to assure performance. If you put all the data in one region, you will cause what is called region hotspotting. What is especially nice about a bulk loader is that when creating a table, it lets you presplit the table into multiple regions. Precreating regions will allow faster imports (because the insert requests will go out to multiple region servers). Here, we are creating a single column family: $ hbase shellhbase> create 'sensors', {NAME => 'f'}, {SPLITS => ['sensor20', 'sensor40', 'sensor60']}0 row(s) in 1.3940 seconds=> Hbase::Table - sensors hbase > describe 'sensors'DESCRIPTION                                       ENABLED'sensors', {NAME => 'f', DATA_BLOCK_ENCODING => true'NONE', BLOOMFILTER => 'ROW', REPLICATION_SCOPE=> '0', VERSIONS => '1', COMPRESSION => 'NONE',MIN_VERSIONS => '0', TTL => 'FOREVER', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}1 row(s) in 0.1140 seconds We are creating regions here. Why there are exactly four regions will be clear from the following diagram:   On inspecting the table in the HBase Master UI, we will see this. Also, you can see how Start Key and End Key, which we specified, are showing up. Run the import Ok, now it's time to insert data into HBase. To see the usage of ImportTsv, do the following: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv This will print the usage as follows: $ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=, -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min sensors   hbase-import/ The following table explains what the parameters mean: Parameter Description -Dimporttsv.separator Here, our separator is a comma (,). The default value is tab (t). -Dimporttsv.columns=HBASE_ROW_KEY,f:max,f:min This is where we map our input files into HBase tables. The first field, sensor_id, is our key, and we use HBASE_ROW_KEY to denote that the rest we are inserting into column family f. The second field, max temp, maps to f:max. The last field, min temp, maps to f:min. sensors This is the table name. hbase-import This is the HDFS directory where the data files are located.  When we run this command, we will see that a MapReduce job is being kicked off. This is how an import is parallelized. Also, from the console output, we can see that MapReduce is importing two files as follows: [main] mapreduce.JobSubmitter: number of splits:2 While the job is running, we can inspect the progress from YARN (or the JobTracker UI). One thing that we can note is that the MapReduce job only consists of mappers. This is because we are reading a bunch of files and inserting them into HBase directly. There is nothing to aggregate. So, there is no need for reducers. After the job is done, inspect the counters and we can see this: Map-Reduce Framework Map input records=7 Map output records=7 This tells us that mappers read seven records from the files and inserted seven records into HBase. Let's also verify the data in HBase: $   hbase shellhbase >   scan 'sensors'ROW                 COLUMN+CELLsensor11           column=f:max, timestamp=1409087465345, value=90sensor11           column=f:min, timestamp=1409087465345, value=70sensor22           column=f:max, timestamp=1409087465345, value=80sensor22           column=f:min, timestamp=1409087465345, value=70sensor31           column=f:max, timestamp=1409087465345, value=85sensor31           column=f:min, timestamp=1409087465345, value=72sensor33           column=f:max, timestamp=1409087465345, value=75sensor33           column=f:min, timestamp=1409087465345, value=72sensor44            column=f:max, timestamp=1409087465345, value=55sensor44           column=f:min, timestamp=1409087465345, value=42sensor45           column=f:max, timestamp=1409087465345, value=57sensor45           column=f:min, timestamp=1409087465345, value=47sensor55           column=f:max, timestamp=1409087465345, value=55sensor55           column=f:min, timestamp=1409087465345, value=427 row(s) in 2.1180 seconds Your output might vary slightly. We can see that seven rows are inserted, confirming the MapReduce counters! Let's take another quick look at the HBase UI, which is shown here:    As you can see, the inserts go to different regions. So, on a HBase cluster with many region servers, the load will be spread across the cluster. This is because we have presplit the table into regions. Here are some questions to test your understanding. Run the same ImportTsv command again and see how many records are in the table. Do you get duplicates? Try to find the answer and explain why that is the correct answer, then check these in the GitHub repository (https://github.com/elephantscale/hbase-book). Bulk import scenarios Here are a few bulk import scenarios: Scenario Methods Notes The data is already in HDFS and needs to be imported into HBase. The two methods that can be used to do this are as follows: If the ImportTsv tool can work for you, then use it as it will save time in writing custom MapReduce code. Sometimes, you might have to write a custom MapReduce job to import (for example, complex time series data, doing data mapping, and so on). It is probably a good idea to presplit the table before a bulk import. This spreads the insert requests across the cluster and results in a higher insert rate. If you are writing a custom MapReduce job, consider using a high-level MapReduce platform such as Pig or Hive. They are much more concise to write than the Java code. The data is in another database (RDBMs/NoSQL) and you need to import it into HBase. Use a utility such as Sqoop to bring the data into HDFS and then use the tools outlined in the first scenario. Avoid writing MapReduce code that directly queries databases. Most databases cannot handle many simultaneous connections. It is best to bring the data into Hadoop (HDFS) first and then use MapReduce. Profiling HBase applications Just like any software development process, once we have our HBase application working correctly, we would want to make it faster. At times, developers get too carried away and start optimizing before the application is finalized. There is a well-known rule that premature optimization is the root of all evil. One of the sources for this rule is Scott Meyers Effective C++. We can perform some ad hoc profiling in our code by timing various function calls. Also, we can use profiling tools to pinpoint the trouble spots. Using profiling tools is highly encouraged for the following reasons: Profiling takes out the guesswork (and a good majority of developers' guesses are wrong). There is no need to modify the code. Manual profiling means that we have to go and insert the instrumentation code all over the code. Profilers work by inspecting the runtime behavior. Most profilers have a nice and intuitive UI to visualize the program flow and time flow. The authors use JProfiler. It is a pretty effective profiler. However, it is neither free nor open source. So, for the purpose of this article, we are going to show you a simple manual profiling, as follows: public class UserInsert {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");        long t1a = System.currentTimeMillis();        HTable htable = new HTable(config, tableName);        long t1b = System.currentTimeMillis();        System.out.println ("Connected to HTable in : " + (t1b-t1a) + " ms");        int total = 100;        long t2a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }        long t2b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t2b - t2a) + " ms");        htable.close();      } } The code we just saw inserts some sample user data into HBase. We are profiling two operations, that is, connection time and actual insert time. A sample run of the Java application yields the following: Connected to HTable in : 1139 msinserted 100 users in 350 ms We spent a lot of time in connecting to HBase. This makes sense. The connection process has to go to ZooKeeper first and then to HBase. So, it is an expensive operation. How can we minimize the connection cost? The answer is by using connection pooling. Luckily, for us, HBase comes with a connection pool manager. The Java class for this is HConnectionManager. It is very simple to use. Let's update our class to use HConnectionManager: Code : File name: hbase_dp.ch8.UserInsert2.java   package hbase_dp.ch8;   import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.client.HConnection; import org.apache.hadoop.hbase.client.HConnectionManager; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.HTableInterface; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.util.Bytes;   public class UserInsert2 {      static String tableName = "users";    static String familyName = "info";      public static void main(String[] args) throws Exception {        Configuration config = HBaseConfiguration.create();        // change the following to connect to remote clusters        // config.set("hbase.zookeeper.quorum", "localhost");               long t1a = System.currentTimeMillis();        HConnection hConnection = HConnectionManager.createConnection(config);        long t1b = System.currentTimeMillis();        System.out.println ("Connection manager in : " + (t1b-t1a) + " ms");          // simulate the first 'connection'        long t2a = System.currentTimeMillis();        HTableInterface htable = hConnection.getTable(tableName) ;        long t2b = System.currentTimeMillis();        System.out.println ("first connection in : " + (t2b-t2a) + " ms");               // second connection        long t3a = System.currentTimeMillis();        HTableInterface htable2 = hConnection.getTable(tableName) ;        long t3b = System.currentTimeMillis();        System.out.println ("second connection : " + (t3b-t3a) + " ms");          int total = 100;        long t4a = System.currentTimeMillis();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));            htable.put(put);          }      long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms");        hConnection.close();    } } A sample run yields the following timings: Connection manager in : 98 ms first connection in : 808 ms second connection : 0 ms inserted 100 users in 393 ms The first connection takes a long time, but then take a look at the time of the second connection. It is almost instant ! This is cool! If you are connecting to HBase from web applications (or interactive applications), use connection pooling. More tips for high-performing HBase writes Here we will discuss some techniques and best practices to improve writes in HBase. Batch writes Currently, in our code, each time we call htable.put (one_put), we make an RPC call to an HBase region server. This round-trip delay can be minimized if we call htable.put() with a bunch of put records. Then, with one round trip, we can insert a bunch of records into HBase. This is called batch puts. Here is an example of batch puts. Only the relevant section is shown for clarity. For the full code, see hbase_dp.ch8.UserInsert3.java:        int total = 100;        long t4a = System.currentTimeMillis();        List<Put> puts = new ArrayList<>();        for (int i = 0; i < total; i++) {            int userid = i;            String email = "user-" + i + "@foo.com";            String phone = "555-1234";              byte[] key = Bytes.toBytes(userid);            Put put = new Put(key);              put.add(Bytes.toBytes(familyName), Bytes.toBytes("email"), Bytes.toBytes(email));            put.add(Bytes.toBytes(familyName), Bytes.toBytes("phone"), Bytes.toBytes(phone));                       puts.add(put); // just add to the list        }        htable.put(puts); // do a batch put        long t4b = System.currentTimeMillis();        System.out.println("inserted " + total + " users in " + (t4b - t4a) + " ms"); A sample run with a batch put is as follows: inserted 100 users in 48 ms The same code with individual puts took around 350 milliseconds! Use batch writes when you can to minimize latency. Note that the HTableUtil class that comes with HBase implements some smart batching options for your use and enjoyment. Setting memory buffers We can control when the puts are flushed by setting the client write buffer option. Once the data in the memory exceeds this setting, it is flushed to disk. The default setting is 2 M. Its purpose is to limit how much data is stored in the buffer before writing it to disk. There are two ways of setting this: In hbase-site.xml (this setting will be cluster-wide): <property>  <name>hbase.client.write.buffer</name>    <value>8388608</value>   <!-- 8 M --></property> In the application (only applies for that application): htable.setWriteBufferSize(1024*1024*10); // 10 Keep in mind that a bigger buffer takes more memory on both the client side and the server side. As a practical guideline, estimate how much memory you can dedicate to the client and put the rest of the load on the cluster. Turning off autofush If autoflush is enabled, each htable.put() object incurs a round trip RPC call to HRegionServer. Turning autoflush off can reduce the number of round trips and decrease latency. To turn it off, use this code: htable.setAutoFlush(false); The risk of turning off autoflush is if the client crashes before the data is sent to HBase, it will result in a data loss. Still, when will you want to do it? The answer is: when the danger of data loss is not important and speed is paramount. Also, see the batch write recommendations we saw previously. Turning off WAL Before we discuss this, we need to emphasize that the write-ahead log (WAL) is there to prevent data loss in the case of server crashes. By turning it off, we are bypassing this protection. Be very careful when choosing this. Bulk loading is one of the cases where turning off WAL might make sense. To turn off WAL, set it for each put: put.setDurability(Durability.SKIP_WAL); More tips for high-performing HBase reads So far, we looked at tips to write data into HBase. Now, let's take a look at some tips to read data faster. The scan cache When reading a large number of rows, it is better to set scan caching to a high number (in the 100 seconds or 1,000 seconds range). Otherwise, each row that is scanned will result in a trip to HRegionServer. This is especially encouraged for MapReduce jobs as they will likely consume a lot of rows sequentially. To set scan caching, use the following code: Scan scan = new Scan(); scan.setCaching(1000); Only read the families or columns needed When fetching a row, by default, HBase returns all the families and all the columns. If you only care about one family or a few attributes, specifying them will save needless I/O. To specify a family, use this: scan.addFamily( Bytes.toBytes("familiy1")); To specify columns, use this: scan.addColumn( Bytes.toBytes("familiy1"),   Bytes.toBytes("col1")) The block cache When scanning large rows sequentially (say in MapReduce), it is recommended that you turn off the block cache. Turning off the cache might be completely counter-intuitive. However, caches are only effective when we repeatedly access the same rows. During sequential scanning, there is no caching, and turning on the block cache will introduce a lot of churning in the cache (new data is constantly brought into the cache and old data is evicted to make room for the new data). So, we have the following points to consider: Turn off the block cache for sequential scans Turn off the block cache for random/repeated access Benchmarking or load testing HBase Benchmarking is a good way to verify HBase's setup and performance. There are a few good benchmarks available: HBase's built-in benchmark The Yahoo Cloud Serving Benchmark (YCSB) JMeter for custom workloads HBase's built-in benchmark HBase's built-in benchmark is PerformanceEvaluation. To find its usage, use this: $   hbase org.apache.hadoop.hbase.PerformanceEvaluation To perform a write benchmark, use this: $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred randomWrite 5 Here we are using five threads and no MapReduce. To accurately measure the throughput, we need to presplit the table that the benchmark writes to. It is TestTable. $ hbase org.apache.hadoop.hbase.PerformanceEvaluation --nomapred --presplit=3 randomWrite 5 Here, the table is split in three ways. It is good practice to split the table into as many regions as the number of region servers. There is a read option along with a whole host of scan options. YCSB The YCSB is a comprehensive benchmark suite that works with many systems such as Cassandra, Accumulo, and HBase. Download it from GitHub, as follows: $   git clone git://github.com/brianfrankcooper/YCSB.git Build it like this: $ mvn -DskipTests package Create an HBase table to test against: $ hbase shellhbase> create 'ycsb', 'f1' Now, copy hdfs-site.xml for your cluster into the hbase/src/main/conf/ directory and run the benchmark: $ bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p table=ycsb YCSB offers lots of workloads and options. Please refer to its wiki page at https://github.com/brianfrankcooper/YCSB/wiki. JMeter for custom workloads The standard benchmarks will give you an idea of your HBase cluster's performance. However, nothing can substitute measuring your own workload. We want to measure at least the insert speed or the query speed. We also want to run a stress test. So, we can measure the ceiling on how much our HBase cluster can support. We can do a simple instrumentation as we did earlier too. However, there are tools such as JMeter that can help us with load testing. Please refer to the JMeter website and check out the Hadoop or HBase plugins for JMeter. Monitoring HBase Running any distributed system involves decent monitoring. HBase is no exception. Luckily, HBase has the following capabilities: HBase exposes a lot of metrics These metrics can be directly consumed by monitoring systems such as Ganglia We can also obtain these metrics in the JSON format via the REST interface and JMX Monitoring is a big subject and we consider it as part HBase administration. So, in this section, we will give pointers to tools and utilities that allow you to monitor HBase. Ganglia Ganglia is a generic system monitor that can monitor hosts (such as CPU, disk usage, and so on). The Hadoop stack has had a pretty good integration with Ganglia for some time now. HBase and Ganglia integration is set up by modern installers from Cloudera and Hortonworks. To enable Ganglia metrics, update the hadoop-metrics.properties file in the HBase configuration directory. Here's a sample file: hbase.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 hbase.period=10 hbase.servers=ganglia-server:PORT jvm.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 jvm.period=10 jvm.servers=ganglia-server:PORT rpc.class=org.apache.hadoop.metrics.ganglia.GangliaContext31 rpc.period=10 rpc.servers=ganglia-server:PORT This file has to be uploaded to all the HBase servers (master servers as well as region servers). Here are some sample graphs from Ganglia (these are Wikimedia statistics, for example): These graphs show cluster-wide resource utilization. OpenTSDB OpenTSDB is a scalable time series database. It can collect and visualize metrics on a large scale. OpenTSDB uses collectors, light-weight agents that send metrics to the open TSDB server to collect metrics, and there is a collector library that can collect metrics from HBase. You can see all the collectors at http://opentsdb.net/docs/build/html/user_guide/utilities/tcollector.html. An interesting factoid is that OpenTSDB is built on Hadoop/HBase. Collecting metrics via the JMX interface HBase exposes a lot of metrics via JMX. This page can be accessed from the web dashboard at http://<hbase master>:60010/jmx. For example, for a HBase instance that is running locally, it will be http://localhost:60010/jmx. Here is a sample screenshot of the JMX metrics via the web UI: Here's a quick example of how to programmatically retrieve these metrics using curl: $ curl 'localhost:60010/jmx' Since this is a web service, we can write a script/application in any language (Java, Python, or Ruby) to retrieve and inspect the metrics. Summary In this article, you learned how to push the performance of our HBase applications up. We looked at how to effectively load a large amount of data into HBase. You also learned about benchmarking and monitoring HBase and saw tips on how to do high-performing reads/writes. Resources for Article:   Further resources on this subject: The HBase's Data Storage [article] Hadoop and HDInsight in a Heartbeat [article] Understanding the HBase Ecosystem [article]
Read more
  • 0
  • 0
  • 5906