Packt+ | Advance your knowledge in tech

You're reading from Statistical Application Development with R and Python Develop applications using data processing, statistical models, and CART

Product type Paperback

Published in Aug 2017

Publisher

ISBN-13 9781788621199

Length 432 pages

Edition 2nd Edition

Languages

Python

Concepts

Application Development

Table of Contents (19) Chapters

Statistical Application Development with R and Python - Second Edition

Credits

About the Author

Acknowledgment

About the Reviewers

www.PacktPub.com

Customer Feedback

Preface

1. Data Characteristics FREE CHAPTER

2. Import/Export Data

3. Data Visualization

4. Exploratory Analysis

5. Statistical Inference

6. Linear Regression Analysis

7. Logistic Regression Model

8. Regression Models with Regularization

9. Classification and Regression Trees

10. CART and Beyond

Index

Preface

R and Python are interchangeably required languages these days for anybody engaged with data analysis. The growth of these two languages and their inter-dependency creates a natural requirement to learn them both. Thus, it was natural where the second edition of my previous title R Statistical Application Development by Example was headed. I thus took this opportunity to add Python as an important layer and hence you would find Doing it in Python spread across and throughout the book. Now, the book is useful on many fronts, those who need to learn both the languages, uses R and needs to switch to Python, and vice versa. While abstract development of ideas and algorithms have been retained in R only, standard and more commonly required data analysis technique are available in both the languages now. The only reason for not providing the Python parallel is to avoid the book from becoming too bulky.

The open source language R is fast becoming one of the preferred companions for statistics, even as the subject continues to add many friends in machine learning, data mining, and so on among its already rich scientific network. The era of mathematical theory and statistical application embeddedness is truly a remarkable one for society and R and Python has played a very pivotal role in it. This book is a humble attempt at presenting statistical models through R for any reader who has a bit of familiarity with the subject. In my experience of practicing the subject with colleagues and friends from different backgrounds, I realized that many are interested in learning the subject and applying it in their domain which enables them to take appropriate decisions in analyses, which involves uncertainty. A decade earlier my friends would have been content with being pointed to a useful reference book. Not so anymore! The work in almost every domain is done through computers and naturally they do have their data available in spreadsheets, databases, and sometimes in plain text format. The request for an appropriate statistical model is invariantly followed by a one word question software? My answer to them has always been a single letter reply R! Why? It is really a very simple decision and it has been my companion over the last seven years. In this book, this experience has been converted into detailed chapters and a cleaner breakup of model building in R.

A by-product of my interactions with colleagues and friends who are all aspiring statistical model builders has been that I have been able to pick up the trough of their learning curve of the subject. The first attempt towards fixing the hurdle has been to introduce the fundamental concepts that the beginners are most familiar with, which is data. The difference is simply in the subtleties and as such I firmly believe that introducing the subject on their turf motivates the reader for a long way in their journey. As with most statistical software, R provides modules and packages which mostly cover many of the recently invented statistical methodologies. The first five chapters of the book focus on the fundamental aspects of the subject and the R language and therefore hence cover R basics, data visualization, exploratory data analysis, and statistical inference.

The foundational aspects are illustrated using interesting examples and sets up the framework for the next five chapters. Linear and logistic regression models being at the forefront, are of paramount importance in applications. The discussion is more generic in nature and the techniques can be easily adapted across different domains. The last two chapters have been inspired by the Breiman school and hence the modern method of using classification and regression trees has been developed in detail and illustrated through a practical dataset.

What this book covers

Chapter 1, Data Characteristics, introduces the different types of data through a questionnaire and dataset. The need of statistical models is elaborated in some interesting contexts. This is followed by a brief explanation of the installation of R and Python and their related packages. Discrete and continuous random variables are discussed through introductory programs. The programs are available in both the languages and although they do not need to be followed, they are more expository in nature.

Chapter 2, Import/Export Data, begins with a concise development of R basics. Data frames, vectors, matrices, and lists are discussed with clear and simpler examples. Importing of data from external files in CSV, XLS, and other formats is elaborated next. Writing data/objects from R for other languages is considered and the chapter concludes with a dialogue on R session management. Python basics, mathematical operations, and other essential operations are explained. Reading data from different format of external file is also illustrated along with the session management required.

Chapter 3, Data Visualization, discusses efficient graphics separately for categorical and numeric datasets. This translates into techniques for bar chart, dot chart, spine and mosaic plot, and four fold plot for categorical data while histogram, box plot, and scatter plot for continuous/numeric data. A very brief introduction to ggplot2 is also provided here. Generating similar plots using both R and Python will be a treatise here.

Chapter 4, Exploratory Analysis, encompasses highly intuitive techniques for the preliminary analysis of data. The visualizing techniques of EDA such as stem-and-leaf, letter values, and the modeling techniques of resistant line, smoothing data, and median polish provide rich insight as a preliminary analysis step. This chapter is driven mainly in R only.

Chapter 5, Statistical Inference, begins with an emphasis on the likelihood function and computing the maximum likelihood estimate. Confidence intervals for parameters of interest is developed using functions defined for specific problems. The chapter also considers important statistical tests of z-test and t-test for comparison of means and chi-square tests and f-test for comparison of variances. The reader will learn how to create new R and Python functions.

Chapter 6, Linear Regression Analysis, builds a linear relationship between an output and a set of explanatory variables. The linear regression model has many underlying assumptions and such details are verified using validation techniques. A model may be affected by a single observation, or a single output value, or an explanatory variable. Statistical metrics are discussed in depth which helps remove one or more types of anomalies. Given a large number of covariates, the efficient model is developed using model selection techniques. While the stats core R package suffices, statsmodels package in Python is very useful.

Chapter 7, The Logistic Regression Model, is useful as a classification model when the output is a binary variable. Diagnostic and model validation through residuals are used which lead to an improved model. ROC curves are next discussed which helps in identifying of a better classification model. The R packages pscl and ROCR are useful while pysal and sklearn are useful in Python.

Chapter 8, Regression Models with Regularization, discusses the problem of over fitting, which arises from the use of models developed in the previous two chapters. Ridge regression significantly reduces the probability of an over fit model and the development of natural spine models also lays the basis for the models considered in the next chapter. Regularization in R is achieved using packages ridge and MASS while sklearn and statsmodels help in Python.

Chapter 9, Classification and Regression Trees, provides a tree-based regression model. The trees are initially built using raw R functions and the final trees are also reproduced using rudimentary codes leading to a clear understanding of the CART mechanism. The pruning procedure is illustrated through one of the languages and the reader should explore to find the fix in another.

Chapter 10, CART and Beyond, considers two enhancements to CART, using bagging and random forests. A consolidation of all the models from Chapter 6, Linear Regression Analysis, to Chapter 10, CART and Beyond, is also provided through a dataset. The ensemble methods is fast emerging as very effective and popular machine learning technique and doing it in both the languages will improve users confidence.

What you need for this book

You will need the following to work with the examples in this book:

R
Python
RStudio

Who this book is for

If you want to have a brief understanding of the nature of data and perform advanced statistical analysis using both R and Python, then this book is what you need. No prior knowledge is required. Aspiring data scientist, R users trying to learn Python and Python users trying to learn R.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, pathnames, dummy URLs, user input, and Twitter handles are shown as follows: “We can include other contexts through the use of the include directive.”

A block of code is set as follows:

abline(h=0.33,lwd=3,col=”red”)
abline(h=0.67,lwd=3,col=”red”)
abline(v=0.33,lwd=3,col=”green”)

Any command-line input or output is written as follows:

sudo apt-get update
sudo apt-get install python3.6

New terms and important words are shown in bold. Words that you see on the screen, for example, in menus or dialog boxes, appear in the text like this: “Clicking the Next button moves you to the next screen.”

Note

Warnings or important notes appear in a box like this.

Note

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this book—what you liked or disliked. Reader feedback is important for us as it helps us develop titles that you will really get the most out of.

To send us general feedback, simply e-mail <[email protected]>, and mention the book’s title in the subject of your message.

If there is a topic that you have expertise in and you are interested in either writing or contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you to get the most from your purchase.

Downloading the example code

You can download the example code files for this book from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can download the code files by following these steps:

Log in or register to our website using your e-mail address and password.
Hover the mouse pointer on the SUPPORT tab at the top.
Click on Code Downloads & Errata.
Enter the name of the book in the Search box.
Select the book for which you’re looking to download the code files.
Choose from the drop-down menu where you purchased this book from.
Click on Code Download.

You can also download the code files by clicking on the Code Files button on the book’s webpage at the Packt Publishing website. This page can be accessed by entering the book’s name in the Search box. Please note that you need to be logged in to your Packt account.

Once the file is downloaded, please make sure that you unzip or extract the folder using the latest version of:

WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at https://github.com/PacktPublishing/ Statistical-Application-Development-with-R-and-Python-Second-Edition. We also have other code bundles from our rich catalog of books and videos available at https://github.com/PacktPublishing/. Check them out!

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do happen. If you find a mistake in one of our books—maybe a mistake in the text or the code—we would be grateful if you could report this to us. By doing so, you can save other readers from frustration and help us improve subsequent versions of this book. If you find any errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting your book, clicking on the Errata Submission Form link, and entering the details of your errata. Once your errata are verified, your submission will be accepted and the errata will be uploaded to our website or added to any list of existing errata under the Errata section of that title.

To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search field. The required information will appear under the Errata section.

Piracy

Piracy of copyrighted material on the Internet is an ongoing problem across all media. At Packt, we take the protection of our copyright and licenses very seriously. If you come across any illegal copies of our works in any form on the Internet, please provide us with the location address or website name immediately so that we can pursue a remedy.

Please contact us at <[email protected]> with a link to the suspected pirated material.

We appreciate your help in protecting our authors and our ability to bring you valuable content.