Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Warming Up

Save for later
  • 660 min read
  • 2015-02-06 00:00:00

article-image

In this article by Bater Makhabel, author of Learning Data Mining with R, you will learn basic data mining terms such as data definition, preprocessing, and so on.

(For more resources related to this topic, see here.)

The most important data mining algorithms will be illustrated with R to help you grasp the principles quickly, including but not limited to, classification, clustering, and outlier detection. Before diving right into data mining, let's have a look at the topics we'll cover:

  • Data mining
  • Social network mining

In the history of humankind, the results of data from every aspect is extensive, for example websites, social networks by user's e-mail or name or account, search terms, locations on map, companies, IP addresses, books, films, music, and products.

Data mining techniques can be applied to any kind of old or emerging data; each data type can be best dealt with using certain, but not all, techniques. In other words, the data mining techniques are constrained by data type, size of the dataset, context of the tasks applied, and so on. Every dataset has its own appropriate data mining solutions.

New data mining techniques always need to be researched along with new data types once the old techniques cannot be applied to it or if the new data type cannot be transformed onto the traditional data types. The evolution of stream mining algorithms applied to Twitter's huge source set is one typical example. The graph mining algorithms developed for social networks is another example.

The most popular and basic forms of data are from databases, data warehouses, ordered/sequence data, graph data, text data, and so on. In other words, they are federated data, high dimensional data, longitudinal data, streaming data, web data, numeric, categorical, or text data.

Big data

Big data is large amount of data that does not fit in the memory of a single machine. In other words, the size of data itself becomes a part of the issue when studying it. Besides volume, two other major characteristics of big data are variety and velocity; these are the famous three Vs of big data. Velocity means data process rate or how fast the data is being processed. Variety denotes various data source types. Noises arise more frequently in big data source sets and affect the mining results, which require efficient data preprocessing algorithms.

As a result, distributed filesystems are used as tools for successful implementation of parallel algorithms on large amounts of data; it is a certainty that we will get even more data with each passing second. Data analytics and visualization techniques are the primary factors of the data mining tasks related to massive data. Some data types that are important to big data are as follows:

  • The data from the camera video, which includes more metadata for analysis to expedite crime investigations, enhanced retail analysis, military intelligence, and so on.
  • The second data type is from embedded sensors, such as medical sensors, to monitor any potential outbreaks of virus.
  • The third data type is from entertainment, information freely published through social media by anyone.
  • The last data type is consumer images, aggregated from social media, and tagging on these like images are important.

Here is a table illustrating the history of data size growth. It shows that information will be more than double every two years, changing the way researchers or companies manage and extract value through data mining techniques from data, revealing new data mining studies.

Year

Data Sizes

Comments

N/A

 

1 MB (Megabyte) = 220. The human brain holds about 200 MB of information.

N/A

 

1 PB (Petabyte) = 250. It is similar to the size of 3 years' observation data for Earth by NASA and is equivalent of 70.8 times the books in America's Library of Congress.

1999

1 EB

1 EB (Exabyte) = 260. The world produced 1.5 EB of unique information.

2007

281 EB

The world produced about 281 Exabyte of unique information.

2011

1.8 ZB

1 ZB (Zetabyte)= 270. This is all data gathered by human beings in 2011.

Very soon

Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at $15.99/month. Cancel anytime

 

1 YB(Yottabytes)= 280.

Scalability and efficiency

Efficiency, scalability, performance, optimization, and the ability to perform in real time are important issues for almost any algorithms, and it is the same for data mining. There are always necessary metrics or benchmark factors of data mining algorithms.

As the amount of data continues to grow, keeping data mining algorithms effective and scalable is necessary to effectively extract information from massive datasets in many data repositories or data streams.

The storage of data from a single machine to wide distribution, the huge size of many datasets, and the computational complexity of the data mining methods are all factors that drive the development of parallel and distributed data-intensive mining algorithms.

Data source

Data serves as the input for the data mining system and data repositories are important. In an enterprise environment, database and logfiles are common sources. In web data mining, web pages are the source of data. The data that continuously fetched various sensors are also a typical data source.

Here are some free online data sources particularly helpful to learn about data mining:

  • Frequent Itemset Mining Dataset Repository: A repository with datasets for methods to find frequent itemsets (http://fimi.ua.ac.be/data/).
  • UCI Machine Learning Repository: This is a collection of dataset, suitable for classification tasks (http://archive.ics.uci.edu/ml/).
  • The Data and Story Library at statlib: DASL (pronounced "dazzle") is an online library of data files and stories that illustrate the use of basic statistics methods. We hope to provide data from a wide variety of topics so that statistics teachers can find real-world examples that will be interesting to their students. Use DASL's powerful search engine to locate the story or data file of interest. (http://lib.stat.cmu.edu/DASL/)
  • WordNet: This is a lexical database for English (http://wordnet.princeton.edu)

Data mining

Data mining is the discovery of a model in data; it's also called exploratory data analysis, and discovers useful, valid, unexpected, and understandable knowledge from the data. Some goals are shared with other sciences, such as statistics, artificial intelligence, machine learning, and pattern recognition. Data mining has been frequently treated as an algorithmic problem in most cases. Clustering, classification, association rule learning, anomaly detection, regression, and summarization are all part of the tasks belonging to data mining.

The data mining methods can be summarized into two main categories of data mining problems: feature extraction and summarization.

Feature extraction

This is to extract the most prominent features of the data and ignore the rest. Here are some examples:

  • Frequent itemsets: This model makes sense for data that consists of baskets of small sets of items.
  • Similar items: Sometimes your data looks like a collection of sets and the objective is to find pairs of sets that have a relatively large fraction of their elements in common. It's a fundamental problem of data mining.

Summarization

The target is to summarize the dataset succinctly and approximately, such as clustering, which is the process of examining a collection of points (data) and grouping the points into clusters according to some measure. The goal is that points in the same cluster have a small distance from one another, while points in different clusters are at a large distance from one another.

The data mining process

There are two popular processes to define the data mining process in different perspectives, and the more widely adopted one is CRISP-DM:

  • Cross-Industry Standard Process for Data Mining (CRISP-DM)
  • Sample, Explore, Modify, Model, Assess (SEMMA), which was developed by the SAS Institute, USA

CRISP-DM

There are six phases in this process that are shown in the following figure; it is not rigid, but often has a great deal of backtracking:

warming-img-0

Let's look at the phases in detail:

  • Business understanding: This task includes determining business objectives, assessing the current situation, establishing data mining goals, and developing a plan.
  • Data understanding: This task evaluates data requirements and includes initial data collection, data description, data exploration, and the verification of data quality.
  • Data preparation: Once available, data resources are identified in the last step. Then, the data needs to be selected, cleaned, and then built into the desired form and format.
  • Modeling: Visualization and cluster analysis are useful for initial analysis. The initial association rules can be developed by applying tools such as generalized rule induction. This is a data mining technique to discover knowledge represented as rules to illustrate the data in the view of causal relationship between conditional factors and a given decision/outcome. The models appropriate to the data type can also be applied.
  • Evaluation :The results should be evaluated in the context specified by the business objectives in the first step. This leads to the identification of new needs and in turn reverts to the prior phases in most cases.
  • Deployment: Data mining can be used to both verify previously held hypotheses or for knowledge.

SEMMA

Here is an overview of the process for SEMMA:

warming-img-1

Let's look at these processes in detail:

  • Sample: In this step, a portion of a large dataset is extracted
  • Explore: To gain a better understanding of the dataset, unanticipated trends and anomalies are searched in this step
  • Modify: The variables are created, selected, and transformed to focus on the model construction process
  • Model: A variable combination of models is searched to predict a desired outcome
  • Assess: The findings from the data mining process are evaluated by its usefulness and reliability

Social network mining

As we mentioned before, data mining finds a model on data and the mining of social network finds the model on graph data in which the social network is represented.

Social network mining is one application of web data mining; the popular applications are social sciences and bibliometry, PageRank and HITS, shortcomings of the coarse-grained graph model, enhanced models and techniques, evaluation of topic distillation, and measuring and modeling the Web.

Social network

When it comes to the discussion of social networks, you will think of Facebook, Google+, LinkedIn, and so on. The essential characteristics of a social network are as follows:

  • There is a collection of entities that participate in the network. Typically, these entities are people, but they could be something else entirely.
  • There is at least one relationship between the entities of the network. On Facebook, this relationship is called friends. Sometimes, the relationship is all-or-nothing; two people are either friends or they are not. However, in other examples of social networks, the relationship has a degree. This degree could be discrete, for example, friends, family, acquaintances, or none as in Google+. It could be a real number; an example would be the fraction of the average day that two people spend talking to each other.
  • There is an assumption of nonrandomness or locality. This condition is the hardest to formalize, but the intuition is that relationships tend to cluster. That is, if entity A is related to both B and C, then there is a higher probability than average that B and C are related.

Here are some varieties of social networks:

  • Telephone networks: The nodes in this network are phone numbers and represent individuals
  • E-mail networks: The nodes represent e-mail addresses, which represent individuals
  • Collaboration networks: The nodes here represent individuals who published research papers; the edge connecting two nodes represent two individuals who published one or more papers jointly

Social networks are modeled as undirected graphs. The entities are the nodes, and an edge connects two nodes if the nodes are related by the relationship that characterizes the network. If there is a degree associated with the relationship, this degree is represented by labeling the edges.

warming-img-2

Here is an example in which Coleman's High School Friendship Data from the sna R package is used for analysis. The data is from a research on friendship ties between 73 boys in a high school in one chosen academic year; reported ties for all informants are provided for two time points (fall and spring). The dataset's name is coleman, which is an array type in R language. The node denotes a specific student and the line represents the tie between two students.

warming-img-3

Summary

The book has, as showcased in this article, a lot more interesting coverage with regard to data mining and R. Deep diving into the algorithms associated with data mining and efficient methods to implement them using R.

Resources for Article:


Further resources on this subject: