0

All Products

Best Sellers

New Releases

Books

Videos

Audiobooks

Free Learning

Tech Guides - Data

281 Articles

article-image-8-nosql-databases-compared

5 min read

8 NoSQL Databases Compared

NoSQL, or non-relational databases, are increasingly used in big data and real-time web applications. These databases are non-relational in nature and they provide a mechanism for storage and the retrieval of information that is not tabular. There are many advantages of using NoSQL database: Horizontal Scalability Automatic replication (using multiple nodes) Loosely defined or no schema (Huge advantage, if you ask me!) Sharding and distribution Recently we were discussing the possibility of changing our data storage from HDF5 files to some NoSQL system. HDF5 files are great for the storage and retrieval purposes. But now with huge data coming in we need to scale up, and also the hierarchical schema of HDF5 files is not very well suited for all sorts of data we are using. I am a bioinformatician working on data science applications to genomic data. We have genomic annotation files (GFF format), genotype sequences (FASTA format), phenotype data (tables), and a lot of other data formats. We want to be able to store data in a space and memory efficient way and also the framework should facilitate fast retrieval. I did some research on the NoSQL options and prepared this cheat-sheet. This will be very useful for someone thinking about moving their storage to non-relational databases. Also, data scientists need to be very comfortable with the basic ideas of NoSQL DB's. In the course Introduction to Data Science by Prof. Bill Howe (UWashinton) on Coursera, NoSQL DB's formed a significant part of the lectures. I highly recommend the lectures on these topics and this course in general. This cheat-sheet should also assist aspiring data scientists in their interviews. Some options for NoSQL databases: Membase: This is key-value type database. It is very efficient if you only need to quickly retrieve a value according to a key. It has all of the advantages of memcached when it comes to the low cost of implementation. There is not much emphasis on scalability, but lookups are very fast. It has a JSON format with no predefined schema. The weakness of using it for important data is that it's a pure key-value store, and thus is not queryable on properties. MongoDB: If you need to associate a more complex structure, such as a document to a key, then MongoDB is a good option. With a single query, you are going to retrieve the whole document and it can be a huge win. However using these documents like simple key/value stores would not be as fast and as space-efficient as Membase. Documents are the basic unit. Documents are in JSON format with no predefined schema. It makes integration of data easier and faster. Berkeley DB: It stores records in key-value pairs. Both key and value can be arbitrary byte strings, and can be of variable lengths. You can put native programming language data structures into the database without converting to a foreign record first. Storage and retrieval are very simple, but the application needs to know what the structure of a key and a value is in advance, it can't ask the DB. Simple data access services. No limit to the data types that can be stored. No special support for binary large objects (unlike some others) Berkeley DB v/s MongoDB: Berkeley DB has no partitioning while MongoDB supports sharding. MongoDB has some predefined data types like float, string, integer, double, boolean, date, and so on. Berkeley DB has key-value store and MongoDb has documents. Both are schema free. Berkeley DB has no support for Python, for example, although there are many third parties libraries. Redis: If you need more structures like lists, sets, ordered sets and hashes, then Redis is the best bet. It's very fast and provides useful data-structures. It just works, but don't expect it to handle every use-case. Nevertheless, it is certainly possible to use Redis as your primary data-store. But it is used less for distributed scalability, but optimizes high performance lookups at the cost of no longer supporting relational queries. Cassandra: Each key has values as columns and columns are grouped together into sets called column families. Thus each key identifies a row of a variable number of elements. A column family contains rows and columns. Each row is uniquely identified by a key. And each row has multiple columns. Think of a column family as a table, each key-value pair being a row. Unlike RDBMS, different rows in a column family don't have to share the same set of columns, and a column may be added to one or multiple rows at any time. A hybrid between a key-value and a column-oriented database. Has a partially defined schema. Can handle large amounts of data across many servers (clusters), is fault-tolerant and robust. Examples were originally written by Facebook for the Inbox search, and later replaced by HBase. HBase: It is modeled after Google's Bigtable DB. The deal use for HBase is in the situations when you need improved flexibility, great performance, scaling and have Big Data. The data structure is similar to Cassandra where you have column families. Built on Hadoop (HDFS), and can do MapReduce without any external support. Very efficient for storing sparse data . Big data (2 billion rows) is easy to deal with. Examples scalable email/messaging system with search. HBase V/S Cassandra: Hbase is more suitable for data warehousing and large scale data processing and analysis (indexing the web as in a search engine) and Cassandra is more apt for real time transaction processing and the serving of interactive data. Cassandra is more write-centric and HBase is more read-centric. Cassandra has multi- data center support, which can be very useful. Resources NoSQL explained Why NoSQL Big Table About the Author Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.

0
0
4845

article-image-2015-year-deep-learning

4 min read

Is 2015 the Year of Deep Learning?

The new phenomenon to hit the world of ‘Big Data’ seems to be ‘Deep Learning’. I’ve read many articles and papers where people question whether there’s a future for it, or if it’s just a buzzword that will die out like many a term before it. Likewise I have seen people who are genuinely excited and truly believe it is the future of Artificial intelligence; the one solution that can greatly improve the accuracy of our data and development of systems. Deep learning is currently a very active research area, by no means is it established as an industry standard, but rather one which is picking up pace and brings a strong promise of being a game changer when dealing with raw, unstructured data. So what is Deep Learning? Deep learning is a concept conceived from machine learning. In very simple terms, we think of machine learning as a method of teaching machines (using complex algorithms to form neural networks) to make improved predictions of outcomes based on patterns and behaviour from initial data sets. The concept goes a step further however. The idea is based around a set of techniques used to train machines (Neural Networks) in processing information that can generate levels of accuracy nearly equivalent to that of a human eye. Deep learning is currently one of the best providers of solutions regarding problems in image recognition, speech recognition, object recognition and natural language processing. There are a growing number of libraries that are available, in a wide range of different languages (Python, R, Java) and frameworks such as: Caffe,Theanodarch, H20, Deeplearning4j, DeepDist etc. How does Deep Learning work? The central idea is around ‘Deep Neural Networks’. Deep Neural Networks take traditional neural networks (or artificial neural networks) and build them on top of one another to form layers that are represented in a hierarchy. Deep learning allows each layer in the hierarchy to learn more about the qualities of the initial data. To put this in perspective; the output of data in level one is then the input of data in level 2. The same process of filtering is used a number of times until the level of accuracy allows the machine to identify its goal as accurately as possible. It’s essentially a repeat process that keeps refining the initial dataset. Here is a simple example of Deep learning. Imagine a face, we as humans are very good at making sense of what our eyes show us, all the while doing it without even realising. We can easily make out ones: face shape, eyes, ears, nose, mouth etc. We take this for granted and don’t fully appreciate how difficult (and complex) it can get whilst writing programs for machines to do what comes naturally to us. The difficulty for machines in this case is pattern recognition - identifying edges, shapes, objects etc. The aim is to develop these ‘deep neural networks’ by increasing and improving the number of layers - training each network to learn more about the data to the point where (in our example) it’s equal to human accuracy. What is the future of Deep Learning? Deep learning seems to have a bright future for sure, not that it is a new concept, I would actually argue it’s now practical rather than theoretical. We can expect to see the development of new tools, libraries and platforms, even improvements on current technologies such as Hadoop to accommodate the growth of Deep Learning. However it may not be all smooth sailing. It is still by far very difficult and time consuming task to understand, especially when trying to optimise networks as datasets grow larger and larger, surely they will be prone to errors? Additionally, the hierarchy of networks formed would surely have to be scaled for larger complex and data intensive AI problems. Nonetheless, the popularity around Deep learning has seen large organisations invest heavily, such as: Yahoo, Facebook, Googles acquisition of Deepmind for $400 million and Twitter’s purchase of Madbits. They are just few of the high profile investments amongst many. 2015 really does seem like the year Deep learning will show its true potential. Prepare for the advent of deep learning by ensuring you know all there is to know about machine learning with our article. Read 'How to do Machine Learning with Python' now. Discover more Machine Learning tutorials and content on our dedicated page. Find it here.

0
0
3020

article-image-what-did-big-data-deliver-in-2014

5 min read

What Did Big Data Deliver In 2014?

Big Data has always been a hot topic and in 2014 it came to play. ‘Big Data’ has developed, evolved and matured to give significant value to ‘Business Intelligence’. However there is so much more to big data than meets the eye. Understanding enormous amounts of unstructured data is not easy by any means; yet once that data is analysed and understood, organisations have started to value its importance and need. ‘Big data’ has helped create a number of opportunities which range from new platforms, tools, technologies; to improved economic performances in different industries; through development of specialist skills, job creation and business growth. Let’s do a quick recap of 2014 and on what Big Data has offered to the tech world from the perspective of a tech publisher. Data Science The term ‘Data Science’ has been around for sometime admittedly, yet in 2014 it received a lot more attention thanks to the demands created by ‘Big Data’. Looking at Data Science from a tech publisher’s point of view, it’s a concept which has rapidly been adopted with potential for greater levels of investment and growth. To address the needs of Big data, Data science has been split into four key categories, which are; Data mining, Data analysis, Data visualization and Machine learning. Equally we have important topics which fit inbetween those such as: Data cleaning (Munging) which I believe takes up majority of a data scientist time. The rise in jobs for data scientists has exploded in recent times and will continue to do so, according to global management firm McKinsey & Company there will be a shortage of 140,000 to 190,000 data scientists due to the continued rise of ‘big data’ and also has been described as the ‘Sexiest job of 21st century’. Real time Analytics The competitive battle in Big data throughout 2014 was focused around how fast data could be streamed to achieve real time performance. Real-time analytics most important feature is gaining instant access and querying data as soon as it comes through. The concept is applicable to different industries and supports the growth of new technologies and ideas. Live analytics are more valuable to social media sites and marketers in order to provide actionable intelligence. Likewise Real time data is becoming increasing important with the phenomenon known as ‘the internet of things’. The ability to make decisions instantly and plan outcome in real time is possible now than before; thanks to development of technologies like Spark and Storm and NoSQL databases like the Apache Cassandra, enable organisations to rapidly retrieve data and allow fault tolerant performance. Deep Learning Machine learning (Ml) became the new black and is in constant demand by many organisations especially new startups. However even though Machine learning is gaining adoption and improved appreciation of its value; the concept Deep Learning seems to be the one that’s really pushed on in 2014. Now granted both Ml and Deep learning might have been around for some time, we are looking at the topics in terms of current popularity levels and adoption in tech publishing. Deep learning is a subset of machine learning which refers to the use of artificial neural networks composed of many layers. The idea is based around a complex set of techniques for finding information to generate greater accuracy of data and results. The value gained from Deep learning is the information (from hierarchical data models) helps AI machines move towards greater efficiency and accuracy that learn to recognize and extract information by themselves and unsupervised! The popularity around Deep learning has seen large organisations invest heavily, such as: Googles acquisition of Deepmind for $400 million and Twitter’s purchase of Madbits, they are just few of the high profile investments amongst many, watch this space in 2015! New Hadoop and Data platforms Hadoop best associated with big data has adopted and changed its batch processing techniques from MapReduce to what’s better known as YARN towards the end of 2013 with Hadoop V2. MapReduce demonstrated the value and benefits of large scale, distributed processing. However as big data demands increased and more flexibility, multiple data models and visual tool became a requirement, Hadoop introduced Yarn to address these problems. YARN stands for ‘Yet-Another-Resource-Negotiator’. In 2014, the emergence and adoption of Yarn allows users to carryout multiple workloads such as: streaming, real-time, generic distributed applications of any kind (Yarn handles and supervises their execution!) alongside the MapReduce models. The biggest trend I’ve seen with the change in Hadoop in 2014 would be the transition from MapReduce to YARN. The real value in big data and data platforms are the analytics, and in my opinion that would be the primary point of focus and improvement in 2015. Rise of NoSQL NoSQL also interpreted as ‘Not Only SQL’ has exploded with a wide variety of databases coming to maturity in 2014. NoSQL databases have grown in popularity thanks to big data. There are many ways to look at data stored, but it is very difficult to process, manage, store and query huge sets of messy, complex and unstructured data. Traditional SQL systems just wouldn’t allow that, so NoSQL was created to offer a way to look at data with no restrictive schemas. The emergence of ‘Graph’, ‘Document’, ‘Wide column’ and ‘Key value store’ databases have showed no slowdown and the growth continues to attract a higher level of adoption. However NoSQL seems to be taking shape and settling on a few major players such as: Neo4j, MongoDB, Cassandra etc, whatever 2015 brings, I am sure it would be faster, bigger and better!

0
0
1179

article-image-big-data-more-than-just-buzz-word

4 min read

Big Data Is More Than Just a Buzz Word!

We all agree big data sounds cool (well I think it does!), but what is it? Put simply, big data is the term used to describe massive volumes of data. We are thinking of data along the lines of "Terabytes," "Petabytes," and "Exabytes" in size. In my opinion, that’s as simple as it gets when thinking about the term "big data." Despite this simplicity, big data has been one of the hottest, and most misunderstood, terms in the Business Intelligence industry recently; every manager, CEO, and Director is demanding it. However, once the realization sets in on just how difficult big data is to implement, they may be scared off! The real reason behind the "buzz" was all the new benefits that organizations could gain from big data. Yet many overlooked the difficulties involved, such as: How do you get that level of data? If you do, what do you do with it? Cultural change is involved, and most decisions would be driven by data. No decision would be made without it. The cost and skills required to implement and benefit from big data. The concept was misunderstood initially; organisations wanted data but failed to understand what they wanted it for and why, even though they were happy to go on the chase. Where did the buzz start? I truly believe Hadoop is what gave big data its fame. Initially founded by Yahoo, used in-house, and then open sourced as an Apache project, Hadoop served a true market need for large scale storage and analytics. Hadoop is so well linked to big data that it’s become natural to think of the two together. The graphic above demonstrates the similarities in how often people searched for the two terms. There’s a visible correlation (if not causation). I would argue that “buzz words” in general (or trends) don’t take off before the technology that allows them to exist does. If we consider buzz words like "responsive web design", they needed the correct CSS rules; "IoT" needed Arduino, and Raspberry Pi and likewise "big data" needed Hadoop. Hadoop was on the rise before big data had taken off, which supports my theory. Platforms like Hadoop allowed businesses to collect more data than they could have conceived of a few years ago. Big data grew as a buzz word because the technology supported it. After the data comes the analysis However, the issue still remains on collecting data with no real purpose, which ultimately yields very little in return; in short, you need to know what you want and what your end goal is. This is something that organisations are slowly starting to realize and appreciate, represented well by Gartner’s 2014 Hype Cycle. Big data is currently in the "Trough of Disillusionment," which I like to describe as “the morning after the night before.” This basically means that realisation is setting in, the excitement and buzz of big data has come down to something akin to shame and regret. The true value of big data can be categorised into three sections: Data types, Speed, and Reliance. By this we mean: the larger the data, the more difficult it becomes to manage the types of data collected, that is, it would be messy, unstructured, and complex. The speed of analytics is crucial to growth and on-demand expectations. Likewise, having a reliable infrastructure is at the core for sustainable efficiency. Big data’s actual value lies in processing and analyzing complex data to help discover, identify, and make better informed data-driven decisions. Likewise, big data can offer a clear insight into strengths, weaknesses, and areas of improvements by discovering crucial patterns for success and growth. However, this comes at a cost, as mentioned earlier. What does this mean for big data? I envisage that the invisible hand of big data will be ever present. Even though devices are getting smaller, data is increasing at a rapid rate. When the true meaning of big data is appreciated, it will genuinely turn from a buzz word into one that smaller organisations might become reluctant to adopt. In order to implement big data, they will need to appreciate the need for structure change, the costs involved, the skill levels required, and an overall shift towards a data-driven culture. To gain the maximum efficiency from big data and appreciate that it's more than a buzz word, organizations will have to be very agile and accept the risks to benefit from the levels of change.

0
0
1475

article-image-top-4-business-intelligence-tools

4 min read

Top 4 Business Intelligence Tools

With the boom of data analytics, Business Intelligence has taken something of a front stage in recent years, and as a result, a number of Business Intelligence (BI) tools have appeared. This allows a business to obtain a reliable set of data, faster and easier, and to set business objectives. This will be a list of the more prominent tools and will list advantages and disadvantages of each. Pentaho Pentaho was founded in 2004 and offers a suite, among others, of open source BI applications under the name, Pentaho Business Analytics. It has two suites, enterprise and community. It allows easy access to data and even easier ways of visualizing this data, from a variety of different sources including Excel and Hadoop and it covers almost every platform ranging from mobile, Android and iPhone, through to Windows and even Web-based. However with the pros, there are cons, which include the Pentaho Metadata Editor in Pentaho, which is difficult to understand, and the documentation provided offers few solutions for this tool (which is a key component). Also, compared to other tools, which we will mention below, the advanced analytics in Pentaho need improving. However, given that it is open source, there is continual improvement. Tableau Founded in 2003, Tableau also offers a range of suites, focusing on three products: Desktop, Server, and Public. Some benefits of using Tableau over other products include ease of use and a pretty simple UI involving drag and drop tools, which allows pretty much everyone to use it. Creating a highly interactive dashboard with various sources to obtain your data from is simple and quick. To sum up, Tableau is fast. Incredibly fast! There are relatively few cons when it comes to Tableau, but some automated features you would usually expect in other suites aren’t offered for most of the processes and uses here. Jaspersoft As well as being another suite that is open source, Jaspersoft ships with a number of data visualization, data integration, and reporting tools. Added to the small licensing cost, Jaspersoft is justifiably one of the leaders in this area. It can be used with a variety of databases including Cassandra, CouchDB, MongoDB, Neo4j, and Riak. Other benefits include ease of installation and the functionality of the tools in Jaspersoft is better than most competitors on the market. However, the documentation has been claimed to have been lacking in helping customers dive deeper into Jaspersoft, and if you do customize it the customer service can no longer assist you if it breaks. However, given the functionality/ability to extend it, these cons seem minor. Qlikview Qlikview is one of the oldest Business Intelligence software tools in the market, having been around since 1993, it has multiple features, and as a result, many pros and cons that include ones that I have mentioned for previous suites. Some advantages of Qlikview are that it takes a very small amount of time to implement and it’s incredibly quick; quicker than Tableau in this regard! It also has 64-bit in-memory, which is among the best in the market. Qlikview also has good data mining tools, good features (having been in the market for a long time), and a visualization function. These aspects make it so much easier to deal with than others on the market. The learning curve is relatively small. Some cons in relation to Qlikview include that while Qlikview is easy to use, Tableau is seen as the better suite to use to analyze data in depth. Qlikview also has difficulties integrating map data, which other BI tools are better at doing. This list is not definitive! It lays out some open source tools that companies and individuals can use to help them analyze data to prepare business performance KPIs. There are other tools that are used by businesses including Microsoft BI tools, Cognos, MicroStrategy, and Oracle Hyperion. I’ve chosen to explore some BI tools that are quick to use out of the box and are incredibly popular and expanding in usage.

0
0
2992

article-image-python-data-stack

3 min read

Python Data Stack

The Python programming language has grown significantly in popularity and importance, both as a general programming language and as one of the most advanced providers of data science tools. There are 6 key libraries every Python analyst should be aware of, and they are: 1 - NumPY NumPY: Also known as Numerical Python, NumPY is an open source Python library used for scientific computing. NumPy gives both speed and higher productivity using arrays and metrics. This basically means it's super useful when analyzing basic mathematical data and calculations. This was one of the first libraries to push the boundaries for Python in big data. The benefit of using something like NumPY is that it takes care of all your mathematical problems with useful functions that are cleaner and faster to write than normal Python code. This is all thanks to its similarities with the C language. 2 - SciPY SciPY: Also known as Scientific Python, is built on top of NumPy. SciPy takes scientific computing to another level. It’s an advanced form of NumPy and allows users to carry out functions such as differential equation solvers, special functions, optimizers, and integrations. SciPY can be viewed as a library that saves time and has predefined complex algorithms that are fast and efficient. However, there are a plethora of SciPY tools that might confuse users more than help them. 3 - Pandas Pandas is a key data manipulation and analysis library in Python. Pandas strengths lie in its ability to provide rich data functions that work amazingly well with structured data. There have been a lot of comparisons between pandas and R packages due to their similarities in data analysis, but the general consensus is that it is very easy for anyone using R to migrate to pandas as it supposedly executes the best features of R and Python programming all in one. 4 - Matplotlib Matplotlib is a visualization powerhouse for Python programming, and it offers a large library of customizable tools to help visualize complex datasets. Providing appealing visuals is vital in the fields of research and data analysis. Python’s 2D plotting library is used to produce plots and make them interactive with just a few lines of code. The plotting library additionally offers a range of graphs including histograms, bar charts, error charts, scatter plots, and much more. 5 - scikit-learn scikit-learn is Python’s most comprehensive machine learning library and is built on top of NumPy and SciPy. One of the advantages of scikit-learn is the all in one resource approach it takes, which contains various tools to carry out machine learning tasks, such as supervised and unsupervised learning. 6 - IPython IPython makes life easier for Python developers working with data. It’s a great interactive web notebook that provides an environment for exploration with prewritten Python programs and equations. The ultimate goal behind IPython is improved efficiency thanks to high performance, by allowing scientific computation and data analysis to happen concurrently using multiple third-party libraries. Continue learning Python with a fun (and potentially lucrative!) way to use decision trees. Read on to find out more.

0
0
10253

article-image-top-5-nosql-databases

4 min read

Top 5 NoSQL Databases

NoSQL has seen a sharp rise in both adoption and migration from the tried and tested relational database management systems. The open source world has accepted it with open arms, which wasn’t the case with large enterprise organisations that still prefer and require ACID-compliant databases. However, as there are so many NoSQL databases, it’s difficult to keep track of them all! Let’s explore the most popular and different ones available to us: 1 - Apache Cassandra Apache Cassandra is an open source NoSQL database. Cassandra is a distributed database management system that is massively scalable. An advantage of using Cassandra is its ability to manage large amounts of structured, semi-structured, and unstructured data. What makes Cassandra more appealing as a database system is its ability to ‘Scale Horizontally’, and it’s one of the few database systems that can process data in real time and generate high performance and maintain high availability. The mixture of a column-oriented database with a key-value store means not all rows require a column, but the columns are grouped, which is what makes them look like tables. Cassandra is perfect for ‘mission critical’ big data projects, as Cassandra offers ‘no single point of failure’ if a data node goes down. 2 - MongoDB MongoDBis an open source schemaless NoSQL database system; its unique appeal is that it’s a ‘Document database’ as opposed to a relational database. This basically means it’s a ‘data dumpster’ that’s free for all. The added benefit in using MongoDB is that it provides high performance, high availability, and easy scalability (auto-sharding) for large sets of unstructured data in JSON-like files. MongoDB is the ultimate opposite to the popular MySQL. MySQL data has to be read in rows and columns, which has its own set of benefits with smaller sets of data. 3 - Neo4j Neo4j is an open source NoSQL ‘graph-based database’. Neo4j is the frontrunner of the graph-based model. As a graph database, it manages and queries highly connected data reliably and efficiently. It allows developers to store data more naturally from domains such as social networks and recommendation engines. The data collected from sites and applications are initially stored in nodes that are then represented as graphs. 4 - Hadoop Hadoop is easy to look over as a NoSQL database due to its ecosystem of tools for big data. It is a framework for distributed data storage and processing, designed to help with huge amounts of data while limiting financial and processing-time overheads. Hadoop includes a database known as HBase, which runs on top of HDFS and is a distributed, column-oriented data store. HBase is also better known as a distributed storage system for Hadoop nodes, which are then used to run analytics with the use of MapReduce V2, also known as Yarn. 5 - OrientDB OrientDB has been included as a wildcard! It’s a very interesting database and one that has everything going for it, but has always been in the shadows of Neo4j. Orient is an open source NoSQL hybrid graph-document database that was developed to combine the flexibility of a document database with the complexity of a graph database (Mongo and Neo4j all in one!). With the growth of complex and unstructured data (such as social media), relational databases were not able to handle the demands of storing and querying this type of data. Document databases were developed as one solution and visualizing them through nodes was another solution. Orient has combined both into one, which sounds awesome in theory but might be very different in practice! Whether the Hybrid approach works and is adopted remains to be seen.

0
0
4504

article-image-war-data-science-python-versus-r

7 min read

The War on Data Science: Python versus R

Data science The relatively new field of data science has taken the world of big data by storm. Data science gives valuable meaning to large sets of complex and unstructured data. The focus is around concepts like data analysis and visualization. However, in the field of artificial intelligence, a valuable concept known as Machine Learning has now been adopted by organizations and is becoming a core area for many data scientists to explore and implement. In order to fully appreciate and carry out these tasks, data scientists are required to use powerful languages. R and Python currently dominate this field, but which is better and why? The power of R R offers a broad, flexible approach to data science. As a programming language, R focuses on allowing users to write algorithms and computational statistics for data analysis. R can be very rewarding to those who are comfortable using it. One of the greatest benefits R brings is its ability to integrate with other languages like C++, Java, C, and tools such as SPSS, Stata, Matlab, and so on. The rise to prominence as the most powerful language for data science was supported by R’s strong community and over 5600 packages available. However, R is very different to other languages; it’s not as easily applicable to general programming (not to say it can’t be done). R’s strength and its ability to communicate with every data analysis platform also limit its ability outside this category. Game dev, Web dev, and so on are all achievable, but there’s just no benefit of using R in these domains. As a language, R is difficult to adopt with a steep learning curve, even for those who have experience in using statistical tools like SPSS and SAS. The violent Python Python is a high level, multi-paradigm programming language. Python has emerged as one of the more promising languages of recent times thanks to its easy syntax and operability with a wide variety of different eco-systems. More interestingly, Python has caught the attention of data scientists over the years, and thanks to its object-oriented features and very powerful libraries, Python has become the go-to language for data science, many arguing it’s taken over R. However, like R, Python has its flaws too. One of the drawbacks in using Python is its speed. Python is a slow language and one of the fundamentals of data science is speed! As mentioned, Python is very good as a programming language, but it’s a bit like a jack of all trades and master of none. Unlike R, it doesn’t purely focus on data analysis but has impressive libraries to carry out such tasks. The great battle begins While comparing the two languages, we will go over four fundamental areas of data science and discuss which is better. The topics we will explore are data mining, data analysis, data visualization, and machine learning. Data mining: As mentioned, one of the key components to data science is data mining. R seems to win this battle; in the 2013 Data Miners Survey, 70% of data miners (from the 1200 who participated in the survey) use R for data mining. However, it could be argued that you wouldn’t really use Python to “mine” data but rather use the language and its libraries for data analysis and development of data models. Data analysis: R and Python boast impressive packages and libraries. Python, NumPy, Pandas, and SciPy’s libraries are very powerful for data analysis and scientific computing. R, on the other hand, is different in that it doesn’t offer just a few packages; the whole language is formed around analysis and computational statistics. An argument could be made for Python being faster than R for analysis, and it is cleaner to code sets of data. However, I noticed that Python excels at the programming side of analysis, whereas for statistical and mathematical programming R is a lot stronger thanks to its array-orientated syntax. The winner of this is debatable; for mathematical analysis, R wins. But for general analysis and programming clean statistical codes more related to machine learning, I would say Python wins. Data visualization: the “cool” part of data science. The phrase “A picture paints a thousand words” has never been truer than in this field. R boasts its GGplot2 package which allows you to write impressively concise code that produces stunning visualizations. However. Python has Matplotlib, a 2D plotting library that is equally as impressive, where you can create anything from bar charts and pie charts, to error charts and scatter plots. The overall concession of the two is that R’s GGplot2 offers a more professional feel and look to data models. Another one for R. Machine learning: it knows the things you like before you do. Machine learning is one of the hottest things to hit the world of data science. Companies such as Netflix, Amazon, and Facebook have all adopted this concept. Machine learning is about using complex algorithms and data patterns to predict user likes and dislikes. It is possible to generate recommendations based on a user’s behaviour. Python has a very impressive library, Scikit-learn, to support machine learning. It covers everything from clustering and classification to building your very own recommendation systems. However, R has a whole eco system of packages specifically created to carry out machine learning tasks. Which is better for machine learning? I would say Python’s strong libraries and OOP syntax might have the edge here. One to rule them all From the surface of both languages, they seem equally matched on the majority of data science tasks. Where they really differentiate is dependent on an individual’s needs and what they want to achieve. There is nothing stopping data scientists using both languages. One of the benefits of using R is that it is compatible with other languages and tools as R’s rich packagescan be used within a Python program using RPy (R from Python). An example of such a situation would include using the Ipython environment to carry out data analysis tasks with NumPy and SciPy, yet to visually represent the data we could decide to use the R GGplot2 package: the best of both worlds. An interesting theory that has been floating around for some time is to integrate R into Python as a data science library; the benefits of such an approach would mean data scientists have one awesome place that would provide R’s strong data analysis and statistical packages with all of Python’s OOP benefits, but whether this will happen remains to be seen. The dark horse We have explored both Python and R and discussed their individual strengths and flaws in data science. As mentioned earlier, they are the two most popular and dominant languages available in this field. However a new emerging language called Julia might challenge both in the future. Julia is a high performance language. The language is essentially trying to solve the problem of speed for large scale scientific computation. Julia is expressive and dynamic, it’s fast as C, it can be used for general programming (its focus is on scientific computing) and the language is easy and clean to use. Sounds too good to be true, right?

0
0
3445

article-image-aspiring-data-analyst-meet-your-new-best-friend-excel

4 min read

Aspiring Data Analyst, Meet Your New Best Friend: Excel

In general, people want to associate themselves with cool job titles and one that indirectly says both that you’re clever and you get paid well, so what’s better than telling someone you’re a data analyst? Personally, as a graduate in Economics I always thought my natural career progression would be to go into a role of an analyst working for a banking organization, a private hedge fund, or an investment firm. I’m guessing at some point all people with a background in maths or some form of statistics have envisaged becoming a hotshot investment banker, right? However, the story was very different for me; I somehow was fortunate enough to fall into the tech world and develop a real interest in programming. What I found really interesting was that programming languages and data sets go hand in hand surprisingly well, which uncovered a relatively new field to me known as data science. Here’s how the story goes – I combined my academic skills with programming, which opened up a world of opportunity, allowing me to appreciate and explore data analysis on a whole new level. Nowadays, I’m using languages like Python and R to mix background knowledge of statistical data with my new-found passion. Yet that’s not how it started. It started with Excel. Now if you want to eventually move into the field of data science, you have to become competent in data analysis. I personally recommend Excel as a starting point. There are many reasons for this, one being that you don’t have to be technical wizard to get started and more importantly, Excel’s functionalities for data analysis are more powerful than you would expect and a lot quicker and efficient in resolving queries and allowing you to visualize them too. Excel has an inbuilt Data tab to get you started: The screenshot shows the basic analytical features to get you started within Excel. It’s separate to any functions and sum calculations that could be used. However, one useful and really handy plugin called Data Analysis is missing from that list. If you click on: File | Options | Add-ins and then choose Analysis tool and Analysis tool pack - VBA from the list and select Go, you will be prompt with the following image: Once you select the add-ins (as shown above) you will now find an awesome new tag in your data tab called Data Analysis: This allows you to run different methods of analysis on your data, anything from histograms, regressions, correlations, to t-tests. Personally I found this to save me tons of time. Excel also offers features such as Pivot-tables and functions like V-look ups, both extremely useful for data analysis, especially when you require multiple tables of information for large sets of data. A V-look up function is very useful when trying to identify products in a database that have the same set of IDs but are difficult to find. A more useful feature for analysis I found was using pivot tables. One of the best things about a pivot table is that it saves so much time and effort when you have a large set of data that you need to categorize and analyze quickly from a database. Additionally, there’s a visual option named a pivot chart, which allows you to visualize all your data in the pivot table. There are many useful tutorials and training available online on pivot tables for free. Overall, Excel provides a solid foundation for most analysts starting out. A general search on the job market for “Excel data” returns a search result of over 120,000 jobs all specific to an analyst role. To conclude, I wouldn’t underestimate Excel for learning the basics and getting valuable experience with large sets of data. From there, you can progress to learning a language like Python or R (and then head towards the exciting and supercool field of data science). With R’s steep learning curve, Python is often recommended as the best place to start, especially for people with little or no background in programming. But don’t dismiss Excel as a powerful first step, as it can easily become your best friend when entering the world of data analysis.

0
0
1562

article-image-mysteries-big-data-and-orient-db

4 min read

The Mysteries of Big Data and the Orient … DB

Mapping the world of big data must be a lot like demystifying the antiquated concept of the Orient, trying to decipher a mass of unknowns. With the ever multiplying expanse of data and the natural desire of humans to simultaneously understand it—as soon as possible and in real time—technology is continually evolving to allow us to make sense of it, make connections between it, turn it into actionable insight, and act upon it physically in the real world. It’s a huge enterprise, and you’ve got to imagine with the masses of data collated years before on legacy database systems, without the capacity for the technological insight and analysis we have now, there are relationships within the data that remain undefined—the known unknowns, the unknown knowns, and the known knowns (that Rumsfeld guy was making sense you see?). It's fascinating to think what we might learn from the data we have already collected. There is a burning need these days to break down the mysteries of big data and developers out there are continually thinking of ways we can interpret it, mapping data so that it is intuitive and understandable. The major way developers have reconceptualized data in order to make sense of it is as a network connected tightly together by relationships. The obvious examples are Facebook or LinkedIn, which map out vast networks of people connected by various shared properties, such as education, location, interest, or profession. One way of mapping highly connectable data is by structuring data in the form of a graph, a design that has emerged in recent years as databases have evolved. The main progenitor of this data structure is Neo4j, which is far and away the leader in the field of graph databases, mobilized by a huge number of enterprises working with big data. Neo4j has cornered the market, and it's not hard to see why—it offers a powerful solution with heavy commercial support for enterprise deployments. In truth there aren't many alternatives out there, but alternatives exist. OrientDB is a hybrid graph document database that offers the unique flexibility of modeling data in the form of either documents, or graphs, while incorporating object-oriented programming as a way of encapsulating relationships. Again, it's a great example of developers imagining ways in which we can accommodate the myriad of different data types, and relationships that connect it all together. The real mystery of the Orient(DB) however, is the relatively low (visible) adoption of a database that offers both innovation, and reputedly staggering levels of performance (claims are that it can store up to 150,000 records a second). The question isn't just why it hasn't managed to dent a market essentially owned by Neo4j, but why, on its own merits, haven’t more developers opted for the database? The answer may in the end be vaguely related to the commercial drivers—outside of Europe it seems as if OrientDB has struggled to create the kind of traction that would push greater levels of adoption, or perhaps it is related to the considerable development and tuning of the project for use in production. Related to that, maybe OrientDB still has a way to go in terms of enterprise grade support for production. For sure it's hard to say what the deciding factor is here. In many ways it’s a simple reiteration of the level of difficulty facing startups and new technologies endeavoring to acquire adoption, and that the road to this goal is typically a long one. Regardless, what both Neo4j and OrientDB are valuable for is adapting both familiar and unfamiliar programming concepts in order to reimagine the way we represent, model, and interpret connections in data, mapping the information of the world.

0
0
1644

article-image-rise-data-science

5 min read

The Rise of Data Science

The rise of big data and business intelligence has been one of the hottest topics to hit the tech world. Everybody who’s anybody has heard of the term business intelligence, yet very few can actually articulate what this means. Nonetheless it’s something all organizations are demanding. But you must be wondering why and how do you develop business intelligence? Enter data scientists! The concept of data science was developed to work with large sets of structured and unstructured data. So what does this mean? Let me explain. Data science was introduced to explore and give meaning to random sets of data floating around (we are talking about huge quantities here, that is, terabytes and petabytes), which are then used to analyze and help identify areas of poor performance, areas of improvement, and areas to capitalize on. The concept was introduced for large data-driven organisations that required consultants and specialists to deal with complex sets of data. However, data science has been adopted very quickly by organizations of all shapes and sizes, so naturally an element of flexibility would be required to fit data scientists in the modern work flow. There seems to be a shortage for data scientists and an increase in the amount of data out there. The modern data scientist is one who would be able to apply analytical skills necessary to any organization with or without large sets of data available. They are required to carry out data mining tasks to discover relevant meaningful data. Yet, smaller organizations wouldn’t have enough capital to invest in paying for a person who is experienced enough to derive such results. Nonetheless, because of the need for information, they might instead turn to a general data analyst and help them move towards data science and provide them with tools/processes/frameworks that allow for the rapid prototyping of models instead. The natural flow of work would suggest data analysis comes after data mining, and in my opinion analysis is at the heart of the data science. Learning languages like R and Python are fundamental to a good data scientist’s tool kit. However would a data scientist with a background in mathematics and statistics and little to no knowledge of R and Python still be as efficient? Now, the way I see it, data science is composed of four key topic areas crucial to achieving business intelligence, which are data mining, data analysis, data visualization, and machine learning. Data analysis can be carried out in many forms; it’s essentially looking at data and understanding it to make a factual conclusion from it (in simple terms). A data scientist may choose to use Microsoft Excel and VBA to analyze their data, but it wouldn’t be as accurate, clean, or as in depth as using Python or R, but it sure would be useful as a quick win with smaller sets of data. The approach here is that starting with something like Excel doesn’t mean it’s not counted as data science, it’s just a different form of it, and more importantly it actually gives a good foundation to progress on to using things like MySQL, R, Julia, and Python, as with time, business needs would grow and so would expectations of the level of analysis. In my opinion, a good data scientist is not one who knows more than one or two languages or tools, but one who is well-versed in the majority of them and knows which language and skill set are best suited to the task in hand. Data visualization is hugely important, as numbers themselves tell a story, but when it comes to representing the data to customers or investors, they're going to want to view all the different aspects of that data as quickly and easily as possible. Graphically representing complex data is one of the most desirable methods, but the way the data is represented varies dependent on the tool used, for example R’s GGplot2 or Python’s Matplotlib. Whether you’re working for a small organization or a huge data-driven company, data visualization is crucial. The world of artificial intelligence introduced the concept of machine learning, which has exploded on the scene and to an extent is now fundamental to large organizations. The opportunity for organizations to move forward by understanding a consumer’s behaviour and equally matching their expectations has never been so valuable. Data scientists are required to learn complex algorithms and core concepts such as classifications, recommenders, neural networks, and supervised and unsupervised learning techniques. This is just touching the edges of this exciting field, which goes into much more depth especially with emerging concepts such as deep learning. To conclude, we covered the basic fundamentals of data science and what it means to be data scientists. For all you R and Python developers (not forgetting any mathematical wizards out there), data science has been described as the ‘Sexiest job of 21st century’ as well as being handsomely rewarding too. The rise in jobs for data scientists has without question exploded and will continue to do so; according to global management firm McKinsey & Company, there will be a shortage of 140,000 to 190,000 data scientists due to the continued rise of ‘big data’.

0
0
1527

Previous
15
16
17
18
19
Next