Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-tableau-powerful-analytics-platform-interview-joshua-milligan

22 May 2018

9 min read

“Tableau is the most powerful and secure end-to-end analytics platform”: An interview with Joshua Milligan

22 May 2018

Tableau is one of the leading BI tools used by data science and business intelligence professionals today. You can not only use it to create powerful data visualizations but also use it to extract actionable insights for quality decision making thanks to the plethora of tools and features it offers. We recently interviewed Joshua Milligan, a Tableau Zen Master and the author of the book, Learning Tableau. Joshua takes us on an insightful journey into Tableau explaining why it is the Google of data visualization. He tells us all about its current and future focus areas such as Geospatial analysis and automating workflows, the exciting new features and tools such as Hyper, Tableau Prep among other topics. He also gives us a preview of things to come in his upcoming book. Author’s Bio Joshua Milligan, author of the bestselling book, Learning Tableau, has been with Teknion Data Solutions since 2004 and currently serves as a principal consultant. With a strong background in software development and custom .NET solutions, he brings a blend of analytical and creative thinking to BI solutions. Joshua has been named Tableau Zen Master, the highest recognition of excellence from Tableau Software not once but thrice. In 2017, Joshua competed as one of three finalists in the prestigious Tableau Iron Viz competition. As a Tableau trainer, mentor, and leader in the online Tableau community, he is passionate about helping others gain insights from their data. His work has been featured multiple times on Tableau Public’s Viz of the Day and Tableau’s website. He also shares frequent Tableau (and Maestro) tips, tricks, and advice on his blog VizPainter.com. Key Takeaways Tableau is perfectly tailored for business intelligence professionals given its extensive list of offerings from data exploration to powerful data storytelling. The drag-and-drop interface allows you to understand data visually thus enabling anyone to perform and share self service data analytics with colleagues in seconds. Hyper is new in-memory data engine designed for powerful query analytical processing on complex datasets. Tableau Prep, a new data preparation tool released with Tableau 2018.1, allows users to easily combine, shape, analyze and clean the data for compelling analytics. Tableau 2018.1 is expected to bring new geospatial tools, enterprise enhancements to Tableau Server, and new extensions and plugins to create interactive dashboards. Tableau users can expect to see artificial intelligence and machine learning becoming major features in both Tableau and Tableau Prep - thus deriving insights based on users behavior across the enterprise. Full Interview There are many enterprise software for business intelligence, how does Tableau compare against the others? What are the main reasons for Tableau's popularity? Tableau's paradigm is what sets it apart from others. It's not just about creating a chart or dashboard. It's about truly having a conversation with the data: asking questions and seeing instant results as you drag and drop to get new answers that raise deeper questions and then iterating. Tableau allows for a flow of thought through the entire cycle of analytics from data exploration through analysis to data storytelling. Once you understand this paradigm, you will flow with Tableau and do amazing things! There's a buzz in the developer's community that Tableau is the Google of data visualization. Can you list the top 3-5 features in Tableau 10.5 that are most appreciated by the community? How do you use Tableau in your day-to-day work? Tableau 10.5 introduced Hyper - a next-generation data engine that really lays a foundation for enterprise scaling as well as a host of exciting new features and Tableau 2018.1 builds on this foundation. One of the most exciting new features is a completely new data preparation tool - Tableau Prep. Tableau Prep complements Tableau Desktop and allows users to very easily clean, shape, and integrate their data from multiple sources. It’s intuitive and gives you a hands-on, instant feedback paradigm for data preparation in a similar way to what Tableau Desktop enables with data visualization. Tableau 2018.1 also includes new geospatial features that make all kinds of analytics possible. I’m particularly excited about support for the geospatial data types and functions in SQL Server which have allowed me to dynamically draw distances and curves on maps. Additionally, web authoring in Tableau Server is now at parity with Tableau Desktop. I use Tableau every day to help my clients see and understand their data and to make key decisions that drive new business, avoid risk, and find hidden opportunities. Tableau Prep makes it easier to access the data I need and shape it according to the analysis I’ll be doing. Tableau offers a wide range of products to suit their users' needs. How does one choose the right product from their data analytics or visualization need? For example, what are the key differences between Tableau Desktop, Server and Public? Are there any plans for a unified product for the Tableau newbie in the near future? As a consultant at Teknion Data Solutions (a Tableau Gold Partner), I work with clients all the time to help them make the best decisions around which Tableau offering best meets their needs. Tableau Desktop is the go-to authoring tool for designing visualizations and dashboards. Tableau Server, which can be hosted on premises or in the cloud, gives enterprises and organizations the ability to share and scale Tableau. It is now at near parity with Tableau Desktop in terms of authoring. Tableau Online is the cloud-based, Tableau managed solution. Tableau Public allows for sharing public visualizations and dashboards with a world-wide audience. How good is Tableau for Self-Service Analytics / automating workflows? What are the key challenges and limitations? Tableau is amazing for this. Combined with the new data prep tool - Tableau Prep - Tableau really does offer users, across the spectrum (from business users to data scientists), the ability to quickly and easily perform self-service analytics. As with any tool, there are definitely cases which require some expertise to reach a solution. Pulling data from an API or web-based source or even sometimes structuring the data in just the right way for the desired analysis are examples that might require some know-how. But even there, Tableau has the tools that make it possible (for example, the web data connector) and partners (like Teknion Data Solutions) to help put it all together. In the third edition of Learning Tableau, I expand the scope of the book to show the full cycle of analytics from data prep and exploration to analysis and data storytelling. Expect updates on new features and concepts (such as the changes Hyper brings), a new chapter focused on Tableau Prep and strategies for shaping data to perform analytics, and new examples throughout that span multiple industries and common analytics questions. What is the development roadmap for Tableau 2018.1? Are we expecting major feature releases this year to overcome some of the common pain areas in business intelligence? I'm particularly excited about Tableau 2018.1. Tableau hasn't revealed everything yet, but things such as new geospatial tools and features, enterprise enhancements to Tableau Server, the new extensions API, new dashboard tools, and even a new visualization type or two look to be amazing! Tableau is working a lot in the geospatial domain coming up with new plugins/connectors and features. Can we expect Tableau to further strengthen their support for spatial data? What are the other areas/domains that Tableau is currently focused on? I couldn't say what the top 3-5 areas are - but you are absolutely correct that Tableau is really putting some emphasis on geospatial analytics. I think the speed and power of the Hyper data engine makes a lot of things like this possible. Although I don't have any specific knowledge beyond what Tableau has publicly shared, I wouldn't be surprised to see some new predictive and statistical models and expansion of data preparation abilities. What's driving Tableau to Cloud? Can we expect more organizations adopting Tableau on Cloud? There has been a major shift to the cloud by organizations. The ability to manage, scale, ensure up-time, and save costs are driving this move and that in turn makes Tableau's cloud-based offerings very attractive. What does Tableau's future hold, according to you? For example, do you see machine learning and AI-powered analytics platform transformation? Or can we expect Tableau entering the IoT and IIoT domain? Tableau demonstrated a concept around NLQ at the Tableau Conference and has already started building in a few machine learning features. For example, Tableau now recommends joins based on what is learns from behavior of users across the enterprise. Tableau Prep has been designed from the ground-up with machine learning in mind. I fully expect to see AI and machine learning become major features in both Tableau and Tableau Prep – but true to Tableau’s paradigm, they will complement the work of the analyst and allow for deeper insight without obscuring the role that humans play in reaching that insight. I'm excited to see what is announced next! Give us a sneak peek into the book you are currently writing "Learning Tableau 2018.1, Third Edition", expected to be released in the 3rd Quarter this year. What should our readers get most excited about as they wait for this book? Although the foundational concepts behind learning Tableau remain the same, I'm excited about the new features that have been released or will be as I write. Among these are a couple of game-changers such as the new geospatial features and the new data prep tool: Tableau Prep. In addition to updating the existing material, I'll definitely have a new chapter or two covering those topics! If you found this interview to be interesting, make sure you check out other insightful articles on business intelligence: Top 5 free Business Intelligence tools [Opinion] Tableau 2018.1 brings new features to help organizations easily scale analytics [News] Ride the third wave of BI with Microsoft Power BI [Interview - Part 1] Unlocking the secrets of Microsoft Power BI [Interview - Part 2] How Qlik Sense is driving self-service Business Intelligence [Interview]

0
0
4510

article-image-pandas-answers-data-analysis-problems-interview

Amey Varangaonkar

24 Apr 2018

9 min read

“Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou

Amey Varangaonkar

24 Apr 2018

9 min read

It comes as no surprise to many developers, Python has grown to become the preferred language of choice for data science. One of the reasons for its staggering adoption in the data science community is the rich suite of libraries for effective data analysis and visualization - allowing you to extract useful, actionable insights from your data. Pandas is one such Python-based library, that provides a solid platform to carry out high-performance data analysis. Ted Petrou is a data scientist and the founder of Dunder Data, a professional educational company focusing on exploratory data analysis. Before founding Dunder Data, Ted was a data scientist at Schlumberger, a large oil services company, where he spent the vast majority of his time exploring data. Ted received his Master’s degree in statistics from Rice University and has used his analytical skills to play poker professionally. He taught math before becoming a data scientist. He is a strong supporter of learning through practice and can often be found answering questions about pandas on Stack Overflow. In this exciting interview, Ted takes us through an insightful journey into pandas - Python’s premier library for exploratory data analysis, and tells us why it is the go-to library for many data scientists to discover new insights from their data. Key Takeaways Data scientists are in the business of making predictions. To make the right predictions you must know how to analyse your data. to perform data analysis efficiently, you must have a good understanding of the concepts as well be proficient using the tools like pandas. Pandas Cookbook contains step by step solutions to the master the pandas syntax while going through the data exploration journey (missteps et al) to solve the most common and not-so-common problems in data analysis. Unlike R which has several different packages for different data science tasks, pandas offers all data analysis capabilities as a single large Python library. Pandas has good time-series capabilities, making it well-suited for building financial applications. That said, its best use is in data exploration - to find interesting discoveries within the data. Ted says beginners in data science should focus on learning one data science concept at a time and master it thoroughly, rather than getting an overview of multiple concepts at once. Let us start with a very fundamental question - Why is data crucial to businesses these days? What problems does it solve? All businesses, from a child’s lemonade stand to the largest corporations, must account for all their operations in order to be successful. This accounting of supplies, transactions, people, etc., is what we call ‘data’ and gives us historical records of what has transpired in a business. Without this data, we would be reduced to oral history or what humans used for accounting before the advent of writing systems. By collecting and analyzing data, we gain a deeper understanding of how the business is progressing. In the most basic instances, such as with a child’s lemonade stand, we know how many glasses of lemonade have been sold, how much was spent on supplies, and importantly whether the business is profitable. This example is incredibly trivial, but it should be noted that such simple data collection is not something that comes naturally to humans. For instance, many people have a desire to lose weight at some point in their life, but fail to accurately record their daily weight or calorie intake in any regular manner, despite the large number of free services available to help with this. There are so many Python-based libraries out there which can be used for a variety of data science tasks. Where does pandas fit into this picture? pandas is the most popular library to perform the most fundamental tasks of a data analysis. Not many libraries can claim to provide the power and flexibility of pandas for working with tabular data. How does pandas help data scientists in overcoming different challenges in data analysis? What advantages does it offer over domain-specific languages such as R? One of the best reasons to use pandas is because it is so popular. There are a tremendous amount of resources available for it, and an excellent database of questions and answers on StackOverflow. Because the community is so large, you can almost always get an immediate answer to your problem. Comparing pandas to R is difficult as R is an entire language that provides tools for a wide variety of tasks. Pandas is a single large Python library. Nearly all the tasks capable in pandas can be replicated with the right library in R. We would love to hear your journey as a data scientist. Did having a master's degree in statistics help you in choosing this profession? Also tell us something about how you leveraged analytics in professional Poker! My journey to becoming a “data scientist” began long before the term even existed. As a math undergrad, I found out about the actuarial profession, which appealed to me because of its meritocratic pathway to success. Because I wasn’t certain that I wanted to become an actuary, I entered a Ph.D. program in statistics in 2004, the same year that an online poker boom began. After a couple of unmotivating and half-hearted attempts at learning probability theory, I left the program with a masters degree to play poker professionally. Playing poker has been by far the most influential and beneficial resource for understanding real-world risk. Data scientists are in the business of making predictions and there’s no better way to understand the outcomes of predictions you make than by exposing yourself to risk. Your recently published 'pandas Cookbook' has received a very positive response from the readers. What problems in data analysis do you think this book solves? I worked extremely hard to make pandas Cookbook the best available book on the fundamentals of data analysis. The material was formulated by teaching dozens of classes and hundreds of students with my company Dunder Data and my meetup group Houston Data Science. Before getting to what makes a good data analysis, it’s important to understand the difference between the tools available to you and the theoretical concepts. Pandas is a tool and is not much different than a big toolbox in your garage. It is possible to master the syntax of pandas without actually knowing how to complete a thorough data analysis. This is like knowing how to use all the individual tools in your toolbox without knowing how to build anything useful, such as a house. Similarly, understanding theoretical concepts such as ‘split-apply-combine’ or ‘tidy data’ without knowing how to implement them with a specific tool will not get you very far. Thus, in order to make a good data analysis, you need to understand both the tools and the concepts. This is what pandas Cookbook attempts to provide. The syntax of pandas is learned together with common theoretical concepts using real-world datasets. Your readers loved the way you have structured the book and the kind of datasets, examples and functions you have chosen to showcase pandas in all its glory. Was is experience, intuition, or observations that led to this fantastic writing insight? The official pandas documentation is very thorough (well over 1,000 pages) but does not present the features as you would see them in a real data analysis. Most of the operations are shown in isolation on contrived or randomly generated data. In a typical data analysis, it is common for many pandas operations to be called one after another. The recipes in pandas Cookbook expose this pattern to the reader, which will help them when they are completing an actual data analysis. This is not meant to disparage the documentation as I have read it multiple times myself and recommend reading it along with pandas Cookbook. Quantitative finance is one domain where pandas finds major application. How does pandas help in developing better financial applications? In what other domains does pandas find important applications and how? Pandas has good time-series capabilities which makes it well-suited for financial applications. It’s ability to group by specific time periods is a very useful feature. In my opinion, pandas most important application is with exploratory data analysis. It is possible for an analyst to quickly use pandas to find interesting discoveries within the data and visualize the results with either matplotlib or Seaborn. This tight integration, coupled with the Jupyter Notebook interface make for an excellent ecosystem for generating and reporting results to others. Please tell us more about 'pandas Cookbook'. What in your opinion are the 3 major takeaways from it? Are there any prerequisites needed to get the most out of the book? The only prerequisite for pandas Cookbook is a fundamental understanding of the Python programming language. The recipes progress in difficulty from chapter to chapter and for those with no pandas experience, I would recommend reading it cover to cover. One of the major takeaways from the book is to be able to write modern and idiomatic pandas code. Pandas is a huge library and there are always multiple ways of completing each task. This is more of a negative than a positive as beginners notoriously write poorly written and inefficient code. Another takeaway is the ability to probe and investigate data until you find something interesting. Many of the recipes are written as if the reader is experiencing the discovery process alongside the author. There are occasional (and purposeful) missteps in some recipes to show how often the right course of action is not always known. Lastly, I wanted to teach common theoretical concepts of doing a data analysis while simultaneously learning pandas syntax. Finally, what advice would you have for beginners in data science? What things should they keep in mind while designing and developing their data science workflow? Are there any specific resources which they could refer to, apart from this book of course? For those just beginning their data science journey, I would suggest keeping their ‘universe small’. This means concentrating on as few things as possible. It is easy to get caught up with a feeling that you need to keep learning as much as possible. Mastering a few subjects is much better than having a cursory knowledge of many. If you found this interview to be intriguing, make sure you check out Ted’s pandas Cookbook which presents more than 90 unique recipes for effective scientific computation and data analysis.

0
1
4301

article-image-python-machine-learning-expert-interviews

Richard Gall

13 Mar 2018

7 min read

Why is Python so good for AI and Machine Learning? 5 Python Experts Explain

Richard Gall

13 Mar 2018

7 min read

Python is one of the best programming languages for machine learning, quickly coming to rival R's dominance in academia and research. But why is Python so popular in the machine learning world? Why is Python good for AI? Mike Driscoll spoke to five Python experts and machine learning community figures about why the language is so popular as part of the book Python Interviews. Programming is a social activity - Python's community has acknowledged this best Glyph Lefkowitz (@glyph), founder of Twisted, a Python network programming framework, awarded The PSF’s Community Service Award in 2017 AI is a bit of a catch-all term that tends to mean whatever the most advanced areas in current computer science research are. There was a time when the basic graph-traversal stuff that we take for granted was considered AI. At that time, Lisp was the big AI language, just because it was higher-level than average and easier for researchers to do quick prototypes with. I think Python has largely replaced it in the general sense because, in addition to being similarly high-level, it has an excellent third-party library ecosystem, and a great integration story for operating system facilities. Lispers will object, so I should make it clear that I'm not making a precise statement about Python's position in a hierarchy of expressiveness, just saying that both Python and Lisp are in the same class of language, with things like garbage collection, memory safety, modules, namespaces and high-level data structures. In the more specific sense of machine learning, which is what more people mean when they say AI these days, I think there are more specific answers. The existence of NumPy and its accompanying ecosystem allows for a very research-friendly mix of high-level stuff, with very high-performance number-crunching. Machine learning is nothing if not very intense number-crunching. "...Statisticians, astronomers, biologists, and business analysts have become Python programmers and have improved the tooling." The Python community's focus on providing friendly introductions and ecosystem support to non-programmers has really increased its adoption in the sister disciplines of data science and scientific computing. Countless working statisticians, astronomers, biologists, and business analysts have become Python programmers and have improved the tooling. Programming is fundamentally a social activity and Python's community has acknowledged this more than any other language except JavaScript. Machine learning is a particularly integration-heavy discipline, in the sense that any AI/machine learning system is going to need to ingest large amounts of data from real-world sources as training data, or system input, so Python's broad library ecosystem means that it is often well-positioned to access and transform that data. Python allows users to focus on real problems Marc-Andre Lemburg (@malemburg), co-founder of The PSF and CEO of eGenix Python is very easy to understand for scientists who are often not trained in computer science. It removes many of the complexities that you have to deal with, when trying to drive the external libraries that you need to perform research. After Numeric (now NumPy) started the development, the addition of IPython Notebooks (now Jupyter Notebooks), matplotlib, and many other tools to make things even more intuitive, Python has allowed scientists to mainly think about solutions to problems and not so much about the technology needed to drive these solutions. "Python is an ideal integration language which binds technologies together with ease." As in other areas, Python is an ideal integration language, which binds technologies together with ease. Python allows users to focus on the real problems, rather than spending time on implementation details. Apart from making things easier for the user, Python also shines as an ideal glue platform for the people who develop the low-level integrations with external libraries. This is mainly due to Python being very accessible via a nice and very complete C API. Python is really easy to use for math and stats-oriented people Sebastian Raschka (@rasbt), researcher and author of Python Machine Learning I think there are two main reasons, which are very related. The first reason is that Python is super easy to read and learn. I would argue that most people working in machine learning and AI want to focus on trying out their ideas in the most convenient way possible. The focus is on research and applications, and programming is just a tool to get you there. The more comfortable a programming language is to learn, the lower the entry barrier is for more math and stats-oriented people. Python is also super readable, which helps with keeping up-to-date with the status quo in machine learning and AI, for example, when reading through code implementations of algorithms and ideas. Trying new ideas in AI and machine learning often requires implementing relatively sophisticated algorithms and the more transparent the language, the easier it is to debug. The second main reason is that while Python is a very accessible language itself, we have a lot of great libraries on top of it that make our work easier. Nobody would like to spend their time on reimplementing basic algorithms from scratch (except in the context of studying machine learning and AI). The large number of Python libraries which exist, help us to focus on more exciting things than reinventing the wheel. Python is also an excellent wrapper language for working with more efficient C/C++ implementations of algorithms and CUDA/cuDNN, which is why existing machine learning and deep learning libraries run efficiently in Python. This is also super important for working in the fields of machine learning and AI. To summarize, I would say that Python is a great language that lets researchers and practitioners focus on machine learning and AI and provides less of a distraction than other languages. Python has so many features that are attractive for scientific computing Luciano Ramalho (@ramalhoorg) technical principal at ThoughtWorks and fellow of The PSF The most important and immediate reason is that the NumPy and SciPy libraries enable projects such as scikit-learn, which is currently almost a de facto standard tool for machine learning. The reason why NumPy, SciPy, scikit-learn, and so many other libraries were created in the first place is because Python has some features that make it very attractive for scientific computing. Python has a simple and consistent syntax which makes programming more accessible to people who are not software engineers. "Python benefits from a rich ecosystem of libraries for scientific computing." Another reason is operator overloading, which enables code that is readable and concise. Then there's Python's buffer protocol (PEP 3118), which is a standard for external libraries to interoperate efficiently with Python when processing array-like data structures. Finally, Python benefits from a rich ecosystem of libraries for scientific computing, which attracts more scientists and creates a virtuous cycle. Python is good for AI because it is strict and consistent Mike Bayer (@zzzeek), Senior Software Engineer at Red Hat and creator of SQLAlchemy What we're doing in that field is developing our math and algorithms. We're putting the algorithms that we definitely want to keep and optimize into libraries such as scikit-learn. Then we're continuing to iterate and share notes on how we organize and think about the data. A high-level scripting language is ideal for AI and machine learning, because we can quickly move things around and try again. The code that we create spends most of its lines on representing the actual math and data structures, not on boilerplate. A scripting language like Python is even better, because it is strict and consistent. Everyone can understand each other's Python code much better than they could in some other language that has confusing and inconsistent programming paradigms. The availability of tools like IPython notebook has made it possible to iterate and share our math and algorithms on a whole new level. Python emphasizes the core of the work that we're trying to do and completely minimizes everything else about how we give the computer instructions, which is how it should be. Automate whatever you don't need to be thinking about. Getting Started with Python and Machine Learning 4 ways to implement feature selection in Python for machine learning Is Python edging R out in the data science wars?

0
1
11250

article-image-mongodb-popular-nosql-database-today

Amey Varangaonkar

23 Jan 2018

12 min read

Why MongoDB is the most popular NoSQL database today

Amey Varangaonkar

23 Jan 2018

12 min read

If NoSQL is the king, MongoDB is surely its crown jewel. With over 15 million downloads and counting, MongoDB is the most popular NoSQL database today, empowering users to query, manipulate and find interesting insights from their data. Alex Giamas is a Senior Software Engineer at the Department for International Trade, UK. Having worked as a consultant for various startups, he is an experienced professional in systems engineering, as well as NoSQL and Big Data technologies. Alex holds an M. Sc., from Carnegie Mellon University in Information Networking and has attended professional courses in Stanford University. He is a MongoDB-certified developer and a Cloudera-certified developer for Apache Hadoop & Data Science essentials. Alex has worked with a wide array of NoSQL and Big Data technologies, and built scalable and highly available distributed software systems in C++, Java, Ruby and Python. In this insightful interview with MongoDB expert Alex Giamas, we talk about all things related to MongoDB - from why NoSQL databases gained popularity to how MongoDB is making developers’ and data scientists’ work easier and faster. Alex also talks about his book Mastering MongoDB 3.x, and how it can equip you with the tools to become a MongoDB expert! Key Takeaways NoSQL databases have grown in popularity over the last decade because they allow users to query their data without having to learn and master SQL. The rise in popularity of the Javascript-based MEAN stack meant many programmers now prefer MongoDB as their choice of database. MongoDB has grown from being just a JSON data store to become the most popular NoSQL database solution with efficient data manipulation and administration capabilities. The sharding and aggregation framework, coupled with document validations, fine-grained locking, a mature ecosystem of tools and a vibrant community of users are some of the key reasons why MongoDB is the go-to database for many. Database schema design, data modeling, backup and security are some of the common challenges faced by database administrators today. Mastering MongoDB 3.x focuses on these common pain points of the database administrators and shows them how to build robust, scalable database solutions with ease. NoSQL databases seem to have taken the world by storm, and many people now choose various NoSQL database solutions over relational databases. What do you think is the reason for this rise in popularity? That's an excellent question. There are several factors contributing to the rise in popularity for NoSQL databases. Relational databases have served us for 30 years. At some point we realised that the one size fits all model is no longer applicable. While “software is eating the world” as Marc Andreessen has famously written, the diversity and breadth of use cases we use software for has brought an unprecedented specialisation in the level of solutions to our problems. Graph databases, column-based databases and of course document-oriented databases like MongoDB are in essence specialised solutions to particular database problems. If our problem fits the document-oriented use case, it makes more sense to use the right tool for the problem (e.g. MongoDB) than a generic one-size-fits-all RDBMS. Another contributing factor to the rise of NoSQL databases and especially MongoDB is the rise of the MEAN stack, which means Javascript developers can now work from frontend to backend and database. Last but not the least, more than a generation of developers have struggled with SQL and its several variations. The promise that one does not need to learn and master SQL to extract data from the database but can rather do it using Javascript or other more developer friendly tools is just too exciting to pass on. MongoDB struck gold in this aspect, as Javascript is one of the most commonly used programming languages. Using Javascript for querying also opened up database querying to the front end developers which I believe has driven adoption as well. MongoDB is one of the most popular NoSQL databases out there today, and finds application in web development as well as Big Data processing. How does MongoDB aid in effective analytics? In the past few years we have seen the explosive growth of generated data. 80% of the world’s data has been generated in the past 3 years and this will continue to happen even more in the near future with the rise of IoT. This data needs to be stored and most importantly analysed to derive insights and actions. The answer to this problem has been to separate the transactional loads from the analytical loads into OLTP and OLAP databases respectively. Hadoop ecosystem has several frameworks that can store and analyse data. The problem with Hadoop data warehouses/data lakes however is threefold. You need experts to analyse data, they are expensive and it’s difficult to get quickly the answers to your questions. MongoDB bridges this gap by offering efficient analytics capabilities. MongoDB can help developers and technical people get quick insights from data that can help define the direction of research for the data scientists working on the data lake. By utilising tools like the new charts or the BI connector, data warehousing and MongoDB are converging. MongoDB does not aim to substitute Hadoop-based systems but rather complement them and decrease the time to market for data-driven solutions. You have been using MongoDB since 2009, way back when it was in its 1.x version. How has the database has evolved over the years? When I started using MongoDB, it was not much more than a JSON data store. It’s amazing how far MongoDB has come in these 9 years in every aspect. Every piece of software has to evolve and adapt to the always changing environment. MongoDB started off as the JSON data store that is easy to setup and use while being blazingly fast with some caveats. The turning point for MongoDB early in its evolution was introducing sharding. Challenging as it may be to choose the right shard key, being able to horizontally scale using commodity hardware is the feature that has been appreciated the most by developers and architects throughout all these years. The introduction of aggregation framework was another turning point for MongoDB since it allowed developers to build data pipelines using MongoDB data, reducing time to market. Geospatial related features were there from an early point in time and actually one of MongoDB’s earliest and most visible customers, FourSquare was a vivid user of geospatial features in MongoDB. Overall, with time MongoDB has matured and is now a robust database for a wide set of use cases. Document validations, fine grained locking, a mature ecosystem of tools around it and a vibrant community means that no matter the language, state of development, startup or corporate environment, MongoDB can be evaluated as the database choice. There have been of course features and directions that didn’t end up as well as we were originally hoping for. A striking example is the MongoDB MapReduce framework which never lived up to the expectations of developers using MapReduce via Hadoop and has gradually been superseded by the more advanced and more developer-friendly Aggregation framework. What do you think are the most striking features of MongoDB? How does it help you in your day to day activities as a Senior Software Engineer? In my day to day development tasks I almost always use the Aggregation framework. It helps to quickly prototype a pipeline that can transform my data to a format that I can then collaborate with the data scientists to derive useful insights in a fraction of the time needed by traditional tools. Day to day or sprint to the next sprint - what you want from any technology is to be reliable and not get in your way but rather help you achieve the business goals. With MongoDB we can easily store data in JSON format, process it, analyse it and pass it on to different frontend or backend systems without much hassle. What are the different challenges that MongoDB developers and architects usually face while working with MongoDB? How does your book 'Mastering MongoDB 3.x' help, in this regard? The major challenge developers and architects face when choosing to work with MongoDB is the database design. Irrespective of whether we come from an RDBMS or a NoSQL background, designing the database such that it can solve our current and future problems is a difficult task. Having been there and struggled with it in the past, I have put emphasis on how someone coming from a relational background can model different relationships in MongoDB. I have also included easy to understand and follow checklists around different aspects of MongoDB. Backup and security is another challenge that users often face. Backups are many times ignored until it’s too late. In my book I identify all available options and the tradeoffs they come with, including cloud-based options. Security on the other hand is becoming an ever increasing concern for computing systems with data leaks and security breaches happening more often. I have put an emphasis on security both in the relevant chapters and also across most chapters by highlighting common security pitfalls and promoting secure practices wherever possible. MongoDB has commanded a significant market share in the NoSQL databases domain for quite some time now, highlighting its usefulness and viability in the community. That said, what are the 3 areas where MongoDB can get better, in order to stay ahead of its competition? MongoDB has conquered the NoSQL space in terms of popularity. The real question is how/if NoSQL can increase its market share in the overall database market. The most important area of improvement is interoperability. What developers get with popular RDBMS is not only the database engine itself, but also easy ways to integrate it with different systems, from programming frameworks to Big Data and analytics systems. MongoDB could invest heavier in building these libraries that can make a developer’s life easier. Real-time analytics is another area with huge potential in the near future. With IoT rapidly increasing the data volume, data analysts need to be able to quickly derive insights from data. MongoDB can introduce features to address this problem. Finally, MongoDB could improve by becoming more tunable in terms of the performance/consistency tradeoff. It’s probably a bit too much to ask from a NoSQL database to support transactions as this is not what it was designed to be from the very beginning, but it would greatly increase the breadth of use cases if we could sparingly link up different documents and treat them as one, even with severed performance degradation. Artificial Intelligence and Machine Learning are finding useful applications in every possible domain today. Although it's a database, do you foresee MongoDB going the Oracle way and incorporating features to make it AI-compatible? Throughout the past few years, algorithms, processing power and the sheer amount of data that we have available have brought a renewed trust in AI. It is true that we use ML algorithms in almost every problem domain, which is why every vendor is trying to make the developer’s life easier by making their products more AI-friendly. It’s only natural for MongoDB to do the same. I believe that not only MongoDB but every database vendor will have to gradually focus more on how to serve AI effectively, and this will become a key part of their strategy going ahead. Please tell us something more about your book 'Mastering MongoDB 3.x'. What are the 3 key takeaways for the readers? Are there any prerequisites to get the most out of the book? First of all, I would like to say that as a “Mastering” level book we assume that readers have some basic understanding of both MongoDB and programming in general. That being said, I encourage readers to start reading the book and try to pick up the missing parts along way. It’s better to challenge yourself than the other way around. As for the most important takeaways, in no specific order of importance: Know your problem. It’s important to understand and analyse as much as possible the problem that you are trying to solve. This will dictate everything, from data structures, indexing, to database design decisions, to technology choices. On the other hand, if the problem is not well defined then this may be the chance to shine for MongoDB as a database choice as we can store data with minimal hassle. Be ready to scale ahead of time. Whether that is replication or sharding, make sure that you have investigated and identified the correct design and implementation steps so that you can scale when needed. Trying to add an extra shard when load has already peaked in the existing shards is neither fun, nor easy to do. Use aggregation. Being able to transform data in MongoDB before extracting it for processing in an external database is a really important feature and should be used whenever possible, instead of querying large datasets and transforming their data in our application server. Finally, what advice would you give to beginners who would like to be an expert in using MongoDB? How would the learning path to mastering MongoDB look like? What are the key things to focus on in order to master data analytics using MongoDB? To become an expert in MongoDB, one should start by understanding its history and roots. They should understand and master schema design and data modelling. After mastering data modelling, the next step would be to master querying - both CRUD and more advanced concepts. Understanding the aggregation framework and how or when to index would be the next step. With this foundation, one can then move on to cross-cutting concerns like monitoring, backup and security, understanding the different storage engines that MongoDB supports and how to use MongoDB with Big Data. All this knowledge should then provide a strong foundation to move on to the scaling aspects like replication and sharding, with the goal of providing effective fault tolerance and high availability systems. Mastering MongoDB 3.x explains these topics in this order with the intention of getting from beginner to expert in a structured and easy to follow and understand way.

0
0
14472

article-image-statistics-data-science-interview-james-miller

Amey Varangaonkar

09 Jan 2018

9 min read

Why You Need to Know Statistics To Be a Good Data Scientist

Amey Varangaonkar

09 Jan 2018

9 min read

Data Science has popularly been dubbed as the sexiest job of the 21st century. So much so that everyone wants to become a data scientist. But what do you need to get started with data science? Do you need to have a degree in statistics? Why is having sound knowledge of statistics so important to be a good data scientist? We seek answers to these questions and look at data science through a statistical lens, in an interesting conversation with James D. Miller. [author title="James D. Miller"]James is an IBM certified expert and a creative innovator. He has over 35 years of experience in applications and system design & development across multiple platforms and technologies. Jim has also been responsible for managing and directing multiple resources in various management roles including project and team leader, lead developer and applications development director. He is the author or several popular books such as Big Data Visualization, Learning IBM Watson Analytics, Mastering Splunk, and many more. In addition, Jim has written a number of whitepapers and continues to write on a number of relevant topics based upon his personal experiences and industry best practices.[/author] In this interview, we look at some of the key challenges faced by many while transitioning from a data developer role to a data scientist. Jim talks about his new book, Statistics for Data Science and discusses how statistics plays a key role when it comes to finding unique, actionable insights from data in order to make crucial business decisions. Key Takeaways - Statistics for Data Science Data science attempts to uncover the hidden context of data by going beyond answering generic questions such as ‘what is happening’, to tackling questions such as ‘what should be done next’. Statistics for data science cultivates 'structured thinking' in one. For most data developers who are transitioning to the role of data scientist, the biggest challenge often comes in calibrating their thought process - from being data design-driven to more insight-driven Having a sound knowledge of statistics differentiates good data scientists from mediocre ones - it helps them accurately identify patterns in data that can potentially cause changes in outcomes Statistics for Data Science attempts to bridge the learning gap between database development and data science by implementing the statistical concepts and methodologies in R to build intuitive and accurate data models. These methodologies and their implementations are easily transferable to other popular programming languages such as Python. While many data science tasks are being automated these days using different tools and platforms, the statistical concepts and methodologies will continue to form their backbone. Investing in statistics for data science is worth every penny! Full Interview Everyone wants to learn data science today as it is one of the most in-demand skills out there. In order to be a good data scientist, having a strong foundation in statistics has become a necessity. Why do you think is this the case? What importance does statistics have in data science? With Statistics, it has always been about "explaining" (data). With data science, the objective is going beyond questions such as "what happened?" and the "what is happening?" to try to determine "what should be done next?". Understanding the fundamentals of statistics allows one to apply "structured thinking" to interpret knowledge and insights sourced from statistics. You are a seasoned professional in the field of Data Science with over 30 years of experience. We would like to know how your journey in Data Science began, and what changes you have observed in this domain over the 3 decades. I have been fortunate to have had a career that has traversed many platforms and technological trends (in fact over 37 years of diversified projects). Starting as a business applications and database developer, I have almost always worked for the office of finance. Typically, these experiences started with the collection - and then management of - data to be able to report results or assess performance. Over time, the industry has evolved and this work as becoming a “commodity” – with many mature tool options available and plenty of seasoned professionals available to perform the work. Businesses have now become keen to “do something more” with their data assets and are looking to move into the world of data science. The world before us offers enormous opportunities for those not only with a statistical background but someone with a business background that understands and can apply the statistical data sciences to identify new opportunities or competitive advantages. What are the key challenges involved in the transition from being a data developer to becoming a data scientist? How does the knowledge of statistics affect this transition? Does one need a degree in statistics before jumping into Data Science? Someone who has been working actively with data already has a “head start” in that they have experience with managing and manipulating data and data sources. They would also most likely have programming experience and possess the ability to apply logic to data. The challenge will be to “retool” their thinking from data developer to data scientist – for example, going from data querying to data mining. Happily, there is much that the data developer “already knows” about data science and my book Statistics for Data Science attempts to “point out” the skills and experiences that the data developer will recognize as the same or at least have significant similarities. You will find that the field of data science is still evolving and the definition of “data scientist” depends upon the industry, project or organization you are referring to. This means that there are many roles that may involve data science with each having perhaps quite different prerequisites (such as a statistical degree). You have authored a lot of books such as Big Data Visualization, Learning IBM Watson Analytics, etc. with the latest being Statistics for Data Science. Please tell us something about your latest book. The latest book, “Statistics for Data Science”, looks to point out the synergies between a data developer and data scientist and hopes to evolve the data developers thinking “beyond database structures”, but also introduces key concepts and terminologies such as probability, statistical inference, model fitting, classification, regression and more, that can be used to journey into statistics and data science. How is statistics used when it comes to cleaning and pre-processing the data? How does it help the analysis? What other tasks can these statistical techniques be used for? Simple examples of the use of statistics when cleaning and/or pre-processing of data (by a data developer) include data-typing, Min/Max limitation, addressing missing values and so on. A really good opportunity for the use of statistics in data or database development is while modeling data to design appropriate storage structures. Using statistics in data development applies a methodical, structured approach to the process. The use of statistics can be a competitive advantage to any data development project. In the book, for practical purposes, you have shown the implementation of the different statistical techniques using the popular R programming language. Why do you think R is favored by the statisticians so much? What advantages does it offer? R is a powerful, feature-rich, extendable free language with many, many easy to use packages free for download. In addition, R has “a history” within the data science industry. R is also quite easy to learn and be productive with quickly. It also includes many graphics and other abilities “built-in”. Do you foresee a change in the way statistics for data science is used in the near future? In other words, will the dependency on statistical techniques for performing different data science tasks reduce? Statistics will continue to be important to data science. I do see more “automation” of more and more data science tasks through the availability of “off the shelf” packages that can be downloaded and installed and used. Also, the more popular tools will continue to incorporate statistical functions over time. This will allow for the main-streaming of statistics and data science into even more areas of life. The key will be for the user to have an understanding of the key statistical concepts and uses. What advice would you like to give to - 1 Those transitioning from the developer to the data scientist role, and 2. Absolute beginners, who want to take up statistics and data science as a career option? Buy my book! But seriously, keep reading and researching. Expose yourself to as much statistics and data science use cases and projects a possible. Most importantly, as you read about the topic, look for similarities between what you do today and what you are reading about. How does it relate? Always look for opportunities to use something that is new to you to do something you do routinely today. Your book 'Statistics for Data Science' highlights different statistical techniques for data analysis and finding unique insights from data. What are the three key takeaways for the readers, from this book? Again, I see (and point out in the book) key synergies between data or database development and data science. I would urge the reader – or anyone looking to move from data developer to data scientist - to learn through these and perhaps additional examples he or she may be able to find and leverage on their own. Using this technique, one can perhaps navigate laterally, rather than losing the time it would take to “start over” at the beginning (or bottom?) of the data science learning curve. Additionally, I would suggest to the reader that time taken to get acquainted with the R programs and the logic used for statistical computations (this book should be a good start) is time well spent.

0
0
3359

article-image-why-choose-ibm-spss-statistics-r

Amey Varangaonkar

22 Dec 2017

9 min read

Why choose IBM SPSS Statistics over R for your data analysis project

Amey Varangaonkar

22 Dec 2017

9 min read

Data analysis plays a vital role in organizations today. It enables effective decision-making by addressing fundamental business questions based on the understanding of the available data. While there are tons of open source and enterprise tools for conducting data analysis, IBM SPSS Statistics has emerged as a popular tool among statistical analysts and researchers. It offers them the perfect platform to quickly perform data exploration and analysis, and share their findings with ease. [author title=""] Dr. Kenneth Stehlik-Barry Kenneth joined SPSS as Manager of Training in 1980 after using SPSS for his own research for several years. He has used SPSS extensively to analyze and discover valuable patterns that can be used to address pertinent business issues. He received his PhD in Political Science from Northwestern University and currently teaches in the Masters of Science in Predictive Analytics program there. Anthony J. Babinec Anthony joined SPSS as a Statistician in 1978 after assisting Norman Nie, the founder of SPSS, at the University of Chicago. Anthony has led a business development effort to find products implementing technologies such as CHAID decision trees and neural networks. Anthony received his BA and MA in Sociology with a specialization in Advanced Statistics from the University of Chicago and is on the Board of Directors of the Chicago Chapter of the American Statistical Association, where he has served in different positions including the President. [/author] In this interview, we take a look at the world of statistical data analysis and see how IBM SPSS Statistics makes it easier to derive business sense from data. Kenneth and Anthony also walk us through their recently published book - Data Analysis with IBM SPSS Statistics - and tell us how it benefits aspiring data analysts and statistical researchers. Key Takeaways - IBM SPSS Statistics IBM SPSS Statistics is a key offering of IBM Analytics - providing an integrated interface for statistical analysis on-premise and on the cloud SPSS Statistics is a self-sufficient tool - it does not require you to have any knowledge of SQL or any other scripting language SPSS Statistics helps you avoid the 3 most common pitfalls in data analysis, i.e. handling missing data, choosing the best statistical method for analysis and understanding the results of the analysis R and Python are not direct competitors to SPSS Statistics - instead, you can create customized solutions by integrating SPSS Statistics with these tools for effective analyses and visualization Data Analysis with IBM SPSS Statistics highlights various popular statistical techniques to the readers, and how to use them in order to gather useful hidden insights from their data Full Interview IBM SPSS Statistics is a popular tool for efficient statistical analysis. What do you think are the 3 notable features of SPSS Statistics that make it stand apart from the other tools available out there? SPSS Statistics has a very short learning curve which makes it ideal for analysts to use efficiently. It also has a very comprehensive set of statistical capabilities so virtually everything a researcher would ever need is encompassed in a single application. Finally, SPSS Statistics provides a wealth of features for preparing and managing data so it is not necessary to master SQL or another database language to address data-related tasks. With over 20 years of experience in this field, you have a solid understanding of the subject and, equally, of SPSS Statistics. How do you use the tool in your work? How does it simplify your day to day tasks related to data analysis? I have used SPSS Statistics in my work with SPSS and IBM clients over the years. In addition, I use SPSS for my own research analysis. It allows me to make good use of my time whether I'm serving clients or doing my own analysis because of the breadth of capabilities available within this one program. The fact that SPSS produces presentation-ready output further simplifies things for me since I can collect key results as I work and put them into a draft report and share them as required. What are the prerequisites to use SPSS Statistics effectively? For someone who intends to use SPSS Statistics for their data analysis tasks, how steep is the curve when it comes to mastering the tool? It certainly helps to have a understanding of basic statistics when you begin to use SPSS Statistics but it can be a valuable tool even with a limited background in statistics. The learning curve is a very "gentle slope" when it comes to acquiring sufficient familiarity with SPSS Statistics to use it very effectively. Mastering the software does involve more time and effort but one can accomplish this over time as one builds on the initial knowledge that comes fairly easily. The good news is that one can obtain a lot of value from the software well before one truly masters it by discovering the many features. What are some of the common problems in data analysis? How does this book help the readers overcome them? Some of the most common pitfalls encountered when analyzing data involve handling missing/incomplete data, deciding which statistical method(s) to employ and understanding the results. In the book, we go into the details of detecting and addressing data issues including missing data. We also describe what each statistical technique provides and when it is most appropriate to use each of them. There are numerous examples of SPSS Statistics output and how the results can be used to assess whether a meaningful pattern exists. In the context of all the above, how does your book Data Analysis with IBM SPSS Statistics help readers in their statistical analysis journey? What, according to you, are the 3 key takeaways for the readers from this book? The approach we took with our book was to share with readers the most straightforward ways to use SPSS Statistics to quickly obtain the results needed to effectively conduct data analysis. We did this by showing the best way to proceed when it comes to analyzing data and then showing how this process can be done best in the software. The key takeaways from our book are the way to approach the discovery process when analyzing data, how to find hidden patterns present in the data and what to look for in the results provided by the statistical techniques covered in the book. IBM SPSS Statistics 25 was released recently. What are the major improvements or features introduced in this version? How do these features help the analysts and researchers? There are a lot of interesting new features introduced in SPSS Statistics 25. For starters, you can copy charts as Microsoft Graphic Objects, which allows you to manipulate charts in Microsoft Office. There are changes to the chart editor that make it easier to customize colors, borders, and grid line settings in charts. Most importantly, it allows the implementation of Bayesian statistical methods. Bayesian statistical methods enable the researcher to incorporate prior knowledge and assumptions about model parameters. This facility looks like a good teaching tool for Statistical Educators. Data visualization goes a long way in helping decision-makers get an accurate sense of their data. How does SPSS Statistics help them in this regard? Kenneth: Data visualization is very helpful when it comes to communicating findings to a broader audience and we spend time in the book describing when and how to create useful graphics to use for this purpose. Graphical examination of the data can also provide clues regarding data issues and hidden patterns that warrant deeper exploration. These topics are also covered in the book. Anthony: SPSS Statistics’ data visualizations capabilities are excellent. The menu system makes it easy to generate common chart types. You can develop customized looks and save them as a template to be applied to future charts. Underlying SPSS Graphics is an influential approach called the Grammar of Graphics. The SPSS graphics capabilities are embodied in a versatile syntax called Graphics Programming Language. Do you foresee SPSS Statistics facing stiff competition from open source alternatives in the near future? What is the current sentiment in the SPSS community regarding these topics? Kenneth: Open source tools based alternatives such as Python and R are potential competition for SPSS Statistics but I would argue otherwise. These tools, while powerful, have a much steeper learning curve and will prove difficult for subject matter experts that periodically need to analyze data. SPSS is ideally suited for these periodic analysts whose main expertise lies in their field which could be healthcare, law enforcement, education, human resources, marketing, etc. Anthony: The open source programs have a lot of capability but they are also fairly low-level languages, so you must learn to code. The learning curve is steep, and there are many maintainability issues. R has 2 major releases a year. You can have a situation where the data and commands remain the same, but the result changes when you update R. There are many dependencies among R packages. R has many contributors and is an avenue for getting your hands on new methods. However, there is a wide variance in the quality of the contributors and contributed packages. The occasional user of SPSS has an easier time jumping back in than does the occasional user of open source software. Most importantly, it is easier to employ SPSS in production settings. SPSS Statistics supports custom analytical solutions through integration with R and Python. Is this an intent from IBM to join hands with the open source community? This is a good follow-up question to the one asked before. Actually, the integration with R and Python allows SPSS Statistics to be extended to accommodate a situation in which an analyst wishes to try an algorithm or graphical technique not directly available in the software but which is supported in one of these languages. It also allows those familiar with R or Python to use SPSS Statistics as their platform and take advantage of all the built-in features it comes with, out of the box while still having the option to employ these other languages where they provide additional value. Lastly, this book is designed for analysts and researchers who want to get meaningful insights from their data as quickly as possible. How does this book help them in this regard? SPSS Statistics does make it possible to very quickly pull in data and get insightful results. This book is designed to streamline the steps involved in getting this done while also pointing out some of the less obvious "hidden gems" that we have discovered during the decades of using SPSS in virtually every possible situation.

0
0
3187

article-image-qlik-sense-driving-self-service-business-intelligence

Amey Varangaonkar

12 Dec 2017

11 min read

How Qlik Sense is driving self-service Business Intelligence

Amey Varangaonkar

12 Dec 2017

11 min read

Delivering Business Intelligence solutions to over 40000 customers worldwide, there is no doubt that Qlik has established a strong foothold in the analytics market for many years now. With the self-service capabilities of Qlik Sense, you can take better and more informed decisions than ever before. From simple data exploration to complex dashboarding and cloud-ready, multi-platform analytics, Qlik Sense gives you the power to find crucial, hidden insights from the depths of your data. We got some fascinating insights from our interview with two leading Qlik community members, Ganapati Hegde and Kaushik Solanki, on what Qlik Sense offers to its users and what the future looks like for the BI landscape. [box type="shadow" align="" class="" width=""] Ganapati Hegde Ganapati is an engineer by background and carries an overall IT experience of over 16 years. He is currently working with Predoole Analytics, an award-winning Qlik partner in India, in the presales role. He has worked on BI projects in several industry verticals and works closely with customers, helping them with their BI strategies. His experience in other aspects of IT, like application design and development, cloud computing, networking, and IT Security - helps him design perfect BI solutions. He also conducts workshops on various technologies to increase user awareness and drive their adoption. Kaushik Solanki Kaushik has been a Qlik MVP (Most Valuable Player) for the years 2016 and 2017 and has been working with the Qlik technology for more than 7 years now. An Information technology engineer by profession, he also holds a master’s degree in finance. Having started his career as a Qlik developer, Kaushik currently works with Predoole Analytics as the Qlik Project Delivery Manager and is also a certified QlikView administrator. An active member of Qlik community, his great understanding of project delivery - right from business requirement to final implementation, has helped many businesses take valuable business decisions.[/box] In this exciting interview, Ganapati and Kaushik take us through a compelling journey in self-service analytics, by talking about the rich features and functionalities offered by Qlik Sense. They also talk about their recently published book ‘Implementing Qlik Sense’ and what the readers can learn from it. Key Takeaways With many self-service and guided analytics features, Qlik Sense is perfectly tailored to business users Qlik Sense allows you to build customized BI solutions with an easy interface, good mobility, collaboration, focus on high performance and very good enterprise governance Built-in capabilities for creating its own warehouse, a strong ETL layer and a visualization layer for creating intuitive Business Intelligence solutions are some of the strengths of Qlik Sense With support for open APIs, the BI solutions built using Qlik Sense can be customized and integrated with other applications without any hassle. Qlik Sense is not a rival to Open Source technologies such as R and Python. Qlik Sense can be integrated with R or Python to perform effective predictive analytics ‘Implementing Qlik Sense’ allows you to upgrade your skill-set from a Qlik developer to a Qlik Consultant. The end goal of the book is to empower the readers to implement successful Business Intelligence solutions using Qlik Sense. Complete Interview There has been a significant rise in the adoption of Self-service Business Intelligence across many industries. What role do you think visualization plays in self-service BI? In a vast ocean of self-service tools, where do you think Qlik stands out from the others? As Qlik says visualization alone is not the answer. A strong backend engine is needed which is capable of strong data integration and associations. This then enables businesses to perform self-service and get answers to all their questions. Self-service plays an important role in the choice of visualization tools, as business users today no longer want to go to IT every time they need changes. Self service enable business users to quickly build their own visualization with simple drag and drop. Qlik stands out from the rest in its capability to bring in multiple data sources, enabling users to easily answers questions. Its unique associative engine allows users to find hidden insights. The open API allows easy customization and integrations which is a must for enterprises. Data security and governance is one of the best in Qlik. What are the key differences between QlikView and Qlik Sense? What are the factors crucial to building powerful Business Intelligence solutions with Qlik Sense? QlikView and Qlik Sense are similar yet different. Both share the same engine. On one hand, QlikView is a developer’s delight with the options it offers, and on the other hand, Qlik Sense with its self-service is more suited for business users. Qlik Sense has better mobility and open API as compared to QlikView, making Qlik Sense more customizable and extensible. The beauty of Qlik Sense lies in its ability to help business get answers to their questions. It helps correlate the data between different data sources and making it very meaningful to users. Powerful data visualizations do not necessarily mean beautiful visualizations and Qlik Sense lays special emphasis on this. Finally what the users need is performance, easy interface, good mobility, collaboration and good enterprise governance - something which Qlik Sense provides. Ganapati, you have over 15 years of experience in IT, and have extensively worked in the BI domain for many years. Please tell us something about your journey. How does your daily schedule look like? I have been fortunate in my career to be able to work on multiple technologies ranging from programming, databases, information security, integrations and cloud solutions. All this knowledge is helping me propose the best solutions for my Qlik customers. It’s a pleasure helping customers in their analytical journey and working for a services company helps in meeting customers from multiple domains. The daily schedule involves doing Proof of Concepts/Demos for customers, designing optimum solutions on Qlik, and conducting requirement gathering workshops. It’s a pleasure facing new challenges every day and this helps me increase my knowledge base. Qlik open API opens up amazing new possibilities and lets me come up with out of the box solutions. Kaushik, you have been awarded the Qlik MVP for 2016 and 2017, and have experience of using Qlik's tools for over 7 years. Please tell us something about your journey in this field. How do you use the tool in your day to day work? I started my career by working with the Qlik technology. My hunger for learning Qlik made me addicted to the Qlik community. I learned lot many things from the community by asking questions and solving real-world problems of community members. This helped me to get awarded by Qlik as MVP for consecutively 2 years. MVP award motivated me to help Qlik customers and users and that is one of the reasons why I thought about writing a book on Qlik Sense. I have implemented Qlik not only for clients but also for my personal use cases. There are many ways in which Qlik helps me in my day-to-day work and makes my life much easier. It’s safe to say that I absolutely love Qlik. Your book 'Implementing Qlik Sense' is primarily divided into 4 sections - with each section catering to a specific need when it comes to building a solid BI solution. Could you please talk more about how you have structured the book, and why? BI projects are challenging, and it really hurts when a project doesn’t succeed. The purpose of the book is to enable Qlik Sense developers to get to implement successful Qlik Projects. There is often a lot of focus on development and thereby Qlik developers miss several other crucial factors which contribute to project success. To make the journey from a Qlik developer to a Qlik consultant the book is divided into 4 sections. The first section focuses on the initial preparation and intended to help consultant to get their groundwork done. The second section focuses on the execution of the project and intended to help consultants play a key role in rest of phases involving requirement gathering, architecture, design, development UAT. The third section is intended to make consultant familiar with some industry domains. This section is intended to help consultant in engaging better with business users and suggesting value-additions to project. The last section is to use the knowledge gained in the three sections and approaching a project with a case study which we come across routinely. Who is the primary target audience for this book? Are there any prerequisites they need to know before they start reading this book? The primary target audience is the Qlik Developers who are looking to progress in their career and are looking to wear the hat of a Qlik consultant. The book is also for existing consultants who would like to sharpen their skills and use Qlik Sense more efficiently. The book will help them become trusted advisors to their clients. Those who are already familiar with some Qlik development will be able to get the most out of this book. Qlik Sense is primarily an enterprise tool. With the rise of open source languages such as R and Python, why do you think people would still prefer enterprise tools for their data visualization? Qlik Sense is not a competition to R and Python but there are lots of synergies. The customer gets the best value when Qlik co-exists with R/Python and can leverage the capabilities of both Qlik and R/Python. Qlik Sense does not have the predictive capability which is easily fulfilled by R/Python. For the customer, the tight integration ensures he/she doesn’t have to leave the Qlik screen. There can be other use cases for using them jointly such as analyzing unstructured data and using machine learning. The reports and visualizations built using Qlik Sense can be viewed and ported across multiple platforms. Can you please share your views on this? How does it help the users? Qlik has opened all gates to integrate its reporting and visualization with most of the technologies through APIs. This has empowered customers to integrate Qlik with their existing portals and provide easy access to end users. Qlik provides APIs for almost all its products, which makes Qlik the first choice for many CIOs because with those APIs they get a variety of options to integrate and automate their work. What are the other key functionalities of Qlik Sense that help the users build better BI solutions? Qlik Sense is not just a pure play data visualization tool. It has capabilities for creating its own warehouse, having an ETL layer and then of course there’s the visualization layer. For the customers, it’s all about getting all the relevant components required for their BI project in a single solution. Qlik is investing heavily in R&D and with its recent acquisitions and a strong portfolio, it is a complete solution enabling users to get all their use cases fulfilled. The open API has enabled opening newer avenues with custom visualizations, amazing concepts such as chatbots, augmented intelligence and much more. The core strength of strong data association, enterprise scalability, governance combined with all other aspects make Qlik one of the best in overall customer satisfaction. Do you foresee Qlik Sense competing strongly with major players such as Tableau and Power BI in the near future? Also, how do you think Qlik plans to tackle the rising popularity of the Open Source alternatives? Qlik has been classified as a Leader in the Gartner’s Magic Quadrant for several years now. We often come across Tableau and Microsoft Power BI as competition. We suggest our customers do a thorough evaluation and more often than not they choose Qlik for its features and the simplicity it offers. With recent acquisitions, Qlik Sense has now become an end-to-end solution for BI, covering uses cases ranging from report distributions, data-as-a-service, and geoanalytics as well. Open source alternatives have their own market and it makes more sense to leverage their capability rather than compete with them. An example, of course, is the strong integration of many BI tools with R or Python which makes life so much easier when it comes to finding useful insights from data. Lastly, what are the 3 key takeaways from your book 'Implementing Qlik Sense'? How will this book help the readers? The book is all about meeting your client’s expectations. The key takeaways are: Understand the role and importance of Qlik consultant and why it’s crucial to be a trusted advisor to your clients Successfully navigating through all aspects which enable successful implementation of your Qlik BI Project. Focus on mitigating risks, driving adoption and avoiding common mistakes while using Qlik Sense. The book is ideal for Qlik developers who aspire to become Qlik consultants. The book uses simple language and gives examples to make the learning journey as simple as possible. It helps the consultants to give equal importance to certain phases of project development that often neglected. Ultimately, the book will enable Qlik consultants to deliver quality Qlik projects. If this interview has nudged you to explore Qlik Sense, make sure you check out our book Implementing Qlik Sense right away!

0
0
4401

article-image-industrial-internet-iiot-architects

Aaron Lazar

21 Nov 2017

8 min read

Why the Industrial Internet of Things (IIoT) needs Architects

Aaron Lazar

21 Nov 2017

8 min read

The Industrial Internet, the IIoT, the 4th Industrial Revolution or Industry 4.0, whatever you may call it, has gained a lot of traction in recent times. Many leading companies are driving this revolution, connecting smart edge devices to cloud-based analysis platforms and solving their business challenges in new and smarter ways. To ensure the smooth integration of such machines and devices, effective architectural strategies based on accepted principles, best practices, and lessons learned, must be applied. In this interview, Shyam throws light on his new book, Architecting the Industrial Internet, and shares expert insights into the world of IIoT, Big Data, Artificial Intelligence and more. Shyam Nath Shyam is the director of technology integrations for Industrial IoT at GE Digital. His area of focus is building go-to-market solutions. His technical expertise lies in big data and analytics architecture and solutions with focus on IoT. He joined GE in Sep 2013 prior to which he has worked in IBM, Deloitte, Oracle, and Halliburton. He is the Founder/President of the BIWA Group, a global community of professional in Big Data, analytics, and IoT. He has often been listed as one of the top social media influencers for Industrial IoT. You can follow him on Twitter @ShyamVaran. He talks about the IIoT, the various impacts that technologies like AI and Deep Learning will have on IIoT and he gives a futuristic direction to where IIoT is headed towards. He talks about the challenges that Architects face while architecting IIoT solutions and how his book will help them overcome such issues. Key Takeaways The fourth Industrial Revolution will break silos and bring IT and Ops teams together to function more smoothly. Choosing the right technology to work with involves taking risks and experimenting with custom solutions. The Predix platform and Predix.io allow developers and architects, quickly learn from others and build working prototypes that can be used to get quick feedback from the business users. Interoperability issues and a lack of understanding of all the security ramifications of the hyper-connected world could be a few challenges that adoption of IIoT must overcome Supporting technologies like AI, Deep Learning, AR and VR will have major impacts on the Industrial Internet In-depth Interview On the promise of a future with the Industrial Internet The 4th Industrial Revolution is evolving at a terrific pace. Can you highlight some of the most notable aspects of Industry 4.0? The Industrial Internet is the 4th Industrial Revolution. It will have a profound impact on both the industrial productivity as well as the future of work. Due to more reliable power, cleaner water, and Intelligent Cities, the standard of living will improve, at large for the world citizens. Industrial Internet will forge new collaborations between the IT and OT, in the organizations, and each side will develop a better appreciation of the problems and technologies of the other. They will work together to create smoother overall operations by breaking the silos. On Shyam’s IIoT toolbox that he uses on a day to day basis You have a solid track record of architecting IIoT applications in the Big Data space over the years. What tools do you use on a day-to-day basis? In order to build Industrial Internet applications, GE's Predix is my preferred IIoT platform. It is built for Digital Industrial solutions, with security and compliance baked into it. Customer IIoT solutions can be quickly built on Predix and extended with the services in the marketplace from the ecosystem. For Asset Health Monitoring and for reducing the downtime, Asset Performance Management (APM) can be used to get a jump start and its extensibility framework can be used to extend it. On how to begin one’s journey into building the Industry 4.0 For an IIoT architect, what would your recommended learning plan be? What aspects of architecting Industry 4.0 applications are tricky to master and how does your book Architecting the Industrial Internet, prepare its readers to be industry ready? An IIoT Architect can start with the book Architecting the Industrial Internet, to get a good grasp of the area broadly. This book provides a diverse set of perspectives and architectural principles, from authors who work in GE Digital, Oracle and Microsoft. The end-to-end IIoT applications involve an understanding of sensors, machines, control systems, connectivity and cloud or server systems, along with the understanding of associated enterprise data, the architect needs to focus on a limited solution or proof of concept first. The book provides coverage for the end-to-end requirements of the IIoT solutions for the architects, developers and business managers. The extensive set of use cases and case studies provides examples from many different industry domains to allow the readers to easily related to it. The book is written, in a style that would not overwhelm the reader, yet explain the workings of the architecture and the solutions. The book will be best suited for Enterprise Architects and Data Architects who are trying to understand how IIoT solutions differ from traditional IT solutions. The layer-by-layer description of the IIoT Architecture will provide a systematic approach to help develop a deep understanding, for Architects. IoT Developers who have some understanding of this area can learn the IIoT platform-based approach to building solutions quickly. On how to choose the best technology solution to optimize ROI There are so many IIoT technologies, that manufacturers are confused as to how to choose the best technology to obtain the best ROI. What would your advice to manufacturers be, in this regard? The manufacturers and operation leaders look for quick solutions to known issues, in a proven way. Hence, often they do not have the appetite to experiment with a custom solution, rather they like to know where the solution provider has solved similar problems and what was the outcome. The collection of use cases and case studies will help business leaders get an idea of the potential ROI while evaluating the solution. Getting to know Predix, GE’s IIoT platform, better Let's talk a bit about Predix, GE's IIoT platform. What advantages does Predix offer developers and architects? Do you foresee any major improvements coming to Predix in the near future? The GE's Predix platform has a growing developer community that is approaching 40,000 strong. Likewise, the ecosystem of Partners is approaching 1000. Coupled with the free access to create developer accounts on Predix.io, the developers and architects can quickly learn from others and build working prototypes that can be used to get quick feedback from the business users. The catalog of microservices at Predix.io will continue to expand. Likewise, applications written on top of Predix, such as APM and OPM (Operations Performance Management) will continue to become feature-rich, providing coverage to many common Digital Industrial challenges. On the impact of other emerging technologies like AI on IIoT What according to you will the impact be of AI and Deep Learning, on IIoT? AI and Deep Learning help to build robust Digtal Twins of the industrial assets. These Digital Twins, will make the job of predictive maintenance and optimization, much easier for the operators of these assets. Further, IIoT will benefit from many new advances in technologies like AI, AR/VR and make the job of Field Services Technicians easier. IIoT is already widely used in energy generation and distribution, Intelligent Cities for law enforcement and to ease traffic congestion. The field of healthcare is evolving, due to increasing use of wearables. Finally, precision agriculture is enabled by IoT as well. On likely barriers to IIoT adoption What are the roadblocks you expect in the adoption of IIoT? Today the challenges to rapid adoption of IoT, are interoperability issues and lack of understanding of all the security ramifications of the hyper-connected world. Finally, how to explain the business case of the IoT to the decision makers and different stakeholders is still evolving. On why Architecting the Industrial Internet is a must read for Architects Would you like to give architects 3 reasons on why they should pick up your book? It is written by IIoT practitioners from large companies who are building solutions for both internal and external consumption. The book captures the architectural best practices and advocates a platform based approach, to solutions. The theory is put to practice in the form of use cases and case studies, to provide a comprehensive guide to the architects. If you enjoyed this interview, do check out Shyam’s latest book, Architecting the Industrial Internet.

0
0
2530

article-image-sports-analytics-empowering-better-decision-making

Amey Varangaonkar

14 Nov 2017

11 min read

Expert Insights: How sports analytics is empowering better decision-making

Amey Varangaonkar

14 Nov 2017

11 min read

Analytics is slowly changing the face of the sports industry as we know it. Data-driven insights are being used to improve the team and individual performance, and to get that all-important edge over the competition. But what exactly is sports analytics? And how is it being used? What better way to get answers to these questions than asking an expert himself! [author title="Gaurav Sundararaman"]A Senior Stats Analyst at ESPN currently based in Bangalore, India. With over 10 years of experience in the field of analytics, Gaurav worked as a Research Analyst and a consultant in the initial phase of his career. He then ventured into sports analytics in 2012, and played a major role in the Analytics division of SportsMechanics India Pvt. Ltd. where he was the Analytics Consultant for the T20 World Cup winning West Indies team in 2016.[/author] In this interview, Gaurav takes us through the current landscape of sports analytics, and talks about how analytics is empowering better decision-making in sports. Key Takeaways Sports analytics pertains to finding actionable, useful insights from sports data which teams can use to gain competitive advantage over the opposition Instincts backed by data make on and off-field decisions more powerful and accurate Rise of IoT and wearable technology has boosted sports analytics. With more data available for analysis, insights can be unique and very helpful Analytics is being used in sports right from improving player performance to optimizing ticket prices and understanding fan sentiments Knowledge of tools for data collection, analysis and visualization such as R, Python and Tableau is essential for a sports analyst Thorough understanding of the sport, up to date skillset and strong communication with players and management are equally important factors to perform efficient analytics Adoption of analytics within sports has been slow, but steady. More and more teams are now realizing the benefits of sports analytics and are adopting an analytics-based strategy Complete Interview Analytics today is finding widespread applications in almost every industry today - how has the sports industry changed over the years? What role is analytics playing in this transformation? The sports industry has been relatively late in adopting analytics. That said, the use of analytics in sports has also varied geographically. In the west, analytics plays a big role in helping teams, as well as individual athletes, take up decisions. Better infrastructure and a quick adoption of the latest trends in technology is an important factor here. Also, investment in sports starts from a very young age in the west, which also makes a huge difference. In contrast, many countries in Asia are still lagging behind when it comes to adopting analytics, and still leverage on traditional techniques to solve problems. A combination of analytics with traditional knowledge from experience would go a long way in helping teams, players and businesses succeed. Previously the sports industry was a very close community. Now with the advent of analytics, the industry has managed to expand its horizon. We witness more non-sportsmen playing a major part in the decision making. They understand the dynamics of the sports business and how to use data-driven insights to influence the same. Many major teams across different sports such as Football (Soccer), Cricket, American Football, Basketball and more have realized the value of data and analytics. How are they using it? What advantages does analytics offer to them? One thing I firmly believe is that analytics can’t replace skills or can’t guarantee wins. What it can do is ensure there is logic towards certain plans and decisions. Instincts backed by data make the decisions more powerful. I always tell the coaches or players – Go with your gut and instincts as Plan A. If it does not work out your fall back could be Plan B based on trends and patterns derived from data. It turns out to be a win-win for both. Analytics offers a neutral perspective which sometimes players or coaches may not realize. Each sport has a unique way of applying analytics to make decisions and obviously, as analysts, we need to understand the context and map the relevant data. As far as using the analytics is concerned, the goals are pretty straightforward - be the best, beat the opponents and aim for sustained success. Analytics helps you achieve each of these objectives. The rise of IoT and wearable technology over the last few years has been incredible. How has it affected sports, and sports analytics, in particular? It is great to see that many companies are investing in such technologies. It is important to identify where wearables and IoT can be used in sport and where it can cause maximum impact. These devices allow in-game monitoring of players, their performance, and their current physical state. Also, I believe more than on-field, these technologies would be very useful in engaging fans as well. Data derived from these devices could be used in broadcasting as well as providing a good experience for fans in the stadiums. This will encourage more and more people to watch games in stadiums and not in the comfort of their homes. We have already seen a beginning with a few stadiums around the world leveraging technology (IoT). The Mercedes Benz stadium (home of Atlanta Falcons) has a high tech stadium powered by IBM. Sacramento is building a state-of-the-art facility for the Sacramento Kings. This is just the start, and it will only get better with time. How does one become a sports analyst? Are there any particular courses/certifications that one needs to complete in order to become one? Can you share with us your journey in sports analytics? To be honest there are no professional courses yet in India to become an Analyst. There are a couple of colleges which have just started offering Sports Analytics as a course in their Post-Graduation Program. However, there are a few companies (Sports Mechanics and Kadamba Technologies in Chennai) that offer jobs that can enable you to become a Sports Analyst if you are really good. If you are a freelancer then my advice would be to ensure you brand yourself well and showcase your knowledge through social media platforms and get a breakthrough via contacts. Post my MBA, Sports Mechanics (a leader in this space), a company based in Chennai were looking for someone to work to start their data practice. I was just lucky to be at the right place at the right time. I worked for 4 years there and was able to learn a lot about the industry and what works and what does not. Being a small company, I was lucky to don multiple hats and work on different projects across the value chain. I moved and joined the lovely team Of ESPNCricinfo where I work for their stats team. What are the tools and frameworks that you use for your day to day tasks? How do they make your work easier? There are no specific tools or frameworks. It depends on the enterprise you are working for. Usually, they are proprietary tools of the company. Most of these tools are used either to collect, mine or visualize data. Interpreting the information and presenting it in a manner in which users understand is important and that is where certain applications or frameworks are used. However to be ready for the future it would be good to be skilled on tools that support data collection, analysis and visualization namely R, Python and Tableau, to name a few. Do sports analysts have to interact with players and the coaching staff directly? How do you communicate your insights and findings with the relevant stakeholders? Yes, they have to interact with players and management directly. If not, the impact will be minimal. Communicating insights is very important in this industry. Too much analysis could lead to paralysis. We need to identify what exactly each player or coach is looking for, based on their game and try to provide them the information in a crisp manner which helps them make decisions on and off the field. For each stakeholder the magnitude of the information provided is different. For the coach and management, the insights can be in detail while for the players we need to keep it short and to the point. The insights you generate must not only be limited to enhancing the performance of a team on the field but much more than that. Could you give us some examples? Insights can vary. For the management, it could deal with how to maximise the revenue or save some money in an auction. For coaches, it could help them know about his team’s as well as the opposition’s strengths and weaknesses from a different perspective. For captains, data could help in identifying some key strategies on the field. For example, in Cricket, it could help the captain determine which bowler to bring on to which opposition batsmen, or where to place the fielders. Off the field, one area where analytics could play a big role would be in grassroots development and tracking of an athlete from an early age to ensure he is prepared for the biggest stage. Monitoring performance, improving physical attributes by following a specific regimen, assessing injury record and designing specific training programs, etc. are some ways in which this could be done. What are some of the other challenges that you face in your day to day work? Growth in this industry can be slow sometimes. You need to be very patient, work hard and ensure you follow the sport very closely. There are not many analytical obstacles as such, but understanding the requirements and what exactly the data needs are can be quite a challenge. Despite all the buzz, there are quite a few sports teams and organizations who are still reluctant to adopt an analytics-based strategy – why do you think is that the case? What needs to change? The reason for the slow adoption could be the lack of successful case studies and the awareness. In most sports when so many decisions are taken on the field sometimes the players' ability and skill seems far more superior to anything else. As more instances of successful execution of data-based trends come up, we are likely to see more teams adopting the data-based strategy. Like I mentioned earlier, analytics needs to be used to make the coach and captain take the most logical and informed decisions. Decision-makers need to be aware of the way it is used and how much impact it can cause. This awareness is vital towards increasing the adoption of analytics in sports. Where do you see sports analytics in the next 5-10 years? Today in sports many decisions are taken on gut feeling, and I believe there should be a balance. That is where analytics can help. In sports like Cricket, only around 30% of the data is used and there is more emphasis given to video. Meanwhile, if we look at Soccer or Basketball, the usage of data and video analytics is close to 60-70% of its potential. Through awareness and trying out new plans based on data, we can increase usage of analytics in cricket to 60-70 % in the next few years. Despite the current shortcomings, It is fair to say that there is a progressive and positive change at the grassroots level across the world. Data-based coaching and access to technology are slowly being made available to teams as well as budding sportsmen/women. Another positive is that the investment in the sports industry is growing steadily. I am confident that in a couple of years, we will see more job opportunities in sports. Maybe in five years, the entire ecosystem would be more structured and professional. We would witness analytics playing a much bigger role in helping stakeholders make informed decisions, as data-based insights become even more crucial. Lastly, what advice do you have for aspiring sports analysts? My only advice would be - Be passionate, build a strong network of people around you, and constantly be on the lookout for opportunities. Also, it is important to keep updating your skill-set in terms of the tools and techniques needed to perform efficient and faster analytics. Newer and better tools keep coming up very quickly, which make your work easier and faster. Be on the lookout for such tools! One also needs to identify their own niche based on their strengths and try to build on that. The industry is on the cusp of growth and as budding analysts, we need to be prepared to take off when the industry matures. Build your brand and talk to more people in the industry - figure out what you want to do to keep yourself in the best position to grow with the industry.

0
0
3654

Amey Varangaonkar

03 Nov 2017

9 min read

Why learn IBM SPSS Modeler in 2017

Amey Varangaonkar

03 Nov 2017

9 min read

IBM’s SPSS Modeler provides a powerful, versatile workbench that allows you to build efficient and accurate predictive models in no time. What else separates IBM SPSS Modeler from other enterprise analytics tools out there today? To know just that, we talk to arguably two of the most popular members of the SPSS community. [box type="shadow" align="" class="" width=""] Keith McCormick Keith is a career-long practitioner of predictive analytics and data science, has been engaged in statistical modeling, data mining, and mentoring others in this area for more than 20 years. He is also a consultant, an established author, and a speaker. Although his consulting work is not restricted to any one tool, his writing and speaking have made him particularly well known in the IBM SPSS Statistics and IBM SPSS Modeler communities. Jesus Salcedo Jesus is an independent statistical consultant and has been using SPSS products for over 20 years. With a Ph.D., in Psychometrics from Fordham University, he is a former SPSS Curriculum Team Lead and Senior Education Specialist, and has developed numerous SPSS learning courses and trained thousands of users.[/box] In this interview with Packt, Keith and Jesus give us more insights on the Modeler as a tool, the different functionalities it offers, and how to get the most out of it for all your data mining and analytics needs. Key Interview Takeaways IBM SPSS Modeler is easy to get started with but can be a tricky tool to master Knowing your business, your dataset and what algorithms you are going to apply are some key factors to consider before building your analytics solution with SPSS Modeler SPSS Modeler’s scripting language is Python, and the tool has support for running R code IBM SPSS Modeler Essentials helps you effectively learn data mining and analytics, with a focus on working with data than on coding Full Interview Predictive Analytics has garnered a lot of attention of late, and adopting an analytics-based strategy has become the norm for many businesses. Why do you think this is the case? Jesus: I think this is happening because everyone wants to make better-informed decisions. Additionally, predictive analytics brings the added benefit of discovering new relationships that you were previously not aware of. Keith: That’s true, but it’s even more exciting when the models are deployed and are potentially driving automated decisions. With over 40 years of combined experience in this field, you are master consultants and trainers, with an unrivaled expertise when it comes to using the IBM SPSS products. Please share with us the story of your journey in this field. Our readers would also love to know how your day-to-day schedule looks like. Jesus: When I was in college, I had no idea what I wanted to be. I took courses in many areas, however I avoided statistics because I thought it would be a waste of time, after all, what else is there to learn other than calculating a mean and plugging it into fancy formulas (as a kid I loved baseball, so I was very familiar with how to calculate various baseball statistics). Anyway, I took my first statistics course (where I learned SPSS) since it was a requirement, and I loved it. Soon after I became a teaching assistant for more advanced statistics courses and I eventually earned my Ph.D. in Psychometrics, all the while doing statistical consulting on the side. After graduate school, my first job was as an education consultant for SPSS (where I met Keith). I worked at SPSS (and later IBM) for seven years, at first focusing on training customers on statistics and data-mining, and then later on developing course materials for our trainings. In 2013 Keith invited me to join him as an IBM partner, so we both trained customers and developed a lot of new and exciting material in both book and video formats. Currently, I work as an independent statistical and data-mining consultant and my daily projects range from analyzing data for customers, training customers so they can analyze their own data, or creating books and videos on statistics and data mining. Keith: Our careers have lots of similarities. My current day to day is similar too. Lately, about 1/3rd of my year is lecturing and curriculum development for organizations like TDWI (Transforming Data with Intelligence), The Modeling Agency, and UC Irvine Extension. The majority of my work is in predictive analytics consulting. I especially enjoy projects where I’m brought in early and can help with strategy and planning. Then, the coach and mentor take over a team until they are self-sufficient. Sometimes building the team is even more exciting than the first project because I know that they will be able to do many more projects in the future. There is a plethora of predictive analytics tools used today - for desktop and enterprises. IBM SPSS Modeler is one such tool. What advantages does SPSS Modeler have over the others, in your opinion? Keith: One of our good friends who co-authored the IBM SPSS Modeler Cookbook made an interesting comment about this at a conference. He is unique in that he has done one-day seminars using several different software tools. As you know, it is difficult to present data mining in just one day. He said that only with Modeler he is able to spend some time on each of the CRISP-DM phases of a case study in a day. I think he feels this way because it’s among the easiest options to use. We agree. While powerful, and while it takes a whole career to master everything, it is easy to get started. Are there any prerequisites for using SPSS Modeler? How steep is the learning curve in order to start using the tool effectively? Keith: Well, the first thing I want to mention is that there are no prerequisites for our PACKT video IBM SPSS Modeler Essentials. In that, we assume that you are starting from scratch. For the tool in general, there aren’t any specific requisites as such, however knowing your data, and what insights you are looking for always helps. Jesus: Once you are back at the office, in order to be successful on a data mining project or efficiently utilize the tool, you’ll need to know your business, your data, and the modeling algorithm you are using. Keith: The other question that we get all the time is how much statistics and machine learning do you have to know. Our advice is to start with one or maybe two algorithms and learn them well. Try to stick to algorithms that you know. In our PACKT course, we mostly focus on just Decision Trees, which one of the easiest to learn. What do you think are the 3 key takeaways from your course - IBM SPSS Modeler Essentials? The 3 key takeaways from this course, we feel are: Start slow. Don’t pressure yourself to learn everything all at once. There are dozens of “nodes” in Modeler. We introduce the most important ones so start there. Be brilliant in the basics. Get comfortable with the software environment. We recommend the bests ways to organize your work. Don’t rush to Modeling. Remember the Cross Industry Standard Process for Data Mining (CRISP-DM), which we cover in the video. Use it to make sure that you proceed systematically and don’t skip critical steps. IBM recently announced that SPSS Modeler would be available freely for educational usage. How can one make the most of this opportunity? Jesus: A large portion of the work that we have done over the past few years has been to train people on how to analyze data. Professors are in a unique position to expose more students to data mining since we teach only those students whose work requires this type of training, whereas professors can expose a much larger group of people to data mining. IBM offers several programs that support professors, students, and faculty; for more information visit: https://www-01.ibm.com/software/analytics/spss/academic/ Keith: When seeking out a university class, whether it be classroom or online, ask them if they use Modeler or if they allow you to complete your homework assignments in Modeler. We recognize that R based classes are very popular now, but you potentially won’t learn as much about Data Mining. Sometimes too much of the class is spent on coding so you learn R, but learn less about analytics. You want to spend most of the class time actively working with data and producing results. With the rise of open source languages such as R and Python and their applications in predictive analytics, how do you foresee enterprise tools like SPSS Modeler competing with them? Keith: Perhaps surprisingly, we don’t think Modeler does compete with R or Python. A lot of folks don’t know that Python is Modeler’s scripting language. Now, that is an advanced feature, and we don’t cover it in the Essentials video, but learning Python actually increases your knowledge of Modeler. And Modeler supports running R code right in a Modeler stream by using the R nodes. So Modeler power users (or future power users) should keep learning R on their to-do list. If you prefer not to use code, you can produce powerful results without learning either by just using Modeler straight out of the box. So, it really is all up to you. If this interview has sparked your interest in learning more about IBM SPSS Modeler, make sure you check out our video course IBM SPSS Modeler Essentials right away!

0
0
2534

article-image-unlocking-the-secrets-of-microsoft-power-bi-interview-part-2-of-2-with-brett-powell-founder-of-frontline-analytics-llc

Amey Varangaonkar

10 Oct 2017

12 min read

Unlocking the secrets of Microsoft Power BI

Amey Varangaonkar

10 Oct 2017

12 min read

0
0
3070

article-image-microsoft-power-bi-interview-part1-brett-powell

Amey Varangaonkar

09 Oct 2017

8 min read

Ride the third wave of BI with Microsoft Power BI

Amey Varangaonkar

09 Oct 2017

8 min read

[dropcap]S[/dropcap]elf-service Business Intelligence is the buzzword everyone's talking about today. It gives modern business users the ability to find unique insights from their data without any hassle. Amidst a myriad of BI tools and platforms out there in the market, Microsoft’s Power BI has emerged as a powerful, all-encompassing BI solution - empowering users to tailor and manage Business Intelligence to suit their unique needs and scenarios. [author title="Brett Powell"]A Microsoft Power BI partner, and the founder and owner of Frontline Analytics LLC., a BI and analytics research and consulting firm. Brett has contributed to the design and development of Microsoft BI stack and Power BI solutions of diverse scale and complexity across the retail, manufacturing, financial, and services industries. He regularly blogs about the latest happenings in Microsoft BI and Power BI features at Insight Quest. He is also an organizer of the Boston BI User Group.[/author] In this two part interview Brett talks about his new book, Microsoft Power BI Cookbook, and shares his insights and expertise in the area of BI and data analytics with a particular focus on Power BI. In part one, Brett shares his views on topics ranging from what it takes to be successful in the field of BI & data analytics to why he thinks Microsoft is going to lead the way in shaping the future of the BI landscape. In part two of the interview, he shares his expertise with us on the unique features that differentiate Power BI from other tools and platforms in the BI space. Key Takeaways Ease of deployment across multiple platforms, efficient data-driven insights, ease of use and support for a data-driven corporate culture are factors to consider while choosing a Business Intelligence solution for enterprises. Power BI leads in self-service BI because it’s the first Software-as-a-Service (SaaS) platform to offer ‘End User BI’ where anyone, not just a business analyst, can leverage powerful tools to obtain greater value from data. Microsoft Power BI has been identified as a leader in Gartner’s Magic Quadrant for BI and Analytics platforms, and provides a visually rich and easy to access interface that modern business users require. You can isolate report authoring from dataset development in Power BI, or quickly scale up or down a Power BI dataset as per your needs. Power BI is much more than just a tool for reports and dashboards. With a thorough understanding of the query and analytical engines of Power BI, users can customize more powerful and sustainable BI solutions. Part One Interview Excerpts - Power BI from a Bird’s Eye View On choosing the right BI solution for your enterprise needs What are some key criteria one must evaluate while choosing a BI solution for enterprises? How does Power BI fare against these criteria as compared with other leading solutions from IBM, Oracle and Qlikview? Enterprises require a platform which can be implemented on their terms and adapted to their evolving needs. For example, the platform must support on-premises, cloud, and hybrid deployments with seamless integration allowing organizations to both leverage on-premises assets as well as fully manage their cloud solution. Additionally, the platform must fully support both corporate business intelligence processes such as staged deployments across development and production environments as well as self-service tools which empower business teams to contribute to BI projects and a data driven corporate culture. Furthermore, enterprises must consider the commitment of the vendor to BI and analytics, the full cost of scaling and managing the solution, as well as the vendors’ vision for delivering emerging capabilities such as artificial intelligence and natural language. Microsoft Power BI has been identified as a leader in Gartner’s Magic Quadrant for BI and Analytics platforms based on both its currently ability to execute as well as its vision. Particularly now with Power BI Premium, the Power BI Report Server, and Power BI embedded offerings, Power BI truly offers organizations the ability to tailor and manage BI to their unique needs and scenarios. Power BI’s mobile application, available on all common platforms (iOS, Android) in addition to continued user experience improvements in the Power BI service provides a visually rich and common interface for the ‘anytime access’ that modern business users require. Additionally, since Power BI’s self-service authoring tool of Power BI Desktop shares the same engine as SQL Server Analysis Services, Power BI has a distinct advantage in enabling organizations to derive value from both self-service and corporate BI. The BI landscape is very competitive and other vendors such as Tableau and Qlikview have obtained significant market share. However, as organizations fully consider the features distinguishing the products in addition to the licensing structures and the integration with Microsoft Azure, Office 365, and common existing BI assets such as Excel and SQL Server Reporting Services and Analysis Services, they will (and are) increasingly concluding that Power BI provides a compelling value. On the future of BI and why Brett is betting on Microsoft to lead the way Self-service BI as a trend has become mainstream. How does Microsoft Power BI lead this trend? Where do you foresee the BI market heading next i.e., are there other trends we should watch out for? Power BI leads in self-service BI because it’s the first software as a service (SaaS) platform to offer ‘End User BI’ in which anyone, not just a business analyst, can leverage powerful tools to obtain greater value from data. This ‘third wave’ of BI, as Microsoft suggests, further follows and supplements the first and second waves of BI in Corporate and self-service BI, respectively. For example, Power BI’s Q & A experience with natural language queries and integration with Cortana goes far beyond the traditional self-service process of an analyst finding field names and dragging and dropping items on a canvas to build a report. Additionally, an end user has the power of machine learning algorithms at their fingertips with features such as Quick Insights now built into Power BI Desktop. Furthermore, it’s critical to understand that Microsoft has a much larger vision for self-service BI than other vendors. Self-service BI is not exclusively the visualization layer over a corporate IT controlled data model – it’s also the ability for self-service solutions to be extended and migrated to corporate solutions as part of a complete BI strategy. Given their common underlying technologies, Microsoft is able to remove friction between corporate and self-service BI and allows organizations to manage modern, iterative BI project lifecycles. On staying ahead of the curve in the data analytics & BI industry For someone just starting out in the data analytics and BI fields, what would your advice be? How can one keep up with the changes in this industry? I would focus on building a foundation in the areas which don’t change frequently such as math, statistics, and dimensional modeling. You don’t need to become a data scientist or a data warehouse architect to deliver great value to organizations but you do need to know the basic tools of storing and analysing data to answer business questions. To succeed in this industry over time you need to consistently invest in your skills in the areas and technologies relevant to your chosen path. You need to hold yourself accountable for becoming a better data professional and this can be accomplished by certification exams, authoring technical blogs, giving presentations, or simply taking notes from technical books and testing out tools and code on your machine. For hard skills I’d recommend standard SQL, relational database fundamentals, data warehouse architecture and dimensional model design, and at least a core knowledge of common data transformation processes and/or tools such as SQL Server Integration Services (SSIS) and SQL stored procedures. You’ll need to master an analytical language as well and for Microsoft BI projects that language is increasingly DAX. For soft skills, you need to move beyond simply looking for a list of requirements for your projects. You need to learn to become flexible and active – you need to become someone who offers ideas and looks to show value and consistently improve projects rather than just ‘deliver requirements’. You need to be able to have both a deeply technical conversation but also have a very practical conversation with business stakeholders. You need to able to build relationships with both business and IT. You don’t ever want to dominate or try to impress anyone but if you’re truly passionate about your work then this will be visible in how you speak about your projects and the positive energy you bring to work every day and to your ongoing personal development. If you enjoyed this interview, check out Brett’s latest book, Microsoft Power BI Cookbook. In part two of the interview, Brett shares 5 Power BI features to watch out for, 7 reasons to choose Power BI to build enterprise solutions and more. Visit us tomorrow to read part two of the interview.

0
0
2789

article-image-romeo-kienzler-mastering-apache-spark

Amey Varangaonkar

02 Oct 2017

7 min read

Is Apache Spark today's Hadoop?

Amey Varangaonkar

02 Oct 2017

7 min read

With businesses generating data at an enormous rate today, many Big Data processing alternatives such as Apache Hadoop, Spark, Flink, and more have emerged in the last few years. Apache Spark among them has gained a lot of popularity of late, as it offers ease of use and sophisticated analytics, and helps you process data with speed and efficiency. [author title="Romeo Kienzler" image="https://www.linkedin.com/in/romeo-kienzler-089b4557/detail/photo/"]Chief Data Scientist in the IBM Watson IoT worldwide team, has been helping clients all over the world find insights from their IoT data using Apache Spark. An Associate Professor for Artificial Intelligence at Swiss University of Applied Sciences, Berne, he is also a member of the IBM Technical Expert Council and the IBM Academy of Technology, IBM's leading brains trust.[/author] In this interview, Romeo talks about his new book on Apache Spark and Spark’s evolution from just a data processing framework to becoming a solid, all-encompassing platform for real-time processing, streaming analytics and distributed Machine Learning. Key Takeaways Apache Spark has evolved to become a full-fledged platform for real-time batch processing and stream processing. Its in-memory computing capabilities allow for efficient streaming analytics, graph processing, and machine learning. It gives you the ability to work with your data at scale, without worrying if it is structured or unstructured. Popular frameworks like H2O and DeepLearning4J are using Apache Spark as their preferred platform for distributed AI, Machine Learning, and Deep Learning. Full-length Interview As a data scientist and an assistant professor, you must have used many tools both for your work and for research? What are some key criteria one must evaluate while choosing a big data analytics solution? What are your go-to tools and where does Spark rank among them? Scalability. Make sure you can use a cluster to accelerate execution of your processes TCO – How much do I have to pay for licensing and deployment. Consider the usage of Open Source (but keep maintenance in mind). Also, consider Cloud. I’ve shifted completely away from non-scalable environments like R and python pandas. I’ve also shifted away from scala for prototyping. I’m using scala only for mission-critical applications which have to be maintained for the long term. Otherwise, I’m using python. I’m trying to completely stay on Apache Spark for everything I’m doing which is feasible since Spark supports: SQL Machine Learning DeepLearning The advantage is that everything I’m doing is scalable by definition and once I need it I can scale without changing code. What does the road to mastering Apache Spark look like? What are some things that users may not have known about Apache Spark? Can readers look forward to learning about some of them in your new book: Mastering Apache Spark, second edition? Scaling on very large clusters is still tricky with Apache Spark because at a certain point scale-out is not linear anymore. So, a lot of tweaking of the various knobs is necessary. Also, the Spark API somehow is slightly more tedious that the one of R or python Pandas – so it needs some energy to really stick with it and not to go back to “the good old R-Studio”. Next, I think the strategic shift from RDDs to DataFrames and Datasets was a disrupting but necessary step. In the book, I try to justify this step and first explain how the new API and the two related projects Tungsten and Catalyst work. Then I show how things like machine learning, streaming, and graph processing are done in the traditional, RDD based way as well as in the new DataFrames and Datasets based way. What are the top 3 data analysis challenges that never seem to go away even as time and technology keep changing? How does Spark help alleviate them? Data quality. Data is often noisy and in bad formats. The majority of the time I spend improving it through various methodologies. Apache Spark helps me to scale. SparkSQL and SparkML pipelines introduce a standardized framework for doing so. Unstructured data preparation. A lot of data is unstructured in the form of text. Apache Spark allows me to pre-process vast amount of text and create tiny mathematical representations out of it for downstream analysis. Instability on technology. Every six months there is a new hype which seems to make everything you’ve learned redundant. So, for example, there exist various scripting languages for big data. SparkSQL ensures that I can use my already acquired SQL skills now and in future. How is the latest Apache Spark 2.2.0 a significant improvement over the previous version? The most significant change, in my opinion, was labeling Structured Streaming GA and no longer as experimental. Otherwise, there have been “only” minor improvements, mainly on performance, 72 to be precise as all are documented in JIRA since it is an Apache project. The most significant improvement between version 1.6 to 2.0 was whole stage code generation in Tungsten which is also covered in this book. Streaming analytics has become mainstream. What role did Apache Spark play in leading this trend? Actually, Apache Spark takes it to the next level by introducing the concept of continuous applications. So with Apache Spark, the streaming and batch API have been unified that you actually don’t have to care anymore on what type of data you are running your queries on. You can even mix and match. For example joining a structured stream, a relational database, a NoSQL database and a file in HDFS within a single SQL statement. Everything is possible. Mastering Apache Spark was first published back in 2015. Big data has greatly evolved since then. What does the second edition of Mastering Apache Spark offer readers today in this context? Back in 2015, Apache Spark was just another framework within the Hadoop ecosystem. Now, Apache Spark has grown to be one of the largest open source projects on this planet! Apache Spark is the new big data operating system like Hadoop was back in 2015. AI and Deep Learning are the most important trends and as explained in this book, Frameworks like H2O, DeepLearning4J and Apache SystemML are using Apache Spark as their big data operation system to scale. I think I’ve done a very good job in taking real-life examples from my work and finding a good open data source or writing a good simulator to give hands-on experience in solving real-world problems. So in the book, you should find a recipe for all the current data science problems you find in the industry. 2015 was also the year when Apache Spark and IBM Watson chose to join hands. As the Chief data scientist for IBM Watson IoT, give us a glimpse of what this partnership is set to achieve. This partnership underpins IBM’s strong commitment to open source. Not only is IBM contributing to Apache Spark, IBM also creates new open source projects on top of it. The most prominent example is Apache SystemML which is also covered in this book. The next three years are dedicated to DeepLearning and AI. And IBM’s open source contributions will help the Apache Spark community to succeed. The most prominent example is PowerAI where IBM outperformed all state-of-the-art deep learning technologies for image recognition. For someone just starting out in the field of big data and analytics, what would your advice be? I suggest taking a Machine Learning course of one of the leading online training vendors. Then take a Spark course (or read my book). Finally, try to do everything yourself. Participate in Kaggle competitions and try to replicate papers.

0
0
2725

Amey Varangaonkar

13 Sep 2017

5 min read

Why you should use Keras for deep learning

Amey Varangaonkar

13 Sep 2017

5 min read

A lot of people rave about TensorFlow and Theano, but there are is one complaint you hear fairly regularly: that they can be a little challenging to use if you're directly building deep learning models. That’s where Keras comes to the rescue. It's a high-level deep learning library written in Python that can be used as a wrapper on top of TensorFlow or Theano, to simplify the model training process and to make the models more efficient. Sujit Pal is Technology Research Director at Elsevier Labs. He has been working with Keras for some time. He is an expert in Semantic Search, Natural Language Processing and Machine Learning. He's also the co-author of Deep Learning with Keras, which is why we spoke to him about why you should use start using Keras (he's very convincing). 5 reasons you should start using Keras Keras is easy to get started with if you’ve worked with Python before and have some basic knowledge of neural networks. It works on top of Theano and TensorFlow seamlessly to create efficient deep learning models. It offers just the right amount of abstraction - allowing you to focus on the problem at hand rather than worry about the complexity of using the framework. It is a handy tool to use if you’re looking to build models related to Computer Vision or Natural Language Processing. Keras is a very expressive framework that allows for rapid prototyping of models. Why I started using Keras Packt: Why did you start using using Keras? Sujit Pal: My first deep learning toolkit was actually Caffe, then TensorFlow, both for work related projects. I learned Keras for a personal project and I was impressed by the Goldilocks (i.e. just right) quality of the abstraction. Thinking at the layer level was far more convenient than having to think in terms of matrix multiplication that TensorFlow makes you do, and at the same time I liked the control I got from using a programming language (Python) as opposed to using JSON in Caffe. I've used Keras for multiple projects now. Packt: How has this experience been different from other frameworks and tools? What problems does it solve exclusively? Sujit: I think Keras has the right combination of simplicity and power. In addition, it allows you to run against either TensorFlow or Theano backends. I understand that it is being extended to support two other backends - CNTK and MXNet. The documentation on the Keras site is extremely good and the API itself (both the Sequential and Functional ones) are very intuitive. I personally took to it like a fish to water, and I have heard from quite a few other people that their experiences were very similar. What you need to know to start using Keras Packt: What are the prerequisites to learning Keras? And what aspects are tricky to learn? Sujit: I think you need to know some basic Python and have some idea about Neural Networks. I started with Neural Networks from the Google/edX course taught by Vincent Van Houke. It’s pretty basic (and taught using TensorFlow) but you can start building networks with Keras even with that kind of basic background. Also, if you have used numpy or scikit-learn, some of the API is easier to pick up because of the similarities. I think the one aspect I have had a few problems with is building custom layers. While there is some documentation that is just enough to get you started, I think Keras would be usable in many more situations if the documentation for the custom layers was better, maybe more in line with the rest of Keras. Things like how to signal that a layer supports masking or multiple tensors, debugging layers, etc. Packt: Why do you use Keras in your day-to-day programming and data science tasks? Sujit: I have spent most of last year working with Image classification and similarity, and I've used Keras to build most of my more recent models. This year I am hoping to do some work with NLP as it relates to images, such as generating image captions, etc. On the personal projects side, I have used Keras for building question answering and disease prediction models, both with data from Kaggle competitions. How Keras could be improved Packt: As a developer, what do you think are the areas of development for Keras as a library? Where do you struggle the most? Sujit: As I mentioned before, the Keras API is quite comprehensive and most of the time Keras is all you need to build networks, but occasionally you do hit its limits. So I think the biggest area of Keras that could be improved would be extensibility, using its backend interface. Another thing I am excited about is the contrib.keras package in TensorFlow, I think it might open up even more opportunity for customization, or at least the potential to maybe mix and match TensorFlow with Keras.

0
0
3680

article-image-machine-learning-can-useful-almost-every-problem-domain-interview-sebastian-raschka

Packt Editorial Staff

04 Sep 2017

9 min read

Has Machine Learning become more accessible?

Packt Editorial Staff

04 Sep 2017

9 min read

0
0
2260

Author Posts - Data

“Tableau is the most powerful and secure end-to-end analytics platform”: An interview with Joshua Milligan

“Pandas is an effective tool to explore and analyze data”: An interview with Theodore Petrou

Why is Python so good for AI and Machine Learning? 5 Python Experts Explain

Why MongoDB is the most popular NoSQL database today

Why You Need to Know Statistics To Be a Good Data Scientist

Why choose IBM SPSS Statistics over R for your data analysis project

How Qlik Sense is driving self-service Business Intelligence

Why the Industrial Internet of Things (IIoT) needs Architects

Expert Insights: How sports analytics is empowering better decision-making

Why learn IBM SPSS Modeler in 2017

Trending Topics

Unlocking the secrets of Microsoft Power BI

Ride the third wave of BI with Microsoft Power BI

Is Apache Spark today's Hadoop?

Why you should use Keras for deep learning

Has Machine Learning become more accessible?