Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials - Data

1204 Articles
article-image-julia-computing-research-team-runs-machine-learning-model-on-encrypted-data-without-decrypting-it
Fatema Patrawala
28 Nov 2019
5 min read
Save for later

Julia Computing research team runs machine learning model on encrypted data without decrypting it

Fatema Patrawala
28 Nov 2019
5 min read
Last week, the team at Julia Computing published a research based on cutting edge cryptographic techniques. The research involved cryptography techniques to practically perform computation on data without ever decrypting it. For example, the user would send encrypted data (e.g. images) to the cloud API, which would run the machine learning model and then return the encrypted answer. Nowhere is the user data decrypted and in particular the cloud provider does not have access to either the original image nor is it able to decrypt the prediction it computed. The team made this possible by building a machine learning service for handwriting recognition of encrypted images (from the MNIST dataset). The ability to compute on encrypted data is generally referred to as “secure computation” and is a fairly large area of research, with many different cryptographic approaches and techniques for a plethora of different application scenarios. For their research, Julia team focused on using a technique known as “homomorphic encryption”. What is homomorphic encryption Homomorphic encryption is a form of encryption that allows computation on ciphertexts, generating an encrypted result which, when decrypted, matches the result of the operations as if they had been performed on the plaintext. This technique can be used for privacy-preserving outsourced storage and computation. It allows data to be encrypted and out-sourced to commercial cloud environments for processing, all while encrypted. In highly regulated industries, such as health care, homomorphic encryption can be used to enable new services by removing privacy barriers inhibiting data sharing. In this research, the Julia Computing team used a homomorphic encryption system which involves the following operations: pub_key, eval_key, priv_key = keygen() encrypted = encrypt(pub_key, plaintext) decrypted = decrypt(priv_key, encrypted) encrypted′ = eval(eval_key, f, encrypted) So the first three are fairly straightforward and are familiar to anyone who has used asymmetric cryptography before. The last one is important as it evaluates some function f on the encryption and returns another encrypted value corresponding to the result of evaluating f on the encrypted value. It is this property that gives homomorphic computation its name. Further the Julia Computing team talks about CKKS (Cheon-Kim-Kim-Song), a homomorphic encryption scheme that allowed homomorphic evaluation on the following primitive operations: Element-wise addition of length n vectors of complex numbers Element-wise multiplication of length n complex vectors Rotation (in the circshift sense) of elements in the vector Complex conjugation of vector elements But they also mentioned that computations using CKKS were noisy, and hence they tested to perform these operations in Julia. Which convolutional neural network did the Julia Computing team use As a starting point the Julia Computing team used the convolutional neural network example given in the Flux model zoo. They kept training the loop, prepared the data and tweaked the ML model slightly. It is essentially the same model as the one used in the paper “Secure Outsourced Matrix Computation and Application to Neural Networks”, which uses the same (CKKS) cryptographic scheme. This paper also encrypts the model, which the Julia team neglected for simplicity and they involved bias vectors after every layer (which Flux does by default). This resulted in a higher test set accuracy of the model used by Julia team which was (98.6% vs 98.1%). An unusual feature in this model are the x.^2 activation functions. More common choices here would have been tanh or relu or something more advanced. While those functions (relu in particular) are cheap to evaluate on plaintext values, they would however, be quite expensive to evaluate on encrypted values. Also, the team would have ended up evaluating a polynomial approximation had they adopted these common choices. Fortunately  x.^2 worked fine for their purpose. How was the homomorphic operation carried out The team performed homomorphic operation on Convolutions and Matrix Multiply assuming a batch size of 64. They precomputed each convolution window of 7x7 extraction from the original images which gave them 64 7x7 matrices per input image. Then they collected the same position in each window into one vector and got a 64-element vector for each image, (i.e. a total of 49 64x64 matrices), and encrypted these matrices. In this way the convolution became a scalar multiplication of the whole matrix with the appropriate mask element, and by summing all 49 elements later, the team got the result of the convolution. Then the team moved to Matrix Multiply by rotating elements in the vector to effect a re-ordering of the multiplication indices. They considered a row-major ordering of matrix elements in the vector. Then shifted the vector by a multiple of the row-size, and got the effect of rotating the columns, which is a sufficient primitive for implementing matrix multiply. The team was able to get everything together and it worked. You can take a look at the official blog post to know the step by step implementation process with codes. Further they also executed the whole encryption process in Julia as it allows powerful abstractions and they could encapsulate the whole convolution extraction process as a custom array type. The Julia Computing team states, “Achieving the dream of automatically executing arbitrary computations securely is a tall order for any system, but Julia’s metaprogramming capabilities and friendly syntax make it well suited as a development platform.” Julia co-creator, Jeff Bezanson, on what’s wrong with Julialang and how to tackle issues like modularity and extension Julia v1.3 released with new multithreading features, and much more! The Julia team shares its finalized release process with the community Julia announces the preview of multi-threaded task parallelism in alpha release v1.3.0 How to make machine learning based recommendations using Julia [Tutorial]
Read more
  • 0
  • 0
  • 3547

article-image-10-key-announcements-from-microsoft-ignite-2019-you-should-know-about
Sugandha Lahoti
26 Nov 2019
7 min read
Save for later

10 key announcements from Microsoft Ignite 2019 you should know about

Sugandha Lahoti
26 Nov 2019
7 min read
This year’s Microsoft Ignite was jam-packed with new releases and upgrades in Microsoft’s line of products and services. The company elaborated on its growing focus to address the needs of its customers to help them do their business in smarter, more productive and more efficient ways. Most of the products were AI-based and Microsoft was committed to security and privacy. Microsoft Ignite 2019 took place on November 4-8, 2019 in Orlando, Florida and was attended by 26,000 IT implementers and decision-makers, developers, data professionals and people from various industries. There were a total of 175 separate announcements made! We have tried to cover the top 10 here. Microsoft’s Visual Studio IDE is now available on the web The web-based version of Microsoft’s Visual Studio IDE is now available to all developers. Called the Visual Studio Online, this IDE will allow developers to configure a fully configured development environment for their repositories and use the web-based editor to work on their code. Visual Studio Online is deeply integrated with GitHub (also owned by Microsoft), although developers can also attach their own physical and virtual machines to their Visual Studio-based environments. Visual Studio Online’s cloud-hosted environments, as well as extended support for Visual Studio Code and the web UI, are now available in preview. Support for Visual Studio 2019 is in private preview, which you can also sign up for through the Visual Studio Online web portal. Project Cortex will classify all content in a single network Project Cortex is a new service in Microsoft 365 useful to maintain the everyday flow of work in enterprises. Project Cortex collates enterprises generated documents and data, which is often spread across numerous repositories. It uses AI and machine learning to automatically classify all your content into topics to form a knowledge network. Cortex improves individual productivity and organizational intelligence and can be used across Microsoft 365, such as in the Office apps, Outlook, and Microsoft Teams. Project Cortex is now in private preview and will be generally available in the first half of 2020. Single-view device management with ‘Microsoft Endpoint Manager’ Microsoft has combined its Configuration Manager with Intune, its cloud-based endpoint management system to form what they call an Endpoint Manager. ConfigMgr allows enterprises to manage the PCs, laptops, phones, and tablets they issue to their employees. Intune is used for cloud-based management of phones. The Endpoint Manager will provide unique co-management options to organizations to provision, deploy, manage and secure endpoints and applications across their organization. Touted as the most important release of the event by Satya Nadella, this solution will give enterprises a single view of their deployments. ConfigMgr users will now also get a license to Intune to allow them to move to cloud-based management. No-code bot builder ‘Microsoft Power Virtual Agents’ is available in public preview Built on the Azure Bot Framework, Microsoft Power Virtual Agents is a low-code and no-code bot-building solution now available in public preview. Power Virtual Agents enables programmers with little to no developer experience to create and deploy intelligent virtual agents. The solution also includes Azure Machine Learning to help users create and improve conversational agents for personalized customer service. Power Virtual Agents will be generally available Dec. 1. Microsoft’s Chromium-based version of Edge is now more privacy-focused Microsoft Ignite announced the release candidate for Microsoft’s Chromium-based version of Edge browser with the general availability release on January 15. InPrivate search will be available for Microsoft Edge and Microsoft Bing to keep online searches and identities private, giving users more control over their data.  When searching InPrivate, search history and personally identifiable data will not be saved nor be associated back to you. Users’ identities and search histories are completely private. There will also be a new security baseline for the all-new Microsoft Edge. Security baselines are pre-configured groups of security settings and default values that are recommended by the relevant security teams. The next version of Microsoft Edge will feature a new icon symbolizing the major changes in Microsoft Edge, built on the Chromium open source project. It will appear in an Easter egg hunt designed to reward the Insider community. ML.NET 1.4 announces General Availability ML.NET 1.4, Microsoft’s open-source machine learning framework is now generally available. The latest release adds image classification training with the ML.NET API, as well as a relational database loader API for reading data used for training models with ML.NET. ML.NET also includes Model Builder (easy to use UI tool in Visual Studio) and Command-Line Interface to make it super easy to build custom Machine Learning models using AutoML. This release also adds a new preview of the Visual Studio Model Builder extension that supports image classification training from a graphical user interface. A preview of Jupyter support for writing C# and F# code for ML.NET scenarios is also available. Azure Arc extends Azure services across multiple infrastructures One of the most important features of Microsoft Ignite 2019 was Azure Arc. This new service enables Azure services anywhere and extends Azure management to any infrastructure — including those of competitors like AWS and Google Cloud.  With Azure Arc, customers can use Azure’s cloud management experience for their own servers (Linux and Windows Server) and Kubernetes clusters by extending Azure management across environments. Enterprises can also manage and govern resources at scale with powerful scripting, tools, Azure Portal and API, and Azure Lighthouse. Announcing Azure Synapse Analytics Azure Synapse Analytics builds upon Microsoft’s previous offering Azure SQL Data Warehouse. This analytics service combines traditional data warehousing with big data analytics bringing serverless on-demand or provisioned resources—at scale. Using Azure Synapse Analytics, customers can ingest, prepare, manage, and serve data for immediate BI and machine learning applications within the same service. Safely share your big data with Azure Data Share, now generally available As the name suggests, Azure Data Share allows you to safely share your big data with other organizations. Organizations can share data stored in their data lakes with third party organizations outside their Azure tenancy. Data providers wanting to share data with their customers/partners can also easily create a new share, populate it with data residing in a variety of stores and add recipients. It employs Azure security measures such as access controls, authentication, and encryption to protect your data. Azure Data Share supports sharing from SQL Data Warehouse and SQL DB, in addition to Blob and ADLS (for snapshot-based sharing). It also supports in-place sharing for Azure Data Explorer (in preview). Azure Quantum to be made available in private preview Microsoft has been working on Quantum computing for some time now. At Ignite, Microsoft announced that it will be launching Azure Quantum in private preview in the coming months. Azure Quantum is a full-stack, open cloud ecosystem that will bring quantum computing to developers and organizations. Azure Quantum will assemble quantum solutions, software, and hardware across the industry in a  single, familiar experience in Azure. Through Azure Quantum, you can learn quantum computing through a series of tools and learning tutorials, like the quantum katas. Developers can also write programs with Q# and the QDK Solve. Microsoft Ignite 2019 organizers have released an 88-page document detailing about all 175 announcements which you can access here. You can also view the conference Keynote delivered by Satya Nadella on YouTube as well as Microsoft Ignite’s official blog. Facebook mandates Visual Studio Code as default development environment and partners with Microsoft for remote development extensions Exploring .Net Core 3.0 components with Mark J. Price, a Microsoft specialist Yubico reveals Biometric YubiKey at Microsoft Ignite Microsoft announces .NET Jupyter Notebooks
Read more
  • 0
  • 0
  • 4417

article-image-why-geospatial-analysis-and-gis-matters-more-than-ever-today
Richard Gall
18 Nov 2019
7 min read
Save for later

Why geospatial analysis and GIS matters more than ever today

Richard Gall
18 Nov 2019
7 min read
Due to the hype around big data and artificial intelligence, it can be easy to miss some of the powerful but specific ways data can be truly impactful. One of the most important areas of modern data analysis that rarely gets given its due is geospatial analysis. At a time when both the natural and human worlds are going through a period of seismic change, the ability to throw a spotlight on issues of climate and population change is as transformative as the smartest chatbot (indeed, probably much more transformative). The foundation of geospatial analysis are GIS systems. GIS, in case you’re new to the field ,is an acronym for Geographic Information System. GIS applications and tools allow you to store, manipulate, analyze, and visualize data that corresponds to different aspects of the existing environment. Central to this is topographical information, but it could also include many other aspects, from contours and slopes, the built environment, land types and bodies of water. In the context of climate and human geography it’s easy to see how this kind of data can help us see the bigger picture - quite literally - behind what’s happening in our region, across our countries, and indeed, across the whole world. The history of geospatial analysis is a testament to its power. In 1854 physician John Snow identified the source of a cholera outbreak in London by marking out the homes of victims on a map. The cluster of victims that Snow’s map revealed led him to an infected water supply. Read next: Neo4j introduces Aura, a new cloud service to supply a flexible, reliable and developer-friendly graph database How GIS and geospatial analysis is being used today While this example is, of course, incredibly low-tech, it highlights exactly why geospatial analysis and GIS tools can be so valuable. To bring us up to date, there are many more examples of how geospatial analysis is making a real impact in social and environmental issues. This article on Forbes, for example, details some of the ways in which GIS projects are helping to uncover information that offers some unique insights on the history of racism, and its continuing reality today. The list includes a map of historical lynchings occurring between 1877 and 1950, and a map by the Urban Institute that shows the reality of racial segregation in U.S. schools in the 21st century. https://twitter.com/urbaninstitute/status/504668921962577921 That’s just a small snapshot - there are a huge range of incredible GIS projects that are having a massive impact on both how we understand issues, but even on policy. That's analytics enacting real, demonstrable change. Here are a few of the different areas in which GIS is being used: How GIS can be used in agriculture GIS can be used to tackle crop diseases by identifying issues across a large area of land. It’s possible to gain a deeper insight into what can drive improvements to crop yields by looking at the geographic and environmental factors that influence successful growth. How GIS can be used in retail GIS can help provide an insight on the relationship between consumer behavior and factors such as weather and congestion. It can also be used to better understand how consumers interact with products in shops. This can influence things like store design and product placement. How GIS can be used in meteorology and climate science Without GIS, it would be impossible to properly understand and visualise rainfall around the world. GIS can also be used to make predictions about the weather. For example, identifying anomalies in patterns and trends could indicate extreme weather events. How GIS can be used in medicine and health As we saw in the example above, by identifying clusters of disease, it becomes much easier to determine the causes of certain illnesses. GIS can also help us better understand the relationship between illness and environment - like pollution and asthma. How GIS can be used for humanitarian purposes Geospatial tools can help humanitarian teams to understand patterns of violence in given areas. This can help them to better manage and distribute resources and support to where it’s needed (Map Kibera is a great example of how this can be done). GIS tools are good at helping to bridge the gap between local populations and humanitarian workers in times of crisis. For example, during the Haiti earthquake non-profit tech company Ushahidi’s product helped to collate and coordinate reports from across the island. This made it possible to align what might have otherwise been a mess of data and information. There are many, many more examples of GIS being used for both commercial and non-profit purposes. If you want an in-depth look at a huge range of examples, it’s well worth checking out this article, which features 1000 GIS projects. Although geospatial analysis can be used across many different domains, all the examples above have a trend running through them: they all help us to understand the impact of space and geography. From social mobility and academic opportunity to soil erosion, GIS and other geospatial tools are brilliant because they help us to identify relationships that we might otherwise be unable to see. GIS and geospatial analysis project ideas This is an important point if you’re not sure where to start when it comes to starting a new GIS project. Forget the data (to begin with at least) and just think about what sort of questions you’d like to answer. The list is potentially endless, but here are some questions that I thought of just off the top of my head: Are there certain parts of your region more prone to flooding? Why are certain parts of your town congested and not others? Do economically marginalised people have to travel further to receive healthcare? Does one part of your region receive more rainfall/snowfall than other parts? Are there more new buildings in one area than another? Getting this right is integral to any good analysis project. Ultimately it’s what makes the whole thing worthwhile. Read next: PostGIS 3.0.0 releases with raster support as a separate extension Where to find data for a GIS project Once you’ve decided on something you want to find out, the next part is to collect your data. This can be tricky, but there are nevertheless a massive range of free data sources you can use for your project. This web page has a comprehensive collection of datasets; while it might not have exactly what you’re looking for, it's nevertheless a good place to begin if you simply want to try something out. Conclusion: Geospatial analysis is one of the most exciting and potentially transformative fields in analytics GIS and geospatial analysis is quite literally rooted in the real world. In the maps and visualizations that we create we’re able to offer unique perspectives on history or provide practical guidance on how we should act, what we need to do. This is significant: all too often technology can feel like its divorced from reality, as if it is folded into its own world that has no connection to real people. So, be ambitious, and be bold with your next GIS project: who knows what impact it could have.
Read more
  • 0
  • 0
  • 5271
Banner background image

article-image-facebook-releases-pytorch-1-3-with-named-tensors-pytorch-mobile-8-bit-model-quantization-and-more
Bhagyashree R
11 Oct 2019
5 min read
Save for later

Facebook releases PyTorch 1.3 with named tensors, PyTorch Mobile, 8-bit model quantization, and more

Bhagyashree R
11 Oct 2019
5 min read
Yesterday, at the PyTorch Developer Conference, Facebook announced the release of PyTorch 1.3. This release comes with three experimental features: named tensors, 8-bit model quantization, and PyTorch Mobile. Along with these exciting features, Facebook also announced the general availability of Google Cloud TPU support and a newly launched integration with Alibaba Cloud. Key updates in PyTorch 1.3 Named Tensors for more readable and maintainable code Though tensors are the building blocks of modern machine learning, researchers have argued that they are “broken.” Tensors have their own share of shortcomings: they expose private dimensions, broadcast based on absolute position, and keep the type information in the documentation. PyTorch 1.3 tries to solve this problem by introducing experimental support for named tensors, which was proposed by Sasha Rush, an Associate Professor at Cornell Tech. He has built a library called NamedTensor, which serves as a “thin-wrapper” on Torch tensor. This update introduces a few changes to the API. Dimension access and reduction now use a ‘dim’ argument instead of an index. Constructing and adding dimensions requires a “name” argument. Functions now broadcast based on set operations, not through heuristic ordering rules. 8-bit model quantization for mobile-optimized AI Quantization in deep learning is the method of approximating a neural network that uses 32-bit floating-point numbers by a neural network that uses a lower-precision numerical format. It is used to reduce the bandwidth and compute requirements of deep learning models. This is extremely essential for on-device applications that have limited memory size and number of computations. PyTorch 1.3 brings experimental support for 8-bit model quantization with the eager mode Python API for efficient deployment on servers and edge devices. This feature includes techniques like post-training quantization, dynamic quantization, and quantization-aware training. Moving from 32-bits to 8-bits can result in two to four times faster computations with one-quarter the memory usage. PyTorch Mobile for more efficient on-device machine learning Running machine learning models directly on edge devices is of great importance as it reduces latency. This is why PyTorch 1.3 introduces PyTorch Mobile that enables “an end-to-end workflow from Python to deployment on iOS and Android.” The current release is experimental. In the future releases, we can expect PyTorch Mobile to come with build-level optimization, selective compilation, support for QNNPACK quantized kernel libraries and ARM CPUs, further performance improvements, and more. Model interpretability and privacy tools in PyTorch 1.3 Captum and Captum Insights Captum is an easy-to-use model interpretability library for PyTorch. It is backed by state-of-the-art interpretability algorithms such as Integrated Gradients, DeepLIFT, and Conductance to help developers improve and troubleshoot their models. Developers can identify different features that contribute to a model’s output and improve its design. Facebook has also released an early release of Captum Insights. It is an interpretability visualization widget built on top of Captum. It works across images, text, and other features to help users understand feature attribution. Check out Facebook’s announcement to know more about Captum. CrypTen Machine learning via cloud-based platforms poses various security and privacy challenges. Facebook writes, “In particular, users of these platforms may not want or be able to share unencrypted data, which prevents them from taking full advantage of ML tools.” PyTorch 1.3 comes with CrypTen, a framework for privacy-preserving machine learning. It aims to make secure computing techniques accessible to machine learning practitioners. You can find more about CrypTen on GitHub. Libraries for multimodal AI systems Detectron2: It is an object detection library implemented in PyTorch. It features support for the latest models and tasks and increased flexibility to aid computer vision research. There are also improvements in maintainability and scalability to support production use cases. Fairseq gets speech extensions: With this release, Fairseq, a framework for sequence-to-sequence applications such as language translation includes support for end-to-end learning for speech and audio recognition tasks. The release of PyTorch 1.3 started a discussion on Hacker News and naturally many developers compared it with TensorFlow 2.0. Here’s what a user commented, “This is a common trend for being second in the market when we see Pytorch and TensorFlow 2.0, TF 2.0 was created to compete directly with Pytorch pythonic implementation (Keras based, Eager execution).” They further added, “Facebook at least on PyTorch has been delivering a quality product. Although for us running production pipelines TF is still ahead in many areas (GPU, TPU implementation, TensorRT, TFX and other pipeline tools) I can see Pytorch catching up on the next couple of years which by my prediction many companies will be running serious and advanced workflows and we may be able to see a winner there.” The named tensors implementation is being well-received by the PyTorch community: https://twitter.com/leopd/status/1182342855886376965 https://twitter.com/rasbt/status/1182647527906140161 These were some of the updates in PyTorch 1.3. Check out the official announcement by Facebook to know more. PyTorch 1.2 is here with a new TorchScript API, expanded ONNX export, and more PyTorch announces the availability of PyTorch Hub for improving machine learning research reproducibility Sherin Thomas explains how to build a pipeline in PyTorch for deep learning workflows Facebook AI open-sources PyTorch-BigGraph for faster embeddings in large graphs Facebook open-sources PyText, a PyTorch based NLP modeling framework
Read more
  • 0
  • 0
  • 3580

article-image-get-ready-for-open-data-science-conference-2019-in-europe-and-california
Sugandha Lahoti
10 Oct 2019
3 min read
Save for later

Get Ready for Open Data Science Conference 2019 in Europe and California

Sugandha Lahoti
10 Oct 2019
3 min read
Get ready to learn and experience the very latest in data science and AI with expert-led trainings, workshops, and talks at ​ODSC West 2019 in San Francisco and ODSC Europe 2019 in London. ODSC events are built for the community and feature the most comprehensive breadth and depth of training opportunities available in data science, machine learning, and deep learning. They also provide numerous opportunities to connect, network, and exchange ideas with data science peers and experts from across the country and the world. What to expect at ODSC West 2019 ODSC West 2019 is scheduled to take place in San Francisco, California on Tuesday, Oct 29, 2019, 9:00 AM – Friday, Nov 1, 2019, 6:00 PM PDT. This year, ODSC West will host several networking events, including ODSC Networking Reception, Dinner and Drinks with Data Scientists, Meet the Speakers, Meet the Experts, and Book Signings Hallway Track. Core areas of focus include Open Data Science, Machine Learning & Deep Learning, Research frontiers, Data Science Kick-Start, AI for engineers, Data Visualization, Data Science for Good, and management & DataOps. Here are just a few of the experts who will be presenting at ODSC: Anna Veronika Dorogush, CatBoost Team Lead, Yandex Sarah Aerni, Ph.D., Director of Data Science and Engineering, Salesforce Brianna Schuyler, Ph.D., Data Scientist, Fenix International Katie Bauer, Senior Data Scientist, Reddit, Inc Jennifer Redmon, Chief Data Evangelist, Cisco Systems, Inc Sanjana Ramprasad, Machine Learning Engineer, Mya Systems Cassie Kozyrkov, Ph.D., Chief Decision Scientist, Google Rachel Thomas, Ph.D., Co-Founder, fast.ai Check out the conference’s more industry-leading speakers here. ODSC also conducts the Accelerate AI Business Summit, which brings together leading experts in AI and business to discuss three core topics: AI Innovation, Expertise, and Management. Don’t miss out on the event You can also use code ODSC_PACKT right now to exclusively save 30% before Friday on your ticket to ODSC West 2019. What to expect at ODSC Europe 2019 ODSC Europe 2019 is scheduled to take place in London, the UK on Tuesday, Nov 19, 2019 – Friday, Nov 22, 2019. Europe Talks/Workshops schedule includes Thursday, Nov 21st and Friday, Nov 22nd. It is available to Silver, Gold, Platinum, and Diamond pass holders. Europe Trainings schedule includes Tuesday, November 19th and Wednesday, November 20th. It is available to Training,  Gold ( Wed Nov 20th only), Platinum, and Diamond pass holders. Some talks scheduled to take place include ML for Social Good: Success Stories and Challenges, Machine Learning Interpretability Toolkit, Tools for High-Performance Python, The Soul of a New AI, Machine Learning for Continuous Integration, Practical, Rigorous Explainability in AI, and more. ODSC has released a preliminary schedule with information on attending speakers and their training, workshop, and talk topics. The full schedule is going to be available soon. They’ve also recently added several excellent speakers, including Manuela Veloso, Ph.D. | Head of AI Research, JP Morgan Dr. Wojciech Samek | Head of Machine Learning, Fraunhofer Heinrich Hertz Institute Samik Chandanara | Head of Analytics and Data Science, JP Morgan Tom Cronin | Head of Data Science & Data Engineering, Lloyds Banking Group Gideon Mann, Ph.D. | Head of Data Science, Bloomberg, LP There are more chances to learn, connect, and share ideas at this year’s event than ever before. Don’t miss out. Use code ODSC_PACKT right now to save 30% on your ticket to ODSC Europe 2019.
Read more
  • 0
  • 0
  • 2606

article-image-tensorflow-2-0-released-tighter-keras-integration-eager-execution-enabled-by-default
Bhagyashree R
03 Oct 2019
5 min read
Save for later

TensorFlow 2.0 released with tighter Keras integration, eager execution enabled by default, and more!

Bhagyashree R
03 Oct 2019
5 min read
After releasing the beta version of TensorFlow 2.0 in June, Google announced its final release on Monday. This release comes with tighter integration with Keras, eager execution enabled by default, promises three times faster training performance, a cleaned-up API, and more. Key updates in TensorFlow 2.0 Tighter Keras integration for better developer productivity One of the important updates in TensorFlow 2.0 is its tighter integration with Keras, a popular high-level API used for easy and fast prototyping, building, and training deep learning models. This will enable developers to easily leverage its various model-building APIs including Sequential, Functional, and Subclassing. Explaining the motivation behind this change, the TensorFlow team wrote, “By establishing Keras as the high-level API for TensorFlow, we are making it easier for developers new to machine learning to get started with TensorFlow. A single high-level API reduces confusion and enables us to focus on providing advanced capabilities for researchers.” Eager execution enabled by default In TensorFlow 1.x, developers were required to define an abstract data structure named Graph and to run this graph they needed an encapsulation called Session. TensorFlow 2.0 has eager execution enabled by default to “eagerly” run code, similar to normal Python code. Eager execution enables fast iteration and intuitive debugging without building a graph. It also makes creating and experimenting with models using TensorFlow much easier. It can be especially useful when using the tf.keras model subclassing API. Also Read: Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Distribution Strategy API The Distribution Strategy API in TensorFlow 2.0 allows machine learning researchers to distribute training across a wide variety of compute configurations. This will allow them to “attain great out-of-the-box performance” with minimal code changes. This release also allows distributed training with Keras’ model.fit and custom training loops. Performance improvements on GPUs TensorFlow 2.0 includes multi-GPU support and experimental support for multi worker and Cloud TPUs. This release also has a number of performance improvements on GPUs. It promises three times faster training performance when using mixed precision on NVIDIA’s Volta and Turing GPUs. It includes tight integration with NVIDIA TensorRT, a platform for high-performance deep learning inference. The standardized SavedModel file format The SavedModel API allows you to save your trained ML model into a language-neutral format. With TensorFlow 2.0, all TensorFlow ecosystem projects including TensorFlow Lite, TensorFlow JS, TensorFlow Serving, and TensorFlow Hub, support SavedModels. Standardizing the SavedModel file format will enable developers to run their models on a variety of runtimes including the cloud, web, browser, Node.js, mobile, and embedded systems. “This allows you to run your models with TensorFlow, deploy them with TensorFlow Serving, use them on mobile and embedded systems with TensorFlow Lite, and train and run in the browser or Node.js with TensorFlow.js,” the team writes. API simplification TensorFlow 2.0 includes a number of API updates. Many API symbols are removed or renamed for better consistency and clarity. Also, the tf.app, tf.flags, and tf.logging API are removed in favor of abseil-py. Because of the huge number of API changes, developers in a discussion on Hacker News expressed that transitioning from TensorFlow 1.X to TensorFlow 2.0 is quite complicated. Some also mentioned switching to PyTorch instead. A user commented, “As someone who uses TensorFlow a lot, I predict an enormous clusterfuck of a transition. Tensorflow has turned into a multiheaded monster, supporting many things and approaches but none of them very well...In my opinion, there are some architectural problems with TF, which have not been addressed in this update...If you need to transition from TF1 to TF2, consider doing the TF1 to PyTorch transition instead.” While some others were happy with the recommended Keras API and eager execution. “I don't know if I'm the only one, but I actually love the changes they've made since v1. Eager execution and tf.function are fantastic, and the built-in Keras is even better than the standalone version. A big improvement compared to TF from last year,” a user commented on Reddit. Another user added, “The most important change in terms of usability, IMO, is the use of tf.keras as the recommended interface to TensorFlow. There hasn't been a case yet where I've needed to dip outside of Keras into raw TensorFlow, but the option is there and is easy to do. That said, TF 2.0 changes a lot. Many repos might break, so expect to see lots of tensorflow==1.14 in requirement.txt files from now on.” These were some of the updates in TensorFlow 2.0. Check out the official announcement and release notes to know more in detail. Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages TensorFlow 2.0 to be released soon with eager execution, removal of redundant APIs, tf function and more Introducing TensorFlow Graphics packed with TensorBoard 3D, object transformations, and much more Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial] Train a convolutional neural network in Keras and improve it with data augmentation [Tutorial]
Read more
  • 0
  • 0
  • 5311
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at €14.99/month. Cancel anytime
article-image-transformers-2-0-nlp-library-with-deep-interoperability-between-tensorflow-2-0-and-pytorch
Fatema Patrawala
30 Sep 2019
3 min read
Save for later

Transformers 2.0: NLP library with deep interoperability between TensorFlow 2.0 and PyTorch, and 32+ pretrained models in 100+ languages

Fatema Patrawala
30 Sep 2019
3 min read
Last week, Hugging Face, a startup specializing in natural language processing, released a landmark update to their popular Transformers library, offering unprecedented compatibility between two major deep learning frameworks, PyTorch and TensorFlow 2.0. Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between TensorFlow 2.0 and PyTorch. Transformers 2.0 embraces the ‘best of both worlds’, combining PyTorch’s ease of use with TensorFlow’s production-grade ecosystem. The new library makes it easier for scientists and practitioners to select different frameworks for the training, evaluation and production phases of developing the same language model. “This is a lot deeper than what people usually think when they talk about compatibility,” said Thomas Wolf, who leads Hugging Face’s data science team. “It’s not only about being able to use the library separately in PyTorch and TensorFlow. We’re talking about being able to seamlessly move from one framework to the other dynamically during the life of the model.” https://twitter.com/Thom_Wolf/status/1177193003678601216 “It’s the number one feature that companies asked for since the launch of the library last year,” said Clement Delangue, CEO of Hugging Face. Notable features in Transformers 2.0 8 architectures with over 30 pretrained models, in more than 100 languages Load a model and pre-process a dataset in less than 10 lines of code Train a state-of-the-art language model in a single line with the tf.keras fit function Share pretrained models, reducing compute costs and carbon footprint Deep interoperability between TensorFlow 2.0 and PyTorch models Move a single model between TF2.0/PyTorch frameworks at will Seamlessly pick the right framework for training, evaluation, production As powerful and concise as Keras About Hugging Face Transformers With half a million installs since January 2019, Transformers is the most popular open-source NLP library. More than 1,000 companies including Bing, Apple or Stitchfix are using it in production for text classification, question-answering, intent detection, text generation or conversational. Hugging Face, the creators of Transformers, have raised US$5M so far from investors in companies like Betaworks, Salesforce, Amazon and Apple. On Hacker News, users are appreciating the company and how Transformers has become the most important library in NLP. Other interesting news in data Baidu open sources ERNIE 2.0, a continual pre-training NLP model that outperforms BERT and XLNet on 16 NLP tasks Dr Joshua Eckroth on performing Sentiment Analysis on social media platforms using CoreNLP Facebook open-sources PyText, a PyTorch based NLP modeling framework
Read more
  • 0
  • 0
  • 5936

article-image-can-a-modified-mit-hippocratic-license-to-restrict-misuse-of-open-source-software-prompt-a-wave-of-ethical-innovation-in-tech
Savia Lobo
24 Sep 2019
5 min read
Save for later

Can a modified MIT ‘Hippocratic License’ to restrict misuse of open source software prompt a wave of ethical innovation in tech?

Savia Lobo
24 Sep 2019
5 min read
Open source licenses allow software to be freely distributed, modified, and used. These licenses give developers an additional advantage of allowing others to use their software as per their own rules and conditions. Recently, software developer and open-source advocate Coraline Ada Ehmke has caused a stir in the software engineering community with ‘The Hippocratic License.’ Ehmke was also the original author of Contributor Covenant, a “code of conduct" for open source projects that encourages participants to use inclusive language and to refrain from personal attacks and harassment. In a tweet posted in September last year, following the code of conduct, she mentioned, “40,000 open source projects, including Linux, Rails, Golang, and everything OSS produced by Google, Microsoft, and Apple have adopted my code of conduct.” [box type="shadow" align="" class="" width=""]The term ‘Hippocratic’ is derived from the Hippocratic Oath, the most widely known of Greek medical texts. The Hippocratic Oath in literal terms requires a new physician to swear upon a number of healing gods that he will uphold a number of professional ethical standards.[/box] Ehmke explained the license in more detail in a post published on Sunday. In it, she highlights how the idea that writing software with the goals of clarity, conciseness, readability, performance, and elegance are limiting, and potentially dangerous.“All of these technologies are inherently political,” she writes. “There is no neutral political position in technology. You can’t build systems that can be weaponized against marginalized people and take no responsibility for them.”The concept of the Hippocratic license is relatively simple. In a tweet, Ehmke said that it “specifically prohibits the use of open-source software to harm others.” Open source software and the associated harm Out of the many privileges that open source software allows such as free redistribution of the software as well as the source code, the OSI also defines there is no discrimination against who uses it or where it will be put to use. A few days ago, a software engineer, Seth Vargo pulled his open-source software, Chef-Sugar, offline after finding out that Chef (a popular open source DevOps company using the software) had recently signed a contract selling $95,000-worth of licenses to the US Immigrations and Customs Enforcement (ICE), which has faced widespread condemnation for separating children from their parents at the U.S. border and other abuses. Vargo took down the Chef Sugar library from both GitHub and RubyGems, the main Ruby package repository, as a sign of protest. In May, this year, Mijente, an advocacy organization released documents stating that Palantir was responsible for the 2017 ICE operation that targeted and arrested family members of children crossing the border alone. Also, in May 2018, Amazon employees, in a letter to Jeff Bezos, protested against the sale of its facial recognition tech to Palantir where they “refuse to contribute to tools that violate human rights”, citing the mistreatment of refugees and immigrants by ICE. Also, in July, the WYNC revealed that Palantir’s mobile app FALCON was being used by ICE to carry out raids on immigrant communities as well as enable workplace raids in New York City in 2017. Founder of OSI responds to Ehmke’s Hippocratic License Bruce Perens, one of the founders of the Open Source movement in software, responded to Ehmke in a post titled “Sorry, Ms. Ehmke, The “Hippocratic License” Can’t Work” . “The software may not be used by individuals, corporations, governments, or other groups for systems or activities that actively and knowingly endanger harm, or otherwise threaten the physical, mental, economic, or general well-being of underprivileged individuals or groups,” he highlights in his post. “The terms are simply far more than could be enforced in a copyright license,” he further adds.  “Nobody could enforce Ms. Ehmke’s license without harming someone, or at least threatening to do so. And it would be easy to make a case for that person being underprivileged,”  he continued. He concluded saying that, though the terms mentioned in Ehmke’s license were unagreeable, he will “happily support Ms. Ehmke in pursuit of legal reforms meant to achieve the protection of underprivileged people.” Many have welcomed Ehmke's idea of an open source license with an ethical clause. However, the license is not OSI approved yet and chances are slim after Perens’ response. There are many users who do not agree with the license. Reaching a consensus will be hard. https://twitter.com/seannalexander/status/1175853429325008896 https://twitter.com/AdamFrisby/status/1175867432411336704 https://twitter.com/rishmishra/status/1175862512509685760 Even though developers host their source code on open source repositories, a license may bring certain level of restrictions on who is allowed to use the code. However, as Perens mentions, many of the terms in Ehmke’s license hard to implement. Irrespective of the outcome of this license’s approval process, Coraline Ehmke has widely opened up the topic of the need for long overdue FOSS licensing reforms in the open source community. It would be interesting to see if such a license would boost ethical reformation by giving more authority to the developers in imbibing their values and preventing the misuse of their software. Read the Hippocratic license to know more in detail. Other interesting news Tech ImageNet Roulette: New viral app trained using ImageNet exposes racial biases in artificial intelligent system Machine learning ethics: what you need to know and what you can do Facebook suspends tens of thousands of apps amid an ongoing investigation into how apps use personal data
Read more
  • 0
  • 0
  • 3950

article-image-machine-learning-ethics-what-you-need-to-know-and-what-you-can-do
Richard Gall
23 Sep 2019
10 min read
Save for later

Machine learning ethics: what you need to know and what you can do

Richard Gall
23 Sep 2019
10 min read
Ethics is, without a doubt, one of the most important topics to emerge in machine learning and artificial intelligence over the last year. While the reasons for this are complex, it nevertheless underlines that the area has reached technological maturity. After all, if artificial intelligence systems weren’t having a real, demonstrable impact on wider society, why would anyone be worried about its ethical implications? It’s easy to dismiss the debate around machine learning and artificial intelligence as abstract and irrelevant to engineers’ and developers’ immediate practical concerns. However this is wrong. Ethics needs to be seen as an important practical consideration for anyone using and building machine learning systems. If we fail to do so the consequences could be serious. The last 12 months has been packed with stories of artificial intelligence not only showing theoretical bias, but also causing discriminatory outcomes in the real world. Amazon scrapped its AI tool for hiring last October because it showed significant bias against female job applicants. Even more recently, last month it emerged that algorithms built to detect hate speech online have in-built biases against black people. Although these might seem like edge cases, it’s vital that everyone in the industry takes responsibility. This isn’t something we can leave up to regulation or other organizations the people who can really affect change are the developers and engineers on the ground. It’s true that machine learning and artificial intelligence systems will be operating in ways where ethics isn’t really an issue - and that’s fine. But by focusing on machine learning ethics, and thinking carefully about the impact of your work you will ultimately end up building better systems that are more robust and have better outcomes. So with that in mind, let’s look at the practical ways to start thinking about ethics in machine learning and artificial intelligence. Machine learning ethics and bias The first step towards thinking seriously about ethics in machine learning is to think about bias. Once you are aware of how bias can creep into machine learning systems, and how that can have ethical implications, it becomes much easier to identify issues and make changes - or, even better, stop them before they arise. Bias isn’t strictly an ethical issue. It could be a performance issue that’s affecting the effectiveness of your system. But in the conversation around AI and machine learning ethics, it’s the most practical way of starting to think seriously about the issue. Types of machine learning and algorithmic bias Although there are a range of different types of bias, the best place to begin is with two top level concepts. You may have read lists of numerous different biases, but for the purpose of talking about ethics there are two important things to think about. Pre-existing and data set biases Pre-existing biases are embedded in the data on which we choose to train algorithms. While it’s true that just about every data set will be ‘biased’ in some way (data is a representation, after all - there will always be something ‘missing), the point here is that we need to be aware of the extent of the bias and the potential algorithmic consequences. You might have heard terms like ‘sampling bias’, ‘exclusion bias’ and ‘prejudice bias’ - these aren’t radically different. They all result from pre-existing biases about how a data set looks or what it represents. Technical and contextual biases Technical machine learning bias is about how an algorithm is programmed. It refers to the problems that arise when an algorithm is built to operate in a specific way. Essentially, it occurs when the programmed elements of an algorithm fail to properly account for the context in which it is being used. A good example is the plagiarism checker Turnitin - this used an algorithm that was trained to identify strings of texts, which meant it would target non-native English speakers over English speaking ones, who were able to make changes to avoid detection. Although there are, as I’ve said, many different biases in the field of machine learning, by thinking about the data on which your algorithm is trained and the context in which the system is working, you will be in a much better place to think about the ethical implications of your work. Equally, you will also be building better systems that don’t cause unforeseen issues. Read next: How to learn data science: from data mining to machine learning The importance of context in machine learning The most important thing for anyone working in machine learning and artificial intelligence is context. Put another way, you need to have a clear sense of why you are trying to do something and what the possible implications could be. If this is unclear, think about it this way: when you use an algorithm, you’re essentially automating away decision making. That’s a good thing when you want to make lots of decisions at a huge scale. But the one thing you lose when turning decision making into a mathematical formula is context. The decisions an algorithm makes lack context because it is programmed to react in a very specific way. This means contextual awareness is your problem. That’s part of the bargain of using an algorithm. Context in data collection Let’s look at what thinking about context means when it comes to your data set. Step 1: what are you trying to achieve? Essentially, the first thing you’ll want to consider is what you’re trying to achieve. Do you want to train an algorithm to recognise faces? Do you want it to understand language in some way? Step 2: why are you doing this? What’s the point of doing what you’re doing? Sometimes this might be a straightforward answer, but be cautious if the answer is too easy to answer. Making something work more efficiently or faster isn’t really a satisfactory reason. What’s the point of making something more efficient? This is often where you’ll start to see ethical issues emerge more clearly. Sometimes they’re not easily resolved. You might not even be in a position to resolve them yourself (if you’re employed by a company, after all, you’re quite literally contracted to perform a specific task). But even if you do feel like there’s little room to maneuver, it’s important to ensure that these discussions actually take place and that you consider the impact of an algorithm. That will make it easier for you to put safeguarding steps in place. Step 3: Understanding the data set Think about how your data set fits alongside the what and the why. Is there anything missing? How was the data collected? Could it be biased or skewed in some way? Indeed, it might not even matter. But if it does, it’s essential that you pay close attention to the data you’re using. It’s worth recording any potential limitations or issues, so if a problem arises at a later stage in your machine learning project, the causes are documented and visible to others. The context of algorithm implementation The other aspect of thinking about context is to think carefully about how your machine learning or artificial intelligence system is being implemented. Is it working how you thought it would? Is it showing any signs of bias? Many articles about the limitations of artificial intelligence and machine learning ethics cite the case of Microsoft’s Tay. Tay was a chatbot that ‘learned’ from its interactions with users on Twitter. Built with considerable naivety, Twitter users turned Tay racist in a matter of days. Users ‘spoke’ to Tay using racist language, and because Tay learned through interactions with Twitter users, the chatbot quickly became a reflection of the language and attitudes of those around it. This is a good example of how the algorithm’s designers didn’t consider how the real-world implementation of the algorithm would have a negative consequence. Despite, you’d think, the best of intentions, the developers didn’t have the foresight to consider the reality of the world into which they were releasing their algorithmic progeny. Read next: Data science vs. machine learning: understanding the difference and what it means today Algorithmic impact assessments It’s true that ethics isn’t always going to be an urgent issue for engineers. But in certain domains, it’s going to be crucial, particularly in public services and other aspects of government, like justice. Maybe there should be a debate about whether artificial intelligence and machine learning should be used in those contexts at all. But if we can’t have that debate, at the very least we can have tools that help us to think about the ethical implications of the machine learning systems we build. This is where Algorithmic Impact Assessments come in. The idea was developed by the AI Now institute and outlined in a paper published last year, and was recently implemented by the Canadian government. There’s no one way to do an algorithmic impact assessment - the Canadian government uses a questionnaire “designed to help you assess and mitigate the risks associated with deploying an automated decision system.” This essentially provides a framework for those using and building algorithms to understand the scope of their project and to identify any potential issues or problems that could arise. Tools for assessing bias and supporting ethical engineering However, although algorithmic impact assessments can provide you with a solid conceptual grounding for thinking about the ethical implications of artificial intelligence and machine learning systems, there are also a number of tools that can help you better understand the ways in which algorithms could be perpetuating biases or prejudices. One of these is FairML, “an end-to- end toolbox for auditing predictive models by quantifying the relative significance of the model's inputs” - helping engineers to identify the extent to which algorithmic inputs could cause harm or bias - while another is LIME (Local Interpretable Model Agnostic Explanations). LIME is not dissimilar to FairML. it aims to understand why an algorithm makes the decisions it does by ‘perturbing’ inputs and seeing how this affects its outputs. There’s also Deon, which is a lot like a more lightweight, developer-friendly version of an algorithmic assessment impact. It’s a command line tool that allows you to add an ethics checklist to your projects. All these tools underline some of the most important elements in the fight for machine learning ethics. FairML and LIME are both attempting to make interpretability easier, while Deon is making it possible for engineers to bring a holistic and critical approach directly into their day to day work. It aims to promote transparency and improve communication between engineers and others. The future of artificial intelligence and machine learning depends on developers taking responsibility Machine learning and artificial intelligence are hitting maturity. They’re technologies that are now, after decades incubated in computer science departments and military intelligence organizations, transforming and having an impact in a truly impressive range of domains. With this maturity comes more responsibility. Ethical questions arise as machine learning affects change everywhere, spilling out into everything from marketing to justice systems. If we can’t get machine learning ethics right, then we’ll never properly leverage the benefits of artificial intelligence and machine learning. People won’t trust it and legislation will start to severely curb what it can do. It’s only by taking responsibility for its effects and consequences that we can be sure it will not only have a transformative impact on the world, but also one that’s safe and for the benefit of everyone.
Read more
  • 0
  • 0
  • 13683

article-image-bad-metadata-can-get-you-in-legal-hot-water
Guest Contributor
21 Sep 2019
6 min read
Save for later

Bad Metadata can get you in legal hot water

Guest Contributor
21 Sep 2019
6 min read
Metadata isn't just something that concerns business intelligence and IT teams; but lawyers are extremely interested in it as well. Metadata, it turns out, can win or lose lawsuits, send politicians to jail, and even decide medical malpractice cases. It's not uncommon for attorneys who conduct discovery of electronic records in organizations to find that the claims of plaintiffs or defendants are contradicted by metadata, like time and date, type of data, etc. If a discovery process is initiated against them, an organization had better be sure that its metadata is in order. All it would take for an organization to lose a case would be for an attorney to discover a discrepancy in different databases – a different time stamp on some communication, a different job title for a principal in the case. Such discrepancies could lead to accusations of data tampering, fraud, or worse – and would most definitely put the organization in a very tough position versus a judge or jury. Metadata errors are difficult to spot The problem with that, of course, is that catching metadata errors is extremely difficult. In large organizations, data is stored in repositories that are spread throughout the organization, maybe even the world – in different departments. Each department is responsible for maintaining its own database, and the metadata in it; and on different cloud storage repositories, which may have their own system of classifying data. An enterprising attorney could have a field day with the different categories and tags data is stored under, making claims that the organization is trying to “hide something.” The organization's only defense: We're poor administrators. That may not be enough to impress the court. Types of Metadata Metadata is “data about data,” and comes in three flavors: System Metadata, which is data that is automatically generated from the computer and includes specific labeled criteria, like the date and time of creation and date a document was modified, etc. Substantive Metadata reflects changes to a document, like tracked changes. Embedded metadata is data entered into a document or file but not normally visible, such as formulas in cells in an Excel spreadsheet. All of these have increasingly become targets for attorneys in recent years. Metadata has been used in thousands of cases – medical, financial, patent and trademark law, product liability, civil rights, and many more. Metadata is both discoverable and admissible as evidence. According to one New York court, “General information about the creation of a document, including who authored a document and when it was created, is pedigree information often important for purposes of determining admissibility at trial.” According to legal experts, “from a legal standpoint metadata is evidence… that describes the characteristics, origins, usage, and validity of other electronic evidence.” The biggest metadata-linked payout until now - $10.8 million – occurred in 2017, when a jury awarded a plaintiff $8 million (eventually this was increased to nearly $11 million) after claiming he was fired from a biotechnology company after telling authorities about potential bribery in China. The key piece of evidence was the metadata timestamp on a performance review that was written after the plaintiff was fired; with that evidence, the court increased the defendant's payout for violating laws against firing whistleblowers. In that case, records claiming that the employee was fired for cause were belied by the metadata in the performance review. That, of course, was a case in which there was clear wrongdoing by an organization. But the same metadata errors could have cropped up in any number of scenarios, even if no laws were broken. The precedent in this case, and others like it, might be enough to convince the court to penalize an organization based on claims of a plaintiff. How can organizations defend themselves from this legal bind of metadata The answer would seem obvious; get control of your metadata and make sure it corresponds to the data it represents. With that kind of control over data, organizations would discover for themselves if something was amiss that could cost them in a settlement later. But execution of that obvious answer is a different story. With reams of data to pore through, it would take an organization's business intelligence team months, or even years, to manually sift through the databases. And because to err is human, there would be no guarantee they hadn't missed something. Clearly Business Intelligence and Data Analysis teams need some help in doing this. One solution would be to hire more staff, expanding teams at least temporarily to make sense of the data and metadata that could prove problematic. There are services that will lend their staff to an organization to do just that, and for companies that prefer the “human touch,” adding that temporary staff may be the best solution. Another idea is to automate the process, with advanced tools that will do a full examination of data, both across systems and within repositories themselves. Such automated tools would examine the data in the various repositories and find where the metadata for the same information is different – pointing BI teams in the right direction and cutting down on the time needed to determine what needs to be fixed. Using automated metadata management tools, companies can ensure that they remain secure. If a company is being sued and discovery has commenced, it will be too late for the organization to fix anything. Honest mistakes or disorganized file keeping can no longer be corrected, and the fate of the organization will be in the hands of a jury or a judge. Automated metadata management tools can help Business Intelligence and Data Analysis teams figure out which metadata entries are not consistent across the repositories, ensuring that things are fixed before discovery takes place. There are a variety of tools on the market, with various strengths and weaknesses. Companies will need to decide whether a data dictionary, a business glossary, or a more all-encompassing product best answers their needs. They’ll also need to make sure the enterprise software they currently use is supported by the metadata management solution they are after. As the market develops, AI will be a huge distinguishing factor between metadata solutions, as machine learning will reduce the cost and manpower investment of solution onboarding significantly. With the success of recent metadata-based lawsuits, you can be sure more attorneys will be using metadata in their discovery processes. Organizations that want to defend themselves need to get their data in order, and ensure that they won't end up losing lots of money because of their own errors. Author Bio Amnon Drori is the Co-Founder and CEO of Octopai and has over 20 years of leadership experience in technology companies. Before co-founding Octopai he led sales efforts at companies like Panaya (Acquired by Infosys), Zend Technologies (Acquired by Rogue Wave Software), ModusNovo and Alvarion. Other interesting news in Tech Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data and Society report. Open AI researchers advance multi-agent competition by training AI agents in a hide and seek environment. France and Germany reaffirm blocking Facebook’s Libra cryptocurrency
Read more
  • 0
  • 0
  • 4753
article-image-how-to-handle-categorical-data-for-machine-learning-algorithms
Packt Editorial Staff
20 Sep 2019
9 min read
Save for later

How to handle categorical data for machine learning algorithms

Packt Editorial Staff
20 Sep 2019
9 min read
The quality of data and the amount of useful information are key factors that determine how well a machine learning algorithm can learn. Therefore, it is absolutely critical that we make sure to encode categorical variables correctly, before we feed data into a machine learning algorithm. In this article, with simple yet effective examples we will explain how to deal with categorical data in computing machine learning algorithms and how we to map ordinal and nominal feature values to integer representations. The article is an excerpt from the book Python Machine Learning - Third Edition by Sebastian Raschka and Vahid Mirjalili. This book is a comprehensive guide to machine learning and deep learning with Python. It acts as both a clear step-by-step tutorial, and a reference you’ll keep coming back to as you build your machine learning systems.  It is not uncommon that real-world datasets contain one or more categorical feature columns. When we are talking about categorical data, we have to further distinguish between nominal and ordinal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order XL > L > M. In contrast, nominal features don't imply any order and, to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn't make sense to say that, for example, red is larger than blue. Categorical data encoding with pandas Before we explore different techniques to handle such categorical data, let's create a new DataFrame to illustrate the problem: >>> import pandas as pd >>> df = pd.DataFrame([ ...            ['green', 'M', 10.1, 'class1'], ...            ['red', 'L', 13.5, 'class2'], ...            ['blue', 'XL', 15.3, 'class1']]) >>> df.columns = ['color', 'size', 'price', 'classlabel'] >>> df color  size price  classlabel 0   green     M 10.1     class1 1     red   L 13.5      class2 2    blue   XL 15.3      class1 As we can see in the preceding output, the newly created DataFrame contains a nominal feature (color), an ordinal feature (size), and a numerical feature (price) column. The class labels (assuming that we created a dataset for a supervised learning task) are stored in the last column. Mapping ordinal features To make sure that the learning algorithm interprets the ordinal features correctly, we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2: >>> size_mapping = { ...                 'XL': 3, ...                 'L': 2, ...                 'M': 1} >>> df['size'] = df['size'].map(size_mapping) >>> df color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 If we want to transform the integer values back to the original string representation at a later stage, we can simply define a reverse-mapping dictionary inv_size_mapping = {v: k for k, v in size_mapping.items()} that can then be used via the pandas map method on the transformed feature column, similar to the size_mapping dictionary that we used previously. We can use it as follows: >>> inv_size_mapping = {v: k for k, v in size_mapping.items()} >>> df['size'].map(inv_size_mapping) 0   M 1   L 2   XL Name: size, dtype: object Encoding class labels Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches. To encode the class labels, we can use an approach similar to the mapping of ordinal features discussed previously. We need to remember that class labels are not ordinal, and it doesn't matter which integer number we assign to a particular string label. Thus, we can simply enumerate the class labels, starting at 0: >>> import numpy as np >>> class_mapping = {label:idx for idx,label in ...                  enumerate(np.unique(df['classlabel']))} >>> class_mapping {'class1': 0, 'class2': 1} Next, we can use the mapping dictionary to transform the class labels into integers: >>> df['classlabel'] = df['classlabel'].map(class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1         0 1     red   2 13.5           1 2    blue     3 15.3           0 We can reverse the key-value pairs in the mapping dictionary as follows to map the converted class labels back to the original string representation: >>> inv_class_mapping = {v: k for k, v in class_mapping.items()} >>> df['classlabel'] = df['classlabel'].map(inv_class_mapping) >>> df     color  size price  classlabel 0   green     1 10.1     class1 1     red   2 13.5      class2 2    blue     3 15.3     class1 Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this: >>> from sklearn.preprocessing import LabelEncoder >>> class_le = LabelEncoder() >>> y = class_le.fit_transform(df['classlabel'].values) >>> y array([0, 1, 0]) Note that the fit_transform method is just a shortcut for calling fit and transform separately, and we can use the inverse_transform method to transform the integer class labels back into their original string representation: >>> class_le.inverse_transform(y) array(['class1', 'class2', 'class1'], dtype=object) Performing a technique ‘one-hot encoding’ on nominal features In the Mapping ordinal features section, we used a simple dictionary-mapping approach to convert the ordinal size feature into integers. Since scikit-learn's estimators for classification treat class labels as categorical data that does not imply any order (nominal), we used the convenient LabelEncoder to encode the string labels into integers. It may appear that we could use a similar approach to transform the nominal color column of our dataset, as follows: >>> X = df[['color', 'size', 'price']].values >>> color_le = LabelEncoder() >>> X[:, 0] = color_le.fit_transform(X[:, 0]) >>> X array([[1, 1, 10.1],        [2, 2, 13.5],        [0, 3, 15.3]], dtype=object) After executing the preceding code, the first column of the NumPy array X now holds the new color values, which are encoded as follows: blue = 0 green = 1 red = 2 If we stop at this point and feed the array to our classifier, we will make one of the most common mistakes in dealing with categorical data. Can you spot the problem? Although the color values don't come in any particular order, a learning algorithm will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, the algorithm could still produce useful results. However, those results would not be optimal. A common workaround for this problem is to use a technique called one-hot encoding. The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn's preprocessing module: >>> from sklearn.preprocessing import OneHotEncoder >>> X = df[['color', 'size', 'price']].values >>> color_ohe = OneHotEncoder() >>> color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()  array([[0., 1., 0.],            [0., 0., 1.],            [1., 0., 0.]]) Note that we applied the OneHotEncoder to a single column (X[:, 0].reshape(-1, 1))) only, to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer that accepts a list of (name, transformer, column(s)) tuples as follows: >>> from sklearn.compose import ColumnTransformer >>> X = df[['color', 'size', 'price']].values >>> c_transf = ColumnTransformer([ ...     ('onehot', OneHotEncoder(), [0]), ...     ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X) .astype(float)     array([[0.0, 1.0, 0.0, 1, 10.1],            [0.0, 0.0, 1.0, 2, 13.5],            [1.0, 0.0, 0.0, 3, 15.3]]) In the preceding code example, we specified that we only want to modify the first column and leave the other two columns untouched via the 'passthrough' argument. An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged: >>> pd.get_dummies(df[['price', 'color', 'size']])     price  size color_blue  color_green color_red 0    10.1     1   0 1          0 1    13.5     2   0 0          1 2    15.3     3   1 0          0 When we are using one-hot encoding datasets, we have to keep in mind that it introduces multicollinearity, which can be an issue for certain methods (for instance, methods that require matrix inversion). If features are highly correlated, matrices are computationally difficult to invert, which can lead to numerically unstable estimates. To reduce the correlation among variables, we can simply remove one feature column from the one-hot encoded array. Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue. If we use the get_dummies function, we can drop the first column by passing a True argument to the drop_first parameter, as shown in the following code example: >>> pd.get_dummies(df[['price', 'color', 'size']], ...                drop_first=True)     price  size color_green  color_red 0    10.1     1     1 0 1    13.5     2     0 1 2    15.3     3     0 0 In order to drop a redundant column via the OneHotEncoder , we need to set drop='first' and set categories='auto' as follows: >>> color_ohe = OneHotEncoder(categories='auto', drop='first') >>> c_transf = ColumnTransformer([  ...            ('onehot', color_ohe, [0]), ...            ('nothing', 'passthrough', [1, 2]) ... ]) >>> c_transf.fit_transform(X).astype(float) array([[  1. , 0. ,  1. , 10.1],        [  0. ,  1. , 2. ,  13.5],        [  0. ,  0. , 3. ,  15.3]]) In this article, we have gone through some of the methods to deal with categorical data in datasets. We distinguished between nominal and ordinal features, and with examples we explained how they can be handled. To harness the power of the latest Python open source libraries in machine learning check out this book Python Machine Learning - Third Edition, written by Sebastian Raschka and Vahid Mirjalili. Other interesting read in data! The best business intelligence tools 2019: when to use them and how much they cost Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine Media manipulation by Deepfakes and cheap fakes require both AI and social fixes, finds a Data & Society report
Read more
  • 0
  • 0
  • 16217

article-image-github-acquires-semmle-to-secure-open-source-supply-chain-attains-cve-numbering-authority-status
Savia Lobo
19 Sep 2019
5 min read
Save for later

GitHub acquires Semmle to secure open-source supply chain; attains CVE Numbering Authority status

Savia Lobo
19 Sep 2019
5 min read
Yesterday, GitHub announced that it has acquired Semmle, a code analysis platform provider and also that it is now a Common Vulnerabilities and Exposures (CVE) Numbering Authority. https://twitter.com/github/status/1174371016497405953 The Semmle acquisition is a part of the plan to securing the open-source supply chain, Nat Friedman explains in his blog post. Semmle provides a code analysis engine, named QL, which allows developers to write queries that identify code patterns in large codebases and search for vulnerabilities and their variants. Security researchers use Semmle to quickly find vulnerabilities in code with simple declarative queries. “Semmle is trusted by security teams at Uber, NASA, Microsoft, Google, and has helped find thousands of vulnerabilities in some of the largest codebases in the world, as well as over 100 CVEs in open source projects to date,” Friedman writes. Also Read: GitHub now supports two-factor authentication with security keys using the WebAuthn API Semmle originally spun out of research at Oxford in 2006 announced a $21 million Series B investment led by Accel Partners, last year. “In total, the company raised $31 million before this acquisition,” Techcrunch reports. Shanku Niyogi, Senior Vice President of Product at GitHub, in his blog post writes, “An important measure of the success of Semmle’s approach is the number of vulnerabilities that have been identified and disclosed through their technology. Today, over 100 CVEs in open source projects have been found using Semmle, including high-profile projects like Apache Struts, Apple’s XNU, the Linux Kernel, Memcached, U-Boot, and VLC. No other code analysis tool has a similar success rate.” GitHub also announced that it has been approved as a CVE Numbering Authority for open source projects. Now, GitHub will be able to issue CVEs for security advisories opened on GitHub, allowing for even broader awareness across the industry. With Semmle integration, every CVE-ID can be associated with a Semmle QL query, which can then be shared and tracked by the broader developer community. The CVE approval will make it easier for project maintainers to report security flaws directly from their repositories. Also, GitHub can assign CVE identifiers directly and post them to the CVE List and the National Vulnerability Database (NVD). Earlier this year, GitHub acquired Dependabot, to provide automatic security fixes natively within GitHub. With automatic security fixes, developers no longer need to manually patch their dependencies. When a vulnerability is found in a dependency, GitHub will automatically issue a pull request on downstream repositories with the information needed to accept the patch. In August, GitHub was in the limelight for being a part of the Capital One data breach that affected 106 million users in the US and Canada. The law firm Tycko & Zavareei LLP filed a lawsuit in California’s federal district court on behalf of their plaintiffs Seth Zielicke and Aimee Aballo. Also Read: GitHub acquires Spectrum, a community-centric conversational platform Both plaintiffs claimed Capital One and GitHub were unable to protect user’s personal data. The complaint highlighted that Paige A. Thompson, the alleged hacker stole the data in March, posted about the theft on GitHub in April. According to the lawsuit, “As a result of GitHub’s failure to monitor, remove, or otherwise recognize and act upon obviously-hacked data that was displayed, disclosed, and used on or by GitHub and its website, the Personal Information sat on GitHub.com for nearly three months.” The Semmle acquisition may be GitHub’s move to improve security for users in the future. It would be interesting to know how GitHub will mold security for users with additional CVE approval. A user on Reddit writes, “I took part in a tutorial session Semmle held at a university CS society event, where we were shown how to use their system to write semantic analysis passes to look for things like use-after-free and null pointer dereferences. It was only an hour and a bit long, but I found the query language powerful & intuitive and the platform pretty effective. At the time, you could set up your codebase to run Semmle passes on pre-commit hooks or CI deployments etc. and get back some pretty smart reporting if you had introduced a bug.” The user further writes, “The session focused on Java, but a few other languages were supported as first-class, iirc. It felt kinda like writing an SQL query, but over AST rather than tuples in a table, and using modal logic to choose the selections. It took a little while to first get over the 'wut' phase (like 'how do I even express this'), but I imagine that a skilled team, once familiar with the system, could get a lot of value out of Semmle's QL/semantic analysis, especially for large/enterprise-scale codebases.” https://twitter.com/kurtseifried/status/1174395660960796672 https://twitter.com/timneutkens/status/1174598659310313472 To know more about this announcement in detail, read GitHub’s official blog post. Other news in Data Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out Introducing Microsoft’s AirSim, an open-source simulator for autonomous vehicles built on Unreal Engine GQL (Graph Query Language) joins SQL as a Global Standards Project and is now the international standard declarative query language for graphs
Read more
  • 0
  • 0
  • 2624

article-image-the-best-business-intelligence-tools-2019-when-to-use-them-and-how-much-they-cost
Richard Gall
19 Sep 2019
11 min read
Save for later

The best business intelligence tools 2019: when to use them and how much they cost

Richard Gall
19 Sep 2019
11 min read
Business intelligence is big business. Salesforce’s purchase of Tableau earlier this year (for a cool $16 billion) proves the value of a powerful data analytics platform, and demonstrates how the business intelligence space is reshaping expectations and demands in the established CRM and ERP marketplace. To a certain extent, the amount Salesforce paid for Tableau highlights that when it comes to business intelligence, tooling is paramount. Without a tool that fits the needs and skill levels of those that need BI and use analytics, discussions around architecture and strategy are practically moot. So, what are the best business intelligence tools? And more importantly, how do they differ from one another? Which one is right for you? Read next: 4 important business intelligence considerations for the rest of 2019 The best business intelligence tools 2019 Tableau Let’s start with the obvious one: Tableau. It’s one of the most popular business intelligence tools on the planet, and with good reason; it makes data visualization and compelling data storytelling surprisingly easy. With a drag and drop interface, Tableau requires no coding knowledge from users. It also allows users to ask ‘what if’ scenarios to model variable changes, which means you can get some fairly sophisticated insights with just a few simple interactions. But while Tableau is undoubtedly designed to be simple, it doesn’t sacrifice complexity. Unlike other business intelligence tools, Tableau allows users to include an unlimited number of datapoints in their analytics projects. When should you use Tableau and how much does it cost? The core Tableau product is aimed at data scientists and data analysts who want to be able to build end-to-end analytics pipelines. You can trial the product for free for 14 days, but this will then cost you $70/month. This is perhaps one of the clearest use cases - if you’re interested and passionate about data, Tableau practically feels like a toy. However, for those that want to employ Tableau across their organization, the product offers a neat pricing tier. Tableau Creator is built for individual power users - like those described above, Tableau Explorer for self-service analytics, and Tableau Viewer for those that need access to Tableau for limited access to dashboard and analytics. Tableau eBooks and videos Mastering Tableau 2019.1 - Second Edition Tableau 2019.x Cookbook Getting Started with Tableau 2019.2 - Second Edition Tableau in 7 Steps [Video] PowerBI PowerBI is Microsoft’s business intelligence platform. Compared to Tableau it is designed more for reporting and dashboards rather than data exploration and storytelling. If you use a wide range of Microsoft products, PowerBI is particularly powerful. It can become a centralized space for business reporting and insights. Like Tableau, it’s also relatively easy to use. With support from Microsoft Cortana - the company’s digital assistant - it’s possible to perform natural language queries. When should you use PowerBI and how much does it cost? PowerBI is an impressive business intelligence product. But to get the most value, you need to be committed to Microsoft. This isn’t to say you shouldn’t be - the company has been on form in recent years and appears to really understand what modern businesses and users need. On a similar note, a good reason to use PowerBI is for unified and aligned business insights. If Tableau is more suited to personal exploration, or project-based storytelling, PowerBI is an effective option for organizations that want more clarity and shared visibility on key performance metrics. This is reflected in the price. For personal users the desktop version of PowerBI is free, while a pro license is $9.99 a month. A premium plan which includes cloud resources (storage and compute) starts at $4,995. This is the option for larger organizations that are fully committed to the Microsoft suite and has a clear vision of how it wants to coordinate analytics and reporting. PowerBI eBooks and videos Learn Power BI Microsoft Power BI Quick Start Guide Learning Microsoft Power BI [Video] Qlik Sense and QlikView Okay, so here we’re going to include two business intelligence products together: Qlik Sense and QlikView. Obviously, they’re both part of the same family - they’re built by business intelligence company Qlik. More importantly, they’re both quite different products. What’s the difference between Qlik Sense and QlikView? As we’ve said, Qlik Sense and QlikView are two different products. QlikView is the older and more established tool. It’s what’s usually described as a ‘guided analytics’ platform, which means dashboards and analytics applications can be built for end users. The tool gives freedom to engineers and data scientists to build what they want but doesn’t allow end users to ‘explore’ data in any more detail than what is provided. QlikView is quite a sophisticated platform and is widely regarded as being more complex to use than Tableau or PowerBI. While PowerBI or Tableau can be used by anyone with an intermediate level of data literacy and a willingness to learn, QlikView will always be the preserve of data scientists and analysts. This doesn’t make it a poor choice. If you know how to use it properly, QlikView can provide you with more in-depth analysis than any other business intelligence platforms, helping users to see patterns and relationships across different data sets. If you’re working with big data, for example, and you have a team of data scientists and data engineers, QlikView is a good option. Qlik Sense, meanwhile, could be seen as Qlik’s attempt to compete with the likes of Tableau and PowerBI. It’s a self-service BI tool, which allows end users to create their own data visualisations and explore data through a process of ‘data discovery’. When should you use QlikView and how much does it cost? QlikView should be used when you need to build a cohesive reporting and business intelligence solution. It’s perfect for when you need a space to manage KPIs and metrics across different teams. Although a free edition is available for personal use, Qlik doesn’t publish prices for enterprise users. You’ll need to get in touch with the company’s sales team to purchase. QlikView eBooks and videos QlikView: Advanced Data Visualization QlikView Dashboard Development [Video] When should you use Qlik Sense and how much does it cost? Qlik Sense should be used when you have an organization full of people curious and prepared to get their hands on their data. If you already have an established model of reporting performance, Qlik Sense is a useful extra that can give employees more autonomy over how data can be used. When it comes to pricing, Qlik Sense is one of the more complicated business intelligence options. Like QlikView, there’s a free option for personal use, and again like QlikView, there’s no public price available - so you’ll have to connect with Qlik directly. To add an additional layer of complexity, there’s also a product called ‘Cloud Basic’ - this is free and can be shared between up to 5 users. It’s essentially a SaaS version of the Qlik Sense product. If you need to add more than 5 users, it costs $15 per user/month. Qlik Sense eBooks and videos Mastering Qlik Sense [Video] Qlik Sense Cookbook - Second Edition Data Storytelling with Qlik Sense [Video] Hands-On Business Intelligence with Qlik Sense Read next: Top 5 free Business Intelligence tools Splunk Splunk isn’t just a business intelligence tool. To a certain extent, it’s related to application monitoring and logging tools such as Datadog, New Relic, and AppDynamics. It’s built for big data and real-time analytics, which means that it’s well-suited to offering insights on business processes and product performance. The copy on the Splunk website talks about “real-time visibility across the enterprise” and describes Splunk as a “data-to-everything” platform. The product, then, is pitching itself as something that can embed itself inside existing systems, and bring insight and intelligence to places and spaces where it’s particularly valuable. This is in contrast to PowerBI and Tableau, which are designed for exploration and accessibility. This isn’t to say that Splunk doesn’t enable sophisticated data exploration, but rather that it is geared towards monitoring systems and processes, understanding change. It’s a tool built for companies that need full transparency - or, in other words, dynamic operational intelligence. When should you use Splunk and how much does it cost? Splunk is a tool that should be used if you’re dealing with dynamic and real-time data. If you want to be able to model and explore wide-ranging existing sets of data Tableau or PowerBI are probably a better bet. But if you need to be able to make decisions in an active and ongoing scenario, Splunk is a tool that can provide substantial support. The reason that Splunk is included as a part of this list of business intelligence tools is because real-time visibility and insight is vital for businesses. Typically understanding application performance or process efficiency might have been embedded within particular departments, such as a centralized IT function. Now, with businesses dependent upon operational excellence, and security and reliability in the digital arena becoming business critical, Splunk is a tool that deserves its status inside (and across) organizations. Splunk’s pricing is complicated. Prices are generally dependent on how much data you want to index - or, in other words, how much you’re giving Splunk to deal with. But to add to that, Splunk also have a perpetual license ( a one time payment) and an annual term license, which needs to be renewed. So, you can index 1GB/day for $4,500 on a perpetual license, or $1,800 on an annual license. If you want to learn more about Splunk’s pricing option, this post is very useful. Splunk eBooks and videos Splunk 7 Essentials [E-Learning] Splunk 7.x Quick Start Guide Splunk Operational Intelligence Cookbook - Third Edition IBM Cognos IBM Cognos is IBM’s flagship business intelligence tool. It’s probably best viewed as existing somewhere between PowerBI and Tableau. It’s designed for reporting dashboards that allow monitoring and analytics, but it is nevertheless also intended for self-service. To that end, you might say it’s more limited in capabilities than PowerBI, but it’s nevertheless more accessible for non-technical end users to explore data. It’s also relatively easy to integrate with other systems and data sources. So, if your data is stored in Microsoft or Oracle cloud services and databases, it’s relatively straightforward to get started with IBM Cognos. However, it’s worth noting that despite the accesibility of IBM’s product, it still needs centralized control and implementation. It doesn’t offer the level of ease that you get with Tableau, for example. When should you use IBM Cognos and how much does it cost? Cognos is perhaps the go-to option if PowerBI and Tableau don’t quite work for you. Perhaps you like the idea of Tableau but need more centralization. Or maybe you need a strong and cohesive reporting system but don’t feel prepared to buy into Microsoft. This isn’t to make IBM Cognos sound like the outsider - in fact, from an efficiency perspective it’s possibly the best way to ensure to ensure some degree of portability between data sources and to manage the age-old problem of data silos. If you’re not quite sure what business intelligence tool is right for you, it’s well worth taking advantage of Cognos’s free trial - you get unlimited access for a month. If you like what you get, you then have a choice between a premium version - which costs $70 per user/month, and the enterprise plan, the price of which isn’t publicly available. IBM Cognos eBooks and videos IBM Cognos Framework Manager [Video] IBM Cognos Report Studio [Video] IBM Cognos Connection and Workspace Advanced [Video] Conclusion: To choose the best business intelligence solution for your organization, you need to understand your needs and goals Business intelligence is a crowded market. The products listed here are the tip of the iceberg when it comes to analytics, monitoring, and data visualization. This is good and bad - it means there are plenty of options and opportunities, but it also means that sorting through the options to find the right one might take up some of your time. That’s okay though - if possible, try to take advantage of free trial periods. And if you’re in a rush to get work done, use them on active projects. You could even allocate different platforms and tools to different team members and get them to report on what worked well and what didn’t. That way you can have documented insights on how the products might actually be used within the organization. This will help you to better reach a conclusion about the best tool for the job. Business intelligence done well can be extremely valuable - so don’t waste money and don’t waste time on tools that aren’t going to deliver what you need.
Read more
  • 0
  • 0
  • 7016
article-image-gql-graph-query-language-joins-sql-as-a-global-standards-project-and-is-now-the-international-standard-declarative-query-language-for-graphs
Amrata Joshi
19 Sep 2019
6 min read
Save for later

GQL (Graph Query Language) joins SQL as a Global Standards Project and will be the international standard declarative query language for graphs

Amrata Joshi
19 Sep 2019
6 min read
On Tuesday, the team at Neo4j, the graph database management system announced that the international committees behind the development of the SQL standard have voted to initiate GQL (Graph Query Language) as the new database query language. GQL is now going to be the international standard declarative query language for property graphs and it is also a Global Standards Project. GQL is developed and maintained by the same international group that maintains the SQL standard. How did the proposal for GQL pass? Last year in May, the initiative for GQL was first time processed in the GQL Manifesto. This year in June, the national standards bodies across the world from the ISO/IEC’s Joint Technical Committee 1 (responsible for IT standards) started voting on the GQL project proposal.  The ballot closed earlier this week and the proposal was passed wherein ten countries including Germany, Korea, United States, UK, and China voted in favor. And seven countries agreed to put forward their experts to work on this project. Japan was the only country to vote against in the ballot because according to Japan, existing languages already do the job, and SQL/Property Graph Query extensions along with the rest of the SQL standard can do the same job. According to the Neo4j team, the GQL project will initiate development of next-generation technology standards for accessing data. Its charter mandates building on core foundations that are established by SQL and ongoing collaboration in order to ensure SQL and GQL interoperability and compatibility. GQL would reflect rapid growth in the graph database market by increasing adoption of the Cypher language.  Stefan Plantikow, GQL project lead and editor of the planned GQL specification, said, “I believe now is the perfect time for the industry to come together and define the next generation graph query language standard.”  Plantikow further added, “It’s great to receive formal recognition of the need for a standard language. Building upon a decade of experience with property graph querying, GQL will support native graph data types and structures, its own graph schema, a pattern-based approach to data querying, insertion and manipulation, and the ability to create new graphs, and graph views, as well as generate tabular and nested data. Our intent is to respect, evolve, and integrate key concepts from several existing languages including graph extensions to SQL.” Keith Hare, who has served as the chair of the international SQL standards committee for database languages since 2005, charted the progress toward GQL, said, “We have reached a balance of initiating GQL, the database query language of the future whilst preserving the value and ubiquity of SQL.” Hare further added, “Our committee has been heartened to see strong international community participation to usher in the GQL project.  Such support is the mark of an emerging de jure and de facto standard .” The need for a graph-specific query language Researchers and vendors needed a graph-specific query language because of the following limitations: SQL/PGQ language is restricted to read-only queries SQL/PGQ cannot project new graphs The SQL/PGQ language can only access those graphs that are based on taking a graph view over SQL tables. Researchers and vendors needed a language like Cypher that would cover insertion and maintenance of data and not just data querying. But SQL wasn’t the apt model for a graph-centric language that takes graphs as query inputs and outputs a graph as a result. But GQL, on the other hand, builds in openCypher, a project that brings Cypher to Apache Spark and gives users a composable graph query language. SQL and GQL can work together According to most of the companies and national standards bodies that are supporting the GQL initiative, GQL and SQL are not competitors. Instead, these languages can complement each other via interoperation and shared foundations. Alastair Green, Query Languages Standards & Research Lead at Neo4j writes, “A SQL/PGQ query is in fact a SQL sub-query wrapped around a chunk of proto-GQL.” SQL is a language that is built around tables whereas GQL is built around graphs. Users can use GQL to find and project a graph from a graph.  Green further writes, “I think that the SQL standards community has made the right decision here: allow SQL, a language built around tables, to quote GQL when the SQL user wants to find and project a table from a graph, but use GQL when the user wants to find and project a graph from a graph. Which means that we can produce and catalog graphs which are not just views over tables, but discrete complex data objects.” It is still not clear when will the first implementation version of GQL will be out. The official page reads,  “The work of the GQL project starts in earnest at the next meeting of the SQL/GQL standards committee, ISO/IEC JTC 1 SC 32/WG3, in Arusha, Tanzania, later this month. It is impossible at this stage to say when the first implementable version of GQL will become available, but it is highly likely that some reasonably complete draft will have been created by the second half of 2020.” Developer community welcomes the new addition Users are excited to see how GQL will incorporate Cypher, a user commented on HackerNews, “It's been years since I've worked with the product and while I don't miss Neo4j, I do miss the query language. It's a little unclear to me how GQL will incorporate Cypher but I hope the initiative is successful if for no other reason than a selfish one: I'd love Cypher to be around if I ever wind up using a GraphDB again.” Few others mistook GQL to be Facebook’s GraphQL and are sceptical about the name. A comment on HackerNews reads, “Also, the name is of course justified, but it will be a mess to search for due to (Facebook) GraphQL.” A user commented, “I read the entire article and came away mistakenly thinking this was the same thing as GraphQL.” Another user commented, “That's quiet an unfortunate name clash with the existing GraphQL language in a similar domain.” Other interesting news in Data Media manipulation by Deepfakes and cheap fakes refquire both AI and social fixes, finds a Data & Society report Percona announces Percona Distribution for PostgreSQL to support open source databases  Keras 2.3.0, the first release of multi-backend Keras with TensorFlow 2.0 support is now out
Read more
  • 0
  • 0
  • 6515

article-image-4-important-business-intelligence-considerations-for-the-rest-of-2019
Richard Gall
16 Sep 2019
7 min read
Save for later

4 important business intelligence considerations for the rest of 2019

Richard Gall
16 Sep 2019
7 min read
Business intelligence occupies a strange position, often overshadowed by fields like data science and machine learning. But it remains a critical aspect of modern business - indeed, the less attention the world appears to pay to it, the more it is becoming embedded in modern businesses. Where analytics and dashboards once felt like a shiny and exciting interruption in our professional lives, today it is merely the norm. But with business intelligence almost baked into the day to day routines and activities of many individuals, teams, and organizations, what does this actually mean in practice. For as much as we’d like to think that we’re all data-driven now, the reality is that there’s much we can do to use data more effectively. Research confirms that data-driven initiatives often fail - so with that in mind here’s what’s important when it comes to business intelligence in 2019. Popular business intelligence eBooks and videos Oracle Business Intelligence Enterprise Edition 12c - Second Edition Microsoft Power BI Quick Start Guide Implementing Business Intelligence with SQL Server 2019 [Video] Hands-On Business Intelligence with Qlik Sense Hands-On Dashboard Development with QlikView Getting the balance between self-service business intelligence and centralization Self-service business intelligence is one of the biggest trends to emerge in the last two years. In practice, this means that a diverse range of stakeholders (marketers and product managers for example) have access to analytics tools. They’re no longer purely the preserve of data scientists and analysts. Self-service BI makes a lot of sense in the context of today’s data rich and data-driven environment. The best way to empower team members to actually use data is to remove any bottlenecks (like a centralized data team) and allow them to go directly to the data and tools they need to make decisions. In essence, self-service business intelligence solutions are a step towards the democratization of data. However, while the notion of democratizing data sounds like a noble cause, the reality is a little more complex. There are a number of different issues that make self-service BI a challenging thing to get right. One of the biggest pain points, for example, are the skill gaps of teams using these tools. Although self-service BI should make using data easy for team members, even the most user-friendly dashboards need a level of data literacy to be useful. Read next: What are the limits of self-service BI? Many analytics products are being developed with this problem in mind. But it’s still hard to get around - you don’t, after all, want to sacrifice the richness of data for simplicity and accessibility. Another problem is the messiness of data itself - and this ultimately points to one of the paradoxes of self-service BI. You need strong alignment - centralization even - if you’re to ensure true democratization. The answer to all this isn’t to get tied up in decentralization or centralization. Instead, what’s important is striking a balance between the two. Decentralization needs centralization - there needs to be strong governance and clarity over what data exists, how it’s used, how it’s accessed - someone needs to be accountable for that for decentralized, self-service BI to actually work. Read next: How Qlik Sense is driving self-service Business Intelligence Self-service business intelligence: recommended viewing Power BI Masterclass - Beginners to Advanced [Video] Data storytelling that makes an impact Data storytelling is a phrase that’s used too much without real consideration as to what it means or how it can be done. Indeed, all too often it’s used to refer to stylish graphs and visualizations. And yes, stylish graphs and data visualizations are part of data storytelling, but you can’t just expect some nice graphics to communicate in depth data insights to your colleagues and senior management. To do data storytelling well, you need to establish a clear sense of objectives and goals. By that I’m not referring only to your goals, but also those of the people around you. It goes without saying that data and insight needs context, but what that context should be, exactly, is often the hard part - objectives and aims are perhaps the straightforward way of establishing that context and ensuring your insights are able to establish the scope of a problem and propose a way forward. Data storytelling can only really make an impact if you are able to strike a balance between centralization and self-service. Stakeholders that use self-service need confidence that everything they need is both available and accurate - this can only really be ensured by a centralized team of data scientists, architects, and analysts. Data storytelling: recommend viewing Data Storytelling with Qlik Sense [Video] Data Storytelling with Power BI [Video] The impact of cloud It’s impossible to properly appreciate the extent to which cloud is changing the data landscape. Not only is it easier than ever to store and process data, it’s also easy to do different things with it. This means that it’s now possible to do machine learning, or artificial intelligence projects with relative ease (the word relative being important, of course). For business intelligence, this means there needs to be a clear strategy that joins together every piece of the puzzle, from data collection to analysis. This means there needs to be buy-in and input from stakeholders before a solution is purchased - or built - and then the solution needs to be developed with every individual use case properly understood and supported. Indeed, this requires a combination of business acumen, soft skills, and technical expertise. A large amount of this will rest on the shoulders of an organization’s technical leadership team, but it’s also worth pointing out that those in other departments still have a part to play. If stakeholders are unable to present a clear vision of what their needs and goals are it’s highly likely that the advantages of cloud will pass them by when it comes to business intelligence. Cloud and business intelligence: recommended viewing Going beyond Dashboards with IBM Cognos Analytics [Video] Business intelligence ethics Ethics has become a huge issue for organizations over the last couple of years. With the Cambridge Analytica scandal placing the spotlight on how companies use customer data, and GDPR forcing organizations to take a new approach to (European) user data, it’s undoubtedly the case that ethical considerations have added a new dimension to business intelligence. But what does this actually mean in practice? Ethics manifests itself in numerous ways in business intelligence. Perhaps the most obvious is data collection - do you have the right to use someone’s data in a certain way? Sometimes the law will make it clear. But other times it will require individuals to exercise judgment and be sensitive to the issues that could arise. But there are other ways in which individuals and organizations need to think about ethics. Being data-driven is great, especially if you can approach insight in a way that is actionable and proactive. But at the same time it’s vital that business intelligence isn’t just seen as a replacement for human intelligence. Indeed, this is true not just in an ethical sense, but also in terms of sound strategic thinking. Business intelligence without human insight and judgment is really just the opposite of intelligence. Conclusion: business intelligence needs organizational alignment and buy-in There are many issues that have been slowly emerging in the business intelligence world for the last half a decade. This might make things feel confusing, but in actual fact it underlines the very nature of the challenges organizations, leadership teams, and engineers face when it comes to business intelligence. Essentially, doing business intelligence well requires you - and those around you - to tie all these different elements. It's certainly not straightforward, but with focus and a clarity of thought, it's possible to build a really effective BI program that can fulfil organizational needs well into the future.
Read more
  • 0
  • 0
  • 6308