Data | 0 articles | Tech News, Tutorials & Expert Insights

article-image-microsoft-open-sources-infer-net-its-popular-model-based-machine-learning-framework

08 Oct 2018

3 min read

Microsoft open sources Infer.NET, it’s popular model-based machine learning framework

08 Oct 2018

Last week, Microsoft open sourced Infer.NET, the cross-platform framework used for model-based machine learning. This popular machine learning engine used in Office, Xbox and Azure, will be available on GitHub under the permissive MIT license for free use in commercial applications. Features of Infer.NET The team at Microsoft Research in Cambridge initially envisioned Infer.NET as a research tool and released it for academic use in 2008. The framework has served as a base to publish hundreds of papers across a variety of fields, including information retrieval and healthcare. The team then started using the framework as a machine learning engine within a wide range of Microsoft products. A model-based approach to machine learning Infer.NET allows users to incorporate domain knowledge into their model. The framework can be used to build bespoke machine learning algorithms directly from their model. To sum it up, this framework actually constructs a learning algorithm for users based on the model they have provided. Facilitates interpretability Infer.NET also facilitates interpretability. If users have designed the model themselves and the learning algorithm follows that model, they can understand why the system behaves in a particular way or makes certain predictions. Probabilistic Approach In Infer.NET, models are described using a probabilistic program. This is used to describe real-world processes in a language that machines understand. Infer.NET compiles the probabilistic program into high-performance code for implementing something cryptically called deterministic approximate Bayesian inference. This approach allows a notable amount of scalability. For instance, it can be used in a system that automatically extracts knowledge from billions of web pages, comprising petabytes of data. Additional Features The framework also supports the ability of the system to learn as new data arrives. The team is also working towards developing and growing it further. Infer.NET will become a part of ML.NET (the machine learning framework for .NET developers). They have already set up the repository under the .NET Foundation and moved the package and namespaces to Microsoft.ML.Probabilistic. Being cross platform, Infer.NET supports .NET Framework 4.6.1, .NET Core 2.0, and Mono 5.0. Windows users get to use Visual Studio 2017, while macOS and Linux folks have command-line options, which could be incorporated into the code wrangler of their choice. Download the framework to learn more about Infer.NET. You can also check the documentation for a detailed User Guide. To know more about this news, head over to Microsoft’s official blog. Microsoft announces new Surface devices to enhance user productivity, with style and elegance Neural Network Intelligence: Microsoft’s open source automated machine learning toolkit Microsoft’s new neural text-to-speech service lets machines speak like people

0
0
3988

article-image-7-tips-for-using-git-and-github-the-right-way

Sugandha Lahoti

06 Oct 2018

3 min read

7 tips for using Git and GitHub the right way

Sugandha Lahoti

06 Oct 2018

3 min read

GitHub has become a widely accepted and integral part of software development owing to the imperative features of change tracking that it offers. It was created in 2005 by Linus Torvalds to support the development of the Linux kernel. In this post, Alex Magana and Joseph Muli, the authors of Introduction to Git and GitHub course, discuss some of the best practices you should keep in mind while learning or using Git and GitHub. Document everything A good best practice that eases work in any team is ample documentation. Documenting something as simple as a repository goes a long way in presenting work and attracting contributors. It’s more of a first impression aspect when it comes to looking for a tool to aid in development. Utilize the README.MD and wikis One should also utilize the README.MD and wikis to elucidate the functionality delivered by the application and categorize guide topics and material. You should Communicate the solution of the code in the repository avails. Specify the guidelines and rules of engagement that govern contributing to the codebase. Indicate the dependencies required to set-up the working environment. Stipulate set-up instructions to get a working version of the application in a contributor’s local environment. Keep simple and concise naming conventions Naming conventions are also highly encouraged when it comes to repositories and branches. They should be simple, descriptive and concise. For instance, a repository that houses code intended for a Git course could be simply named as “git-tutorial-material”. As a learner on the bespoke course, it’s easier for a user to get the material, compared to a repository with a name such as “material”. Adopt naming prefixes You should also adopt institute naming prefixes for different task types for branch naming. For example, you may use feat- for feature branches, bug- for bugs and fix- for fix branches. Also, make use of templates that encompass a checklist for Pull Requests. Correspond a PR and Branch to a ticket or task A PR and Branch should correspond to a ticket or task on the project management board. This aids in aligning efforts employed on a product to the appropriate milestones and product vision. Organize and track tasks using issues Tasks such as technical debts, bugs, and workflow improvements should be organized and tracked using Issues. You should also enforce Push and Pull restrictions on the default branch and use webhooks to automate deployment and run pre-merge test suites. Use atomic commits Another best practice is to use atomic commits. An atomic commit is an operation that applies a set of distinct changes as a single operation. You should persist changes in small changesets and use descriptive and concise commit messages to record effected changes. You read a guest post from Alex Magana and Joseph Muli, the authors of Introduction to Git and GitHub. We hope that these best practices help you manage your Git and GitHub more smoothly. Don’t forget to check out Alex and Joseph’s Introduction to Git and GitHub course to learn how to create and enforce checks and controls for the introduction, scrutiny, approval, merging, and reversal of changes. GitHub introduces ‘Experiments’, a platform to share live demos of their research projects. Packt’s GitHub portal hits 2,000 repositories

0
0
4062

article-image-production-ready-pytorch-1-0-preview-release-is-here-with-torch-jit-c10d-distributed-library-c-api

Aarthi Kumaraswamy

02 Oct 2018

4 min read

PyTorch 1.0 preview release is production ready with torch.jit, c10d distributed library, C++ API

Aarthi Kumaraswamy

02 Oct 2018

4 min read

Back in May, the PyTorch team shared their roadmap for PyTorch 1.0 release and highlighted that this most anticipated version will not only continue to provide stability and simplicity of use to its users, but will also make it production ready while making it a hassle-free migration experience for its users. Today, Facebook announced the release of PyTorch 1.0 RC1. The official announcement states, “PyTorch 1.0 accelerates the workflow involved in taking breakthrough research in artificial intelligence to production deployment. With deeper cloud service support from Amazon, Google, and Microsoft, and tighter integration with technology providers ARM, Intel, IBM, NVIDIA, and Qualcomm, developers can more easily take advantage of PyTorch’s ecosystem of compatible software, hardware, and developer tools. The more software and hardware that is compatible with PyTorch 1.0, the easier it will be for AI developers to quickly build, train, and deploy state-of-the-art deep learning models.” PyTorch is an open-source Python-based deep learning framework which provides powerful GPU acceleration. PyTorch is known for advanced indexing and functions, imperative style, integration support and API simplicity. This is one of the key reasons why developers prefer PyTorch for research and hackability. On the downside, it has struggled with adoption in production environments. The pyTorch team acknowledged this in their roadmap and have worked on improving this aspect significantly in pyTorch 1.0 not just in terms of improving the library but also by enriching its ecosystems but partnering with key software and hardware vendors. “One of its biggest downsides has been production-support. What we mean by production-support is the countless things one has to do to models to run them efficiently at massive scale: exporting to C++-only runtimes for use in larger projects optimizing mobile systems on iPhone, Android, Qualcomm and other systems using more efficient data layouts and performing kernel fusion to do faster inference (saving 10% of speed or memory at scale is a big win) quantized inference (such as 8-bit inference)”, stated the pyTorch team in their roadmap post. Below are some key highlights of this major milestone for PyTorch. JIT The JIT is a set of compiler tools for bridging the gap between research in PyTorch and production. It includes a language called Torch Script ( a subset of Python), and two ways (Tracing mode and Script mode) in which the existing code can be made compatible with the JIT. Torch Script code can be aggressively optimized and it can be serialized for later use in the new C++ API, which doesn't depend on Python at all. torch.distributed new "C10D" library The torch.distributed package and torch.nn.parallel.DistributedDataParallel module are backed by the new "C10D" library. The main highlights of the new library are: C10D is performance driven and operates entirely asynchronously for all backends: Gloo, NCCL, and MPI. Significant Distributed Data Parallel performance improvements especially for slower network like ethernet-based hosts Adds async support for all distributed collective operations in the torch.distributed package. Adds send and recv support in the Gloo backend C++ Frontend [API Unstable] The C++ frontend is a pure C++ interface to the PyTorch backend that follows the API and architecture of the established Python frontend. It is intended to enable research in high performance, low latency and bare metal C++ applications. It provides equivalents to torch.nn, torch.optim, torch.data and other components of the Python frontend. The C++ frontend is marked as "API Unstable" as part of PyTorch 1.0. This means it is ready to be used for building research applications, but still has some open construction sites that will stabilize over the next month or two. In other words, it is not ready for use in production, yet. N-dimensional empty tensors, a collection of new operators inspired from numpy and scipy and new distributions such as Weibull, Negative binomial and multivariate log gamma distributions have been introduced. There have also been a lot of breaking changes, bug fixes, and other improvements made to pyTorch 1.0. For more details read the official announcement and also the official release notes for pyTorch. What is PyTorch and how does it work? Build your first neural network with PyTorch [Tutorial] Is Facebook-backed PyTorch better than Google’s TensorFlow? Can a production-ready Pytorch 1.0 give TensorFlow a tough time?

0
0
5874

article-image-ethical-dilemmas-developers-on-artificial-intelligence-products-must-consider

Amey Varangaonkar

29 Sep 2018

10 min read

The ethical dilemmas developers working on Artificial Intelligence products must consider

Amey Varangaonkar

29 Sep 2018

10 min read

0
0
5086

article-image-brainnet-an-interface-to-communicate-between-human-brains-could-soon-make-telepathy-real

Sunith Shetty

28 Sep 2018

3 min read

BrainNet, an interface to communicate between human brains, could soon make Telepathy real

Sunith Shetty

28 Sep 2018

3 min read

BrainNet provides the first multi-person brain-to-brain interface which allows a nonthreatening direct collaboration between human brains. It can help small teams collaborate to solve a range of tasks using direct brain-to-brain communication. How does BrainNet operate? The noninvasive interface combines electroencephalography (EEG) to record brain signals and transcranial magnetic stimulation (TMS) to deliver the required information to the brain. For now, the interface allows three human subjects to collaborate, handle and solve a task using direct brain-to-brain communication. Two out of three human subjects are “Senders”. The senders’ brain signals are decoded using real-time EEG data analysis. This technique allows extracting decisions which are vital in communicating in order to solve the required challenges. Let’s take an example of a Tetris-like game--where you need quick decisions to decide whether to rotate a block or drop as it is in order to fill a line. The senders’ signals (decisions) are transmitted to the third subject human brain via the Internet, the “Receiver” in this case. The decisions are sent to the receiver brain via magnetic stimulation of the occipital cortex. The receiver can’t see the game screen to decide if the rotation of the block is required. The receiver integrates the decisions received and makes an informed call using an EEG interface regarding turning the position of the block or keeping it in the same position. The second round of the game allows the senders to validate the previous move and provide the necessary feedback to the receiver’s action. How did the results look? The group of researchers has implemented this technique for the Tetris game to evaluate the performance of BrainNet considering the following factors: Group-level performance during the game True/False positive rates of subject’s decisions Mutual information between subjects This was implemented among five groups of three human brain subjects to perform the Tetris task using BrainNet interface. The average accuracy result for the task was 0.813. Furthermore, they also tried varying the information reliability by injecting artificially generated noise into one of the senders’ signals. However, the receiver was able to classify which sender is more reliable based on the information transmitted to their brains. These positive results have open the gates and the possibilities of future brain-to-brain interfaces which holds the power of enabling cooperative problem solving by humans using a "social network" of connected brains. To know more, you can refer to the research paper. Read more Diffractive Deep Neural Network (D2NN): UCLA-developed AI device can identify objects at the speed of light Baidu announces ClariNet, a neural network for text-to-speech synthesis Optical training of Neural networks is making AI more efficient

0
0
4549

article-image-did-you-know-facebook-shares-the-data-you-share-with-them-for-security-reasons-with-advertisers

Natasha Mathur

28 Sep 2018

5 min read

Did you know Facebook shares the data you share with them for ‘security’ reasons with advertisers?

Natasha Mathur

28 Sep 2018

5 min read

Facebook is constantly under the spotlight these days when it comes to controversies regarding user’s data and privacy. A new research paper published by the Princeton University researchers states that Facebook shares the contact information you handed over for security purposes, with their advertisers. This study was first brought to light by a Gizmodo writer, Kashmir Hill. “Facebook is not content to use the contact information you willingly put into your Facebook profile for advertising. It is also using contact information you handed over for security purposes and contact information you didn’t hand over at all, but that was collected from other people’s contact books, a hidden layer of details Facebook has about you that I’ve come to call “shadow contact information”, writes Hill. Recently, Facebook introduced a new feature called custom audiences. Unlike traditional audiences, the advertiser is allowed to target specific users. To do so, the advertiser uploads user’s PII (personally identifiable information) to Facebook. After the uploading is done, Facebook then matches the given PII against platform users. Facebook then develops an audience that comprises the matched users and allows the advertiser to further track the specific audience. Essentially with Facebook, the holy grail of marketing, which is targeting an audience of one, is practically possible; nevermind whether that audience wanted it or not. In today’s world, different social media platforms frequently collect various kinds of personally identifying information (PII), including phone numbers, email addresses, names and dates of birth. Majority of this PII often represent extremely accurate, unique, and verified user data. Because of this, these services have the incentive to exploit and use this personal information for other purposes. One such scenario includes providing advertisers with more accurate audience targeting. The paper titled ‘Investigating sources of PII used in Facebook’s targeted advertising’ is written by Giridhari Venkatadri, Elena Lucherini, Piotr Sapiezynski, and Alan Mislove. “In this paper, we focus on Facebook and investigate the sources of PII used for its PII-based targeted advertising feature. We develop a novel technique that uses Facebook’s advertiser interface to check whether a given piece of PII can be used to target some Facebook user and use this technique to study how Facebook’s advertising service obtains users’ PII,” reads the paper. The researchers developed a novel methodology, which involved studying how Facebook obtains the PII to provide custom audiences to advertisers. “We test whether PII that Facebook obtains through a variety of methods (e.g., directly from the user, from two-factor authentication services, etc.) is used for targeted advertising, whether any such use is clearly disclosed to users, and whether controls are provided to users to help them limit such use,” reads the paper. The paper uses size estimates to study what sources of PII are used for PII-based targeted advertising. Researchers used this methodology to investigate which range of sources of PII was actually used by Facebook for its PII-based targeted advertising platform. They also examined what information gets disclosed to users and what control users have over PII. What sources of PII are actually being used by Facebook? Researchers found out that Facebook allows its users to add contact information (email addresses and phone numbers) on their profiles. While any arbitrary email address or phone number can be added, it is not displayed to other users unless verified (through a confirmation email or confirmation SMS message, respectively). This is the most direct and explicit way of providing PII to advertisers. Researchers then further moved on to examine whether PII provided by users for security purposes such as two-factor authentication (2FA) or login alerts are being used for targeted advertising. They added and verified a phone number for 2FA to one of the authors’ accounts. The added phone number became targetable after 22 days. This proved that a phone number provided for 2FA was indeed used for PII-based advertising, despite having set the privacy controls to the choice. What control do users have over PII? Facebook allows users the liberty of choosing who can see each PII listed on their profiles, the current list of possible general settings being: Public, Friends, Only Me. Users can also restrict the set of users who can search for them using their email address or their phone number. Users are provided with the following options: Everyone, Friends of Friends, and Friends. Facebook provides users a list of advertisers who have included them in a custom audience using their contact information. Users can opt out of receiving ads from individual advertisers listed here. But, information about what PII is used by advertisers is not disclosed. What information about how Facebook uses PII gets disclosed to the users? On adding mobile phone numbers directly to one’s Facebook profile, no information about the uses of that number is directly disclosed to them. This Information is only disclosed to users when adding a number from the Facebook website. As per the research results, there’s very little disclosure to users, often in the form of generic statements that do not refer to the uses of the particular PII being collected or that it may be used to allow advertisers to target users. “Our paper highlights the need to further study the sources of PII used for advertising, and shows that more disclosure and transparency needs to be provided to the user,” says the researchers in the paper. For more information, check out the official research paper. Ex-employee on contract sues Facebook for not protecting content moderators from mental trauma How far will Facebook go to fix what it broke: Democracy, Trust, Reality Mark Zuckerberg publishes Facebook manifesto for safeguarding against political interference

0
0
3074

article-image-9-recommended-blockchain-online-courses

Guest Contributor

27 Sep 2018

7 min read

9 recommended blockchain online courses

Guest Contributor

27 Sep 2018

7 min read

Blockchain is reshaping the world as we know it. And we are not talking metaphorically because the new technology is really influencing everything from online security and data management to governance and smart contracting. Statistical reports support these claims. According to the study, the blockchain universe grows by over 40% annually, while almost 70% of banks are already experimenting with this technology. IT experts at the Editing AussieWritings.com Services claim that the potential in this field is almost limitless: “Blockchain offers a myriad of practical possibilities, so you definitely want to get acquainted with it more thoroughly.” Developers who are curious about blockchain can turn it into a lucrative career opportunity since it gives them the chance to master the art of cryptography, hierarchical distribution, growth metrics, transparent management, and many more. There were 5,743 mostly full-time job openings calling for blockchain skills in the last 12 months - representing the 320% increase - while the biggest freelancing website Upwork reported more than 6,000% year-over-year growth. In this post, we will recommend our 9 best blockchain online courses. Let’s take a look! Udemy Udemy offers users one of the most comprehensive blockchain learning sources. The target audience is people who have heard a little bit about the latest developments in this field, but want to understand more. This online course can help you to fully understand how the blockchain works, as well as get to grips with all that surrounds it. Udemy breaks down the course into several less complicated units, allowing you to figure out this complex system rather easily. It costs $19.99, but you can probably get it with a 40% discount. The one downside, however, is that content quality in terms of subject scope can vary depending on the instructor, but user reviews are a good way to gauge quality. Each tutorial lasts approximately 30 minutes, but it also depends on your own tempo and style of work. Pluralsight Pluralsight is an excellent beginner-level blockchain course. It comes in three versions: Blockchain Fundamentals, Surveying Blockchain Technologies for Enterprise, and Introduction to Bitcoin and Decentralized Technology. Course duration varies from 80 to 200 minutes depending on the package. The price of Pluralsight is $29 a month or $299 a year. Choosing one of these options, you are granted access to the entire library of documents, including course discussions, learning paths, channels, skill assessments, and other similar tools. Packt Publishing Packt Publishing has a wide portfolio of learning products on Blockchain for varying levels of experience in the field from beginners to experts. And what’s even more interesting is that you can choose your learning format from books, ebooks to videos, courses and live courses. Or you could simply subscribe to MAPT, their library to gain access to all products at a reasonable price of $29 monthly and $150 annually. It offers several books and videos on the leading blockchain technology. You can purchase 5 blockchain titles at a discounted rate of $50. Here’s the list of top blockchain courses offered by Packt Publishing: Exploring Blockchain and Crypto-currencies: You will gain the foundational understanding of blockchain and crypto-currencies through various use-cases. Building Blockchain Projects: In this, you will be able to develop real-time practical DApps with Ethereum and JavaScript. Mastering Blockchain - Second Edition: You can learn about cryptography and cryptocurrencies, so you can build highly secure, decentralized applications and conduct trusted in-app transactions. Hands-On Blockchain with Hyperledger: This book will help you leverage the power of Hyperledger Fabric to develop Blockchain-based distributed ledgers with ease. Learning Blockchain Application Development [video ]: This interactive video will help you learn build smart contracts and DApps on Ethereum. Create Ethereum and Blockchain Applications using Solidity [video ]: This video will help you learn about Ethereum, Solidity, DAO, ICO, Bitcoin, Altcoin, Website Security, Ripple, Litecoin, Smart Contracts, and Apps. Cryptozombies Cryptozombies is an online blockchain course based on gamification elements. The tool teaches you to write smart contracts in Solidity through building your own crypto-collectibles game. It is entirely Ethereum-focused, but you don’t need any previous experience to understand how Solidity works. There is a step by step guide that explains to you even the smallest details, so you can quickly learn to create your own fully-functional blockchain-based game. The best thing about Cryptozombies is that you can test it for free and give up in case you don’t like it. Coursera The blockchain is the epicenter of the cryptocurrency world, so it’s necessary to study it if you want to deal with Bitcoin and other digital currencies. Coursera is the leading online resource in the field of virtual currencies, so you might want to check it out. After this course like Blockchain Specialization, you’ll know everything you need to be able to separate fact from fiction when reading claims about Bitcoin and other cryptocurrencies. You’ll have the conceptual foundations you need to engineer to secure software that interacts with the Bitcoin network. And you’ll be able to integrate ideas from Bitcoin in your own projects. The course is a 4-part course spanning a duration 4 weeks, but you can take each part separately. The price depends on the level and features you choose. LinkedIn Learning (formerly known as Lynda) LinkedIn Learning (what used to be Lynda) doesn't offer a specific blockchain course, but it does have a wide range of industry-related learning sources. A search for ‘blockchain’ will present you with almost 100 relevant video courses. You can find all sorts of lessons here, from beginner to expert levels. Lynda allows you to customize selection according to video duration, authors, software, subjects, etc. You can access the library for $15 a month. B9Lab B9Lab ETH-25 Certified Online Ethereum Developer Course is another course that promotes blockchain technology aimed at the Ethereum platform. It’s a 12-week in-depth learning solution that targets experienced programmers. B9Lab introduces everything there is to know about blockchain and how to build useful applications. Participants are taught about the Ethereum platform, the programming language Solidity, how to use web3 and the Truffle framework, and how to tie everything together. The price is €1450 or about $1700. IBM IBM made a self-paced blockchain course, titled Blockchain Essentials that lasts over two hours. The video lectures and lab in this course help you learn about blockchain for business and explore key use cases that demonstrate how the technology adds value. You can learn how to leverage blockchain benefits, transform your business with the new technology, and transfer assets. Besides that, you get a nice wrap-up and a quiz to test your knowledge upon completion. IBM’s course is free of charge. Khan Academy Khan Academy is the last, but certainly not the least important online course on our list. It gives users a comprehensive overview of blockchain-powered systems, particularly Bitcoin. Using this platform, you can learn more on cryptocurrency transactions, security, proof of work, etc. As an online education platform, Khan Academy won’t cost you a dime. [dropcap]B[/dropcap]lockchain is the groundbreaking technology that opens new boundaries in almost every field of business. It directly influences financial markets, data management, digital security, and a variety of other industries. In this post, we presented 9 best blockchain online courses you should try. These sources can teach you everything there is to know about the blockchain basics. Take some time to check them out and you won’t regret it! Author Bio: Olivia is a passionate blogger who writes on topics of digital marketing, career, and self-development. She constantly tries to learn something new and to share this experience on various websites. Connect with her on Facebook and Twitter. Google introduces Machine Learning courses for AI beginners Microsoft start AI School to teach Machine Learning and Artificial Intelligence.

0
0
5451

article-image-the-white-house-is-reportedly-launching-an-antitrust-investigation-against-social-media-companies

Sugandha Lahoti

26 Sep 2018

3 min read

The White House is reportedly launching an antitrust investigation against social media companies

Sugandha Lahoti

26 Sep 2018

3 min read

According to information obtained by Bloomberg, The White House is reportedly making a draft executive order against online platform bias in Social Media firms. Per this draft, federal antitrust and law enforcement agencies are instructed to investigate into the practices of Google, Facebook, and other social media companies. The existence of the draft was first reported by Capital Forum. Federal law enforcers are required to investigate primarily against two violations. First, if an online platform has acted in violation of the antitrust laws. Second, to remove anti-competitive spirit among online platforms and address online platform bias. Per the sources by Capital Forum, the draft is written in two parts. The first part is a policy statement stating that online platforms are central to the flow of information and commerce and need to be held accountable through competition. The second part instructs agencies to investigate bias and anticompetitive conduct in online platforms where they have the authority. In case of lack of authorization, they are required to report concerns or issues to the Federal Trade Commission or the Department of Justice. No online platforms are mentioned by name in the draft. It’s unclear when or if the White House will decide to issue the order. Donald Trump and the White House have always been vocal about the prevalent bias in Social media platforms. In August, Trump tweeted about Social Media discriminating against Republican and Conservative voices. Source: Twitter He also went on to claim that Google search results for “Trump News” reports fake news. He accused the search engines’ algorithms of being rigged. However, that allegation having not been backed by evidence, let Google slam Trump’s accusations, asserting that its search engine algorithms do not favor any political ideology. Earlier this month, Facebook COO Sheryl Sandberg and Twitter CEO Jack Dorsey faced the Senate Select Intelligence Committee, to discuss foreign interference through social media platforms in US elections. Google, Facebook, and Twitter also released a Testimony ahead of appearing before the committee. As reported by the Wall Street Journal, Google CEO Sundar Pichai also plans to meet privately with top Republican lawmakers this Friday to discuss a variety of topics, including the company's alleged political bias in search results. This meeting is organized by the House Majority Leader, Kevin McCarthy. Pichai said on Tuesday that “I look forward to meeting with members on both sides of the aisle, answering a wide range of questions, and explaining our approach." Google is also facing public scrutiny over a report that it intends to launch a censored search engine in China. Google’s custom search engine would link Chinese users’ search queries to their personal phone numbers, thus making it easier for the government to track their searches. About a thousand Google employees frustrated with a series of controversies involving Google have signed a letter to demand transparency on building the alleged search engine. Google’s new Privacy Chief officer proposes a new framework for Security Regulation. Amazon is the next target on EU’s antitrust hitlist. Mark Zuckerberg publishes Facebook manifesto for safeguarding against political interference.

0
0
1932

article-image-what-is-a-convolutional-neural-network-cnn-video

Richard Gall

25 Sep 2018

5 min read

What is a convolutional neural network (CNN)? [Video]

Richard Gall

25 Sep 2018

5 min read

0
0
27202

article-image-how-far-will-facebook-go-to-fix-what-it-broke-democracy-trust-reality

Aarthi Kumaraswamy

24 Sep 2018

19 min read

How far will Facebook go to fix what it broke: Democracy, Trust, Reality

Aarthi Kumaraswamy

24 Sep 2018

19 min read

Facebook, along with other tech media giants, like Twitter and Google, broke the democratic process in 2016. Facebook also broke the trust of many of its users as scandal after scandal kept surfacing telling the same story in different ways - the story of user data and trust abused in exchange for growth and revenue. The week before last, Mark Zuckerberg posted a long explanation on Facebook titled ‘Preparing for Elections’. It is the first of a series of reflections by Zuckerberg that ‘address the most important issues facing Facebook’. That post explored what Facebook is doing to avoid ending up in a situation similar to the 2016 elections when the platform ‘inadvertently’ became a super-effective channel for election interference of various kinds. It follows just weeks after Facebook COO, Sheryl Sandberg appeared in front of a Senate Intelligence hearing alongside Twitter CEO, Jack Dorsey on the topic of social media’s role in election interference. Zuckerberg’s mobile-first rigor oversimplifies the issues Zuckerberg opened his post with a strong commitment to addressing the issues plaguing Facebook using the highest levels of rigor the company has known in its history. He wrote, “I am bringing the same focus and rigor to addressing these issues that I've brought to previous product challenges like shifting our services to mobile.” To understand the weight of this statement we must go back to how Facebook became a mobile-first company that beat investor expectations wildly. Suffice to say it went through painful years of restructuring and reorientation in the process. Those unfamiliar with that phase of Facebook, please read the section ‘How far did Facebook go to become a mobile-first company?’ at the end of this post for more details. To be fair, Zuckerberg does acknowledge that pivoting to mobile was a lot easier than what it will take to tackle the current set of challenges. He writes, “These issues are even harder because people don't agree on what a good outcome looks like, or what tradeoffs are acceptable to make. When it comes to free expression, thoughtful people come to different conclusions about the right balances. When it comes to implementing a solution, certainly some investors disagree with my approach to invest so much on security. We have a lot of work ahead, but I am confident we will end this year with much more sophisticated approaches than we began, and that the focus and investments we've put in will be better for our community and the world over the long term.” However, what Zuckerberg does not acknowledge in the above statement is that the current set of issues is not merely a product challenge, but a business ethics and sustainability challenge. Unless ‘an honest look in the mirror’ kind of analysis is done on that side of Facebook, any level of product improvements will only result in cosmetic changes that will end in an ‘operation successful, patient dead’ scenario. In the coming sections, I attempt to dissect Zuckerberg’s post in the context of the above points by reading between the lines to see how serious the platform really is about changing its ways to ‘be better for our community and the world over the long term’. Why does Facebook’s commitment to change feel hollow? Let’s focus on election interference in this analysis as Zuckerberg limits his views to this topic in his post. Facebook has been at the center of this story on many levels. Here is some context on where Zuckerberg is coming from. Facebook’s involvement in the 2016 election meddling Apart from the traditional cyber-attacks (which they had even back then managed to prevent successfully), there were Russia-backed coordinated misinformation campaigns found on the platform. Then there was also the misuse of its user data by data analytics firm, Cambridge Analytica, which consulted on election campaigning. They micro-profiled users based on their psychographics (the way they think and behave) to ensure more effective ad spending by political parties. There was also the issue of certain kinds of ads, subliminal messages and peer pressure sent out to specific Facebook users during elections to prompt them to vote for certain candidates while others did not receive similar messages. There were also alleged reports of a certain set of users having been sent ‘dark posts’ (posts that aren’t publicly visible to all, but visible only to those on the target list) to discourage them from voting altogether. It also appears that Facebook staff offered both the Clinton and the Trump campaigns to assist with Facebook advertising. The former declined the offer while the latter accepted. We don’t know which of the above and to what extent each of these decisions and actions impacted the outcome of the 2016 US presidential elections. But one thing is certain, collectively they did have a significant enough impact for Zuckerberg and team to acknowledge these are serious problems that they need to address, NOW! Deconstructing Zuckerberg’s ‘Protecting Elections’ Before diving into what is problematic about the measures that are taken (or not taken) by Facebook, I must commend them for taking ownership of their role in election interference in the past and for attempting to rectify the wrongs. I like that Zuckerberg has made himself vulnerable by sharing his corrective plans with the public while it is a work in progress and is engaging with the public at a personal level. Facebook’s openness to academic research using anonymized Facebook data and their willingness to permit publishing findings without Facebook’s approval is also noteworthy. Other initiatives such as the political ad transparency report, AI enabled fake account & fake news reduction strategy, doubling the content moderator base, improving their recommendation algorithms are all steps in the right direction. However, this is where my list of nice things to say ends. The overall tone of Zuckerberg’s post is that of bargaining rather than that of acceptance. Interestingly this was exactly the tone adopted by Sandberg as well in the Senate hearing earlier this month, down to some very similar phrases. This makes one question if everything isn’t just one well-orchestrated PR disaster management plan. Disappointingly, most of the actions stated in Zuckerberg's post feel like half-measures; I get the sense that they aren’t willing to go the full distance to achieve the objectives they set for themselves. I hope to be wrong. 1. Zuckerberg focuses too much on ‘what’ and ‘how’, is ignoring the ‘why’ Zuckerberg identifies three key issues he wants to address in 2018: preventing election interference, protecting the community from abuse, and providing users with better control over their information. This clarity is a good starting point. In this post, he only focuses on the first issue. So I will reserve sharing my detailed thoughts on the other two for now. What I would say for now is that the key to addressing all issues on Facebook is taking a hard look at Facebook policies, including privacy, from a mission statement perspective. In other words, be honest about ‘Why Facebook exists’. Users are annoyed, advertisers are not satisfied and neither are shareholders confident about Facebook’s future. Trying to be everyone’s friend is clearly not working for Facebook. As such, I expected this in the opening part of the series. ‘Be better for our community and the world over the long term’ is too vague of a mission statement to be of any practical use. 2. Political Ad transparency report is necessary, but not sufficient In May this year, Facebook released its first political ad transparency report as a gesture to show its commitment to minimizing political interference. The report allows one to see who sponsored which issue advertisement and for how much. This was a move unanimously welcomed by everyone and soon others like Twitter and Google followed suit. By doing this, Facebook hopes to allow its users to form more informed views about political causes and other issues. Here is my problem with this feature. (Yes, I do view this report as a ‘feature’ of the new Facebook app which serves a very specific need: to satisfy regulators and media.) The average Facebook user is not the politically or technologically savvy consumer. They use Facebook to connect with friends and family and maybe play silly games now and then. The majority of these users aren’t going to proactively check out this ad transparency report or the political ad database to arrive at the right conclusions. The people who will find this report interesting are academic researchers, campaign managers, and analysts. It is one more rich data point to understand campaign strategy and thereby infer who the target audience is. This could most likely lead to a downward spiral of more and more polarizing ads from parties across the spectrum. 3. How election campaigning, hate speech, and real violence are linked but unacknowledged Another issue closely tied with political ads is hate speech and violence-inciting polarising content that aren’t necessarily paid ads. These are typical content in the form of posts, images or videos that are posted as a response to political ads or discourses. These act as carriers that amplify the political message, often in ways unintended by the campaigners themselves. The echo chambers still exist. And the more one's ecosystem or ‘look-alike audience’ responds to certain types of ads or posts, users are more likely to keep seeing them, thanks to Facebook's algorithms. Seeing something that is endorsed by one’s friends often primes one to trust what is said without verifying the facts for themselves thus enabling fake news to go viral. The algorithm does the rest to ensure everyone who will engage with the content sees it. Newsy political ads will thrive in such a setup while getting away with saying ‘we made full disclosure in our report’. All of this is great for Facebook’s platform as it not only gets great engagement from the content but also increased ad spendings from all political parties as they can’t afford to be missing from action on Facebook. A by-product of this ultra-polarised scenario though is more protectionism and less free, open and meaningful dialog and debate between candidates as well as supporters on the platform. That’s bad news for the democratic process. 4. Facebook’s election interference prevention model is not scalable Their single-minded focus on eliminating US election interference on Facebook’s platforms through a multipronged approach to content moderation is worth appreciating. This also makes one optimistic about Facebook’s role in consciously attempting to do the right thing when it comes to respecting election processes in other nations as well. But the current approach of creating an ‘election war room’ is neither scalable nor sustainable. What happens everytime a constituency in the US has some election or some part of the world does? What happens when multiple elections take place across the world simultaneously? Who does Facebook prioritize to provide election interference defense support and why? Also, I wouldn’t go too far to trust that they will uphold individual liberties in troubled nations with strong regimes or strong divisive political discourses. What happens when the ruling party is the one interfering with the elections? Who is Facebook answerable to? 5. Facebook’s headcount hasn’t kept up with its own growth ambitions Zuckerberg proudly states in his post that they’ve deleted a billion fake accounts with machine learning and have double the number of people hired to work on safety and security. "With advances in machine learning, we have now built systems that block millions of fake accounts every day. In total, we removed more than one billion fake accounts -- the vast majority within minutes of being created and before they could do any harm -- in the six months between October and March. ....it is still very difficult to identify the most sophisticated actors who build their networks manually one fake account at a time. This is why we've also hired a lot more people to work on safety and security -- up from 10,000 last year to more than 20,000 people this year." ‘People working on safety and security’ could have a wide range of job responsibilities from network security engineers to security guards hired at Facebook offices. What is missing conspicuously in the above picture is a breakdown of the number of people hired specifically to fact check, moderate content and resolve policy related disputes and review flagged content. With billions of users posting on Facebook, the job of content moderators and policy enforcers, even when assisted by algorithms, is massive. It is important that they are rightly incentivized to do their job well and are set clear and measurable goals. The post neither talks of how Facebook plans to reward moderators and neither does it talk about what the yardsticks for performance in this area would be. Facebook fails to acknowledge that it is not fully prepared, partly because it is understaffed. 6. The new Product Policy Director, human rights role is a glorified Public Relations job The weekend following Zuckerberg’s post, a new job opening appeared on Facebook’s careers page for the position of ‘Product policy director, human rights’. Below snippet is taken from that job posting. Source: Facebook careers The above is typically what a Public relations head does as well. Not only are the responsibilities cited above heavily communication and public perception building based, there’s not much given in terms of authority to this role to influence how other teams achieve their goals. Simply put, this role ‘works with, coordinates or advises teams’, it does not ‘guide or direct teams’. Als,o another key point to observe is that this role aims to add another layer of distance to further minimize exposure for Zuckerberg, Sandberg and other top key executives in public forums such as congressional hearings or press meets. Any role/area that is important to a business typically finds a place at the C-suite table. Had this new role been one of the c-suite roles it would have been advertised so, and it may have had some teeth. Of the 24 key executives in Facebook, only one is concerned with privacy and policy, ‘Chief Privacy Officer & VP of U.S. Public Policy’. Even this role does not have a global directive or public welfare in mind. On the other hand, there are multiple product development, creative and business development roles on Facebook’s c-suite. There is even a separate watch product head, a messaging product head, and one just dedicated to China called ‘Head of Creative Shop - Greater China’. This is why Facebook’s plan to protect elections will fail I am afraid Facebook’s greatest strength is also it’s Achilles heel. The tech industry’s deified hacker culture is embodied perfectly by Facebook. Facebook’s ad revenue based flawed business model is the ingenious creation of that very hacker culture. Any attempts to correct everything else is futile without correcting the issues with the current model. The ad revenue based model is why the Facebook app is designed the way it is: with ‘relevant’ news feeds, filter bubbles and look-alike audience segmentation. It is the reason why viral content gets rewarded irrespective of its authenticity or the impact it has on society. It is also the reason why Facebook has a ‘move fast and break things’ internal culture where growth at all costs is favored and idolized. Facebook’s Q2 2018 Earnings summary highlights the above points succinctly. Source: Facebook's SEC Filing The above snapshot means that even if we assume all 30k odd employees do some form of content moderation (the probability of which is zero), every employee is responsible for 50k users’ content daily. Let’s say every user only posts 1 post a day. If we assume Facebook’s news feed algorithms are super efficient and only find 2% of the user content questionable/fake (as speculated by Sandberg in her Senate hearing this month), that would still mean nearly 1k posts per person to review every day! What can Facebook do to turn over a new leaf? Unless Facebook attempts to sincerely address at least some of the below, I will continue to be skeptical of any number of beautifully written posts by Zuckerberg or patriotically orated speeches by Sandberg. A content moderation transparency report that shares not just the number of posts moderated, the number of people working to moderate content on Facebook but also the nature of content moderated, the moderators’ job satisfaction levels, their tenure, qualifications, career aspirations, their challenges, and how much Facebook is investing in people, processes and technology to make its platform safe and objective for everyone to engage with others. A general Ad transparency report that not only lists advertisers on Facebook but also their spendings and chosen ad filters for the public and academia to review or analyze any time. Taking responsibility for the real-world consequences of actions enabled by Facebook. Like the recent gender and age discrimination employment ads shown on Facebook. Really banning hate speech and fake viral content. Bring in a business/AI ethics head who is only next to Zuckerberg and equal to Sandberg’s COO role. Exploring and experimenting with other alternative revenue channels to tackle the current ad-driven business model problem. Resolving the UI problem so that users can gain back control over their data and make it easy to choose to not participate in Facebook’s data experiments. This would mean a potential loss in some ad revenue. The ‘grow hacker’ culture problem that is a byproduct of years of moving fast and breaking things. This would mean a significant change in behavior by everyone starting from the top and probably restructuring the way teams are organized and business is done. It would also mean a different definition and measurement of success which could lead to shareholder backlash. But Mark is uniquely placed to withstand these pressures given his clout over the board voting powers. Like Augustus Caesar his role model, Zuckerberg has a chance to make history. But he might have to put the company through hard and sacrificing times in exchange for the proverbial 200 years of world peace. He’s got the best minds and limitless resources at his disposal to right what he and his platform wronged. But he would have to make enemies with the hands that feed him. Would he rise to the challenge? Like Augustus who is rumored to have killed his grandson, will Zuckerberg ever be prepared to kill his ad revenue generating brainchild? In the meanwhile, we must not underestimate the power of good digital citizenry. We must continue to fight the good fight to move tech giants like Facebook in the right direction. Just as persistent trickling water droplets can erode mountains and create new pathways, so can our mindful actions as digital platform users prompt major tech reforms. It could be as bold as deleting one's Facebook account (I haven’t been on the platform for years now, and I don’t miss it at all). You could organize groups to create awareness on topics like digital privacy, fake news, filter bubbles, or deliberately choose to engage with those whose views differ from yours to understand their perspective on topics and thereby do your part in reversing algorithmically accentuated polarity. It could also be by selecting the right individuals to engage in informed dialog with tech conglomerates. Not every action needs to be hard though. It could be as simple as customizing your default privacy settings or choosing to only spend a select amount of time on such platforms, or deciding to verify the authenticity and assessing the toxicity of a post you wish to like, share or forward to your network. Addendum How far did Facebook go to become a mobile-first company? Following are some of the things Facebook did to become the largest mobile advertising platform in the world, surpassing Google by a huge margin. Clear purpose and reason for the change: “For one, there are more mobile users. Second, they’re spending more time on it... third, we can have better advertising on mobile, make more money,” said Zuckerberg at TechCrunch Disrupt back in 2012 on why they were becoming mobile first. In other words, there was a lot of growth and revenue potential in investing in this space. This was a simple and clear ‘what’s in it for me’ incentive for everyone working to make the transition as well for stockholders and advertisers to place their trust in Zuckerberg’s endeavors. Setting company-wide accountability: “We realigned the company around, so everybody was responsible for mobile.”, said the then President of Business and Marketing Partnerships David Fischer to Fortune in 2013. Willing to sacrifice desktop for mobile: Facebook decided to make a bold gamble to lose its desktop users to grow its unproven mobile platform. Essentially it was willing to bet its only cash cow for a dark horse that was dependent on so many other factors to go right. Strict consequences for non-compliance: Back in the days of transitioning to a mobile-first company Zuckerberg famously said to all his product teams that when they went in for reviews: “Come in with mobile. If you come in and try to show me a desktop product, I’m going to kick you out. You have to come in and show me a mobile product.” Expanding resources and investing in reskilling: They grew their team of 20 mobile engineers to literally all engineers at Facebook undergoing training courses on iOS and Android development. “we’ve completely changed the way we do product development. We’ve trained all our engineers to do mobile first.”, said Facebook’s VP of corporate development, Vaughan Smith to TechCrunch by the end of 2012. Realigning product design philosophy: Designed custom features for the mobile-first interface instead of trying to adapt the features for the web to mobile. In other words, they began with mobile as their default user interface. Local and global user behavior sensitization: Some of their engineering teams even did field visits to developing nations like the Philippines to see first hand how mobile apps are being used there. Environmental considerations in app design: Facebook even had the foresight to consider scenarios where mobile users may not have quality internet signals or poor quality mobile battery related issues. They designed their apps keeping these future needs in mind.

0
0
2855

article-image-how-facebook-is-advancing-artificial-intelligence-video

Richard Gall

14 Sep 2018

4 min read

How Facebook is advancing artificial intelligence [Video]

Richard Gall

14 Sep 2018

4 min read

0
0
2804

article-image-getting-started-with-amazon-machine-learning-workflow-tutorial

Melisha Dsouza

02 Sep 2018

14 min read

Getting started with Amazon Machine Learning workflow [Tutorial]

Melisha Dsouza

02 Sep 2018

14 min read

Amazon Machine Learning is useful for building ML models and generating predictions. It also enables the development of robust and scalable smart applications. The process of building ML models with Amazon Machine Learning consists of three operations: data analysis model training evaluation. The code files for this article are available on Github. This tutorial is an excerpt from a book written by Alexis Perrier titled Effective Amazon Machine Learning. The Amazon Machine Learning service is available at https://console.aws.amazon.com/machinelearning/. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. Let us take a closer look at each one of these steps. Understanding the dataset used We will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. We do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for prediction on previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. As with all datasets, scripts, and resources mentioned in this book, the training and holdout files are available in the GitHub repository at https://github.com/alexperrier/packt-aml. It is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is useful to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are useful when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. Understanding the model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a recipe for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The recipe will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. Evaluating the model Amazon ML uses the standard metric RMSE for linear regression. RMSE is defined as the sum of the squares of the difference between the real values and the predicted values: Here, ŷ is the predicted values, and y the real values we want to predict (the weight of the children in our case). The closer the predictions are to the real values, the lower the RMSE is. A lower RMSE means a better, more accurate prediction. Making batch predictions We now have a model that has been properly trained and selected among other models. We can use it to make predictions on new data. A batch prediction consists in applying a model to a datasource in order to make predictions on that datasource. We need to tell Amazon ML which model we want to apply on which data. Batch predictions are different from streaming predictions. With batch predictions, all the data is already made available as a datasource, while for streaming predictions, the data will be fed to the model as it becomes available. The dataset is not available beforehand in its entirety. In the Main Menu select Batch Predictions to access the dashboard predictions and click on Create a New Prediction: The first step is to select one of the models available in your model dashboard. You should choose the one that has the lowest RMSE: The next step is to associate a datasource to the model you just selected. We had uploaded the held-out dataset to S3 at the beginning of this chapter (under the Loading the data on S3 section) but had not used it to create a datasource. We will do so now.When asked for a datasource in the next screen, make sure to check My data is in S3, and I need to create a datasource, and then select the held-out dataset that should already be present in your S3 bucket: Don't forget to tell Amazon ML that the first line of the file contains columns. In our current project, our held-out dataset also contains the true values for the weight of the students. This would not be the case for "real" data in a real-world project where the real values are truly unknown. However, in our case, this will allow us to calculate the RMSE score of our predictions and assess the quality of these predictions. The final step is to click on the Verify button and wait for a few minutes: Amazon ML will run the model on the new datasource and will generate predictions in the form of a CSV file. Contrary to the evaluation and model-building phase, we now have real predictions. We are also no longer given a score associated with these predictions. After a few minutes, you will notice a new batch-prediction folder in your S3 bucket. This folder contains a manifest file and a results folder. The manifest file is a JSON file with the path to the initial datasource and the path to the results file. The results folder contains a gzipped CSV file: Uncompressed, the CSV file contains two columns, trueLabel, the initial target from the held-out set, and score, which corresponds to the predicted values. We can easily calculate the RMSE for those results directly in the spreadsheet through the following steps: Creating a new column that holds the square of the difference of the two columns. Summing all the rows. Taking the square root of the result. The following illustration shows how we create a third column C, as the squared difference between the trueLabel column A and the score (or predicted value) column B: As shown in the following screenshot, averaging column C and taking the square root gives an RMSE of 11.96, which is even significantly better than the RMSE we obtained during the evaluation phase (RMSE 14.4): The fact that the RMSE on the held-out set is better than the RMSE on the validation set means that our model did not overfit the training data, since it performed even better on new data than expected. Our model is robust. The left side of the following graph shows the True (Triangle) and Predicted (Circle) Weight values for all the samples in the held-out set. The right side shows the histogram of the residuals. Similar to the histogram of residuals we had observed on the validation set, we observe that the residuals are not centered on 0. Our model has a tendency to overestimate the weight of the students: In this tutorial, we have successfully performed the loading of the data on S3 and let Amazon ML infer the schema and transform the data. We also created a model and evaluated its performance. Finally, we made a prediction on the held -out dataset. To understand how to leverage Amazon's powerful platform for your predictive analytics needs, check out this book Effective Amazon Machine Learning. Four interesting Amazon patents in 2018 that use machine learning, AR, and robotics Amazon Sagemaker makes machine learning on the cloud easy Amazon ML Solutions Lab to help customers “work backwards” and leverage machine learning

0
0
2148

article-image-understanding-amazon-machine-learning-workflow

Natasha Mathur

24 Aug 2018

11 min read

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Natasha Mathur

24 Aug 2018

11 min read

This article presents an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. Amazon Machine Learning is an online service by Amazon Web Services (AWS) that does supervised learning for predictive analytics. Launched in April 2015 at the AWS Summit, Amazon ML joins a growing list of cloud-based machine learning services, such as Microsoft Azure, Google prediction, IBM Watson, Prediction IO, BigML, and many others. These online machine learning services form an offer commonly referred to as Machine Learning as a Service or MLaaS following a similar denomination pattern of other cloud-based services such as SaaS, PaaS, and IaaS respectively for Software, Platform, or Infrastructure as a Service. The Amazon ML workflow closely follows a standard Data Science workflow with steps: Extract the data and clean it up. Make it available to the algorithm. Split the data into a training and validation set, typically a 70/30 split with equal distribution of the predictors in each part. Select the best model by training several models on the training dataset and comparing their performances on the validation dataset. Use the best model for predictions on new data. This article is an excerpt taken from the book 'Effective Amazon Machine Learning' written by Alexis Perrier. As shown in the following Amazon ML menu, the service is built around four objects: Datasource ML model Evaluation Prediction The Datasource and Model can also be configured and set up in the same flow by creating a new Datasource and ML model. We will take a closer look at the Datasource and ML model. Amazon ML dataset For the rest of the article, we will use the simple Predicting Weight by Height and Age dataset (from Lewis Taylor (1967)) with 237 samples of children's age, weight, height, and gender, which is available at https://v8doc.sas.com/sashtml/stat/chap55/sect51.htm. This dataset is composed of 237 rows. Each row has the following predictors: sex (F, M), age (in months), height (in inches), and we are trying to predict the weight (in lbs) of these children. There are no missing values and no outliers. The variables are close enough in range and normalization is not required. In short, we do not need to carry out any preprocessing or cleaning on the original dataset. Age, height, and weight are numerical variables (real-valued), and sex is a categorical variable. We will randomly select 20% of the rows as the held-out subset to use for the prediction of previously unseen data and keep the other 80% as training and evaluation data. This data split can be done in Excel or any other spreadsheet editor: By creating a new column with randomly generated numbers Sorting the spreadsheet by that column Selecting 190 rows for training and 47 rows for prediction (roughly a 80/20 split) Let us name the training set LT67_training.csv and the held-out set that we will use for prediction LT67_heldout.csv, where LT67 stands for Lewis and Taylor, the creator of this dataset in 1967. Note that it is important for the distribution in age, sex, height, and weight to be similar in both subsets. We want the data on which we will make predictions to show patterns that are similar to the data on which we will train and optimize our model. Loading the data on Amazon S3 Follow these steps to load the training and held-out datasets on S3: Go to your s3 console at https://console.aws.amazon.com/s3. Create a bucket if you haven't done so already. Buckets are basically folders that are uniquely named across all S3. We created a bucket named aml.packt. Since that name has now been taken, you will have to choose another bucket name if you are following along with this demonstration. Click on the bucket name you created and upload both the LT67_training.csv and LT67_heldout.csv files by selecting Upload from the Actions drop-down menu: Both files are small, only a few KB, and hosting costs should remain negligible for that exercise. Note that for each file, by selecting the Properties tab on the right, you can specify how your files are accessed, what user, role, group or AWS service may download, read, write, and delete the files, and whether or not they should be accessible from the Open Web. When creating the datasource in Amazon ML, you will be prompted to grant Amazon ML access to your input data. You can specify the access rules to these files now in S3 or simply grant access later on. Our data is now in the cloud in an S3 bucket. We need to tell Amazon ML where to find that input data by creating a datasource. We will first create the datasource for the training file ST67_training.csv. Declaring a datasource Go to the Amazon ML dashboard, and click on Create new... | Datasource and ML model. We will use the faster flow available by default: As shown in the following screenshot, you are asked to specify the path to the LT67_training.csv file {S3://bucket}{path}{file}. Note that the S3 location field automatically populates with the bucket names and file names that are available to your user: Specifying a Datasource name is used to organize your Amazon ML assets. By clicking on Verify, Amazon ML will make sure that it has the proper rights to access the file. In case it needs to be granted access to the file, you will be prompted to do so as shown in the following screenshot: Just click on Yes to grant access. At this point, Amazon ML will validate the datasource and analyze its contents. Creating the datasource An Amazon ML datasource is composed of the following: The location of the data file: The data file is not duplicated or cloned in Amazon ML but accessed from S3 The schema that contains information on the type of the variables contained in the CSV file: Categorical Text Numeric (real-valued) Binary It is possible to supply Amazon ML with your own schema or modify the one created by Amazon ML. At this point, Amazon ML has a pretty good idea of the type of data in your training dataset. It has identified the different types of variables and knows how many rows it has: Move on to the next step by clicking on Continue, and see what schema Amazon ML has inferred from the dataset as shown in the next screenshot: Amazon ML needs to know at that point which is the variable you are trying to predict. Be sure to tell Amazon ML the following: The first line in the CSV file contains te column name The target is the weight We see here that Amazon ML has correctly inferred the following: sex is categorical age, height, and weight are numeric (continuous real values) Since we chose a numeric variable as the target Amazon ML, will use Linear Regression as the predictive model. For binary or categorical values, we would have used Logistic Regression. This means that Amazon ML will try to find the best a, b, and c coefficients so that the weight predicted by the following equation is as close as possible to the observed real weight present in the data: predicted weight = a * age + b * height + c * sex Amazon ML will then ask you if your data contains a row identifier. In our present case, it does not. Row identifiers are used when you want to understand the prediction obtained for each row or add an extra column to your dataset later on in your project. Row identifiers are for reference purposes only and are not used by the service to build the model. You will be asked to review the datasource. You can go back to each one of the previous steps and edit the parameters for the schema, the target, and the input data. Now that the data is known to Amazon ML, the next step is to set up the parameters of the algorithm that will train the model. The machine learning model We select the default parameters for the training and evaluation settings. Amazon ML will do the following: Create a step for data transformation based on the statistical properties it has inferred from the dataset Split the dataset (ST67_training.csv) into a training part and a validation part, with a 70/30 split. The split strategy assumes the data has already been shuffled and can be split sequentially. The step will be used to transform the data in a similar way for the training and the validation datasets. The only transformation suggested by Amazon ML is to transform the categorical variable sex into a binary variable, where m = 0 and f = 1 for instance. No other transformation is needed. The default advanced settings for the model are shown in the following screenshot: We see that Amazon ML will pass over the data 10 times, shuffle splitting the data each time. It will use an L2 regularization strategy based on the sum of the square of the coefficients of the regression to prevent overfitting. We will evaluate the predictive power of the model using our LT67_heldout.csv dataset later on. Regularization comes in 3 levels with a mild (10^-6), medium (10^-4), or aggressive (10^-02) setting, each value stronger than the previous one. The default setting is mild, the lowest, with a regularization constant of 0.00001 (10^-6) implying that Amazon ML does not anticipate much overfitting on this dataset. This makes sense when the number of predictors, three in our case, is much smaller than the number of samples (190 for the training set). Clicking on the Create ML model button will launch the model creation. This takes a few minutes to resolve, depending on the size and complexity of your dataset. You can check its status by refreshing the model page. In the meantime, the model status remains pending. At that point, Amazon ML will split our training dataset into two subsets: a training and a validation set. It will use the training portion of the data to train several settings of the algorithm and select the best one based on its performance on the training data. It will then apply the associated model to the validation set and return an evaluation score for that model. By default, Amazon ML will sequentially take the first 70% of the samples for training and the remaining 30% for validation. It's worth noting that Amazon ML will not create two extra files and store them on S3, but instead create two new datasources out of the initial datasource we have previously defined. Each new datasource is obtained from the original one via a Data rearrangement JSON recipe such as the following: { "splitting": { "percentBegin": 0, "percentEnd": 70 } } You can see these two new datasources in the Datasource dashboard. Three datasources are now available where there was initially only one, as shown by the following screenshot: While the model is being trained, Amazon ML runs the Stochastic Gradient algorithm several times on the training data with different parameters: Varying the learning rate in increments of powers of 10: 0.01, 0.1, 1, 10, and 100. Making several passes over the training data while shuffling the samples before each path. At each pass, calculating the prediction error, the Root Mean Squared Error (RMSE), to estimate how much of an improvement over the last pass was obtained. If the decrease in RMSE is not really significant, the algorithm is considered to have converged, and no further pass shall be made. At the end of the passes, the setting that ends up with the lowest RMSE wins, and the associated model (the weights of the regression) is selected as the best version. Once the model has finished training, Amazon ML evaluates its performance on the validation datasource. Once the evaluation itself is also ready, you have access to the model's evaluation. The Amazon ML flow is smooth and facilitates the inherent data science loop: data, model, evaluation, and prediction. We looked at an overview of the workflow of a simple Amazon Machine Learning (Amazon ML) project. We discussed two objects of the Amazon ML menu: Datasource and ML model. If you found this post useful, be sure to check out the book 'Effective Amazon Machine Learning' to learn about evaluation and prediction in Amazon ML along with other AWS ML concepts. Integrate applications with AWS services: Amazon DynamoDB & Amazon Kinesis [Tutorial] AWS makes Amazon Rekognition, its image recognition AI, available for Asia-Pacific developers

0
0
2250

article-image-four-ibm-facial-recognition-patents-in-2018-we-found-intriguing

Natasha Mathur

11 Aug 2018

10 min read

Four IBM facial recognition patents in 2018, we found intriguing

Natasha Mathur

11 Aug 2018

10 min read

0
0
5818

article-image-time-series-modeling-what-is-it-why-it-matters-how-its-used

Sunith Shetty

10 Aug 2018

11 min read

Time series modeling: What is it, Why it matters and How it's used

Sunith Shetty

10 Aug 2018

11 min read

A series can be defined as a number of events, objects, or people of a similar or related kind coming one after another; if we add the dimension of time, we get a time series. A time series can be defined as a series of data points in time order. In this article, we will understand what time series is and why it is one of the essential characteristics for forecasting. This article is an excerpt from a book written by Harish Gulati titled SAS for Finance. The importance of time series What importance, if any, does time series have and how will it be relevant in the future? These are just a couple of fundamental questions that any user should find answers to before delving further into the subject. Let's try to answer this by posing a question. Have you heard the terms big data, artificial intelligence (AI), and machine learning (ML)? These three terms make learning time series analysis relevant. Big data is primarily about a large amount of data that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interaction. AI is a kind of technology that is being developed by data scientists, computational experts, and others to enable processes to become more intelligent, while ML is an enabler that is helping to implement AI. All three of these terms are interlinked with the data they use, and a lot of this data is time series in its nature. This could be either financial transaction data, the behavior pattern of individuals during various parts of the day, or related to life events that we might experience. An effective mechanism that enables us to capture the data, store it, analyze it, and then build algorithms to predict transactions, behavior (and life events, in this instance) will depend on how big data is utilized and how AI and MI are leveraged. A common perception in the industry is that time series data is used for forecasting only. In practice, time series data is used for: Pattern recognition Forecasting Benchmarking Evaluating the influence of a single factor on the time series Quality control For example, a retailer may identify a pattern in clothing sales every time it gets a celebrity endorsement, or an analyst may decide to use car sales volume data from 2012 to 2017 to set a selling benchmark in units. An analyst might also build a model to quantify the effect of Lehman's crash at the height of the 2008 financial crisis in pushing up the price of gold. Variance in the success of treatments across time periods can also be used to highlight a problem, the tracking of which may enable a hospital to take remedial measures. These are just some of the examples that showcase how time series analysis isn't limited to just forecasting. In this chapter, we will review how the financial industry and others use forecasting, discuss what a good and a bad forecast is, and hope to understand the characteristics of time series data and its associated problems. Forecasting across industries Since one of the primary uses of time series data is forecasting, it's wise that we learn about some of its fundamental properties. To understand what the industry means by forecasting and the steps involved, let's visit a common misconception about the financial industry: only lending activities require forecasting. We need forecasting in order to grant personal loans, mortgages, overdrafts, or simply assess someone's eligibility for a credit card, as the industry uses forecasting to assess a borrower's affordability and their willingness to repay the debt. Even deposit products such as savings accounts, fixed-term savings, and bonds are priced based on some forecasts. How we forecast and the rationale for that methodology is different in borrowing or lending cases, however. All of these areas are related to time series, as we inevitably end up using time series data as part of the overall analysis that drives financial decisions. Let's understand the forecasts involved here a bit better. When we are assessing an individual's lending needs and limits, we are forecasting for a single person yet comparing the individual to a pool of good and bad customers who have been offered similar products. We are also assessing the individual's financial circumstances and behavior through industry-available scoring models or by assessing their past behavior, with the financial provider assessing the lending criteria. In the case of deposit products, as long as the customer is eligible to transact (can open an account and has passed know your customer (KYC), anti-money laundering (AML), and other checks), financial institutions don't perform forecasting at an individual level. However, the behavior of a particular customer is primarily driven by the interest rate offered by the financial institution. The interest rate, in turn, is driven by the forecasts the financial institution has done to assess its overall treasury position. The treasury is the department that manages the central bank's money and has the responsibility of ensuring that all departments are funded, which is generated through lending and attracting deposits at a lower rate than a bank lends. The treasury forecasts its requirements for lending and deposits, while various teams within the treasury adhere to those limits. Therefore, a pricing manager for a deposit product will price the product in such a way that the product will attract enough deposits to meet the forecasted targets shared by the treasury; the pricing manager also has to ensure that those targets aren't overshot by a significant margin, as the treasury only expects to manage a forecasted target. In both lending and deposit decisions, financial institutions do tend to use forecasting. A lot of these forecasts are interlinked, as we saw in the example of the treasury's expectations and the subsequent pricing decision for a deposit product. To decide on its future lending and borrowing positions, the treasury must have used time series data to determine what the potential business appetite for lending and borrowing in the market is and would have assessed that with the current cash flow situation within the relevant teams and institutions. Characteristics of time series data Any time series analysis has to take into account the following factors: Seasonality Trend Outliers and rare events Disruptions and step changes Seasonality Seasonality is a phenomenon that occurs each calendar year. The same behavior can be observed each year. A good forecasting model will be able to incorporate the effect of seasonality in its forecasts. Christmas is a great example of seasonality, where retailers have come to expect higher sales over the festive period. Seasonality can extend into months but is usually only observed over days or weeks. When looking at time series where the periodicity is hours, you may find a seasonality effect for certain hours of the day. Some of the reasons for seasonality include holidays, climate, and changes in social habits. For example, travel companies usually run far fewer services on Christmas Day, citing a lack of demand. During most holidays people love to travel, but this lack of demand on Christmas Day could be attributed to social habits, where people tend to stay at home or have already traveled. Social habit becomes a driving factor in the seasonality of journeys undertaken on Christmas Day. It's easier for the forecaster when a particular seasonal event occurs on a fixed calendar date each year; the issue comes when some popular holidays depend on lunar movements, such as Easter, Diwali, and Eid. These holidays may occur in different weeks or months over the years, which will shift the seasonality effect. Also, if some holidays fall closer to other holiday periods, it may lead to individuals taking extended holidays and travel sales may increase more than expected in such years. The coffee shop near the office may also experience lower sales for a longer period. Changes in the weather can also impact seasonality; for example, a longer, warmer summer may be welcome in the UK, but this would impact retail sales in the autumn as most shoppers wouldn't need to buy a new wardrobe. In hotter countries, sales of air-conditioners would increase substantially compared to the summer months' usual seasonality. Forecasters could offset this unpredictability in seasonality by building in a weather forecast variable. We will explore similar challenges in the chapters ahead. Seasonality shouldn't be confused with a cyclic effect. A cyclic effect is observed over a longer period of generally two years or more. The property sector is often associated with having a cyclic effect, where it has long periods of growth or slowdown before the cycle continues. Trend A trend is merely a long-term direction of observed behavior that is found by plotting data against a time component. A trend may indicate an increase or decrease in behavior. Trends may not even be linear, but a broad movement can be identified by analyzing plotted data. Outliers and rare events Outliers and rare events are terminologies that are often used interchangeably by businesses. These concepts can have a big impact on data, and some sort of outlier treatment is usually applied to data before it is used for modeling. It is almost impossible to predict an outlier or rare event but they do affect a trend. An example of an outlier could be a customer walking into a branch to deposit an amount that is 100 times the daily average of that branch. In this case, the forecaster wouldn't expect that trend to continue. Disruptions Disruptions and step changes are becoming more common in time series data. One reason for this is the abundance of available data and the growing ability to store and analyze it. Disruptions could include instances when a business hasn't been able to trade as normal. Flooding at the local pub may lead to reduced sales for a few days, for example. While analyzing daily sales across a pub chain, an analyst may have to make note of a disruptive event and its impact on the chain's revenue. Step changes are also more common now due to technological shifts, mergers and acquisitions, and business process re-engineering. When two companies announce a merger, they often try to sync their data. They might have been selling x and y quantities individually, but after the merger will expect to sell x + y + c (where c is the positive or negative effect of the merger). Over time, when someone plots sales data in this case, they will probably spot a step change in sales that happened around the time of the merger, as shown in the following screenshot: In the trend graph, we can see that online travel bookings are increasing. In the step change and disruptions chart, we can see that Q1 of 2012 saw a substantive increase in bookings, where Q1 of 2014 saw a substantive dip. The increase was due to the merger of two companies that took place in Q1 of 2012. The decrease in Q1 of 2014 was attributed to prolonged snow storms in Europe and the ash cloud disruption from volcanic activity over Iceland. While online bookings kept increasing after the step change, the disruption caused by the snow storm and ash cloud only had an effect on sales in Q1 of 2014. In this case, the modeler will have to treat the merger and the disruption differently while using them in the forecast, as disruption could be disregarded as an outlier and treated accordingly. Also note that the seasonality chart shows that Q4 of each year sees almost a 20% increase in travel bookings, and this pattern continues each calendar year. In this article, we defined time series and learned why it is important for forecasting. We also looked at the characteristics of time series data. To know more how to leverage the analytical power of SAS to perform financial analysis efficiently, you can check out the book SAS for Finance. Read more Getting to know SQL Server options for disaster recovery Implementing a simple Time Series Data Analysis in R Training RNNs for Time Series Forecasting

0
0
7264

How-To Tutorials - Data

Microsoft open sources Infer.NET, it’s popular model-based machine learning framework

7 tips for using Git and GitHub the right way

PyTorch 1.0 preview release is production ready with torch.jit, c10d distributed library, C++ API

The ethical dilemmas developers working on Artificial Intelligence products must consider

BrainNet, an interface to communicate between human brains, could soon make Telepathy real

Did you know Facebook shares the data you share with them for ‘security’ reasons with advertisers?

9 recommended blockchain online courses

The White House is reportedly launching an antitrust investigation against social media companies

What is a convolutional neural network (CNN)? [Video]

How far will Facebook go to fix what it broke: Democracy, Trust, Reality

Trending Topics

How Facebook is advancing artificial intelligence [Video]

Getting started with Amazon Machine Learning workflow [Tutorial]

Understanding Amazon Machine Learning Workflow [ Tutorial ]

Four IBM facial recognition patents in 2018, we found intriguing

Time series modeling: What is it, Why it matters and How it's used