Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

How-To Tutorials

6719 Articles
article-image-how-neurips-2018-is-taking-on-its-diversity-and-inclusion-challenges
Sugandha Lahoti
06 Dec 2018
3 min read
Save for later

How NeurIPS 2018 is taking on its diversity and inclusion challenges

Sugandha Lahoti
06 Dec 2018
3 min read
This year the Neural Information Processing Systems Conference is asking serious questions to improve diversity, equity, and inclusion at NeurIPS. “Our goal is to make the conference as welcoming as possible to all.” said the heads of the new diversity and inclusion chairs introduced this year. https://twitter.com/InclusionInML/status/1069987079285809152 The Diversity and Inclusion chairs were headed by Hal Daume III, a professor from the University of Maryland and machine learning and fairness groups researcher at Microsoft Research and Katherine Heller, assistant professor at Duke University and research scientist at Google Brain. They opened up the talk by acknowledging the respective privilege that they get as a group of white man and woman and the fact that they don’t reflect the diversity of experience in the conference room, much less the world. They talk about the three major goals with respect to inclusion at NeurIPS: Learn about the challenges that their colleagues have faced. Support those doing the hard work of amplifying the voices of those who have been historically excluded. To begin structural changes that will positively impact the community over the coming years. They urged attendees to start building an environment where everyone can do their best work. They want people to: see other perspectives remember the feeling of being an outsider listen, do research and learn. make an effort and speak up Concrete actions taken by the NeurIPS diversity and inclusion chairs This year they have assembled an advisory board and run a demographics and inclusion survey. They have also conducted events such as WIML (Women in Machine Learning), Black in AI, LatinX in AI, and Queer in AI. They have established childcare subsidies and other activities in collaboration with Google and DeepMind to support all families attending NeurIPS by offering a stipend of up to $100 USD per day. They have revised their Code of Conduct, to provide an experience for all participants that is free from harassment, bullying, discrimination, and retaliation. They have added inclusion tips on Twitter offering tips and bits of advice related to D&I efforts. The conference also offers pronoun stickers (only them and they), first-time attendee stickers, and information for participant needs. They have also made significant infrastructure improvements for visa handling. They had discussions with people handling visas on location, sent out early invitation letters for visas, and are choosing future locations with visa processing in mind. In the future, they are also looking to establish a legal team for details around Code of Conduct. Further, they are looking to improve institutional structural changes that support the community, and improve the coordination around affinity groups & workshops. For the first time, NeurIPS also invited a diversity and inclusion (D&I) speaker Laura Gomez to talk about the lack of diversity in the tech industry, which leads to biased algorithms, faulty products, and unethical tech. Head over to NeurIPS website for interesting tutorials, invited talks, product releases, demonstrations, presentations, and announcements. NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]
Read more
  • 0
  • 0
  • 2628

article-image-neurips-2018-a-quick-look-at-data-visualization-for-machine-learning-by-google-pair-researchers-tutorial
Natasha Mathur
05 Dec 2018
9 min read
Save for later

NeurIPS 2018: A quick look at data visualization for Machine learning by Google PAIR researchers [Tutorial]

Natasha Mathur
05 Dec 2018
9 min read
The 32nd annual NeurIPS (Neural Information Processing Systems) Conference 2018 (formerly known as NIPS), is currently being hosted in Montreal, Canada this week. The Conference is the biggest machine learning conference of the year that started on 2nd December and will be ending on 8th December. It will feature a series of tutorials, invited talks, product releases, demonstrations, presentations, and announcements related to machine learning research. One such tutorial was presented at NeurIPS, earlier this week, called “Visualization for machine learning” by Fernanda Viegas and Martin Wattenberg. Viegas and Wattenberg are co-leads at Google’s PAIR (People in AI research ) initiative, which is a part of Google Brain. Their work in machine learning focuses on transparency and interpretability to improve human AI interaction and to democratize AI technology. Here are some key highlights from the tutorial. The tutorial talks about how visualization works, and explores common visualization techniques, and uses of visualization in Machine learning. Viegas opened the talk with first explaining the data visualization concept. Data visualization refers to a process of representing and transforming data into visual encodings and context. It is used for data exploration, for gaining scientific insight, and for better communication of data results. How does data visualization work? Data visualization works by “finding visual encodings”. In other words, you take data and then transform it into visual encodings. These encodings then further perform a bunch of different functions. Firstly, they help guide viewers attention through data. Viegas explains how if our brains are given “the right kind of visual stimuli”, our visual system works dramatically faster. There are certain things that human visual systems are acutely aware of such as differences in shapes, alignments, colors, sizes, etc. Secondly, they communicate the data effectively to the viewer, and thirdly, it allows the viewer to calculate data. Once these functions are complete, you can then interactively explore the data on the computer. Wattenberg explains how different encodings have different properties. For instance, “position” and “length” properties are as good as a text for communicating exact values within data. “Area” and “colors” are good for drawing the attention of the viewer. He further gives an example of Colorbrewer, a color advice tool by Cynthia Brewer, a cartographer, that lets you try out different color palettes and scales. It’s a handy tool when playing with colors for data visualization. Apart from that, a trick to keep in mind when choosing colors for data visualization is to go for a color palette or scale where one color doesn’t look more prominent than the others since it can be perceived as one category that is more important than the other, says Viegas. Common visualization Techniques Data Density Viegas explains how when you have a lot of data, there is something called small multiples, meaning that you “use your chart over and over again for each moment that is important”. A visualization example presented by Viegas is that of a New York Times infographic for Drought, in the US over the decades. Visualization for machine learning She explains how in the above visualization, each one of the rows is a decade worth of drought in the US. Another thing to notice is that the background color in the visualization is very faint so that the map of the US recedes in the background, points out Viegas. This is because the map is not the most important thing. The drought information is what needs to majorly pop out. Hence, a sharp highlighting and saturating color are used for the drought. Data Faceting Another visualization technique discussed by Viegas is that of data faceting, which is basically adding two different visualizations together to understand and analyze the data better. A visualization example below shows what are the tax rates for different companies around the US, and how much does the tax amount vary among these companies. Each one of these circles is a company that is sized differently. The color here shows a distribution that goes from the lowest tax rate on the left to the highest on the right. “Just by looking at the distribution, you can tell that the tax rates are going up the further to the right they are. They have also calculated the tax rate for the entire distribution, so they are packing a ton of info in this graph,” says Viegas. Data Faceting Another tab saying “view by industry”, shows another visualization that presents the distribution of each industry, along with their tax rates and some commentary for each of the industries, starting from utilities to insurance.   Data Faceting Visualization uses in ML If you look at the visualization pipeline of machine learning, you can identify the areas and stages where visualization is particularly needed and helpful. “It’s thinking about through acquiring data, as you implement a model, training and when you deploy it for monitoring”, says Wattenberg. Visualization pipeline in ML Visualization is mainly used in Machine learning for training data, monitoring performance, improve interpretability, understand high-dimensional data, for education, and communication. Let’s now have a look at some of these. Visualizing training data To explain why visualizing training data can be useful, Viegas takes an example of visualizing CIFAR-10 which is a dataset that comprises a collection of images commonly used to train machine learning and computer vision algorithms. Viegas points out that there are a lot of tools for looking at your data. One such tool is Facets, an Open Source Visualization Tool for Machine Learning Training Data. In the example below, they have used facets where pictures in CIFAR 10 are organized into categories such as an airplane, automobile, bird, etc. CIFAR 10- Facets Demo Not only does it provide a clear distinction between different categories, but their Facets can also help with analyzing mistakes in your data. Facets provide a sense of the shape of each feature of the data using Facets Overview. You can also explore a set of individual observations using Facets Dive. These visualizations help with analyzing mistakes in your data and automatically provide an understanding of “distribution of values” across different features of a dataset. Visualizing Performance monitoring Viegas quickly went over how visualization is widely seen in performance monitoring in the form of monitor boards, almost on a daily basis, in machine learning. Performance monitoring visualization includes using different graphs and line charts, as while monitoring performance, you are constantly trying to make sure that your system is working right and doing what it’s supposed to do. Visualizing Interpretability Interpretability in machine learning means the degree to which a human can consistently predict the model’s result. Viegas discusses interpretability visualization in machine learning by breaking it further into visualization in CNNs, and RNNs. CNNs (Convolutional Neural Network) She compares interpretability of image classification to a petri dish. She explains how image classifiers are effective in practice, however, what they do and how they do it is mysterious, and they also have failures that add to the mystery. Another thing about image classifiers is that since they’re visual, it can be hard to understand what they exactly do such as what features do these networks really use, what roles are played by different layers, etc. An example presented by Viegas is of saliency maps that show each pixel's unique quality. Saliency maps simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. “The idea with saliency maps is to consider the sensitivity of class to each pixel. These can be sometimes deceiving, visually noisy,..and ..sometimes easy to project on them what you’re seeing”, adds Viegas.                                                   DrawNet Another example presented by Viegas that’s been very helpful in case of visualization of CNNs is that of drawNet by Antonio Torralba. The reason this visualization is particularly great is that it is great at informing people who are not from a machine learning field on how neural networks actually work. RNNs (Recurrent Neural Network) Viegas presented another visualization example in case of RNNs. A visualization example presented here is that of Karpathy, that looked at visualizing text sequences, and trying to understand that if you activate different cells, you can maybe interpret them. Visualizing text sequences The color scale is very friendly, and the fact that color layers right on top of the data. It is a good example of how to make the right tradeoff when selecting colors to represent quantitative data, explains Wattenberg. Viegas further pointed out how it’s always better to go back to the raw data (in this case, text), and show that to the user since it will make your visualization more effective. Visualizing High Dimensional Data Wattenberg explains how visualizing high dimensional data is very tough, and almost “impossible”. However, there are some approaches that help visualize it. These approaches are divided into two: linear and non-linear. Linear approaches include principal component analysis and visualization of labeled data using linear transformation. Non-linear approaches include multidimensional scaling, sammon mapping, t-SNE, UMAP, etc. Wattenberg gives an example of PCA on embedding projector that is using MNIST as a dataset. MNIST is a large database of handwritten digits commonly used for training different image processing systems. PCA does a good job at visualizing MNIST. However, using non-linear method is more effective since the clusters of digits get separated quite well. However, Wattenberg argues that there’s a lot of trickiness that goes around, and to analyze it, t-SNE is used to visualize data. t-SNE is a fairly complex non-linear technique that uses an adaptive sense of distance. It translates well between the geometry of high and low dimensional space. t-SNE is effective in visualizing high-dimensional data but there’s another method, called UMAP ( Uniform Manifold Approximation and Projection for Dimension Reduction), that is faster than t-SNE, and efficiently embed into high dimensions, and captures the global structure better. After learning how visualization is used in ML, and what different tools and methods work out of visualization in Machine learning, data scientists can now start experimenting and refining the existing visualization methods or they can even start inventing entirely new visual techniques. Now that you have a head-start, dive right into this fascinatingly informative tutorial on the NeurIPS page! NeurIPS 2018: Deep learning experts discuss how to build adversarially robust machine learning models NeurIPS 2018 paper: DeepMind researchers explore autoregressive discrete autoencoders (ADAs) to model music in raw audio at scale NIPS finally sheds its ‘sexist’ name for NeurIPS
Read more
  • 0
  • 0
  • 5382

article-image-ngi0-consortium-to-award-grants-worth-5-6-million-euro-to-open-internet-projects-that-promote-inclusivity-privacy-interoperability-and-data-protection
Bhagyashree R
03 Dec 2018
3 min read
Save for later

NGI0 Consortium to award grants worth 5.6 million euro to open internet projects that promote inclusivity, privacy, interoperability, and data protection

Bhagyashree R
03 Dec 2018
3 min read
NLnet Foundation, on Saturday, announced that they are now taking in submissions for project proposals for projects that will deliver “potential break-through contributions to the open internet”. These project will be judged on their technical merits, strategic relevance to the Next Generation Internet and overall value for money. For this, they have created separate themes such as NGI Zero PET and NGI Zero Discovery under which you can list your projects. The foundation will be investing 5.6 million euro on small to medium-size R&D grants towards improving search and discovery and privacy and trust enhancing technologies from 2018 to 2021. They are seeking project proposals between 5.000 and 50.000 euros, with the chances to scale them up if there is proven potential. Deadline for submitting these proposals is February 1st, 2019 12:00 CET. NLnet Foundation supports the open internet and the privacy and security of internet users. The foundation helps independent organizations and people that contribute to an open information society by providing them microgrants, advice, and access to a global network. Next Generation Internet (NGI): Creating open, trustworthy, and reliable internet for all The European Commission launched the NGI initiative in 2016 aiming to make the internet an interoperable platform ecosystem. This future internet will respect human and societal values such as openness, inclusivity, transparency, privacy, cooperation, and protection of data. NGI wants to make the internet more human-centric while also driving the adoption of advanced concepts and methodologies in domains such as artificial intelligence, Internet of Things, interactive technologies, and more. To achieve these goals NLnet has launched projects like NGI Zero Discovery and NGI Zero Privacy and Trust Enhancing Technologies (PET). NGI Zero Discovery aims to provide individual researchers and developers an agile, effective, and low-threshold funding mechanism. This project will help researchers and developers bring in new ideas that contribute to the establishment of the Next Generation Internet. These new projects will be made available as free/libre/open source software. NGI Zero PET is the sister project of NGI Zero Discovery. The objective of this project is to equip people with new technologies that will provide them better privacy. NLnet on their website said these investments are for helping researchers and developers towards creating an open internet: “Trust is one of the key drivers for the Next Generation Internet, and an adequate level of privacy is a non-negotiable requirement for that. We want to assist independent researchers and developers to create powerful new technology and to help them put it in the hands of future generations as building blocks for a fair and democratic society and an open economy that benefits all.” To read more, check out the NLnet Foundation’s official website. The State of Mozilla 2017 report focuses on internet health and user privacy Tim Berners-Lee plans to decentralize the web with ‘Solid’, an open-source project for “personal empowerment through data” Has the EU just ended the internet as we know it?
Read more
  • 0
  • 0
  • 2198

article-image-marriotts-starwood-guest-database-faces-a-massive-data-breach-affecting-500-million-user-data
Savia Lobo
03 Dec 2018
5 min read
Save for later

Marriott’s Starwood guest database faces a massive data breach affecting 500 million user data

Savia Lobo
03 Dec 2018
5 min read
Last week, a popular Hospitality company, Marriott International, unveiled details about a massive data breach, which exposed the personal and financial information of its customers. According to Marriott, this breach was happening over the past four years and collected information about customers who made reservations in its Starwood subsidiary. The information which was subject to the breach included details of approximately 500 million guests. For approximately 327 million of these guests, the information breached includes a combination of name, mailing address, phone number, email address, passport number, Starwood Preferred Guest (“SPG”) account information, date of birth, gender, arrival and departure information, reservation date, and communication preferences. The four-year-long breach that hit Marriott’s customer data Marriott, on September 8, 2018, received an alert from an internal security tool which reported that attempts had been taken to access the Starwood guest reservation database in the United States. Following this, Marriott carried out an investigation which revealed that their Starwood network had been accessed by attackers since 2014. According to Marriott’s news center, “On November 19, 2018, the investigation determined that there was unauthorized access to the database, which contained guest information relating to reservations at Starwood properties* on or before September 10, 2018.” For some users out of the 500 million, the information includes payment card details such as numbers and expiration dates. However,  “the payment card numbers were encrypted using Advanced Encryption Standard encryption (AES-128). There are two components needed to decrypt the payment card numbers, and at this point, Marriott has not been able to rule out the possibility that both were taken. For the remaining guests, the information was limited to name and sometimes other data such as mailing address, email address, or other information”, stated the Marriott News release. Arne Sorenson, Marriott’s President, and Chief Executive Officer said, “We will continue to support the efforts of law enforcement and to work with leading security experts to improve.  Finally, we are devoting the resources necessary to phase out Starwood systems and accelerate the ongoing security enhancements to our network”. Marriott also reported this incident to law enforcement and are notifying regulatory authorities. This is not the first time Starwood data was breached Marriott hoteliers did not exactly mention when the breach hit them four years ago in 2014. However, its subsidiary Starwood revealed that, a few days after being acquired by Marriott, more than 50 of Starwood’s properties were breached in November 2015. According to Starwood’s disclosure at the time, that earlier breach stretched back at least one year, i.e., November 2014. According to Krebs on Security, “Back in 2015, Starwood said the intrusion involved malicious software installed on cash registers at some of its resort restaurants, gift shops and other payment systems that were not part of its guest reservations or membership systems.” In Dec. 2016, KrebsOnSecurity stated, “banks were detecting a pattern of fraudulent transactions on credit cards that had one thing in common: They’d all been used during a short window of time at InterContinental Hotels Group (IHG) properties, including Holiday Inns and other popular chains across the United States.” Marriott said that its own network has not been affected by this four-year data breach and that the investigation only identified unauthorized access to the separate Starwood network. “Marriott is providing its affected guests in the United States, Canada, and the United Kingdom a free year’s worth of service from WebWatcher, one of several companies that advertise the ability to monitor the cybercrime underground for signs that the customer’s personal information is being traded or sold”, said Krebs on Security. What should compromised users do? Companies affected by the breach or as a defense measure pay threat hunters to look out for new intrusions. They can even test their own networks and employees for weaknesses, and arrange for a drill in order to combat their breach response preparedness. For individuals who re-use the same password should try using password managers, which helps remember strong passwords/passphrases and essentially lets you use the same strong master password/passphrase across all Web sites. According to a Krebs on Security’s “assume you’re compromised” philosophy “involves freezing your credit files with the major credit bureaus and regularly ordering free copies of your credit file from annualcreditreport.com to make sure nobody is monkeying with your credit (except you).” Rob Rosenberger, Co-founder of Vmyths, urged everyone who booked a room at their properties since 2014 by tweeting advice that the affected users should change their mother’s maiden name and the social security number soon. https://twitter.com/vmyths/status/1069273409652224000 To know more about the Marriott breach in detail, visit Marriott’s official website. Uber fined by British ICO and Dutch DPA for nearly $1.2m over a data breach from 2016 Dell reveals details on its recent security breach Twitter on the GDPR radar for refusing to provide a user his data due to ‘disproportionate effort’ involved
Read more
  • 0
  • 0
  • 3041

article-image-microsoft-becomes-the-worlds-most-valuable-public-company-moves-ahead-of-apple
Sugandha Lahoti
03 Dec 2018
3 min read
Save for later

Microsoft becomes the world's most valuable public company, moves ahead of Apple

Sugandha Lahoti
03 Dec 2018
3 min read
Last week, Microsoft moved ahead of Apple as the world’s most valuable publicly traded U.S. company. On Friday, the company closed on with a market value of $851 billion with Apple a few steps short at $847 billion. The move from Windows to Cloud Microsoft's success can be attributed to its able leadership under CEO Satya Nadella and his focus on moving away from the flagship Windows operating system and focusing on cloud-computing services with long-term business contracts. The organization's biggest growth has happened in its Azure Cloud platform. Cloud computing now accounts for more than a quarter of Microsoft’s revenue rivaling Amazon, which is also a leading provider. Microsoft is also building new products and features for Azure. Last month, it announced container support for Azure Cognitive Services to build intelligent applications. In October, it invested in Grab to together conquer the Southeast Asian on-demand services market with Azure’s Intelligent Cloud. In September, at the Ignite 2018, the company announced major changes and improvements to their cloud offering. It also came up with Azure Functions 2.0 with better workload support for serverless, general availability of Microsoft’s Immutable storage for Azure Storage Blobs, and Azure DevOps. In August, Microsoft made Azure supported for NVIDIA GPU Cloud (NGC), and a new governance DApp for Azure. Wedbush analyst Dan Ives commented that “Azure is still in its early days, meaning there’s plenty of room for growth, especially considering the company’s large customer base for Office and other products. While the tech carnage seen over the last month has been brutal, shares of (Microsoft) continue to hold up like the Rock of Gibraltar” he said. Focus on business and values Microsoft has also prioritized business-oriented services such as Office and other workplace software, as well as newer additions such as LinkedIn and Skype. In 2016, Microsoft bought LinkedIn, the social network for professionals, for $26.2 billion. This year, Microsoft paid $7.5 billion for GitHub, an open software platform used by 28 million programmers. Another reason Microsoft is flourishing is because of its focus on upholding its founding values without compromising on issues like internet censorship and surveillance. Daniel Morgan, senior portfolio manager for Synovus Trust, says “Microsoft is outperforming its tech rivals in part because it doesn’t face as much regulatory scrutiny as advertising-hungry Google and Facebook, which have attracted controversy over their data-harvesting practices. Unlike Netflix, it’s not on a hunt for a diminishing number of international subscribers. And while Amazon also has a strong cloud business, it’s still more dependent on online retail.” In a recent episode of Pivot with Kara Swisher and Scott Galloway, the two speakers also talked about why Microsoft is more valuable than Apple. Scott said that Microsoft’s success is because of Nadella’s decision of diversifying Microsoft’s business into enough verticals which is the reason why the company hasn’t been as impacted by tech stocks’ recent decline. He argues that Satya Nadella deserves the title of “tech CEO of the year”. Microsoft wins $480 million US Army contract for HoloLens. Microsoft amplifies focus on conversational AI: Acquires XOXCO; shares guide to developing responsible bots. Microsoft announces official support for Windows 10 to build 64-bit ARM apps
Read more
  • 0
  • 0
  • 2553

article-image-google-bypassed-its-own-security-and-privacy-teams-for-project-dragonfly-reveals-intercept
Sugandha Lahoti
30 Nov 2018
5 min read
Save for later

Google bypassed its own security and privacy teams for Project Dragonfly reveals Intercept

Sugandha Lahoti
30 Nov 2018
5 min read
Google’s Project Dragonfly has faced significant criticism and scrutiny from both the public and Google employees. In a major report yesterday, the Intercept revealed how internal conversations around Google’s censored search engine for China shut out Google’s legal, privacy, and security teams. According to named and anonymous senior Googlers who worked on the project and spoke to The Intercept's Ryan Gallagher, Company executives appeared intent on watering down the privacy review. Google bosses also worked to suppress employee criticism of the censored search engine. Project Dragonfly is the secretive search engine that Google is allegedly developing which will comply with the Chinese rules of censorship. It was kept secret from the company at large during the 18 months it was in development until an insider leak led to its existence being revealed in The Intercept. It has been on the receiving end of a constant backlash from various human rights organizations and investigative reporters, since then. Earlier this week, it also faced criticism from human rights organization Amnesty International and was followed by Google employees signing a petition protesting Google’s infamous Project Dragonfly. The secretive way Google operated Dragonfly Majority of the leaks were reported by Yonatan Zunger, a security engineer on the Dragonfly team. He was asked to produce the privacy review for the project in early 2017. However, he faced opposition from Scott Beaumont, Google’s top executive for China and Korea. According to Zunger, Beaumont “wanted the privacy review of Dragonfly]to be pro forma and thought it should defer entirely to his views of what the product ought to be. He did not feel that the security, privacy, and legal teams should be able to question his product decisions, and maintained an openly adversarial relationship with them — quite outside the Google norm.” Beaumont also micromanaged the project and ensured that discussions about Dragonfly and access to documents about it were under his tight control. If some members of the Dragonfly team broke the strict confidentiality rules, then their contracts at Google could be terminated. Privacy report by Zunger In midst of all these conditions, Zunger and his team were still able to produce a privacy report. The report mentioned problematic scenarios that could arise if the search engine was launched in China. The report mentioned that, in China, it would be difficult for Google to legally push back against government requests, refuse to build systems specifically for surveillance, or even notify people of how their data may be used. Zunger’s meetings with the company’s senior leadership on the discussion of the privacy report were repeatedly postponed. Zunger said, “When the meeting did finally take place, in late June 2017, I and my team were not notified, so we missed it and did not attend. This was a deliberate attempt to exclude us.” Dragonfly: Not just an experiment Intercept’s report even demolished Sundar Pichai’s recent public statement on Dragonfly, where he described it as “just an experiment,” adding that it remained unclear whether the company “would or could” eventually launch it in China. Google employees were surprised as they were told to prepare the search engine for launch between January and April 2019, or sooner. “What Pichai said [about Dragonfly being an experiment] was ultimately horse shit,” said one Google source with knowledge of the project. “This was run with 100 percent intention of launch from day one. He was just trying to walk back a delicate political situation.” It is also alleged that Beaumont had intended from day one that the project should only be known about once it had been launched. “He wanted to make sure there would be no opportunity for any internal or external resistance to Dragonfly.” said one Google source to Intercept. This makes us wonder the extent to which Google really is concerned about upholding its founding values, and how far it will go in advocating internet freedom, openness, and democracy. It now looks a lot like a company who simply prioritizes growth and expansion into new markets, even if it means compromising on issues like internet censorship and surveillance. Perhaps we shouldn’t be surprised. Google CEO Sundar Pichai is expected to testify in Congress on Dec. 5 to discuss transparency and bias. Members of Congress will likely also ask about Google's plans in China. Public opinion on Intercept’s report is largely supportive. https://twitter.com/DennGordon/status/1068228199149125634 https://twitter.com/mpjme/status/1068268991238541312 https://twitter.com/cynthiamw/status/1068240969990983680 Google employee and inclusion activist Liz Fong Jones tweeted that she would match $100,000 in pledged donations to a fund to support employees who refuse to work in protest. https://twitter.com/lizthegrey/status/1068212346236096513 She has also shown full support for Zunger https://twitter.com/lizthegrey/status/1068209548320747521 Google employees join hands with Amnesty International urging Google to drop Project Dragonfly OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly? Amnesty International takes on Google over Chinese censored search engine, Project Dragonfly.
Read more
  • 0
  • 0
  • 2796
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at ₹800/month. Cancel anytime
article-image-getting-started-with-web-scraping-using-python-tutorial
Melisha Dsouza
29 Nov 2018
15 min read
Save for later

Getting started with Web Scraping using Python [Tutorial]

Melisha Dsouza
29 Nov 2018
15 min read
Small manual tasks like scanning through information sources in search of small bits of relevant information are in fact, automatable.  Instead of performing tasks that get repeated over and over, we can use computers to do these kinds of menial tasks and focus our own efforts instead on what humans are good for—high-level analysis and decision making based on the result. This tutorial shows how to use the Python language to automatize common business tasks that can be greatly sped up if a computer is doing them. The code files for this article are available on Github. This tutorial is an excerpt from a book written by Jaime Buelta titled Python Automation Cookbook. The internet and the WWW (World Wide Web) is the most prominent source of information today.   In this article, we will learn to perform operations programmatically to automatically retrieve and process information. Python  requests module makes it very easy to perform these operations. We'll cover the following recipes: Downloading web pages Parsing HTML Crawling the web Accessing password-protected pages Speeding up web scraping Downloading web pages The basic ability to download a web page involves making an HTTP GET request against a URL. This is the basic operation of any web browser.  We'll see in this recipe how to make a simple request to obtain a web page. Install requests module: $ echo "requests==2.18.3" >> requirements.txt $ source .venv/bin/activate (.venv) $ pip install -r requirements.txt Download the example page because it is a straightforward HTML page that is easy to read in text mode. How to Download web pages Import the requests module: >>> import requests Make a request to the URL, which will take a second or two: >>> url = 'http://www.columbia.edu/~fdc/sample.html' >>> response = requests.get(url) Check the returned object status code: >>> response.status_code 200 Check the content of the result: >>> response.text '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html>\n<head>\n ... FULL BODY ... <!-- close the <html> begun above -->\n' Check the ongoing and returned headers: >>> response.request.headers {'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'} >>> response.headers {'Date': 'Fri, 25 May 2018 21:51:47 GMT', 'Server': 'Apache', 'Last-Modified': 'Thu, 22 Apr 2004 15:52:25 GMT', 'Accept-Ranges': 'bytes', 'Vary': 'Accept-Encoding,User-Agent', 'Content-Encoding': 'gzip', 'Content-Length': '8664', 'Keep-Alive': 'timeout=15, max=85', 'Connection': 'Keep-Alive', 'Content-Type': 'text/html', 'Set-Cookie': 'BIGipServer~CUIT~www.columbia.edu-80-pool=1764244352.20480.0000; expires=Sat, 26-May-2018 03:51:47 GMT; path=/; Httponly'} The operation of requests is very simple; perform the operation, GET in this case, over the URL. This returns a result object that can be analyzed. The main elements are the status_code and the body content, which can be presented as text. The full request can be checked in the request field: >>> response.request <PreparedRequest [GET]> >>> response.request.url 'http://www.columbia.edu/~fdc/sample.html' You can check out the full request's documentation for more information. Parsing HTML We'll use the excellent Beautiful Soup module to parse the HTML text into a memory object that can be analyzed. We need to use the beautifulsoup4 package to use the latest Python 3 version that is available. Add the package to your requirements.txt and install the dependencies in the virtual environment: $ echo "beautifulsoup4==4.6.0" >> requirements.txt $ pip install -r requirements.txt How to perform HTML Parsing Import BeautifulSoup and requests: >>> import requests >>> from bs4 import BeautifulSoup Set up the URL of the page to download and retrieve it: >>> URL = 'http://www.columbia.edu/~fdc/sample.html' >>> response = requests.get(URL) >>> response <Response [200]> Parse the downloaded page: >>> page = BeautifulSoup(response.text, 'html.parser') Obtain the title of the page. See that it is the same as what's displayed in the browser: >>> page.title <title>Sample Web Page</title> >>> page.title.string 'Sample Web Page' Find all the h3 elements in the page, to determine the existing sections: >>> page.find_all('h3') [<h3><a name="contents">CONTENTS</a></h3>, <h3><a name="basics">1. Creating a Web Page</a></h3>, <h3><a name="syntax">2. HTML Syntax</a></h3>, <h3><a name="chars">3. Special Characters</a></h3>, <h3><a name="convert">4. Converting Plain Text to HTML</a></h3>, <h3><a name="effects">5. Effects</a></h3>, <h3><a name="lists">6. Lists</a></h3>, <h3><a name="links">7. Links</a></h3>, <h3><a name="tables">8. Tables</a></h3>, <h3><a name="install">9. Installing Your Web Page on the Internet</a></h3>, <h3><a name="more">10. Where to go from here</a></h3>] 6. Extract the text on the section links. Stop when you reach the next <h3> tag: >>> link_section = page.find('a', attrs={'name': 'links'}) >>> section = [] >>> for element in link_section.next_elements: ... if element.name == 'h3': ... break ... section.append(element.string or '') ... >>> result = ''.join(section) >>> result '7. Links\n\nLinks can be internal within a Web page (like to\nthe Table of ContentsTable of Contents at the top), or they\ncan be to external web pages or pictures on the same website, or they\ncan be to websites, pages, or pictures anywhere else in the world.\n\n\n\nHere is a link to the Kermit\nProject home pageKermit\nProject home page.\n\n\n\nHere is a link to Section 5Section 5 of this document.\n\n\n\nHere is a link to\nSection 4.0Section 4.0\nof the C-Kermit\nfor Unix Installation InstructionsC-Kermit\nfor Unix Installation Instructions.\n\n\n\nHere is a link to a picture:\nCLICK HERECLICK HERE to see it.\n\n\n' Notice that there are no HTML tags; it's all raw text. The first step is to download the page. Then, the raw text can be parsed, as in step 3. The resulting page object contains the parsed information. BeautifulSoup allows us to search for HTML elements. It can search for the first one with .find() or return a list with .find_all(). In step 5, it searched for a specific tag <a> that had a particular attribute, name=link. After that, it kept iterating on .next_elements until it finds the next h3 tag, which marks the end of the section. The text of each element is extracted and finally composed into a single text. Note the or that avoids storing None, returned when an element has no text. Crawling the web Given the nature of hyperlink pages, starting from a known place and following links to other pages is a very important tool in the arsenal when scraping the web. To do so, we crawl a page looking for a small phrase and will print any paragraph that contains it. We will search only in pages that belong to the same site. I.e. only URLs starting with www.somesite.com. We won't follow links to external sites. We'll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script. $ python simple_delay_server.py This serves the site in the URL http://localhost:8000. You can check it on a browser. It's a simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python. How to crawl the web The full script, crawling_web_step1.py, is available in GitHub. The most relevant bits are displayed here: ... def process_link(source_link, text): logging.info(f'Extracting links from {source_link}') parsed_source = urlparse(source_link) result = requests.get(source_link) # Error handling. See GitHub for details ... page = BeautifulSoup(result.text, 'html.parser') search_text(source_link, page, text) return get_links(parsed_source, page) def get_links(parsed_source, page): '''Retrieve the links on the page''' links = [] for element in page.find_all('a'): link = element.get('href') # Validate is a valid link. See GitHub for details ... links.append(link) return links   Search for references to python, to return a list with URLs that contain it and the paragraph. Notice there are a couple of errors because of broken links: $ python crawling_web_step1.py https://localhost:8000/ -p python Link http://localhost:8000/: --> A smaller article , that contains a reference to Python Link http://localhost:8000/files/5eabef23f63024c20389c34b94dee593-1.html: --> A smaller article , that contains a reference to Python Link http://localhost:8000/files/33714fc865e02aeda2dabb9a42a787b2-0.html: --> This is the actual bit with a python reference that we are interested in. Link http://localhost:8000/files/archive-september-2018.html: --> A smaller article , that contains a reference to Python Link http://localhost:8000/index.html: --> A smaller article , that contains a reference to Python Another good search term is crocodile. Try it out: $ python crawling_web_step1.py http://localhost:8000/ -p crocodile Let's see each of the components of the script: A loop that goes through all the found links, in the main function: Downloading and parsing the link, in the process_link function: It downloads the file, and checks that the status is correct to skip errors such as broken links. It also checks that the type (as described in Content-Type) is a HTML page to skip PDFs and other formats. And finally, it parses the raw HTML into a BeautifulSoup object. It also parses the source link using urlparse, so later, in step 4, it can skip all the references to external sources. urlparse divides a URL into its composing elements: >>> from urllib.parse import urlparse >>> >>> urlparse('http://localhost:8000/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html') ParseResult(scheme='http', netloc='localhost:8000', path='/files/b93bec5d9681df87e6e8d5703ed7cd81-2.html', params='', query='', fragment='') It finds the text to search, in the search_text function: It searches the parsed object for the specified text. Note the search is done as a regex and only in the text. It prints the resulting matches, including source_link, referencing the URL where the match was found: for element in page.find_all(text=re.compile(text)): print(f'Link {source_link}: --> {element}') The  get_links function retrieves all links on a page: It searches in the parsed page all <a> elements, and retrieves the href elements, but only elements that have such href elements and that are a fully qualified URL (starting with http). This removes links that are not a URL, such as a '#' link, or that are internal to the page. An extra check is done to check they have the same source as the original link, then they are registered as valid links. The netloc attribute allows to detect that the link comes from the same URL domain than the parsed URL generated in step 2. Finally, the links are returned, where they'll be added to the loop described in step 1. Accessing password-protected pages Sometimes a web page is not open to the public but protected in some way. The most basic aspect is to use basic HTTP authentication, which is integrated into virtually every web server, and it's a user/password schema. We can test this kind of authentication in https://httpbin.org. It has a path, /basic-auth/{user}/{password}, which forces authentication, with the user and password stated. This is very handy for understanding how authentication works. How to Access password protected pages Import requests: >>> import requests Make a GET request to the URL with the wrong credentials. Notice that we set the credentials on the URL to be user and psswd: >>> requests.get('https://httpbin.org/basic-auth/user/psswd', auth=('user', 'psswd')) <Response [200]> Use the wrong credentials to return a 401 status code (Unauthorized): >>> requests.get('https://httpbin.org/basic-auth/user/psswd', auth=('user', 'wrong')) <Response [401]> The credentials can be also passed directly in the URL, separated by a colon and an @ symbol before the server, like this: >>> requests.get('https://user:[email protected]/basic-auth/user/psswd') <Response [200]> >>> requests.get('https://user:[email protected]/basic-auth/user/psswd') <Response [401]> Speeding up web scraping Most of the time spent downloading information from web pages is usually spent waiting. A request goes from our computer to whatever server will process it, and until the response is composed and comes back to our computer, we cannot do much about it. During the execution of the recipes in the book, you'll notice there's a wait involved in requests calls, normally of around one or two seconds. But computers can do other stuff while waiting, including making more requests at the same time. In this recipe, we will see how to download a list of pages in parallel and wait until they are all ready. We will use an intentionally slow server to show the point. We'll get the code to crawl and search for keywords, making use of the futures capabilities of Python 3 to download multiple pages at the same time. A future is an object that represents the promise of a value. This means that you immediately receive an object while the code is being executed in the background. Only, when specifically requesting for its .result() the code blocks until getting it. To generate a future, you need a background engine, called executor. Once created, submit a function and parameters to it to retrieve a future.  The retrieval of the result can be delayed as long as necessary, allowing the generation of several futures in a row, and waiting until all are finished, executing them in parallel, instead of creating one, wait until it finishes, creating another, and so on. There are several ways to create an executor; in this recipe, we'll use ThreadPoolExecutor, which will use threads. We'll use as an example a prepared example, available in the GitHub repo. Download the whole site and run the included script $ python simple_delay_server.py -d 2 This serves the site in the URL http://localhost:8000. You can check it on a browser. It's s simple blog with three entries. Most of it is uninteresting, but we added a couple of paragraphs that contain the keyword python. The parameter -d 2 makes the server intentionally slow, simulating a bad connection. How to speed up web scraping Write the following script, speed_up_step1.py. The full code is available in GitHub. Notice the differences in the main function. Also, there's an extra parameter added (number of concurrent workers), and the function process_link now returns the source link. Run the crawling_web_step1.py script to get a time baseline. Notice the output has been removed here for clarity: $ time python crawling_web_step1.py http://localhost:8000/ ... REMOVED OUTPUT real 0m12.221s user 0m0.160s sys 0m0.034s Run the new script with one worker, which is slower than the original one: $ time python speed_up_step1.py -w 1 ... REMOVED OUTPUT real 0m16.403s user 0m0.181s sys 0m0.068s Increase the number of workers: $ time python speed_up_step1.py -w 2 ... REMOVED OUTPUT real 0m10.353s user 0m0.199s sys 0m0.068s Adding more workers decreases the time: $ time python speed_up_step1.py -w 5 ... REMOVED OUTPUT real 0m6.234s user 0m0.171s sys 0m0.040s The main engine to create the concurrent requests is the main function. Notice that the rest of the code is basically untouched (other than returning the source link in the process_link function). This is the relevant part of the code that handles the concurrent engine: with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor: while to_check: futures = [executor.submit(process_link, url, to_search) for url in to_check] to_check = [] for data in concurrent.futures.as_completed(futures): link, new_links = data.result() checked_links.add(link) for link in new_links: if link not in checked_links and link not in to_check: to_check.append(link) max_checks -= 1 if not max_checks: return   The with context creates a pool of workers, specifying its number. Inside, a list of futures containing all the URLs to retrieve is created. The .as_completed() function returns the futures that are finished, and then there's some work dealing with obtaining newly found links and checking whether they need to be added to be retrieved or not. This process is similar to the one presented in the Crawling the web recipe. The process starts again until enough links have been retrieved or there are no links to retrieve. In this post, we learned to use the power of Python to automate web scraping tasks. To understand how to automate monotonous tasks with Python 3.7, check out our book: Python Automation Cookbook Google releases Magenta studio beta, an open source python machine learning library for music artists How to perform sentiment analysis using Python [Tutorial] Do you write Python Code or Pythonic Code?
Read more
  • 0
  • 0
  • 8228

article-image-is-anti-trust-regulation-coming-to-facebook-following-fake-news-inquiry-made-by-a-global-panel-in-the-house-of-commons-uk
Prasad Ramesh
28 Nov 2018
11 min read
Save for later

Is Anti-trust regulation coming to Facebook following fake news inquiry made by a global panel in the House of Commons, UK?

Prasad Ramesh
28 Nov 2018
11 min read
The DCMS meeting for fake news inquiry on Facebook’s platform was held yesterday at the House of Commons, UK. This was the first time that parliamentarians from such number (nine) of countries gathered in one place. Representatives were from Argentina, Canada, France, Singapore, Ireland, Belgium, Brazil, Latvia, and a select few from the DCMS committee. Richard Allan is also a member of House of Lords, as Lord Allan of Hallam in addition to being Facebook VP for policy solutions. The committee was chaired by Damian Collins MP, head of UK parliament’s digital, culture, media and sport (DCMS) select committee. About Zuckerberg not attending the meeting himself Facebook had refused to send Zuckerberg to the hearing despite repeated requests from the DCMS committee and even after being flexible about remotely attending the meeting via FaceTime. The parliamentarians were clearly displeased with Mark Zuckerberg’s empty chair at the meeting. They made remarks about how he should be accountable as a CEO for a meeting that involves his company and representatives representing millions of Facebook users from different countries. There were plenty of remarks directed at the Facebook founder being absent in the hearing. Statement from Mark Zuckerberg to the US Senate hearing earlier this year: “We didn’t take a broad enough view of our responsibility, it was my mistake and I am sorry”. Allan was asked if he thought that was a genuine statement, he said yes. Then Nathaniel Erskine-Smith from Canada made a remark “Just not sorry enough to appear himself before nine parliaments.” Canada wasn’t done, another remark from Erksine-Smith: “Sense of corporate social responsibility, particularly in light of the immense power and profit of Facebook, has been empty as the chair beside you.” In Canada, only 270 people had used the app called Your Digital Life related to Cambridge Analytica and 620,000 had their information shared with the developer. Who gave Mr. Zuckerberg the advice to ignore this committee? Charles Angus Vice-Chair, from House of Commons, Canada made a remark that Zuckerberg decided to “blow off this meeting”. Richard Allan accepted full responsibility for decisions on public appearances for Facebook. How does it looks that Zuckerberg is not here and you’re apologizing for his absence? “Not great” was his answer. Don’t you see that Facebook has lost public trust due to misinformation tactics? Allan agreed to this point. Charles Angus said Facebook has lost the trust of the international committee that it can police itself. Damian Collins said, “It should be up to the parliaments to decide what regulatory measures need to be set in place and not Facebook.“ Were you sent because you could answer our questions or to defend Facebook’s position? Allan said that he was sent to answer questions and was in the company since 2009 and had experienced events first hand. He said that he volunteered to come, Mike Schroepfer, Facebook CTO was sent to an earlier hearing, but the committee was not happy with his answers. The Cambridge Analytica incident Questions were asked about when Facebook became aware of this incident. Allan said that it was when the story was out in the press. When did Mark Zuckerberg know about the GSR Cambridge Analytica incident? After some evasion, the answer was March 2018, as the timeline, when it was covered by the press. The same question was asked 6 months ago to Mike Schroepfer and he said he didn’t know. A follow up was if Facebook was aware of and banned any other apps that breached privacy. Allan said that there were many but on probing could not name even one. He promised to send the committee a written response to that question. After the US senate hearing in April, Zuckerberg was supposed to give a list of such apps that were banned, the committee still hasn’t got any such list. Ian Lucas MP (Wrexham, Labour) said: “You knew app developers were sharing information and the only time you took actions was when you were found out.” What were Facebook’s decision on its decisions on data and privacy controls that led to the Cambridge Analytica scandal? Allan explained that there are two versions of the way developers had access to user data: Before the 2015 policy changes, access to friends data was allowed After the changes, this access was removed Non user data is sitting on Facebook servers but they do not use it to create shadow profiles. Additionally, any third party apps are expected to have their own privacy policy which can be different from Facebook’s own privacy policy. Allan said that if any such app is found that has privacy measures that may lead to privacy issues, then they take actions but could not provide an example of having done so. Will Facebook apply GDPR standards across all countries as Zuckerberg stated? They believe that the tools, system that they built are GDPR complaint. Russian activity on Facebook From the recently seized documents, questions were asked but Allan deflected them by saying they are unverified and partial information. Why didn’t Facebook disclose that it knew Russian ads were run on its platform? The case made for the question was that no one from Facebook disclosed that information about Russian activity on its platform. It wasn’t disclosed until US Senate intelligence committee made a formal request. Allan said that their current policy at the moment is to investigate and publish any such activity. From the cache of documents obtained, a point was made about an email by a Facebook engineer in 2014 about Russian IP addresses using a Pinterest API key to pull over 3 billion data points through the ordered friends API. Allan said that the details from those seized mails/documents are unverified, partial and can be misleading. Allan stuck to his guns saying: “we will come back to you”. Facebook’s privacy controls Facebook user settings were overridden by a checkbox not in plain sight and that allowed your friends' apps to access your data. When did Facebook change the API that overrode its own central privacy page? In November 2009, Facebook central privacy page that allowed you to “control who can see your profile and personal information”. In November 2011 the US federal trade commission made a complaint against Facebook that they allowed external app developers to access personal information. Allan responded by saying that in privacy settings, there was a checkbox to disallow access to your data by applications installed by friends. What about non-Facebook user data? They use it to link connections when that person becomes a Facebook user. They do not make any money out of it. What’s your beef with Six4Three? Their app, pikini depended on friends data. When Facebook changed the API to version 2 as mentioned above, Six4Three sued Facebook as their app won’t work anymore. Discussions on can Facebook be held accountable for its actions Allan agrees that there should be a global regulation to hold the company accountable for its actions. Are you serious about regulation on a global level? There are now tens of thousands of Facebook employees working towards securing user privacy, they were too few before. Allan agrees a global level regulation should be present, the company should be accountable for its actions and sanctions should be made against any undesirable actions by the company. Maybe this will be communicated from a global organization like the United Nations (UN). How is Facebook policing fake accounts and their networks? It is an ongoing battle. Most of fake account creation is not with political intent but with commercial intent to sell followers pushing spam etc. More clicks = more money for them. Many of these accounts are taken out within minutes of creation. “We have artificial intelligence systems that try and understand what a fake account looks like and shut them down as soon as they come up”. This is for mass creation of accounts. In political cases only or two accounts are created and act as genuine users. Facebook is still removing fake accounts related to Russia. Allan says they’re trying to get better all the time. Low-quality information has decreased by 50% on Facebook, as per research by academia. Fake users that use VPN are difficult to address. For running political ads, an account needs to be a regularly used account, driving license or passport is needed and also the payment information is stored with Facebook in addition to any information that Facebook may have. Allan says, in this case, it would be unwise since the information can be used to prosecute the fake account user even if the documents used were fake. In the context of fake ads or information Allan agreed that the judicial authority of the specific country is the best entrusted with taking down sensitive information. He gave an example—if someone claims that a politician is corrupt and he is not, taking it down is correct but if he is corrupt and it is taken down then genuine information is lost. A case of non-regulation was pointed out A hate speech Facebook comment in Sri Lanka was pointed out by Edwin Tong of Singapore. The comment was in Sinhalese and Facebook did not remove it even after reports of it being hate speech. Allan said that it was a mistake and they are heavily investing in artificial intelligence with a set of hate speech keywords that can weed out such comments. They are working through the different languages on this. How will Facebook provide transparency on the use of measures taken against fake news? There is a big push around academic study in this area. They are working with academics in this area to understand the fake news problem better. But they also want to ensure that Facebook doesn't end up sharing user data that people would find inappropriate. How is Facebook monitoring new sign-ups and posts during elections? There should not be anonymous users. The next time they log in, there is a checkpoint that says more information is required. Would Facebook consider working actively with local election authorities to remove or flag posts that would influence voter choice? They think that this is essential. It is the judiciary system that can best decide if such posts are true or false and make a call to remove them. To make everyone feel that the election was free and fair is something Facebook can’t do on their own. What is Facebook doing to prevent misuse of its algorithms to influence elections? Change in algorithms the way it searches information generally. They now better classify low-quality content. Secondly, there is a category of borderline content which is not banned by close to being banned. But it is getting rewarded by the algorithm, work is being done to reduce it instead. Third party opinion as fact checkers for marking posts as false or true. This is about tilting the scales to higher quality less sensational content from lower quality more sensational content in the algorithm. What measures for fake news in WhatsApp? There are services that provide fake WhatsApp numbers. Report it to the company and they will take them down, says Allan. They are aware of this and its use and it needs to be a part of the election protection effort. Closing discussion After the lengthy round of grilling of the fake news inquiry, Angus reiterated that they expect Facebook to be accountable for its actions. Would you be interested in asking your friend Mr Zuckerberg if we should have a discussion about anti-trust? You and Mr. Zuckerberg are the symptoms. Perhaps the best regulation is anti trust, to break Facebook up from WhatsApp and Instagram, allow competition. Allan answers that it depends on the problem to solve. Angus jolted: “The problem is Facebook” which we need to address. It’s unprecedented economic control of every form of social discourse and communication. Angus asks Facebook to have corporate accountability. Perhaps in its unwillingness to be accountable to the international body, maybe anti-trust would be something to help get credible democratic responses from a corporation. These were of the highlights of the questions and answers at the committee meeting held on 27th November 2018, the House of Commons. We recommend you watch the complete proceeding for a more comprehensive context here. In our view, Mr Allan tried answering many of the questions during the three hour session of this fake news inquiry better than Sandberg or Zuckerberg did in their hearings, but the answers were less than satisfactory where important topics were involved regarding Facebook’s data and privacy controls. It does appear that Facebook will continue to delay, deny and deflect as much as it can. Privacy experts urge the Senate Commerce Committee for a strong federal privacy bill “that sets a floor, not a ceiling” Consumer protection organizations submit a new data protection framework to the Senate Commerce Committee Facebook, Twitter open up at Senate Intelligence hearing, committee does ‘homework’ this time
Read more
  • 0
  • 0
  • 2382

article-image-malicious-code-in-npm-event-stream-package-targets-a-bitcoin-wallet-and-causes-8-million-downloads-in-two-months
Savia Lobo
28 Nov 2018
3 min read
Save for later

Malicious code in npm ‘event-stream' package targets a bitcoin wallet and causes 8 million downloads in two months

Savia Lobo
28 Nov 2018
3 min read
Last week Ayrton Sparling, a Computer Science major at CSUF, California disclosed that the popular npm package, event-stream, contains a malicious package named flatmap-stream. He disclosed the issue via the GitHub issue on the EventStream’s repository. The event-stream npm package was originally created and maintained by Dominic Tarr. However, this popular package has not been updated for a long time now. According to Thomas Hunter’s post on Medium, “Ownership of event-stream, was transferred by the original author to a malicious user, right9ctrl.  The malicious user was able to gain the trust of the original author by making a series of meaningful contributions to the package.” The malicious owner then added a malicious library named flatmap-stream to the events-stream package as a dependency. This led to a download and invocation of the event-stream package (using the malicious 3.3.6 version) by every user. The malicious library download added up to nearly 8 million downloads since it was included in September 2018. The malicious package represents a highly targeted attack and affects an open source app called bitpay/copay. Copay is a secure bitcoin wallet platform for both desktop and mobile devices. “We know the malicious package specifically targets that application because the obfuscated code reads the description field from a project’s package.json file, then uses that description to decode an AES256 encrypted payload”, said Thomas in his post. Post this breakout, many users from Twitter and GitHub have positively supported Dominic. In a statement on the event-stream issue, Dominic stated, “I've shared publish rights with other people before. Of course, If I had realized they had a malicious intent I wouldn't have, but at the time it looked like someone who was actually trying to help me”. https://twitter.com/dominictarr/status/1067186943304159233 As a support to Dominic, André Staltz, an open source hacker, tweeted, https://twitter.com/andrestaltz/status/1067157915398746114 Users affected by this malicious code are advised to eliminate this package from their application by reverting back to version 3.3.4 of event-stream. If the user application deals with Bitcoin, they should inspect its activity in the last 3 months to see if any mined or transferred bitcoins did not make it into their wallet. However, if the application does not deal with bitcoin but is especially sensitive, an inspection of its activity in the last 3 months for any suspicious activity is recommended. This is to analyze the notably data sent on the network to unintended destinations. To know more about this in detail, visit Eventstream’s repository. A new data breach on Facebook due to malicious browser extensions allowed almost 81,000 users’ private data up for sale, reports BBC News Wireshark for analyzing issues and malicious emails in POP, IMAP, and SMTP [Tutorial] Machine learning based Email-sec-360°surpasses 60 antivirus engines in detecting malicious emails
Read more
  • 0
  • 0
  • 5643

article-image-google-employees-join-hands-with-amnesty-international-urging-google-to-drop-project-dragonfly
Sugandha Lahoti
28 Nov 2018
3 min read
Save for later

Google employees join hands with Amnesty International urging Google to drop Project Dragonfly

Sugandha Lahoti
28 Nov 2018
3 min read
Yesterday, Google employees have signed a petition protesting Google’s infamous Project Dragonfly. “We are Google employees and we join Amnesty International in calling on Google to cancel project Dragonfly”, they wrote on a post on Medium. This petition also marks the first time over 300 Google employees (at the time of writing this post) have used their actual names in a public document. Project Dragonfly is the secretive search engine that Google is allegedly developing which will comply with the Chinese rules of censorship. It has been on the receiving end of constant backlash from various human rights organizations and investigative reporters, since it was revealed earlier this year. On Monday, it also faced critique from human rights organization Amnesty International. Amnesty launched a petition opposing the project, and coordinated protests outside Google offices around the world including San Francisco, Berlin, Toronto and London. https://twitter.com/amnesty/status/1067488964167327744 Yesterday, Google employees joined Amnesty and wrote an open letter to the firm. “We are protesting against Google’s effort to create a censored search engine for the Chinese market that enables state surveillance. Our opposition to Dragonfly is not about China: we object to technologies that aid the powerful in oppressing the vulnerable, wherever they may be. Dragonfly in China would establish a dangerous precedent at a volatile political moment, one that would make it harder for Google to deny other countries similar concessions. Dragonfly would also enable censorship and government-directed disinformation, and destabilize the ground truth on which popular deliberation and dissent rely.” Employees have expressed their disdain over Google’s decision by calling it a money-minting business. They have also highlighted Google’s previous disappointments including Project Maven, Dragonfly, and Google’s support for abusers, and believe that “Google is no longer willing to place its values above its profits. This is why we’re taking a stand.” Google spokesperson has redirected to their previous response on the topic: "We've been investing for many years to help Chinese users, from developing Android, through mobile apps such as Google Translate and Files Go, and our developer tools. But our work on search has been exploratory, and we are not close to launching a search product in China." Twitterati have openly sided with Google employees in this matter. https://twitter.com/Davidramli/status/1067582476262957057 https://twitter.com/shabirgilkar/status/1067642235724972032 https://twitter.com/nrambeck/status/1067517570276868097 https://twitter.com/kuminaidoo/status/1067468708291985408 OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly? Amnesty International takes on Google over Chinese censored search engine, Project Dragonfly. Google’s prototype Chinese search engine ‘Dragonfly’ reportedly links searches to phone numbers
Read more
  • 0
  • 0
  • 2537
article-image-uk-parliament-seizes-facebook-internal-documents-cache-after-zuckerbergs-continuous-refusal-to-answer-questions
Prasad Ramesh
26 Nov 2018
4 min read
Save for later

UK parliament seizes Facebook internal documents cache after Zuckerberg’s continuous refusal to answer questions

Prasad Ramesh
26 Nov 2018
4 min read
Last weekend, the UK parliament seized a cache of Facebook internal documents. The parliament exercised its legal powers after Facebook founder Mark Zuckerberg continuously refused to answer questions  regarding privacy and the Cambridge Analytica scandal. User data privacy was a major concern for the Digital, Culture, Media and Sport Committee (DCMS) committee of UK. As reported in the Observer, the cache of the obtained documents likely contains significant revelations about social media giant’s decisions on user data privacy and controls or the lack of which led to the Cambridge Analytica scandal. This includes confidential emails between Facebook’s senior executives. How did they seize the cache of documents? The chair of DCMS, Damian Collins initiated a parliamentary procedure to make the founder of Six4Three hand over the documents. This happened during his business trip in London. They also sent a serjeant at arms to recover the documents. When Six4Three’s founder did not comply, he was likely escorted to the parliament where he was facing fines and imprisonment for non-compliance. Collins said to the Observer: “We are in uncharted territory. This is an unprecedented move but it’s an unprecedented situation. We’ve failed to get answers from Facebook and we believe the documents contain information of very high public interest.” What is Six4Three and how did they get said documents? Six4Three was a software company that produced a way to search for bikini pictures from your Facebook contacts. After investing $250K, Six4Three alleged that the cache contains information that indicates Facebook was aware of the implications of its privacy policy and also actively exploited them. They intentionally created and then denied the loophole that allowed Cambridge Analytica to collect data which affected over 87 million users. This detail caught the attention of Collins and the DCMS committee. In 2015, Six4Three filed a lawsuit against Facebook. The complaint was that Facebook promised developers long-term access to user data for creating apps for them. But then, they later shut off access to such data. The documents which Six4Three obtained are under seal on order of a Californian court. This didn’t stop the UK parliament from enforcing its own power when the company’s owner was in London. Facebook asking not to read or reveal the documents A Facebook Spokesperson told the Observer: “The materials obtained by the DCMS committee are subject to a protective order of the San Mateo Superior Court restricting their disclosure. We have asked the DCMS committee to refrain from reviewing them and to return them to counsel or to Facebook. We have no further comment.” The email exchange Since the news was first covered by press, Facebook responded with an email: https://twitter.com/carolecadwalla/status/1066732715737837569 To which Collins wrote back saying that under parliamentary privilege the committee can publish these documents: https://twitter.com/DamianCollins/status/1066773746491498498 The session hearing It is not clear if Facebook can make any legal moves to restrict the publication of the documents. However, the court hearing to be held tomorrow will be attended by Richard Allan, Facebook vice-president of policy, after Zuckerberg refused to attend. Earlier, Zuckerberg also declined many video call requests by the committee. It will be a very long session where Allan says that they (the committee) has very serious questions for Facebook: “It[Facebook] misled us about Russian involvement on the platform. And it has not answered our questions about who knew what, when with regards to the Cambridge Analytica scandal.” But with Allan, a politician, who was a Liberal Democrat Member of Parliament representing Facebook, it is to be seen how many questions will be answered directly. This story was first published in The Observer. NYT Facebook exposé fallout: Board defends Zuckerberg and Sandberg; Media call and transparency report Highlights Facebook’s outgoing Head of communications and policy takes blame for hiring PR firm ‘Definers’ and reveals more Did you know Facebook shares the data you share with them for ‘security’ reasons with advertisers?
Read more
  • 0
  • 0
  • 1972

article-image-creating-triggers-in-azure-functions-tutorial
Bhagyashree R
26 Nov 2018
8 min read
Save for later

Creating triggers in Azure Functions [Tutorial]

Bhagyashree R
26 Nov 2018
8 min read
A trigger is an event or situation that causes something to start. This something can be some sort of processing of data or some other service that performs some action. Triggers are just a set of functions that get executed when some event gets fired. In Azure, we have different types of triggers, such as an implicit trigger, and we can also create a manual trigger. With, Azure Functions you can write code in response to a trigger in Azure. [box type="shadow" align="" class="" width=""]This article is taken from the book Learning Azure Functions by Manisha Yadav and Mitesh Soni. In this book, you will you learn the techniques of scaling your Azure functions and making the most of serverless architecture. [/box] In this article, we will see the common types of triggers and learn how to create a trigger with a very simple example. We will also learn about the HTTP trigger, event bus, and service bus. Common types of triggers Let's understand first how a trigger works and get acquainted with the different types of triggers available in Azure Functions. The architecture of a trigger and how it works is shown in the following figure: The preceding diagram shows the event that fires the trigger and once the trigger is fired, it runs the Azure Function associated with it. We need to note a very important point here: one function must have exactly one trigger; in other words, one function can't have multiple triggers. Now let's see the different types of trigger available in Azure: TimerTrigger: This trigger is called on a predefined schedule. We can set the time for execution of the Azure Function using this trigger. BlobTrigger: This trigger will get fired when a new or updated blob is detected. The blob contents are provided as input to the function. EventHubTrigger: This trigger is used for the application instrumentation, the user experience, workflow processing, and in the Internet of Things (IoT). This trigger will get fired when any events are delivered to an Azure event hub. HTTPTrigger: This trigger gets fired when the HTTP request comes. QueueTrigger: This trigger gets fired when any new messages come in an Azure Storage queue. Generic Webhook: This trigger gets fired when the Webhook HTTP requests come from any service that supports Webhooks. GitHub Webhook: This trigger is fired when an event occurs in your GitHub repositories. The GitHub repository supports events such as Branch created, Delete branch, Issue comment, and Commit comment. Service Bus trigger: This trigger is fired when a new message comes from a service bus queue or topic. Example of creating a simple scheduled trigger in Azure Consider a simple example where we have to display a "good morning" message on screen every day in the morning at 8 AM. This situation is related to time so we need to use a schedule trigger. We will look at this type of trigger later in this article. Let's first start creating a function with the schedule trigger first: Log in to the Azure Portal. Click on the top left + icon | Compute | Function App: Once we click on Function App, the next screen will appear, where we have to provide a unique function App name, Subscription, Resource Group, Hosting Plan, Location, Storage, and then click on the Create button: Once we click on the Create button, Azure will start to deploy this function. Once this function is deployed, it will be seen in Notifications, as shown in the following screenshot: Click on Notifications and check the Functions details and add the trigger: To add a trigger in this function, click on the + icon next to Functions and then click on Custom function: Now we have to select Language and type the name of the trigger. Once we provide the name and trigger value it will provide the available template for us after filtering all the templates: Scroll down and type the trigger name and schedule. The Schedule value is a six-field CRON expression. Click on the Create button: By providing 0 0/5 * * * *, the function will run every 5 minutes from the first run. Once we click on the Create button, we will see the template code on the screen as follows: Here, we have to write code. Whatever action we want to perform, we have to write it here. Now write the code and click on the Save and run button. Once we run the code, we can see the output in the logs, as shown in the following screenshot: Note the timing. It runs at an interval of 5 minutes. Now we want it to run only once a day at 8 AM. To do this, we have to change the value of the schedule. To edit the value in the trigger, click on Integrate, type the value, and then click on the Save button: Now, again, click on goodMorningTriggerJS, modify the code, and test it. So, this is all about creating a simple trigger with the Azure Function. Now, we will look at the different types of triggers available in Azure. HTTP trigger The HTTP trigger is normally used to create the API or services, where we request for data using the HTTP protocol and get the response. We can also integrate the HTTP trigger with a Webhook. Let's start creating the HTTP trigger. We have already created a simple Azure Function and trigger. Now we will create the HTTP Login API. In this, we will send the login credential through an HTTP post request and get the response as to whether the user is valid or not. Since we have already created a Function app in the previous example, we can now add multiple functions to it: Click on +, select HttpTrigger-JavaScript, provide the function name, and click on the Create button: After we click on the Create button, the default template will be available. Now, we can edit and test the function: Now edit the code as follows: Save and run the code, as shown in the following screenshot: The login service is ready. Now let's check this service in Postman. To get the URL from the function, click on Get function URL: We will use Postman to check our service. Postman is a Chrome extension for API developers to test APIs. To add the Chrome extension, go to Settings in Chrome and select More tools | Extensions, as shown in the following screenshot: Click on Get more extensions: Now search for postman and then click on + ADD TO CHROME: Click on Add app: Launch the Postman app. Click on Sign Up with Google: The Postman window will look like this: Once the initial setup is done, test the API. Copy the function URL and paste it in Postman. Select the method type POST and provide a request body and click on the Send button: If we provide the correct username and password in the request body, we will get the response, user is valid; otherwise, the response will be invalid user. In the next section, we will discuss event hubs. Event hubs Event hubs are created to help us with the challenge of handling a huge amount of event-based messaging. The idea is that if we have apps or devices that publish a large amount of events in a very short duration (for example, a real-time voting system), then event hubs can be the place where we can send the event. Event hubs will create a stream of all the events which can be processed at some point in different ways. An event hub trigger is used to respond to an event sent to an event hub event stream. The following diagram shows how a trigger works with an Event Hub: Service bus The service bus is used to provide interaction between services or applications run in the cloud with other services or applications. The service bus trigger is used to give the response to messages which come from the service bus queue or topic. We have two types of service bus triggers: Service bus queue trigger: A queue is basically for first-in-first-out messages. When a message comes from the service bus, the service bus queue trigger gets fired and the Azure Function is called. In the Azure Function, we can process the message and then deliver it. Service bus topic trigger: The topic is useful for scaling to very large numbers of recipients. Finally, we have completed the trigger part of the Azure Function. In this article, we have discussed the architecture of a trigger and how the trigger works. We have covered different types of triggers available in Azure. We created one simple example of a schedule trigger and discussed the workflow of the schedule trigger. We discussed the HTTP trigger in detail and how the HTTP trigger works. Then we created an API using the HTTP trigger. We covered the event hub, service bus. We also covered how a trigger works with these services. If you found this post useful, do check out the book, Learning Azure Functions. This book walks you through the techniques of scaling your Azure functions and will help you make the most of serverless architecture. Anatomy of an Azure function App [Tutorial] Implementing Identity Security in Microsoft Azure [Tutorial] Azure Functions 2.0 launches with better workload support for serverless
Read more
  • 0
  • 0
  • 14734

article-image-exploring-deep-learning-architectures-tutorial
Melisha Dsouza
25 Nov 2018
11 min read
Save for later

Exploring Deep Learning Architectures [Tutorial]

Melisha Dsouza
25 Nov 2018
11 min read
This tutorial will focus on some of the important architectures present today in deep learning. A lot of the success of neural networks lies in the careful design of the neural network architecture. We will look at the architecture of Autoencoder Neural Networks, Variational Autoencoders, CNN's and RNN's. This tutorial is an excerpt from a book written by Dipanjan Sarkar, Raghav Bali, Et al titled Hands-On Transfer Learning with Python. This book extensively focuses on deep learning (DL) and transfer learning, comparing and contrasting the two with easy-to-follow concepts and examples. Autoencoder neural networks Autoencoders are typically used for reducing the dimensionality of data in neural networks. They are also successfully used for anomaly detection and novelty detection problems. Autoencoder neural networks come under the unsupervised learning category. The network is trained by minimizing the difference between input and output. A typical autoencoder architecture is a slight variant of the DNN architecture, where the number of units per hidden layer is progressively reduced until a certain point before being progressively increased, with the final layer dimension being equal to the input dimension. The key idea behind this is to introduce bottlenecks in the network and force it to learn a meaningful compact representation. The middle layer of hidden units (the bottleneck) is basically the dimension-reduced encoding of the input. The first half of the hidden layers is called the encoder, and the second half is called the decoder. The following depicts a simple autoencoder architecture. The layer named z is the representation layer here: Source: cloud4scieng.org Variational autoencoders The variational autoencoders (VAEs) are generative models and compared to other deep generative models, VAEs are computationally tractable and stable and can be estimated by the efficient backpropagation algorithm. They are inspired by the idea of variational inference in Bayesian analysis. The idea of variational inference is as follows: given input distribution x, the posterior probability distribution over output y is too complicated to work with. So, let's approximate that complicated posterior, p(y | x), with a simpler distribution, q(y). Here, q is chosen from a family of distributions, Q, that best approximates the posterior. For example, this technique is used in training latent Dirichlet allocation (LDAs) (they do topic modeling for text and are Bayesian generative models). Given a dataset, X, VAE can generate new samples similar but not necessarily equal to those in X. Dataset X has N Independent and Identically Distributed (IID) samples of some continuous or discrete random variable, x. Let's assume that the data is generated by some random process, involving an unobserved continuous random variable, z. In this example of a simple autoencoder, the variable z is deterministic and is a stochastic variable. Data generation is a two-step process: A value of z is generated from a prior distribution, ρθ(z) A value of x is generated from the conditional distribution, ρθ(x|z)  So, p(x) is basically the marginal probability, calculated as:   The parameter of the distribution, θ, and the latent variable, z, are both unknown. Here, x can be generated by taking samples from the marginal p(x). Backpropagation cannot handle stochastic variable z or stochastic layer z within the network. Assuming the prior distribution, p(z), is Gaussian, we can leverage the location-scale property of Gaussian distribution, and rewrite the stochastic layer as z = μ + σε , where μ is the location parameter, σ is the scale, and ε is the white noise. Now we can obtain multiple samples of the noise, ε, and feed them as the deterministic input to the neural network. Then, the model becomes an end-to-end deterministic deep neural network, as shown here: Here, the decoder part is same as in the case of the simple autoencoder that we looked at earlier. Types of CNN architectures CNNs are multilayered neural networks designed specifically for identifying shape patterns with a high degree of invariance to translation, scaling, and rotation in two-dimensional image data. These networks need to be trained in a supervised way. Typically, a labeled set of object classes, such as MNIST or ImageNet, is provided as a training set. The crux of any CNN model is the convolution layer and the subsampling/pooling layer. LeNet architecture This is a pioneering seven-level convolutional network, designed by LeCun and their co-authors in 1998, that was used for digit classification. Later, it was applied by several banks to recognize handwritten numbers on cheques. The lower layers of the network are composed of alternating convolution and max pooling layers. The upper layers are fully connected, dense MLPs (formed of hidden layers and logistic regression). The input to the first fully connected layer is the set of all the feature maps of the previous layer: AlexNet In 2012, AlexNet significantly outperformed all the prior competitors and won the ILSVRC by reducing the top-5 error to 15.3%, compared to the runner-up with 26%. This work popularized the application of CNNs in computer vision. AlexNet has a very similar architecture to that of LeNet, but has more filters per layer and is deeper. Also, AlexNet introduces the use of stacked convolution, instead of always using alternative convolution pooling. A stack of small convolutions is better than one large receptive field of convolution layers, as this introduces more non-linearities and fewer parameters. ZFNet The ILSVRC 2013 winner was a CNN from Matthew Zeiler and Rob Fergus. It became known as ZFNet. It improved on AlexNet by tweaking the architecture hyperparameters, in particular by expanding the size of the middle convolutional layers and making the stride and filter size on the first layer smaller, going from 11 x 11 stride 4 in AlexNet to 7 x 7 stride 2 in ZFNet. The intuition behind this was that a smaller filter size in the first convolution layer helps to retain a lot of the original pixel information. Also, AlexNet was trained on 15 million images, while ZFNet was trained on only 1.3 million images: GoogLeNet (inception network) The ILSVRC 2014 winner was a convolutional network called GoogLeNet from Google. It achieved a top-5 error rate of 6.67%! This was very close to human-level performance. The runner up was the network from Karen Simonyan and Andrew Zisserman known as VGGNet. GoogLeNet introduced a new architectural component using a CNN called the inception layer. The intuition behind the inception layer is to use larger convolutions, but also keep a fine resolution for smaller information on the images. The following diagram describes the full GoogLeNet architecture: Visual Geometry Group Researchers from the Oxford Visual Geometry Group, or the VGG for short, developed the VGG network, which is characterized by its simplicity, using only 3 x 3 convolutional layers stacked on top of each other in increasing depth. Reducing volume size is handled by max pooling. At the end, two fully connected layers, each with 4,096 nodes, are then followed by a softmax layer. The only preprocessing done to the input is the subtraction of the mean RGB value, computed on the training set, from each pixel. Pooling is carried out by max pooling layers, which follow some of the convolution layers. Not all the convolution layers are followed by max pooling. Max pooling is performed over a 2 x 2 pixel window, with a stride of 2. ReLU activation is used in each of the hidden layers. The number of filters increases with depth in most VGG variants. The 16-layered architecture VGG-16 is shown in the following diagram. The 19-layered architecture with uniform 3 x 3 convolutions (VGG-19) is shown along with ResNet in the following section. The success of VGG models confirms the importance of depth in image representations: VGG-16: Input RGB image of size 224 x 224 x 3, the number of filters in each layer is circled Residual Neural Networks The main idea in this architecture is as follows. Instead of hoping that a set of stacked layers would directly fit a desired underlying mapping, H(x), they tried to fit a residual mapping. More formally, they let the stacked set of layers learn the residual R(x) = H(x) - x, with the true mapping later being obtained by a skip connection. The input is then added to the learned residual, R(x) + x. Also, batch normalization is applied right after each convolution and before activation: Here is the full ResNet architecture compared to VGG-19. The dotted skip connections show an increase in dimensions; hence, for the addition to be valid, no padding is done. Also, increases in dimensions are indicated by changes in color: Types of RNN architectures An recurrent neural Network (RNN) is specialized for processing a sequence of values, as in x(1), . . . , x(t).  We need to do sequence modeling if, say, we wanted to predict the next term in the sequence given the recent history of the sequence, or maybe translate a sequence of words in one language to another language. RNNs are distinguished from feedforward networks by the presence of a feedback loop in their architecture. It is often said that RNNs have memory. The sequential information is preserved in the RNNs hidden state. So, the hidden layer in the RNN is the memory of the network. In theory, RNNs can make use of information in arbitrarily long sequences, but in practice they are limited to looking back only a few steps. LSTMs RNNs start losing historical context over time in the sequence, and hence are hard to train for practical purposes. This is where  LSTMs (Long short-term memory)  come into the picture! Introduced by Hochreiter and Schmidhuber in 1997, LSTMs can remember information from really long sequence-based data and prevent issues such as the vanishing gradient problem. LSTMs usually consist of three or four gates, including input, output, and forget gates. The following diagram shows a high-level representation of a single LSTM cell: The input gate can usually allow or deny incoming signals or inputs to alter the memory cell state. The output gate usually propagates the value to other neurons as necessary. The forget gate controls the memory cell's self-recurrent connection to remember or forget previous states as necessary. Multiple LSTM cells are usually stacked in any deep learning network to solve real-world problems, such as sequence prediction. Stacked LSTMs If we want to learn about the hierarchical representation of sequential data, a stack of LSTM layers can be used. Each LSTM layer outputs a sequence of vectors rather than a single vector for each item of the sequence, which will be used as an input to a subsequent LSTM layer. This hierarchy of hidden layers enables a more complex representation of our sequential data. Stacked LSTM models can be used for modeling complex multivariate time series data. Encoder-decoder – Neural Machine Translation Machine translation is a sub-field of computational linguistics, and is about performing translation of text or speech from one language to another. Traditional machine translation systems typically rely on sophisticated feature engineering based on the statistical properties of text. Recently, deep learning has been used to solve this problem, with an approach known as Neural Machine Translation (NMT). An NMT system typically consists of two modules: an encoder and a decoder. It first reads the source sentence using the encoder to build a thought vector: a sequence of numbers that represents the sentence's meaning. A decoder processes the sentence vector to emit a translation to other target languages. This is called an encoder-decoder architecture. The encoders and decoders are typically forms of RNN. The following diagram shows an encoder-decoder architecture using stacked LSTMs. Source: tensorflow.org The source code for NMT in TensorFlow is available at Github. Gated Recurrent Units Gated Recurrent Units (GRUs) are related to LSTMs, as both utilize different ways of gating information to prevent the vanishing gradient problem and store long-term memory. A GRU has two gates: a reset gate, r, and an update gate, z, as shown in the following diagram. The reset gate determines how to combine the new input with the previous hidden state, ht-1, and the update gate defines how much of the previous state information to keep. If we set the reset to all ones and update gate to all zeros, we arrive at a simple RNN model: GRUs are computationally more efficient because of a simpler structure and fewer parameters. Summary This article covered various advances in neural network architectures including Autoencoder Neural Networks, Variational Autoencoders,  CNN's and  RNN's architectures. To understand how to simplify deep learning  by taking supervised, unsupervised, and reinforcement learning to the next level using the Python ecosystem, check out this book  Hands-On Transfer Learning with Python Neural Style Transfer: Creating artificial art with deep learning and transfer learning Dr. Brandon explains ‘Transfer Learning’ to Jon 5 cool ways Transfer Learning is being used today
Read more
  • 0
  • 0
  • 6438
article-image-using-machine-learning-for-phishing-domain-detection-tutorial
Prasad Ramesh
24 Nov 2018
11 min read
Save for later

Using machine learning for phishing domain detection [Tutorial]

Prasad Ramesh
24 Nov 2018
11 min read
Social engineering is one of the most dangerous threats facing every individual and modern organization. Phishing is a well-known, computer-based, social engineering technique. Attackers use disguised email addresses as a weapon to target large companies. With the huge number of phishing emails received every day, companies are not able to detect all of them. That is why new techniques and safeguards are needed to defend against phishing. This article will present the steps required to build three different machine learning-based projects to detect phishing attempts, using cutting-edge Python machine learning libraries. We will use the following Python libraries: scikit-learn Python (≥ 2.7 or ≥ 3.3) NumPy  (≥ 1.8.2) NLTK Make sure that they are installed before moving forward. You can find the code files here. This article is an excerpt from a book written by Chiheb Chebbi titled Mastering Machine Learning for Penetration Testing. In this book, you will you learn how to identify loopholes in a self-learning security system and will be able to efficiently breach a machine learning system. Social engineering overview Social engineering, by definition, is the psychological manipulation of a person to get useful and sensitive information from them, which can later be used to compromise a system. In other words, criminals use social engineering to gain confidential information from people, by taking advantage of human behavior. Social Engineering Engagement Framework The Social Engineering Engagement Framework (SEEF) is a framework developed by Dominique C. Brack and Alexander Bahmram. It summarizes years of experience in information security and defending against social engineering. The stakeholders of the framework are organizations, governments, and individuals (personals). Social engineering engagement management goes through three steps: Pre-engagement process: Preparing the social engineering operation During-engagement process: The engagement occurs Post-engagement process: Delivering a report There are many social engineering techniques used by criminals: Baiting: Convincing the victim to reveal information, promising him a reward or a gift. Impersonation: Pretending to be someone else. Dumpster diving: Collecting valuable information (papers with addresses, emails, and so on) from dumpsters. Shoulder surfing: Spying on other peoples' machines from behind them, while they are typing. Phishing: This is the most often used technique; it occurs when an attacker, masquerading as a trusted entity, dupes a victim into opening an email, instant message, or text message. Steps of social engineering penetration testing Penetration testing simulates a black hat hacker attack in order to evaluate the security posture of a company for deploying the required safeguard. Penetration testing is a methodological process, and it goes through well-defined steps. There are many types of penetration testing: White box pentesting Black box pentesting Grey box pentesting To perform a social engineering penetration test, you need to follow the following steps: Building real-time phishing attack detectors using different machine learning models In the next sections, we are going to learn how to build machine learning phishing detectors. We will cover the following two methods: Phishing detection with logistic regression Phishing detection with decision trees Phishing detection with logistic regression In this section, we are going to build a phishing detector from scratch with a logistic regression algorithm. Logistic regression is a well-known statistical technique used to make binomial predictions (two classes). Like in every machine learning project, we will need data to feed our machine learning model. For our model, we are going to use the UCI Machine Learning Repository (Phishing Websites Data Set). You can check it out at https://archive.ics.uci.edu/ml/datasets/Phishing+Websites: The dataset is provided as an arff file: The following is a snapshot from the dataset: For better manipulation, we have organized the dataset into a csv file: As you probably noticed from the attributes, each line of the dataset is represented in the following format – {30 Attributes (having_IP_Address URL_Length, abnormal_URL and so on)} + {1 Attribute (Result)}: For our model, we are going to import two machine learning libraries, NumPy and scikit-learn. Let's open the Python environment and load the required libraries: >>> import numpy as np >>> from sklearn import * >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.metrics import accuracy_score Next, load the data: training_data = np.genfromtxt('dataset.csv', delimiter=',', dtype=np.int32) Identify the inputs (all of the attributes, except for the last one) and the outputs (the last attribute): >>> inputs = training_data[:,:-1] >>> outputs = training_data[:, -1] We need to divide the dataset into training data and testing data: training_inputs = inputs[:2000] training_outputs = outputs[:2000] testing_inputs = inputs[2000:] testing_outputs = outputs[2000:] Create the scikit-learn logistic regression classifier: classifier = LogisticRegression() Train the classifier: classifier.fit(training_inputs, training_outputs) Make predictions: predictions = classifier.predict(testing_inputs) Let's print out the accuracy of our phishing detector model: accuracy = 100.0 * accuracy_score(testing_outputs, predictions) print ("The accuracy of your Logistic Regression on testing data is: " + str(accuracy)) The accuracy of our model is approximately 85%. This is a good accuracy, since our model detected 85 phishing URLs out of 100. But let's try to make an even better model with decision trees, using the same data. Phishing detection with decision trees To build the second model, we are going to use the same machine learning libraries, so there is no need to import them again. However, we are going to import the decision tree classifier from sklearn: >>> from sklearn import tree Create the tree.DecisionTreeClassifier() scikit-learn classifier: classifier = tree.DecisionTreeClassifier() Train the model: classifier.fit(training_inputs, training_outputs) Compute the predictions: predictions = classifier.predict(testing_inputs) Calculate the accuracy: accuracy = 100.0 * accuracy_score(testing_outputs, predictions) Then, print out the results: print ("The accuracy of your decision tree on testing data is: " + str(accuracy)) The accuracy of the second model is approximately 90.4%, which is a great result, compared to the first model. We have now learned how to build two phishing detectors, using two machine learning techniques. NLP in-depth overview NLP is the art of analyzing and understanding human languages by machines. According to many studies, more than 75% of the used data is unstructured. Unstructured data does not have a predefined data model or not organized in a predefined manner. Emails, tweets, daily messages and even our recorded speeches are forms of unstructured data. NLP is a way for machines to analyze, understand, and derive meaning from natural language. NLP is widely used in many fields and applications, such as: Real-time translation Automatic summarization Sentiment analysis Speech recognition Build chatbots Generally, there are two different components of NLP: Natural Language Understanding (NLU): This refers to mapping input into a useful representation. Natural Language Generation (NLG): This refers to transforming internal representations into useful representations. In other words, it is transforming data into written or spoken narrative. Written analysis for business intelligence dashboards is one of NLG applications. Every NLP project goes through five steps. To build an NLP project the first step is identifying and analyzing the structure of words. This step involves dividing the data into paragraphs, sentences, and words. Later we analyze the words in the sentences and relationships among them. The third step involves checking the text for  meaningfulness. Then, analyzing the meaning of consecutive sentences. Finally, we finish the project by the pragmatic analysis. Open source NLP libraries There are many open source Python libraries that provide the structures required to build real-world NLP applications, such as: Apache OpenNLP GATE NLP library Stanford NLP And, of course, Natural Language Toolkit (NLTK) Let's fire up our Linux machine and try some hands-on techniques. Open the Python terminal and import nltk: >>> import nltk Download a book type, as follows: >>> nltk.download() You can also type: >> from nltk.book import * To get text from a link, it is recommended to use the urllib module to crawl a website: >>> from urllib import urlopen >>> url = "http://www.URL_HERE/file.txt" As a demonstration, we are going to load a text called Security.in.Wireless.Ad.Hoc.and.Sensor.Networks: We crawled the text file, and used len to check its length and raw[:50] to display some content. As you can see from the screenshot, the text contains a lot of symbols that are useless for our projects. To get only what we need, we use tokenization: >>> tokens = nltk.word_tokenize(raw) >>> len(tokens) > tokens[:10] To summarize what we learned in the previous section, we saw how to download a web page, tokenize the text, and normalize the words. Spam detection with NLTK Now it is time to build our spam detector using the NLTK. The principle of this type of classifier is simple; we need to detect the words used by spammers. We are going to build a spam/non-spam binary classifier using Python and the nltk library, to detect whether or not an email is spam. First, we need to import the library as usual: >>> import nltk We need to load data and feed our model with an emails dataset. To achieve that, we can use the dataset delivered by the Internet CONtent FIltering Group. You can visit the website at https://labs-repos.iit.demokritos.gr/skel/i-config/: Basically, the website provides four datasets: Ling-spam PU1 PU123A Enron-spam For our project, we are going to use the Enron-spam dataset: Let's download the dataset using the wget command: Extract the tar.gz file by using the tar -xzf enron1.tar.gz command: Shuffle the cp spam/* emails && cp ham/* emails object: To shuffle the emails, let's write a small Python script, Shuffle.py, to do the job: import os import random #initiate a list called emails_list emails_list = [] Directory = '/home/azureuser/spam_filter/enron1/emails/' Dir_list = os.listdir(Directory) for file in Dir_list: f = open(Directory + file, 'r') emails_list.append(f.read()) f.close() Just change the directory variable, and it will shuffle the files: After preparing the dataset, you should be aware that, as we learned previously, we need to tokenize the emails: >> from nltk import word_tokenize Also, we need to perform another step, called lemmatizing. Lemmatizing connects words that have different forms, like hacker/hackers and is/are. We need to import WordNetLemmatizer: >>> from nltk import WordNetLemmatizer Create a sentence for the demonstration, and print out the result of the lemmatizer: >>> [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(unicode(sentence, errors='ignore'))] Then, we need to remove stopwords, such as of, is, the, and so on: from nltk.corpus import stopwords stop = stopwords.words('english') To process the email, a function called Process must be created, to lemmatize and tokenize our dataset: def Process(data): lemmatizer = WordNetLemmatizer() return [lemmatizer.lemmatize(word.lower()) for word in word_tokenize(unicode(sentence, errors='ignore'))] The second step is feature extraction, by reading the emails' words: from collections import Counter def Features_Extraction(text, setting): if setting=='bow': # Bow means bag-of-words return {word: count for word, count in Counter(Process(text)).items() if not word in stop} else: return {word: True for word in Process(text) if not word in stop} Extract the features: features = [(Features_Extraction(email, 'bow'), label) for (email, label) in emails] Now, let's define training the model Python function: def training_Model (Features, samples): Size = int(len(Features) * samples) training , testing = Features[:Size], Features[Size:] print ('Training = ' + str(len(training)) + ' emails') print ('Testing = ' + str(len(testing)) + ' emails') As a classification algorithm, we are going to use NaiveBayesClassifier: from nltk import NaiveBayesClassifier, classify classifier = NaiveBayesClassifier.train(training) Finally, we define the evaluation Python function: def evaluate(training, tesing, classifier): print ('Training Accuracy is ' + str(classify.accuracy(classifier, train_set))) print ('Testing Accuracy i ' + str(classify.accuracy(classifier, test_set))) In this article, we learned to detect phishing attempts by building three different projects from scratch. First, we discovered how to develop a phishing detector using two different machine learning techniques—logistic regression and decision trees. The third project was a spam filter, based on NLP and Naive Bayes classification. To become a master at penetration testing using machine learning with Python, check out this book Mastering Machine Learning for Penetration Testing. Google’s Protect your Election program: Security policies to defend against state-sponsored phishing attacks, and influence campaigns How the Titan M chip will improve Android security New cybersecurity threats posed by artificial intelligence
Read more
  • 0
  • 0
  • 14553

article-image-recode-decode-googlewalkout-interview-shows-why-data-and-evidence-dont-always-lead-to-right-decisions-in-even-the-worlds-most-data-driven-company
Natasha Mathur
23 Nov 2018
10 min read
Save for later

Recode Decode #GoogleWalkout interview shows why data and evidence don’t always lead to right decisions in even the world’s most data-driven company

Natasha Mathur
23 Nov 2018
10 min read
Earlier this month, 20,000 Google employees along with temps, Vendors, and Contractors walked out of their respective Google offices to protest against the discrimination, racism, and sexual harassment that they encountered at Google’s workplace. As a part of the walkout, Google employees had laid out five demands urging Google to bring about structural changes within the workplace. In the latest episode of Recode Decode with Kara Swisher, yesterday, six of the Google walkout organizers, namely, Erica Anderson, Claire Stapleton, Meredith Whittaker, Stephanie Parker, Cecilia O’Neil-Hart and Amr Gaber spoke out about Google’s dismissive approach towards the five demands laid out by the Google employees. A day after the Walkout, Google addressed these demands in a note written by Sundar Pichai, where he admitted that they have “not always gotten everything right in the past” and are “sincerely sorry”. Pichai also mentioned that  “It’s clear that to live up to the high bar we set for Google, we need to make some changes. Going forward, we will provide more transparency into how you raise concerns and how we handle them”. The 'walkout for real change' was a response to the New York Times report, published last month, that exposed how Google has protected its senior executives (Andy Rubin, Android Founder being one of them) that had been accused of sexual misconduct in the recent past. We’ll now have a look at the major highlights from the podcast. Key Takeaways The podcast talks about how the organizers formulated their demands, the rights of contractors at Google, post walkout town hall meeting, and what steps will be taken next by the Google employees. How the walkout mobilized collective action and the formulation of demands As per the Google employees, collating demands was a collective effort from the very beginning. They were inspired by stories of sexual harassment at Google that were floating around in an internal email chain. This urged the organizers of the walkout to send out an email to a large group of women stating that they need to do something about it, to which a lot of employees suggested that they should put out their demands. A doc was prepared in Google Doc Live that listed all the suggested demands by the fellow Googlers. “it was just this truly collective action, living, moving in a Google Document that we were all watching and participating in” said Cecelia O’Neil Hart, a marketer at YouTube.  Cecelia also pointed out that the demands that were being collected were not new and had represented the voices of a lot of groups at Google. “It was just completely a process of defining what we wanted in solidarity with each other. I think it showed me the power of collective action, writing the demands quite literally as a collective” said Cecelia. Rights of Contractors One of the demands laid out by the Google employees as a part of the walkout, states, “commitment to ending pay and opportunity inequity for all levels of the organization”. They expected a change that is applicable to not just full-time employees, but also contract workers as well as subcontract workers, as they are the ones who work at Google with rights that are restricted and different than those of the full-time employees. “We have contractors that manage teams of upwards of 10, 20, even more, other people but left in this second-class state where they don’t have healthcare benefits, they don’t have paid sick leave and they definitely don’t get access to the same well-being resources: Counseling, professional development, any of that”, adds Stephanie Parker, a policy specialist on Trust and Safety, YouTube. Other examples of discrimination against contractors at Google include the shooting at YouTube Headquarters in April where contractor workers (security guards, cafeteria workers, etc) were excluded from the post-shooting town hall meeting conducted by Susan Wojcicki, CEO, YouTube. Also, while the shooting was taking place, all the employees were being updated on the Security via texts, except the contractors. Similarly, the contractors were not allowed in the town hall meeting that was conducted six days post walkout, although the demands applied to them just as much as it did to full-time employees. There’s also systemic racism in hiring and promotion for certain job ladders like engineering, versus other job ladders, versus contract work. Parker mentioned that by including contractors in the five demands, they wanted to bring it to everyone’s attention that despite Google striving to be a company with the best workplace that offers the best benefits, it’s quite far-off from leading in that space. “The solution is to convert them to full-time or to treat them fairly with respect. Not to throw up our hands and say, “Oh well” said Parker. Post walkout town hall meeting Six days after the walkout, a mail was sent over to the employees regarding the town hall meeting, which Google said was accidentally “leaked”. Stapleton, a marketing manager at YouTube, says that the “the town hall was really tough to watch” and that the Google executives “did not ever address, acknowledge, the list of demands nor did they adequately provide solutions to all the five. They did drop forced arbitration, but for sexual harassment only, not discrimination, which was a key omission”. As per the employees, Google seemed to use the same old methods to get the situation under control. Google said that they’ll be focusing on committing to the OKRs (Objective and Key Result) i.e. the main goal for the company as a whole. Moreover, they also tried to play down the other concerns and core issues such as discrimination (apart from sexual), racism, and the abuse of power while only focussing on one kind of behavior i.e. sexual assault. They mentioned how Google refused to address any issues surrounding the TVCs (temps, vendors, and contractors), despite being asked about it in the town hall. Also, Google did not acknowledge that the HR processes and systems within the company are not working. Instead, Google decided to conduct a survey to ensure how people really feel about the HR teams within the workplace. “They heard loud and clear from 20,000 of us that these processes and reporting lines that are in place are set up the wrong way and need to be redesigned so that we normal employees have more of a say and more of a look into the decision-making processes, and they didn’t even acknowledge that as a valid sentiment or idea”, said Parker. All in all, there wasn’t much “leadership”, and there wasn’t an understanding that “accountability was necessary”. Employees want their demands to be met Employees want an employee representative on board to speak on behalf of all the employees. They want accountability systems in place and for Google to begin analyzing the cultures within companies that use racism, discrimination, abuse of power, sexism, the kind that excludes many from power and accrue resources to only a few. The employees acknowledge that Google is continuing to discuss and talk about the issue, but that the employees would have to keep pushing the conversation forward every step of the way. “I think we need to not be afraid to say the real words. I want to hear our execs say the real words like “discrimination,” which was erased from their response to the demands. Like ‘systemic racism’.I want to hear those real words” said Cecelia. Employees also want the demand no. 2 i.e. ending pay inequity specifically to be addressed by Google as all they keep getting in response is that Google is “looking into it” and “studying” about it. “I think that what they have to do is embrace the tough critique that they’ve gotten and try to understand where we’re coming from and make these changes, and make them in collaboration with us, which has not happened,” said Stapleton. Employees continue to be cautiously hopeful Employees believe that Google has incredible people at the company. Thousands of people came together and worked on their vision for the world altogether on something that really mattered. “You know, we’ve called this the ‘Walkout for Real Change’ for a reason. Even if all of our optimism comes true and the best outcome and our demands are met, real change happens over time and we’re going to hold people accountable to that real change actually going down, and hold us accountable for demanding it also, because we’ve got to get the rest of the demands met”, says Cecelia. Our thoughts on this topic Just as history has proven time and again, information and data can be used to drive a narrative that benefits the storyteller and their agendas. Based on collecting feedback from workers across the company, the Google walkout organizers pointed out systemic issues within the company that enabled the sexual predatory behavior. They pointed out that sexual harassment is one of the symptoms and not the cause. They demanded that the root causes be addressed holistically through their set of five demands. To extinguish a movement or dissension in its infancy, regimes and corporations throughout history have used the following tactics: Be the benevolent ruler Divide and conquer the crowd by appealing to individual group needs but never to everyone’s collective demands Find a middle ground by agreeing to some demands while signaling that the other side also takes a few steps forward thereby disengaging those whose demands aren’t met. This would weaken the movement’s leadership Use the information to support the status quo. Promote the influencers into top management roles It appears that Google is using a lot of the approaches to appease the walkout participants. The Google management adopted classic labor negotiation tactics by sanctioning the protest, also encouraging managers to participate, then agreeing to adopt the easiest item on the list of demands which have already been implemented in some other tech companies but restricted it to only their employees. But restricting the reforms to only their employees, and creating a larger distance for TVCs, they seem to be thinning out the protesting crowd. By not engaging in open dialog on all key issues highlighted and by removing key decision makers on top out of the town hall, they have created a situation for deniability. Lastly, by going back to surveying sentiments on key issues, they are not only relying on time to subdue anger felt but also on the grassroots voice to dissipate. Will this be the tipping point for Google employees to unionize? BuzzFeed Report: Google’s sexual misconduct policy “does not apply retroactively to claims already compelled to arbitration” OK Google, why are you ok with mut(at)ing your ethos for Project DragonFly? Following Google, Facebook changes its forced arbitration policy for sexual harassment claims
Read more
  • 0
  • 0
  • 2506