How-To Tutorials

article-image-analyzing-moby-dick-frequency-distribution-nltk

30 Mar 2018

2 min read

Analyzing Moby Dick through frequency distribution with NLTK

30 Mar 2018

What is frequency distribution and why does it matter? In the context of natural language processing, frequency distribution is simply a tally of the number of times each unique word is used in a text. Recording the individual word counts of a text can better help us understand not only what topics are being discussed and what information is important but also how that information is being discussed as well. It's a useful method for better understanding language and different types of texts. This video tutorial has been taken from from Natural Language Processing with Python. Word frequency distribution is central to performing content analysis with NLP. Its applications are wide ranging. From understanding and characterizing an author’s writing style to analyzing the vocabulary of rappers, the technique is playing a large part in wider cultural conversations. It’s also used in psychological research in a number of ways to analyze how patients use language to form frameworks for thinking about themselves and the world. Trivial or serious, word frequency distribution is becoming more and more important in the world of research. Of course, manually creating such a word frequency distribution models would be time consuming and inconvenient for data scientists. Fortunately for us, NLTK, Python’s toolkit for natural language processing, makes life much easier. How to use NLTK for frequency distribution Take a look at how to use NLTK to create a frequency distribution for Herman Melville’s Moby Dick in the video tutorial above. In it, you'll find a step by step guide to performing an important data analysis task. Once you've done that, you can try it for yourself, or have a go at performing a similar analysis on another data set. Read Analyzing Textual information using the NLTK library. Learn more about natural language processing - read How to create a conversational assistant or chatbot using Python.

0
0
3570

article-image-django-and-django-rest-frameworks-build-restful-app

Sugandha Lahoti

29 Mar 2018

12 min read

Getting started with Django and Django REST frameworks to build a RESTful app

Sugandha Lahoti

29 Mar 2018

12 min read

0
0
3995

article-image-how-to-build-and-deploy-microservices-using-payara-micro

Gebin George

28 Mar 2018

9 min read

How to build and deploy Microservices using Payara Micro

Gebin George

28 Mar 2018

9 min read

0
0
7564

article-image-how-to-build-microservices-using-rest-framework

Gebin George

28 Mar 2018

7 min read

How to build Microservices using REST framework

Gebin George

28 Mar 2018

7 min read

Today, we will learn to build microservices using REST framework. Our microservices are Java EE 8 web projects, built using maven and published as separate Payara Micro instances, running within docker containers. The separation allows them to scale individually, as well as have independent operational activities. Given the BCE pattern used, we have the business component split into boundary, control, and entity, where the boundary comprises of the web resource (REST endpoint) and business service (EJB). The web resource will publish the CRUD operations and the EJB will in turn provide the transactional support for each of it along with making external calls to other resources. Here's a logical view for the boundary consisting of the web resource and business service: The microservices will have the following REST endpoints published for the projects shown, along with the boundary classes XXXResource and XXXService: Power Your APIs with JAXRS and CDI, for Server-Sent Events. In IMS, we publish task/issue updates to the browser using an SSE endpoint. The code observes for the events using the CDI event notification model and triggers the broadcast. The ims-users and ims-issues endpoints are similar in API format and behavior. While one deals with creating, reading, updating, and deleting a User, the other does the same for an Issue. Let's look at this in action. After you have the containers running, we can start firing requests to the /users web resource. The following curl command maps the URI /users to the @GET resource method named getAll() and returns a collection (JSON array) of users. The Java code will simply return a Set<User>, which gets converted to JsonArray due to the JSON binding support of JSON-B. The method invoked is as follows: @GET public Response getAll() {... } curl -v -H 'Accept: application/json' http://localhost:8081/ims-users/resources/users ... HTTP/1.1 200 OK ... [{ "id":1,"name":"Marcus","email":"[email protected]" "credential":{"password":"1234","username":"marcus"} }, { "id":2,"name":"Bob","email":"[email protected]" "credential":{"password":"1234","username":"bob"} }] Next, for selecting one of the users, such as Marcus, we will issue the following curl command, which uses the /users/xxx path. This will map the URI to the @GET method which has the additional @Path("{id}") annotation as well. The value of the id is captured using the @PathParam("id") annotation placed before the field. The response is a User entity wrapped in the Response object returned. The method invoked is as follows: @GET @Path("{id}") public Response get(@PathParam("id") Long id) { ... } curl -v -H 'Accept: application/json' http://localhost:8081/ims-users/resources/users/1 ... HTTP/1.1 200 OK ... { "id":1,"name":"Marcus","email":"[email protected]" "credential":{"password":"1234","username":"marcus"} } In both the preceding methods, we saw the response returned as 200 OK. This is achieved by using a Response builder. Here's the snippet for the method: return Response.ok( ..entity here..).build(); Next, for submitting data to the resource method, we use the @POST annotation. You might have noticed earlier that the signature of the method also made use of a UriInfo object. This is injected at runtime for us via the @Context annotation. A curl command can be used to submit the JSON data of a user entity. The method invoked is as follows: @POST public Response add(User newUser, @Context UriInfo uriInfo) We make use of the -d flag to send the JSON body in the request. The POST request is implied: curl -v -H 'Content-Type: application/json' http://localhost:8081/ims-users/resources/users -d '{"name": "james", "email":"[email protected]", "credential": {"username":"james","password":"test123"}}' ... HTTP/1.1 201 Created ... Location: http://localhost:8081/ims-users/resources/users/3 The 201 status code is sent by the API to signal that an entity has been created, and it also returns the location for the newly created entity. Here's the relevant snippet to do this: //uriInfo is injected via @Context parameter to this method URI location = uriInfo.getAbsolutePathBuilder() .path(newUserId) // This is the new entity ID .build(); // To send 201 status with new Location return Response.created(location).build(); Similarly, we can also send an update request using the PUT method. The method invoked is as follows: @PUT @Path("{id}") public Response update(@PathParam("id") Long id, User existingUser) curl -v -X PUT -H 'Content-Type: application/json' http://localhost:8081/ims-users/resources/users/3 -d '{"name": "jameson", "email":"[email protected]"}' ... HTTP/1.1 200 Ok The last method we need to map is the DELETE method, which is similar to the GET operation, with the only difference being the HTTP method used. The method invoked is as follows: @DELETE @Path("{id}") public Response delete(@PathParam("id") Long id) curl -v -X DELETE http://localhost:8081/ims-users/resources/users/3 ... HTTP/1.1 200 Ok You can try out the Issues endpoint in a similar manner. For the GET requests of /users or /issues, the code simply fetches and returns a set of entity objects. But when requesting an item within this collection, the resource method has to look up the entity by the passed in id value, captured by @PathParam("id"), and if found, return the entity, or else a 404 Not Found is returned. Here's a snippet showing just that: final Optional<Issue> issueFound = service.get(id); //id obtained if (issueFound.isPresent()) { return Response.ok(issueFound.get()).build(); } return Response.status(Response.Status.NOT_FOUND).build(); The issue instance can be fetched from a database of issues, which the service object interacts with. The persistence layer can return a JPA entity object which gets converted to JSON for the calling code. We will look at persistence using JPA in a later section. For the update request which is sent as an HTTP PUT, the code captures the identifier ID using @PathParam("id"), similar to the previous GET operation, and then uses that to update the entity. The entity itself is submitted as a JSON input and gets converted to the entity instance along with the passed in message body of the payload. Here's the code snippet for that: @PUT @Path("{id}") public Response update(@PathParam("id") Long id, Issue updated) { updated.setId(id); boolean done = service.update(updated); return done ? Response.ok(updated).build() : Response.status(Response.Status.NOT_FOUND).build(); } The code is simple to read and does one thing—it updates the identified entity and returns the response containing the updated entity or a 404 for a non-existing entity. The service references that we have looked at so far are @Stateless beans which are injected into the resource class as fields: // Project: ims-comments @Stateless public class CommentsService {... } // Project: ims-issues @Stateless public class IssuesService {... } // Project: ims-users @Stateless public class UsersService {... } These will in turn have the EntityManager injected via @PersistenceContext. Combined with the resource and service, our components have made the boundary ready for clients to use. Similar to the WebSockets section in Chapter 6, Power Your APIs with JAXRS and CDI, in IMS, we use a @ServerEndpoint which maintains the list of active sessions and then uses that to broadcast a message to all users who are connected. A ChatThread keeps track of the messages being exchanged through the @ServerEndpoint class. For the message to besent, we use the stream of sessions and filter it by open sessions, then send the message for each of the sessions: chatSessions.getSessions().stream().filter(Session::isOpen) .forEach(s -> { try { s.getBasicRemote().sendObject(chatMessage); }catch(Exception e) {...} }); To summarize, we practically saw how to leverage REST framework to build microservices. This article is an excerpt from the book, Java EE 8 and Angular written by Prashant Padmanabhan. The book covers building modern user friendly web apps with Java EE

0
0
3121

article-image-getting-started-with-django-restful-web-services

Sugandha Lahoti

27 Mar 2018

19 min read

Getting started with Django RESTful Web Services

Sugandha Lahoti

27 Mar 2018

19 min read

0
0
4436

article-image-how-to-perform-full-text-search-fts-in-postgresql

Sugandha Lahoti

27 Mar 2018

8 min read

How to perform full-text search (FTS) in PostgreSQL

Sugandha Lahoti

27 Mar 2018

8 min read

0
0
8671

article-image-creating-2d-3d-plots-using-matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

Creating 2D and 3D plots using Matplotlib

Pravin Dhandre

22 Mar 2018

10 min read

0
0
20058

article-image-how-to-secure-data-in-salesforce-einstein-analytics

Amey Varangaonkar

22 Mar 2018

5 min read

How to secure data in Salesforce Einstein Analytics

Amey Varangaonkar

22 Mar 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book includes techniques to build effective dashboards and Business Intelligence metrics to gain useful insights from data.[/box] Before getting into security in Einstein Analytics, it is important to set up your organization, define user types so that it is available to use. In this article we explore key aspects of security in Einstein Analytics. The following are key points to consider for data security in Salesforce: Salesforce admins can restrict access to data by setting up field-level security and object-level security in Salesforce. These settings prevent data flow from loading sensitive Salesforce data into a dataset. Dataset owners can restrict data access by using row-level security. Analytics supports security predicates, a robust row-level security feature that enables you to model many different types of access control on datasets. Analytics also supports sharing inheritance. Take a look at the following diagram: Salesforce data security In Einstein Analytics, dataflows bring the data to the Analytics Cloud from Salesforce. It is important that Einstein Analytics has all the necessary permissions and access to objects as well as fields. If an object or a field is not accessible to Einstein then the data flow fails and it cannot extract data from Salesforce. So we need to make sure that the required access is given to the integration user and security user. We can configure the permission set for these users. Let’s configure permissions for an integration user by performing the following steps: Switch to classic mode and enter Profiles in the Quick Find / Search… box Select and clone the Analytics Cloud Integration User profile and Analytics Cloud Security User profile for the integration user and security user respectively: Save the cloned profiles and then edit them Set the permission to Read for all objects and fields Save the profile and assign it to users Take a look at the following diagram: Data pulled from Salesforce can be made secure from both sides: Salesforce as well as Einstein Analytics. It is important to understand that Salesforce and Einstein Analytics are two independent databases. So, a user security setting given to Einstein will not affect the data in Salesforce. There are the following ways to secure data pulled from Salesforce: Salesforce Security Einstein Analytics Security Roles and profiles Inheritance security Organization-Wide Defaults (OWD) and record ownership Security predicates Sharing rules Application-level security Sharing mechanism in Einstein All Analytics users start off with Viewer access to the default Shared App that’s available out-of-the-box; administrators can change this default setting to restrict or extend access. All other applications created by individual users are private, by default; the application owner and administrators have Manager access and can extend access to other Users, groups, or roles. The following diagram shows how the sharing mechanism works in Einstein Analytics: Here’s a summary of what users can do with Viewer, Editor, and Manager access: Action / Access level Viewer Editor Manager View dashboards, lenses, and datasets in the application. If the underlying dataset is in a different application than a lens or dashboard, the user must have access to both applications to view the lens or dashboard. Yes Yes Yes See who has access to the application. Yes Yes Yes Save contents of the application to another application that the user has Editor or Manager access to. Yes Yes Yes Save changes to existing dashboards, lenses, and datasets in the application (saving dashboards requires the appropriate permission set license and permission). Yes Yes Change the application’s sharing settings. Yes Rename the application. Yes Delete the application. Yes Confidentiality, integrity, and availability together are referred to as the CIA Triad and it is designed to help organizations decide what security policies to implement within the organization. Salesforce knows that keeping information private and restricting access by unauthorized users is essential for business. By sharing the application, we can share a lens, dashboard, and dataset all together with one click. To share the entire application, do the following: Go to your Einstein Analytics and then to Analytics Studio Click on the APPS tab and then the icon for your application that you want to share, as shown in the following screenshot: 3. Click on Share and it will open a new popup window, as shown in the following screenshot: Using this window, you can share the application with an individual user, a group of users, or a particular role. You can define the access level as Viewer, Editor, or Manager After selecting User, click on the user you wish to add and click on Add Save and then close the popup And that’s it. It’s done. Mass-sharing the application Sometimes, we are required to share the application with a wide audience: There are multiple approaches to mass-sharing the Wave application such as by role or by username In Salesforce classic UI, navigate to Setup|Public Groups | New For example, to share a sales application, label a public group as Analytics_Sales_Group Search and add users to a group by Role, Roles and Subordinates, or by Users (username): 5. Search for the Analytics_Sales public group 6. Add the Viewer option as shown in the following screenshot: 7. Click on Save Protecting data from breaches, theft, or from any unauthorized user is very important. And we saw that Einstein Analytics provides the necessary tools to ensure the data is secure. If you found this excerpt useful and want to know more about securing your analytics in Einstein, make sure to check out this book Learning Einstein Analytics.

0
0
6502

article-image-embed-einstein-dashboards-salesforce-classic

Amey Varangaonkar

21 Mar 2018

5 min read

How to embed Einstein dashboards on Salesforce Classic

Amey Varangaonkar

21 Mar 2018

5 min read

[box type="note" align="" class="" width=""]The following excerpt is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book highlights the key techniques and know-how to unlock critical insights from your data using Salesforce Einstein Analytics.[/box] With Einstein Analytics, users have the power to embed their dashboards on various third-party applications and even on their web applications. In this article, we will show how to embed an Einstein dashboard on Salesforce Classic. In order to start embedding the dashboard, let's create a sample dashboard by performing the following steps: Navigate to Analytics Studio | Create | Dashboard. Add three chart widgets on the dashboard. Click on the Chart button in the middle and select the Opportunity dataset. Select Measures as Sum of Amount and select BillingCountry under Group by. Click on Done. Repeat the second step for the second widget, but select Account Source under Group by and make it a donut chart. Repeat the second step for the third widget but select Stage under Group by and make it a funnel chart. Click on Save (s) and enter Embedding Opportunities in the title field, as shown in the following screenshot: Now that we have created a dashboard, let's embed this dashboard in Salesforce Classic. In order to start embedding the dashboard, exit from the Einstein Analytics platform and go to Classic mode. The user can embed the dashboard on the record detail page layout in Salesforce Classic. The user can view the dashboard, drill in, and apply a filter, just like in the Einstein Analytics window. Let's add the dashboard to the account detail page by performing the following steps: Navigate to Setup | Customize | Accounts | Page Layouts as shown in the following screenshot: Click on Edit of Account Layout and it will open a page layout editor which has two parts: a palette on the upper portion of the screen, and the page layout on the lower portion of the screen. The palette contains the user interface elements that you can add to your page layout, such as Fields, Buttons, Links, and Actions, and Related Lists, as shown in the following screenshot: Click on the Wave Analytics Assets option from the palette and you can see all the dashboards on the right-side panel. Drag and drop a section onto the page layout, name it Einstein Dashboard, and click on OK. Drag and drop the dashboard which you wish to add to the record detail page. We are going to add Embedded Opportunities. Click on Save. Go to any accounting record and you should see a new section within the dashboard: Users can easily configure the embedded dashboards by using attributes. To access the dashboard properties, go to edit page layout again, and go to the section where we added the dashboard to the layout. Hover over the dashboard and click on the Tool icon. It will open an Asset Properties window: The Asset Properties window gives the user the option to change the following features: Width (in pixels or %): This feature allows you to adjust the width of the dashboard section. Height (in pixels): This feature allows you to adjust the height of the dashboard section. Show Title: This feature allows you to display or hide the title of the dashboard. Show Sharing Icon: Using this feature, by default, the share icon is disabled. The Show Sharing Icon option gives the user a flexibility to include the share icon on the dashboard. Show Header: This feature allows you to display or hide the header. Hide on error: This feature gives you control over whether the Analytics asset appears if there is an error. Field mapping: Last but not least, field mapping is used to filter the relevant data to the record on the dashboard. To set up the dashboard to show only the data that’s relevant to the record being viewed, use field mapping. Field mapping links data fields in the dashboard to the object’s fields. We are using the Embedded Opportunity dashboard. Let's add field mapping to it. The following is the format for field mapping: { "datasets": { "datasetName":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset Fieldname"]} }] } Let's add field mapping for account by using the following format: { "datasets": { "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } If your dashboard uses multiple datasets, then you can use the following format: { "datasets": { "datasetName1":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset1 Fieldname"]} }], "datasetName2":[{ "fields":["Actual Field name from object"], "filter":{"operator": "matches", "values":["$dataset2 Fieldname"]} }] } Let's add field mapping for account and opportunities: { "datasets": { "Opportunities":[{ "fields":["Account.Name"], "Filter":{"operator": "Matches", "values":["$Name"]} }], "Account":[{ "fields":["Name"], "filter":{"operator": "matches", "values":["$Name"]} }] } } Now that we have added field mapping, save the page layout and go to the actual record. Observe that the dashboard is getting filtered now per record, as shown in the following screenshot: To summarize, we saw it’s fairly easy to embed your custom dashboards in Salesforce. Similarly, you can do so on other platforms such as Lightning, Visualforce pages, and even on your websites and web applications. If you are keen to learn more, you may check out the book Learning Einstein Analytics.

0
0
3283

article-image-write-high-quality-code-python-15-tips-data-scientists-researchers

Aarthi Kumaraswamy

21 Mar 2018

5 min read

How to write high quality code in Python: 15+ tips for data scientists and researchers

Aarthi Kumaraswamy

21 Mar 2018

5 min read

Writing code is easy. Writing high quality code is much harder. Quality is to be understood both in terms of actual code (variable names, comments, docstrings, and so on) and architecture (functions, modules, and classes). In general, coming up with a well-designed code architecture is much more challenging than the implementation itself. In this post, we will give a few tips about how to write high quality code. This is a particularly important topic in academia, as more and more scientists without prior experience in software development need to code. High quality code writing first principles Writing readable code means that other people (or you in a few months or years) will understand it quicker and will be more willing to use it. It also facilitates bug tracking. Modular code is also easier to understand and to reuse. Implementing your program's functionality in independent functions that are organized as a hierarchy of packages and modules is an excellent way of achieving high code quality. It is easier to keep your code loosely coupled when you use functions instead of classes. Spaghetti code is really hard to understand, debug, and reuse. Iterate between bottom-up and top-down approaches while working on a new project. Starting with a bottom-up approach lets you gain experience with the code before you start thinking about the overall architecture of your program. Still, make sure you know where you're going by thinking about how your components will work together. How these high quality code writing first principles translate in Python? Take the time to learn the Python language seriously. Review the list of all modules in the standard library—you may discover that functions you implemented already exist. Learn to write Pythonic code, and do not translate programming idioms from other languages such as Java or C++ to Python. Learn common design patterns; these are general reusable solutions to commonly occurring problems in software engineering. Use assertions throughout your code (the assert keyword) to prevent future bugs (defensive programming). Start writing your code with a bottom-up approach; write independent Python functions that implement focused tasks. Do not hesitate to refactor your code regularly. If your code is becoming too complicated, think about how you can simplify it. Avoid classes when you can. If you can use a function instead of a class, choose the function. A class is only useful when you need to store persistent state between function calls. Make your functions as pure as possible (no side effects). In general, prefer Python native types (lists, tuples, dictionaries, and types from Python's collections module) over custom types (classes). Native types lead to more efficient, readable, and portable code. Choose keyword arguments over positional arguments in your functions. Argument names are easier to remember than argument ordering. They make your functions self-documenting. Name your variables carefully. Names of functions and methods should start with a verb. A variable name should describe what it is. A function name should describe what it does. The importance of naming things well cannot be overstated. Every function should have a docstring describing its purpose, arguments, and return values, as shown in the following example. You can also look at the conventions chosen in popular libraries such as NumPy. The exact convention does not matter, the point is to be consistent within your code. You can use a markup language such as Markdown or reST to do that. Follow (at least partly) Guido van Rossum's Style Guide for Python, also known as Python Enhancement Proposal number 8 (PEP8). It is a long read, but it will help you write well-readable Python code. It covers many little things such as spacing between operators, naming conventions, comments, and docstrings. For instance, you will learn that it is considered a good practice to limit any line of your code to 79 or 99 characters. This way, your code can be correctly displayed in most situations (such as in a command-line interface or on a mobile device) or side by side with another file. Alternatively, you can decide to ignore certain rules. In general, following common guidelines is beneficial on projects involving many developers. You can check your code automatically against most of the style conventions in PEP8 with the pycodestyle Python package. You can also automatically make your code PEP8-compatible with the autopep8 package. Use a tool for static code analysis such as flake8 or Pylint. It lets you find potential errors or low-quality code statically, that is, without running your code. Use blank lines to avoid cluttering your code (see PEP8). You can also demarcate sections in a long Python module with salient comments. A Python module should not contain more than a few hundreds lines of code. Having too many lines of code in a module may be a sign that you need to split it into several modules. Organize important projects (with tens of modules) into subpackages (subdirectories). Take a look at how major Python projects are organized. For example, the code of IPython is well-organized into a hierarchy of subpackages with focused roles. Reading the code itself is also quite instructive. Learn best practices to create and distribute a new Python package. Make sure that you know setuptools, pip, wheels, virtualenv, PyPI, and so on. Also, you are highly encouraged to take a serious look at conda, a powerful and generic packaging system created by Anaconda. Packaging has long been a rapidly evolving topic in Python, so read only the most recent references. You enjoyed an excerpt from Cyrille Rossant’s latest book, IPython Cookbook, Second Edition. This book contains 100+ recipes for high-performance scientific computing and data analysis, from the latest IPython/Jupyter features to the most advanced tricks, to help you write better and faster code. For free recipes from the book, head over to the Ipython Cookbook Github page. If you loved what you saw, support Cyrille’s work by buying a copy of the book today!

0
1
7857

article-image-cambridge-analytica-ethics-data-science

Richard Gall

20 Mar 2018

5 min read

The Cambridge Analytica scandal and ethics in data science

Richard Gall

20 Mar 2018

5 min read

Earlier this month, Stack Overflow published the results of its 2018 developer survey. In it, there was an interesting set of questions around the concept of 'ethical code'. The main takeaway was ultimately that the area remains a gray area. The Cambridge Analytica scandal, however, has given the issue of 'ethical code' a renewed urgency in the last couple of days. The data analytics company are alleged to have not only been involved in votes in the UK and US, but also of harvesting copious amounts of data from Facebook (illegally). For whistleblower Christopher Wylie, the issue of ethical code is particularly pronounced. “I created Steve Bannon’s psychological mindfuck tool” he told Carole Cadwalladr in an interview in the Guardian. Cambridge Analytica: psyops or just market research? Wylie is a data scientist whose experience over the last half a decade or so has been impressive. It’s worth noting however, that Wylie’s career didn’t begin in politics. His academic career was focused primarily on fashion forecasting. That might all seem a little prosaic, but it underlines the fact that data science never happens in a vacuum. Data scientists always operate within a given field. It might be tempting to view the world purely through the prism of impersonal data and cold statistics. To a certain extent you have to if you’re a data scientist or a statistician. But at the very least this can be unhelpful; at worst a potential threat to global democracy. At one point in the interview Wylie remarks that: ...it’s normal for a market research company to amass data on domestic populations. And if you’re working in some country and there’s an auxiliary benefit to a current client with aligned interests, well that’s just a bonus. This is potentially the most frightening thing. Cambridge Analytica’s ostensible role in elections and referenda isn’t actually that remarkable. For all the vested interests and meetings between investors, researchers and entrepreneurs, the scandal is really just the extension of data mining and marketing tactics employed by just about every organization with a digital presence on the planet. Data scientists are always going to be in a difficult position. True, we're not all going to end up working alongside Steve Bannon. But your skills are always being deployed with a very specific end in mind. It’s not always easy to see the effects and impact of your work until later, but it’s still essential for data scientists and analysts to be aware of whose data is being collected and used, how it’s being used and why. Who is responsible for the ethics around data and code? There was another interesting question in the Stack Overflow survey that's relevant to all of this. The survey asked respondents who was ultimately most responsible for code that accomplishes something unethical. 57.5% claimed upper management were responsible, 22.8% said the person who came up with the idea, and 19.7% said it was the responsibility of the developer themselves. Clearly the question is complex. The truth lies somewhere between all three. Management make decisions about what’s required from an organizational perspective, but the engineers themselves are, of course, a part of the wider organizational dynamic. They should be in a position where they are able to communicate any personal misgivings or broader legal issues with the work they are being asked to do. The case of Wylie and Cambridge Analytica is unique, however. But it does highlight that data science can be deployed in ways that are difficult to predict. And without proper channels of escalation and the right degree of transparency it's easy for things to remain secretive, hidden in small meetings, email threads and paper trails. That's another thing that data scientists need to remember. Office politics might be a fact of life, but when you're a data scientist you're sitting on the apex of legal, strategic and political issues. To refuse to be aware of this would be naive. What the Cambridge Analytica story can teach data scientists But there's something else worth noting. This story also illustrates something more about the world in which data scientists are operating. This is a world where traditional infrastructure is being dismantled. This is a world where privatization and outsourcing is viewed as the route towards efficiency and 'value for money'. Whether you think that’s a good or bad thing isn’t really the point here. What’s important is that it makes the way we use data, even the code we write more problematic than ever because it’s not always easy to see how it’s being used. Arguably Wylie was naive. His curiosity and desire to apply his data science skills to intriguing and complex problems led him towards people who knew just how valuable he could be. Wylie has evidently developed greater self-awareness. This is perhaps the main reason why he has come forward with his version of events. But as this saga unfolds it’s worth remembering the value of data scientists in the modern world - for a range of organizations. It’s made the concept of the 'citizen data scientist' take on an even more urgent and literal meaning. Yes data science can help to empower the economy and possibly even toy with democracy. But it can also be used to empower people, improve transparency in politics and business. If anything, the Cambridge Analytica saga proves that data science is a dangerous field - not only the sexiest job of the twenty-first century, but one of the most influential in shaping the kind of world we're going to live in. That's frightening, but it's also pretty exciting.

0
0
7991

article-image-getting-started-with-python-web-scraping

Amarabha Banerjee

20 Mar 2018

13 min read

Getting started with Python Web Scraping

Amarabha Banerjee

20 Mar 2018

13 min read

0
0
2849

article-image-25-datasets-deep-learning-iot

Sugandha Lahoti

20 Mar 2018

8 min read

25 Datasets for Deep Learning in IoT

Sugandha Lahoti

20 Mar 2018

8 min read

Deep Learning is one of the major players for facilitating the analytics and learning in the IoT domain. A really good roundup of the state of deep learning advances for big data and IoT is described in the paper Deep Learning for IoT Big Data and Streaming Analytics: A Survey by Mehdi Mohammadi, Ala Al-Fuqaha, Sameh Sorour, and Mohsen Guizani. In this article, we have attempted to draw inspiration from this research paper to establish the importance of IoT datasets for deep learning applications. The paper also provides a handy list of commonly used datasets suitable for building deep learning applications in IoT, which we have added at the end of the article. IoT and Big Data: The relationship IoT and Big data have a two-way relationship. IoT is the main producer of big data, and as such an important target for big data analytics to improve the processes and services of IoT. However, there is a difference between the two. Large-Scale Streaming data: IoT data is a large-scale streaming data. This is because a large number of IoT devices generate streams of data continuously. Big data, on the other hand, lack real-time processing. Heterogeneity: IoT data is heterogeneous as various IoT data acquisition devices gather different information. Big data devices are generally homogeneous in nature. Time and space correlation: IoT sensor devices are also attached to a specific location, and thus have a location and time-stamp for each of the data items. Big data sensors lack time-stamp resolution. High noise data: IoT data is highly noisy, owing to the tiny pieces of data in IoT applications, which are prone to errors and noise during acquisition and transmission. Big data, in contrast, is generally less noisy. Big data, on the other hand, is classified according to conventional 3V’s, Volume, Velocity, and Variety. As such techniques used for Big data analytics are not sufficient to analyze the kind of data, that is being generated by IoT devices. For instance, autonomous cars need to make fast decisions on driving actions such as lane or speed change. These decisions should be supported by fast analytics with data streaming from multiple sources (e.g., cameras, radars, left/right signals, traffic light etc.). This changes the definition of IoT big data classification to 6V’s. Volume: The quantity of generated data using IoT devices is much more than before and clearly fits this feature. Velocity: Advanced tools and technologies for analytics are needed to efficiently operate the high rate of data production. Variety: Big data may be structured, semi-structured, and unstructured data. The data types produced by IoT include text, audio, video, sensory data and so on. Veracity: Veracity refers to the quality, consistency, and trustworthiness of the data, which in turn leads to accurate analytics. Variability: This property refers to the different rates of data flow. Value: Value is the transformation of big data to useful information and insights that bring competitive advantage to organizations. Despite the recent advancement in DL for big data, there are still significant challenges that need to be addressed to mature this technology. Every 6 characteristics of IoT big data imposes a challenge for DL techniques. One common denominator for all is the lack of availability of IoT big data datasets. IoT datasets and why are they needed Deep learning methods have been promising with state-of-the-art results in several areas, such as signal processing, natural language processing, and image recognition. The trend is going up in IoT verticals as well. IoT datasets play a major role in improving the IoT analytics. Real-world IoT datasets generate more data which in turn improve the accuracy of DL algorithms. However, the lack of availability of large real-world datasets for IoT applications is a major hurdle for incorporating DL models in IoT. The shortage of these datasets acts as a barrier to deployment and acceptance of IoT analytics based on DL since the empirical validation and evaluation of the system should be shown promising in the natural world. The lack of availability is mainly because: Most IoT datasets are available with large organizations who are unwilling to share it so easily. Access to the copyrighted datasets or privacy considerations. These are more common in domains with human data such as healthcare and education. While there is a lot of ground to be covered in terms of making datasets for IoT available, here is a list of commonly used datasets suitable for building deep learning applications in IoT. Dataset Name Domain Provider Notes Address/Link CGIAR dataset Agriculture, Climate CCAFS High-resolution climate datasets for a variety of fields including agricultural http://www.ccafs-climate.org/ Educational Process Mining Education University of Genova Recordings of 115 subjects’ activities through a logging application while learning with an educational simulator http://archive.ics.uci.edu/ml/datasets/Educational+Process+Mining+%28EPM%29%3A+A+Learning+Analytics+Data+Set Commercial Building Energy Dataset Energy, Smart Building IIITD Energy related data set from a commercial building where data is sampled more than once a minute. http://combed.github.io/ Individual household electric power consumption Energy, Smart home EDF R&D, Clamart, France One-minute sampling rate over a period of almost 4 years http://archive.ics.uci.edu/ml/datasets/Individual+household+electric+power+consumption AMPds dataset Energy, Smart home S. Makonin AMPds contains electricity, water, and natural gas measurements at one minute intervals for 2 years of monitoring http://ampds.org/ UK Domestic Appliance-Level Electricity Energy, Smart Home Kelly and Knottenbelt Power demand from five houses. In each house both the whole-house mains power demand as well as power demand from individual appliances are recorded. http://www.doc.ic.ac.uk/∼dk3810/data/ PhysioBank databases Healthcare PhysioNet Archive of over 80 physiological datasets. https://physionet.org/physiobank/database/ Saarbruecken Voice Database Healthcare Universitat¨ des Saarlandes A collection of voice recordings from more than 2000 persons for pathological voice detection. http://www.stimmdatebank.coli.uni-saarland.de/help_en.php4 T-LESS Industry CMP at Czech Technical University An RGB-D dataset and evaluation methodology for detection and 6D pose estimation of texture-less objects http://cmp.felk.cvut.cz/t-less/ CityPulse Dataset Collection Smart City CityPulse EU FP7 project Road Traffic Data, Pollution Data, Weather, Parking http://iot.ee.surrey.ac.uk:8080/datasets.html Open Data Institute - node Trento Smart City Telecom Italia Weather, Air quality, Electricity, Telecommunication http://theodi.fbk.eu/openbigdata/ Malaga datasets Smart City City of Malaga A broad range of categories such as energy, ITS, weather, Industry, Sport, etc. http://datosabiertos.malaga.eu/dataset Gas sensors for home activity monitoring Smart home Univ. of California San Diego Recordings of 8 gas sensors under three conditions including background, wine and banana presentations. http://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring CASAS datasets for activities of daily living Smart home Washington State University Several public datasets related to Activities of Daily Living (ADL) performance in a two story home, an apartment, and an office settings. http://ailab.wsu.edu/casas/datasets.html ARAS Human Activity Dataset Smart home Bogazici University Human activity recognition datasets collected from two real houses with multiple residents during two months. https://www.cmpe.boun.edu.tr/aras/ MERLSense Data Smart home, building Mitsubishi Electric Research Labs Motion sensor data of residual traces from a network of over 200 sensors for two years, containing over 50 million records. http://www.merl.com/wmd SportVU Sport Stats LLC Video of basketball and soccer games captured from 6 cameras. http://go.stats.com/sportvu RealDisp Sport O. Banos Includes a wide range of physical activities (warm up, cool down and fitness exercises). http://orestibanos.com/datasets.htm Taxi Service Trajectory Transportation Prediction Challenge, ECML PKDD 2015 Trajectories performed by all the 442 taxis running in the city of Porto, in Portugal. http://www.geolink.pt/ecmlpkdd2015-challenge/dataset.html GeoLife GPS Trajectories Transportation Microsoft A GPS trajectory by a sequence of time-stamped points https://www.microsoft.com/en-us/download/details.aspx?id=52367 T-Drive trajectory data Transportation Microsoft Contains a one-week trajectories of 10,357 taxis https://www.microsoft.com/en-us/research/publication/t-drive-trajectory-data-sample/ Chicago Bus Traces data Transportation M. Doering Bus traces from the Chicago Transport Authority for 18 days with a rate between 20 and 40 seconds. http://www.ibr.cs.tu-bs.de/users/mdoering/bustraces/ Uber trip data Transportation FiveThirtyEight About 20 million Uber pickups in New York City during 12 months. https://github.com/fivethirtyeight/uber-tlc-foil-response Traffic Sign Recognition Transportation K. Lim Three datasets: Korean daytime, Korean nighttime, and German daytime traffic signs based on Vienna traffic rules. https://figshare.com/articles/Traffic_Sign_Recognition_Testsets/4597795 DDD17 Transportation J. Binas End-To-End DAVIS Driving Dataset. http://sensors.ini.uzh.ch/databases.html

0
2
48302

article-image-create-prepare-first-dataset-salesforce-einstein

Amey Varangaonkar

19 Mar 2018

3 min read

How to create and prepare your first dataset in Salesforce Einstein

Amey Varangaonkar

19 Mar 2018

3 min read

[box type="note" align="" class="" width=""]The following extract is taken from the book Learning Einstein Analytics written by Santosh Chitalkar. This book will help you learn Salesforce Einstein analytics, to get insights faster and understand your customer better.[/box] In this article, we see how to start your analytics journey using Salesforce Einstein by taking the first step in the process i.e; by creating and preparing your dataset! A dataset is a set of source data, specially formatted and optimized for interactive exploration. Here are the steps to create a new dataset in Salesforce Einstein: Click on the Create button in the top-right corner and then click on Dataset. You can see the following three options to create datasets: CSV File Salesforce Informatica Rev 2. Select CSV File and click on Continue, as shown in the following screenshot: 3. Select the Account_data.csv file or drag and drop the file. 4. Click on Next. The next screen uploads the user interface to create a single dataset by using the external.csv file: 5. Click on Next to proceed as shown in the following screenshot: 6. Change the dataset name if you want. You can select an application to store the dataset. You can also replace the CSV file from this screen. 7. Click on in the Data Schema File section and select the Replace File option to change the file. You can also download the uploaded .csv file from here as shown in the following screenshot: 8. Click on Next. In the next screen, you can change field attributes such as column name, dimensions, field type, and so on. 9. Click on the Next button and it will start uploading the file in Analytics and queuing it in dataflow. Once done click on the Got it button. 10. Wait for 10-15 minutes (depending on the data, it may take a longer time to create the dataset). 11. Go to Analytics Studio and open the DATASETS tab. You can see the Account_data dataset as shown in the following screenshot: Congrats!!! You have created your first dataset. Let's now update this dataset with the same information but with some additional columns. Updating datasets We need to update the dataset to add new fields, change application settings, remove fields, and so on. Einstein Analytics gives users the flexibility to update the dataset. Here are the steps to update an existing dataset: Create a CSV file to include some new fields and name it Account_Data_Updated. Save the file to a location that you can easily remember. In Salesforce, go to the Analytics Studio home page and find the dataset. Hover over the dataset and click on the button, then click on Edit, as shown in the following screenshot: 4. Salesforce displays the dataset editing screen. Click on the Replace Data button in the top-right corner of the page: 5. Click on the Next button and upload your new CSV file using upload UI. 6. Click on the Next button again to get to the next screen for editing and click on Next again. 7. Click on Replace as shown in the following screenshot: Voila! You’ve successfully updated your dataset. As you can see it’s fairly easy to create and then update the dataset if required, using Einstein without any hassle. If you found this post useful, make sure to check out our book Learning Einstein Analytics for more tips and techniques on using Einstein Analytics effectively to uncover unique insights from your data.

0
1
5160

article-image-perform-crud-operations-on-mongodb-with-php

Amey Varangaonkar

17 Mar 2018

6 min read

Perform CRUD operations on MongoDB with PHP

Amey Varangaonkar

17 Mar 2018

6 min read

0
0
5923

Analyzing Moby Dick through frequency distribution with NLTK

Getting started with Django and Django REST frameworks to build a RESTful app

How to build and deploy Microservices using Payara Micro

How to build Microservices using REST framework

Getting started with Django RESTful Web Services

How to perform full-text search (FTS) in PostgreSQL

Creating 2D and 3D plots using Matplotlib

How to secure data in Salesforce Einstein Analytics

How to embed Einstein dashboards on Salesforce Classic

How to write high quality code in Python: 15+ tips for data scientists and researchers

Trending Topics

The Cambridge Analytica scandal and ethics in data science

Getting started with Python Web Scraping

25 Datasets for Deep Learning in IoT

How to create and prepare your first dataset in Salesforce Einstein

Perform CRUD operations on MongoDB with PHP