Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletter Hub
Free Learning
Arrow right icon
timer SALE ENDS IN
0 Days
:
00 Hours
:
00 Minutes
:
00 Seconds

Tech News - Data

1208 Articles
article-image-near-real-time-nrt-applications-work
Amarabha Banerjee
10 Nov 2017
6 min read
Save for later

How Near Real Time (NRT) Applications work

Amarabha Banerjee
10 Nov 2017
6 min read
[box type="note" align="" class="" width=""]In this article by Shilpi Saxena and Saurabh Gupta from their book Practical Real-time data Processing and Analytics we shall explore what a near real time architecture looks like and how an NRT app works. [/box] It's very important to understand the key aspects where the traditional monolithic application systems are falling short to serve the need of the hour: Backend DB: Single point monolithic data access. Ingestion flow: The pipelines are complex and tend to induce latency in end to end flow. Systems are failure prone, but the recovery approach is difficult and complex. Synchronization and state capture: It's very difficult to capture and maintain the state of facts and transactions in the system. Getting diversely distributed systems and real-time system failures further complicate the design and maintenance of such systems. The answer to the above issues is an architecture that supports streaming and thus provides its end users access to actionable insights in real-time over ever flowing in-streams of real-time fact data. Local state and consistency of the system for large scale high velocity systems Data doesn't arrive at intervals, it keeps flowing in, and it's streaming in all the time No single state of truth in the form of backend database, instead the applications subscribe or tap into stream of fact data Before we delve further, it's worthwhile to understand the notation of time: Looking at this figure, it's very clear to correlate the SLAs with each type of implementation (batch, near real-time, and real-time) and the kinds of use cases each implementation caters to. For instance, batch implementations have SLAs ranging from a couple of hours to days and such solutions are predominantly deployed for canned/pre-generated reports and trends. The real-time solutions have an SLA of a magnitude of few seconds to hours and cater to situations requiring ad-hoc queries, mid-resolution aggregators, and so on. The real-time application's most mission-critical in terms of SLA and resolutions are where each event accounts for and the results have to return within an order of milliseconds to seconds. Near real time (NRT) Architecture In its essence, NRT Architecture consists of four main components/layers, as depicted in the following figure: The message transport pipeline The stream processing component The low-latency data store Visualization and analytical tools The first step is the collection of data from the source and providing for the same to the "data pipeline", which actually is a logical pipeline that collects the continuous events or streaming data from various producers and provides the same to the consumer stream processing applications. These applications transform, collate, correlate, aggregate, and perform a variety of other operations on this live streaming data and then finally store the results in the low-latency data store. Then, there is a variety of analytical, business intelligence, and visualization tools and dashboards that read this data from the data store and present it to the business user. Data collection This is the beginning of the journey of all data processing, be it batch or real time the foremost and most forthright is the challenge to get the data from its source to the systems for our processing. If I can look at the processing unit as a black box and a data source, and at consumers as publishers and subscribers. It's captured in the following diagram: The key aspects that come under the criteria for data collection tools in the general context of big data and real-time specifically are as follows: Performance and low latency Scalability Ability to handle structured and unstructured data Apart from this, the data collection tool should be able to cater to data from a variety of sources such as: Data from traditional transactional systems: To duplicate the ETL process of these traditional systems and tap the data from the source Tap the data from these ETL systems The third and a better approach is to go the virtual data lake architecture for data replication. Structured data from IoT/ Sensors/Devices, or CDRs: This is the data that comes at a very high velocity and in a fixed format – the data can be from a variety of sensors and telecom devices. Unstructured data from media files, text data, social media, and so on: This is the most complex of all incoming data where the complexity is due to the dimensions of volume, velocity, variety, and structure. Stream processing The stream processing component itself consists of three main sub-components, which are: The Broker: that collects and holds the events or data streams from the data collection agents. The "Processing Engine": that actually transforms, correlates, aggregates the data, and performs the other necessary operations The "Distributed Cache": that actually serves as a mechanism for maintaining common data set across all distributed components of the processing engine The same aspects of the stream processing component are zoomed out and depicted in the diagram as follows: There are few key attributes that should be catered to by the stream processing component: Distributed components thus offering resilience to failures Scalability to cater to growing need of the application or sudden surge of traffic Low latency to handle the overall SLAs expected from such application Easy operationalization of use case to be able to support the evolving use cases Build for failures, the system should be able to recover from inevitable failures without any event loss, and should be able to reprocess from the point it failed Easy integration points with respect to off-heap/distributed cache or data stores A wide variety of operations, extensions, and functions to work with business requirements of the use case Analytical layer - serve it to the end user The analytical layer is the most creative and interesting of all the components of an NRT application. So far, all we have talked about is backend processing, but this is the layer where we actually present the output/insights to the end user graphically, visually in form of an actionable item. A few of the challenges these visualization systems should be capable of handling are: Need for speed Understanding the data and presenting it in the right context Dealing with outliers The figure depicts the flow of information from event producers to the collection agents, followed by the brokers and processing engine (transformation, aggregation, and so on) and then the long-term storage. From the storage unit, the visualization tools reap the insights and present them in form of graphs, alerts, charts, Excel sheets, dashboards, or maps, to the business owners who can assimilate the information and take some action based upon it. The above was an excerpt from the book Practical Real-time data Processing and Analytics.
Read more
  • 0
  • 0
  • 3468

article-image-trending-datascience-news-10th-nov-17-headlines
Packt Editorial Staff
10 Nov 2017
3 min read
Save for later

10th Nov.' 17 - Headlines

Packt Editorial Staff
10 Nov 2017
3 min read
Duer OS Prometheus Project, PLATO platform, HPE's neural network chip, and BullSequana S AI servers, in today's trending data science news. Baidu's new OS for AI capabilities Baidu launches new operating system Duer OS Prometheus Project to advance conversational AI Baidu Inc., has officially launched a new operating system to speed the conversational AI capabilities. Known as the “Duer OS Prometheus Project,” the operating system is already providing conversational support to 10 major domains and more than 100 subdomains in China. Baidu has announced a $1 million fund to invest in the efforts in this space. Announcing AI platform PLATO AI.io launches PLATO, an AI-based operating platform for enterprise AI.io has announced an AI-based neural network PLATO which expands as Perceptive Learning Artificial intelligence Technology Operating platform. The platform powers apps in consumer, and enterprise use cases for machine learning, deep learning, natural language processing, computer vision, machine reasoning, and cognitive AI. Based on the specific needs of the business, PLATO can be used to extract value from large amounts of structured and unstructured data using an intuitive, easy to use interface and toolkit. The data can then be converted, normalized, and enriched with concepts, relationships, sentiment and tone giving PLATO the ability to understand the content in a fully cognitive manner. Businesses can then embed the data into existing applications and workflows, with new insights. Thus Plato can enhance the overall decision-making resulting in revenue growth. HPE's upcoming processor could well be an accelerator HPE developing its own neural network chip that is faster than anything in the market Hewlett Packard Enterprise could be developing an advanced chip for high-performance computing under intense power and physical space limitations characteristic of space missions. Recently when VP and GM Tom Bradicich was asked about the processor architecture, he said the Dot Product Engine is less of a full processor and more like an accelerator, which takes offload of certain multiplication elements common in neural network inference and broader HPC applications. “DPE is not a neural network per se, in the sense that it’s not a fixed configuration, but rather is reconfigurable, and can be used for inference of several types of neural networks like DNN, CNN, RNN. Hence it can do neural network jobs and workloads,” he said, adding that DPE executes linear algebra in the analog domain, which is more efficient than digital implementations, such as dedicated ASICs. With the added advantage of reconfigurability, DPE is fast because it accelerates vector * matrix math, dot product multiplication, by exploiting Ohms Law on a memristor array. In fact, the way it is designed, it is “faster than anything available on the market, Bradicich claimed, with a “much better fit for the performance, power, and space requirements of extreme edge environments.” Announcing BullSequana S servers Atos launches next generation AI servers “BullSequana S” that are ultra-scalable, ultra-flexible Atos has developed BullSequana S, its in-house next-generation servers optimized for machine learning, business‐critical computing applications and in-memory environments. BullSequana S comes with a unique combination of powerful processors CPUs and GPUs. Leveraging a modular architecture, the BullSequana S server’s flexibility offers customers an agility to add machine learning and AI capacity to existing enterprise workloads, thanks to the introduction of a GPU. Within a single server, GPU, storage and compute modules are mixed. BullSequana S integrates the advanced Intel Xeon Scalable processors Skylake with an innovative architecture designed by Atos’ R&D teams.
Read more
  • 0
  • 0
  • 1330

article-image-soft-skills-data-scientists-teach-child
Aaron Lazar
09 Nov 2017
7 min read
Save for later

Soft skills every data scientist should teach their child

Aaron Lazar
09 Nov 2017
7 min read
Data Scientists work really hard to upskill their technical competencies. A rapidly changing technology landscape demands a continuous ramp up of skills like mastering a new programming language like R, Python, Java or something else, exploring new machine learning frameworks and libraries like TensorFlow or Keras, understanding cutting-edge algorithms like Deep Convolutional Networks and K-Means to name a few. Had they lived in Dr.Frankenstein's world, where scientists worked hard in their labs, cut-off from the rest of the world, this should have sufficed. But in the real world, data scientists use data and work with people to solve real-world problems for people. They need to learn something more, that forms a bridge between their ideas/hypotheses and the rest of the world. Something that’s more of an art than a skill these days. We’re talking about soft-skills for data scientists. Today we’ll enjoy a conversation between a father and son, as we learn some critical soft-skills for data scientists necessary to make it big in the data science world. [box type="shadow" align="" class="" width=""] One chilly evening, Tommy is sitting with his dad on their grassy backyard with the radio on, humming along to their favourite tunes. Tommy, gazing up at the sky for a while, asks his dad, “Dad, what are clouds made of?” Dad takes a sip of beer and replies, “Mostly servers, son. And tonnes of data.” Still gazing up, Tommy takes a deep breath, pondering about what his dad just said. Tommy: Tell me something, what’s the most important thing you’ve learned in your career as a Data Scientist? Dad smiles: I’m glad you asked, son. I’m going to share something important with you. Something I have learned over all these years crunching and munching data. I want you to keep this to yourself and remember it for as long as you can, okay? Tommy: Yes dad. Dad: Atta boy! Okay, the first thing you gotta do if you want to be successful, is you gotta be curious! Data is everywhere and it can tell you a lot. But if you’re not curious to explore data and tackle it from every angle, you will remain mediocre at best. Have an open mind - look at things through a kaleidoscope and challenge assumptions and presumptions. Innovation is the key to making the cut as a data scientist. Tommy nods his head approvingly. Dad, satisfied that Tommy is following along, continues. Dad: One of the most important skills a data scientist should possess is a great business acumen. Now, I know you must be wondering why one would need business acumen when all they’re doing is gathering a heap of data and making sense of it. Tommy looks straight-faced at his dad. Dad: Well, a data scientist needs to know the business like the back of their hand because unless they do, they won’t understand what the business’ strengths and weaknesses are and how data can contribute towards boosting its success. They need to understand where the business fits into the industry and what it needs to do to remain competitive. Dad’s last statement is rewarded by an energetic, affirmative nod from Tommy. Smiling, dad’s quite pleased with the response. Dad: Communication is next on the list. Without a clever tongue, a data scientist will find himself going nowhere in the tech world. Gone are the days when technical knowledge was all that was needed to sustain. A data scientist’s job is to help a business make critical, data-driven decisions. Of what use is it to the non-technical marketing or sales teams, if the data scientist can’t communicate his/her insights in a clear and effective way? A data scientist must also be a good listener to truly understand what the problem is to come up with the right solution. Tommy leans back in his chair, looking up at the sky again, thinking how he would communicate insights effectively. Dad continues: Very closely associated with communication, is the ability to present well, or as a data scientist would put it - tell tales that inspire action. Now a data scientist might have to put forward their findings before an entire board of directors, who will be extremely eager to know why they need to take a particular decision and how it will benefit the organization. Here, clear articulation, a knack for storytelling and strong convincing skills are all important for the data scientist to get the message across in the best way. Tommy quips: Like the way you convince mom to do the dishes every evening? Dad playfully punches Tommy: Hahaha, you little rascal! Tommy: Are there any more skills a data scientist needs to possess to excel at what they do? Dad: Indeed, there are! True data science is a research activity, where problems with unclear or unobvious solutions get solved. There are times when even the nature of the problem isn’t clear. A data scientist should be skilled at performing their own independent research - snooping around for information or data, gathering it and preparing it for further analysis. Many organisations look for people with strong research capabilities, before they recruit them. Tommy: What about you? Would you recruit someone without a research background? Dad: Well, personally no. But that doesn’t mean I would only hire someone if they were a PhD. Even an MSc would do, if they were able to justify their research project, and convince me that they’re capable of performing independent research. I wouldn’t hesitate to take them on board. Here’s where I want to share one of the most important skills I’ve learned in all my years. Any guesses on what it might be? Tommy: Hiring? Dad: Ummmmm… I’ll give this one to you ‘cos it’s pretty close. The actual answer is, of course, a much broader term - ‘management’. It encompasses everything from hiring the right candidates for your team to practically doing everything that a person handling a team does. Tommy: And what’s that? Dad: Well, as a senior data scientist, one would be expected to handle a team of lesser experienced data scientists, managing, mentoring and helping them achieve their goals. It’s a very important skill to hone, as you climb up the ladder. Some learn it through experience, others learn it by taking management courses. Either way, this skill is important for one to succeed in a senior role. And, that’s about all I have for now. I hope at least some of this benefits you, as you step into your first job tomorrow. Tommy smiles: Yeah dad, it’s great to have someone in the same line of work to look up to when I’m just starting out my career. I’m glad we had this conversation. Holding up an empty can, he says, “I’m out, toss me another beer, please.”[/box] Soft Skills for Data Scientists - A quick Recap In addition to keeping yourself technically relevant, to succeed as a data scientist you need to Be curious: Explore data from different angles, question the granted - assumptions & presumptions. Have strong business acumen: Know your customer, know your business, know your market. Communicate effectively: Speak the language of your audience, listen carefully to understand the problem you want to solve. Master the art of presenting well: Tell stories that inspire action, get your message across through a combination of data storytelling, negotiation and persuasion skills Be a problem solver: Do your independent research, get your hands dirty and dive deep for answers. Develop your management capabilities: Manage, mentor and help other data scientists reach their full potential.
Read more
  • 0
  • 0
  • 2430
Visually different images

article-image-9th-nov-17-headlines
Packt Editorial Staff
09 Nov 2017
3 min read
Save for later

9th Nov.' 17 - Headlines

Packt Editorial Staff
09 Nov 2017
3 min read
Bitcoin prices soar and tumble, MongoDB announces its biggest release, and a proposed Grid to improve blockchain system, in today’s top stories in data science news. Bitcoin's roller-coaster amid SegWit2x cancellation Bitcoin price surges to record high, then tanks, as plans to split digital currency is called off Bitcoin was scheduled to upgrade around Nov. 16 following a proposal called SegWit2x, which would have split the digital currency in two. But with major bitcoin developers dropping their support for the upgrade recently, developers behind SegWit2x called off the upgrade plans on Wednesday. In response to this, bitcoin price reached an all-time high around $7,900. However, this was followed by a $1,000 crash, plummeting the price to $6,977. Experts believe the rapid price swing could denote a possible conflict between the short- and long-term impacts of SegWit2x cancellation. The hard fork would have split Bitcoin into two competing blockchains, resulting in an ugly fight for supremacy. Announcing MongoDB 3.6 MongoDB 3.6 released: Change Streams, Retryable Writes among key updates in MongoDB’s biggest ever release MongoDB has announced its biggest release yet, version 3.6, with over a hundred new and updated features. With new array update operators, users can now specify in-place updates to specific array items at any depth of nesting. Extensions to the $lookup aggregation stage now allow uncorrelated subqueries and multiple matching conditions, so referencing and joining documents in complex combinations can be handled in the database. Also, MongoDB 3.6 introduces Change Streams, which applications can use to get real-time notification of updates to collection data. To handle network outages gracefully, MongoDB 3.6 uses Retryable Writes, a new feature ensuring that writes are performed exactly once, even in the face of outages. Besides, MongoDB 3.6 improves on its previous capabilities with the introduction of JSON Schema. “With MongoDB 3.6, schema isn’t a straightjacket, it’s framework of validation you can tune to exactly the degree you need,” Co-Founder Eliot Horowitz said in the official announcement. A new 'Grid' blockchain system Introducing Grid: A scalable Blockchain system for better performance, resource segregation and working governance model A new blockchain initiative Grid proposes to establish a blockchain system which functions as an operating system similar to Linux. As per the modus operandi, Grid will run nodes on clusters. It will allow assigned transactions to different groups based on mutex of the transactions. Transactions within a group will be processed in linear sequence, while all groups will be processed simultaneously. Grid adopts a Main Chain + N Side Chains architecture, which means each business scenario has its dedicated Side Chain to fulfill its requirements. By segregating resources like this, processing efficiency of the system is increased and there is no congestion. Grid also promises a better governance model by permitting Side Chains to join or exit from Main Chain dynamically based on stakeholder voting, therefore introducing competition and incentive to improve each Side Chain. Singapore-based Grid Foundation is promoting Grid’s development and applications, while technical developments will be led by Beijing Hoopox Information and Technology Co. Ltd.
Read more
  • 0
  • 0
  • 1439

article-image-intel-amd-laptop-chip-partnership
Abhishek Jha
09 Nov 2017
3 min read
Save for later

Frenemies: Intel and AMD partner on laptop chip to keep Nvidia at bay

Abhishek Jha
09 Nov 2017
3 min read
For decades, Intel and AMD have remained bitter archrivals. Today, they find themselves teaming up to thwart a common enemy – Nvidia. As Intel revealed its partnership with Advanced Micro Devices (AMD) over a next-generation notebook chip, it was the first time the two chip giants collaborated since the ‘80s. The proposed chip for thin and lightweight laptops combines an Intel processor and an AMD graphics unit for complex video gaming. The new series of processors will be part of Intel's 8th-generation Core H-series mobile chips, expected to hit the market in the first quarter of 2018. What it means is that Intel’s high-performance x86 cores will get combined with AMD Radeon Graphics into the same processor package using Intel’s EMIB multi-die technology. That is not all. Intel is also bundling the design with built-in High Bandwidth Memory (HBM2) RAM. The new processor, Intel claims, reduces the usual silicon footprint by about 50%. And with a ‘semi-custom’ graphics processor from AMD, enthusiasts can look forward to discrete graphics-level performances for playing games, editing photos or videos, and other tasks that can leverage modern GPU technologies. What does AMD get? Having struggled to remain profitable in recent times, AMD has been losing share in the discrete notebook GPU market. The deal could bring additional revenues with increased market share. Most importantly, the laptops built with the new processors won’t be competing with AMD’s Ryzen chips (which are also designed for ultrathin laptops). AMD clarified on the difference: While the new Intel chips are designed for serious gamers, Ryzen chips (that are due out at the end of the year) can run games but are not specifically designed for that purpose. "Our collaboration with Intel expands the installed base for AMD Radeon GPUs and brings to market a differentiated solution for high-performance graphics,” Scott Herkelman, vice president and general manager of AMD's Radeon Technologies Group, said. "Together we are offering gamers and content creators the opportunity to have a thinner-and-lighter PC capable of delivering discrete performance-tier graphics experiences in AAA games and content creation applications.” While more information will be available in future, the first machines with the new technology are expected to release in the first quarter of 2018. Nvidia's stock fell on the news. While both AMD and Intel saw their shares surging. A rivalry that began when AMD reverse-engineered the Intel 8080 microchip in 1975 could still be far from over, but in graphics, the two have been rather cordial. Despite hating each other since formation, both decided to pick each other as lesser evil over Nvidia. This is why the Intel AMD laptop chip partnership has a definite future. Currently centered around laptop solutions, this could even stretch to desktops, who knows!
Read more
  • 0
  • 0
  • 1939

article-image-data-scientist-sexiest-role-21st-century
Aarthi Kumaraswamy
08 Nov 2017
6 min read
Save for later

Data Scientist: The sexiest role of the 21st century

Aarthi Kumaraswamy
08 Nov 2017
6 min read
"Information is the oil of the 21st century, and analytics is the combustion engine." -Peter Sondergaard, Gartner Research By 2018, it is estimated that companies will spend $114 billion on big data-related projects, an increase of roughly 300%, compared to 2013 (https://www.capgemini-consulting.com/resource-file-access/resource/pdf/big_dat a_pov_03-02-15.pdf). Much of this increase in expenditure is due to how much data is being created and how we are better able to store such data by leveraging distributed filesystems such as Hadoop. However, collecting the data is only half the battle; the other half involves data extraction, transformation, and loading into a computation system, which leverages the power of modern computers to apply various mathematical methods in order to learn more about data and patterns and extract useful information to make relevant decisions. The entire data workflow has been boosted in the last few years by not only increasing the computation power and providing easily accessible and scalable cloud services (for example, Amazon AWS, Microsoft Azure, and Heroku) but also by a number of tools and libraries that help to easily manage, control, and scale infrastructure and build applications. Such a growth in the computation power also helps to process larger amounts of data and to apply algorithms that were impossible to apply earlier. Finally, various computation- expensive statistical or machine learning algorithms have started to help extract nuggets of information from data. Finding a uniform definition of data science is akin to tasting wine and comparing flavor profiles among friends—everyone has their own definition and no one description is more accurate than the other. At its core, however, data science is the art of asking intelligent questions about data and receiving intelligent answers that matter to key stakeholders. Unfortunately, the opposite also holds true—ask lousy questions of the data and get lousy answers! Therefore, careful formulation of the question is the key for extracting valuable insights from your data. For this reason, companies are now hiring data scientists to help formulate and ask these questions. At first, it's easy to paint a stereotypical picture of what a typical data scientist looks like: t- shirt, sweatpants, thick-rimmed glasses, and debugging a chunk of code in IntelliJ... you get the idea. Aesthetics aside, what are some of the traits of a data scientist? One of our favorite posters describing this role is shown here in the following diagram: Math, statistics, and general knowledge of computer science is given, but one pitfall that we see among practitioners has to do with understanding the business problem, which goes back to asking intelligent questions of the data. It cannot be emphasized enough: asking more intelligent questions of the data is a function of the data scientist's understanding of the business problem and the limitations of the data; without this fundamental understanding, even the most intelligent algorithm would be unable to come to solid conclusions based on a wobbly foundation. A day in the life of a data scientist This will probably come as a shock to some of you—being a data scientist is more than reading academic papers, researching new tools, and model building until the wee hours of the morning, fueled on espresso; in fact, this is only a small percentage of the time that a data scientist gets to truly play (the espresso part however is 100% true for everyone)! Most part of the day, however, is spent in meetings, gaining a better understanding of the business problem(s), crunching the data to learn its limitations (take heart, this book will expose you to a ton of different feature engineering or feature extractions tasks), and how best to present the findings to non data-sciencey people. This is where the true sausage making process takes place, and the best data scientists are the ones who relish in this process because they are gaining more understanding of the requirements and benchmarks for success. In fact, we could literally write a whole new book describing this process from top-to-tail! So, what (and who) is involved in asking questions about data? Sometimes, it is process of saving data into a relational database and running SQL queries to find insights into data: "for the millions of users that bought this particular product, what are the top 3 OTHER products also bought?" Other times, the question is more complex, such as, "Given the review of a movie, is this a positive or negative review?" This book is mainly focused on complex questions, like the latter. Answering these types of questions is where businesses really get the most impact from their big data projects and is also where we see a proliferation of emerging technologies that look to make this Q and A system easier, with more functionality. Some of the most popular, open source frameworks that look to help answer data questions include R, Python, Julia, and Octave, all of which perform reasonably well with small (X < 100 GB) datasets. At this point, it's worth stopping and pointing out a clear distinction between big versus small data. Our general rule of thumb in the office goes as follows: If you can open your dataset using Excel, you are working with small data. Working with big data What happens when the dataset in question is so vast that it cannot fit into the memory of a single computer and must be distributed across a number of nodes in a large computing cluster? Can't we just rewrite some R code, for example, and extend it to account for more than a single-node computation? If only things were that simple! There are many reasons why the scaling of algorithms to more machines is difficult. Imagine a simple example of a file containing a list of names: B D X A D A We would like to compute the number of occurrences of individual words in the file. If the file fits into a single machine, you can easily compute the number of occurrences by using a combination of the Unix tools, sort and uniq: bash> sort file | uniq -c The output is as shown ahead: 2 A 1 B 1 D 1 X However, if the file is huge and distributed over multiple machines, it is necessary to adopt a slightly different computation strategy. For example, compute the number of occurrences of individual words for every part of the file that fits into the memory and merge the results together. Hence, even simple tasks, such as counting the occurrences of names, in a distributed environment can become more complicated. The above is an excerpt from the book  Mastering Machine Learning with Spark 2.x by Alex Tellez, Max Pumperla and Michal Malohlava. If you would like to learn how to solve the above problem and other cool machine learning tasks a data scientist carries out such as the following, check out the book. Use Spark streams to cluster tweets online Run the PageRank algorithm to compute user influence Perform complex manipulation of DataFrames using Spark Define Spark pipelines to compose individual data transformations Utilize generated models for off-line/on-line prediction
Read more
  • 0
  • 0
  • 4761
Unlock access to the largest independent learning library in Tech for FREE!
Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of.
Renews at £15.99/month. Cancel anytime
article-image-trending-datascience-news-8th-nov-17-headlines
Packt Editorial Staff
08 Nov 2017
3 min read
Save for later

8th Nov.' 17 - Headlines

Packt Editorial Staff
08 Nov 2017
3 min read
spaCy's latest version, Microsoft’s artificial intelligence processor and a proposed AI broker, among today’s tech stories in data science news. Announcing spaCy 2.0 spaCy 2.0 released with 13 new neural network models for 7+ languages The 2.0 version of spaCy has been released, making it up to date with the latest deep learning technologies, with over 60 bug fixes that include several long-standing issues. It is now easier to run spaCy in scalable cloud computing workflows. The spaCy v2.0 comes with 13 new convolutional neural network models for 7+ languages, adding alpha tokenization support for 8 new languages. These models have been designed and implemented from scratch specifically for spaCy, the developer team said, adding that they “re-wrote almost all of the usage guides, API docs and code examples.” For a full overview of changes in v2.0, users can see the guide on migrating from spaCy 1.x. Microsoft goes full throttle on AI chip Microsoft says it will extend Hololense AI processor to other devices from daily life In July, Microsoft had revealed it was designing a custom AI chip for its next-generation Hololense headsets. In latest developments, the company’s Corporate Vice President Panos Panay has said while the work on the proposed artificial intelligence processor is going at full speed, the AI chip may well be implemented in other everyday devices other than Hololense, such as mobile phones, TV’s, wearables, home smart devices and computers. Panay said in an interview that Microsoft is not designing the processor just for its own products, but for devices from all other brands. The AI processor will analyze what the users see and hear on the spot without having to waste precious time to send the data to the cloud for analysis. Other News Cloud SQL for PostgreSQL integrates high availability and replication Cloud SQL for PostgreSQL has now added support for high availability (HA) and read replicas. This can ensure that users’ database workloads are fault tolerant. Announcing the release, developers said the beta release of high availability provides isolation from failures, and read replicas provide additional read performance — requirements for demanding workloads. Artificial Intelligence creeps into CryptoTrading, AiX claims to develop first AI broker Startup AiX has announced the creation of an electronic broker with artificial intelligence. AiX said its AI broker blends cutting-edge artificial intelligence with blockchain technology to make trading cheaper, faster, and trustworthy. Using an AI chatbot and Alexa-style voice recognition, it will execute trades on behalf of individual traders and investment banks that may cut down on trade costs altogether. As all actions will be recorded in blockchain, AiX believes the process will bring reliability and transparency. After securing $16 million already on this project, AiX plans to raise further capital using a token sale before year end.
Read more
  • 0
  • 0
  • 1329

article-image-uber-deep-probabilistic-programming-language-pyro
Abhishek Jha
08 Nov 2017
3 min read
Save for later

Introducing "Pyro" for deep probabilistic modeling

Abhishek Jha
08 Nov 2017
3 min read
Last year when Uber set up its ambitious facility in San Francisco as Uber AI Labs, the aim was to leverage cutting-edge research in artificial intelligence and machine learning to move people and things in the real world — a challenge that is more complex and uncertain than it appears on paper. It extends to, as the firm admitted, teaching a self-driven machine to safely and autonomously navigate the world, whether a car on the roads or an aircraft through busy airspace or new types of robotic devices. Well, the first big initiative to come out of the Labs is Pyro, a deep universal probabilistic programming language. “Pyro is a tool for deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling,” writes Stanford researcher Noah Goodman, a member of Uber AI Labs. Written in Python, the Pyro programming language supports PyTorch in the backend. Among the key principles underlying Pyro’s design, it is a flexible and scalable programming library, implemented with a small core of powerful, composable abstractions. Pyro: Design principles and insights In Uber’s own words, Pyro was developed to satisfy the following four design principles: Universal: Pyro is a universal PPL—it can represent any computable probability distribution. How? By starting from a universal language with iteration and recursion (arbitrary Python code), and then adding random sampling, observation, and inference. Scalable: Pyro scales to large data sets with little overhead above hand-written code. How? By building modern black box optimization techniques, which use mini-batches of data, to approximate inference. Minimal: Pyro is agile and maintainable. How? Pyro is implemented with a small core of powerful, composable abstractions. Wherever possible, the heavy lifting is delegated to PyTorch and other libraries. Flexible: Pyro aims for automation when you want it and control when you need it. How? Pyro uses high-level abstractions to express generative and inference models, while allowing experts to easily customize inference. In a way, Pyro is going to reflect on interesting aspects in PPL research, starting from dynamic computational graphs to deep generative models, or even programmable inference. “In Pyro, both the generative models and the inference guides can include deep neural networks as components,” Goodman wrote. “The resulting deep probabilistic models have shown great promise in recent work, especially for unsupervised and semi-supervised machine learning problems.” Pyro: Installation (Remember to first install PyTorch) Install via pip: Python 2.7.*: pip install pyro-ppl Python 3.5: pip3 install pyro-ppl Install from source: git clone [email protected]:uber/pyro.git cd pyro pip install . Still in alpha release, Pyro may see several possible enhancements in coming days with more and more engagement from probabilistic programming and deep learning communities.
Read more
  • 0
  • 0
  • 2331

article-image-dr-brandon-explains-decision-trees-jon
Aarthi Kumaraswamy
08 Nov 2017
3 min read
Save for later

Dr.Brandon explains Decision Trees to Jon

Aarthi Kumaraswamy
08 Nov 2017
3 min read
[box type="shadow" align="" class="" width=""]Dr. Brandon: Hello and welcome to the third episode of 'Date with Data Science'. Today we talk about decision trees in machine learning. Jon: Decisions are hard enough to make. Now you want me to grow a decision tree. Next, you'll say there are decision jungles too! Dr. Brandon: It might come as a surprise to you, Jon, but decision trees can help you make decisions easier. Imagine you are in a restaurant and you are given a menu card. A decision tree can help you decide if you want to have a burger, pizza, fries or a pie, for instance. And yes, there are decision jungles, but they are called random forests. We will talk about them another time. Jon: You know Bran, I have never been very good at making decisions. But with food, it is easy. It's ALWAYS all you can have. Dr. Brandon: Well, my mistake. Let's take another example. You go to the doctor's after your binge eating at the restaurant with stomach complaints. A decision tree can help your doctor decide if you have a problem and then to choose a treatment option based on what your symptoms are. Jon: Really!? Tell me more. Dr. Brandon: Alright. The following excerpt introduces decision trees from the book Apache Spark 2.x Machine Learning Cookbook by Siamak Amirghodsi, Meenakshi Rajendran, Broderick Hall, and Shuen Mei. To know how to implement them in Spark read this article. [/box] Decision trees are one of the oldest and more widely used methods of machine learning in commerce. What makes them popular is not only their ability to deal with more complex partitioning and segmentation (they are more flexible than linear models) but also their ability to explain how we arrived at a solution and as to "why" the outcome is predicated or classified as a class/label. A quick way to think about the decision tree algorithm is as a smart partitioning algorithm that tries to minimize a loss function (for example, L2 or least square) as it partitions the ranges to come up with a segmented space which are best-fitted decision boundaries to the data. The algorithm gets more sophisticated through the application of sampling the data and trying a combination of features to assemble a more complex ensemble model in which each learner (partial sample or feature combination) gets to vote toward the final outcome. The following figure depicts a simplified version in which a simple binary tree (stumping) is trained to classify the data into segments belonging to two different colors (for example, healthy patient/sick patient). The figure depicts a simple algorithm that just breaks the x/y feature space to one-half every time it establishes a decision boundary (hence classifying) while minimizing the number of errors (for example, a L2 least square measure): The following figure provides a corresponding tree so we can visualize the algorithm (in this case, a simple divide and conquer) against the proposed segmentation space. What makes decision tree algorithms popular is their ability to show their classification result in a language that can easily be communicated to a business user without much math: If you liked the above excerpt, please be sure to check out Apache Spark 2.0 Machine Learning Cookbook it is originally from to learn how to implement deep learning using Spark and many more useful techniques on implementing machine learning solutions with the MLlib library in Apache Spark 2.0.
Read more
  • 0
  • 0
  • 2926

article-image-4-clustering-algorithms-every-data-scientist-know
Sugandha Lahoti
07 Nov 2017
6 min read
Save for later

4 Clustering Algorithms every Data Scientist should know

Sugandha Lahoti
07 Nov 2017
6 min read
[box type="note" align="" class="" width=""]This is an excerpt from a book by John R. Hubbard titled Java Data Analysis. In this article, we see the four popular clustering algorithms: hierarchical clustering, k-means clustering, k-medoids clustering, and the affinity propagation algorithms along with their pseudo-codes.[/box] A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves. Hierarchical clustering Of the several clustering algorithms that we will examine in this article, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space. The general setup is that we have a dataset S of m points in Rn which we want to partition into a given number k of clusters C1 , C2 ,..., Ck , where within each cluster the points are relatively close together. Here is the algorithm: Create a singleton cluster for each of the m data points. Repeat m – k times: Find the two clusters whose centroids are closest Replace those two clusters with a new cluster that contains their points The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point (5, 4), because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in the figure below : K-means clustering A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the K-Nearest Neighbor (KNN) classification algorithm. As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, k, as input. (This version is also called the K-Means++ algorithm) Here is the algorithm: Select k points from the dataset. Create k clusters, each with one of the initial points as its centroid. For each dataset point x that is not already a centroid: Find the centroid y that is closest to x Add x to that centroid’s cluster Re-compute the centroid for that cluster It also requires k points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters. K-medoids clustering The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the cost of the configuration. Here is the algorithm: Select k points from the dataset to be medoids. Assign each data point to its closest medoid. This defines the k clusters. For each cluster Cj : Compute the sum  s = ∑ j s j , where each sj = ∑{ d (x, yj) : x ∈ Cj } , and change the medoid yj  to whatever point in the cluster Cj that minimizes s If the medoid yj  was changed, re-assign each x to the cluster whose medoid is closest Repeat step 3 until s is minimal. This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is: C1 = {(1,1),(2,1),(3,2) (4,2),(2,3)}, with y1 = x1 = (1,1) C2 = {(4,3),(5,3),(2,4) (4,4),(3,5)}, with y2 = x10 = (3,5) The sums are s1 = d (x2,y1) + d (x3,y1) + d (x4,y1) + d (x5,y1) = 1 + 3 + 4 + 3 = 11 s2 = d (x6,y1) + d (x7,y1) + d (x8,y1) + d (x9,y1) = 3 + 4 + 2 + 2 = 11 s = s1 + s2  = 11 + 11 = 22 The algorithm at step 3 first part changes the medoid for C1 to y1 = x3 = (3,2). This causes the clusters to change, at step 3 second part, to: C1 = {(1,1),(2,1),(3,2) (4,2),(2,3),(4,3),(5,3)}, with y1 = x3 = (3,2) C2 = {(2,4),(4,4),(3,5)}, with y2 = x10 = (3,5) This makes the sums: s1 = 3 + 2 + 1 + 2 + 2 + 3 = 13 s2 = 2 + 2 = 4 s = s1 + s2  = 13 + 4 = 17 The resulting configuration is shown in the second panel of the figure below: At step 3 of the algorithm, the process repeats for cluster C2. The resulting configuration is shown in the third panel of the above figure. The computations are: C1 = {(1,1),(2,1),(3,2) (4,2),(4,3),(5,3)}, with y1 = x3 = (3,2) C2 = {(2,3),(2,4),(4,4),(3,5)}, with y2 = x8 = (2,4) s = s1 + s2  = (3 + 2 + 1 + 2 + 3) + (1 + 2 + 2) = 11 + 5 = 16 The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of the above figure. This version of k-medoid clustering is also called partitioning around medoids (PAM). Affinity propagation clustering One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods. Like k-medoid clustering, affinity propagation selects cluster center points, called exemplars, from the dataset to represent the clusters. This is done by message-passing between the data points. The algorithm works with three two-dimensional arrays: sij = the similarity between xi and xj rik = responsibility: message from xi to xk on how well-suited xk is as an exemplar for xi aik = availability: message from xk to xi on how well-suited xk is as an exemplar for xi Here is the complete algorithm: Initialize the similarities: sij = –d(xi , xj )2 , for i ≠ j; sii = the average of those other sij values 2. Repeat until convergence: Update the responsibilities: rik = sik − max {aij + s ij  : j ≠ k} Update the availabilities: aik = min {0, rkk + ∑j  { max {0, rjk } : j ≠ i ∧ j ≠ k }}, for i ≠ k; akk = ∑j  { max {0, rjk } : j ≠ k } A point xk will be an exemplar for a point xi if aik + rik = maxj {aij + rij}. If you enjoyed this excerpt from the book Java Data Analysis by John R. Hubbard, check out the book to learn how to implement various machine learning algorithms, data visualization and more in Java.
Read more
  • 0
  • 0
  • 2325
article-image-salesforce-myeinstein
Abhishek Jha
07 Nov 2017
3 min read
Save for later

Salesforce myEinstein: Now build AI apps with 'clicks, not code'

Abhishek Jha
07 Nov 2017
3 min read
This year’s Dreamforce conference has started rather big. The Einstein machine learning platform has been updated with new predictive insights and chatbot capabilities. In ways that could truly make AI and deep learning more accessible to developers. The latest iteration, Salesforce myEinstein, allows users of all skill levels to now develop custom AI apps “with clicks, without being a data scientist.” The tool has two new services: Einstein Prediction Builder and Einstein Bots. Einstein Prediction Builder enables automatic creation of custom AI models that can forecast outcomes for any field or object in Salesforce. Whereas, with Einstein Bots developers and admins can use a point-and-click interface to build custom chatbots. It is a service which can be trained to augment customer service workflows by automating tasks such as answering questions and retrieving information. "We are further democratizing AI by empowering admins and developers to transform every process and customer interaction to be more intelligent with myEinstein," Salesforce GM and SVP John Ball said. "No other company is arming customers with both pre-built AI apps for CRM and the ability to build and customise their own with just clicks." As far as business processes are concerned, it’s high time that employees are freed from one-size-fits-all tools, and more importantly, the repetitive tasks that take up their days. But to this date, companies have always been hindered by the infrastructure costs, lack of expertise and the resources required to optimize their workflow with AI. This is where Salesforce myEinstein is a remarkable announcement. With myEinstein, the employees who are actually managing and driving business processes have the power to build and customize AI apps to fit their specific needs, paving the way for everyone to be smarter and more productive in the process. So how does myEinstein works with ‘simple clicks’ after all? The declarative setup guide walks users through building, training and deploying AI models using structured and unstructured Salesforce data. The service automates the model building and data scoring process and custom predictive models and bots can then be easily embedded directly into Salesforce workflows. Models and bots automatically learn and improve as they're used, delivering accurate, personalized recommendations and predictions in the context of business. Both the tools, Einstein Prediction Builder as well as Einstein Bots, are currently in pilot and will be generally available in summer of 2018. Salesforce said pricing for each Einstein feature varies as some are already covered under the existing license while others require additional charges. It’s to be seen as to what extent Salesforce manages to reduce the complexity with creating bots and bring an element of underlying intelligence, but as the firm’s vice president Jim Sinai said, myEinstein is "automating data science under the hood."
Read more
  • 0
  • 0
  • 1659

article-image-trending-datascience-news-7th-nov-17-headlines
Packt Editorial Staff
07 Nov 2017
5 min read
Save for later

7th Nov.' 17 - Headlines

Packt Editorial Staff
07 Nov 2017
5 min read
Google’s Tangent, Salesforce’s myEinstein, Intel-AMD partnership, and HPE’s Superdome Flex among today’s top stories in data science news. Announcing Python library Tangent Google introduces Tangent, a Python library for automatic differentiation Google has announced a new, open-source Python library for automatic differentiation called Tangent. In contrast to existing machine learning libraries, Tangent is a source-to-source system, consuming a Python function f and emitting a new Python function that computes the gradient of f. This allows much better user visibility into gradient computations, as well as easy user-level editing and debugging of gradients. Tangent is useful to researchers and students who not only want to write their models in Python, but also read and debug automatically-generated derivative code without sacrificing speed and flexibility. Salesforce in news Salesforce announces machine learning platform myEinstein to build custom AI apps Salesforce has unveiled a machine learning platform myEinstein at its annual Dreamforce conference on Monday. The myEinstein platform enables users to develop custom AI apps "with clicks, without being a data scientist." The tool has two new services: Einstein Prediction Builder and Einstein Bots. Einstein Prediction Builder enables automatic creation of custom AI models that can predict outcomes for any field or object in Salesforce. Whereas Einstein Bots is a service which can be trained to augment customer service workflows by automating tasks such as answering questions and retrieving information. Salesforce, Google form strategic partnership on cloud Salesforce and Google have entered into a cloud partnership that could provide easier integration between Salesforce tools and Google’s G Suite and Google Analytics. Salesforce plans to use Google Cloud Platform (GCP) for its core services as part of its international infrastructure expansion. Intel-AMD partnership to target Nvidia Intel teams up with AMD for semi-custom GPU for next-gen mobile chips In a bid to counter rival Nvidia, Intel has joined hands with AMD to create a next-generation notebook chip. Intel said the new chips will be part of its 8th-generation Core H mobile processors, and will not only feature a discrete-level graphics cards, but also have built-in High Bandwidth Memory (HBM2) RAM packed onto a single board. While more information will be available in future, the first machines with the new technology will be released in the first quarter of 2018. New analytics platforms announced Rockwell unveils Project Scio, a scalable analytics platform for industrial IoT applications Rockwell Automation has announced Project Scio, a scalable and open platform that gives users secure, persona-based access to all data sources, structured or unstructured. The company said that Scio offers a configurable, easy-to-use interface with which “all users can become self-serving data scientists to solve problems and drive tangible business outcomes.” It can also intelligently fuse related data, delivering analytics in intuitive dashboards – called storyboards – that users can share and view. “Providing analytics at all levels of the enterprise – on the edge, on-premises or in the cloud – helps users have the ability to gain insights not possible before,” said John Genovesi, vice president of Information Software, Rockwell Automation. “When users gain the ability to fuse multiple data sources and add machine learning, their systems could become more predictive and intelligent.” HPE launches Superdome Flex platform for high performance data analytics for mission critical workloads Hewlett Packard Enterprise (HPE) has unveiled HPE Superdome Flex, a highly scalable and modular in-memory computing platform. The platform enables enterprises of any size to process and analyze massive amounts of data and turn it into real-time business insights. “With HPE Superdome Flex, customers can capitalize on in-memory data analytics for their most critical workloads and scale seamlessly as data volumes grow,” said Randy Meyer, vice president at HPE. Other news in data science Google releases its internal tool Colaboratory Google has released yet another internal development tool in Colaboratory. Built on top of the open-source Jupyter project, Colaboratory is both an education tool as well as one meant for collaboration for research. With Colaboratory, users create notebooks, or documents, that can be simultaneously edited like Google Docs, but with an added ability to run code and show that code’s output within the document. It supports Python 2.7 and has to be used on Google Chrome. The software is also integrated with Google Drive. Neuromation announces ICO to facilitate AI adoption with blockchain-powered platform Neuromation is utilizing blockchain technology to create a marketplace, the Neuromation Platform, which will connect multiple parties and bridge the gap between research, design and implementation stages of AI modeling in a cost-effective manner. In this connection, Neuromation ICO is in its pre-sale stage, which will end with the public sale, starting on Nov. 28 and ending on Jan. 1, 2018. Out of the total of 100,000,000 Neurotokens, 60,000,000 be available for distribution, with each token priced at 0.001 ETH. According to the project roadmap, second version of Neuromation Platform will be launched in Q2 2018, and then v3 will be launched with a custom blockchain in Q3 2018. DefinedCrowd unveils data platform API at Web Summit 2017 Seattle-based startup DefinedCrowd Corp. announced the release of version 1.0 of their public API at Web Summit 2017 in Lisbon. The product, which will be generally available on November 8, helps companies create new projects, upload tasks, and execute data collections and data processing campaigns in a more streamlined way, directly from their own data and machine learning infrastructure. “The life of data scientists will become easier with this API,” said CEO and Founder Daniela Braga. “They will have the option to integrate their data platforms with DefinedCrowd, having complete control of their projects, working from their own platforms. This will give them direct access to high-quality large-scale data with very little overhead.”
Read more
  • 0
  • 0
  • 1078

article-image-apache-kafka-1-0-streaming-platform
Abhishek Jha
06 Nov 2017
4 min read
Save for later

Apache Kafka 1.0: From messaging system to streaming platform

Abhishek Jha
06 Nov 2017
4 min read
In tech world, when a 1.0 version gets released, it’s assumed that the software is stable, mature, and production ready. But for Neha Narkhede, co-founder of Confluent and co-creator of Apache Kafka, the wait for Apache Kafka 1.0 “was less about stability and more about completeness of the vision” she and a team of engineers set to build towards back when they first started Kafka in 2009. After all, Kafka has been so broadly adopted by thousands of companies for several years, including a third of the Fortune 500 enterprises that continue to trust the platform for their mission-critical applications. Every software has a unique story to tell in their journey towards 1.0. In case of Kafka, named after acclaimed German writer Franz Kafka (Jay Kreps spilled the beans in a 2014 Quora post), it’s more about the transformation from a messaging system to a distributed streaming platform. “Back in 2009 when we first set out to build Kafka, we thought there should be an infrastructure platform for streams of data. We didn’t start with the idea of making our own software, but started by observing the gaps in the technologies available at the time and realized how they were insufficient to serve the needs of a data-driven organization,” says Neha. This is interesting because the team was not imagining some hypothetical need, but a real world business need. Not by building Kafka, but by thinking ‘Why did the stream processing startups fail in the 2000’s and 1990’s?’ “They failed because companies did not have the ability to collect these streams and have them laying around to process,” she adds, “the big question we asked ourselves was ‘why not both scale and real-time’? And more broadly, why not build a true infrastructure platform that allows you to build all of your applications on top of it, and have those applications handle streaming data by default.” And thus followed a multi-stage transformation: implementing a log-like abstraction for continuous streams, making Kafka fault-tolerant and building replication into it, building APIs that made it easy to get data in and out of Kafka and process it, and more recently adding transactions to enable exactly-once semantics for stream processing. The Version 1.0.0 comes with further performance improvements with exactly-once semantics which avoids sending the same messages multiple times in the case of a connection error. The exactly-once capabilities ensure enterprise stream processing in a controlled manner, as  they enable “closure-like functions” for stream processing. Fundamentally, message delivery to an endpoint once, and no more than once, in a distributed stateless systems has been an ongoing challenge. But while guaranteeing exactly-once delivery is still a debated topic, Kafka’s continued enhancements with exactly-once semantics have resulted in its wider acceptance. Besides, Apache Kafka 1.0.0 has several important improvements such as significantly faster TLS and CRC32C implementations with Java 9 support, faster controlled shutdown, and better JBOD support, among other bug fixes. There are other features which essentially got the nod: Kafka can now tolerate disk failures better, there is better diagnostics for simple authentication and security layer (SASL) authentication failures, and the Streams API has been improved with functional enhancements. “The nice thing about all this is that while the current instantiation of Kafka’s Streams APIs are in the form of Java libraries, it isn’t limited to Java per se. Kafka’s support for stream processing is primarily a protocol-level capability that can be represented in any language. This is an important distinction. Stream processing isn’t one interface, so there is no restriction for it to be available as a Java library alone. There are many ways to express continual programs: SQL, function-as-a-service or collection-like DSLs in many programming languages. A foundational protocol is the right way to address this diversity in applications around an infrastructure platform,” said Neha. May be it is this continual improvement that she was talking about as part of Apache Kafka’s decade long completeness of vision, which saw it getting trusted by companies like LinkedIn, Capital One, Goldman Sachs, LinkedIn, Netflix, Pinterest, and New York Times. "Kafka enabled us to process trillions of messages per day in a scalable way. This opened up a completely new frontier for us to efficiently process data in motion to help us better serve Netflix members around the world," said Allen Wang, Senior Software Engineer at Netflix. Apache Kafka 1.0 is more than just a release. As the company rightly puts it, 1.0.0 is not a ‘mere bump of the version number’ but a full-fledged streaming platform with the ability to read, write, move and process streams of data with transactional correctness at enterprise-wide scale. It will, in fact, play a bigger role in future if stream processing goes on to become the “central nervous system” for companies worldwide.
Read more
  • 0
  • 0
  • 1628
article-image-trending-datascience-news-6th-nov-17-headlines
Packt Editorial Staff
06 Nov 2017
4 min read
Save for later

6th Nov.' 17 - Headlines

Packt Editorial Staff
06 Nov 2017
4 min read
Uber’s new programming language Pyro, Tableau's new integration with AWS analytics, IBM’s cloud restructuring, and more in today’s tech stories on data science news. Introducing Pyro for deep probabilistic modeling Uber AI Labs announces PyTorch-based deep universal probabilistic programming language Pyro As the first public project to come out of Uber AI Labs, Uber has released a programming language called “Pyro” that will help developers build probabilistic models for AI research. “Pyro is a tool for deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling,” wrote Noah Goodman, Stanford researcher and member of Uber AI Labs, in a blog post. Pyro is based on Python and the PyTorch library. “In Pyro, both the generative models and the inference guides can include deep neural networks as components,” Goodman added. “The resulting deep probabilistic models have shown great promise in recent work, especially for unsupervised and semi-supervised machine learning problems.” Uber added that Pyro is an alpha release. Tableau in data science news Tableau announces support for Amazon Redshift Spectrum in Tableau 10.4 Tableau has announced an update to its Amazon Redshift connector with support for Amazon Redshift Spectrum (external S3 tables). This feature was released as part of Tableau 10.3.3 and will be available broadly in Tableau 10.4.1. In an official blog post, Tableau said its customers can now connect directly to data in Amazon Redshift and analyze it in conjunction with data in Amazon Simple Storage Service (S3). IBM revamps its Cloud strategy IBM brings new cloud data tools, updates Unified Data Governance Platform Presenting its latest vision for the cloud, IBM has announced a set of new data management products. The company added new tools Data Catalog, Data Refinery, and Analytics Engine on its Watson Data Platform. Keeping in mind the European Union’s incoming GDPR (General Data Protection Regulation), IBM also updated its Unified Data Governance Platform to ensure that businesses are better prepared to comply with the new regulation. IBM’s Goodbye to Bluemix brand In yet another rebranding, IBM Bluemix Cloud has been renamed as just “IBM Cloud.” The simplified naming, IBM said, is intended to put more focus on data and data science as against the infrastructure. Last year, the company had expunged the SoftLayer name for Bluemix. Newly launched data platforms Periscope Data unveils new platform to bolster “data driven culture” for professional data teams Targeting the large and growing market for big data and analytics software,Periscope Data has launched a new platform using which professional data teams can address the complete analytics life cycle. The Periscope Unified Data Platform enables teams to ingest, store, analyze, visualize and report on data. Periscope said that Unified Data Platform extends the core product with built-in data warehousing capabilities based on Amazon Redshift, as well as with new capabilities to ingest data from any source. The platform is designed to overcome the high costs of incomplete, obsolete and potentially inaccurate data, which costs huge monetary losses to businesses every year. The Unified Data Platform is comprised of the core components including ingestion from virtually any source, storage through a data warehouse, analysis in seconds, visualizations from data charts instantly, and reporting. Caviar announces real estate-backed digital asset platform Blockchain startup Caviar has launched a dual-purpose token and crowdfunding platform built on the Ethereum blockchain. Caviar’s token offers access to stable real estate and cryptocurrencies, with built-in downside protection and automatic diversification. In addition, Caviar Platform will allow real estate developers to raise funds for their upcoming projects. The pre-sale, being launched on November 28, aims to raise $25 million. SIA: MCN collaborates with SAS to unveil single source data platform Multi Channel Network (MCN) is partnering analytics software company SAS to deliver a new data management solution that integrates all of MCN’s different data sources to provide advertisers with a single consumer view across linear TV and all digital platforms. Known as SIA, the data tool combines MCN's data sources from TV, online, mobile, location and OOH, as well as data from agencies and advertisers. This includes adding data assets from Telstra, Near (MCN's location data). It will act as a central nervous system for all of MCN's data assets and business initiatives, including programmatic TV and addressable advertising as well as new business models around data.
Read more
  • 0
  • 0
  • 1370

article-image-cisco-spark-assistant
Abhishek Jha
04 Nov 2017
3 min read
Save for later

Cisco Spark Assistant: World's first AI voice assistant for meetings

Abhishek Jha
04 Nov 2017
3 min read
Few days back I wrote about how the cloud collaboration with Google could overturn Cisco’s dwindling fortunes. Seems like the internet tech pioneer is now back into full throttle. It has a reason to remind the world what it did with internet, after all. And no prizes in guessing that it intends to do the same with the next generation tech sensation – artificial intelligence. To start with, let’s be honest with the daily corporate meetings – it’s boring. Meetings after meetings – every Monday, every other day, client meetings, internal meetings, vendor meetings – and possible all kinds of stakeholders meetings that are ‘serious’ stuffs. Devoid of smile. Enter Cisco Spark Assistant. As you set up your office meetings, AI takes over with a simple, “Hey, Spark.” Basically bots have entered your meeting rooms. "During the next few years, AI meeting bots will be joining our work teams. When they do, people will be able to ditch the drudgery of meeting setup and other logistics to become more creative than ever," says Cisco SVP and GM Rowan Trollope, "The future of great meetings is Spark with AI and our partners have an incredible opportunity to help customers take advantage of this game-changing technology." Cisco Spark Assistant is the latest in the series of innovations on the Cisco Spark platform. The announcement was made at Cisco Partner Summit, and the company said the world's first enterprise-ready voice assistant for meetings will see a phased rollout. Early next year, it will be available first on the Cisco Spark Room Series portfolio, including the new flagship Cisco Spark Room 70. In May, Cisco had entered a $125 million deal to buy MindMeld. The new service leverages machine learning technology out of that acquisition. So how it is going to bring down the hassles? The Assistant will let you speak commands to Spark-registered devices, and it’s kind of going to be a zero-touch meeting scenario. Just tell the AI bots what you want it do. From ‘Hey, Spark. Let's get started.’ and ‘Hey, Spark. Call Wilson’s meeting room.’ to ‘Hey, Spark. End the meeting.’ All without lifting a finger. Apart from machine learning, speech recognition technology and natural language understanding, Cisco said it has also applied its deep knowledge of meetings, honed over time: “Because we deliver 50 billion minutes of meetings every year. With this, we optimized the AI for the conference room.” Don’t forget since the time you started your first job, you have always seen a Cisco conference phone in every meeting room. In future, Cisco plans to further enhance the service based on the feedback from early trials. The Assistant could become smarter with added capabilities to assign action items and prepare minutes of the meeting. “Spark Assistant takes advantage of our meeting room endpoints' industry-first advancements such as intelligent proximity, speaker tracking and real-time face recognition. These let it see and hear. As a result, Cisco Spark Assistant knows who enters the room, who leaves the room and who is speaking,” the company said in its official announcement. The initial focus looks clearly on simplifying everyday meetings. And voice commands promise to streamline the things. Above all, they definitely induce an ‘interactive’ incentive to drive away your Monday blues.
Read more
  • 0
  • 0
  • 2397