





















































Join Roman Lavrik from Deloitte Snyk hosted DevSecCon 2024
Snyk is thrilled to announce DevSecCon 2024, Developing AI Trust Oct 8-9, a FREE virtual summit designed for DevOps, developer and security pros of all levels. Join Roman Lavrik from Deloitte, among many others, and learn some presciptive DevSecOps methods for AI-powered development.
Sponsored
Welcome to DataPro #112—Your Weekly Fix of Data Science & ML Magic! 🌟
In the fast-moving world of AI and ML, staying ahead means leveraging smart strategies for bold decisions. This week, we’re bringing you expert insights from our new Packt Signature Series. From real-time data mastery to AI modeling techniques, we’ve got everything you need to level up your data game!
Get ready to elevate your model accuracy, supercharge performance, and cut costs with the latest in scalable solutions. Dive into this week’s must-read articles, tips, and practical techniques.
📚 Must-Reads for Data Pros
✦ LLM-Powered Apps: Build smarter AI tools
✦ Python for Trading: Algorithmic insights
✦ Power BI Cookbook: Master data visualization
✦ The Prompt Engineering Playbook: Unlock AI secrets
✦ Mastering PyTorch: Deep learning unleashed
🔍 Algorithm Spotlight: Dive Deep into the Tech
✦ Automating Metrics with Amazon Prometheus: Simplify data tracking on EKS
✦ Graviton4 EC2 Instances: Memory-optimized power for your AI workloads
✦ OpenAI Safety Practices: An update on securing AI
✦ Mistral AI Release: Open-source models with unmatched flexibility
🚀 Trendspotting: The Future of AI
✦ Eureka AI Progress: Understand and evaluate AI advancements
✦ OpenAI o1 System Card: A glance into AI innovations
✦ Conversational Analytics Preview: What’s new in Looker?
✦ Comet’s Opik: Streamlining LLM evaluation and prompt tracking
🛠️ Tool Showdown: Which ML Platform Reigns Supreme?
✦ BigQuery’s Contribution Model: Fresh insights for your data
✦ Running Airflow on Google Cloud: Three easy approaches
✦ Python Tricks: Merge dictionaries like a pro
✦ Google AI’s DataGemma: A Set of Open Models that Utilize Data Commons
📊 Case Studies: ML Success Stories
✦ Handling Large Text with Longformer: A Hugging Face deep dive
✦ Confluent & Vertex AI: Integrating LLMs for big wins
✦ What Makes a Data Business Thrive? Lessons from the top
🌍 ML Buzz: Industry News & Discoveries
✦ Cracking PyTorch’s Mixed Precision Library: What you need to know
✦ MLflow, Azure, Docker: Managing models with ease
✦ Self-Learning Models: Teaching AI to improve autonomously
Get ready for a week of data-driven breakthroughs!
Take our weekly survey and get a free PDF copy of our best-selling book,"Interactive Data Visualization with Python - Second Edition."We appreciate your input and hope you enjoy the book!
We’re excited to present a new collection in our Signature Series, featuring the best-selling titles in the data industry. Packed with insights on Generative AI and multimodal systems, this collection is available for a limited time at 30% off both print and e-book formats. This offer ends Sunday, September 22nd. Don’t miss your chance to upskill and elevate your career. Let’s dive in!
➽ Building LLM Powered Applications: This new titleis all about helping engineers and data pros use large language models (LLMs) effectively. It tackles key challenges like embedding LLMs into real-world apps and mastering prompt engineering techniques. You’ll learn to orchestrate LLMs with LangChain and explore various models, making it easier to create intelligent systems that can handle both structured and unstructured data. It’s a great way to boost your skills, whether you’re new to AI or already experienced! Start your free trial for access, renewing at $19.99/month.
➽ Python for Algorithmic Trading Cookbook: This bookis your go-to guide for using Python in trading. It helps you tackle key issues like acquiring and visualizing market data, designing and backtesting trading strategies, and deploying them live with APIs. You’ll learn practical techniques to gather data, analyze it, and optimize your strategies using tools like OpenBB and VectorBT. Whether you’re just starting or looking to refine your skills, this book equips you with the know-how to trade smarter with Python! Start your free trial for access, renewing at $19.99/month.
➽ Microsoft Power BI Cookbook - Third Edition: The Power BI Cookbook is your essential guide to mastering data analysis and visualization with Power BI. It covers using Microsoft Data Fabric, managing Hybrid tables, and creating effective scorecards. Learn to transform complex data into clear visuals, implement robust models, and enhance reports with real-time data. This updated edition prepares you for future AI innovations, making it a must-have for beginners and seasoned users alike! Start your free trial for access, renewing at $19.99/month.
➽ The Definitive Guide to Power Query (M): The Definitive Guide to Power Query (M) focuses on mastering data transformation with Power Query. It covers fundamental and advanced concepts through hands-on examples that address real-world problems. You'll learn the Power Query M language, optimize performance, handle errors, and implement efficient data processes. By the end, you'll have the skills to enhance your data analysis effectively! Start your free trial for access, renewing at $19.99/month.
➽ Automating metrics collection on Amazon EKS with Amazon Managed Service for Prometheus managed scrapers: This blog discusses how Amazon Managed Service for Prometheus simplifies monitoring containerized applications in Amazon EKS by introducing a fully-managed, agentless scraper for Prometheus metrics, reducing operational overhead and enhancing efficiency through Terraform and AWS CloudFormation automation.
➽ Now available: Graviton4-powered memory-optimized Amazon EC2 X8g instances. This post introduces Graviton-4-powered X8g instances, offering high memory, enhanced performance, scalability, and security for applications like databases and electronic design automation, emphasizing their efficiency, flexibility, and improved price-performance over previous instances.
➽ An update on OpenAI safety & security practices: This post introduces OpenAI's Safety and Security Committee, outlining five key recommendations to enhance governance, security, transparency, collaboration, and safety frameworks for AI model development and deployment, ensuring responsible and secure advancements in AI technology.
➽ Mistral AI Released Mistral-Small-Instruct-2409: A Game-Changing Open-Source Language Model Empowering Versatile AI Applications with Unmatched Efficiency and Accessibility. This article introduces Mistral AI's release of Mistral-Small-Instruct-2409, a powerful open-source large language model designed to enhance AI performance, promote accessibility, and support various natural language processing tasks with an emphasis on transparency, collaboration, and ethical AI development.
➽ Eureka: Evaluating and understanding progress in AI. This post introduces the EUREKA framework for evaluating AI models, emphasizing the need for in-depth measurement beyond standard benchmarks. It aims to uncover strengths, weaknesses, and real-world capabilities of state-of-the-art models through transparent and reproducible evaluations.
➽ OpenAI o1 System Card: This report outlines safety evaluations conducted before releasing OpenAI o1 models, addressing risks like bias, hallucinations, and disallowed content. It highlights mitigations, advanced reasoning capabilities, and overall safety ratings under OpenAI's Preparedness Framework.
➽ Conversational Analytics in Looker is now in preview: This post introduces Looker's Conversational Analytics, powered by AI and Looker’s semantic model, enabling users to ask data questions in natural language. It simplifies business intelligence, enhances accessibility, and promotes data-driven decision-making across organizations.
➽ Comet Launches Opik: A Comprehensive Open-Source Tool for End-to-End LLM Evaluation, Prompt Tracking, and Pre-Deployment Testing with Seamless Integration. This article introduces Opik, an open-source platform by Comet for enhancing observability and evaluation of large language models (LLMs). Opik helps developers and data scientists monitor, test, and track LLM applications, improving performance reliability and addressing issues like hallucinations.
➽ Introducing a new contribution analysis model in BigQuery: This post introduces contribution analysis in BigQuery ML, which helps organizations identify key data drivers behind trends and fluctuations, enabling faster, data-driven decisions by analyzing test and control datasets, and finding statistically significant contributors at scale.
➽ Three different ways to run Apache Airflow ETL on Google Cloud: This article explores three ways to run Apache Airflow on Google Cloud, comparing Compute Engine, managed solutions, and infrastructure setups. It highlights the pros and cons of each, providing Terraform code for implementation.
➽3 Simple Ways to Merge Python Dictionaries: This blog explains three common methods to merge dictionaries in Python: using the `update()` method, dictionary unpacking (`{**dict1, **dict2}`), and the union operator (`|`), providing code examples for each approach.
➽ Google AI Introduces DataGemma: A Set of Open Models that Utilize Data Commons through Retrieval Interleaved Generation (RIG) and Retrieval Augmented Generation (RAG). Google's DataGemma addresses hallucinations in large language models (LLMs) by grounding them in real-world statistical data through Google’s Data Commons. It introduces two advanced models, RAG-27B-IT and RIG-27B-IT, enhancing precision for tasks requiring deep analysis and real-time fact-checking.
➽ How to Handle Large Text Inputs with Longformer and Hugging Face Transformers? This post is a tutorial on using Longformer with Hugging Face Transformers for processing long text inputs in NLP tasks. It covers installing necessary packages, loading datasets, fine-tuning models, and evaluating results for tasks like review classification.
➽ Integrating Confluent and Vertex AI with LLMs: This blog explains how integrating large language models (LLMs) with Confluent and Vertex AI automates SQL query generation, streamlining real-time data analytics. It enhances data exploration, report generation, pipeline optimization, and anomaly detection, addressing challenges like complex queries and real-time decision-making.
➽ What Makes a Great Data Business? This post discusses how to identify and evaluate data businesses, highlighting their high margins and value potential. It covers key evaluation criteria: data sources, uses, nice-to-haves, and business models, providing a framework for private equity investors to spot valuable data businesses.
➽ The Mystery Behind the PyTorch Automatic Mixed Precision Library: This article explains how to accelerate deep learning model training using Nvidia's automatic mixed precision (AMP) technique. It introduces Nvidia's Tensor cores, reviews the "Mixed Precision Training" paper, and demonstrates a 2X training speed-up for ResNet50 on FashionMNIST with minimal code changes.
➽ Model Management with MLflow, Azure, and Docker: This article explains how to deploy MLflow, a tool for managing machine learning workflows, in a Docker container on Azure for scalability and collaboration. It covers MLflow's key components, focusing on MLflow Tracking, and provides a hands-on guide for setting up the system with Azure SQL Database and Blob Storage.
➽ Teaching Your Model to Learn from Itself: This article explains pseudo-labeling, a semi-supervised learning technique that uses confident predictions from a model to label unlabeled data. A case study on the MNIST dataset demonstrates how pseudo-labeling boosted accuracy from 90% to 95% by iteratively adding confident predictions to the training set.