PythonPro | 0 articles | Packt Newsletter Hub

19 Nov 2024

PythonPro #56: Pandas Vectorized Operations, PyPI Deletion Rules, and ChatGPT vs. Gemini Accuracy Showdown

19 Nov 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#56Pandas Vectorized Operations, PyPI Deletion Rules, and ChatGPT vs. Gemini Accuracy ShowdownHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Pandas Cookbook - Third Edition, which emphasizes the importance of using vectorized operations in pandas for better performance compared to Python loops.News Highlights:Technion launches PyPIM for in-memory computing in Python; PEP 763 limits PyPI deletions to 72 hours post-upload; and ColiVara API enables advanced document retrieval with visual embeddings.My top 5 picks from today’s learning resources:Any Python program fits in 24 characters*🧩Flash Attention derived and coded from first principles with Triton (Python)⚡Mastering Bivariate Maps with Plotly: A Step-by-Step Guide🗺️5 Overrated Python Libraries (And What You Should Use Instead)🔄Attemtps at immutability with dataclasses in Python🔒And, today’s Featured Study, evaluates AI programming tools ChatGPT, Gemini, AlphaCode, and GitHub Copilot, highlighting ChatGPT's highest single-attempt accuracy (87.2% pass@1), and Gemini's strong multi-attempt performance.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month'ssurvey is still live. Do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsResearchers develop Python code for in-memory computing — in-memory computation comes to Python code: Technion researchers have developed PyPIM, a tool that translates Python code into machine code for in-memory computing, enabling programmers to use Python without adaptation.PEP 763 – Limiting deletions on PyPI: PEP 763 proposes restricting the deletion of files, releases, and projects on PyPI to within 72 hours of upload, with exceptions for pre-release specifiers.ColiVara – State of the Art RAG API with Vision Models: ColiVara is a Python-based API and suite of tools for state-of-the-art document retrieval using visual embeddings, designed as a web-first implementation of the ColPali paper.💼Case Studies and Experiments🔬Any Python program fits in 24 characters*: Demonstrates how to encode any Python program in 24 characters (excluding whitespace) by exploiting the flexibility of whitespace encoding and Unicode representations.Judge a Book by its Color: How to Visualize Decades of Book Cover Colors from Scratch— Scraping, Data, and Design: Explores six decades of bestseller book cover colors using web scraping, ColorThief, and other libraries.📊AnalysisA pocket calculator using lambdatalk vs. Python: Compares building a browser-based pocket calculator using the lightweight functional programming language lambda talk with the Python-to-JavaScript transpiler Brython.Building a macOS app with python - BeeWare vs Kivy vs Flet vs Swift: Compares Python GUI frameworks BeeWare, Kivy, and Flet with Swift for building a macOS voice cloning app using the F5 TTS MLX model.🎓Tutorials and Guides🤓📽️Flash Attention derived and coded from first principles with Triton (Python): Provides a comprehensive tutorial on deriving and coding Flash Attention from scratch, covering mathematical foundations, CUDA, and Triton.Mastering Bivariate Maps with Plotly: A Step-by-Step Guide: Covers data generation, normalization, creating custom legends, and interactive map visualization, offering insights into crafting informative and visually appealing geospatial representations.1969: Can You Land on The Moon? • A Python `turtle` Lunar Lander: Demonstrates how to create a lunar landing game using Python’s turtle module, simulating realistic physics and controls for landing a lunar module.Generating realistic IoT data using Python & storing into MongoDB Timeseries Collection. Part 1: Guides you through generating realistic IoT sensor data streams using Python and storing them in MongoDB Time Series Collections.Vector animations with Python: A notebook demonstrating how to create dynamic vector animations in Python using Gizeh for vector graphics and MoviePy for animation.Dependent Types in 200 Lines of Python: Demonstrates building a type checker for the Calculus of Constructions (CoC) in Python, illustrating dependent types, type polymorphism, and precise type guarantees.Data in the Small: Python packagelittletable combines in-memory NoSQL ORM with schemaless setup(and easy CSV import/export): Introduces littletable, a lightweight Python package for in-memory NoSQL ORM with a schema-less setup, offering SQL-like features.🔑Best Practices and Advice🔏5 Overrated Python Libraries (And What You Should Use Instead): Critiques Requests, BeautifulSoup, Pandas, Matplotlib, and Scikit-Learn as outdated or inefficient for modern needs, and suggests alternatives.Python Dictionary Comprehensions: How and When to Use Them: Covers creating dictionaries from iterables, transforming existing ones, and filtering key-value pairs with conditionals, while also advising on best practices.Using the Python zip() Function for Parallel Iteration: Covers key concepts such as zip()'s lazy evaluation, handling unequal-length iterables, and using zip() to build dictionaries, alongside techniques like unzipping sequences.Using the len() Function in Python: Delves into applying len() to built-in and third-party types, like NumPy arrays and pandas DataFrames, as well as extending its functionality to user-defined classes via the .__len__() method.Attemtps at immutability with dataclasses in Python: Explores achieving immutability in Python through various methods, comparing old-style constants, new-style constants, dataclasses, enums, namedtuples, and metaprogramming.🔍Featured Study: Programming with AI💥In "Programming with AI: Evaluating ChatGPT, Gemini, AlphaCode, and GitHub Copilot for Programmers," Siam, Gu, and Cheng, compare four leading AI-powered tools for programming. The researchers from the New York Institute of Technology, aim to assess the tools' code-generation accuracy, capabilities, and implications for developers using rigorous benchmarks and evaluation metrics.ContextLLMs like ChatGPT, Gemini, AlphaCode, and GitHub Copilot use transformer architectures to process natural language and generate programming code. Tools such as these are revolutionising software development by automating code creation and assisting with problem-solving tasks. The study’s relevance lies in its comprehensive evaluation of their accuracy, efficiency, and potential to transform programming workflows. Metrics like pass@k (accuracy over k attempts) and test case pass rates (functional correctness) provide critical insight into the models' capabilities.Key FindingsChatGPT: GPT-4-Turbo-0125 achieved the highest accuracy (87.2% pass@1) on HumanEval, outperforming other models in single-attempt code generation.Gemini: Gemini-1.5-Pro scored 74.9% on HumanEval, while Gemini-Ultra excelled in multiple-attempt scenarios with a 74.7% pass@100 on Natural2Code.AlphaCode: Designed for competitive programming, AlphaCode achieved pass rates of 54% (Python), 51% (Java), and 45% (C++) on Codeforces challenges.GitHub Copilot: On LeetCode, Copilot attained test case pass rates of 75.7% (Java) and 73.3% (C++), enhancing productivity by offering real-time code suggestions.Ethical Issues: Models exhibit biases in outputs, risk copyright infringement, and occasionally produce plausible but incorrect code. GitHub Copilot, in particular, has faced criticism over intellectual property concerns.What This Means for YouThe study is particularly valuable for programmers, software engineers, and organisations using AI tools to streamline coding tasks. It highlights which tools excel in accuracy and productivity, enabling developers to make informed decisions based on their specific needs, such as competitive programming (AlphaCode) or real-time coding assistance (GitHub Copilot). Ethical concerns warrant careful oversight when using these tools in professional environments.Examining the DetailsThe study uses empirical methods, analysing performance across benchmarks like HumanEval, Codeforces, and Natural2Code. Metrics such as pass@1, pass@100, and test case pass rates were applied to ensure rigorous evaluation. By referencing 10 recent research papers, it validates the models' capabilities and relevance. However, the study also emphasises limitations, including computational costs and the need for human oversight due to occasional inaccuracies. Despite these challenges, the findings are robust, demonstrating how AI tools are reshaping the future of programming.You can learn more by reading the entire paper.🧠 Expert insight💥Here’s an excerpt from “Chapter 10: General Usage and Performance Tips” in the Pandas Cookbook - Third Edition by William Ayd and Matthew Harrison, published in October 2024.Use vectorized functions instead of loopsPython as a language is celebrated for its looping prowess. Whether you are working with a list or a dictionary, looping over an object in Python is a relatively easy task to perform, and can allow you to write really clean, concise code.Even though pandas is a Python library, those same looping constructs are ironically an impediment to writing idiomatic, performant code. In contrast to looping, pandas offersvectorized computations, i.e, computations that work with all of the elements contained within apd.Seriesbut which do not require you to explicitly loop.How to do itLet’s start with a simplepd.Seriesconstructed from a range:ser = pd.Series(range(100_000), dtype=pd.Int64Dtype())We could use the built-inpd.Series.summethod to easily calculate the summation:ser.sum()4999950000Looping over thepd.Seriesand accumulating your own result will yield the same number:result = 0for x in ser: result += xresult4999950000Yet the two code samples are nothing alike. Withpd.Series.sum, pandas performs the summation of elements in a lower-level language like C, avoiding any interaction with the Python runtime. In pandas speak, we would refer to this as avectorizedfunction.By contrast, theforloop is handled by the Python runtime, and as you may or may not be aware, Python is a much slower language than C.To put some tangible numbers forth, we can run a simple timing benchmark using Python’stimeitmodule. Let’s start withpd.Series.sum:timeit.timeit(ser.sum, number=1000)0.04479526499926578Let’s compare that to the Python loop:def loop_sum(): result = 0 for x in ser: result += xtimeit.timeit(loop_sum, number=1000)5.392715779991704That’s a huge slowdown with the loop!Generally, you should look to use the built-in vectorized functions of pandas for most of your analysis needs. For more complex applications, reach for the.agg,.transform,.map, and.applymethods, which were covered back inChapter 5, Algorithms and How to Apply Them. You should be able to avoid usingforloops in 99.99% of your analyses; if you find yourself using them more often, you should rethink your design, more than likely after a thorough re-read ofChapter 5, Algorithms and How to Apply Them.The one exception to this rule where it may make sense to use aforloop is when dealing with apd.GroupByobject, which can be efficiently iterated like a dictionary:df = pd.DataFrame({ "column": ["a", "a", "b", "a", "b"], "value": [0, 1, 2, 4, 8],})df = df.convert_dtypes(dtype_backend="numpy_nullable")for label, group in df.groupby("column"): print(f"The group for label {label} is:\n{group}\n")The group for label a is:column value0 a 01 a 13 a 4The group for label b is:column value2 b 24 b 8Pandas Cookbook - Third Edition was published in October 2024.Get the eBook for $39.99 $27.98Get the Print Book for $49.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
558

Divya Anne Selvaraj

12 Nov 2024

PythonPro #55: Python Project Packaging Guide, AWS Credentials Theft Alert, and PyTorch 2 Speeds Up ML

Divya Anne Selvaraj

12 Nov 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#55Python Project Packaging Guide, AWS Credentials Theft Alert, and PyTorch 2 Speeds Up MLHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, LLM Engineer's Handbook, which discusses comprehensive RAG evaluation through the Ragas and ARES frameworks.News Highlights: Malicious Python package "Fabrice" on PyPI has been stealing AWS credentials by mimicking Fabric; and PyTorch 2 boosts ML speeds with dynamic bytecode transformation, achieving 2.27x inference and 1.41x training speedups on NVIDIA A100 GPUs.My top 5 picks from today’s learning resources:A Comprehensive Guide to Python Project Management and Packaging📦20 Python scripts to automate common daily tasks🔄Principal Component Analysis with Python (A Deep Dive) -Part 1📉State of Python 3.13 Performance: Free-Threading🚀How I Got Started Making Maps with Python and SQL🗺️And, today’s Featured Study, introduces Magentic-One, a generalist multi-agent AI system developed by Microsoft Research, designed to coordinate specialised agents in tackling complex, multi-step tasks across diverse applications.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month'ssurvey is now live. Do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsMalicious Python package collects AWS credentials via 37,000 downloads: The malicious Python package "Fabrice" has been typosquatting on PyPI since 2021, gathering AWS credentials by imitating the popular Fabric SSH library.🎥PyTorch 2 Speeds Up ML with Dynamic Python Bytecode Transformation and Graph Compilation: Fully backward-compatible, the version achieves a 2.27x inference speedup and 1.41x training speedup on NVIDIA A100 GPUs, surpassing six other compilers across 180+ models.💼Case Studies and Experiments🔬LangGraph for Healthcare: A Comprehensive Technical Guide: Delves into the technical aspects of integrating LangGraph into healthcare workflows, emphasizing its potential to improve patient care, streamline administrative tasks, and facilitate medical researchTiming-sensitive analysis in python: explores timing sensitivity in Python through a hands-on approach, testing timing stability under different conditions (like input size and intentional delay).📊AnalysisState of Python 3.13 Performance: Free-Threading: Examines how free-threading affects Python's ability to handle parallel tasks, particularly through a practical example with a PageRank algorithm implementation.Modular, Faster DateOnly Operations in Python: Delves into the reasoning behind memory-efficient, modular date handling and performance optimization, examining different approaches and technologies like C structures.🎓Tutorials and Guides🤓How to Reset a pandas DataFrame Index:Covers methods like .reset_index(), .index, and .set_axis() while exploring index alignment, duplicate removal, multi-index handling, and using columns as indexes.A Comprehensive Guide to Python Project Management and Packaging: Covers structuring projects, managing dependencies, and creating distributable packages, along with practical examples.Effortless Image Cropping with Python: Automate Your Workflow in Minute: Shows you how to automate image cropping in Python using OpenCV to identify the focal area of an image and crop it to a 16:9 aspect ratio.Adding keyboard shortcuts to the Python REPL: Explains how to add custom keyboard shortcuts to the Python 3.13 REPL using a PYTHONSTARTUP file and the unsupported _pyrepl module.Simplifying News Scraping with Python’s Newspaper4k Library: Demonstrates how to use Python’s Newspaper4k library to automate the scraping and parsing of news articles, extracting key components.Principal Component Analysis with Python (A Deep Dive) -Part 1: Provides a step-by-step guide for reducing data dimensionality through mathematical and coding examples.Using the OpenAI Reatime API in python: Covers key challenges such as managing 24kHz, 16-bit audio format, avoiding jerky audio by handling recording and playback concurrently, and preventing echo by using a headset.🔑Best Practices and Advice🔏How to Fix the Most Common Python Coding Errors: Covers IndentationError, SyntaxError, TypeError (NoneType not subscriptable), IndexError (list index out of range), and KeyError (missing dictionary key).Do Constructors Return Values in OOP?: Clarifies that while constructors don’t explicitly return values, they implicitly return the newly created instance of the class, fulfilling their primary purpose of object initialization.20 Python scripts to automate common daily tasks: Lists scripts for tasks such as sending emails, scraping news, downloading stock prices, backing up files, posting to social media, fetching weather updates, and resizing images.What time is it? A simple approach to AI-agents: Explains how AI agents solve real-time queries by selecting and executing pre-defined functions, using tasks like fetching the current time and weather as examples.How I Got Started Making Maps with Python and SQL: Recounts the author’s journey into spatial data visualization using tools like DuckDB, H3, and GeoPandas to create interactive maps, from building density to 3D dashboards.🔍Featured Study: Magentic-One💥In "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks," Fourney et al. from AI Frontiers - Microsoft Research aim to develop a versatile, multi-agent AI system capable of autonomously completing complex tasks. The study presents Magentic-One as a generalist solution that orchestrates specialised agents to tackle tasks that require planning, adaptability, and error recovery.ContextTo address the need for AI systems capable of handling a wide range of tasks, Magentic-One leverages a multi-agent architecture. In this setup, agents are AI-driven components, each with a distinct skill, such as web browsing or code execution, all working under the direction of an Orchestrator agent. The Orchestrator not only delegates tasks but monitors and revises strategies to keep progress on track, ensuring effective task completion. This system responds to the growing demand for agentic systems in AI—those able to handle tasks involving multiple steps, real-time problem-solving, and error correction.The importance of such systems has increased as AI technology advances in areas like software development, data analysis, and web-based research, where single-agent models often struggle with multi-step, unpredictable tasks. By developing Magentic-One as a generalist system, the researchers offer a foundation that balances adaptability and reliability across diverse applications, helping establish future standards for agentic AI systems.Key Features of Magentic-OneMulti-Agent Architecture: Magentic-One uses multiple agents, each specialising in a specific task, coordinated by an Orchestrator agent.Orchestrator-Led Dynamic Planning: The Orchestrator creates and adapts task plans, tracks progress, and initiates corrective actions as needed, improving resilience.Specialised Agents:Coder: Writes, analyses, and revises code.ComputerTerminal: Executes code, manages shell commands.WebSurfer: Browses the web, interacts with web pages.FileSurfer: Reads and navigates files of various types.Performance on Benchmarks: Magentic-One achieved high performance on challenging benchmarks like GAIA (38% completion rate) and AssistantBench (27.7% accuracy), positioning it competitively among state-of-the-art systems.AutoGenBench Tool for Evaluation: AutoGenBench offers a controlled testing environment, allowing for repeatable, consistent evaluation of agentic systems like Magentic-One.What This Means for YouThe study’s findings are particularly relevant for developers, researchers, and AI practitioners focused on real-world applications of AI for complex, multi-step tasks. For instance, fields such as autonomous software engineering, data management, and digital research can leverage Magentic-One's multi-agent system to automate complex workflows. Its modular, open-source design enables further adaptation, making it useful for those interested in customising AI tools to meet specific requirements or studying multi-agent coordination for diverse scenarios.Examining the DetailsThe researchers applied a rigorous methodology to assess Magentic-One's reliability and practical value. Key benchmarks included GAIA, AssistantBench, and WebArena, each with unique tasks requiring multi-step reasoning, data handling, and planning. To verify the system’s efficacy, Magentic-One’s performance was compared against established state-of-the-art systems. The study reports a 38% task completion rate on GAIA, positioning Magentic-One competitively among leading systems without modifying core agent capabilities.To analyse the system’s interactions and address limitations, the team examined errors in detail, identifying recurring issues such as repetitive actions and insufficient data validation. By tracking these errors and using AutoGenBench, an evaluation tool ensuring isolated test conditions, the researchers provided a clear, replicable performance baseline. Their approach underscores the importance of modularity in AI design, as Magentic-One's agents operated effectively without interfering with each other, demonstrating both reliability and extensibility.You can learn more by reading the entire paper or access the system here.🧠 Expert insight💥Here’s an excerpt from “Chapter 7: Evaluating LLMs” in the book, LLM Engineer's Handbookby Paul Iusztin and Maxime Labonne, published in October 2024.RAG evaluationWhile traditional LLM evaluation focuses on the model’s inherent capabilities, RAG evaluation requires a more comprehensive approach that considers both the model’s generative abilities and its interaction with external information sources.RAG systems combine the strengths ofLLMs with information retrieval mechanisms, allowing them to generate responses that are not only coherent and contextually appropriate but also grounded in up-to-date, externally sourced information. This makes RAG particularly valuable in fields where current and accurate information is crucial, such as news reporting, research, and customer support.The evaluation of RAG systems goes beyond assessing a standalone LLM. It requires examining the entire system’s performance, including:Retrieval accuracy: How well does the system fetch relevant information?Integration quality: How effectively is the retrieved information incorporated into the generated response?Factuality and relevance: Does the final output address the query appropriately while seamlessly blending retrieved and generated content?Key metrics for RAG evaluation include retrieval precision and recall, which measure the accuracy and comprehensiveness of the retrieved information. Additionally, the quality of integration between retrieved data and generated text is crucial, as is the overall factuality and coherence of the output.To illustrate how these metrics are applied in practice, consider a RAG system designed for a customer support chatbot in an e-commerce setting. In this scenario, the user asks “What’s your return policy for laptops purchased during the holiday sale?” The RAG pipeline finds relevant documents on the electronics return policy and documents on holiday sale terms. This additional context is appended at the end of the question, and the model uses it to respond:For laptops purchased during our holiday sale, you have an extended return period of 60 days from the date of purchase. This is longer than our standard 30-day return policy for electronics. Please ensure the laptop is in its original packaging with all accessories to be eligible for a full refund.Table 7.3: Example of output from a RAG pipeline designed for customer supportIn this pipeline, we can evaluate if the retrieved documents correspond to what was expected (retrieval accuracy). We can also measure the difference between responses with and without additional context (integration quality). Finally, we can assess whether the output is relevant and grounded in the information provided by the documents (factuality and relevance).In this section, we will cover two methods to evaluate how well RAG models incorporate external information into their responses.RagasRetrieval-Augmented Generation Assessment(Ragas) is an open-source toolkit designed to provide developers with a comprehensive set of tools for RAG evaluation and optimization. It’s designed around the idea ofmetrics-driven development(MDD), a product development approach that relies on data to make well-informed decisions, involving the ongoing monitoring of essential metrics over time to gain valuable insights into an application’s performance. By embracing this methodology, Ragas enables developers to objectively assess their RAG systems, identify areas for improvement, and track the impact of changes over time.One of the key capabilities of Ragas is its ability to synthetically generate diverse and complex test datasets. This feature addresses a significant pain point in RAG development, as manually creating hundreds of questions, answers, and contexts is both time-consuming and labor-intensive. Instead, it uses an evolutionary approach paradigm inspired by works like Evol-Instruct to craft questions with varying characteristics such as reasoning complexity, conditional elements, and multi-context requirements. This approach ensures a comprehensive evaluation of different components within the RAG pipeline.Additionally, Ragas can generate conversational samples that simulate chat-based question-and-follow-up interactions, allowing developers to evaluate their systems in more realistic scenarios.Figure 7.1: Overview of the Ragas evaluation frameworkAs illustrated inFigure 7.1, Ragas provides a suite of LLM-assisted evaluation metrics designed to objectively measure different aspects of RAG system performance. These metrics include:Faithfulness: This metric measures the factual consistency of the generated answer against the given context. It works by breaking down the answer into individual claims and verifying if each claim can be inferred from the provided context. The faithfulness score is calculated as the ratio of verifiable claims to the total number of claims in the answer.Answer relevancy: This metric evaluates how pertinent the generated answer is to the given prompt. It uses an innovative approach where an LLM is prompted to generate multiple questions based on the answer and then calculates the mean cosine similarity between these generated questions and the original question. This method helps identify answers that may be factually correct but off-topic or incomplete.Context precision: This metric evaluates whether all the ground-truth relevant items present in the contexts are ranked appropriately. It considers the position of relevant information within the retrieved context, rewarding systems that place the most pertinent information at the top.Context recall: This metric measures the extent to which the retrieved context aligns with the annotated answer (ground truth). It analyzes each claim in the ground truth answer to determine whether it can be attributed to the retrieved context, providing insights into the completeness of the retrieved information.Finally, Ragas also provides building blocks for monitoring RAG quality in production environments. This facilitates continuous improvement of RAG systems. By leveraging the evaluation results from test datasets and insights gathered from production monitoring, developers can iteratively enhance their applications. This might involve fine-tuning retrieval algorithms, adjusting prompt engineering strategies, or optimizing the balance between retrieved context and LLM generation.Ragas can be complemented with another approach, based on custom classifiers.ARESARES (an automated evaluation framework for RAG systems) is a comprehensive tool designed to evaluate RAG systems. It offers an automated process that combines synthetic data generation with fine-tuned classifiers to assess various aspects of RAG performance, including context relevance, answer faithfulness, and answer relevance.The ARES framework operates in three main stages: synthetic data generation, classifier training, and RAG evaluation. Each stage is configurable, allowing users to tailor the evaluation process to their specific needs and datasets.In the synthetic data generation stage, ARES creates datasets that closely mimic real-world scenarios for robust RAG testing. Users can configure this process by specifying document file paths, few-shot prompt files, and output locations for the synthetic queries. The framework supports various pre-trained language models for this task, with the default being google/flan-t5-xxl. Users can control the number of documents sampled and other parameters to balance between comprehensive coverage and computational efficiency.Figure 7.2: Overview of the ARES evaluation frameworkThe classifier training stage involves creating high-precision classifiers to determine the relevance and faithfulness of RAG outputs. Users can specify the classification dataset (typically generated from the previous stage), test set for evaluation, label columns, and model choice. ARES uses microsoft/deberta-v3-large as the default model but supports other Hugging Face models. Training parameters such as the number of epochs, patience value for early stopping, and learning rate can be fine-tuned to optimize classifier performance.The final stage, RAG evaluation, leverages the trained classifiers and synthetic data to assess the RAG model’s performance. Users provide evaluation datasets, few-shot examples for guiding the evaluation, classifier checkpoints, and gold label paths. ARES supports various evaluation metrics and can generate confidence intervals for its assessments.ARES offers flexible model execution options, supporting both cloud-based and local runs through vLLM integration. The framework also supports various artifact types (code snippets, documents, HTML, images, and so on), enabling comprehensive evaluation across different RAG system outputs.In summary, Ragas and ARES complement each other through their distinct approaches to evaluation and dataset generation. Ragas’s strength in production monitoring and LLM-assisted metrics can be combined with ARES’s highly configurable evaluation process and classifier-based assessments. While Ragas may offer more nuanced evaluations based on LLM capabilities, ARES provides consistent and potentially faster evaluations once its classifiers are trained. Combining them offers a comprehensive evaluation framework, benefiting from quick iterations with Ragas and in-depth, customized evaluations with ARES at key stages.LLM Engineer's Handbook was published in October 2024.Get the eBook for $47.99Get the Print Book for $59.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
511

Divya Anne Selvaraj

24 Sep 2024

PythonPro #48: Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices

Divya Anne Selvaraj

24 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#48:Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices3 Days. 25+ AI Experts. 30+ Sessions.Join the Generative AI In Action conference from Nov 11-13 (LIVE | Virtual) and gain insights from top AI leaders across over 30 sessions. Explore key topics including GenAI tools, AI Agents, Open-Source LLMs, Small Language Models, LLM fine-tuning, and many more! This is your opportunity to dive deep into cutting-edge AI strategies and technologies.Save 40% with our Early Bird offer using code BIGSAVE40 – don’t miss out!Secure Your Seat Today!Hi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Machine Learning and Generative AI for Marketing, which discusses how to create effective prompts for Zero-Shot Learning to generate high-quality marketing content.News Highlights: Opik, a new open-source LLM evaluation tool, integrates with CI/CD, and Model2Vec, a newly launched library, boosts CPU inference 500x and cuts model size by 15x.Here are my top 5 picks from our learning resources today:Frankenstein’s Ice cream shop🍦Python 3.13 Preview: Free Threading and a JIT Compiler⚙️Graph RAG into Production — Step-by-Step🧩FastAPI Best Practices and Design Patterns - Building Quality Python APIs🛠️From Spreadsheets to SDMX Effortless with Python and .Stat Suite📊And, today’s Featured Study, examines the performance of open-source models like Mistral and LLaMa and provides insights into the hardware needed for efficient deployment, using GPUs and optimisation techniques such as quantification.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:With this issue, we have finished covering all content requests made through the September feedback survey. Stay tuned for next month's survey.Sign Up|AdvertiseWhat changed in the way you code for 2024? What has happened in the tech world in the last months?Take this shorter version of the Developer Nation survey, learn about new tools, influence the future of development and share your insights with the world!What’s in it for you?A virtual goody bag packed with cool resourcesThe more questions you answer the more chances you have to win amazing prizes including aSamsung Galaxy Watch 7!Take the Survey now!🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsOpik, an open source LLM evaluation framework: The platform can be used for developing, evaluating, and monitoring LLM applications and offers features such as LLM call tracing, annotation, automated evaluation, and integration into CI/CD pipelines.Model2Vec: Distill a Small Fast Model from any Sentence Transformer: Model2Vec is a Python library that distills sentence transformers into small static embeddings, making inference 500x faster on CPU and reducing model size by 15x.💼Case Studies and Experiments🔬Integrated Python and GIS Approach for Geomorphometric Investigation of Man River Basin, Western Madhya Pradesh, India: Analyzes the tectonic influence on the Man River Basin's development using satellite imagery, GIS software, and Python to compute and study geomorphometric indices.Frankenstein’s Ice cream shop:Details how to automate the cleaning of messy Excel sheets using Python's Pandas library, focusing on a made-up ice cream sales commission dataset.📊AnalysisThe Python Package Index Should Get Rid Of Its Training Wheels: Discusses the challenges of PyPI's exponentially growing storage needs, particularly due to prebuilt binaries and suggests leveraging modern build tools.UV — I am (somewhat) sold: Initially skeptical, the author of this article found UV useful for handling multiple Python versions, dependency management, and simplifying their development setup.🎓Tutorials and Guides🤓Python 3.13 Preview: Free Threading and a JIT Compiler: Demonstrates the key new features in Python 3.13, including free threading, which makes the GIL optional, and a JIT compiler that compiles Python code into machine code.Graph RAG into Production — Step-by-Step: Discusses how to implement Graph Retrieval-Augmented Generation (Graph RAG) in production using a fully serverless, parallelized approach without using a graph database.Python Virtual Environments: A Primer: Covers how to create, activate, and manage virtual environments, explaining their importance for isolating dependencies, avoiding conflicts, and ensuring reproducibility.Python for Network Programming — A Beginner’s Overview: Explains key concepts such as sockets, TCP, and UDP protocols, and walks you through practical examples of building TCP and UDP client-server applications.Mastering ChatGPT’s Function Call API - The Smart Way and the… Not-So-Smart Way (in Python): Explains how to use ChatGPT's function call API for automating tasks in Python.Git With Python HowTo GitPython Tutorial And PyGit2 Tutorial: Covers installation, exception handling, and common tasks like cloning, committing, branching, tagging, and pushing changes.🎥Program a RAG LLM Chat App with LangChain + Streamlit + *o1, GTP-4o and Claude 3.5 API: Covers loading custom documents, integrating website content into LLM queries, and creating a web app that enables users to interact with GPT-4 and Claude models.🔑Best Practices and Advice🔏Counting Sheep with Contracts in Python: Discusses using code contracts to enhance software development by ensuring preconditions and postconditions are met, making the code safer and easier to maintain.FastAPI Best Practices and Design Patterns - Building Quality Python APIs: Discusses applying SOLID principles and design patterns like DAO and Service Layer to build clean, maintainable, and scalable APIs using FastAPI.Recently I read a few articles and have a few questions: Covers managing dependencies without tools like Poetry, and handling Python version installations, particularly when a preferred version lacks an official installer.Unlocking the Magic of Docstrings: Introduces the power of Python docstrings for documenting code, enhancing readability, and providing functionality like automatic documentation generation and testing.From Spreadsheets to SDMX Effortless with Python and .Stat Suite: Highlights the importance of SDMX adoption for efficient data sharing among institutions and presents a step-by-step case study using World Bank data.🔍Featured Study: Deploying Open-Source Large Language Models Efficiently💥The study "Deploying Open-Source Large Language Models: A Performance Analysis", conducted by Bendi-Ouis et al., compares the performance of open-source large language models. The study aims to assist organisations in evaluating the hardware requirements for efficiently deploying models like Mistral and LLaMa.ContextSince the release of ChatGPT in November 2023, there has been growing interest in deploying large language models. Many organisations and institutions are keen to harness LLMs, but the computational demands remain a challenge. While proprietary models require substantial resources, open-source models like Mistral and LLaMa provide alternatives that may be deployed with less hardware. This study explores how different hardware configurations and optimisation techniques, such as quantification, can make these models more accessible for public and private entities.Key FindingsThe study used two types of GPUs: NVIDIA V100 16GB and NVIDIA A100 40GB, with tests conducted on models like Mistral-7B, Codestral-22B, Mixtral-8x7B, Mixtral-8x22B, and LLaMa-3-70B.Mistral-7B generated 119 tokens in 1.9 seconds with one request, but 72.1 seconds with 128 requests on two V100 16GB GPUs.Codestral-22B produced 63 tokens in 2.3 seconds with one request but took 96.2 seconds with 128 requests on an A100 40GB GPU.Larger models like Mixtral-8x22B and LLaMa-3-70B faced slower generation times as context size and simultaneous requests increased.Quantifying models to 4 or 6 bits helped reduce the memory load while maintaining performance, with negligible loss in accuracy for models with up to 70 billion parameters.What This Means for YouFor organisations and developers seeking to deploy LLMs, this analysis provides valuable insights into the hardware requirements and optimisation techniques necessary for efficient deployment. With moderate hardware investments, open-source models can perform competitively, reducing dependency on proprietary systems and enabling better control over digital resources. This ensures digital sovereignty and cost-effective deployment of advanced AI technologies.Examining the DetailsThe researchers focused on GPU performance and model quantification to measure how efficiently LLMs could be deployed. Using vLLM, a Python library designed for inference optimisation, the study tested multiple models and configurations. For instance, Mistral-7B, when run on two V100 16GB GPUs, showed an increase in response time with higher numbers of simultaneous requests, highlighting the challenge of scaling for larger user bases.Quantification emerged as a key method to reduce computational load, allowing models to use less memory by lowering precision from 16 or 32 bits to 4 or 8 bits. This method was effective for larger models, maintaining performance without significant loss in accuracy.The study concluded that, although proprietary solutions like ChatGPT require significant resources, open-weight models like Mistral and LLaMa can deliver strong performance with commercially available GPUs. By deploying these models with vLLM and quantification techniques, organisations can achieve scalable, efficient AI deployment without excessive hardware costs.You can learn more by reading the entire paper here.🧠 Expert insight💥Here’s an excerpt from “Chapter 9: Creating Compelling Content with Zero-Shot Learning” in the book, Machine Learning and Generative AI for Marketing by Yoon Hyup Hwang and Nicholas C. Burtch, published in August 2024.Creating an effective promptBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needCreating an effective prompt is the most crucial step in leveraging ZSL for marketing copy. In ZSL, the prompt effectively becomes the instruction manual for a model, telling it what kind of content to generate, as well as its style, tone, and substance.The following are some guidelines around how to formulate prompts that will elicit the best possible marketing copy content from the model:Clarity: Ensure that your prompt is specific about what you want, whether it’s a product description, headline, or call to action.Contextual: Provide sufficient background to guide a model. For eco-friendly products, mention key selling points like sustainability or biodegradability.Creative: While clarity is crucial, leaving room for creativity can yield surprising and innovative results. Phrases like “Imagine...” or “Create a story where...” can be particularly powerful.Concise: Lengthy prompts can dilute the focus. Aim for brevity while including essential details, ensuring that a model stays on topic.In the following sections, we will illustrate the impact of prompt quality through examples, with different types of marketing copy. While good prompts elicit detailed, relevant, and engaging content, poor prompts can lead to vague and uninspiring outputs. To generate these responses, we will define the following function:def generate_response(prompt, model="gpt-4", max_tokens=100, temperature=0.7, n=1): response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=max_tokens, n=n, temperature=temperature)return response.choices[0].message.contentThis function will be used with different prompt types in the examples that follow.Example 1: Product descriptionsIn this example, we will generate product descriptions for our e-commerce brand, which is launching new lines of eco-friendly kitchenware.The following is an example of a poor prompt:poor_product_description = "Talk about bamboo cutlery."generate_response(poor_product_description)This produces:Bamboo cutlery is a type of eating utensil made from bamboo, a fast-growing renewable resource. This form of cutlery includes items such as forks, knives, spoons, and chopsticks. They are an eco-friendly alternative to plastic disposable cutlery due to their reusable nature and biodegradable properties. Bamboo cutlery is lightweight and durable, making it a popular choice for camping, picnics, and travel.Now, let’s look at the following example of a good prompt:good_product_description = "Write a captivating description for a bamboo cutlery set designed for eco-conscious consumers, emphasizing its sustainability and style."generate_response(good_product_description)This prompt produces the following output:Discover the perfect harmony of sophistication and sustainability with our Bamboo Cutlery Set. Designed for the eco-conscious consumer, this set is not only a statement of your commitment to the environment but also a reflection of your exquisite taste. Each piece is crafted from 100% organic bamboo, a renewable resource that grows significantly faster than traditional hardwoods. This set boasts a sleek, minimalist design that perfectly complements any table setting, from casual picnic to elegant dinner parties.From a marketing perspective, this example demonstrates the significance of creating detailed and audience-specific prompts with clear requirements when using ZSL for product descriptions, as well as how this leads to more specificity in the generated response. However, it is worth noting that older consumers may value more straightforward, factual information and, therefore, may favor the more generic prompt’s response from an engagement standpoint. Tailoring GenAI outputs at the level of the individual consumer can be crucial as well and is a topic discussed inChapter 11.Packt library subscribers can continue reading the entire book for free. You can buy Machine Learning and Generative AI for Marketing,here.Get the eBook for $39.99 $27.98!Other Python titles from Packt at 30% offGet the eBook for $35.99 $24.99!Get the eBook for $25.99 $17.99!Get the eBook for $35.99 $24.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
438

Divya Anne Selvaraj

26 Nov 2024

PythonPro #57: NASA Image Pipeline in Airflow, PyPI Crypto Scare, and Numba vs. Cython

Divya Anne Selvaraj

26 Nov 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#57NASA Image Pipeline in Airflow, PyPI Crypto Scare, and Numba vs. CythonHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Apache Airflow Best Practices, which explains how to build and test a pipeline in Jupyter Notebook to extract daily images from NASA's APOD API, store them locally, and prepare the workflow for automation using Apache Airflow.News Highlights:PyPI'saiocpa updated with code to steal private keys via Telegram; AWS Lambda SnapStart now supports Python 3.12+ and .NET 8+ for faster startups; Eel simplifies Python/JS HTML GUI apps with async support; and Marimo raises $5M for an open-source reactive Python notebook.My top 5 picks from today’s learning resources:Numba vs. Cython: A Technical Comparison⚡NumPy Practical Examples: Useful Techniques🔢Optimize Python with Nuitka🛠️Speed Up Your Python Program With Concurrency🏎️Efficient String Concatenation in Python🧵And, today’s Featured Study, introduces CODECLEANER, an open-source toolkit that employs automated code refactoring to mitigate data contamination in Code Language Models, significantly enhancing evaluation reliability across Python and Java through systematic and scalable techniques.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:Thank you all who participated in this month's survey. With this issue, we have fulfilled all content requests made this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython Crypto Library Updated to Steal Private Keys: The PyPI packageaiocpawas updated with malicious code to exfiltrate private keys via Telegram, targeting crypto library users.AWS Lambda now supports SnapStart for Python and .NET functions: This opt-in feature is ideal for latency-sensitive applications and is available for Python 3.12+ and .NET 8+ across several AWS regions.Eel: For little HTML GUI applications, with easyPython/JS interop: The library simplifies development for utility scripts, supports asynchronous functions, and enables distribution through PyInstaller.Marimo raises $5M to build an open-source reactive Python notebook: The Python notebook will be optimized for reproducibility, Git compatibility, script execution, and web app deployment.💼Case Studies and Experiments🔬Running Code from Strangers: Recounts the evolution of Livedocs' runtime architecture, from browser-based Pyodide to Kubernetes for scalability, security, and performance in running isolated, interactive Python-based documents.Mach 1 with Python!:Details a fun DIY project using a Raspberry Pi, ultrasonic sensors, and Python to create a Mario pipe that plays a sound whenever someone passes through it.📊AnalysisNumba vs. Cython: A Technical Comparison: Compares the two tools for optimizing Python performance, detailing their features, use cases, and benchmarking results to guide their practical application.Is async django ready for prime time?: Details the setup required for fully async operations, such as using an ASGI server, async views, and an async ORM, while showcasing tools like django-ninja and aiohttp.🎓Tutorials and Guides🤓How to Iterate Through a Dictionary in Python: Explores various methods for iterating through Python dictionaries, including using.items(),.keys(), and.values() methods for accessing keys, values, or key-value pairs.NumPy Practical Examples: Useful Techniques: Demonstrates advanced NumPy techniques, including creating multidimensional arrays from file data, handling duplicates, and reconciling datasets with structured arrays.Build generative AI applications on Amazon Bedrock with the AWS SDK for Python (Boto3): Focuses on accessing and integrating foundation models into applications without managing infrastructure.Python range(): Represent Numerical Ranges: Covers Python's range() function, explaining its use for generating numerical sequences for loops, defining intervals with start, stop, and step parameters.A Complete Guide to Credit Risk Analysis With Python and Datalore AI: Utilizing Python and Datalore's AI-assisted coding capabilities, this guide demonstrates how to retrieve and process financial data, create visualizations, and perform statistical analyses to uncover patterns and insights.Optimize Python with Nuitka: Introduces Nuitka, a Python-to-C++ compiler, detailing its features like performance optimization, compatibility, cross-platform support, and the creation of standalone executables.NotebookLM-style podcasts with API in < 50 lines of Python code: A notebook-based tutorial leveraging the OpenAI API to generate a concise summary of the input text and then utilizes a text-to-speech (TTS) service to convert this summary into audio format.Language Translation with Python: Introduces LibreTranslate, a lightweight, Python-based API server for language detection and translation, particularly useful for analyzing data in non-English languages.🔑Best Practices and Advice🔏The Code That Almost Led to Disaster • A Starbase Story: Narrates a fictional scenario where a Python import error nearly causes a catastrophic failure on a starbase, illustrating the critical importance of understanding Python's import system to prevent such issues.Speed Up Your Python Program With Concurrency: Explains threading, asynchronous tasks, and multiprocessing and demonstrates how to optimize I/O-bound and CPU-bound tasks for improved performance.Efficient String Concatenation in Python: Covers methods including using the +and+=operators, the.join()method for lists, and tools likeStringIO for handling large datasets, with best practices for performance and flexibility.Interacting With Python: Explores ways and provides guidance to interact with Python, including using the interactive REPL mode, running scripts from files via the command line, working in IDEs, and leveraging tools like Jupyter Notebooks.Samuel Colvin on LogFire, mixing Python with Rust & being pydantic about types: This podcast episode featuresSamuel Colvin, who introducesLogFire, a new observability platform for Python, while exploring the increasing integration of Rust into Python’s ecosystem.🔍Featured Study: CODECLEANER💥In "CODECLEANER: Elevating Standards with a Robust Data Contamination Mitigation Toolkit," Cao et al. address the pervasive issue of data contamination in Code Language Models (CLMs). The study introduces CODECLEANER, an automated code refactoring toolkit designed to mitigate contamination, enabling more reliable performance evaluations for CLMs.ContextData contamination occurs when CLMs, trained on vast code repositories, inadvertently include test data, leading to inflated performance metrics. This undermines the credibility of CLMs in real-world applications, posing risks for software companies. Refactoring, a method of restructuring code without altering its functionality, offers a potential solution. However, the lack of automated tools and validated methods has hindered its adoption. CODECLEANER fills this gap by systematically evaluating refactoring operators for Python and Java code, ensuring they reduce contamination without semantic alterations.Key Features of CODECLEANERAutomated Code Refactoring: CODECLEANER provides a fully automated solution for restructuring code, eliminating the need for manual intervention while preserving original code functionality.Comprehensive Refactoring Operators: It includes 11 refactoring operators categorised into three distinct types catering to different aspects of code restructuring.Syntactic Refactoring: Operators such asif-condition flipping,loop transformations, anditeration changesalter code structure without affecting its semantics, offering lightweight syntactic adjustments.Semantic Refactoring: Advanced operators likeidentifier renamingandperformance measurement decoratorsdisrupt patterns that models memorise, significantly reducing overlap with training data.Code Style Modifications: Adjustments such asnaming style switches(e.g., camel case to snake case) andcode normalisation(e.g., consistent formatting) ensure stylistic uniformity while mitigating contamination.Cross-Language Functionality: While primarily designed for Python, CODECLEANER demonstrates adaptability by implementing selected operators in Java, addressing data contamination in a second language.Scalable Application: The toolkit works on both small-scale (method-level) and large-scale (class-level) codebases, proving its utility across various levels of complexity.Open Source and Accessible: CODECLEANER is available online, enabling widespread adoption and further research into mitigating data contamination in CLM evaluations.What This Means for YouThis study is particularly valuable for software developers and engineering teams seeking to integrate CLMs into production, researchers aiming to benchmark CLMs accurately, and organisations evaluating AI-based code tools. By addressing data contamination, CODECLEANER enhances the credibility and reliability of CLM-based solutions for real-world applications.Examining the DetailsThe researchers evaluated CODECLEANER by applying 11 refactoring operators to Python and Java code at method-, class-, and cross-class levels. Effectiveness was measured using metrics like N-gram overlap and perplexity across over 7000 code snippets sampled from The Stack dataset. Four Code Language Models (CLMs), including StarCoder and CodeLlama, were used to assess changes in contamination severity.Results showed that semantic operators, such as identifier renaming, reduced overlap by up to 39.3%, while applying all operators decreased overlap in Python code by 65%. On larger class-level Python codebases, contamination was reduced by 37%. Application to Java showed modest improvements, with the most effective operator achieving a 17% reduction.You can learn more by reading the entire paper and accessing the toolkit here.🧠 Expert insight💥Here’s an excerpt from “Chapter 4: Basics of Airflow and DAG Authoring” in the Apache Airflow Best Practices by Dylan Intorf, Dylan Storey, and Kendrick van Doorn, published in October 2024.Extracting images from the NASA APIThis pipeline is designed to extract an image every day, store this information in a folder, and notify you of the completion. This entire process will be orchestrated by Apache Airflow and will take advantage of the scheduler to automate the function of re-running. As stated earlier, it is helpful to spend timeworking through practicing this in Jupyter Notebook or another tool to ensure the API calls and connections are operating as expected and to troubleshootany issues.The NASA APIFor this data pipeline, we will be extracting data from NASA. My favorite API is theAstronomy Picture of the Day(APOD) where a new photo is selected and displayed. You can easily change the API to another of interest, butfor this example, I recommend you stick with the APOD and explore othersonce completed.A NASA API key is required to start thisnext step:Create a NASA APIkey (https://api.nasa.gov/).Input your name, email, and planned functional use ofthe API.Navigate to your email to locate the APIkey information.Figure 4.3: NASA API Key input screenshotBuilding an API request in Jupyter NotebookWith the environment configured and the API set up, we can begin authoring a DAG to automate this process. As a reminder, most Python code can be pre-tested in a system outside of Airflow, such as Jupyter Notebook or locally. If you are running into problems, it is recommended to spend time analyzing what the code is doing and workto debug.In Jupyter Notebook, we are going to use the following code block to represent the function of calling the API, accessing the location of the image, and then storing the image locally. We will keep this example as simple as possible and walk througheach step:import requestsimport jsonfrom datetime import datefrom NASA_Keys import api_keyurl = f'https://api.nasa.gov/planetary/apod?api_key={api_key}'response = requests.get(url).json()responsetoday_image = response['hdurl']r = requests.get(today_image)with open(f'todays_image_{date.today()}.png', 'wb') as f: f.write(requests.get(today_image).content)The preceding code snippet is normally how we recommend starting any pipeline, ensuring that the API is functional, the API key works, and the current network requirements are in place to perform the procedures. It is best to ensure that the network connections are available and that no troubleshooting alongside the information security or networking teamsis required.Here is how the code looks in our JupyterNotebook environment:We begin by importing the required libraries to support our code. These libraries includethe following:requests: A common Python library for making HTTP requests. It is an easy-to-use library that makes working with HTTP requests simple and allows for easy use ofGETandPOSTmethods.json: This library allows you to parse JSON from strings or files into a dictionaryor list.datetime: This library provides the currentdateandtimeparameters. We will use this later on to title theimage file.NASA_Keys: This is a local file to our machine holding theapi_keyparameter. This is used in this example to keep things as simple as possible and also maskthe variable.Figure 4.4: What your current Jupyter cell should look likeAfter importing the appropriate libraries and variables required, we construct a variable calledurlto house the HTTP request call including ourapi_keyvariable. This allows theapi_keyvariable to be included in the URL while hidden by a mask. It callsapi_keyfrom theNASA_Keysfile:url = f'https://api.nasa.gov/planetary/apod?api_key={api_key}'Next, we use therequestslibrary to perform an HTTPGETmethod call on the URL that we have created. This calls on the API to send information for our program to interpret. Finally, we convert this information from theGET call into JSON format. For our own understanding and analysis of the information being sent back, we print out the response to get a view of how the dictionaryis structured. In this dictionary, it seems that there is only one level with multiple key-value pairs including copyright , date ,explanation ,hdurl, media_type, service_version, title , and url:Figure 4.5: Response from the NASA API callIn the next step, we will utilize thehdurlkey to access the URL associated with the high-definition astronomy image of the day. Since I am an enthusiast and want the highest quality image available, I have decided that the highest definition available meets my user needs. This is a great example of a time to determine whether your users desire or need the highest quality available or whether there is an opportunity to deliver a product that meets their needs at a lower cost or lower requirementof memory.We storeresponse['hdurl']within thetoday_imagevariable for use in the next step for storing the image. This storing ofhdurlallows for manipulation of the string later on in thenext step:Figure 4.6: Saving the hdurl response in a variableIn the last step, we make use ofhdurland appenddate.today()to create a new name for the image each day. This is so that an image from yesterday does not have the same name as an image from today, thus reducing the risk of overwrites. There are additional ways to reduce the risk of overwrites, especially when creating an automated system, but this was chosen as the simplest option forour needs:Figure 4.7: Writing the image content to a local fileFinally, we can look in the local repository or folder and find that the image wasstored there:Figure 4.8: The image file we saved in the local repository or folderThis walk-through in Jupyter Notebook may seem ...excessive..., but taking the time to ensure the API is working and thinking through the logic of the common steps that need to be automated or repeated can be extremely beneficial when stepping into creating the Airflow DAG.Apache Airflow Best Practiceswas published in October 2024.Get the eBook for $35.99 $24.99Get the Print Book for $44.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
330

Divya Anne Selvaraj

15 Apr 2025

PythonPro #68: Python 3.14 Changes, Google’s Agent Development Kit, Genkit for AI Apps, and Template Strings (PEP 750)

Divya Anne Selvaraj

15 Apr 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#68Python 3.14 Changes, Google’s Agent Development Kit, Genkit for AI Apps, and Template Strings (PEP 750)Hi ,Welcome to a brand new issue of PythonPro!News Highlights: Python 3.14 set to brings key changes like PEP 765 and deferred annotations; Google’s ADK enables AI agent development in Python with Cloud integration; Genkit adds Python support for building structured, observable AI apps; and PEP 750 proposes Template Strings for cleaner package metadata.My top 5 picks from today’s learning resources:Python at the Speed of Rust⚡From Unstructured Text to Interactive Knowledge Graphs Using LLMs🧠Building Transformers from Scratch🤖Graceful API Failure 101 for Data Scientists🛠️Python Best Practices Every Coder Should Know✅And, in From the Cutting Edge, we introduceDataRec, a Python library that standardises dataset handling in recommender systems research, enabling reproducible, transparent, and framework-agnostic data preprocessing, filtering, and splitting.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️News🎥Python 3.14 | Upcoming Changes: Previews major changes in Python 3.14 due in a month, including PEP 765 (disallowing return, break, continue in finally ), deferred type annotation evaluation (PEP 648), and more.Google’s Agent Development Kit (ADK): Google has released an open-source Python toolkit for building, evaluating, and deploying AI agents with fine-grained control, offering code-first orchestration, multi-agent design, and seamless integration with Google Cloud services.Announcing Genkit for Python and Go: Genkit for Python (Alpha) is an open-source framework for building AI applications with structured output, tool integration, and observability supporting models from Google, OpenAI, and more.PEP 750 – Template Strings: PEP 750 which proposes a standard metadata field, Unstable-Interface , to mark Python packages with unstable APIs, has been accepted.RxInferServer – Remote Bayesian Inference fromPythonvia Julia: This newly released Julia package automates efficient inference in complex probabilistic models, offering performance and extensibility for AI applications with thousands of latent variables.💼Case Studies and Experiments🔬"Verified" "Compilation" of "Python" with Knuckledragger, GCC, and Ghidra: Presents a workflow for translating Python functions into C, compiling them, and formally verifying the resulting assembly.Elliptical Python Programming: A humorous essay exploring Python's flexibility and quirks through intentionally obscure syntax using comparison operations and ellipses to represent integers and executable code.📊AnalysisPython Performance: Why 'if not list' is 2x Faster Than Using len(): Dissects CPython's bytecode execution, memory layout, and instruction specialisation to explain impacts on performance.Python at the Speed of Rust: Introduces Function, a compiler that converts Python functions into native code using symbolic tracing and type annotations, achieving near-Rust performance.🎓Tutorials and Guides🤓Building Transformers from Scratch: Presents a comprehensive, code-driven walkthrough of implementing a GPT-2 style Transformer model entirely from scratch using NumPy, covering tokenization, embeddings, and more.Open Access Course | Computational Fluid Dynamics (CFD) with high-performance Python programming: A 20-step online course covering core PDEs, array operations with NumPy, and advanced methods like JAX, implicit solvers, and the Lattice Boltzmann Method.DNS Server in Python: Details the implementation of a custom local DNS server in Python, featuring caching, blocklist support, and upstream resolution.Dropping Values (#2 in The `itertools` Series • `dropwhile()` and `takewhile()`): Explains how to use these functions to efficiently filter elements from the beginning of an iterable based on a condition, offering more concise and performant alternatives to traditional for loops.From Unstructured Text to Interactive Knowledge Graphs Using LLMs: Describes how to use LLMs to extract structured subject–predicate–object triples from unstructured text, standardise and infer relationships, and render the results as interactive knowledge graphs in a browser.The Magic of Manacher’s Algorithm: explains how Manacher’s Algorithm efficiently finds the longest palindromic substring in O(n) time by transforming the input and leveraging symmetry to minimise redundant computations.How Much YouTube Is Actually Ads? A Data-Driven Look at Sponsorships: Walks you through using open data and SQL/Python tooling to quantify YouTube sponsor trends, identify high-ad-density channels, and apply time-based algorithms like sweep line to visualise ad placement.🔑Best Practices and Advice🔏Speed up exploratory data analysis with Buckaroo: Introduces an open-source Jupyter extension that streamlines exploratory data analysis by displaying histograms, null counts, and summary statistics for DataFrames in a single interactive view.Graceful API Failure 101 for Data Scientists: Shows how data scientists can use Python decorators to handle API failures in long-running pipelines more cleanly, using retry logic and skip strategies for timeouts and oversized inputs in Gemini API calls.Visualizing Recursion Trees: Details the author's iterative process of developing an interactive visualization tool for recursive functions, the challenges faced with various technologies, and the insights gained into effective collaboration with LLMs.Essential Tips for Python Developers - Crafting Standout Documentation: Explains how to create documentation that improves usability, supports diverse user needs, increases adoption, and reduces support burdens.Python Best Practices Every Coder Should Know: Outlines best practices such as using ExitStack() for managing multiple contexts, following consistent naming conventions, avoiding hardcoded secrets, safely accessing dictionary keys with .get(), and using match for cleaner conditionals.🔍From the Cutting Edge: DataRec—A Python Library for Standardized and ReproducibleData Management in Recommender Systems💥In "DataRec: A Python Library for Standardized and ReproducibleData Management in Recommender Systems," Mancino et al. from Politecnico di Bari and Université Paris-Saclay introduce a Python library designed to standardise and simplify data handling in recommender system research. This work was accepted at SIGIR 2025.ContextRecommender systems are central to modern digital platforms, influencing decisions in e-commerce, media, and social networks. Despite substantial progress in algorithms and evaluation, the reproducibility of experiments remains a challenge—particularly due to inconsistencies in data preprocessing, filtering, and splitting. Existing frameworks each handle these processes differently, leading to fragmented methodologies and results that cannot easily be compared. DataRec addresses this gap by providing a unified, reproducible approach to data management.Key Features of DataRecStandardised Data Handling: Provides reproducible routines for dataset preparation, filtering, and splitting, based on practices observed in 55 recent recommendation studies.Built-in Dataset Access: Direct access to 18 widely-used datasets with explicit versioning and referencing to ensure traceability.Flexible Input/Output: Supports tabular, inline, and JSON formats, and can export datasets in formats compatible with popular frameworks like RecBole, Cornac, Elliot, and more.Processing Tools: Includes tools such as binarisation, k-core filtering (user, item, iterative), and rating-based filtering.Splitting Strategies: Implements multiple splitting methods—random, temporal, leave-one-out, and pre-computed—supporting user-stratified evaluation.Reproducibility Support: Tracks all operations, allows random seed setting, and generates YAML config files with checksums for full reproducibility.What This Means for YouDataRec is particularly valuable for researchers, developers, and students working on recommender systems. If you have struggled with reproducing results across different studies or frameworks, or need to ensure traceable dataset handling in your experiments, DataRec provides a consistent foundation. It also simplifies integration with existing pipelines, whether you are using general-purpose frameworks or domain-specific toolkits.Examining the DetailsThe library’s development was driven by an extensive survey of 55 papers published between 2020 and 2024, covering areas such as graph neural networks, contrastive learning, and reinforcement learning. This meta-analysis identified inconsistencies in how datasets are referenced, filtered, and split—issues that DataRec explicitly seeks to correct.Dataset referencing, for example, was found to be unreliable: only 35% of papers referenced original sources; others linked to modified versions or broken links. DataRec counters this with built-in dataset access and public checksums. It supports transformation of raw data using filtering methods that mirror common practice, and provides traceable exports to major frameworks.In contrast to monolithic recommendation frameworks, which are often non-interoperable, DataRec is modular and library-focused. This enables it to act as a shared layer for dataset handling, without duplicating model training or evaluation logic. The architecture is centred on a primary DataRec class backed by modules for I/O, processing, and splitting. Version control, detailed logging, and exportable configurations ensure that results can be reliably reproduced across different environments and research groups.You can learn more by reading the entire paper or accessing the library on GitHub.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
326

Divya Anne Selvaraj

01 Apr 2025

PythonPro #67: PEP 751 Lock Files, Discord RAT, Prompt Toolkit 3.0, and Async Python at Duolingo

Divya Anne Selvaraj

01 Apr 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#67PEP 751 Lock Files, Discord RAT, Prompt Toolkit 3.0, and Async Python at DuolingoHi ,Welcome to a brand new issue of PythonPro!News Highlights: Python adopts PEP 751 for standardized lock files; new Discord-based Python RAT steals credentials; Prompt Toolkit 3.0 adds rich CLI features; and OpenAI Agents SDK gains MCP support for external tool access.My top 5 picks from today’s learning resources:We hacked Google’s A.I Gemini and leaked its source code (at least some part)🕵️‍♂️Share Python Scripts Like a Pro: uv and PEP 723 for Easy Deployment📦How to use Hinge Loss and Squared Hinge Loss with Keras📉How we started our async python migration🚀Python’s ‘shelve’ is really useful for LLM debugging🗄️And, in From the Cutting Edge, we introduce Freyja, a lightweight Python library for scalable data discovery in data lakes, enabling efficient join discovery and data augmentation by profiling attributes and predicting joinability without heavy infrastructure or deep learning models.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsNew Python lock file format will specify dependencies: Python has officially adopted PEP 751, introducing a universal, human-readable lock file format to standardize dependency specification for reproducible installs.New Python-Based Discord RAT Attacking Users to Steal Login Credentials: The RAT uses Discord as its command-and-control channel to steal browser credentials, execute system commands, and capture screenshots.Python Prompt Toolkit 3.0: The latest version Python library for building advanced interactive command-line applications includes features like syntax highlighting, autocompletion, multiline editing, and full-screen UI support.OpenAI adds Model Context Protocol (MCP) support to Agents SDK: MCP is a standard for connecting LLMs to external tools and data sources using local or remote servers..Big improvements to checkpoint performance in latest LangGraphPython: The latest release of LangGraph for Python (langgraph 0.3.21) achieves up to 1.7x faster checkpoint performance based on benchmark graph.💼Case Studies and Experiments🔬We hacked Google’s A.I Gemini and leaked its source code (at least some part): Demonstrates advanced LLM red-teaming techniques, sandbox inspection, and secure code exploitation—useful for developers working with AI sandboxes, custom interpreters, or secure system integrations.Smuggling Python Code Using Shrugging Faces: Demonstrates how Python code can be covertly embedded within a single emoji using zero-width joiner sequences, effectively smuggling a working REPL inside what appears to be a shrugging face.📊AnalysisShadowing in Python gave me an UnboundLocalError: A personal account of encountering and resolving a common Python error with context, reflection, and a small illustrative example.Democratizing AI Compute, Part 7: What about Triton and Python eDSLs?: Examines Python-based embedded domain-specific languages (eDSLs) like Triton as a means to combine Python’s ease with GPU-level control for AI workloads.🎓Tutorials and Guides🤓📖Open Access Book | Architecture Patterns with Python: Introduces architectural patterns for building testable, maintainable Python applications using TDD, DDD, and event-driven principles.Share Python Scripts Like a Pro: uv and PEP 723 for Easy Deployment: Shows you how to create and share self-contained Python scripts with embedded dependencies for hassle-free deployment across systems.How to build Hot Module Replacement (HMR) in Python: Shows you how to HMR using a dependency map to reload only affected modules instead of restarting the entire process.Writing a 6502 emulator in Python: Explains how CPUs work by building an emulator, deepening your understanding of processor architecture, memory access, and instruction flow.The Dark Side of Python’s pickle – How to Backdoor an AI Model: Walks you through how pickle works, how vulnerabilities can be exploited, and how to mitigate them, all with detailed technical examples.Specializing Python with E-graphs: Shows you how to build a Python expression compiler using e-graphs and MLIR. Demonstrates symbolic rewriting with Egglog, compiles optimized NumPy-style code to MLIR, lowers it to LLVM, and executes it via JIT.How to use Hinge Loss and Squared Hinge Loss with Keras: Walks you through dataset generation, model architecture, training setup, and performance visualization on a non-linear dataset using TensorFlow 2.🔑Best Practices and Advice🔏How we started our async python migration: Demonstrates how Duolingo migrated a Python microservice from synchronous to asynchronous execution to improve performance and reduce infrastructure costs and achieved a 40% increase in request handling per instance.Using overload to handle tagged union return types: Shows you how to improve static type checking in Python when functions return tagged unions based on input types, using typing.overload and Literal.Operationalizing Python – Part 1: Why Does It Hurt?: Explains why Python codebases degrade in large organizations and how aligned tooling—not strict rules—can guide teams toward maintainability.Claude 3.7 meta-reflects on Clojure vs Python: Explainshow AI coding assistants perform better in structured, functional environments and how using AI in architectural discussions, TDD, and documentation can improve open-source project quality.Python’s ‘shelve’ is really useful for LLM debugging: Shows how Python’s built-in shelve module can act as a persistent key-value store and help to avoid redundant API calls by caching model responses as pickled Python objects for fast, low-cost reuse.🔍From the Cutting Edge: Supporting Data Discovery Tasks at Scale with Freyja💥In "Supporting Data Discovery Tasks at Scale with Freyja," Marc Maynou and Sergi Nadal introduce Freyja, a scalable data discovery system developed to support join discovery and data augmentation tasks within large and heterogeneous data lakes, released as a lightweight Python library.ContextData lakes are vast, schema-flexible repositories where different stakeholders contribute datasets of varying structure and semantics. In contrast to traditional data warehouses, which follow a model-first integration approach, data lakes adopt a load-first strategy, making data easier to ingest but harder to discover and relate.Data discovery refers to the automatic identification of relevant datasets that can be combined for analysis. A core sub-task is join discovery, which aims to find attributes from different datasets that can be meaningfully joined. This is often used for data augmentation, where new features are added to training datasets to improve machine learning models.Conventional approaches to join discovery are either too simplistic (e.g., relying on exact value overlaps) or too resource-intensive (e.g., involving deep learning or semantic embeddings). Freyja is introduced as a middle ground—semantically aware, but efficient and easy to deploy.Key Features of FreyjaAttribute profiling: Generates compact representations of columns based on 62 profile features (e.g., entropy, value frequency, string lengths), using analytical databases like DuckDB for efficiency.Joinability ranking: Predicts how well a column can be joined with others using a pretrained model that analyses profile similarities.Data augmentation: Enables automated joining of datasets based on the predicted rankings to enrich data used in downstream models.Scalability and portability: Profiles are small in size and independent, allowing the entire pipeline to run in-memory with linear scalability.Ease of use: Integrated into notebooks with minimal setup, it is accessible to both novice and experienced data scientists.What This Means for YouFreyja is highly relevant for data scientists and machine learning practitioners who work in organisations with extensive data lakes but limited infrastructure. It simplifies complex data discovery tasks without sacrificing accuracy or scalability. For public sector analysts, researchers, or commercial data teams looking to enrich datasets without extensive engineering overhead, Freyja provides a practical and portable solution.Examining the DetailsFreyja replaces computationally expensive set-overlap checks with a predictive model trained on a large corpus of attribute pairs with known joinability values. This model assesses the distance between attribute profiles—vectors of normalised feature values—and produces a continuous joinability score. Because profiles are standardised and comparisons are made via Z-score normalisation, the approach is robust to the heterogeneity typical of data lakes.The system’s architecture ensures that profiles are computed only once per attribute and reused across analyses. Freyja’s profiling avoids numeric columns due to their low join potential and instead targets categorical and textual attributes.In demonstration, Freyja significantly improved model accuracy through data augmentation. For example, augmenting a rental price prediction dataset with just one additional attribute reduced the root mean squared error by nearly half—from 76.44 to 39.19.You can learn more by reading the entire paper or accessing the library on GitHub.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
324

Divya Anne Selvaraj

18 Feb 2025

PythonPro #63: Python Mobile Wheels for iOS & Android, LLM 0.22 Updates, and Top Python Vizualization Tools 📊

Divya Anne Selvaraj

18 Feb 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#63Python Mobile Wheels for iOS & Android, LLM 0.22 Updates, and Top Python Vizualization Tools 📊Hi ,Welcome to a brand new issue of PythonPro!Here are today's News Highlights: PyPI now supports iOS and Android wheels for Python 3.13; LLM 0.22 adds API key support, a chatgpt-4o-latest alias, and logging improvements; llm-mlx enables running Apple MLX-optimized LLMs like Llama 3.2 3B on macOS.My top 5 picks from today’s learning resources:Top Python Data Visualization Libraries of 2025📊How to Integrate Ollama and Deepseek-R1 with FastAPI for AI-Powered Chatbots🤖Transforming Non-Stationary Time Series with Differencing🔄The “late‐binding closure” pitfall in Python⚠️8 Open-Source AutoML Frameworks: How to Choose the Right One🤯And, in From the Cutting Edge, we introduce FVAPPS,the largest formal verification benchmark, transforming coding interview problems into theorem-proving tasks in Lean 4, providing a rigorous test for AI models and researchers working on provably correct code generation.Stay awesome!Divya Anne SelvarajEditor-in-ChiefPS: We're conducting market research to better understand the evolving landscape of software engineering and architecture – including how professionals like you learn, grow and adapt to the impact of AI.We think your insights would be incredibly valuable, and would love to hear what you have to say in a quick 1:1 conversation with our team.What's in it for you?✅ A brief 20–30 minute conversation at a time that’s convenient for you✅ An opportunity to share your experiences and shape the future of learning✅ A free credit to redeem any eBook of your choice from our library as a thank-youHow to Participate:Schedule a quick call at your convenience using the link provided after the form:https://forms.office.com/e/Bqc7gaDCKqLooking forward to speaking with you soon!Thank you,Team Packt.Note: Credits may take up to 15 working days to be applied to your accountSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPyPI Now Supports iOS and Android Wheels for Mobile Python Development: This follows PEP 730 (iOS support) and PEP 738 (Android support), both targeting Tier 3 support in Python 3.13.LLM 0.22, the annotated release notes: The version introduces API key support for models, a new chatgpt-4o-latest alias, improved logging and model search functionality, an embedding --prepend option, and various bug fixes.Run LLMs on macOS using llm-mlx and Apple’s MLX framework: llm-mlx, a new plugin for the LLM Python library and CLI, enables local execution of Apple MLX-optimized LLMs and simplifies running models like Llama 3.2 3B.💼Case Studies and Experiments🔬URL-addressable Pyodide Python environments: Explores Datasette Lite, a browser-based Python environment powered by Pyodide and WebAssembly, highlighting its ability to run Python entirely in the browser.Lines, Scribbles, and an Algorithm: AI in Brain Mapping: Applies machine learning to classify EEG data, tests a model, analyzes its accuracy, and discusses broader implications.📊AnalysisA year of uv: pros, cons, and should you migrate: Explains uv’s advantages over traditional tools like pip and venv, when not to use it, and how Astral’s development approach ensures reliability and future improvements.Top Python Data Visualization Libraries of 2025: pygwalker leads as a low-code tool, matplotlib remains foundational with GPU support, and Plotly dominates dashboards. Seaborn, Bokeh, and Altair cater to statistical, streaming, and academic needs.🎓Tutorials and Guides🤓Hermione's Undetectable Extension Charm: Revealed (Using Python): Uses Hermione's magical handbag from Harry Potter to explain Python's memory model, how references work, and why lists and other data structures appear to contain large objects while remaining small themselves.Transforming Non-Stationary Time Series with Differencing: Explains how differencing transforms non-stationary time series by removing trends, using first and second-order differencing, and applies this method to global temperature anomalies to prepare data for statistical modeling.Solving Markov Decision Processes with Value Iteration: Explains key MDP components, implements Bellman Equations for decision-making, and compares Value Iteration and Policy Iteration for optimizing movement strategies.How to Integrate Ollama and Deepseek-R1 with FastAPI for AI-Powered Chatbots: Provides step-by-step instructions to build an AI-powered chatbot, including setup, coding, API development, testing, and next steps.3D DMD and Visualization with OpenFOAM and Python: Provides a step-by-step guide to applying 3D DMDto OpenFOAM computational fluid dynamics (CFD) simulations, using Python and ParaView.Quick Start Robotics and Reinforcement Learning with MuJoCo: Covers the library's core structure, MJCF format, integration with reinforcement learning algorithms such as PPO, and environment setup for training robotic agents in a simulated physics environment.Creating a private, custom distilled Deepseek-R1 model on your local machine & using it within an operational Python pipeline: Covers using Ollama and integrating it into a Python pipeline for AI-driven text analysis while ensuring data privacy and offline functionality.🔑Best Practices and Advice🔏The “late‐binding closure” pitfall in Python: Explains how closures in Python capture variables by reference, not value, and provides a practical workaround to ensure expected behavior when using closures inside loops.Python Tooling Guide (Evergreen): Provides an up-to-date overview of the best tools for managing dependencies, linting, formatting, type checking, and editing as of early 2025.8 Open-Source AutoML Frameworks: How to Choose the Right One: Covers early projects like Auto-WEKA and TPOT, widely used tools like Auto-sklearn and H2O AutoML, and modern frameworks like AutoGluon, FLAML, and PyCaret.Python Setup Tips: Recommends uv as the modern 2025 approach for speed and ease of use but includes Conda as an alternative; provides step-by-step instructions for installing Python, creating virtual environments, and managing dependencies.#TechForNewbies: Inheritance: Explains inheritance in OOP using a family analogy, demonstrating how child classes inherit properties and methods from parent classes, the difference between single and multiple inheritance, and how to manage conflicts in variable names.🔍From the Cutting Edge: FVAPPS -- A Benchmark for Formally Verified Code Generation💥In "Proving the Coding Interview: A Benchmark for Formally Verified Code Generation," Dougherty and Mehta introduce Formally Verified Automated Programming Progress Standards (FVAPPS), the largest formal verification benchmark to date, containing 4,715 samples, including 1,083 curated ones. This benchmark expands upon Automated Programming Progress Standards (APPS), integrating interactive theorem proving into coding interview-style problems.ContextFormal verification mathematically proves a program’s correctness, offering stronger assurances than traditional software testing. This is crucial in safety-critical applications such as medical devices, cybersecurity, and autonomous systems. While Large Language Models (LLMs) have improved in code generation, ensuring their correctness remains a challenge. FVAPPS provides a rigorous benchmark to evaluate both program synthesis and theorem proving.Key FindingsLargest Formal Verification Benchmark: FVAPPS converts APPS coding problems into 4,715 theorem-proving tasks using Lean 4.Structured Benchmark: Problems are provided as Lean 4 theorems with "sorry" placeholders, requiring formal proofs for correctness.Automated Dataset Generation: Uses Anthropic’s Claude Sonnet 3.5 for iterative improvements and verification.Baseline Model Performance:Sonnet 3.5: Proved 30% of the attempted theorems.Gemini 1.5: Proved 18% of the attempted theorems.Human vs AI Performance: A human spent 10 hours on a single problem but couldn't complete the proof, highlighting the challenge of formal verification.What This Means for YouFVAPPS is valuable forAI researchers advancing formal verification and program synthesis, software engineers working on provably correct systems,Machine Learning developers refining AI theorem-proving capabilities, and hiring platforms assessing candidates using verifiable coding tasks.Examining the DetailsFVAPPS extends APPS by reformatting problems into theorem-proving tasks in Lean 4. The dataset undergoes a multi-stage process:Data Processing: Extracts coding problems, consolidates unit tests, and reformulates solutions.Lean Conversion: Converts Python solutions into function definitions and theorem statements with "sorry" placeholders.Validation: Uses Lean’s type-checking system to verify theorem structures.Quality Assurance: Ensures problems pass unit tests and align with formal verification principles.Curation: Filters out invalid or trivial problems, categorizing them into three subsets: unguarded (full dataset), guarded (unit test-verified), and guarded & plausible (most rigorous).You can learn more by reading the entire paper or accessing FVAPPS.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
291

Divya Anne Selvaraj

29 Apr 2025

PythonPro #69: 14 Advanced Python Features, Understanding the CPython Compiler, and Unvibe for LLM Code Quality

Divya Anne Selvaraj

29 Apr 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#6914 Advanced Python Features, Understanding the CPython Compiler, and Unvibe for LLM Code QualitySponsored:Build the Future of Apps on HubSpot: New developer products introduce deeper extensibility, flexible UI options, modern development tools, and more—shaping a more powerful, customisable platform experience. Learn more.Hi ,Welcome to a brand new issue of PythonPro!News Highlights: Unvibe, a Python test-runner, uses unit tests and tree search to improve LLM-generated code; Rowboat debuts as an open-source IDE for multi-agent systems with custom LLM support; and pipdeptree 2.26.1 adds improved Python package dependency visualisation.My top 5 picks from today’s learning resources:14 Advanced Python Features🧩Understanding the CPython Compiler⚙️Open Access Book | Mathematical Methods in Data Science (with Python) by Sebastien Roch📘Blog Modernisation with Claude: A Python Makeover🛠️Better ways to monitor NNs while training📈And, in From the Cutting Edge, we introduces gdeltnews, an open-source Python tool that reconstructs full newspaper articles from GDELT’s fragmented n-gram data, enabling researchers to access large-scale, structured news corpora at near-zero cost with high reconstruction accuracy.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSponsored:How to stop identity threats without drowning in tools: Know Whiteswan better. Demo walk-through.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsUnvibe: A Python Test-Runner that forces LLMs to generate correct code: The Python library uses unit tests as a reward signal to guide LLMs in generating correct code through a tree search approach, improving code quality in large existing projects without endless prompt tweaking.Rowboat – Open-source IDE for multi-agent systems: Powered by OpenAI’s Agents SDK, the platform supports custom LLM providers like LiteLLM and OpenRouter, integrates via HTTP API or Python SDK, and includes a UI for managing tools and servers.pipdeptree 2.26.1 released: This command-line utility for visualising Python package dependencies as a tree, helps detect conflicts, circular dependencies, and outdated packages.A Python client for the Global CVE Allocation System: The gcve project is a newly updated Python client for the Global CVE Allocation System, offering command-line and library support to manage and verify decentralised vulnerability IDs (GNAs) through a flexible, open approach.💼Case Studies and Experiments🔬Minimal IRC server from scratch in Python: Recounts building pirc, a minimal, ~350-line Python IRC server prototype, aimed at creating a lightweight chat environment without modern overhead like encryption or account systems, using only Python’s standard library.From slopes to stats: Building a snowboarding performance dashboard with Python and my own sensor data: Describes how the author combined Python, SAS, and consumer devices to build a personal performance dashboard, collecting and analysing GPS, heart rate, and metadata.📊Analysis14 Advanced Python Features: Presents 14 underused yet powerful Python features, including typing overloads, structural pattern matching, generics, protocols, and metaclasses, with code examples and references.Python is an interpreted language with a compiler: Clarifies that while Python is commonly called an interpreted language, it actually uses a compiler internally to generate bytecode, which is then run by the Python virtual machine.🎓Tutorials and Guides🤓📖Open Access Book | Mathematical Methods in Data Science (with Python) by Sebastien Roch: A mathematically rigorous textbook introducing data science through multivariable calculus, linear algebra, and probability, with coding examples in Python.Understanding the CPython Compiler: Demystifies how CPython compiles Python source code into bytecode through four stages: tokenization, AST generation, bytecode compilation, and execution by the Python Virtual Machine.Powering React with Python (Wasm): Demonstrates how to build a web-based photo editor using Next.js for UI and Python compiled to WebAssembly for performance-heavy image editing tasks.I Can’t Get No (Boolean) Satisfaction: Explores the significance of Boolean satisfiability (SAT) problems, illustrating their foundational role in computer science and their applicability to various real-world challenges, while providing a technical guide to generating and solving SAT instances using Python.How I Built a Local MCP Server to Connect Obsidian with AI: Explains how the author built a local MCP server enabling secure, read-only AI access to a personal knowledge base for analysis, content completion, and review question generation.Named Entity Recognition with Python in George Eliot’s The Mill on the Floss: Uses Python and spaCy to map character frequencies and relationships through co-occurrence analysis, bar plots, and network graphs to support literary interpretation of social and emotional dynamics.Animated scatter plot with size legend using matplotlib: Explains how to create an animated scatter plot by plotting earthquake data from Nepal’s 2015 earthquake, addressing challenges like dynamic marker sizing and legend scaling.🔑Best Practices and Advice🔏Better ways to monitor NNs while training: Offers an in-depth exploration of advanced techniques for monitoring neural network training beyond conventional methods like loss and accuracy metrics.Blog Modernisation with Claude: A Python Makeover: Describes how the author modernised his 18-year-old Python blog generator using Claude AI, improving it with Poetry, pre-commit hooks, GitHub Actions, and a structured package layout, all while retaining functionality.Optimizing Causal Decisions with Gurobi Machine Learning: A Step-by-Step Tutorial: Explains how to combine machine learning predictions and mathematical optimization using Gurobi to solve constrained causal decision-making problems.Experimental Design in the AI Era: Explains how to design efficient, AI-driven experimental frameworks by combining optimal experimental design (OED), Bayesian optimization, and machine learning surrogate models.Mapping Hollywood: Actor Collaboration Networks with d3Blocks, Streamlit and NetworkX: Describes how to build an interactive Streamlit app that maps Hollywood actor collaborations using TMDB data, visualises the network with d3Blocks, and analyses it with NetworkX metrics like degree centrality and clustering.🔍From the Cutting Edge: gdeltnews—A Python Tool for Reconstructing Full News Text from GDELT💥In "A Python Tool for Reconstructing Full News Text from GDELT,"Andrea Fronzetti Colladon and Roberto Vestrelli introduce a Python-based method and tool, gdeltnews, for reconstructing full news articles from GDELT’s Web News NGrams 3.0 dataset. Their work addresses critical challenges around affordable, large-scale access to news text for research purposes.ContextNews datasets are foundational across economics, finance, management, social sciences, and computer science. They help predict stock trends, study political discourse, assess corporate reputations, and train LLMs. However, access to comprehensive, full-text news datasets often requires expensive subscriptions to platforms like Factiva and LexisNexis, while free alternatives tend to lack completeness or transparency. GDELT—the Global Database of Events, Language, and Tone—is a free, open-access platform capturing global news coverage in over 100 languages. Its Web News NGrams 3.0 dataset provides n-grams (single-word units) along with minimal contextual information but does not offer full article texts. Colladon and Vestrelli’s Python tool aims to bridge this gap by reconstructing articles from GDELT’s fragmented data.Key Features of gdeltnewsOpen-source Python package: Implements the reconstruction method using modular, extensible code.Fragment Assembly Algorithm: Automatically reconstructs articles by joining overlapping text fragments based on word similarity and article position.Support for Space-Segmented Languages: Initially handles languages like English, French, Spanish; planned future extension to scriptio continua languages like Chinese and Japanese.Preprocessing for Clean Reconstruction: Detects and corrects known GDELT artefacts, such as misplaced article ends at beginnings.Validation-backed Reliability: Demonstrated high fidelity (up to 95% similarity) in reconstructing original articles, based on systematic benchmarking against EventRegistry.Parallel Processing Version: Provides a parallelised option for handling large datasets more efficiently.URL and Metadata Handling: Allows researchers to filter or organise reconstructed articles by original source URLs and language metadata.What This Means for YouThis tool is particularly relevant for researchers, data scientists, and NLP practitioners who require large-scale news datasets but cannot afford costly proprietary services. gdeltnews empowers studies in economic forecasting, public opinion tracking, fake news detection, and AI training by offering near-complete reconstructed texts at no cost. It enables more flexible, verifiable, and customised analyses that were previously limited by access restrictions.Examining the DetailsThe reconstruction method starts by grouping n-gram entries by source URL and combining the "pre", "ngram", and "post" fields into textual fragments. These fragments are then joined by detecting word overlaps and considering positional metadata (article deciles). The method includes logic to correct GDELT-specific artefacts, such as misplaced end-of-article content.For validation, the authors matched 2,211 articles reconstructed from GDELT data to original full texts obtained from EventRegistry, covering major U.S. news outlets. After cleaning and tokenising both sets, they compared them using Levenshtein Similarity and SequenceMatcher Similarity — both sensitive to word order, which is critical when reconstructing coherent article narratives.Without filtering, reconstructed articles achieved around 75% similarity to originals; when filtering for articles with at least 80% token overlap, the similarity rose to 95%. These results confirm the method’s strong fidelity even under minor noise or variations.Limitations include the absence of article titles in GDELT’s dataset and slower single-process performance, although a parallel version of gdeltnews mitigates the latter issue. Future improvements aim to support non-space-separated languages and enhance efficiency.You can learn more by reading the entire paper or accessing the tool on GitHub.And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
290

Divya Anne Selvaraj

25 Feb 2025

PythonPro #64: ChatGPT Explains Shapiro-Wilk Test, Python VS Code Updates, and dtaianomaly for Time Series

Divya Anne Selvaraj

25 Feb 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#64ChatGPT Explains Shapiro-Wilk Test, Python VS Code Updates, and dtaianomaly for Time SeriesJoin Snyk's Fetch the Flag CTF 2025!Snyk’s annual Capture the Flag event, Fetch the Flag 2025, is back! Hosted by Snyk and John Hammond, the event will take place on February 27, 2025, from 9am ET to 9pm ET. Engage in 20+ real-world security challenges, enhance your professional development & earn CPE credits.Register for free!Hi ,In today’sExpert Insight we bring you an excerpt from the book, Coding with ChatGPT and Other LLMs, which discusses using LLMs like ChatGPT to explain and document code, specifically explaining the Shapiro-Wilk normality test code.News Highlights: Python in Visual Studio Code – February 2025 Release adds no-config debugging, test discovery cancellation, Native REPL, and AI code actions; CLI2 v4 introduces a simplified CLI, Sphinx integration, and 12-factor config; and Python docs expose risks in legacy web development with insecure CGI code.My top 5 picks from today’s learning resources:Introduction to CUDA Programming for Python Developers🖥️Fine Tuning a Python Code Completion Model🔧Slashing my ".bashrc" in half⚡20 Important Statistical Approaches Every Data Scientist Knows📊Python does a super weird thing that puts methods on the wrong object🤔And, in From the Cutting Edge, we introduce dtaianomaly, an open-source Python library that standardises and simplifies time series anomaly detection for both researchers and industry, offering a scikit-learn-inspired API, extensibility for developing new models, and tools for preprocessing, benchmarking, and visual analysis.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython in Visual Studio Code – February 2025 Release: This release of Python, Pylance, and Jupyter extensions introduces no-config debugging, test discovery cancellation, Native REPL launch, Go to Implementation with Pylance, and an experimental AI code action for generating symbols.Cli2Python Automation framework v4 released: The Python automation framework for DevOps includes features like a simplified CLI alternative, Sphinx integration for documentation, and a 12-factor configuration library.Python's official documentation contains textbook example of insecure code (XSS): The module with the insecure CGI code example is deprecated in Python 3.11 and removed in 3.13, but exposes risks in legacy Python web development.💼Case Studies and Experiments🔬Lines, Scribbles, and an Algorithm: AI in Sleep Analysis and Brain Mapping: Explores how Python and machine learning, using libraries like MNE and Random Forest Classifiers, can be applied to EEG data for classifying sleep stages and analyzing brain activity.Antivirus Bypass Method for Metasploit Tools: Details a method for bypassing antivirus detection using Metasploit tools, including creating a Python-based Meterpreter payload with msfvenom, obfuscating it through encoding and compression techniques, and configuring msfconsole for stealth and persistence.📊AnalysisWhen Maps Made Me Dumb, Valuable Geospatial Insights Rescued Me: Explores how Python’s GeoPandas library, along with tools like QGIS, can help accurately measure geographic areas and understand the distortions caused by map projections like Mercator.Python does a super weird thing that puts methods on the wrong object: Explores why Python places the join method on strings instead of lists, examining its design, historical context, and alignment with Python's philosophy.🎓Tutorials and Guides🤓Introduction to CUDA Programming for Python Developers: Explains GPU architecture, parallel processing, and the CUDA platform for optimizing computations, especially for deep learning.Building an Open, Multi-Engine Data Lakehouse with S3 and Python: Explains how to set up a Python-first, open lakehouse on cloud storage, integrate multiple query engines, and handle metadata management using Iceberg and Snowflake's Open Catalog.Fine Tuning a Python Code Completion Model: Details the process of fine-tuning the smolLM 135M model for Python code completion using LoRA, with an iterative approach to optimize hyperparameters, achieving efficient training and strong code generation performance.Train a Small GPT-Style LLM from Scratch: Provides a Jupyter Notebook for training a small GPT-style, decoder-only language model from scratch using PyTorch, covering tokenization, positional encoding, self-attention, and more.Automating FrameBlender with AI Agents and LangGraph: Explains how to automate a previously interactive system with multi-agent workflows using LangGraph to improve scalability and efficiency.UV — An Ultimate Python Package Manager: The Beginner's Guide: Covers how to install and use UV for managing Python environments, packages, and projects efficiently, along with advanced features like multi-version support and isolated tool execution.Synthetic Data(What, Why, When, and How?): Defines synthetic data, its use cases in scenarios like healthcare and finance, and provides a Python code snippet to generate synthetic data using libraries like Faker, Pandas, and xlsxwriter.🔑Best Practices and Advice🔏Slashing my ".bashrc" in half: Covers how the author improved their shell configuration by replacing numerous plugins with efficient, modern Rust-based tools, resulting in a faster, leaner, and easier-to-maintain ".bashrc."Benchmarking utility for Python: Details how Go’s built-in testing framework simplifies benchmarking, andintroduces a custom utility function, autobench, to replicate Go's automatic loop count and repetition handling in Python.10 machine learning mistakes and how to avoid them: Outlines issues like AI hallucinations, model bias, poor data quality, and integration challenges, and offers insights on how to avoid these.pytest snippets for python 3.11+: Provides useful snippets focusing on asynchronous test support, integration test handling, and SQLAlchemy async support with factory-boy.20 Important Statistical Approaches Every Data Scientist Knows: Covers methods like Bayesian Inference, Maximum Likelihood Estimation, and Principal Component Analysis, with Python code snippets to illustrate each technique.🔍From the Cutting Edge: dtaianomaly— A Python library for time series anomaly detection💥In "dtaianomaly: A Python library for time series anomaly detection," Carpentier et al. introduce an open-source Python library for time series anomaly detection. The library aims to bridge the gap between cutting-edge research and practical applications in business and industry.ContextTime series anomaly detection (TSAD) involves identifying outliers or abnormal observations in time series data. Accurate anomaly detection is crucial in various fields, such as healthcare, finance, and IoT. However, many existing tools struggle to effectively integrate state-of-the-art research with practical industrial applications. dtaianomaly addresses this issue by simplifying anomaly detection for practitioners while providing a framework for researchers to extend the tool with new methods.Key concepts in this domain include anomaly detection, which refers to the identification of unusual patterns in data; time series data, which consists of data points indexed by time; and APIs, which offer a standard interface for communication between software systems.Key FeaturesStandardised API: Provides consistency for new users and compatibility with tools like scikit-learn.Extensibility: Designed to allow easy addition of new anomaly detection methods.Experimental Validation: Simplifies benchmarking, enabling comparison of models with minimal code.Range of Anomaly Detectors: Includes various built-in detectors, such as Matrix Profile.Time Series Preprocessing: Provides tools for scaling, resampling, and smoothing data.Visual Analysis Tools: Visualisation techniques for qualitative assessment of models.Confidence Prediction: Estimates the confidence level of anomaly scores.Profiling: Supports runtime and memory profiling, useful for evaluating edge-device suitability.Cross-Platform Unit Testing: Ensures robustness with thorough testing across platforms.What This Means for Youdtaianomaly is relevant to both researchers and industry practitioners. Researchers can use it to rapidly implement and test new algorithms, while businesses can apply its established tools for practical anomaly detection tasks, such as in sensor data or financial transactions. The library's user-friendly documentation and standardised API make it accessible to newcomers, while its extensibility makes it valuable for ongoing research.Examining the Detailsdtaianomaly uses an object-oriented design similar to scikit-learn. Its core component, the BaseDetector, standardises the interface for all anomaly detection models, making it easy to integrate new methods. The library supports time series preprocessing, preparing data before anomaly detection, and offers an automated workflow for large-scale experimental validation. The LazyDataLoader ensures efficient handling of large datasets, while the Pipeline object allows for chaining preprocessing and detection steps. Evaluation is facilitated through both quantitative metrics (e.g., AUC, precision, recall) and qualitative visualisations.The library also supports runtime and memory profiling, helping determine if a model is suitable for deployment on edge devices, which is particularly useful for resource-constrained industries. Its cross-platform compatibility and continuous testing ensure reliability.You can learn more by reading the entire paper or accessing the library on GitHub.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: Grounding Responses” in the book, Coding with ChatGPT and Other LLMs by Dr. Vincent Austin Hall.Documenting codeOften, you’ll come across code that someone else (including an earlier version of you) has written and it’ll not make sense to you. You might have to go through each line and function and class, running it in your mind and testing each bit of code inan IDE.Fortunately, there is now anotheroption: LLMs or chatbots. You can get one to be your personal code translator to clarify what is going on.You can put some code in the chat window and ask the bot to explain what it does. You will generally also get detailed explanations when the chatbots giveyou code.Let’s get ChatGPT and to explain some codeStatistical tests are useful for statisticians and data scientists, so, let’s do some. Jason Brownlee can help here [Brownlee_17Stats]. What is aShapiro-Wilk test?Here is some code in Python for this, from 17 Statistical Hypothesis Tests in Python (Cheat Sheet):# Example of the Shapiro-Wilk Normality Testfrom scipy.stats import shapirodata = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]stat, p = shapiro(data)print('stat=%.3f, p=%.3f' % (stat, p))if p > 0.05:print('Probably Gaussian')else:print('Probably not Gaussian')What does ChatGPT sayabout this?Here isthe prompt:Please explain this code to me, I don't understand it at all.Here is ChatGPT’sresponse:(I put it into a markdown interpreter to get it into a nice style, without the **around theheadings** [Dillinger].)Certainly! The provided code is an example of how to use the Shapiro-Wilk Normality Test in Python, specifically with thescipy.statslibrary. This statistical test is used to determine whether a given sample follows a Gaussian (normal) distribution. Here’s a breakdown ofthe code:# Importing the Shapiro-Wilk test from the scipy.stats modulefrom scipy.stats import shapiro# Creating a sample dataset (in this case, a list of numbers)data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]# Applying the Shapiro-Wilk test to the datasetstat, p = shapiro(data)# Printing the test statistic and the p-valueprint('stat=%.3f, p=%.3f' % (stat, p))# Checking the p-value and making a decision based on a significance level (e.g., 0.05)if p > 0.05: print('Probably Gaussian') # If p > 0.05, we fail to reject the null hypothesis, suggesting the data is probably Gaussian.else: print('Probably not Gaussian') # If p <= 0.05, we reject the null hypothesis, suggesting the data is probably not Gaussian.Here’s astep-by-stepexplanation:Importing the Shapiro-Wilktest function:from scipy.stats import shapiroThis line imports theshapirofunction from thescipy.statsmodule.Defining asample dataset:data = [0.873, 2.817, 0.121, -0.945, -0.055, -1.436, 0.360, -1.478, -1.637, -1.869]This line creates a list named data containing a set ofnumerical values.Applying theShapiro-Wilk test:stat, p = shapiro(data)This line applies the Shapiro-Wilk test to the dataset, and the results are stored in two variables: stat (the test statistic) and p (the p-value).Printingthe results:print('stat=%.3f, p=%.3f' % (stat, p))This line prints the test statistic and the p-value with threedecimal places.Interpretingthe results:if p > 0.05: print('Probably Gaussian')else: print('Probably not Gaussian')Based on a significance level (0.05is commonly used), this code prints whether the dataset is probably Gaussian or probably not Gaussian. Ifp > 0.05, we fail to reject the null hypothesis, suggesting the data is probably Gaussian. Ifp <= 0.05, we reject the null hypothesis, suggesting the data is probablynot Gaussian.Here is my review: ChatGPT (not GPT-4, but definitely use that instead) tells us what the test is for, how to import the relevant library and where it comes from, how to create example data and what type it is, about the statistics returned, about printing, and what each statistics result means. That’sreally helpful!What could it do better? Maybe it could suggest other tests that might be able to do this, when to not use the test, and whether it might be able to critique the code. We can askthe following:ChatGPT, when should I not use thisstatistical test?ChatGPT lists six points and details that the test gets better with larger sample sizes, but there is a limit. It might also find significance where there isn’t any, on very large sample sizes. ChatGPT also mentions non-Gaussian distributions and sensitivity to outliers and uses other tests and actions to confirm that the data is to benormally distributed.There are far more details, which I won’t get into here, for brevity, but I’m sure if you ask the AI for it, it’ll give you goodinformation [ChatGPT].We could also ask Gemini to critiquethe code:Critique the code and give us some visualizations to help us understand theShapiro-Wilk test.Let’s check how that compares with what Gemini says about the code...Coding with ChatGPT and Other LLMs was published in November 2024. Packt library subscribers can continue reading the entire book for free.Get the eBook for $35.99 $24.99Get the Print Book for $44.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
288

Divya Anne Selvaraj

17 Sep 2024

PythonPro #47: Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from Scratch

Divya Anne Selvaraj

17 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#47:Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from ScratchHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, AI-Assisted Programming for Web and Machine Learning, which discusses how Copilot can assist in debugging and troubleshooting by adding error-handling features.News Highlights: DBOS Transact launches with durable Python workflow recovery; Python in Excel now live for data analysis; Python 3.13 is coming October 2024 with new interpreter, JIT, and more; and Hackers use fake coding tests on GitHub to target Python developers.Here are my top 5 picks from our learning resources today:How Does AI Work? Create a Neural Network from Scratch🤖Spam Mail Detection - Machine Learning with Python✉️Django from first principles🌱How to Use Conditional Expressions With NumPy where()🔄Why Learn Python Concurrency⚙️And, today’s Featured Study, introduces ComplexCodeEval, a benchmark designed to evaluate large code models (LCMs) in complex development environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is still live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsDBOS Transact: Ultra-Lightweight Durable Execution for Python Workflows launched: The library ensures programs automatically resume from their last completed step after crashes or interruptions.Python in Excel – Available Now: Microsoft 365 users can now integrate Python libraries for advanced data analysis, visualization, and machine learning within Excel.What’s New In Python 3.13: The version, releasing on October 1, 2024, will include a new interactive interpreter, experimental free-threaded mode, a JIT compiler, enhanced error messages, and updates to the standard library.Fake password manager coding test used to hack Python developers: Posing as recruiters, the hackers use GitHub-hosted projects to infect victims' systems and pressure them to bypass security checks.💼Case Studies and Experiments🔬How Does AI Work? Create a Neural Network from Scratch: Explains how to build a basic neural network using Python, to predict house prices, while covering core concepts like gradient descent, backpropagation, and more.Text mining in Python - case-study with “Romeo and Juliet” from Project Gutenberg:Walks you through the steps of accessing the text, cleaning it, tokenizing words, analyzing word frequency, and visualizing the results.📊AnalysisStreamlit vs Gradio - The Ultimate Showdown for Python Dashboards: Evaluates their ease of use, customization options, deployment flexibility, and suitability for complex data visualization or rapid prototyping tasks.It’s time to stop using Python 3.8: Emphasizes the importance of upgrading from Python 3.8, which reaches end-of-life in October 2024, meaning no more bug or security fixes.🎓Tutorials and Guides🤓Understanding Proximal Policy Optimization (PPO) - A Game-Changer in AI Decision-Making Explained for RL Newcomers: explains PPO, detailing its key concepts, practical implementation, and how it improves decision-making stability and efficiency in AI systems.Use Python for Earth Engine Analysis, Save Directly to Your Local Drive: Explains how to use the Python library geemap for interacting with Google Earth Engine (GEE) to process and analyze satellite imagery.Django from first principles: A series on building a Django project starting with a single file, gradually expanding as necessary to manage complexity, simplifying Django for beginners by focusing on essential components first.Injecting syscall faults in Python and Ruby: Discusses how to simulate syscall failures in Python and Ruby using Cirron, a tool that integrates with strace to inject errors, delays, and signals into system calls.Deploying a Django app with Kamal, AWS ECR, and Github Actions: Covers setting up a VPS and preparing it for Kamal, creating a Dockerfile for containerized apps, and configuring a deployment pipeline.Implementing the Singleton Pattern in FastAPI for Efficient Database Management: Demonstrates how to implement the Singleton Pattern for efficient database management, particularly while handling expensive resources.Spam Mail Detection - Machine Learning with Python: Explains how to use a supervised learning approach with a dataset from Kaggle, analyzing email length, applying logistic regression, and creating a scanner to detect spam.🔑Best Practices and Advice🔏Let’s build and optimize a Rust extension for Python: Explains how to build and optimize a Rust extension for Python to improve performance and memory efficiency.Why Learn Python Concurrency: Explains how concurrent, parallel, and asynchronous execution allow programs to fully utilize modern hardware, improve performance, and scale more effectively.Therac-25, LLMs and the Zen of Python: Discusses the dangers of relying on LLMs to rewrite code across languages without understanding the underlying principles and context, drawing a parallel to the infamous Therac-25 disaster.Using Python's pip to Manage Your Projects' Dependencies: discusses using Python's pip to for installing and uninstalling packages, and handling errors.How to Use Conditional Expressions With NumPy where(): Explains how to work with multiple conditions, array broadcasting, and common pitfalls when using np.where() in data manipulation.🔍Featured Study: ComplexCodeEval - Benchmarking Large Code Models in Practice💥In ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code, Feng et al. introduce a new benchmark for assessing large code models (LCMs). The paper focuses on evaluating LCMs in real-world coding scenarios involving complex tasks and avoiding data leakage.ContextLCMs are AI models trained to handle coding tasks like code generation, completion, test case creation, and API recommendation. Existing benchmarks tend to evaluate LCMs on limited tasks, such as standalone code generation, without capturing the broader, more diverse challenges developers face. Additionally, they often overlook data leakage, where models are tested on data already seen during training, resulting in inflated performance scores.ComplexCodeEval is a comprehensive benchmark designed to test LCMs on multiple coding tasks and scenarios, reflecting real-world programming challenges. It assesses how well LCMs perform in contexts that include dependencies on third-party libraries and the need to create test functions and recommend APIs.Key Featured of ComplexCodeEvalReal-World Data: Uses 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.Multiple Tasks: Evaluates LCMs on code generation, completion, API recommendation, and test case generation.Rich Context: Each sample includes function signatures, docstrings, API references, and test functions.Data Leakage Prevention: Multiple timestamps (creation, update) ensure the benchmark avoids testing on training data.Variety of Models Tested: Ten popular LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were evaluated.What This Means for YouThis study is valuable for programmers and software engineers who use AI coding tools. ComplexCodeEval highlights which models perform best for tasks like generating Java code or recommending Python APIs, making it easier to select the right tools for complex programming tasks. It provides a realistic assessment of LCMs, avoiding inflated scores from limited or synthetic benchmarks.For developers working on AI models, the study offers insights into how additional contextual information, such as dependencies and function histories, can significantly improve model performance.Examining the DetailsTo create ComplexCodeEval, the authors sourced Java and Python samples from GitHub repositories that relied on popular third-party libraries. Each sample was annotated with relevant metadata like API references, docstrings, and timestamps, simulating real-world coding tasks.Ten LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were tested on four tasks: code generation, code completion, API recommendation, and test case generation. CodeLlama-34B achieved the highest CodeBLEU score of 34.08 for Java code generation, and Python API recommendation saw an F1 score of 52.24.The researchers tested the impact of adding context to the inputs provided to LCMs. Starting with basic function signatures and docstrings, they added more context (e.g., dependencies and library imports) and found that full context improved average CodeBLEU scores by 70.73% in Java and 31.90% in Python.To assess data leakage, the team compared model performance on data created before and after the models’ knowledge cut-off dates. They found models performed better on leaked data, with average CodeBLEU scores increasing by 1.22 points in Java and 3.10 points in Python, demonstrating the importance of preventing data leakage in evaluations.You can learn more by reading the entirepaper and accessing the ComplexCodeEvalGithub repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 20: Increasing Efficiency with GitHub Copilot” in the book, AI-Assisted Programming for Web and Machine Learning by Christoffer Noring, Anjali Jain, Marina Fernandez, Ayşe Mutlu, and Ajit Jaokar, published in August 2024.Debugging and troubleshootingBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needto be able to debug and troubleshoot code to figure out what’s wrong with it. The same applies if you have a reported bug; you need to be able to diagnose at what point it goes wrong. Copilot can help, and a good approach is using Copilot chat and asking it to add support for debugging and troubleshooting.The following is a piece of sample code you’ve been given. It’s a REST API in the framework Flask for Python. It’s working code, but it might not be so easy to work with from a debugging and troubleshooting standpoint:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response)@app.route('/products', methods=['POST'])def add_product(): # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'})if __name__ == '__main__': app.run(debug=True)The above code works, but it doesn’t support debugging and troubleshooting; let’s see how we can add support for that by using Copilot Chat:[Prompt]Add support for troubleshooting and debugging[End of prompt]Copilot chat answers us with both text and code:[Prompt response]Sure, here’s an updated version of the Flask API that includes support for troubleshooting and debugging:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): try: # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while retrieving products'})@app.route('/products', methods=['POST'])def add_product(): try: # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'}) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while adding the product'})if __name__ == '__main__': app.run(debug=True)The improvements are:Added atry/exceptblock around the code in theget_products()function.Addedloggerstatements to log errors.Addeddebug=Trueto theapp.run()statement.These changes will help us troubleshoot and debug the code.[End of response]Packt library subscribers can continue reading the entire book for free. You can buy AI-Assisted Programming for Web and Machine Learning,here.Get the eBook for $38.99 $26.99!Other Python titles from Packt at upto 30% offGet the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $43.99 $29.99!Get the Print Book for $54.99 $40.99Print discounts end in 5 days on the 22nd of September, 2024. @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }

0
0
286

Divya Anne Selvaraj

28 Aug 2024

PythonPro #44: Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid Development

Divya Anne Selvaraj

28 Aug 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#44:Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid DevelopmentHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI Foundations in Python, which provides a hands-on guide to implementing generative AI models—GANs, diffusion models, and transformers—using PyTorch and the diffusers library.News Highlights:Theuv Python packaging tool now offers comprehensive project management, tool installation, and support for single-file scripts; and Tach, written in Rust, enforces strict interfaces and dependency management for PythonHere are my top 5 picks from our learning resources today:Visualisation 101 - Choosing the Best Visualisation Type📊Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos🎥🤖FastAPI Tutorial - Build APIs with Python in Minutes🚀Flatten JSON data with different methods using Python📦Linear Algebra Concepts Every Data Scientist Should Know📐And, in today’sFeatured Study, we introduce PyRoboCOP, a Python-based package designed for optimizing robotic control and collision avoidance in complex environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:We have covered all requests made so far this month, in this issue.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️Newsuv: Unified Python packaging:The tool now offers end-to-end project management, tool installation, Python bootstrapping, and support for single-file scripts with embedded dependencies, all within a unified, fast, and reliable interface.Tach - Strict interfaces and dep management forPython, written in Rust:Inspired by modular monolithic architecture, Tach allows you to define dependencies and ensures that modules only import from authorized dependencies.💼Case Studies and Experiments🔬Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTubeVideos:Details downloading video segments, capturing screenshots, extracting code from images using GPT, and then explaining the code with an LLM.Packaging Python and PyTorch for a Machine Learning Application:Discusses the challenges of packaging Python and PyTorch for the Transformer Lab application, aiming for a seamless user experience across various operating systems.📊Analysis🎥Charlie Marsh on Astral, uv, and the Python packaging ecosystem:Discusses insights on the development of Astral's uv tool, a cargo-like tool for Python, following a significant upgrade.CPython Compiler Hardening:Outlines the author’s process of selecting and testing compiler options, addressing challenges like excessive warnings, performance impacts, and developing tools to track and manage these warnings🎓Tutorials and Guides🤓Flatten JSON data with different methods using Python:Techniques discussed include usingpandas'json_normalize, recursive functions, theflatten_jsonlibrary, custom functions, and tools like PySpark and SQL.FastAPI Tutorial - Build APIs with Python in Minutes:Guides you through setting up a development environment, creating a FastAPI app, building a logistic regression classifier, defining data models with Pydantic, and setting up API endpoints for predictions.What's the deal with setuptools, setup.py, pyproject.toml, and wheels?:Provides a detailed explanation of Python packaging tools and practices, offering insights and recommendations for how to approach packaging in modern projects.Python's Preprocessor:Debunks the myth that Python lacks a preprocessor by demonstrating how Python can be extended and customized through the use of custom codecs and path configuration files.📖Open Access Book |Kalman and Bayesian Filters in Python:Addresses the need for a practical introduction to Kalman filtering, offering accessible explanations and examples, along with exercises with answers and supporting libraries.Python Backend Development - A Complete Guide for Beginners:Provides a step-by-step guide to building web applications, including advanced topics like asynchronous programming, performance optimization, and real-time data handling.Working with Excel Spreadsheets in Python:Focuses on automating tasks using theopenpyxlmodule.Read to learn about reading, writing, modifying, and formatting Excel files, and advanced features like plotting charts and integrating images.🔑Best Practices and Advice🔏Visualisation 101 - Choosing the Best Visualisation Type:Explores how visualizations improve data-driven decisions, focusing on understanding context, audience, and visual perception.Readto learn how to implement visualizations.Simone's Creative Cooking Club • If You Haven't Got a Clue What "Pass by Value" or "Pass by Reference" Mean, Read On…:Demonstrates how Python handles function arguments, particularly mutable and immutable objects.How I ask GPT-4 to make tiny Python scripts in practice:Succinctly describes starting with a basic script, then converting it into a command-line interface using click, and adding features like stdin/stdout handling and error logging.Linear Algebra Concepts Every Data Scientist Should Know:Introduces key concepts such as vectors, vector operations, vector spaces, and matrices, with visual explanations and code examples to demonstrate their application in real-world data science tasks.🎥Python From a Java Developer's Perspective:Provides guidance for Java developers to write Python code effectively.Watch to learn how to smoothly transition between Java and Python while leveraging your existing Java knowledge.🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥In “PyRoboCOP: Python-based Robotic Control & Optimization Package for Manipulation and Collision Avoidance” Raghunathan et al. introduce a Python-based software package designed for the optimisation and control of robotic systems. The package excels in handling complex interactions like contact and collision avoidance, crucial for autonomous robotic manipulation.ContextRobotic systems often operate in environments with numerous obstacles and objects, making it essential to model and optimise these interactions mathematically. These interactions, defined by complementarity constraints, are challenging to manage because they do not follow standard optimisation assumptions. Most existing physics engines simulate these interactions but do not offer real-time optimisation capabilities.PyRoboCOPaddresses this gap by providing a flexible and user-friendly package that allows robots to reason about their environment and optimise their behaviour, which is critical for achieving autonomous manipulation tasks.Key Features of PyRoboCOPPyRoboCOP is characterised by its ability to automatically reformulate complex mathematical constraints and integrate seamlessly with powerful optimisation tools. Key features include:Automatic Reformulation of Complementarity Constraints:Handles difficult constraints that describe object interactions.Direct Transcription via Orthogonal Collocation:Converts DAEs into a solvable set of nonlinear equations.Integration with ADOL-C and IPOPT:Supports automatic differentiation and efficient optimisation.Built-in Support for Contact and Obstacle Avoidance Constraints:Simplifies the setup of complex robotic tasks.Flexible User Interface:Allows for customisation and adaptation to various robotic systems.What This Means for YouThe package is particularly relevant for researchers, developers, and engineers working in the field of robotics, especially those involved in designing autonomous systems that require precise control and optimisation. PyRoboCOP’s ability to handle complex robotic interactions makes it a valuable tool for developing real-time, model-based control solutions in environments where contact and collision avoidance are critical.Examining the DetailsPyRoboCOP's performance was rigorously tested across several robotic scenarios, including planar pushing, car parking, and belt drive unit assembly. In a planar pushing task, PyRoboCOP optimised the robot's trajectory, balancing a normal force of 0.5 N and a friction coefficient of 0.3, successfully navigating from (0,0,0)(0,0,0)(0,0,0) to (0.5,0.5,0)(0.5,0.5,0)(0.5,0.5,0) and (−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2). In a car parking scenario, the software optimised movement from (1,4,0,0)(1,4,0,0)(1,4,0,0) to (2,2.5,π/2,0)(2,2.5,π/2,0)(2,2.5,π/2,0), effectively avoiding obstacles. PyRoboCOP also managed the complex task of assembling a belt drive unit, demonstrating its ability to handle intricate manipulations. When benchmarked againstCasADiandPyomo, PyRoboCOP showed comparable performance, solving an acrobot system in a mean time of 2.282 seconds with 1,296 variables, versus CasADi's 1.175 seconds with 900 variables and Pyomo's 2.374 seconds with 909 variables.You can learn more by reading the entirepaperor access the packagehere.🧠 Expert insight 📚Tasks💥Here’s an excerpt from “Chapter 2: Surveying GenAI Types and Modes: An Overview of GANs, Diffusers, and Transformers” in the book,Generative AI Foundations in PythonbyCarlos Rodriguez, published in July 2024.Applying GAI models – image generation using GANs, diffusers, and transformersIn this hands-on section…You’ll get a first-hand experience and deep dive into theactual implementation of generative models, specifically GANs, diffusion models, and transformers….I'm a new paragraph block.We’ll be utilizing the highly versatilePyTorchlibrary, a popular choice among machine learning practitioners, to facilitate our operations.PyTorchprovides a powerful and dynamic toolset to define and compute gradients, which is central to trainingthese models.In addition, we’ll also use thediffuserslibrary. It’s a specialized library that provides functionality to implement diffusion models. This library enables us to reproduce state-of-the-art diffusion models directly from our workspace. It underpins the creation, training, and usage of denoising diffusion probabilistic models at an unprecedented level of simplicity, without compromising themodels’ complexity.Through this practical session, we’ll explore how to operate and integrate these libraries and implement and manipulate GANs, diffusers, and transformers using the Python programming language. This hands-on experience will complement the theoretical knowledge we have gained in the chapter, enabling us to see these models in action in thereal world….Working with Jupyter Notebook and Google ColabJupyter notebooks enable live code execution, visualization, and explanatory text, suitable for prototyping and data analysis. Google Colab, conversely, is a cloud-based version of Jupyter Notebook, designed for machine learning prototyping. It provides free GPU resources and integrates with Google Drive for file storage and sharing. We’ll leverage Colab as our prototyping environmentgoing forward.Stable diffusion transformerWe begin with a pre-trained stable diffusion model, a text-to-image latent diffusion model created by researchers and engineers from CompVis, Stability AI, and LAION (Patil et al., 2022). The diffusion process is used to draw samples from complex, high-dimensional distributions, and when it interacts with the text embeddings, it creates a powerful conditional imagesynthesis model.The term “stable” in this context refers to the fact that during training, a model maintains certain properties that stabilize the learning process. Stable diffusion models offer rich potential to create entirely new samples from a given data distribution, based ontext prompts.Again, for our practical example, we will Google Colab to alleviate a lot of initial setups. Colab also provides all of the computational resources needed to begin experimenting right away. We start by installing some libraries, and with three simple functions, we will build out a minimalStableDiffusionPipelineusing a well-established open-source implementation of the stablediffusion method.First, let’s navigate to our pre-configured Python environment, Google Colab, and install thediffusersopen-source library, which will provide most of the key underlying components we need forour experiment.In the first cell, we install all dependencies using the followingbashcommand. Note the exclamation point at the beginning of the line, which tells our environment to reach down to its underlying process and install the packageswe need:!pip install pytorch-fid torch diffusers clip transformers accelerateNext, we import the libraries we’ve just installed to make them available to ourPython program:from typing import Listimport torchimport matplotlib.pyplot as pltfrom diffusers import StableDiffusionPipeline, DDPMSchedulerNow, we’re ready for our three functions, which will execute the three tasks – loading the pre-trained model, generating the images based on prompting, and renderingthe images:def load_model(model_id: str) -> StableDiffusionPipeline:"""Load model with provided model_id."""return StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=False).to("cuda")def generate_images(pipe: StableDiffusionPipeline, prompts: List[str]) -> torch.Tensor:"""Generate images based on provided prompts."""with torch.autocast("cuda"):images = pipe(prompts).imagesreturn imagesdef render_images(images: torch.Tensor):"""Plot the generated images."""plt.figure(figsize=(10, 5))for i, img in enumerate(images):plt.subplot(1, 2, i + 1)plt.imshow(img)plt.axis("off")plt.show()In summary,load_modelloads a machine learning model identified bymodel_idonto a GPU for faster processing. Thegenerate_imagesfunction takes this model and a list of prompts to create our images. Within this function, you will notice torch.autocast("cuda"), which is a special command that allows PyTorch (our underlying machine learning library) to perform operations faster while maintaining accuracy. Lastly, the render_images function displays these images in a simple grid format, making use of the matplotlib visualization library to renderour output.With our functions defined, we select our model version, define our pipeline, and execute our imagegeneration process:# Executionmodel_id = "CompVis/stable-diffusion-v1-4"prompts = ["A hyper-realistic photo of a friendly lion","A stylized oil painting of a NYC Brownstone"]pipe = load_model(model_id)images = generate_images(pipe, prompts)render_images(images)The output inFigure 2.1is a vivid example of the imaginativeness and creativity we typically expect from human art, generated entirely by the diffusion process. Except, how do we measure whether the model was faithful to thetext provided?Figure 2.1: Output for the prompts “A hyper-realistic photo of a friendly lion” (left) and “A stylized oil painting of a NYC Brownstone” (right)The next step is to evaluate the quality and relevance of our generated images in relation to the prompts. This is where CLIP comes into play. CLIP is designed to measure the alignment between text and images by analyzing their semantic similarities, giving us a true quantitative measure of the fidelity of our synthetic images tothe prompts.Scoring with the CLIP modelCLIP is trained to understand the relationship between text and images by learning to place similar images and text near each other in a shared space. When evaluating a generated image, CLIP checks how closely the image aligns with the textual description provided. A higher score indicates a better match, meaning the image accurately represents the text. Conversely, a lower score suggests a deviation from the text, indicating a lesser quality or fidelity to the prompt, providing a quantitative measure of how well the generated image adheres to theintended description.Again, we will import thenecessary libraries:from typing import List, Tuplefrom PIL import Imageimport requestsfrom transformers import CLIPProcessor, CLIPModelimport torchWe begin by loading the CLIP model, processor, andnecessary parameters:# ConstantsCLIP_REPO = "openai/clip-vit-base-patch32"def load_model_and_processor(model_name: str) -> Tuple[CLIPModel, CLIPProcessor]:"""Loads the CLIP model and processor."""model = CLIPModel.from_pretrained(model_name)processor = CLIPProcessor.from_pretrained(model_name)return model, processorNext, we define a processing function to adjust the textual prompts and images, ensuring that they are in the correct format forCLIP inference:def process_inputs(processor: CLIPProcessor, prompts: List[str],images: List[Image.Image]) -> dict:"""Processes the inputs using the CLIP processor."""return processor(text=prompts, images=images,return_tensors="pt", padding=True)In this step, we initiate the evaluation process by inputting the images and textual prompts into the CLIP model. This is done in parallel across multiple devices to optimize performance. The model then computes similarity scores, known as logits, for each image-text pair. These scores indicate how well each image corresponds to the text prompts. To interpret these scores more intuitively, we convert them into probabilities, which indicate the likelihood that an image aligns with any of thegiven prompts:def get_probabilities(model: CLIPModel, inputs: dict) -> torch.Tensor:"""Computes the probabilities using the CLIP model."""outputs = model(**inputs)logits = outputs.logits_per_image# Define temperature - higher temperature will make the distribution more uniform.T = 10# Apply temperature to the logitstemp_adjusted_logits = logits / Tprobs = torch.nn.functional.softmax(temp_adjusted_logits, dim=1)return probsLastly, we display the images along with their scores, visually representing how well each image adheres to theprovided prompts:def display_images_with_scores(images: List[Image.Image], scores: torch.Tensor) -> None:"""Displays the images alongside their scores."""# Set print options for readabilitytorch.set_printoptions(precision=2, sci_mode=False)for i, image in enumerate(images):print(f"Image {i + 1}:")display(image)print(f"Scores: {scores[i, :]}")print()With everything detailed, let’s execute the pipelineas follows:# Load CLIP modelmodel, processor = load_model_and_processor(CLIP_REPO)# Process image and text inputs togetherinputs = process_inputs(processor, prompts, images)# Extract the probabilitiesprobs = get_probabilities(model, inputs)# Display each image with corresponding scoresdisplay_images_with_scores(images, probs)We now have scores for each of our synthetic images that quantify the fidelity of the synthetic image to the text provided, based on the CLIP model, which interprets both image and text data as one combined mathematical representation (or geometric space) and can measuretheir similarity.Figure 2.2: CLIP scoresFor our “friendly lion,” we computed scores of 83% and 17% for each prompt, which we can interpret as an 83% likelihood that the image aligns with thefirst prompt.Packt library subscribers cancontinue readingthe entire book for free. You can buyGenerative AI Foundations in Pythonby Carlos Rodriguez,here.Get the eBook for $31.99$21.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
283

Divya Anne Selvaraj

08 Oct 2024

PythonPro #50: Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook Tips

Divya Anne Selvaraj

08 Oct 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#50:Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook TipsHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Offensive Security Using Python, which briefly discusses key practices such as input validation, secure authentication, session management, secure coding techniques, and the implementation of security headers.News Highlights: Python 3.13.0, released yesterday, adds an interactive interpreter, free-threaded mode, JIT compiler, and iOS/Android support; and Rev's Reverb models for ASR and diarization outperform other open-source models.Here are my top 5 picks from our learning resources today:10 Jupyter Notebook Features You Didn’t Know Exist📓A Guide to Modern Python String Formatting Tools🔠Modeling customers' decisions in Python with the Choice-Learn package🛍️Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog📈Best practices for securely consuming open source in Python — Ciara Carey🔐And, today’s Featured Study, evaluates the performance of AI models in geospatial code generation, revealing significant challenges in handling complex tasks, specific data formats, and specialised libraries.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython3.13.0 Is Released: Released on October 7, 2024, the version includes a new interactive interpreter, free-threaded mode, and JIT compiler, and support for iOS and Android platforms.Introducing Reverb: The Future of Open-Source automatic speech recognition (ASR) and Diarization: Rev's new open-source models for ASR and speech diarization, built using Rev’s extensive human-transcribed English speech dataset, outperforms existing open-source models.💼Case Studies and Experiments🔬Using Kolmogorov-Arnold Networks (KAN) and Backtesting to Predict Stock Prices: Discusses predicting stock prices, focusing on deep learning models trained on historical data from Yahoo Finance.🎥Marketing Media Mix Models with Python & PyMC: a Case Study [PyCon DE & PyData Berlin 2024]: discusses how machine learning models can optimize marketing investments by analyzing various channels.📊Analysis10 Jupyter Notebook Features You Didn’t Know Exist: Discusses features including magic commands, interactive widgets, auto-reload for modules, in-notebook documentation, and collapsible headings.I Used Claude.ai to Create a Discord Bot — Here’s What I Learned About the State of AI Code Writing: Discusses the author's experience using Claude to rapidly generate Python code for a bot that deletes old Discord messages.🎓Tutorials and Guides🤓A Guide to Modern Python String Formatting Tools: Explains how to format values, create custom format specifiers, and embed expressions in strings. Read to learn practical techniques for dynamic string manipulation.DuckDB in Python in the Browser with Pyodide, PyScript, and JupyterLite: Shows you how to run DuckDB in Python within a browser environment and embed interactive Python environments in web pages.Tutorial: Creating a Twitter (X) Bot using Python: Explains how to build and deploy a Python-based Twitter (X) bot that autonomously tweets updates, including progress graphs, using the X API.Distilling python functions into LLM: Explains how to use the Instructor library to distill Python functions into a language model, enabling fine-tuning for function emulation using Pydantic type hints.Getting Started with Powerful Data Tables in Your Python Web Apps: Demonstrates building a finance app that fetches stock data, displays it interactively, and includes features like sorting, and graph visualization.Modeling customers decisions in Python with the Choice-Learn package: Introduces the Choice-Learn Python package, which simplifies implementing discrete choice models like Conditional Logit to predict customer decisions.Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide:Outlines how Q-learning helps balance holding and stockout costs by developing an optimal ordering policy.🔑Best Practices and Advice🔏Speeding up CRC-32 calculations in Mojo: Discusses speeding up CRC-32 calculations in Mojo, achieving an 18x improvement over Python's native implementation and reaching 3x slower performance compared to zlib library.Bad Schemas could break your LLM Structured Outputs: Explains how choosing the right response model dramatically impacts the performance of language models like GPT-4o and Claude, especially when using JSON mode or Tool Calling.Implementing a Python Singleton with Decorators: Explains how a decorator ensures only one instance of a class is created, using a _SingletonWrapper class to handle instantiation and simplifies global access.🎥Best practices for securely consuming open source in Python — Ciara Carey: Introduces a framework called Secure Supply Chain Consumption Framework (S2C2F) to help organizations improve open-source security.Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog: Walks you through plotting data with a logarithmic x-axis, y-axis, and both axes, respectively, and provides code snippets to generate these plots.🔍Featured Study: Current AI Models Fall Short in Geospatial Code Generation💥In "Evaluation of Code LLMs on Geospatial Code Generation," Gramacki et al. introduce a benchmark to assess LLMs' ability to handle tasks involving spatial reasoning and data processing.ContextLLMs generate code based on natural language inputs and are effective in general programming tasks, particularly in data science. Geospatial data science is a field focused on analysing spatial data tied to locations. It relies on libraries like GeoPandas and Shapely for tasks such as geo-coding, spatial analysis, and data visualisation. However, the domain poses unique challenges for LLMs due to the need for spatial reasoning and the use of specialised tools, making evaluation in this area crucial. As geospatial applications expand in industries such as urban planning and environmental science, reliable AI assistance is becoming increasingly important.Key FindingsLLMs underperform in geospatial tasks: Models like Code Llama and Starcoder2 show reduced accuracy compared to their performance in general coding.Starcoder2-7B leads but struggles: It achieved a pass@1 score of 32.47%, highlighting the difficulty of geospatial tasks even for top-performing models.Complex tasks pose a challenge: Single-step tasks had a 45.45% pass@1 success rate, but multi-step tasks were far more difficult, scoring only 15.15%.Data format matters: Models handled GeoDataFrames better than other formats like GeoJSON, showing varying levels of tool proficiency.Limited tool support: Libraries like MovingPandas and OSMNX, crucial for geospatial analysis, were inadequately supported by the models.What This Means for YouThis study is relevant for geospatial programmers and data scientists seeking to automate coding tasks. Current LLMs are not yet reliable for complex geospatial tasks, highlighting a need for models specifically trained for the domain. Developers and researchers can benefit by focusing on improving AI models to better support geospatial data science workflows.Examining the DetailsThe authors created a benchmark dataset categorising tasks by complexity, data format, and tool usage. The dataset includes 77 samples to test LLM performance on tasks like spatial reasoning and tool implementation. Evaluation metrics focused on accuracy and pass@1, with the results highlighting the models' struggles in handling geospatial problems. Libraries like GeoPandas and H3 were used to evaluate the models, while more complex tools like MovingPandas exposed the models' weaknesses.This rigorous benchmark, publicly available for future research, sets a foundation for improving geospatial code generation in LLMs. The study’s methodology ensures it reflects real-world geospatial coding challenges, offering valuable insights for the development of more domain-specific AI tools.You can learn more by reading the entire paper and accessing the benchmark dataset: geospatial-code-llms-dataset.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: An Introduction to Web Security with Python” in the book, Offensive Security Using Python by Rejah Rehim and Manindar Mohan, published in September 2024.Proactive web security measures with PythonPython has developed as a versatile widely used programming language in the field of modern software development. Its ease of use, readability, and rich library support have made it a popular choice for developingweb-based applications in a variety of industries. Python frameworks such as Django, Flask, and Pyramid have enabled developers to create dynamic and feature-rich web applications with speed and agility.However, as Python web apps become more popular, there is a corresponding increase in the sophistication and diversity of attacks targeting these applications. Cybersecurity breaches can jeopardize valuable user data, interfere with corporate operations, and damage an organization’s brand. Python web applications become vulnerable to a variety of security vulnerabilities, including SQL injection, XSS, andcross-site request forgery(CSRF). The consequences of these vulnerabilities can be severe, demanding an effectivecybersecurity strategy.Developers must be proactive to counteract this. By implementing security practices such as input validation, output encoding, and other secure coding guidelines early in the development lifecycle, developers can reduce the attack surface and improve the resilience of their Pythonweb applications.Although we are only discussing Python-based applications here, these practices are universal and should be implemented in web applications built with anytechnology stack.To protect against a wide range of cyber threats, it is critical to implement strong best practices. This section explains key security practices that developers should follow while developingweb apps.Input validation and data sanitizationUserinput validationis essential for preventing code injection attacks. Malicious inputs can exploit vulnerabilities and cause unwanted commands to be executed. Properdata sanitizationguarantees that user inputs are handled as data rather than executable code by eliminating or escaping special characters. Using libraries such asinput()and frameworks such as Flask’srequestobject can help validate and sanitizeincoming data.Secure authentication and authorizationRestricting unauthorized access requires effective authentication and authorization procedures. Password hashing, which uses algorithms such asbcryptorArgon2, adds an extra degree of security by ensuring that plaintext passwords are never saved.Two-factor authentication(2FA) adds an additional verification step to user authentication, increasing security.Role-Based Access Control(RBAC) allows developers to provide specific permissions to different user roles, guaranteeing that users only access functionality relevant totheir responsibilities.Secure session managementKeeping user sessions secure is critical for avoiding session fixation and hijacking attempts. Using secure cookies with theHttpOnlyandSecurecharacteristics prohibits client-side script access and ensures that cookies are only sent over HTTPS. Session timeouts and measures such as session rotation can improve session securityeven further.Secure coding practicesFollowing secure coding practices reduces a slew of possible vulnerabilities. Parameterized queries, made possible by libraries such assqlite3, protect against SQL injection by separating data from SQL commands. Output encoding, achieved with techniques such ashtml.escape(), avoids XSS threats by converting user inputs to innocuous text. Similarly, omitting functions such aseval()andexec()avoids uncontrolled code execution, lowering the likelihood of codeinjection attacks.Implementing security headersSecurity headersare a fundamental component of web application security. They are HTTP response headers that provide instructions to web browsers, instructing them on how to behave when interacting with the web application. Properly configured security headers can mitigate various web vulnerabilities, enhance privacy, and protect against commoncyber threats.Here is an in-depth explanation of implementing security headers to enhance webapplication security:Content Security Policy (CSP): CSP is a security feature that helps prevent XSS attacks. By defining and specifying which resources (scripts, styles, images, etc.) can be loaded, CSP restricts script execution to trusted sources. Implementing CSP involves configuring theContent-Security-Policy HTTP header in your web server. This header helps prevent inline scripts and unauthorized script sources from being executed, reducing the risk of XSS attacks significantly. An example of the CSP header is as follows:Content-Security-Policy: default-src 'self'; script-src 'self' www.google-analytics.com;HTTP Strict Transport Security (HSTS): HSTS is a security feature that ensures secure, encrypted communication between the web browser and the server. It preventsMan-in-the-Middle(MITM) attacks by enforcing the use of HTTPS. Once a browser has visited a website with HSTS enabled, it will automatically establish a secure connection for all future visits, even if the user attempts to access the site via HTTP.An example HSTS header isas follows:Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;X-Content-Type-Options: TheX-Content-Type-Optionsheader prevents browsers from interpreting files as a different media type also known as aMultipurpose Internet Mail Extensions(MIME) type. It mitigates attacks such as MIME sniffing, where an attacker can trick a browser into interpreting content in an unintended way, potentially leading to security vulnerabilities.An exampleX-Content-Type-Optionsheader isas follows:X-Content-Type-Options: nosniffX-Frame-Options: TheX-Frame-Options header prevents clickjacking attacks by denying the browser permission to display a web page in a frame or iframe. This header ensures that your web content cannot be embedded within malicious iframes, protecting against UIredressing attacks.An exampleX-Frame-Optionsheader isas follows:X-Frame-Options: DENYReferrer-Policy: TheReferrer-Policyheader controls what information is included in theReferrer header when a user clicks on a link that leads to another page. By setting an appropriate referrer policy, you can protect sensitive information, enhance privacy, and reduce the risk ofdata leakage.An exampleReferrer-Policyheader isas follows:Referrer-Policy: strict-origin-when-cross-originPackt library subscribers can continue reading the entire book for free. You can buy Offensive Security Using Python,here.Get the eBook for $39.99 $27.98!Get the Print Book for $49.99 $34.98!Other Python titles from Packt at 30% offGet the eBook for $39.99 $27.98!Get the eBook for $35.99 $24.99!Get the eBook for $27.99 $18.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
263

Divya Anne Selvaraj

28 Jan 2025

PythonPro #60: XGBoost Multithreading, Python 3.14 Updates, REST API Frameworks, and Quantum ML with sQUlearn

Divya Anne Selvaraj

28 Jan 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#60XGBoost Multithreading, Python 3.14 Updates, REST API Frameworks, and Quantum ML with sQUlearnHi ,Happy New Year! We’re back! Did you try your hand at any exciting Python projects over the holidays that you’d like to share? If so, reply to this email and let me know. If it’s brilliant, we’ll share what you made in next week's issue.In today’sExpert Insight we bring you an excerpt from the recently published book, XGBoost for Regression Predictive Modeling and Time Series Analysis, which demonstrates the power of XGBoost's multithreaded capabilities, showcasing how adjusting the nthread parameter can significantly accelerate model training by utilizing multiple CPU cores, as illustrated through a practical example with the California housing dataset.News Highlights: Python 3.14.0 alpha 4 introduces features like PEP 649 for deferred annotations and improved error messages; Python wins Tiobe's Programming Language of the Year 2024 with a 9.3% popularity surge; and a new PEP proposes SBOMs for better package security and dependency tracking.My top 5 picks from today’s learning resources:A technical intro to Ibis: The portable Python DataFrame library🐦How to Split a Python List or Iterable Into Chunks🍰Build a chatbot web app under 5min in Python🤖Five Key Lessons for Google Earth Engine Beginners🌍Choosing your Python REST API framework🔧And, in From the Cutting Edge, we introduce sQUlearn, a Python library for quantum machine learning that integrates seamlessly with classical tools like scikit-learn, offering high-level APIs, low-level customisation, and robust support for NISQ devices.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython 3.14.0 alpha 4 is out: This early developer preview showcases new features like PEP 649 (deferred annotation evaluation), PEP 741 (Python configuration C API), and improved error messages.Python wins Tiobe language of the year honors: Python won Tiobe's Programming Language of the Year 2024 with a 9.3% popularity increase, surpassing other languages like Java and Go.Software bill-of-materials (SBOMs) docs eyed for Python packages: A new Python Enhancement Proposal (PEP) suggests incorporating SBOM documents into Python packages to improve dependency tracking and vulnerability analysis.💼Case Studies and Experiments🔬A Scheme for Network Programmability and Backup Automation Using Python Netmiko Library on Cisco; the Case Study of the Komfo Anokye Teaching Hospital Local Area Network: Presents a Python-based framework that addresses the inefficiencies of manual processes, achieving a 99% reduction in backup time, a 100% success rate, and enhanced resource utilization.Change Python's syntax with the "# coding:" trick: Demonstrates a playful yet risky manipulation of Python's behavior by creating a codec that converts braces into indentation, enabling alternative syntax like using {} for blocks.📊AnalysisA technical intro to Ibis: The portable Python DataFrame library: Introduces Ibis which simplifies multi-backend data workflows through lazy evaluation, backend-agnostic code, and seamless backend translation.A Deeper Look into Node.js Docker Images: Help, My Node Image Has Python!: Analyzes various Node.js Docker images, comparing their sizes, security vulnerabilities, and use cases.🎓Tutorials and Guides🤓Embedding Python in Rust (for tests): Covers setting up a Python interpreter within a Rust project, exposing Rust functions to Python, handling dynamic types, and building a test runner.How to Visualize your Python Project’s Dependency Graph: Explains how to visualize a Python project's dependency graph using the Tach tool to define module boundaries, sync dependencies, and visualize the graph in a browser or with GraphViz.Build a chatbot web app under 5min in Python: Provides a beginner-level tutorial for building a chatbot web app in Python using Dash, Dash-Chat, and OpenAI's GPT models.Deploying the Python Masonite framework on Lambda: Coverscreating a Lambda function, configuring the lambda_handler , and setting up CI/CD with GitHub Actions to automate deployments.The Shortest Python `import` Tutorial • A Picture Story: Explains the three main ways of importing in Python: importing the whole module, importing specific items, and importing everything with a wildcard.How to Split a Python List or Iterable Into Chunks: Techniques covered include using Python’s standard library (itertools.batched()), third-party packages like more_itertools and NumPy, and custom implementations.Nine Pico PIO Wats with MicroPython (Part 1): Explores surprising behaviors ("Wats") in Raspberry Pi Pico's Programmable Input/Output (PIO) subsystem through a musical theremin project using MicroPython.🔑Best Practices and Advice🔏Choosing your Python REST API framework: Evaluates popular frameworks like FastAPI, Django REST framework, Flask-RESTX, Sanic, and Tornado, offering guidance for selecting the most suitable framework.The Storeroom: Introduces a fix to the White Room analogy, a teaching method for explaining Python’s variable handling and namespaces and addresses the analogy's limitation in representing multiple references to the same object.Python's Mutable vs Immutable Types: What's the Difference?: Discusses key concepts like object identity, type, and value, along with common pitfalls, such as aliasing variables, mutating function arguments, and using mutable default values.Five Key Lessons for Google Earth Engine Beginners: Offfers tips illustrated with real-world examples, including calculating water balance and drought in a water basin in Ecuador.Using Tree Sitter to extract insights from your code and drive your development metrics: Covers three methods for analyzing code: textual matching, syntax linting, and AST traversal, emphasizing the advantages of the latter for accuracy and extracting node values.🔍From the Cutting Edge: sQUlearn – A Python Library for Quantum Machine LearningI💥In the paper, "sQUlearn – A Python Library for Quantum Machine Learning," Kreplin et al. introduce sQUlearn, a Python library for quantum machine learning (QML), designed to integrate seamlessly with classical machine learning tools like scikit-learn.ContextQuantum Machine Learning (QML) combines quantum computing and machine learning to harness quantum principles for computational efficiency and enhanced algorithmic capabilities. However, many current QML tools demand in-depth quantum computing expertise. Noisy Intermediate-Scale Quantum (NISQ) devices, while promising, pose significant challenges due to their limitations in handling deep quantum circuits. To bridge these gaps, sQUlearn focuses on NISQ-compatibility, usability, and integration with classical ML tools, particularly scikit-learn.Key FeaturessQUlearn offers:High-Level Interfaces: Provides scikit-learn-compatible APIs for quantum kernel methods (e.g., quantum SVMs) and quantum neural networks (QNNs) for classification and regression tasks.Low-Level Functionalities: Offers tools for designing quantum circuits, customising encodings, and performing advanced differentiation for QML research.Quantum Kernel Methods: Supports fidelity-based and projected quantum kernels (FQK and PQK) for enhanced data embedding and efficient computation.Flexible Execution: Enables seamless transitions between simulations and real quantum hardware using Qiskit and PennyLane backends.Automation Features: Includes session management, result caching, error handling, and automatic restarts to simplify quantum experiment execution.Customisation Options: Allows users to create and modify data encoding strategies, observables, and outer kernels for tailored solutions.What This Means for YousQUlearn simplifies quantum machine learning for both researchers and practitioners. For researchers, it offers a flexible low-level framework for exploring novel QML algorithms and quantum circuit designs. For practitioners it simplifies the deployment of QML solutions with minimal quantum-specific knowledge via high-level interfaces and pre-built models using familiar tools like scikit-learn.Examining the DetailssQUlearn’s dual-layer architecture enables flexibility, with high-level APIs for seamless integration into machine learning workflows and low-level tools for advanced customisation. The Executor module centralises quantum job execution, handling retries, caching results, and transitioning between simulation and real hardware. It supports quantum kernel methods and neural networks while addressing noise challenges on quantum devices through built-in regularisation techniques. This focus on automation and robustness ensures the library is both reliable for practical applications and adaptable for research needs.You can learn more by reading the entire paper or accessing the library on GitHub.🧠 Expert insight💥Here’s an excerpt from “Chapter 13: Deploying Your XGBoost Model” in the book, XGBoost for Regression Predictive Modeling and Time Series Analysis by Partha Pritam Deka and Joyce Weiner.Using XGBoost’s multithreaded featuresXGBoost has built-in support for multithreaded computing, which allows you to speed up model training by utilizing multiple CPU cores. You can control this by setting thenthread parameter, which determines the number of threads to use. By default,XGBoost will automatically use the maximum number of available threads.It’s important to note that if you’re using Dask, any value you set fornthreadwithin XGBoost will take precedence over Dask’s default configuration. The following example demonstrates how the multithreading parameter works. We’ll revisit theCalifornia housing datasetthat you worked with inChapter 4:Create a Python file to demonstrate XGBoost’s multithreaded functionality. We’ve started with a header and named thefilemultithreaded.py.Import the necessary modules. You can load the California housing dataset from scikit-learn (sklearn). You’ll also be usingpandas,numpy, a module calledtimeto track how long code execution takes, and, ofcourse,xgboost:import pandas as pdimport numpy as npimport timeimport xgboost as xgbfrom sklearn.metrics import r2_scorefrom sklearn import datasetsfrom sklearn.model_selection import train_test_splitNow, you can load in the California housing dataset and perform the train-test split using scikit-learn, as youdid previously:housingX, housingy = datasets.fetch_california_housing( return_X_y=True, as_frame=True)X_train, X_test, y_train, y_test = train_test_split( housingX,housingy, test_size=0.2, random_state=17)Previously, you used the scikit-learn interface for XGBoost. In this example, you’ll use the XGBoost API for Python. One difference is that XGBoost uses a data structure called aDMatrixto manipulate data. So, the first thing you need to do is convert the dataset fromnumpyorpandasform intoDmatrixform by using theDMatrixfunction and passing in the data and the labels. In this case, we’ll be usingdtrain = xgb.DMatrix(X_train, y_train)for the training dataset; do the same for thetest dataset:dtrain = xgb.DMatrix(X_train, y_train)dtest = xgb.DMatrix(X_test, y_test)Now, the data is in a format that XGBoost can manipulate with efficiency. As mentioned inChapter 3, XGBoost does some sorting and performs other operations on the dataset to speedup execution.At this point, you’re ready to train a model using the XGBoost API and the multithreading feature. By default, XGBoost uses the maximum number of threads available. To see the difference, train the model with just two threads, and then increase the maximum number of logical processors you have in your computer. You’ll need to use thetimemodule to get the computation time and print it out so that you can compare the results. First, save the start time with the following lineof code:train_start = time.time()You can set the training parameters for XGBoost by creating a dictionary with the parameters as key-value pairs. You can configure all the parameters listed in theHyperparameterssection ofChapter 5. Here, seteta = 0.3(the learning rate),booster = gbtree, andnthread =2:param = {"eta": 0.3, "booster": "gbtree", "nthread": 2}Now that the training parameters have been set, you can train the model and save the end of the execution time by using thefollowing code:housevalue_xgb = xgb.train(param, dtrain)train_end = time.time()Print the execution time with a formattedprintstatement while subtractingtrain_startfromtrain_endand converting it into milliseconds by multiplyingby 103:print ("Training time with 2 threads is :{ 0:.3f}".format((train_end - train_start) * 10**3), "ms")Now, repeat the code and increase the number of threads XGBoost uses by changing the value ofnthread. Since our computer has eight logical processors, I’vechosen8:train_start = time.time()param = {"eta": 0.3, "booster": "gbtree", "nthread": 8}housevalue_xgb = xgb.train(param, dtrain)train_end = time.time()print ("Training time with 8 threads is :{ 0:.3f}".format((train_end - train_start) * 10**3), "ms")To ensure the model is working as expected, you can make a prediction and check the R2value. You can also time the prediction. To make a prediction with the Python API, just call thepredictmethod on your model and pass thetest dataset:pred_start = time.time()ypred = housevalue_xgb.predict(dtest)pred_end = time.time()print ("Prediction time is :{0:.3f}".format(( pred_end - pred_start) * 10**3), "ms")xgb_r2 = r2_score(y_true=y_test, y_pred= ypred)print ("XGBoost Rsquared is {0:.2f}".format(xgb_r2))Running this script results in the following output. Please note that the execution time on your computer willbe different:Training time with 2 threads is :237.088 msTraining time with 8 threads is :130.723 msPrediction time is :2.012 ms XGBoostRsquared is 0.76On our computer, going from two to eight threads sped up training by over 44%. This demonstrates the benefit XGBoost provides with multithreading. Recall that by default, it will use the maximum number of threads available. Next, you’ll learn about using XGBoost with distributed compute by using Daskon Linux.XGBoost for Regression Predictive Modeling and Time Series Analysis was published in December 2024.Get the eBook for $39.99 $27.98Get the Print Book for $49.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
252

Divya Anne Selvaraj

10 Sep 2024

PythonPro #45: Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for Django

Divya Anne Selvaraj

10 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#46:Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for DjangoHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published, Python Feature Engineering Cookbook - Third Edition, which discusses using boxplots and the inter-quartile range (IQR) proximity rule to visualize outliers in data distributions.Related TitlesCovers numerous tools for mastering visualization including NumPy, Pandas, SQL, Matplotlib, and SeabornIncludes an introductory chapter on Python 3 basicsFeatures companion files with numerous Python code samples and figuresGet the eBook for $54.99 $37.99!Explores cutting-edge techniques using ChatGPT/GPT-4 in harmony with Python for generating visuals that tell more compelling data storiesTackles actual data scenarios and builds your expertise as you apply learned concepts to real datasetsGet the eBook for $54.99 $37.99!Covers Python-based data visualization libraries and techniquesIncludes practical examples and Gemini-generated code samples for efficient learningIntegrates Google Gemini for advanced data visualization capabilitiesGet the eBook for $51.99 $35.99!News Highlights: Python 3.13.0rc2 released with new interpreter, free-threaded build, JIT, and incremental garbage collection; Python survey shows pip dominance, rising interest in Conda, Poetry, and uv; and PSF expands CNA role to cover Pallets Projects like Flask and Jinja.Here are my top 5 picks from our learning resources today:Breaking Bell's Inequality with Monte Carlo Simulations in Python🔗Python QuickStart for People Learning AI🤖Integrating Stripe Into A One-Product Django Python Shop🛒Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP🌐A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS)☁️And, today’s Featured Study, explores how ChatGPT can automate and streamline Python-based federated learning algorithm development, reducing human effort and improving coding efficiency.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython 3.13.0rc2 released: This version introduces several major features such as a new interactive interpreter, an experimental free-threaded build mode, preliminary JIT for performance, and incremental garbage collection.Packaging Trends in Python: Highlights from the 2023 Developer Survey: Results show a strong preference for pip, with emerging interest in Conda and Poetry, and a new player, uv.Python Software Foundation (PSF) Expands CNA Scope to Include Pallets Projects: The PSF has expanded its CVE Numbering Authority role to include Pallets Projects like Flask and Jinja, ensuring better vulnerability management.💼Case Studies and Experiments🔬Lessons learnt building a real-time audio application in Python: Key learnings covered include accepting inherent latency issues, leveraging modern operating systems' efficient memory management, and utilizing web browsers as effective interfaces for real-time applications.Breaking Bell's Inequality with Monte Carlo Simulations in Python: Discusses the use of Monte Carlo simulations in Python to challenge Bell's inequality through a quantum mechanics game.📊AnalysisRust for the small things?... but what about Python?: Explores the enduring relevance of Python in data engineering, despite the allure of Rust for performance and safety.Multiversion Python Thoughts: Delves into the complexities of implementing multi-version package imports in Python, motivated by the desire to handle incompatible library versions concurrently.🎓Tutorials and Guides🤓Python QuickStart for People Learning AI: Covers Python fundamentals, including data types, loops, and functions, and provides a concrete AI project example using the OpenAI API for summarizing research papers.Lists vs Tuples in Python: Explores the characteristics, uses, and differences between lists and tuples in Python, emphasizing their ordered nature, content diversity, mutability, and appropriate usage scenarios.Layman's Guide to Python Built-in Functions: Simplifies Python's built-in functions for beginners, providing plain English explanations and straightforward examples.🎥Some tricks with UV: Demonstrates how UV not only facilitates quicker installations but also supports running Python scripts with on-the-fly dependency management.Python 3 Module of the Week: A series of articles detailing diverse library functionalities ranging from text handling, data structures, and algorithms to more complex areas like cryptography and network communication.Integrating Stripe Into A One-Product Django Python Shop: Part two of a series on creating a one-product shop using Django, htmx, and Stripe. Covers creating a Stripe account, defining a product, and configuring a webhook for transaction notifications.Practical Introduction to Polars: Compares Polars' key functionalities with Pandas, offering practical examples to help users transition from Pandas to Polars for more efficient data analysis.🔑Best Practices and Advice🔏Understanding Python's __new__ Method Through a Magical Example: Introduces Python's lesser-known .__new__()method, used for creating instances before they're initialized with .__init__().Some fun with Python Enum: Explores the Enum class introduced in Python 3.4, detailing its benefits over using literal types for type-safety and avoiding errors in code.A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS): Discusses various providers that support Python, their development experience (DevEx), pricing models, runtime limits, and other platform products.Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP: Details each library's strengths and appropriate use cases, helping developers choose the right tool based on project needs.Shades of testing HTTP requests in Python: Covers different techniques including mocking with AsyncMockand respx, parameterizing HTTP clients for flexible testing setups, and using integration tests with a Starlette server.🔍Featured Study: Streamlining Federated Learning with Python and ChatGPT💥In PTB-FLA Development Paradigm Adaptation for ChatGPT, Popovic et al. explore how AI can be used to streamline the development of federated learning algorithms (FLAs). The study adapts a Python-based development paradigm to leverage ChatGPT for improved speed and efficiency in coding for machine learning tasks.ContextFederated Learning (FL) allows machine learning algorithms to train across decentralized data sources, such as edge devices, without sharing the raw data. PTB-FLA is a Python framework designed to ease this process by providing a structured way for developers to create these algorithms. Traditionally, this has required significant human input. With ChatGPT, the authors of this paper aimed to reduce human effort by automating much of the coding work. This study is important because it shows how LLMs can help build complex systems like FL algorithms, particularly in environments such as edge computing, where efficiency and reduced human oversight are key.Key FindingsThe adapted four-phase paradigm reduced human labour by 50%, achieving double the speed of the original development method.A new two-phase paradigm further streamlined the process, cutting human effort by 6 times compared to the original approach.ChatGPT-generated code was of higher quality, showing fewer errors compared to human-generated versions in comparable tasks.The study demonstrated a significant reduction in costs by reducing the size of ChatGPT prompts by 2.75 times.Both adapted paradigms were successfully validated using logistic regression as a case study for federated learning.What This Means for YouIf you work with machine learning, particularly in decentralized systems like IoT or edge computing, this research is highly relevant. Using ChatGPT to develop federated learning algorithms can save you substantial time by automating coding tasks that would otherwise require significant effort. By adopting the two-phase paradigm, developers can expect faster, more efficient development cycles, allowing you to focus on innovation rather than repetitive coding. This also reduces costs when using AI-assisted tools like ChatGPT, as it optimises the prompt size.Examining the DetailsThe study's methodology revolves around adapting an existing four-phase development process for federated learning into two paradigms tailored for ChatGPT. The original phases involved creating sequential code, transforming it into federated code, incorporating callbacks, and generating the final PTB-FLA code. The new two-phase paradigm simplifies this further by merging phases, allowing ChatGPT to generate the final federated code directly from the sequential code, bypassing intermediary steps. The team validated both paradigms through a case study using logistic regression. They iteratively refined the ChatGPT prompts to find the minimal context needed to achieve correct outputs, ensuring efficiency while maintaining code accuracy. The final results showed ChatGPT could develop high-quality code faster than humans, with far fewer resources.You can learn more by reading the entirepaper and accessing the PTB-FLA Github repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 5: Working with Outliers” in the Python Feature Engineering Cookbook - Third Edition,by Soledad Galli, published in August 2024.Visualizing outliers with boxplots and the inter-quartile proximity ruleA common way to visualize outliers is by using boxplots. Boxplots provide a standardized display of the variable’s distribution based on quartiles. The box contains the observations within the firstand third quartiles, known as the Inter-Quartile Range(IQR). The first quartile is the value below which 25% of the observations lie (equivalent to the 25th percentile), while the third quartile is the value below which 75% of the observations lie (equivalent to the 75th percentile). The IQR is calculatedas follows:IQR = 3rd quartile - 1st quartileBoxplots also display whiskers, which are lines that protrude from each end of the box toward the minimum and maximum values and up to a limit. These limits are given by the minimum or maximum value of the distribution or, in the presence of extreme values, by thefollowing equations:upper limit = 3rd quartile + IQR × 1.5lower limit = 1st quartile - IQR × 1.5According to theIQR proximity rule, we can consider a value an outlier if it falls beyond the whisker limits determined by the previous equations. In boxplots, outliers are indicatedas dots.NoteIf the variable has a normal distribution, about 99% of the observations will be located within the interval delimited by the whiskers. Hence, we can treat values beyond the whiskers as outliers. Boxplots are, however, non-parametric, which is why we also use them to visualize outliers inskewed variables.In this recipe, we’ll begin by visualizing the variable distribution with boxplots, and then we’ll calculate the whisker’s limits manually to identify the points beyond which we could consider a value asan outlier.How to do it...We will create boxplots utilizing theseabornlibrary. Let’s begin by importing the Python libraries and loadingthe dataset:Let’s import the Python libraries andthe dataset:import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import fetch_california_housingModify the default background fromseaborn (it makes prettier plots, but that’s subjective, of course):sns.set(style="darkgrid")Load the California house prices datasetfrom scikit-learn:X, y = fetch_california_housing( return_X_y=True, as_frame=True)Make a boxplot of theMedIncvariable to visualizeits distribution:plt.figure(figsize=(8, 3))sns.boxplot(data=X["MedInc"], orient="y")plt.title("Boxplot")plt.show()In the following boxplot, we identify the box containing the observations within the IQR, that is, the observations between the first and third quartiles. We also see the whiskers. On the left, the whisker extends to the minimum value ofMedInc; on the right, the whisker goes up to the third quartile plus 1.5 times the IQR. Values beyond the right whisker are represented as dots and couldconstitute outliers:Figure 5.1 – Boxplot of the MedInc variable highlighting potential outliers on the right tail of the distributionNoteAs shown inFigure 5.1, the boxplot returns asymmetric boundaries denoted by the varying lengths of the left and right whiskers. This makes boxplots a suitable method for identifying outliers in highly skewed distributions. As we’ll see in the coming recipes, alternative methods to identify outliers create symmetric boundaries around the center of the distribution, which may not be the best option forasymmetric distributions.Let’s now create a function to plot a boxplot next toa histogram:def plot_boxplot_and_hist(data, variable): f, (ax_box, ax_hist) = plt.subplots( 2, sharex=True, gridspec_kw={"height_ratios": (0.50, 0.85)}) sns.boxplot(x=data[variable], ax=ax_box) sns.histplot(data=data, x=variable, ax=ax_hist) plt.show()Let’s use the previous function to create the plots for theMedInc variable:plot_boxplot_and_hist(X, "MedInc")In the following figure, we can see the relationship between the boxplot and the variable’s distribution shown in the histogram. Note how most ofMedInc’s observations are located within the IQR box.MedInc’s potential outliers lie on the right tail, corresponding to people with unusuallyhigh-income salaries:Figure 5.2 – Boxplot and histogram – two ways of displaying a variable’s distribution...How it works...In this recipe, we used theboxplotmethod from Seaborn to create the boxplots and then we calculated the limits beyond which a value could be considered an outlier based on the IQRproximity rule.InFigure 5.2, we saw that the box in the boxplot forMedInc extended from approximately 2 to 5, corresponding to the first and third quantiles (you can determine these values precisely by executing X[“MedInc”].quantile(0.25)andX[“MedInc”].quantile(0.75) ). We also saw that the whiskers start at MedInc’s minimum on the left and extend up to8.013on the right (we know this value exactly because we calculated it instep 8).MedIncshowed values greater than8.013 , which were displayed in the boxplot as dots. Those are the values that could be considered outliers...Packt library subscribers cancontinue reading the entire book for free. You can buy the Python Feature Engineering Cookbook - Third Edition, by Soledad Galli,here.Get the eBook for $35.99 $24.99! @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }

0
0
219

Divya Anne Selvaraj

04 Feb 2025

PythonPro #61: Meta’s Llama Flaw, Codon’s NumPy Boost, and Web2Vec for Automated Website Analysis

Divya Anne Selvaraj

04 Feb 2025

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#61Meta’s Llama Flaw, Codon’s NumPy Boost, and Web2Vec for Automated Website AnalysisHi ,In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI on Google Cloud with LangChain, which discusses how LLMs generate plausible but sometimes false responses (hallucinations), and demonstrates how structured prompting with LangChain can help mitigate the issue.News Highlights: Meta's Llama flaw exposes AI servers to remote code execution via Python’s pickle; Codon’s 2025 update brings a faster, open-source NumPy with GPU support; Codegen, a Python library for code refactoring and analysis, is now open source.My top 5 picks from today’s learning resources:Decorator JITs - Python as a DSL⚡100 Common Python Mistakes🐍9 Statistical Distributions Every Data Scientist Should Know📊Create an Adaptive Customer Behavior Analytics Dashboard with Claude AI and Python📈Security and cryptography algorithms: A guide🔐And, in From the Cutting Edge, we introduce Web2Vec, a Python library that automates website analysis by extracting over 200 structured parameters through web crawling and direct feature extraction, with potential applications in cybersecurity, SEO, and machine learning.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsMeta's Llama Framework Flaw Exposes AI Systems to Remote Code Execution Risks: The critical vulnerability exposed AI inference servers to remote code execution due to unsafe deserialization with Python’s pickle module.Codon in 2025: New compiler-optimized NumPy implementation. Switching to an open source license: This update leverages Codon’s multithreading, GPU capabilities, and compiler optimizations, achieving significant speed improvements over standard NumPy.Codegen is now open source: Codegen is a Python library for advanced code manipulation, enabling large-scale refactoring, pattern enforcement, and static analysis without requiring deep AST knowledge.💼Case Studies and Experiments🔬How I Built a Python RL Trading Bot That Simulated 1150% Profit: Describes building a bot that identifies potential short squeezes using Financial Modeling Prep’s Fail-to-Deliver and historical stock price data.An empirical study of developers’ challenges in implementing Workflows as Code: A case study on Apache Airflow: Analyzes 1,000 Stack Overflow posts to categorizes challenges into workflow definition, execution, environment setup, quality assurance, security, and optimization, identifying key pain points.📊AnalysisObservations: Using Python with DeepSeek-R1: Explores using DeepSeek-R1 for AI tasks, covering API integration, response streaming, and Retrieval-Augmented Generation (RAG) while analyzing its reasoning process.Decorator JITs - Python as a DSL: Covers AST-based, bytecode-based, and tracing JITs, showing how they convert Python code into optimized LLVM IR for execution.🎓Tutorials and Guides🤓Security and cryptography algorithms: A guide: Coversblock and stream ciphers, hashing, key exchange, public key encryption, and cryptographic libraries, including practical examples using Python.TLS and networking: Explains TLS handshakes, encryption, certificate verification, networking layers, HTTP protocols, sockets, firewalls, and secure app deployment. Also discusses mutual TLS, DNS, and network security.On Shared Birthdays (and a Bit on Pythagorean Triplets) • Pythonic Code: Demonstrates probability concepts with itertools, collections, and datetime,simulates birthday collisions, analyzes probability with brute-force, and more.Make Sick Beats with Python: Explains how to build a simple drum machine in Python using the pygame library, covering setting up the environment, storing and playing sounds, representing music in code, and more.Create an Adaptive Customer Behavior Analytics Dashboard with Claude AI and Python: Demonstrates building a dashboard which analyzes uploaded CSV data, generates Python scripts, executes them, and creates visualizations.Nine Pico PIO Wats with MicroPython (Part 2): Through debugging techniques and real-world examples, it demonstrates workarounds for PIO limitations while building a theremin-like musical instrument.Managing Magento Configurations with PyGento: A Powerful Python CLI Tool for Developers: Explains how the tool integrates with PyGento, provides database access via SQLAlchemy, and automates tasks like searching, viewing, and editing Magento settings without using the admin panel.🔑Best Practices and Advice🔏AI Python Libraries: A centralized resource listing 1,037 libraries with descriptions and use cases for AI development, covering deep learning, machine learning, NLP, and data science.Python Code for Automated Log Analysis & Alerting: Covers parsing system, firewall, and web server logs to detect suspicious activity like brute force attacks and malicious IPs.9 Statistical Distributions Every Data Scientist Should Know: Introduces key statistical distributions, explains their characteristics, and provides practical examples.Crafting a Winning Conference Talk: Lessons from a PyCon US Reviewer:Provides guidance based on the author's experience as a PyCon US reviewer, outlining common mistakes and recommendations for writing a strong conference talk proposal.100 Common Python Mistakes: Covers logic bugs, inefficient code, non-Pythonic practices, and best practices for readability and performance, using clear examples for each mistake.🔍From the Cutting Edge: Web2Vec — A Python Library for Website-to-Vector Transformation💥In "Web2Vec: A Python Library for Website-to-Vector Transformation," D. Frąszczak and E. Frąszczak introduce Web2Vec, a Python library that converts websites into structured vector representations. The library automates feature extraction from web pages, integrating over 200 parameters from website content and structure to enable efficient analysis.ContextWebsite processing involves extracting and transforming web data for analysis. This includes web crawling, which systematically navigates websites to collect data, and web scraping, which extracts specific information from web pages. Website feature extraction identifies key attributes such as structure, security settings, and external references, while vectorisation converts unstructured data into numerical formats for computational analysis.The internet hosts over a billion websites, with millions actively generating data daily. Extracting insights is essential for market research, cybersecurity, and machine learning. While many research papers discuss web data collection, they often rely on custom scripts, leading to inefficiencies. Existing services like WHOIS, SimilarWeb, and Google Search Index provide valuable data but restrict free API access, requiring users to parse raw HTML instead. Web2Vec addresses these challenges with a unified, open-source solution for automated website data extraction and analysis.Key FeaturesWeb2Vec offers a structured approach to web data collection and analysis through:Automated Website Crawling – Uses Scrapy-based spiders to extract data from single pages or entire websites.Comprehensive Feature Extraction – Captures 211 parameters, including URL lexical features, HTML content, HTTP response details, SSL certificates, WHOIS data, and traffic metrics.Flexible Data Processing – Supports active crawling and pre-generated HTML snapshots.Cybersecurity Applications – Detects phishing sites, misinformation, and suspicious activity by integrating services like PhishTank and OpenPhish.Graph-Based Analysis – Visualises website relationships through network graphs for deeper insights.Open-Source & Extensible – Available on PyPI (pip install web2vec), with community support for updates and improvements.What This Means for YouWeb2Vec is a valuable tool for professionals and researchers working with web data. Data scientists can automate website feature extraction for large-scale analysis, while cybersecurity professionals can detect phishing and misinformation using structured data and threat intelligence services. SEO and marketing professionals can benefit from its ability to analyse rankings, metadata, and traffic sources. Developers and web scraping practitioners can replace custom scripts with Web2Vec’s unified framework, reducing redundancy and improving efficiency.Examining the DetailsWeb2Vec processes websites using two primary methods: direct extraction from pre-generated HTML snapshots and active web crawling with Scrapy-based spiders. It extracts 211 parameters across multiple categories, converting website characteristics into structured vector representations. The extracted data, stored in JSON format, includes details such as URL, title, HTML response, and status codes. Users can either crawl entire websites using the Web2VecSpider or extract specific parameters through direct method invocation. Web2Vec also supports visualising website relationships using directed graphs, making it particularly useful for cybersecurity applications such as phishing detection and misinformation analysis.You can learn more by reading the entire paper or accessing the library on GitHub.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: Grounding Responses” in the book, Generative AI on Google Cloud with LangChain by Leonid Kuligin, Jorge Zaldívar, and Maximilian Tschochohei.Overcoming hallucinations in LLMsLLMs are trained on large amounts of publicly available data (please take a look atAppendix 1 for more details on how LLMs are trained). By design, they absorb the information they are explicitly given in the prompt and the information they have seen during thetraining.By default, LLMs don’t have access to any external information (except for what they have already memorized), and in most cases, they’re autoregressive models (they predict output words or tokens one by one by looking at the previous input, and this limits their reasoning capabilities). We’ll see some examples of how we can expand an LLM’s reasoning capabilities with agentic workflows in Chapters 9,10,and11.Simply put, LLMs will use the information from their training to respond to a prompt. LLMs effectively reproduce human language and, therefore, their answers sound very credible even if they are just a probabilistic continuation of the prompt. This can be correct if the training corpus has the right information – or it can just be plausible, but factually wrong. If it’s the latter, the LLM is saidto hallucinate.Let’s take asimple example:Prompt: Tell me about the black cucumbers.Response:Black CucumbersBlack cucumbers, scientifically known as Cucumis sativus var. Nigra, are an intriguing variety of cucumber characterized by their unique dark green to black skin. They are not genetically modified but instead occur naturally due to a higher concentration of chlorophyll.Origin and CultivationBlack cucumbers are native to Europe and Asia, particularly Russia and China. They are grown in warm, moist climates and require well-drained soil rich in organic matter. The fruits typically mature within 60-70 days after planting.Physical CharacteristicsSkin: Deep green to black, with occasional lighter... (The response was truncated because it has reached the token limit. Try to increase the token limit if you need a longer response.)At least to my knowledge, black cucumbers don’t exist. But the LLM has produced a response based on a description of actual cucumbers that looks plausible and real. In the summer of 2023, a US lawyer used an LLM to answer legal questions. The LLM cited non-existing cases, but they looked so trustworthy that the lawyer used them in court and got intoproblems [1].Sometimes hallucinations can be overcome by prompt engineering. Let’s look at the followingprompt template:from langchain.chains import LLMChainfrom langchain.prompts.prompt import PromptTemplatefrom langchain_google_vertexai import VertexAIllm = VertexAI(model_name="gemini-1.0-pro", temperature=0.8, max_output_tokens=128)template = """Describe {plant}.First, think whether {plant} exist.If they {plant} don't exist, answer "I don't have enough information about {plant}".Otherwise, give their title, a short summary and then talk about origin and cultivation.After that, describe their physical characteristics."""prompt_template = PromptTemplate( input_variables=["plant"], template=template,)chain = LLMChain(llm=llm, prompt=prompt_template)chain.run(plant="black cucumbers")If we run this chain, we’ll get arelevant answer:I don't have enough information about black cucumbers.You can double-check and ask the question about green cucumbers to make sure that the LLM will give a correct answer withthis prompt.HallucinationsHallucinations are one of the key problems that the industry is facing atthe moment.The good news: There are ways to significantly reduce hallucination rates, and we’re going to discuss them in this and thenext chapters.The bad news: Anygenerative AI(GenAI) might produce hallucinations, and you need to evaluate and monitor them during application development. We’ll talk about evaluation inChapter 14.Generative AI on Google Cloud with LangChainwas published in December 2024. Packt library subscribers can continue reading the entire book for free.Get the eBook for $35.99 $24.99Get the Print Book for $44.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
211