PythonPro

12 Nov 2024

PythonPro #55: Python Project Packaging Guide, AWS Credentials Theft Alert, and PyTorch 2 Speeds Up ML

12 Nov 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#55Python Project Packaging Guide, AWS Credentials Theft Alert, and PyTorch 2 Speeds Up MLHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, LLM Engineer's Handbook, which discusses comprehensive RAG evaluation through the Ragas and ARES frameworks.News Highlights: Malicious Python package "Fabrice" on PyPI has been stealing AWS credentials by mimicking Fabric; and PyTorch 2 boosts ML speeds with dynamic bytecode transformation, achieving 2.27x inference and 1.41x training speedups on NVIDIA A100 GPUs.My top 5 picks from today’s learning resources:A Comprehensive Guide to Python Project Management and Packaging📦20 Python scripts to automate common daily tasks🔄Principal Component Analysis with Python (A Deep Dive) -Part 1📉State of Python 3.13 Performance: Free-Threading🚀How I Got Started Making Maps with Python and SQL🗺️And, today’s Featured Study, introduces Magentic-One, a generalist multi-agent AI system developed by Microsoft Research, designed to coordinate specialised agents in tackling complex, multi-step tasks across diverse applications.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month'ssurvey is now live. Do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsMalicious Python package collects AWS credentials via 37,000 downloads: The malicious Python package "Fabrice" has been typosquatting on PyPI since 2021, gathering AWS credentials by imitating the popular Fabric SSH library.🎥PyTorch 2 Speeds Up ML with Dynamic Python Bytecode Transformation and Graph Compilation: Fully backward-compatible, the version achieves a 2.27x inference speedup and 1.41x training speedup on NVIDIA A100 GPUs, surpassing six other compilers across 180+ models.💼Case Studies and Experiments🔬LangGraph for Healthcare: A Comprehensive Technical Guide: Delves into the technical aspects of integrating LangGraph into healthcare workflows, emphasizing its potential to improve patient care, streamline administrative tasks, and facilitate medical researchTiming-sensitive analysis in python: explores timing sensitivity in Python through a hands-on approach, testing timing stability under different conditions (like input size and intentional delay).📊AnalysisState of Python 3.13 Performance: Free-Threading: Examines how free-threading affects Python's ability to handle parallel tasks, particularly through a practical example with a PageRank algorithm implementation.Modular, Faster DateOnly Operations in Python: Delves into the reasoning behind memory-efficient, modular date handling and performance optimization, examining different approaches and technologies like C structures.🎓Tutorials and Guides🤓How to Reset a pandas DataFrame Index:Covers methods like .reset_index(), .index, and .set_axis() while exploring index alignment, duplicate removal, multi-index handling, and using columns as indexes.A Comprehensive Guide to Python Project Management and Packaging: Covers structuring projects, managing dependencies, and creating distributable packages, along with practical examples.Effortless Image Cropping with Python: Automate Your Workflow in Minute: Shows you how to automate image cropping in Python using OpenCV to identify the focal area of an image and crop it to a 16:9 aspect ratio.Adding keyboard shortcuts to the Python REPL: Explains how to add custom keyboard shortcuts to the Python 3.13 REPL using a PYTHONSTARTUP file and the unsupported _pyrepl module.Simplifying News Scraping with Python’s Newspaper4k Library: Demonstrates how to use Python’s Newspaper4k library to automate the scraping and parsing of news articles, extracting key components.Principal Component Analysis with Python (A Deep Dive) -Part 1: Provides a step-by-step guide for reducing data dimensionality through mathematical and coding examples.Using the OpenAI Reatime API in python: Covers key challenges such as managing 24kHz, 16-bit audio format, avoiding jerky audio by handling recording and playback concurrently, and preventing echo by using a headset.🔑Best Practices and Advice🔏How to Fix the Most Common Python Coding Errors: Covers IndentationError, SyntaxError, TypeError (NoneType not subscriptable), IndexError (list index out of range), and KeyError (missing dictionary key).Do Constructors Return Values in OOP?: Clarifies that while constructors don’t explicitly return values, they implicitly return the newly created instance of the class, fulfilling their primary purpose of object initialization.20 Python scripts to automate common daily tasks: Lists scripts for tasks such as sending emails, scraping news, downloading stock prices, backing up files, posting to social media, fetching weather updates, and resizing images.What time is it? A simple approach to AI-agents: Explains how AI agents solve real-time queries by selecting and executing pre-defined functions, using tasks like fetching the current time and weather as examples.How I Got Started Making Maps with Python and SQL: Recounts the author’s journey into spatial data visualization using tools like DuckDB, H3, and GeoPandas to create interactive maps, from building density to 3D dashboards.🔍Featured Study: Magentic-One💥In "Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks," Fourney et al. from AI Frontiers - Microsoft Research aim to develop a versatile, multi-agent AI system capable of autonomously completing complex tasks. The study presents Magentic-One as a generalist solution that orchestrates specialised agents to tackle tasks that require planning, adaptability, and error recovery.ContextTo address the need for AI systems capable of handling a wide range of tasks, Magentic-One leverages a multi-agent architecture. In this setup, agents are AI-driven components, each with a distinct skill, such as web browsing or code execution, all working under the direction of an Orchestrator agent. The Orchestrator not only delegates tasks but monitors and revises strategies to keep progress on track, ensuring effective task completion. This system responds to the growing demand for agentic systems in AI—those able to handle tasks involving multiple steps, real-time problem-solving, and error correction.The importance of such systems has increased as AI technology advances in areas like software development, data analysis, and web-based research, where single-agent models often struggle with multi-step, unpredictable tasks. By developing Magentic-One as a generalist system, the researchers offer a foundation that balances adaptability and reliability across diverse applications, helping establish future standards for agentic AI systems.Key Features of Magentic-OneMulti-Agent Architecture: Magentic-One uses multiple agents, each specialising in a specific task, coordinated by an Orchestrator agent.Orchestrator-Led Dynamic Planning: The Orchestrator creates and adapts task plans, tracks progress, and initiates corrective actions as needed, improving resilience.Specialised Agents:Coder: Writes, analyses, and revises code.ComputerTerminal: Executes code, manages shell commands.WebSurfer: Browses the web, interacts with web pages.FileSurfer: Reads and navigates files of various types.Performance on Benchmarks: Magentic-One achieved high performance on challenging benchmarks like GAIA (38% completion rate) and AssistantBench (27.7% accuracy), positioning it competitively among state-of-the-art systems.AutoGenBench Tool for Evaluation: AutoGenBench offers a controlled testing environment, allowing for repeatable, consistent evaluation of agentic systems like Magentic-One.What This Means for YouThe study’s findings are particularly relevant for developers, researchers, and AI practitioners focused on real-world applications of AI for complex, multi-step tasks. For instance, fields such as autonomous software engineering, data management, and digital research can leverage Magentic-One's multi-agent system to automate complex workflows. Its modular, open-source design enables further adaptation, making it useful for those interested in customising AI tools to meet specific requirements or studying multi-agent coordination for diverse scenarios.Examining the DetailsThe researchers applied a rigorous methodology to assess Magentic-One's reliability and practical value. Key benchmarks included GAIA, AssistantBench, and WebArena, each with unique tasks requiring multi-step reasoning, data handling, and planning. To verify the system’s efficacy, Magentic-One’s performance was compared against established state-of-the-art systems. The study reports a 38% task completion rate on GAIA, positioning Magentic-One competitively among leading systems without modifying core agent capabilities.To analyse the system’s interactions and address limitations, the team examined errors in detail, identifying recurring issues such as repetitive actions and insufficient data validation. By tracking these errors and using AutoGenBench, an evaluation tool ensuring isolated test conditions, the researchers provided a clear, replicable performance baseline. Their approach underscores the importance of modularity in AI design, as Magentic-One's agents operated effectively without interfering with each other, demonstrating both reliability and extensibility.You can learn more by reading the entire paper or access the system here.🧠 Expert insight💥Here’s an excerpt from “Chapter 7: Evaluating LLMs” in the book, LLM Engineer's Handbookby Paul Iusztin and Maxime Labonne, published in October 2024.RAG evaluationWhile traditional LLM evaluation focuses on the model’s inherent capabilities, RAG evaluation requires a more comprehensive approach that considers both the model’s generative abilities and its interaction with external information sources.RAG systems combine the strengths ofLLMs with information retrieval mechanisms, allowing them to generate responses that are not only coherent and contextually appropriate but also grounded in up-to-date, externally sourced information. This makes RAG particularly valuable in fields where current and accurate information is crucial, such as news reporting, research, and customer support.The evaluation of RAG systems goes beyond assessing a standalone LLM. It requires examining the entire system’s performance, including:Retrieval accuracy: How well does the system fetch relevant information?Integration quality: How effectively is the retrieved information incorporated into the generated response?Factuality and relevance: Does the final output address the query appropriately while seamlessly blending retrieved and generated content?Key metrics for RAG evaluation include retrieval precision and recall, which measure the accuracy and comprehensiveness of the retrieved information. Additionally, the quality of integration between retrieved data and generated text is crucial, as is the overall factuality and coherence of the output.To illustrate how these metrics are applied in practice, consider a RAG system designed for a customer support chatbot in an e-commerce setting. In this scenario, the user asks “What’s your return policy for laptops purchased during the holiday sale?” The RAG pipeline finds relevant documents on the electronics return policy and documents on holiday sale terms. This additional context is appended at the end of the question, and the model uses it to respond:For laptops purchased during our holiday sale, you have an extended return period of 60 days from the date of purchase. This is longer than our standard 30-day return policy for electronics. Please ensure the laptop is in its original packaging with all accessories to be eligible for a full refund.Table 7.3: Example of output from a RAG pipeline designed for customer supportIn this pipeline, we can evaluate if the retrieved documents correspond to what was expected (retrieval accuracy). We can also measure the difference between responses with and without additional context (integration quality). Finally, we can assess whether the output is relevant and grounded in the information provided by the documents (factuality and relevance).In this section, we will cover two methods to evaluate how well RAG models incorporate external information into their responses.RagasRetrieval-Augmented Generation Assessment(Ragas) is an open-source toolkit designed to provide developers with a comprehensive set of tools for RAG evaluation and optimization. It’s designed around the idea ofmetrics-driven development(MDD), a product development approach that relies on data to make well-informed decisions, involving the ongoing monitoring of essential metrics over time to gain valuable insights into an application’s performance. By embracing this methodology, Ragas enables developers to objectively assess their RAG systems, identify areas for improvement, and track the impact of changes over time.One of the key capabilities of Ragas is its ability to synthetically generate diverse and complex test datasets. This feature addresses a significant pain point in RAG development, as manually creating hundreds of questions, answers, and contexts is both time-consuming and labor-intensive. Instead, it uses an evolutionary approach paradigm inspired by works like Evol-Instruct to craft questions with varying characteristics such as reasoning complexity, conditional elements, and multi-context requirements. This approach ensures a comprehensive evaluation of different components within the RAG pipeline.Additionally, Ragas can generate conversational samples that simulate chat-based question-and-follow-up interactions, allowing developers to evaluate their systems in more realistic scenarios.Figure 7.1: Overview of the Ragas evaluation frameworkAs illustrated inFigure 7.1, Ragas provides a suite of LLM-assisted evaluation metrics designed to objectively measure different aspects of RAG system performance. These metrics include:Faithfulness: This metric measures the factual consistency of the generated answer against the given context. It works by breaking down the answer into individual claims and verifying if each claim can be inferred from the provided context. The faithfulness score is calculated as the ratio of verifiable claims to the total number of claims in the answer.Answer relevancy: This metric evaluates how pertinent the generated answer is to the given prompt. It uses an innovative approach where an LLM is prompted to generate multiple questions based on the answer and then calculates the mean cosine similarity between these generated questions and the original question. This method helps identify answers that may be factually correct but off-topic or incomplete.Context precision: This metric evaluates whether all the ground-truth relevant items present in the contexts are ranked appropriately. It considers the position of relevant information within the retrieved context, rewarding systems that place the most pertinent information at the top.Context recall: This metric measures the extent to which the retrieved context aligns with the annotated answer (ground truth). It analyzes each claim in the ground truth answer to determine whether it can be attributed to the retrieved context, providing insights into the completeness of the retrieved information.Finally, Ragas also provides building blocks for monitoring RAG quality in production environments. This facilitates continuous improvement of RAG systems. By leveraging the evaluation results from test datasets and insights gathered from production monitoring, developers can iteratively enhance their applications. This might involve fine-tuning retrieval algorithms, adjusting prompt engineering strategies, or optimizing the balance between retrieved context and LLM generation.Ragas can be complemented with another approach, based on custom classifiers.ARESARES (an automated evaluation framework for RAG systems) is a comprehensive tool designed to evaluate RAG systems. It offers an automated process that combines synthetic data generation with fine-tuned classifiers to assess various aspects of RAG performance, including context relevance, answer faithfulness, and answer relevance.The ARES framework operates in three main stages: synthetic data generation, classifier training, and RAG evaluation. Each stage is configurable, allowing users to tailor the evaluation process to their specific needs and datasets.In the synthetic data generation stage, ARES creates datasets that closely mimic real-world scenarios for robust RAG testing. Users can configure this process by specifying document file paths, few-shot prompt files, and output locations for the synthetic queries. The framework supports various pre-trained language models for this task, with the default being google/flan-t5-xxl. Users can control the number of documents sampled and other parameters to balance between comprehensive coverage and computational efficiency.Figure 7.2: Overview of the ARES evaluation frameworkThe classifier training stage involves creating high-precision classifiers to determine the relevance and faithfulness of RAG outputs. Users can specify the classification dataset (typically generated from the previous stage), test set for evaluation, label columns, and model choice. ARES uses microsoft/deberta-v3-large as the default model but supports other Hugging Face models. Training parameters such as the number of epochs, patience value for early stopping, and learning rate can be fine-tuned to optimize classifier performance.The final stage, RAG evaluation, leverages the trained classifiers and synthetic data to assess the RAG model’s performance. Users provide evaluation datasets, few-shot examples for guiding the evaluation, classifier checkpoints, and gold label paths. ARES supports various evaluation metrics and can generate confidence intervals for its assessments.ARES offers flexible model execution options, supporting both cloud-based and local runs through vLLM integration. The framework also supports various artifact types (code snippets, documents, HTML, images, and so on), enabling comprehensive evaluation across different RAG system outputs.In summary, Ragas and ARES complement each other through their distinct approaches to evaluation and dataset generation. Ragas’s strength in production monitoring and LLM-assisted metrics can be combined with ARES’s highly configurable evaluation process and classifier-based assessments. While Ragas may offer more nuanced evaluations based on LLM capabilities, ARES provides consistent and potentially faster evaluations once its classifiers are trained. Combining them offers a comprehensive evaluation framework, benefiting from quick iterations with Ragas and in-depth, customized evaluations with ARES at key stages.LLM Engineer's Handbook was published in October 2024.Get the eBook for $47.99Get the Print Book for $59.99And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, take the survey or just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
511

Divya Anne Selvaraj

05 Nov 2024

PythonPro #54: Global Forecasting Models, Python Overtakes JavaScript, and Hidden Python Libraries

Divya Anne Selvaraj

05 Nov 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#54Global Forecasting Models, Python Overtakes JavaScript, and Hidden Python LibrariesHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Modern Time Series Forecasting with Python - Second Edition, which explains the shift from traditional, isolated time series models to global forecasting models, which leverage related datasets to enhance scalability, accuracy, and reduce overfitting in large-scale applications.News Highlights: Python has overtaken JavaScript on GitHub, driven by its role in AI and data science, per GitHub's Octoverse 2024 report; and IBM’s Deep Search team has released Docling v2, a Python library for document extraction with models on Hugging Face.Hidden Python Libraries That Will Blow Your Mind🌟Python threading and subprocesses explained🧵Books are Datasets: Mapping 12 Sacred Texts with Python and D3.js📖Python Closures: Common Use Cases and Examples➿ChatGPT-4o cannot run proper Generalized Additive Models currently— but can correctly interpret results from R🧠And, today’s Featured Study, introducesSafePyScript, a machine-learning-based tool developed by researchers at the University of Passau, Germany, for detecting vulnerabilities in Python code.Stay awesome!Divya Anne SelvarajEditor-in-ChiefSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython has overtaken JavaScript on GitHub: GitHub’s Octoverse 2024 report reveals Python as the most popular language on GitHub, driven by its role in AI, data science, and machine learning. Jupyter Notebooks usage has also surged.Docling: Document extraction Python library from the Deep Search team at IBM: IBM’s Deep Search team released Docling v2, an MIT-licensed Python library for document extraction with custom models available on Hugging Face.💼Case Studies and Experiments🔬Programmed differently? Testing for gender differences in Python programming style and quality on GitHub: The study confirms that programming style can predict gender but these differences do not impact code quality.Tune your guitar with python: Demonstrates using Python’s sounddevice and matplotlib modules to create a real-time guitar tuner, where a live spectrogram identifies key bass guitar note frequencies for tuning, with a custom interface..📊AnalysisPackage compatibility tracker: Python 3.13 free-threading and subinterpreters: This compatibility tracker shows that 83% of the 500 most downloaded Python packages are compatible with Python 3.13’s new free-threading feature, while 73% support importing without GIL in Docker tests.Hidden Python Libraries That Will Blow Your Mind: Introduces six powerful yet lesser-known Python libraries including Streamlit for quick app-building, PyWhatKit for task automation, and Typer for simplified CLIs.🎓Tutorials and Guides🤓Python threading and subprocesses explained: Details Python’s threading and multiprocessing methods to handle parallel tasks, using thread/process pools for I/O-bound and CPU-bound tasks.Tiny GraphRAG in 1000 lines ofPython:introduces a simplified, local implementation of the GraphRAG algorithm, using a graph-based structure to enhance RAG for more contextually aware information retrieval.Building AI chatbots with marimo: Covers how to create adaptable chatbots capable of sophisticated tasks, including visualizing data and processing diverse inputs.Explanation of Vision Transformer with implementation: Provides an in-depth explanation and step-by-step implementation of Vision Transformer, covering key concepts such as embedding patches with code examples.PyD-Net: Advancing Real-Time Depth Estimation for Mobile and Embedded Devices: Showcases PyD-Net's practical applications across autonomous navigation, augmented reality, assistive technology, and design.Explore Solvable and Unsolvable Equations with Python: Delves into solving equations in Python, discussing when closed-form solutions are feasible and when numerical methods become necessary.Books are Datasets: Mapping 12 Sacred Texts with Python and D3.js: Explores using Python and D3.js to analyze and visualize 12 major sacred texts as datasets, showcasing text-processing techniques to reveal connections and patterns within and between these religious texts.🔑Best Practices and Advice🔏Variables in Python: Usage and Best Practices: Covers variable creation, dynamic typing, expressions, and best practices for naming and using variables in various scopes along with parallel assignments, and iterable unpacking.The Python Square Root Function: Details Python’s sqrt() function from the math module, explaining its use for calculating square roots of positive numbers and zero, while raising errors for negative inputs.Python Closures: Common Use Cases and Examples: Explains Python closures, inner functions that capture variables from their surrounding scope, enabling state retention, function-based decorators, and encapsulation.Python ellipses considered harmful: Argues that using ellipses (... ) to declare unimplemented methods in Python’s abstract classes can lead to hidden errors, and advocates for raise NotImplementedError instead.ChatGPT-4o cannot run proper Generalized Additive Models currently— but it can correctly interpret results from R: Highlights limitations of ChatGPT-4o in advanced statistical modeling, informing Python users about workarounds and considerations when working with similar tools.🔍Featured Study: SafePyScript💥In "SafePyScript: A Web-Based Solution for Machine Learning-Driven Vulnerability Detection in Python," Farasat et al., researchers from the University of Passau,Germany, introduce SafePyScript, a machine-learning-based web tool designed to detect vulnerabilities in Python code.ContextIn software development, identifying vulnerabilities is a major concern due to the security risks posed by cyberattacks. Vulnerabilities, or flaws in code that can be exploited by attackers, require constant detection and correction. Traditionally, vulnerability detection relies on:>Static Analysis: This rule-based approach scans code for known vulnerability patterns but often results in high false positives.>Dynamic Analysis (Penetration Testing): This approach tests code in a runtime environment, relying on security experts to simulate potential attacks, making it resource-heavy and dependent on professional expertise.Machine learning offers a data-driven alternative, enabling automated vulnerability detection with improved accuracy. Despite its popularity, Python lacks dedicated machine-learning-based tools for this purpose, which SafePyScript aims to provide. SafePyScript leverages a specific machine learning model, BiLSTM (Bidirectional Long Short-Term Memory), and the ChatGPT API to not only detect but also propose secure code, addressing this gap for Python developers.Key Features of SafePyScriptBiLSTM Model for Vulnerability Detection: Trained on word2vec embeddings, this model has achieved an accuracy of 98.6% and ROC of 99.3% for Python code vulnerabilities.Integration with ChatGPT API: SafePyScript uses ChatGPT (Turbo 3.5) to analyse and generate secure alternatives for vulnerable code.Common Vulnerabilities Addressed: These include SQL Injection, Cross-Site Scripting (XSS), Remote Code Execution, Cross-Site Request Forgery (XSRF), and Open Redirect.User-Friendly Interface: Built using Django (backend) and HTML, CSS, and JavaScript with Ajax (frontend) for a responsive, accessible user experience.Report Generation: Users can download detailed reports on vulnerabilities detected in their code, making it easier to track and resolve issues systematically.Feedback Mechanism: Users can provide feedback, allowing for tool improvement and adaptation to new security threats.What This Means for YouSafePyScript is most useful for Python developers and software engineers who need an efficient way to detect vulnerabilities in their code without relying on traditional, labour-intensive methods. Its machine-learning foundation and integration with ChatGPT make it highly practical for real-world application, providing not only insights into code vulnerabilities but also generating secure code alternatives.Examining the DetailsSafePyScript’s effectiveness rests on a robust BiLSTM model. This model, using word2vec embeddings, achieved an impressive 98.6% accuracy, 96.2% precision, and 99.3% ROC in vulnerability detection. The researchers optimised the BiLSTM’s hyperparameters—such as a learning rate of 0.001 and a batch size of 128—through rigorous testing, achieving reliable results as benchmarks.Additionally, SafePyScript leverages ChatGPT’s language model to generate secure code alternatives. The research team implemented precise prompt engineering to maximise ChatGPT’s effectiveness in analysing Python code vulnerabilities, further supporting the tool’s usability.SafePyScript’s frontend design, built with HTML, CSS, JavaScript (with Ajax), and a Django backend, ensures a smooth user experience. This structure allows developers to log in, upload or import code, select detection models, review reports, and access secure code—all within an intuitive, accessible platform.You can learn more by reading the entire paper or accessing SafePyScript.🧠 Expert insight💥Here’s an excerpt from “Chapter 6: Time Series Forecasting as Regression” in the book, Modern Time Series Forecasting with Python - Second Edition by Manu Joseph and Jeffrey Tackes, published in October 2024.Global forecasting models – a paradigm shiftTraditionally, each time series was treated in isolation. Because of that, traditional forecasting has always looked at the history of a single time series alone in fitting a forecasting function. But recently, because of the ease of collecting data in today's digital-first world, many companies have started collecting large amounts of time series from similar sources, or related time series.For example, retailers such as Walmart collect data on sales of millions of products across thousands of stores. Companies such as Uber or Lyft collect the demand for rides from all the zones in a city. In the energy sector, energy consumption data is collected across all consumers. All these sets of time series have shared behavior and are hence calledrelated time series.We can consider that all the time series in a related time series come from separatedata generating processes(DGPs), and thereby model them all separately. We call these thelocalmodels of forecasting. An alternative to this approach is to assume that all the time series are coming from a single DGP. Instead of fitting a separate forecast function for each time series individually, we fit a single forecast function to all the related time series. This approach has been calledglobalorcross-learningin literature.The terminologyglobalwas introduced byDavid Salinas et al.in theDeepARpaper andCross-learningbySlawek Smyl....having more data will lead to lower chances of overfitting and, therefore, lower generalization error (the difference between training and testing errors). This is exactly one of the shortcomings of the local approach. Traditionally, time series are not very long, and in many cases, it is difficult and time-consuming to collect more data as well. Fitting a machine learning model (with all its expressiveness) on small data is prone to overfitting. This is why time series models that enforce strong priors were used to forecast such time series, traditionally. But these strong priors, which restrict the fitting of traditional time series models, can also lead to a form of underfitting and limit accuracy.Strong and expressive data-driven models, as in machine learning, require a larger amount of data to have a model that generalizes to new and unseen data. A time series, by definition, is tied to time, and sometimes, collecting more data means waiting for months or years and that is not desirable. So, if we cannot increase thelengthof the time-series dataset, we can increase thewidthof the time series dataset. If we add multiple time series to the dataset, we increase the width of the dataset, and there by increase the amount of data the model is getting trained with.Figure 5.7shows the concept of increasing the width of a time series dataset visually:Figure 5.7 – The length and width of a time series datasetThis works in favor of machine learning models because with higher flexibility in fitting a forecast function and the addition of more data to work with, the machine learning model can learn a more complex forecast function than traditional time series models, which are typically shared between the related time series, in a completely data-driven way.Another shortcoming of the local approach revolves around scalability. In the case of Walmart we mentioned earlier, there are millions of time series that need to be forecasted and it is not possible to have human oversight on all these models. If we think about this from an engineering perspective, training and maintaining millions of models in a production system would give any engineer a nightmare. But under the global approach, we only train a single model for all these time series, which drastically reduces the number of models we need to maintain and yet can generate all the required forecasts.This new paradigm of forecasting has gained traction and has consistently been shown to improve the local approaches in multiple time series competitions, mostly in datasets of related time series. In Kaggle competitions, such asRossman Store Sales(2015),Wikipedia WebTraffic Time Series Forecasting(2017),Corporación Favorita Grocery Sales Forecasting(2018), andM5 Competition(2020), the winning entries were all global models—either machine learning or deep learning or a combination of both. TheIntermarché Forecasting Competition(2021) also had global models as the winning submissions. Links to these competitions are provided in theFurther readingsection.Although we have many empirical findings where the global models have outperformed local models for related time series, global models are still a relatively new area of research.Montero-Manson and Hyndman(2020) showed a few very interesting results and showed that any local method can be approximated by a global model with required complexity, and the most interesting finding they put forward is that the global model will perform better, even with unrelated time series. We will talk more about global models and strategies for global models in Chapter 10,Global Forecasting Models.Modern Time Series Forecasting with Python - Second Edition was published in October 2024.Get the eBook for $46.99 $31.99!Get the Print Book for $57.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
166

Divya Anne Selvaraj

29 Oct 2024

PythonPro #53: FastAPI on Docker, Python-CUDA Integration with Numbast, and Concurrent Requests with httpx vs aiohttp

Divya Anne Selvaraj

29 Oct 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#53FastAPI on Docker, Python-CUDA Integration with Numbast, and Concurrent Requests with httpx vs aiohttpHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, FastAPI Cookbook, which explains how to deploy FastAPI apps using Docker, covering Dockerfile creation, image building, and container generation.News Highlights: Numbast simplifies Python-CUDA C++ integration by auto-generating Numba bindings for CUDA functions; and DJ Beat Drop enhances Django’s new developer onboarding with a streamlined project initializer.Concurrent Requests in Python: httpx vs aiohttp🚦Python Thread Safety: Using a Lock and Other Techniques🔐Time-Series Data Meets Blockchain: Storing Time-Series Data with Solidity, Ganache and Python⛓️Let's Eliminate General Bewilderment • Python's LEGB Rule, Scope, and Namespaces🧩Optimization of Iceberg Table In AWS Glue🧊And, today’s Featured Study, introduces LSS-SKAN, a Kolmogorov–Arnold Network (KAN) variant that uses a single-parameter function (Shifted Softplus) for efficient accuracy and speed.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:Thank you to those who participated in this month's survey. With this issue, we have tried to fulfill at least one request made by each participant. Keep an eye out for next month's survey.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsBridging the CUDA C++ Ecosystem and Python Developers with Numbast: Numbast streamlines the integration of CUDA C++ libraries with Python by automatically generating Numba bindings for CUDA functions.Improving the New Django Developer Experience: Introduces DJ Beat Drop as a streamlined project initializer to improve the onboarding experience for new Django developers.💼Case Studies and Experiments🔬Concurrent Requests in Python: httpx vs aiohttp: Describes how switching from the httpx to aiohttp library resolved high-concurrency issues and improved stability in a computer vision application.From Python to CPU instructions: Part 1: Explains how rewriting a Python program in C exposes low-level details Python abstracts away, particularly highlighting the manual effort required for tasks like input handling.📊AnalysisPython 3.13, what didn't make the headlines: highlights Python 3.13's understated but impactful improvements, focusing on debugging enhancements, filesystem fixes, and minor concurrency updates.When should you upgrade to Python 3.13?: Advises waiting until December 2024 for Python 3.13 upgrades to ensure compatibility with libraries, tools, and bug-fix improvements.🎓Tutorials and Guides🤓Python Thread Safety: Using a Lock and Other Techniques: Explains how to address issues like race conditions and introduces synchronization techniques such as semaphores to ensure safe, concurrent code execution.Time-Series Data Meets Blockchain: Storing Time-Series Data with Solidity, Ganache and Python: Walks you through the steps to set up Ethereum locally, deploy a smart contract, and store and retrieve data points.Beautiful Soup: Build a Web Scraper With Python: Covers how to inspect site structure, scrape HTML content, and parse data using Requests and Beautiful Soup to build a script that extracts and displays job listings.🎥Advanced Web Scraping Tutorial! (w/ Python Beautiful Soup Library): Covers Requests to retrieve and parse data, especially from dynamic pages like Walmart's, with enhancements like using modified headers.Fuzzy regex matching in Python: Introduces the orc library to simplify fuzzy matching by providing a human-friendly interface that highlights edits and can invert changes, enhancing usability for complex text correction tasks.Achieving Symmetrical ManyToMany Filtering in Django Admin: Covers using Django's RelatedFieldWidgetWrapper and a custom ModelForm , allowing for consistent filtering on both sides of a ManyToMany relationship.Get started with the free-threaded build of Python 3.13: Details installation, usage in Python programs, compatibility with C extensions, and how to detect GIL status programmatically.🔑Best Practices and Advice🔏Let's Eliminate General Bewilderment • Python's LEGB Rule, Scope, and Namespaces: Details how variables are resolved in local, enclosing, global, and built-in scopes, using accessible examples to clarify potential pitfalls.🎥Robust LLM pipelines (Mathematica, Python, Raku): Given the unreliable and often slow nature of LLMs, this presentation outlines methods to enhance pipeline efficiency, robustness, and usability.A new way of Python Debugging with the Frame Evaluation API: Introduces Python's Frame Evaluation API, a tool that allows real-time monitoring and control of program execution at the frame level.Buffers on the edge: Python and Rust: Explains how Python's buffer protocol, which enables memory sharing between objects, can lead to undefined behavior due to data races in C, and the challenges Rust faces in maintaining soundness.Optimization of Iceberg Table In AWS Glue: Discusses how AWS Glue offers built-in optimization, but a Python-based solution using boto3 and Athena SQL scripts provides customizable, cost-effective automation.🔍Featured Study: LSS-SKAN💥In "LSS-SKAN: Efficient Kolmogorov–Arnold Networks based on Single-Parameterized Function," Chen and Zhang from South China University of Technology present a refined Kolmogorov–Arnold Network (KAN) variant. Their study introduces an innovative design principle for neural networks, improving accuracy and computational speed while ensuring greater model interpretability.ContextKANs are neural networks based on the Kolmogorov-Arnold theorem, which breaks down complex, multivariate functions into simpler univariate ones, aiding in better visualisation and interpretability. This makes them valuable in critical decision-making applications, where understanding a model's decision process is crucial. Unlike typical neural networks like Multilayer Perceptrons (MLPs), which rely on opaque linear and activation functions, KANs assign functions to network edges, creating a more interpretable structure. Over time, several KAN variants, such as FourierKAN and FastKAN, have emerged, each with unique basis functions to balance speed and accuracy.LSS-SKAN builds on these advancements with the Efficient KAN Expansion (EKE) Principle, a new approach that scales networks using fewer complex basis functions, allocating parameters to the network's size instead. This principle is central to LSS-SKAN's efficiency and demonstrates how a simpler basis function can yield high accuracy with reduced computational cost.Key Features of LSS-SKANEKE Principle: Scales the network by prioritising size over basis function complexity, making LSS-SKAN faster and more efficient.Single-Parameter Basis Function: Utilises the Shifted Softplus function, requiring only one learnable parameter for each function, which simplifies the network and reduces training time.Superior Accuracy: Outperforms KAN variants, showing a 1.65% improvement over Spl-KAN, 2.57% over FastKAN, 0.58% over FourierKAN, and 0.22% over WavKAN on the MNIST dataset.Reduced Training Time: Achieves significant reductions in training time, running 502.89% faster than MLP+rKAN and 41.78% faster than MLP+fKAN.What This Means for YouFor those working in machine learning or fields requiring interpretable AI, LSS-SKAN offers a practical solution to enhance neural network accuracy and speed while maintaining transparency in model decision-making. LSS-SKAN is particularly beneficial in applications involving image classification, scientific computing, or scenarios demanding high interpretability, such as medical or financial sectors where model explainability is crucial.Examining the DetailsThe researchers conducted detailed experiments using the MNIST dataset to measure LSS-SKAN’s performance against other KAN variants. They tested both short-term (10-epoch) and long-term (30-epoch) training cycles, focusing on two key metrics: accuracy and execution speed.Through these tests, LSS-SKAN consistently outperformed other KAN models in accuracy, achieving a 1.65% improvement over Spl-KAN, 2.57% over FastKAN, and 0.58% over FourierKAN, while also running 502.89% faster than MLP+rKAN and 41.78% faster than MLP+fKAN.The LSS-SKAN Python library is available on GitHub, along with experimental code, so you can replicate and build on their findings. They recommend a learning rate between 0.0001 and 0.001 for best results, particularly due to KANs’ sensitivity to learning rate adjustments.You can learn more by reading the entire paper and accessing LSS-SKAN.🧠 Expert insight💥Here’s an excerpt from “Chapter 12: Deploying and Managing FastAPI Applications” in the book, FastAPI Cookbook by Giunio De Luca, published in August 2024.Running FastAPI applications in Docker containersDockeris a useful tool that lets developers wrap applications with their dependencies into a container. This method makes sure that the application operates reliably in different environments, avoiding the commonworks on my machine issue. In this recipe, we will see how to make aDockerfile and run a FastAPI application inside a Docker container. By the end of this guide, you will know how to put your FastAPI application into a container, making it more flexible and simpler to deploy.Getting readyYou will benefit from some knowledge of container technology, especially Docker, to follow the recipe better. But first, check thatDocker Engineis set up properly on your machine. You can see how to do it at thislink:https://docs.docker.com/engine/install/.If you use Windows, it is better to installDocker Desktop, which is a Docker virtual machine distribution with a built-ingraphical interface.Whether you have Docker Engine or Docker Desktop, make sure the daemon is running by typingthis command:$ docker imagesIf you don’t see any error about the daemon, that means that Docker is installed and working on the machine. The way to start the Docker daemon depends on the installation you choose. Look at the related documentation to see how todo it.You can use the recipe for your applications or follow along with theLive Applicationapplication that we introduced in the first recipe, which we are using throughoutthe chapter.How to do it…It is not very complicated to run a simple FastAPI application in a Docker container. The process consists ofthree steps:Createthe Dockerfile.Buildthe image.Generatethe container.Then, you just have to run the container to have theapplication working.Creating the DockerfileThe Dockerfile contains the instructions needed to build the image from an operating system and the file we wantto specify.It is good practice to create a separate Dockerfile for the development environment. We will name itDockerfile.devand place it under the projectroot folder.We start the file by specifying the base image, which will beas follows:FROM python:3.10This will pull an image from the Docker Hub, which already comes with Python 3.10 integrated. Then, we create a folder called/codethat will hostour code:WORKDIR /codeNext, we copyrequirements.txtinto the image and install the packages insidethe image:COPY ./requirements.txt /code/requirements.txtRUN pip install --no-cache-dir -r /code/requirements.txtThepip installcommand runs with the--no-cache-dirparameter to avoidpipcaching operations that wouldn’t be beneficial inside a container. Also, in a production environment, for larger applications, it is recommended to pin fixed versions of the packages inrequirements.txtto avoid potential compatibility issues due topackage upgrades.Then, we can copy theappfolder containing the application into the image with thefollowing command:COPY ./app /code/appFinally, we define the server startup instructionas follows:CMD ["fastapi", "run", "app/main.py", "--port", "80"]This is all we need to create ourDockerfile.devfile.Building the imageOnce we haveDockerfile.dev, we can build the image. We can do it by running the following from the command line at the project rootfolder level:$ docker build -f Dockerfile.dev -t live-application .Since we named our DockerfileDockerfile.dev, we should specify it in an argument. Once the build is finished, you can check that the image has been correctly built by runningthe following:$ docker images live-applicationYou should see the details of the image on the output printlike this:REPOSITORY TAG IMAGE ID CREATED SIZElive-application latest 7ada80a535c2 43 seconds ago 1.06GBWith the image built, we can proceed with creating thecontainer creation.Creating the containerTo create the container and run it; simply runthe following:$ docker run -p 8000:80 live-applicationThis will create the container and run it. We can see the container by runningthe following:$ docker ps -aSince we didn’t specify a container name, it will automatically affect a fancy name. Mine, for example,isbold_robinson.Open the browser onhttp://localhost:8000and you will see the home page response ofour application.This is all you need to run a FastAPI application inside a Docker container. Running a FastAPI application in a Docker container is a great way to use the advantages of both technologies. You can easily scale, update, and deploy your web app withminimal configuration.See alsoThe Dockerfile can be used to specify several features of the image. Check the list of commands in the official documentation:Dockerfilereference:https://docs.docker.com/reference/dockerfile/Docker CLI documentation:https://docs.docker.com/reference/cli/docker/FastAPI in Containers - Docker:https://fastapi.tiangolo.com/deployment/docker/FastAPI Cookbook was published in August 2024.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99 $30.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you a Python learning resource on a particular subject, just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
140

Divya Anne Selvaraj

22 Oct 2024

PythonPro #52: AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with Python

Divya Anne Selvaraj

22 Oct 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#52AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with PythonHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Building AI Applications with OpenAI APIs - Second Edition, which discusses how to create a language translation desktop app using OpenAI's ChatGPT API and Microsoft Word.News Highlights: Protect AI to release Vulnhuntr, an AI tool for detecting Python zero-day vulnerabilities; Amazon launches SageMaker Core, a Python SDK simplifying machine learning with object-oriented interfaces; and PyCharm becomes the official IDE of OpenCV as JetBrains joins as a Silver Member.Comprehensive Python Cheatsheet📚Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset🚴‍♂️Python's property(): Add Managed Attributes to Your Classes🔧Python approach to the Semantic Web: exploring linked data and RDF🌐Assert vs. Raise: When to Use Each in Your ML/AI Projects⚠️And, today’s Featured Study, presents ChangeGuard, a tool designed to compare code behaviour before and after changes to detect functionality modifications.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsOpen source LLM tool primed to sniff out Python zero-days: Researchers with Seattle-based Protect AI will soon release Vulnhuntr, an AI-powered open-source tool that uses Claude AI to detect zero-day vulnerabilities in Python codebases by analyzing entire call chains for security issues.Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker: The SDK will simplify the machine learning lifecycle by replacing complex JSON structures with object-oriented interfaces.Press Release: PyCharm Becomes Official IDE of OpenCV, JetBrains Joins as Silver Member: As a Silver Member, JetBrains will financially support OpenCV, ensuring its resources remain free.💼Case Studies and Experiments🔬Part 2: Data Quality Dashboard: A Visual Approach to Monitoring Expectations in Databricks: Explains how to quickly identify issues using graphical representations like pie charts and bar charts.Exploring User Behavior: A Python Case Study of Bike-Sharing Company Dataset: UsesPython to uncover user behaviour patterns and develop strategies to convert casual riders into annual members.📊Analysis🎥Russell Keith-Magee on Beeware, packaging, GUI & money in Python: Focuses on the challenges of cross-platform Python packaging, particularly for desktop and mobile platforms and discusses how BeeWare helps developers.Should you use uv’s managed Python in production?: Advises careful consideration of uv’s production readiness, noting recent improvements but recommending thorough evaluation based on project-specific risks.🎓Tutorials and Guides🤓Python's property(): Add Managed Attributes to Your Classes: Covers creating read-only, read-write, and computed properties, logging, and more, while maintaining a stable public API for your classes.A Multi-Agent AI Chatbot App using Databutton and Swarm: Explains how different agents can collaborate and hand off tasks, with an example of a multi-agent healthcare chatbot that connects users to specialized agents.Understanding Pluggable Authentication Module (PAM) and Creating a Custom One in Python: Covers PAM’s architecture, module stacks, and control flags and walks you through building and integrating a custom PAM.Python approach to the Semantic Web: exploring linked data and RDF: Covers creating RDF triples, querying SPARQL endpoints, and visualizing relationships using NetworkX.Understanding Web Scraping in Python and Scrapy: Explains what web scraping is, its significance, and the tools required, such as BeautifulSoup, Requests, and Scrapy.🎥A hand-holding guide to writing FUSE-based filesystems in Python: Covers the process of creating Python-based FUSE file systems, from basic functionality to more advanced features like file attributes.Adding syntax to the cpython interpreter: Demonstrates how to add new syntax to Python, specifically making ternary statements default to None when no else condition is provided, similar to Ruby.🔑Best Practices and Advice🔏What I Learned from Making the Python Backend for YouTube Transcript Optimizer: Explains the process of building the Python backend for a YouTube Transcript Optimizer using FastAPI and SQLmodel.Comprehensive Python Cheatsheet: An extensive resource covering a wide array of Python topics, including syntax, data structures, and advanced concepts.How to Use Lambda Functions in Python: Covers their syntax, common use cases with functions like map(), filter(), and sorted(), along with advantages, limitations, and best practices for effective use in simplifying code.Assert vs. Raise: When to Use Each in Your ML/AI Projects: Discusses when to use assert for internal checks during development and raise for handling user-facing errors in ML/AI projects to ensure robust error handling.Structural Pattern Matching in Python: Explores customizing pattern matching for classes, extracting nested data, and common limitations in Python’s implementation.🔍Featured Study: ChangeGuard - Validating Code Changes via Pairwise Learning-Guided Execution💥In "ChangeGuard: Validating Code Changes via Pairwise Learning-Guided Execution," Gröninger et al. present a tool called ChangeGuard, which compares code behaviour before and after changes to determine whether the modifications alter functionality.ContextValidating whether code changes preserve intended behaviour is a key challenge in software development, particularly when changes are deep within complex projects. Developers may make modifications to improve readability, performance, or to fix bugs, but unintended changes in functionality can lead to errors. Current methods, such as regression testing, often fail to catch these subtle changes. This study is relevant because it introduces a more reliable approach—ChangeGuard, which uses pairwise learning-guided execution. This approach involves running two versions of a code snippet simultaneously and predicting values to ensure the code runs correctly, even in complex scenarios.Key Featured of ChangeGuardPairwise learning-guided execution: Simultaneously executes old and new versions of code to compare their runtime behaviour.Value injection: Predicts and injects missing or uninitialised values, ensuring the code executes smoothly and reaches all relevant paths.High precision and recall: Achieves 77.1% precision and 69.5% recall in identifying behaviour-altering code changes.Extensive evaluation: Tested on 224 manually annotated code changes and datasets generated by automated refactoring tools.Outperforms regression tests: Traditional regression tests only achieved 7.6% recall in identifying semantics-changing code modifications.What This Means for YouThis paper will be most useful for software developers, especially those working with large and complex codebases. It provides practical insights into validating code changes more effectively than existing methods, offering a way to catch unintended behaviour early in the development process. Developers using automated refactoring tools or large language models like GPT-4 will particularly benefit from ChangeGuard's ability to detect subtle, behaviour-altering modifications.Examining the DetailsChangeGuard's methodology is based on pairwise learning-guided execution, an extension of an existing technique. It predicts missing values dynamically, ensuring more execution paths are covered than previous approaches. The tool was evaluated on 224 annotated code changes from popular Python open-source projects, showing high accuracy in detecting semantics changes. Additionally, ChangeGuard was applied to automated refactoring tools and large language models like GPT-3.5 and GPT-4, where it found 87 out of 187 and 143 out of 258 code changes to unexpectedly alter behaviour. This comprehensive testing provides strong evidence for ChangeGuard's reliability and robustness.You can learn more by reading the entire paper and accessing ChangeGuard.🧠 Expert insight💥Here’s an excerpt from “Chapter 6: Language Translation Desktop App with the ChatGPT API and Microsoft Word” in the book, Building AI Applications with OpenAI APIs - Second Edition by Martin Yanev, published in October 2024.Integrating the ChatGPT API with Microsoft OfficeIn this section, we will explore how to set up our project and install thedocxPython library to extract text fromWorddocuments. Thedocx library is a Python package that allows us to read and writeMicrosoft Word (.docx) files and provides a convenient interface to access information stored inthese files.The first step is to initiate your work by creating a new directory calledTranslation Appand loading it with VSCode. This will enable you to have a dedicated area to craft and systematize your translation app code. Activate your virtual environment from the terminal window following the steps outlined inChapter 1,Getting Started with the ChatGPT API forNLP Tasks.To run the language translation desktop app, you will need to install thefollowing libraries:openai: Theopenailibrary allows you to interact with the OpenAI API and perform variousNLP tasksdocx: Thedocxlibrary allows you to read and write Microsoft Word.docxfilesusing Pythontkinter: Thetkinterlibrary is a built-in Python library that allows you to createGraphical User Interfaces(GUIs) for yourdesktop appAstkinteris a built-in library, there is no need for installation since it already exists within your Python environment. To install theopenaianddocxlibraries, access the VSCode terminal, and then execute thefollowing commands:pip install openaipip install python-docxTo access and read the contents of a Word document, you will need to create a sample Word file inside your project. Here are the steps to create a newWord file:In your project, right-click on the project directory, selectNew Folder, and nameitfiles.Right-click on thefilesfolder and selectNew File.In the edit field that appears, enter a filename with the.docxextension – forexample,info.docx.Press theEnterkey to createthe file.Once the file is created, open it usingMicrosoft Word.You can now add some text or content to this file, which we will later access and read using thedocxlibrary in Python. For this example, we have created an article about New York City. You can find the complete article here:https://en.wikipedia.org/wiki/New_York_City. However, you can choose any Word document containing text that you wantto analyze:The United States’ most populous city, often referred to as New York City or NYC, is New York. In 2020, its population reached 8,804,190 people across 300.46 square miles, making it the most densely populated major city in the country and over two times more populous than the nation’s second-largest city, Los Angeles. The city’s population also exceeds that of 38 individual U.S. states. Situated at the southern end of New York State, New York City serves as the Northeast megalopolis and New York metropolitan area’s geographic and demographic center - the largest metropolitan area in the country by both urban area and population. Over 58 million people also live within 250 miles of the city. A significant influencer on commerce, health care and life sciences, research, technology, education, politics, tourism, dining, art, fashion, and sports, New York City is a global cultural, financial, entertainment, and media hub. It houses the headquarters of the United Nations, making it a significant center for international diplomacy, and is often referred to as theworld’s capital.Now that you have created the Word file inside your project, you can move on to the next step, which is to create a new Python file calledapp.pyinside theTranslation Approot directory. This file will contain the code to read and manipulate the contents of the Word file using thedocxlibrary. With the Word file and the Python file in place, you are ready to start writing the code to extract data from the document and use it inyour application.To test whether we can read Word files with thedocx-pythonlibrary, we can implement the following code in ourapp.pyfile:import docxdoc = docx.Document("<full_path_to_docx_file>")text = ""for para in doc.paragraphs: text += para.textprint(text)Make sure to replace<full_path_to_docx_file>with the actual path to your Word document file. Obtaining the file path is a simple task, achieved by right-clicking on your.docxfile in VSCode and selecting theCopy Relative Pathoption from thedrop-down menu.Once you have done that, run theapp.pyfile and verify the output. This code will read the contents of your Word document and print them to the console. If the text extraction works correctly, you should see the text of your document printed in the console (seeFigure 6.1). Thetextvariable now holds the data frominfo.docxas aPython string.Figure 6.1 – Word text extraction console outputPackt library subscribers can continue reading the entire book for free. You can buy Building AI Applications with OpenAI APIs - Second Edition,here.Get the eBook for $31.99 $21.99!Get the Print Book for $39.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
62

Divya Anne Selvaraj

15 Oct 2024

PythonPro #51: Python 3.13 REPL Enhancements, Python 3.12 vs. 3.13, and Visualizing Named Entities in Text

Divya Anne Selvaraj

15 Oct 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#51Notion for StartupsThousands of startups use Notion as a connected workspace to create and share docs, take notes, manage projects, and organize knowledge—all in one place.We’re offering 6 months of new Plus plans, including unlimited Notion AI so you can try it all for free!To redeem the Notion for Startups offer:1. Submit an application using our custom link: https://ntn.so/packt and select Packt on the partner list.2. Include our partner key: STARTUP4110P19151Get your Free 6-month Notion Plus Acceess!Hi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Python Natural Language Processing Cookbook - Second Edition, which explains how to use the displaCy library from spacy to visualize named entities in text.News Highlights: PEP 762 in Python 3.13 adds multi-line editing, syntax highlighting, and custom commands to the REPL, and Pyinstrument 5 introduces a flamegraph timeline view for better code execution visualization.Here are my top 5 picks from our learning resources today:Python 3.12 vs Python 3.13 – performance testing⚡️Exploring Infrastructure as Code (IaC) with Python: AWS CDK, Terraform CDK, and Pulumi🏗️lintsampler : a new way to quickly get random samples from any distribution🎲Python and SysV shared memory🧠Gradient-Boosting anything (alert: high performance)🚀And, today’s Featured Study, presents a method using LLMs to generate precise, transparent code transformations, improving accuracy and efficiency for compiler optimizations and legacy refactoring.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Your cloud deserves dedicated data protection94% of cloud tenants were targeted last year, and 62% were successfully compromised.The hard truth is that organizations are having a hard time securing their cloud data—and cyberattackers are ready to exploit that challenge.Here’s a handy resource you’ll want with you as you map out your plan: Orchestrating the Symphony of Cloud Data Security.You’ll learn how to: Overcome the challenges of securing data in the cloud, Navigate multi cloud data security, and Balance data security with cloud economicsDownload Your Complimentary Copy NowSign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPEP 762 – REPL-acing the default REPL: As of Python 3.13, the default REPL has been replaced with a Python-based version (PEP 762), offering modern features like multi-line editing, syntax highlighting, and custom commands.Pyinstrument 5 - Flamegraphs for Python: The new version of the Python statistical profiler introduces a new flamegraph-style timeline view for visualizing code execution, improves on previous timeline modes, and more.💼Case Studies and Experiments🔬Moving all our Python code to a monorepo: pytendi: Describes the migration of Attendi’s Python codebase into a monorepo using the Polylith architecture to improve code discoverability, reusability, and developer experience.How Maintainable is Proficient Code? A Case Study of Three PyPI Libraries: Aims to help you recognize when proficient coding might hinder future maintenance efforts.📊AnalysisIn the Making of Python Fitter and Faster: Provides insights into how Python's evolving interpreter architecture enhances execution speed, memory efficiency, and overall performance for modern applications.Python 3.12 vs Python 3.13 – performance testing: Tests on AMD Ryzen 7000 and Intel 13th-gen processors show Python 3.13 generally performs faster, especially in asynchronous tasks, but there are slowdowns in certain areas.🎓Tutorials and Guides🤓Build a Contact Book App With Python, Textual, and SQLite: Covers creating the app’s text-based interface (TUI), setting up a SQLite database for contact storage, and integrating both elements.Syntactic Sugar: Why Python Is Sweet and Pythonic: Covers various Pythonic constructs like operators, assignment expressions, loops, comprehensions, and decorators, and shows how they simplify code.The Ultimate Guide to Error Handling in Python: Provides a comprehensive guide to Python error handling, exploring common patterns like "Look Before You Leap" (LBYL) and "Easier to Ask Forgiveness than Permission" (EAFP).Exploring Infrastructure as Code (IaC) with Python: AWS CDK, Terraform CDK, and Pulumi: Explains how Python integrates with IaC tools to automate cloud infrastructure management.Web scraping of a dynamic website using Python with HTTP Client: Walks you through analyzing sites with JavaScript-rendered content and using the Crawlee framework to extract data in JSON format.lintsampler : a new way to quickly get random samples from any distribution: Introduces a Python package designed to easily and efficiently generate random samples from any probability distribution.Mastering Probability with Python: A Step-by-Step Guide with Simulations:Through examples like coin tosses, dice rolls, and event probabilities, this tutorial guides you on how to simulate and analyze real-world scenarios.🔑Best Practices and Advice🔏What's In A List—Yes, But What's *Really* In A List: Explains common pitfalls when multiplying lists and why it matters when working with mutable versus immutable data types.Yes, you need to duplicate your frontend business logic on the server: Explains why backend validation is essential to protect data integrity, regardless of frontend sophistication.Python and SysV shared memory: Explains how to wrap C functions like shmget, shmat, and shmctl for shared memory management, handling void pointers, and performing basic operations like writing to shared memory.Gradient-Boosting anything (alert: high performance): Explores using Gradient Boosting with various machine learning models, adapting LSBoost in the Python package mlsauce for both regression and classification tasks.Code Generation with ChatGPT o1-preview as a Story of Human-AI Collaboration: Through experiments in Python and C++, the author demonstrates that human-AI collaboration improves code generation, specifically in building sentiment analysis tools.🔍Featured Study: Don't Transform the Code, Code the Transforms💥In "Don't Transform the Code, Code the Transforms: Towards Precise Code Rewriting using LLMs," researchers from Meta, Cummins et al., introduce a novel method called Code the Transforms (CTT), which leverages LLMs to generate precise code transformations rather than directly rewriting code.ContextCode transformation refers to rewriting or optimising existing code, a task essential for compiler optimisations, legacy code refactoring, or performance improvements. Traditional rule-based approaches to code transformations are difficult to implement and maintain. LLMs offer the potential to automate this process, but direct code rewriting by LLMs lacks precision and is challenging to debug. This study introduces the CTT method, where LLMs generate the transformation logic, making the process more transparent and adaptable.Key Featured of the CTT MethodChain-of-thought process: The method synthesises code transformations by iterating through input/output examples to create a precise transformation logic rather than rewriting code directly.Improved transparency and adaptability: The generated transformations are explicit, making them easier to inspect, debug, and modify when necessary.Higher precision: The method achieved perfect precision in 7 out of 16 Python code transformations, significantly outperforming traditional direct rewriting approaches.Reduced computational costs: By generating transformation logic instead of rewriting code, the method requires less compute and review effort compared to direct LLM rewriting.Iterative feedback loop: The method incorporates execution and feedback to ensure the generated transformations work as expected, leading to more reliable outcomes.What This Means for YouThis study is particularly beneficial for software engineers, developers, and those working on compiler optimisations or legacy code refactoring. By using this method, teams can reduce the time spent on manual code review and debugging, while improving the precision of code transformations.Examining the DetailsThe study's methodology involved testing 16 different Python code transformations across a variety of tasks, ranging from simple operations like constant folding to more complex transformations such as converting dot products to PyTorch API calls. The CTT method achieved an overall F1 score of 0.97, compared to the 0.75 achieved by the direct rewriting method. The precision of transformations ranged from 93% to 100%, with tasks like dead code elimination and redundant function elimination reaching near-perfect performance. In contrast, the traditional direct LLM rewriting approach showed an average precision of 60%, and was prone to more frequent errors, requiring manual correction.You can learn more by reading the entire paper.🧠 Expert insight💥Here’s an excerpt from “Chapter 7: Visualizing Text Data” in the book, Python Natural Language Processing Cookbook - Second Edition by Zhenya Antić and Saurabh Chakravarty, published in September 2024.VisualizingNERNamed entity recognition, orNER, is a very useful tool for quickly finding people, organizations, locations, and other entities in texts. In order to visualize them better, we can use thedisplacypackage to create compelling andeasy-to-read images.After working through this recipe, you will be able to create visualizations of named entities in a text using different formatting options and save the results ina file.Getting readyThedisplaCylibrary is part of thespacypackage. You need at least version 2.0.12 of thespacypackage fordisplaCyto work. The version in thepoetryenvironment andrequirements.txtfileis 3.6.1.The notebook is locatedathttps://github.com/PacktPublishing/Python-Natural-Language-Processing-Cookbook-Second-Edition/blob/main/Chapter07/7.3_ner.ipynb.How to do it...We will usespacyto parse the sentence and then thedisplacyengine to visualize thenamed entities:Import bothspacyanddisplacy:import spacyfrom spacy import displacyRun the languageutilities file:%run -i "../util/lang_utils.ipynb"Define the textto process:text = """iPhone 12: Apple makes jump to 5GApple has confirmed its iPhone 12 handsets will be its first to work on faster 5G networks.The company has also extended the range to include a new "Mini" model that has a smaller 5.4in screen.The US firm bucked a wider industry downturn by increasing its handset sales over the past year.But some experts say the new features give Apple its best opportunity for growth since 2014, when it revamped its line-up with the iPhone 6."5G will bring a new level of performance for downloads and uploads, higher quality video streaming, more responsive gaming,real-time interactivity and so much more," said chief executive Tim Cook.There has also been a cosmetic refresh this time round, with the sides of the devices getting sharper, flatter edges.The higher-end iPhone 12 Pro models also get bigger screens than before and a new sensor to help with low-light photography.However, for the first time none of the devices will be bundled with headphones or a charger."""In this step, we process the text using the small model. This gives us aDocobject. We then modify the object to contain a title. This title will be part of theNER visualization:doc = small_model(text)doc.user_data["title"] = "iPhone 12: Apple makes jump to 5G"Here, we set up color options for the visualization display. We set green for theORG-labeled text and yellow for thePERSON-labeled text. We then set theoptionsvariable, which contains the colors. Finally, we use therendercommand to display the visualization. As arguments, we provide theDocobject and the options we previously defined. We also set thestyleargument to"ent", as we would like to display just entities. We set thejupyterargument toTruein order to display directly inthe notebook:colors = {"ORG": "green", "PERSON":"yellow"}options = {"colors": colors}displacy.render(doc, style='ent', options=options, jupyter=True)The output should look like that inFigure 7.4.Figure 7.4 – Named entities visualizationNow we save the visualization to an HTML file. We first define thepathvariable. Then, we use the samerendercommand, but we set thejupyterargument toFalsethis time and assign the output of the command to thehtmlvariable. We then open the file, write the HTML, and closethe file:path = "../data/ner_vis.html"html = displacy.render(doc, style="ent", options=options, jupyter=False)html_file= open(path, "w", encoding="utf-8")html_file.write(html)html_file.close()This will create an HTML file with theentities visualization.Packt library subscribers can continue reading the entire book for free. You can buy Python Natural Language Processing Cookbook - Second Edition,here.Get the eBook for $35.99 $17.99!Get the Print Book for $44.99 $30.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
73

Divya Anne Selvaraj

08 Oct 2024

PythonPro #50: Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook Tips

Divya Anne Selvaraj

08 Oct 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#50:Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook TipsHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Offensive Security Using Python, which briefly discusses key practices such as input validation, secure authentication, session management, secure coding techniques, and the implementation of security headers.News Highlights: Python 3.13.0, released yesterday, adds an interactive interpreter, free-threaded mode, JIT compiler, and iOS/Android support; and Rev's Reverb models for ASR and diarization outperform other open-source models.Here are my top 5 picks from our learning resources today:10 Jupyter Notebook Features You Didn’t Know Exist📓A Guide to Modern Python String Formatting Tools🔠Modeling customers' decisions in Python with the Choice-Learn package🛍️Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog📈Best practices for securely consuming open source in Python — Ciara Carey🔐And, today’s Featured Study, evaluates the performance of AI models in geospatial code generation, revealing significant challenges in handling complex tasks, specific data formats, and specialised libraries.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:This month's survey is still live, do take the opportunity to leave us your feedback, request a learning resource, and earn your one Packt credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython3.13.0 Is Released: Released on October 7, 2024, the version includes a new interactive interpreter, free-threaded mode, and JIT compiler, and support for iOS and Android platforms.Introducing Reverb: The Future of Open-Source automatic speech recognition (ASR) and Diarization: Rev's new open-source models for ASR and speech diarization, built using Rev’s extensive human-transcribed English speech dataset, outperforms existing open-source models.💼Case Studies and Experiments🔬Using Kolmogorov-Arnold Networks (KAN) and Backtesting to Predict Stock Prices: Discusses predicting stock prices, focusing on deep learning models trained on historical data from Yahoo Finance.🎥Marketing Media Mix Models with Python & PyMC: a Case Study [PyCon DE & PyData Berlin 2024]: discusses how machine learning models can optimize marketing investments by analyzing various channels.📊Analysis10 Jupyter Notebook Features You Didn’t Know Exist: Discusses features including magic commands, interactive widgets, auto-reload for modules, in-notebook documentation, and collapsible headings.I Used Claude.ai to Create a Discord Bot — Here’s What I Learned About the State of AI Code Writing: Discusses the author's experience using Claude to rapidly generate Python code for a bot that deletes old Discord messages.🎓Tutorials and Guides🤓A Guide to Modern Python String Formatting Tools: Explains how to format values, create custom format specifiers, and embed expressions in strings. Read to learn practical techniques for dynamic string manipulation.DuckDB in Python in the Browser with Pyodide, PyScript, and JupyterLite: Shows you how to run DuckDB in Python within a browser environment and embed interactive Python environments in web pages.Tutorial: Creating a Twitter (X) Bot using Python: Explains how to build and deploy a Python-based Twitter (X) bot that autonomously tweets updates, including progress graphs, using the X API.Distilling python functions into LLM: Explains how to use the Instructor library to distill Python functions into a language model, enabling fine-tuning for function emulation using Pydantic type hints.Getting Started with Powerful Data Tables in Your Python Web Apps: Demonstrates building a finance app that fetches stock data, displays it interactively, and includes features like sorting, and graph visualization.Modeling customers decisions in Python with the Choice-Learn package: Introduces the Choice-Learn Python package, which simplifies implementing discrete choice models like Conditional Logit to predict customer decisions.Optimizing Inventory Management with Reinforcement Learning: A Hands-on Python Guide:Outlines how Q-learning helps balance holding and stockout costs by developing an optimal ordering policy.🔑Best Practices and Advice🔏Speeding up CRC-32 calculations in Mojo: Discusses speeding up CRC-32 calculations in Mojo, achieving an 18x improvement over Python's native implementation and reaching 3x slower performance compared to zlib library.Bad Schemas could break your LLM Structured Outputs: Explains how choosing the right response model dramatically impacts the performance of language models like GPT-4o and Claude, especially when using JSON mode or Tool Calling.Implementing a Python Singleton with Decorators: Explains how a decorator ensures only one instance of a class is created, using a _SingletonWrapper class to handle instantiation and simplifies global access.🎥Best practices for securely consuming open source in Python — Ciara Carey: Introduces a framework called Secure Supply Chain Consumption Framework (S2C2F) to help organizations improve open-source security.Understanding Logarithmic Plots in Matplotlib: semilogx, semilogy, and loglog: Walks you through plotting data with a logarithmic x-axis, y-axis, and both axes, respectively, and provides code snippets to generate these plots.🔍Featured Study: Current AI Models Fall Short in Geospatial Code Generation💥In "Evaluation of Code LLMs on Geospatial Code Generation," Gramacki et al. introduce a benchmark to assess LLMs' ability to handle tasks involving spatial reasoning and data processing.ContextLLMs generate code based on natural language inputs and are effective in general programming tasks, particularly in data science. Geospatial data science is a field focused on analysing spatial data tied to locations. It relies on libraries like GeoPandas and Shapely for tasks such as geo-coding, spatial analysis, and data visualisation. However, the domain poses unique challenges for LLMs due to the need for spatial reasoning and the use of specialised tools, making evaluation in this area crucial. As geospatial applications expand in industries such as urban planning and environmental science, reliable AI assistance is becoming increasingly important.Key FindingsLLMs underperform in geospatial tasks: Models like Code Llama and Starcoder2 show reduced accuracy compared to their performance in general coding.Starcoder2-7B leads but struggles: It achieved a pass@1 score of 32.47%, highlighting the difficulty of geospatial tasks even for top-performing models.Complex tasks pose a challenge: Single-step tasks had a 45.45% pass@1 success rate, but multi-step tasks were far more difficult, scoring only 15.15%.Data format matters: Models handled GeoDataFrames better than other formats like GeoJSON, showing varying levels of tool proficiency.Limited tool support: Libraries like MovingPandas and OSMNX, crucial for geospatial analysis, were inadequately supported by the models.What This Means for YouThis study is relevant for geospatial programmers and data scientists seeking to automate coding tasks. Current LLMs are not yet reliable for complex geospatial tasks, highlighting a need for models specifically trained for the domain. Developers and researchers can benefit by focusing on improving AI models to better support geospatial data science workflows.Examining the DetailsThe authors created a benchmark dataset categorising tasks by complexity, data format, and tool usage. The dataset includes 77 samples to test LLM performance on tasks like spatial reasoning and tool implementation. Evaluation metrics focused on accuracy and pass@1, with the results highlighting the models' struggles in handling geospatial problems. Libraries like GeoPandas and H3 were used to evaluate the models, while more complex tools like MovingPandas exposed the models' weaknesses.This rigorous benchmark, publicly available for future research, sets a foundation for improving geospatial code generation in LLMs. The study’s methodology ensures it reflects real-world geospatial coding challenges, offering valuable insights for the development of more domain-specific AI tools.You can learn more by reading the entire paper and accessing the benchmark dataset: geospatial-code-llms-dataset.🧠 Expert insight💥Here’s an excerpt from “Chapter 3: An Introduction to Web Security with Python” in the book, Offensive Security Using Python by Rejah Rehim and Manindar Mohan, published in September 2024.Proactive web security measures with PythonPython has developed as a versatile widely used programming language in the field of modern software development. Its ease of use, readability, and rich library support have made it a popular choice for developingweb-based applications in a variety of industries. Python frameworks such as Django, Flask, and Pyramid have enabled developers to create dynamic and feature-rich web applications with speed and agility.However, as Python web apps become more popular, there is a corresponding increase in the sophistication and diversity of attacks targeting these applications. Cybersecurity breaches can jeopardize valuable user data, interfere with corporate operations, and damage an organization’s brand. Python web applications become vulnerable to a variety of security vulnerabilities, including SQL injection, XSS, andcross-site request forgery(CSRF). The consequences of these vulnerabilities can be severe, demanding an effectivecybersecurity strategy.Developers must be proactive to counteract this. By implementing security practices such as input validation, output encoding, and other secure coding guidelines early in the development lifecycle, developers can reduce the attack surface and improve the resilience of their Pythonweb applications.Although we are only discussing Python-based applications here, these practices are universal and should be implemented in web applications built with anytechnology stack.To protect against a wide range of cyber threats, it is critical to implement strong best practices. This section explains key security practices that developers should follow while developingweb apps.Input validation and data sanitizationUserinput validationis essential for preventing code injection attacks. Malicious inputs can exploit vulnerabilities and cause unwanted commands to be executed. Properdata sanitizationguarantees that user inputs are handled as data rather than executable code by eliminating or escaping special characters. Using libraries such asinput()and frameworks such as Flask’srequestobject can help validate and sanitizeincoming data.Secure authentication and authorizationRestricting unauthorized access requires effective authentication and authorization procedures. Password hashing, which uses algorithms such asbcryptorArgon2, adds an extra degree of security by ensuring that plaintext passwords are never saved.Two-factor authentication(2FA) adds an additional verification step to user authentication, increasing security.Role-Based Access Control(RBAC) allows developers to provide specific permissions to different user roles, guaranteeing that users only access functionality relevant totheir responsibilities.Secure session managementKeeping user sessions secure is critical for avoiding session fixation and hijacking attempts. Using secure cookies with theHttpOnlyandSecurecharacteristics prohibits client-side script access and ensures that cookies are only sent over HTTPS. Session timeouts and measures such as session rotation can improve session securityeven further.Secure coding practicesFollowing secure coding practices reduces a slew of possible vulnerabilities. Parameterized queries, made possible by libraries such assqlite3, protect against SQL injection by separating data from SQL commands. Output encoding, achieved with techniques such ashtml.escape(), avoids XSS threats by converting user inputs to innocuous text. Similarly, omitting functions such aseval()andexec()avoids uncontrolled code execution, lowering the likelihood of codeinjection attacks.Implementing security headersSecurity headersare a fundamental component of web application security. They are HTTP response headers that provide instructions to web browsers, instructing them on how to behave when interacting with the web application. Properly configured security headers can mitigate various web vulnerabilities, enhance privacy, and protect against commoncyber threats.Here is an in-depth explanation of implementing security headers to enhance webapplication security:Content Security Policy (CSP): CSP is a security feature that helps prevent XSS attacks. By defining and specifying which resources (scripts, styles, images, etc.) can be loaded, CSP restricts script execution to trusted sources. Implementing CSP involves configuring theContent-Security-Policy HTTP header in your web server. This header helps prevent inline scripts and unauthorized script sources from being executed, reducing the risk of XSS attacks significantly. An example of the CSP header is as follows:Content-Security-Policy: default-src 'self'; script-src 'self' www.google-analytics.com;HTTP Strict Transport Security (HSTS): HSTS is a security feature that ensures secure, encrypted communication between the web browser and the server. It preventsMan-in-the-Middle(MITM) attacks by enforcing the use of HTTPS. Once a browser has visited a website with HSTS enabled, it will automatically establish a secure connection for all future visits, even if the user attempts to access the site via HTTP.An example HSTS header isas follows:Strict-Transport-Security: max-age=31536000; includeSubDomains; preload;X-Content-Type-Options: TheX-Content-Type-Optionsheader prevents browsers from interpreting files as a different media type also known as aMultipurpose Internet Mail Extensions(MIME) type. It mitigates attacks such as MIME sniffing, where an attacker can trick a browser into interpreting content in an unintended way, potentially leading to security vulnerabilities.An exampleX-Content-Type-Optionsheader isas follows:X-Content-Type-Options: nosniffX-Frame-Options: TheX-Frame-Options header prevents clickjacking attacks by denying the browser permission to display a web page in a frame or iframe. This header ensures that your web content cannot be embedded within malicious iframes, protecting against UIredressing attacks.An exampleX-Frame-Optionsheader isas follows:X-Frame-Options: DENYReferrer-Policy: TheReferrer-Policyheader controls what information is included in theReferrer header when a user clicks on a link that leads to another page. By setting an appropriate referrer policy, you can protect sensitive information, enhance privacy, and reduce the risk ofdata leakage.An exampleReferrer-Policyheader isas follows:Referrer-Policy: strict-origin-when-cross-originPackt library subscribers can continue reading the entire book for free. You can buy Offensive Security Using Python,here.Get the eBook for $39.99 $27.98!Get the Print Book for $49.99 $34.98!Other Python titles from Packt at 30% offGet the eBook for $39.99 $27.98!Get the eBook for $35.99 $24.99!Get the eBook for $27.99 $18.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
263

Divya Anne Selvaraj

24 Sep 2024

PythonPro #48: Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices

Divya Anne Selvaraj

24 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#48:Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices3 Days. 25+ AI Experts. 30+ Sessions.Join the Generative AI In Action conference from Nov 11-13 (LIVE | Virtual) and gain insights from top AI leaders across over 30 sessions. Explore key topics including GenAI tools, AI Agents, Open-Source LLMs, Small Language Models, LLM fine-tuning, and many more! This is your opportunity to dive deep into cutting-edge AI strategies and technologies.Save 40% with our Early Bird offer using code BIGSAVE40 – don’t miss out!Secure Your Seat Today!Hi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Machine Learning and Generative AI for Marketing, which discusses how to create effective prompts for Zero-Shot Learning to generate high-quality marketing content.News Highlights: Opik, a new open-source LLM evaluation tool, integrates with CI/CD, and Model2Vec, a newly launched library, boosts CPU inference 500x and cuts model size by 15x.Here are my top 5 picks from our learning resources today:Frankenstein’s Ice cream shop🍦Python 3.13 Preview: Free Threading and a JIT Compiler⚙️Graph RAG into Production — Step-by-Step🧩FastAPI Best Practices and Design Patterns - Building Quality Python APIs🛠️From Spreadsheets to SDMX Effortless with Python and .Stat Suite📊And, today’s Featured Study, examines the performance of open-source models like Mistral and LLaMa and provides insights into the hardware needed for efficient deployment, using GPUs and optimisation techniques such as quantification.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:With this issue, we have finished covering all content requests made through the September feedback survey. Stay tuned for next month's survey.Sign Up|AdvertiseWhat changed in the way you code for 2024? What has happened in the tech world in the last months?Take this shorter version of the Developer Nation survey, learn about new tools, influence the future of development and share your insights with the world!What’s in it for you?A virtual goody bag packed with cool resourcesThe more questions you answer the more chances you have to win amazing prizes including aSamsung Galaxy Watch 7!Take the Survey now!🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsOpik, an open source LLM evaluation framework: The platform can be used for developing, evaluating, and monitoring LLM applications and offers features such as LLM call tracing, annotation, automated evaluation, and integration into CI/CD pipelines.Model2Vec: Distill a Small Fast Model from any Sentence Transformer: Model2Vec is a Python library that distills sentence transformers into small static embeddings, making inference 500x faster on CPU and reducing model size by 15x.💼Case Studies and Experiments🔬Integrated Python and GIS Approach for Geomorphometric Investigation of Man River Basin, Western Madhya Pradesh, India: Analyzes the tectonic influence on the Man River Basin's development using satellite imagery, GIS software, and Python to compute and study geomorphometric indices.Frankenstein’s Ice cream shop:Details how to automate the cleaning of messy Excel sheets using Python's Pandas library, focusing on a made-up ice cream sales commission dataset.📊AnalysisThe Python Package Index Should Get Rid Of Its Training Wheels: Discusses the challenges of PyPI's exponentially growing storage needs, particularly due to prebuilt binaries and suggests leveraging modern build tools.UV — I am (somewhat) sold: Initially skeptical, the author of this article found UV useful for handling multiple Python versions, dependency management, and simplifying their development setup.🎓Tutorials and Guides🤓Python 3.13 Preview: Free Threading and a JIT Compiler: Demonstrates the key new features in Python 3.13, including free threading, which makes the GIL optional, and a JIT compiler that compiles Python code into machine code.Graph RAG into Production — Step-by-Step: Discusses how to implement Graph Retrieval-Augmented Generation (Graph RAG) in production using a fully serverless, parallelized approach without using a graph database.Python Virtual Environments: A Primer: Covers how to create, activate, and manage virtual environments, explaining their importance for isolating dependencies, avoiding conflicts, and ensuring reproducibility.Python for Network Programming — A Beginner’s Overview: Explains key concepts such as sockets, TCP, and UDP protocols, and walks you through practical examples of building TCP and UDP client-server applications.Mastering ChatGPT’s Function Call API - The Smart Way and the… Not-So-Smart Way (in Python): Explains how to use ChatGPT's function call API for automating tasks in Python.Git With Python HowTo GitPython Tutorial And PyGit2 Tutorial: Covers installation, exception handling, and common tasks like cloning, committing, branching, tagging, and pushing changes.🎥Program a RAG LLM Chat App with LangChain + Streamlit + *o1, GTP-4o and Claude 3.5 API: Covers loading custom documents, integrating website content into LLM queries, and creating a web app that enables users to interact with GPT-4 and Claude models.🔑Best Practices and Advice🔏Counting Sheep with Contracts in Python: Discusses using code contracts to enhance software development by ensuring preconditions and postconditions are met, making the code safer and easier to maintain.FastAPI Best Practices and Design Patterns - Building Quality Python APIs: Discusses applying SOLID principles and design patterns like DAO and Service Layer to build clean, maintainable, and scalable APIs using FastAPI.Recently I read a few articles and have a few questions: Covers managing dependencies without tools like Poetry, and handling Python version installations, particularly when a preferred version lacks an official installer.Unlocking the Magic of Docstrings: Introduces the power of Python docstrings for documenting code, enhancing readability, and providing functionality like automatic documentation generation and testing.From Spreadsheets to SDMX Effortless with Python and .Stat Suite: Highlights the importance of SDMX adoption for efficient data sharing among institutions and presents a step-by-step case study using World Bank data.🔍Featured Study: Deploying Open-Source Large Language Models Efficiently💥The study "Deploying Open-Source Large Language Models: A Performance Analysis", conducted by Bendi-Ouis et al., compares the performance of open-source large language models. The study aims to assist organisations in evaluating the hardware requirements for efficiently deploying models like Mistral and LLaMa.ContextSince the release of ChatGPT in November 2023, there has been growing interest in deploying large language models. Many organisations and institutions are keen to harness LLMs, but the computational demands remain a challenge. While proprietary models require substantial resources, open-source models like Mistral and LLaMa provide alternatives that may be deployed with less hardware. This study explores how different hardware configurations and optimisation techniques, such as quantification, can make these models more accessible for public and private entities.Key FindingsThe study used two types of GPUs: NVIDIA V100 16GB and NVIDIA A100 40GB, with tests conducted on models like Mistral-7B, Codestral-22B, Mixtral-8x7B, Mixtral-8x22B, and LLaMa-3-70B.Mistral-7B generated 119 tokens in 1.9 seconds with one request, but 72.1 seconds with 128 requests on two V100 16GB GPUs.Codestral-22B produced 63 tokens in 2.3 seconds with one request but took 96.2 seconds with 128 requests on an A100 40GB GPU.Larger models like Mixtral-8x22B and LLaMa-3-70B faced slower generation times as context size and simultaneous requests increased.Quantifying models to 4 or 6 bits helped reduce the memory load while maintaining performance, with negligible loss in accuracy for models with up to 70 billion parameters.What This Means for YouFor organisations and developers seeking to deploy LLMs, this analysis provides valuable insights into the hardware requirements and optimisation techniques necessary for efficient deployment. With moderate hardware investments, open-source models can perform competitively, reducing dependency on proprietary systems and enabling better control over digital resources. This ensures digital sovereignty and cost-effective deployment of advanced AI technologies.Examining the DetailsThe researchers focused on GPU performance and model quantification to measure how efficiently LLMs could be deployed. Using vLLM, a Python library designed for inference optimisation, the study tested multiple models and configurations. For instance, Mistral-7B, when run on two V100 16GB GPUs, showed an increase in response time with higher numbers of simultaneous requests, highlighting the challenge of scaling for larger user bases.Quantification emerged as a key method to reduce computational load, allowing models to use less memory by lowering precision from 16 or 32 bits to 4 or 8 bits. This method was effective for larger models, maintaining performance without significant loss in accuracy.The study concluded that, although proprietary solutions like ChatGPT require significant resources, open-weight models like Mistral and LLaMa can deliver strong performance with commercially available GPUs. By deploying these models with vLLM and quantification techniques, organisations can achieve scalable, efficient AI deployment without excessive hardware costs.You can learn more by reading the entire paper here.🧠 Expert insight💥Here’s an excerpt from “Chapter 9: Creating Compelling Content with Zero-Shot Learning” in the book, Machine Learning and Generative AI for Marketing by Yoon Hyup Hwang and Nicholas C. Burtch, published in August 2024.Creating an effective promptBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needCreating an effective prompt is the most crucial step in leveraging ZSL for marketing copy. In ZSL, the prompt effectively becomes the instruction manual for a model, telling it what kind of content to generate, as well as its style, tone, and substance.The following are some guidelines around how to formulate prompts that will elicit the best possible marketing copy content from the model:Clarity: Ensure that your prompt is specific about what you want, whether it’s a product description, headline, or call to action.Contextual: Provide sufficient background to guide a model. For eco-friendly products, mention key selling points like sustainability or biodegradability.Creative: While clarity is crucial, leaving room for creativity can yield surprising and innovative results. Phrases like “Imagine...” or “Create a story where...” can be particularly powerful.Concise: Lengthy prompts can dilute the focus. Aim for brevity while including essential details, ensuring that a model stays on topic.In the following sections, we will illustrate the impact of prompt quality through examples, with different types of marketing copy. While good prompts elicit detailed, relevant, and engaging content, poor prompts can lead to vague and uninspiring outputs. To generate these responses, we will define the following function:def generate_response(prompt, model="gpt-4", max_tokens=100, temperature=0.7, n=1): response = client.chat.completions.create( model=model, messages=[{"role": "user", "content": prompt}], max_tokens=max_tokens, n=n, temperature=temperature)return response.choices[0].message.contentThis function will be used with different prompt types in the examples that follow.Example 1: Product descriptionsIn this example, we will generate product descriptions for our e-commerce brand, which is launching new lines of eco-friendly kitchenware.The following is an example of a poor prompt:poor_product_description = "Talk about bamboo cutlery."generate_response(poor_product_description)This produces:Bamboo cutlery is a type of eating utensil made from bamboo, a fast-growing renewable resource. This form of cutlery includes items such as forks, knives, spoons, and chopsticks. They are an eco-friendly alternative to plastic disposable cutlery due to their reusable nature and biodegradable properties. Bamboo cutlery is lightweight and durable, making it a popular choice for camping, picnics, and travel.Now, let’s look at the following example of a good prompt:good_product_description = "Write a captivating description for a bamboo cutlery set designed for eco-conscious consumers, emphasizing its sustainability and style."generate_response(good_product_description)This prompt produces the following output:Discover the perfect harmony of sophistication and sustainability with our Bamboo Cutlery Set. Designed for the eco-conscious consumer, this set is not only a statement of your commitment to the environment but also a reflection of your exquisite taste. Each piece is crafted from 100% organic bamboo, a renewable resource that grows significantly faster than traditional hardwoods. This set boasts a sleek, minimalist design that perfectly complements any table setting, from casual picnic to elegant dinner parties.From a marketing perspective, this example demonstrates the significance of creating detailed and audience-specific prompts with clear requirements when using ZSL for product descriptions, as well as how this leads to more specificity in the generated response. However, it is worth noting that older consumers may value more straightforward, factual information and, therefore, may favor the more generic prompt’s response from an engagement standpoint. Tailoring GenAI outputs at the level of the individual consumer can be crucial as well and is a topic discussed inChapter 11.Packt library subscribers can continue reading the entire book for free. You can buy Machine Learning and Generative AI for Marketing,here.Get the eBook for $39.99 $27.98!Other Python titles from Packt at 30% offGet the eBook for $35.99 $24.99!Get the eBook for $25.99 $17.99!Get the eBook for $35.99 $24.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
438

Divya Anne Selvaraj

17 Sep 2024

PythonPro #47: Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from Scratch

Divya Anne Selvaraj

17 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#47:Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from ScratchHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, AI-Assisted Programming for Web and Machine Learning, which discusses how Copilot can assist in debugging and troubleshooting by adding error-handling features.News Highlights: DBOS Transact launches with durable Python workflow recovery; Python in Excel now live for data analysis; Python 3.13 is coming October 2024 with new interpreter, JIT, and more; and Hackers use fake coding tests on GitHub to target Python developers.Here are my top 5 picks from our learning resources today:How Does AI Work? Create a Neural Network from Scratch🤖Spam Mail Detection - Machine Learning with Python✉️Django from first principles🌱How to Use Conditional Expressions With NumPy where()🔄Why Learn Python Concurrency⚙️And, today’s Featured Study, introduces ComplexCodeEval, a benchmark designed to evaluate large code models (LCMs) in complex development environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is still live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsDBOS Transact: Ultra-Lightweight Durable Execution for Python Workflows launched: The library ensures programs automatically resume from their last completed step after crashes or interruptions.Python in Excel – Available Now: Microsoft 365 users can now integrate Python libraries for advanced data analysis, visualization, and machine learning within Excel.What’s New In Python 3.13: The version, releasing on October 1, 2024, will include a new interactive interpreter, experimental free-threaded mode, a JIT compiler, enhanced error messages, and updates to the standard library.Fake password manager coding test used to hack Python developers: Posing as recruiters, the hackers use GitHub-hosted projects to infect victims' systems and pressure them to bypass security checks.💼Case Studies and Experiments🔬How Does AI Work? Create a Neural Network from Scratch: Explains how to build a basic neural network using Python, to predict house prices, while covering core concepts like gradient descent, backpropagation, and more.Text mining in Python - case-study with “Romeo and Juliet” from Project Gutenberg:Walks you through the steps of accessing the text, cleaning it, tokenizing words, analyzing word frequency, and visualizing the results.📊AnalysisStreamlit vs Gradio - The Ultimate Showdown for Python Dashboards: Evaluates their ease of use, customization options, deployment flexibility, and suitability for complex data visualization or rapid prototyping tasks.It’s time to stop using Python 3.8: Emphasizes the importance of upgrading from Python 3.8, which reaches end-of-life in October 2024, meaning no more bug or security fixes.🎓Tutorials and Guides🤓Understanding Proximal Policy Optimization (PPO) - A Game-Changer in AI Decision-Making Explained for RL Newcomers: explains PPO, detailing its key concepts, practical implementation, and how it improves decision-making stability and efficiency in AI systems.Use Python for Earth Engine Analysis, Save Directly to Your Local Drive: Explains how to use the Python library geemap for interacting with Google Earth Engine (GEE) to process and analyze satellite imagery.Django from first principles: A series on building a Django project starting with a single file, gradually expanding as necessary to manage complexity, simplifying Django for beginners by focusing on essential components first.Injecting syscall faults in Python and Ruby: Discusses how to simulate syscall failures in Python and Ruby using Cirron, a tool that integrates with strace to inject errors, delays, and signals into system calls.Deploying a Django app with Kamal, AWS ECR, and Github Actions: Covers setting up a VPS and preparing it for Kamal, creating a Dockerfile for containerized apps, and configuring a deployment pipeline.Implementing the Singleton Pattern in FastAPI for Efficient Database Management: Demonstrates how to implement the Singleton Pattern for efficient database management, particularly while handling expensive resources.Spam Mail Detection - Machine Learning with Python: Explains how to use a supervised learning approach with a dataset from Kaggle, analyzing email length, applying logistic regression, and creating a scanner to detect spam.🔑Best Practices and Advice🔏Let’s build and optimize a Rust extension for Python: Explains how to build and optimize a Rust extension for Python to improve performance and memory efficiency.Why Learn Python Concurrency: Explains how concurrent, parallel, and asynchronous execution allow programs to fully utilize modern hardware, improve performance, and scale more effectively.Therac-25, LLMs and the Zen of Python: Discusses the dangers of relying on LLMs to rewrite code across languages without understanding the underlying principles and context, drawing a parallel to the infamous Therac-25 disaster.Using Python's pip to Manage Your Projects' Dependencies: discusses using Python's pip to for installing and uninstalling packages, and handling errors.How to Use Conditional Expressions With NumPy where(): Explains how to work with multiple conditions, array broadcasting, and common pitfalls when using np.where() in data manipulation.🔍Featured Study: ComplexCodeEval - Benchmarking Large Code Models in Practice💥In ComplexCodeEval: A Benchmark for Evaluating Large Code Models on More Complex Code, Feng et al. introduce a new benchmark for assessing large code models (LCMs). The paper focuses on evaluating LCMs in real-world coding scenarios involving complex tasks and avoiding data leakage.ContextLCMs are AI models trained to handle coding tasks like code generation, completion, test case creation, and API recommendation. Existing benchmarks tend to evaluate LCMs on limited tasks, such as standalone code generation, without capturing the broader, more diverse challenges developers face. Additionally, they often overlook data leakage, where models are tested on data already seen during training, resulting in inflated performance scores.ComplexCodeEval is a comprehensive benchmark designed to test LCMs on multiple coding tasks and scenarios, reflecting real-world programming challenges. It assesses how well LCMs perform in contexts that include dependencies on third-party libraries and the need to create test functions and recommend APIs.Key Featured of ComplexCodeEvalReal-World Data: Uses 3,897 Java samples and 7,184 Python samples from high-star GitHub repositories.Multiple Tasks: Evaluates LCMs on code generation, completion, API recommendation, and test case generation.Rich Context: Each sample includes function signatures, docstrings, API references, and test functions.Data Leakage Prevention: Multiple timestamps (creation, update) ensure the benchmark avoids testing on training data.Variety of Models Tested: Ten popular LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were evaluated.What This Means for YouThis study is valuable for programmers and software engineers who use AI coding tools. ComplexCodeEval highlights which models perform best for tasks like generating Java code or recommending Python APIs, making it easier to select the right tools for complex programming tasks. It provides a realistic assessment of LCMs, avoiding inflated scores from limited or synthetic benchmarks.For developers working on AI models, the study offers insights into how additional contextual information, such as dependencies and function histories, can significantly improve model performance.Examining the DetailsTo create ComplexCodeEval, the authors sourced Java and Python samples from GitHub repositories that relied on popular third-party libraries. Each sample was annotated with relevant metadata like API references, docstrings, and timestamps, simulating real-world coding tasks.Ten LCMs, including StarCoder2, CodeLlama, DeepSeek-Coder, and GPT-3.5-Turbo, were tested on four tasks: code generation, code completion, API recommendation, and test case generation. CodeLlama-34B achieved the highest CodeBLEU score of 34.08 for Java code generation, and Python API recommendation saw an F1 score of 52.24.The researchers tested the impact of adding context to the inputs provided to LCMs. Starting with basic function signatures and docstrings, they added more context (e.g., dependencies and library imports) and found that full context improved average CodeBLEU scores by 70.73% in Java and 31.90% in Python.To assess data leakage, the team compared model performance on data created before and after the models’ knowledge cut-off dates. They found models performed better on leaked data, with average CodeBLEU scores increasing by 1.22 points in Java and 3.10 points in Python, demonstrating the importance of preventing data leakage in evaluations.You can learn more by reading the entirepaper and accessing the ComplexCodeEvalGithub repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 20: Increasing Efficiency with GitHub Copilot” in the book, AI-Assisted Programming for Web and Machine Learning by Christoffer Noring, Anjali Jain, Marina Fernandez, Ayşe Mutlu, and Ajit Jaokar, published in August 2024.Debugging and troubleshootingBeing able to debug and troubleshoot code is an important skill to have. When you develop code, it seldom does what you need it to do the first time. You needto be able to debug and troubleshoot code to figure out what’s wrong with it. The same applies if you have a reported bug; you need to be able to diagnose at what point it goes wrong. Copilot can help, and a good approach is using Copilot chat and asking it to add support for debugging and troubleshooting.The following is a piece of sample code you’ve been given. It’s a REST API in the framework Flask for Python. It’s working code, but it might not be so easy to work with from a debugging and troubleshooting standpoint:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response)@app.route('/products', methods=['POST'])def add_product(): # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'})if __name__ == '__main__': app.run(debug=True)The above code works, but it doesn’t support debugging and troubleshooting; let’s see how we can add support for that by using Copilot Chat:[Prompt]Add support for troubleshooting and debugging[End of prompt]Copilot chat answers us with both text and code:[Prompt response]Sure, here’s an updated version of the Flask API that includes support for troubleshooting and debugging:from flask import Flask, jsonify, requestimport sqlite3app = Flask(__name__)# Create a connection to the databaseconn = sqlite3.connect('products.db')c = conn.cursor()# Create a table for products if it doesn't existc.execute('''CREATE TABLE IF NOT EXISTS products (id INTEGER PRIMARY KEY AUTOINCREMENT, name TEXT, price REAL)''')conn.commit()@app.route('/products', methods=['GET'])def get_products(): try: # Retrieve all products from the database c.execute('SELECT * FROM products') products = c.fetchall() # Convert the products to a JSON response response = [] for product in products: response.append({ 'id': product[0], 'name': product[1], 'price': product[2] }) return jsonify(response) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while retrieving products'})@app.route('/products', methods=['POST'])def add_product(): try: # Retrieve the product data from the request body data = request.get_json() name = data['name'] price = data['price'] # Insert the product into the database c.execute('INSERT INTO products (name, price) VALUES (?, ?)', (name, price)) conn.commit() # Return a success message return jsonify({'message': 'Product added successfully'}) except Exception as e: # Log the error and return an error message app.logger.error(str(e)) return jsonify({'error': 'An error occurred while adding the product'})if __name__ == '__main__': app.run(debug=True)The improvements are:Added atry/exceptblock around the code in theget_products()function.Addedloggerstatements to log errors.Addeddebug=Trueto theapp.run()statement.These changes will help us troubleshoot and debug the code.[End of response]Packt library subscribers can continue reading the entire book for free. You can buy AI-Assisted Programming for Web and Machine Learning,here.Get the eBook for $38.99 $26.99!Other Python titles from Packt at upto 30% offGet the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $35.99 $24.99!Get the Print Book for $44.99 $32.99Get the eBook for $43.99 $29.99!Get the Print Book for $54.99 $40.99Print discounts end in 5 days on the 22nd of September, 2024. @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }

0
0
286

Divya Anne Selvaraj

11 Sep 2024

Master Python for Data, AI, and API Development

Divya Anne Selvaraj

11 Sep 2024

New Python books—designed for today’s needsMaster Python for Data, AI, and API DevelopmentHi ,Python powers some of the fastest-growing fields in tech today. According to the latest Python Developer Survey results, 47% of Python users apply it in data analysis, 42% in machine learning, and 39% in web development. With Python’s influence only expanding, staying ahead means mastering these key areas.Packt's August 2024 releases offer the practical expertise you need to enhance your Python skills, whether you're working with big data, building machine learning models, or developing high-performance APIs.Python Feature Engineering Cookbook - Third Editionby Soledad GalliA complete guide to crafting powerful features for your machine learning modelsEquips you with practical techniques for handling complex datasets, to craft features that will improve model performance.Learn to impute missing values, transform numerical variables, and extract powerful features from complex datasets like time series and transactional data.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!Polars Cookbook by Yuki KakegawaOver 60 practical recipes to transform, manipulate, and analyze your data using Python Polars 1.xOptimise data analysis tasks with Python Polars, a blazingly fast alternative to pandas.Ideal for data professionals looking to improve performance across a variety of datasets, solve common data problems, perform complex transformations, and analyse time-series data.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!FastAPI Cookbook by Giunio De LucaDevelop high-performance APIs and web applications with PythonFastAPI is gaining ground rapidly, with 25% of Python developers now using it for web development.Learn how to use FastAPI’s modern, async-friendly features, and take your backend development to the next level with custom middleware and WebSockets.Get the eBook for $35.99 $24.99!Get the Print Book for $44.99!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
136

Divya Anne Selvaraj

10 Sep 2024

PythonPro #45: Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for Django

Divya Anne Selvaraj

10 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#46:Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for DjangoHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published, Python Feature Engineering Cookbook - Third Edition, which discusses using boxplots and the inter-quartile range (IQR) proximity rule to visualize outliers in data distributions.Related TitlesCovers numerous tools for mastering visualization including NumPy, Pandas, SQL, Matplotlib, and SeabornIncludes an introductory chapter on Python 3 basicsFeatures companion files with numerous Python code samples and figuresGet the eBook for $54.99 $37.99!Explores cutting-edge techniques using ChatGPT/GPT-4 in harmony with Python for generating visuals that tell more compelling data storiesTackles actual data scenarios and builds your expertise as you apply learned concepts to real datasetsGet the eBook for $54.99 $37.99!Covers Python-based data visualization libraries and techniquesIncludes practical examples and Gemini-generated code samples for efficient learningIntegrates Google Gemini for advanced data visualization capabilitiesGet the eBook for $51.99 $35.99!News Highlights: Python 3.13.0rc2 released with new interpreter, free-threaded build, JIT, and incremental garbage collection; Python survey shows pip dominance, rising interest in Conda, Poetry, and uv; and PSF expands CNA role to cover Pallets Projects like Flask and Jinja.Here are my top 5 picks from our learning resources today:Breaking Bell's Inequality with Monte Carlo Simulations in Python🔗Python QuickStart for People Learning AI🤖Integrating Stripe Into A One-Product Django Python Shop🛒Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP🌐A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS)☁️And, today’s Featured Study, explores how ChatGPT can automate and streamline Python-based federated learning algorithm development, reducing human effort and improving coding efficiency.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } 🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython 3.13.0rc2 released: This version introduces several major features such as a new interactive interpreter, an experimental free-threaded build mode, preliminary JIT for performance, and incremental garbage collection.Packaging Trends in Python: Highlights from the 2023 Developer Survey: Results show a strong preference for pip, with emerging interest in Conda and Poetry, and a new player, uv.Python Software Foundation (PSF) Expands CNA Scope to Include Pallets Projects: The PSF has expanded its CVE Numbering Authority role to include Pallets Projects like Flask and Jinja, ensuring better vulnerability management.💼Case Studies and Experiments🔬Lessons learnt building a real-time audio application in Python: Key learnings covered include accepting inherent latency issues, leveraging modern operating systems' efficient memory management, and utilizing web browsers as effective interfaces for real-time applications.Breaking Bell's Inequality with Monte Carlo Simulations in Python: Discusses the use of Monte Carlo simulations in Python to challenge Bell's inequality through a quantum mechanics game.📊AnalysisRust for the small things?... but what about Python?: Explores the enduring relevance of Python in data engineering, despite the allure of Rust for performance and safety.Multiversion Python Thoughts: Delves into the complexities of implementing multi-version package imports in Python, motivated by the desire to handle incompatible library versions concurrently.🎓Tutorials and Guides🤓Python QuickStart for People Learning AI: Covers Python fundamentals, including data types, loops, and functions, and provides a concrete AI project example using the OpenAI API for summarizing research papers.Lists vs Tuples in Python: Explores the characteristics, uses, and differences between lists and tuples in Python, emphasizing their ordered nature, content diversity, mutability, and appropriate usage scenarios.Layman's Guide to Python Built-in Functions: Simplifies Python's built-in functions for beginners, providing plain English explanations and straightforward examples.🎥Some tricks with UV: Demonstrates how UV not only facilitates quicker installations but also supports running Python scripts with on-the-fly dependency management.Python 3 Module of the Week: A series of articles detailing diverse library functionalities ranging from text handling, data structures, and algorithms to more complex areas like cryptography and network communication.Integrating Stripe Into A One-Product Django Python Shop: Part two of a series on creating a one-product shop using Django, htmx, and Stripe. Covers creating a Stripe account, defining a product, and configuring a webhook for transaction notifications.Practical Introduction to Polars: Compares Polars' key functionalities with Pandas, offering practical examples to help users transition from Pandas to Polars for more efficient data analysis.🔑Best Practices and Advice🔏Understanding Python's __new__ Method Through a Magical Example: Introduces Python's lesser-known .__new__()method, used for creating instances before they're initialized with .__init__().Some fun with Python Enum: Explores the Enum class introduced in Python 3.4, detailing its benefits over using literal types for type-safety and avoiding errors in code.A comparison of hosts / providers for Pythonserverless functions (a.k.a. FaaS): Discusses various providers that support Python, their development experience (DevEx), pricing models, runtime limits, and other platform products.Python HTTP Clients -Requests vs. HTTPX vs. AIOHTTP: Details each library's strengths and appropriate use cases, helping developers choose the right tool based on project needs.Shades of testing HTTP requests in Python: Covers different techniques including mocking with AsyncMockand respx, parameterizing HTTP clients for flexible testing setups, and using integration tests with a Starlette server.🔍Featured Study: Streamlining Federated Learning with Python and ChatGPT💥In PTB-FLA Development Paradigm Adaptation for ChatGPT, Popovic et al. explore how AI can be used to streamline the development of federated learning algorithms (FLAs). The study adapts a Python-based development paradigm to leverage ChatGPT for improved speed and efficiency in coding for machine learning tasks.ContextFederated Learning (FL) allows machine learning algorithms to train across decentralized data sources, such as edge devices, without sharing the raw data. PTB-FLA is a Python framework designed to ease this process by providing a structured way for developers to create these algorithms. Traditionally, this has required significant human input. With ChatGPT, the authors of this paper aimed to reduce human effort by automating much of the coding work. This study is important because it shows how LLMs can help build complex systems like FL algorithms, particularly in environments such as edge computing, where efficiency and reduced human oversight are key.Key FindingsThe adapted four-phase paradigm reduced human labour by 50%, achieving double the speed of the original development method.A new two-phase paradigm further streamlined the process, cutting human effort by 6 times compared to the original approach.ChatGPT-generated code was of higher quality, showing fewer errors compared to human-generated versions in comparable tasks.The study demonstrated a significant reduction in costs by reducing the size of ChatGPT prompts by 2.75 times.Both adapted paradigms were successfully validated using logistic regression as a case study for federated learning.What This Means for YouIf you work with machine learning, particularly in decentralized systems like IoT or edge computing, this research is highly relevant. Using ChatGPT to develop federated learning algorithms can save you substantial time by automating coding tasks that would otherwise require significant effort. By adopting the two-phase paradigm, developers can expect faster, more efficient development cycles, allowing you to focus on innovation rather than repetitive coding. This also reduces costs when using AI-assisted tools like ChatGPT, as it optimises the prompt size.Examining the DetailsThe study's methodology revolves around adapting an existing four-phase development process for federated learning into two paradigms tailored for ChatGPT. The original phases involved creating sequential code, transforming it into federated code, incorporating callbacks, and generating the final PTB-FLA code. The new two-phase paradigm simplifies this further by merging phases, allowing ChatGPT to generate the final federated code directly from the sequential code, bypassing intermediary steps. The team validated both paradigms through a case study using logistic regression. They iteratively refined the ChatGPT prompts to find the minimal context needed to achieve correct outputs, ensuring efficiency while maintaining code accuracy. The final results showed ChatGPT could develop high-quality code faster than humans, with far fewer resources.You can learn more by reading the entirepaper and accessing the PTB-FLA Github repository.🧠 Expert insight💥Here’s an excerpt from “Chapter 5: Working with Outliers” in the Python Feature Engineering Cookbook - Third Edition,by Soledad Galli, published in August 2024.Visualizing outliers with boxplots and the inter-quartile proximity ruleA common way to visualize outliers is by using boxplots. Boxplots provide a standardized display of the variable’s distribution based on quartiles. The box contains the observations within the firstand third quartiles, known as the Inter-Quartile Range(IQR). The first quartile is the value below which 25% of the observations lie (equivalent to the 25th percentile), while the third quartile is the value below which 75% of the observations lie (equivalent to the 75th percentile). The IQR is calculatedas follows:IQR = 3rd quartile - 1st quartileBoxplots also display whiskers, which are lines that protrude from each end of the box toward the minimum and maximum values and up to a limit. These limits are given by the minimum or maximum value of the distribution or, in the presence of extreme values, by thefollowing equations:upper limit = 3rd quartile + IQR × 1.5lower limit = 1st quartile - IQR × 1.5According to theIQR proximity rule, we can consider a value an outlier if it falls beyond the whisker limits determined by the previous equations. In boxplots, outliers are indicatedas dots.NoteIf the variable has a normal distribution, about 99% of the observations will be located within the interval delimited by the whiskers. Hence, we can treat values beyond the whiskers as outliers. Boxplots are, however, non-parametric, which is why we also use them to visualize outliers inskewed variables.In this recipe, we’ll begin by visualizing the variable distribution with boxplots, and then we’ll calculate the whisker’s limits manually to identify the points beyond which we could consider a value asan outlier.How to do it...We will create boxplots utilizing theseabornlibrary. Let’s begin by importing the Python libraries and loadingthe dataset:Let’s import the Python libraries andthe dataset:import matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.datasets import fetch_california_housingModify the default background fromseaborn (it makes prettier plots, but that’s subjective, of course):sns.set(style="darkgrid")Load the California house prices datasetfrom scikit-learn:X, y = fetch_california_housing( return_X_y=True, as_frame=True)Make a boxplot of theMedIncvariable to visualizeits distribution:plt.figure(figsize=(8, 3))sns.boxplot(data=X["MedInc"], orient="y")plt.title("Boxplot")plt.show()In the following boxplot, we identify the box containing the observations within the IQR, that is, the observations between the first and third quartiles. We also see the whiskers. On the left, the whisker extends to the minimum value ofMedInc; on the right, the whisker goes up to the third quartile plus 1.5 times the IQR. Values beyond the right whisker are represented as dots and couldconstitute outliers:Figure 5.1 – Boxplot of the MedInc variable highlighting potential outliers on the right tail of the distributionNoteAs shown inFigure 5.1, the boxplot returns asymmetric boundaries denoted by the varying lengths of the left and right whiskers. This makes boxplots a suitable method for identifying outliers in highly skewed distributions. As we’ll see in the coming recipes, alternative methods to identify outliers create symmetric boundaries around the center of the distribution, which may not be the best option forasymmetric distributions.Let’s now create a function to plot a boxplot next toa histogram:def plot_boxplot_and_hist(data, variable): f, (ax_box, ax_hist) = plt.subplots( 2, sharex=True, gridspec_kw={"height_ratios": (0.50, 0.85)}) sns.boxplot(x=data[variable], ax=ax_box) sns.histplot(data=data, x=variable, ax=ax_hist) plt.show()Let’s use the previous function to create the plots for theMedInc variable:plot_boxplot_and_hist(X, "MedInc")In the following figure, we can see the relationship between the boxplot and the variable’s distribution shown in the histogram. Note how most ofMedInc’s observations are located within the IQR box.MedInc’s potential outliers lie on the right tail, corresponding to people with unusuallyhigh-income salaries:Figure 5.2 – Boxplot and histogram – two ways of displaying a variable’s distribution...How it works...In this recipe, we used theboxplotmethod from Seaborn to create the boxplots and then we calculated the limits beyond which a value could be considered an outlier based on the IQRproximity rule.InFigure 5.2, we saw that the box in the boxplot forMedInc extended from approximately 2 to 5, corresponding to the first and third quantiles (you can determine these values precisely by executing X[“MedInc”].quantile(0.25)andX[“MedInc”].quantile(0.75) ). We also saw that the whiskers start at MedInc’s minimum on the left and extend up to8.013on the right (we know this value exactly because we calculated it instep 8).MedIncshowed values greater than8.013 , which were displayed in the boxplot as dots. Those are the values that could be considered outliers...Packt library subscribers cancontinue reading the entire book for free. You can buy the Python Feature Engineering Cookbook - Third Edition, by Soledad Galli,here.Get the eBook for $35.99 $24.99! @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}} @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} } @media only screen and (max-width: 100%;} #pad-desktop {display: none !important;} }

0
0
219

Divya Anne Selvaraj

03 Sep 2024

PythonPro #45: Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with Scrapy

Divya Anne Selvaraj

03 Sep 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#45:Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with ScrapyHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published, Polars Cookbook, which shows you how to convert DataFrames and Series between Polars and pandas.News Highlights: Python Developer Survey: 55% use Linux, 6% still on Python 2; SuperTree enables interactive decision tree visuals in Jupyter; and OneBusAway launches Python and JavaScript SDKs for seamless data integration.Here are my top 5 picks from our learning resources today:Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks🌲Web Scraping With Scrapy and MongoDB🕸️DBSCAN, Explained in 5 Minutes🧩Python packaging is a MESS📦Why I Still Use Python Virtual Environments in Docker🛳️And, today’s Featured Study, highlights how process mining, using tools like pm4py, can uncover insights into workflow efficiency, variability, and algorithmic performance.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.: This month’ssurvey is now live. Do take the opportunity to tell us what you think of PythonPro, request learning resources, and earn your one Packt Credit for this month.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️NewsPython Developer Survey - 55% Use Linux, 6% Use Python 2: The 7th annual Python Developers Survey, which gathered responses from over 25,000 developers worldwide also found that Visual Studio Code is the leading IDE.supertree - Interactive Decision Tree Visualization: This Python package is designed to create interactive visualizations of decision trees within Jupyter Notebooks, Jupyter Lab, Google Colab, and similar environments that support HTML rendering.OneBusAway Launches Official Python and JavaScript SDKs: Developed as part of the Google Summer of Code, these SDKs simplify the incorporation of OneBusAway's data, offer consistent API usage across platforms, and include comprehensive documentation.💼Case Studies and Experiments🔬Exploring the National Park Service API - Harvesting and Visualizing Data for National Parks: Provides a step-by-step guide on accessing the API, retrieving data such as park entrance fees, and organizing it into a Pandas DataFrame for analysis.Code Without Any Syntax: Discusses an experiment in which the author uses an LLM to convert natural language instructions into functional Python code without traditional syntax.📊AnalysisMake magic with Mesop - python based web apps: Reviews Mesop, a newly released Python-based framework for building web apps. Read for tips to get started.Why I Prefer Django for My Projects: While acknowledging the strengths of Node.js and Express.js, the author of this article finds Django's holistic, secure, and efficient approach better suited to their needs in web development.🎓Tutorials and Guides🤓Web Scraping With Scrapy and MongoDB: Guides you through setting up a Scrapy project, building a web scraper, extracting data, and storing it in MongoDB. Read to also learn about testing and debugging techniques.Generate Images With DALL·E and the OpenAI API: Covers setting up the necessary environment, making API calls to create images from text prompts, handling image variations, and converting Base64 JSON responses to PNG files.Primer on Jinja Templating: Covers installation, basic usage, and advanced features like loops, conditional statements, and macros. Read to learn how to integrate Jinja with Flask to build a basic web project with dynamic web pages.How to Install Python on Your System - A Guide: Provides a comprehensive guide to installing Python on various systems, including Windows, macOS, Linux, iOS, and Android.Adventures building a spreadsheet engine in Python: Demonstrates using the Lark Python package to parse formulas and compute dependencies, employing a topological sort algorithm to determine the order of cell evaluation.How to write your first Genetic Algorithm — Knapsack Problem: Guides you through implementing a genetic algorithm using Python. Read to learn how to apply genetic algorithms to solve complex optimization problems.Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Explained in 5 Minutes: Provides a concise explanation of the DBSCAN algorithm, which identifies clusters in data based on spatial distance and detects outliers without needing to predefine the number of clusters.🔑Best Practices and Advice🔏Escaping from Anaconda's Stranglehold on macOS: Provides simple, non-technical instructions to move the .zshrc file, allowing users to switch between Anaconda and official Python installations without terminal commands.Why I Still Use Python Virtual Environments in Docker: Argues that virtual environments simplify the management of Python applications, particularly in production settings, by ensuring consistent and isolated environments across different stages of development.Python Classes - The Power of Object-Oriented Programming: Covers defining classes, creating objects, managing attributes and methods, and the benefits of using classes. Read to learn about advanced topics like inheritance.Python packaging is a MESS: Stress-tests nine Python package managers, including pip, conda, poetry, and newer tools like pixi and hatch, highlighting the historical issues and modern solutions in Python packaging.Use python -m http.server in SSL: Provides a custom script, ssl_server.py, that wraps http.server to enable serving static sites over HTTPS using a self-signed SSL certificate. Read to learn how to serve static content securely.🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥In "Navigating Process Mining: A Case Study using pm4py," Kovács et al., explore the application of the pm4py library in analysing road traffic fine management processes. The study aims to demonstrate how process mining can uncover key insights into process efficiency and optimisation.ContextProcess mining is a technique that combines data mining and business process management to analyse event logs generated by information systems. It is particularly effective for uncovering hidden patterns, identifying bottlenecks, and optimising workflows. The study focuses on applying the pm4py library, an open-source Python tool, to a real-world road traffic fine management process. This approach offers a deeper understanding of process execution compared to traditional business intelligence tools.Key FindingsThe study's application of process mining to road traffic fine management revealed significant insights into process variability, algorithmic performance, and workflow complexity:Process Variants: The analysis identified 231 distinct process variants, with one variant accounting for 56,482 cases (approximately 37.6% of the total 150,370 cases), indicating a dominant workflow path.Algorithm Performance: Three process mining algorithms were evaluated:Alpha Miner: Revealed causal dependencies between activities, achieving simplicity and precision scores of 0.66.Inductive Miner: Employed a recursive approach to construct process models, scoring 0.62 in simplicity and 0.58 in precision.Heuristic Miner: Utilised heuristics to infer process models from event data, achieving a perfect precision score of 1.0 but a lower simplicity score of 0.54.Start and End Events: The process log analysis showed that 'Create Fine' was the most frequent start event, occurring 150,370 times. Multiple end events, such as 'Send Fine', 'Payment', and 'Send for Credit Collection,' were identified, indicating diverse process pathways.Process Discovery and Visualisation: The discovered models allowed a detailed understanding of workflow structures and dependencies. Each mining approach had strengths and limitations in capturing the process dynamics, with pm4py proving effective in facilitating process mining tasks.What This Means for YouThis study is relevant to data scientists, business analysts, and operations managers interested in optimising business processes. The pm4py library, as demonstrated in this case study, provides practical tools for analysing complex workflows, identifying inefficiencies, and improving operational efficiency. The insights gained can be applied to other business processes, making it a valuable resource for those aiming to enhance process performance.Examining the DetailsThe study used the pm4py library to analyse an event log related to the management of road traffic fines, covering activities such as creating fines, sending fines, adding penalties, managing appeals, and handling payments. The analysis involved three process mining algorithms—Alpha Miner, Inductive Miner, and Heuristic Miner—to discover process models from the event log data. The evaluation of simplicity and precision across these algorithms revealed that the Heuristic Miner achieved the highest precision score of 1.0, while the Alpha Miner provided a balance between simplicity and accuracy.You can learn more by reading the entirepaper and accessing the pm4py library.🧠 Expert insight💥Here’s an excerpt from “Chapter 10: Interoperability with Other Python Libraries” in the Polars Cookbook,by Yuki Kakegawa, published in August 2024.Converting to and from a pandas DataFrameMany of you have used pandas before, especially in your day-to-day work. Although pandas and Polars are often compared as one-or-the-other tools, you can use these tools to supplement each other.📚Related Titles from PacktUnderstand key data science algorithms with Python-based examplesIncrease the impact of your data science solutions by learning how to apply existing algorithmsTake your data science solutions to the next level by learning how to create new algorithmsGet the eBook for $35.99 $24.99!Conduct Bayesian data analysis with step-by-step guidanceGain insight into a modern, practical, and computational approach to Bayesian statistical modelingEnhance your learning with best practices through sample problems and practice exercisesGet the eBook for $55.99 $38.99!Polars allows you to convert between pandas and Polars DataFrames, which is exactly what we’ll cover in this recipe.Getting readyYou needpandas andpyarrowinstalled for this recipe to work. Execute the following code to make sure that you havethem installed:pip install pandas pyarrowHow to do it...Here’s how to convert to and from pandas DataFrames. We’ll first create a Polars DataFrame and then go through ways to convert back and forth between Polarsand pandas:Create a Polars DataFrame from aPython dictionary:df = pl.DataFrame({ 'a': [1,2,3], 'b': [4,5,6]})type(df)The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameConvert a Polars DataFrame to a pandas DataFrame using the.to_pandas()method:pandas_df = df.to_pandas()type(pandas_df)The preceding code will return thefollowing output:>> pandas.core.frame.DataFrameConvert a pandas DataFrame to a Polars DataFrame using the.from_pandas()method:df = pl.from_pandas(pandas_df)type(df)The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameIf you want to allow zero copy operations, then you need to enable theuse_pyarrow_extension_arrayparameter:df.to_pandas(use_pyarrow_extension_array=True).dtypesThe preceding code will return thefollowing output:>>a int64[pyarrow]b int64[pyarrow]dtype: objectYou can also create a Polars DataFrame by wrapping a pandas DataFrameusingpl.DataFrame():type(pl.DataFrame(pandas_df))The preceding code will return thefollowing output:>> polars.dataframe.frame.DataFrameHow it works...Polars has built-in methods to interoperate with pandas such as.from_pandas() and.to_pandas(). Each method is descriptive enough that you can see that .from_pandas() is used for reading data into Polars from pandas, whereas .to_pandas()is used to convert Polars objectsinto pandas.Theuse_pyarrow_extension_arrayparameter of the.to_pandas()method uses PyArrow-supported arrays instead of NumPy arrays for the columns within the pandas DataFrame. This enables zero-copy operations and maintains the integrity ofnull values.There’s more...You can convert to and from a pandas Series to aPolars Series:s = pl.Series([1,2,3])type(s.to_pandas())The preceding code producesthe following:>> pandas.core.series.SeriesThe.from_pandas()method returns a Series object when a pandas Series waspassed in:type(pl.from_pandas(s.to_pandas()))The preceding code producesthe following:>> polars.series.series.SeriesPackt library subscribers cancontinue reading the entire book for free. You can buy the Polars Cookbook,by Yuki Kakegawa,here.Get the eBook for $35.99 $24.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.If you have any suggestions or feedback, or would like us to find you aPythonlearning resource on a particular subject, take thesurveyor just respond to this email!*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
80

Divya Anne Selvaraj

28 Aug 2024

PythonPro #44: Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid Development

Divya Anne Selvaraj

28 Aug 2024

Bite-sized actionable content, practical tutorials, and resources for Python programmers.#44:Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid DevelopmentHi ,Welcome to a brand new issue of PythonPro!In today’sExpert Insight we bring you an excerpt from the recently published book, Generative AI Foundations in Python, which provides a hands-on guide to implementing generative AI models—GANs, diffusion models, and transformers—using PyTorch and the diffusers library.News Highlights:Theuv Python packaging tool now offers comprehensive project management, tool installation, and support for single-file scripts; and Tach, written in Rust, enforces strict interfaces and dependency management for PythonHere are my top 5 picks from our learning resources today:Visualisation 101 - Choosing the Best Visualisation Type📊Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTube Videos🎥🤖FastAPI Tutorial - Build APIs with Python in Minutes🚀Flatten JSON data with different methods using Python📦Linear Algebra Concepts Every Data Scientist Should Know📐And, in today’sFeatured Study, we introduce PyRoboCOP, a Python-based package designed for optimizing robotic control and collision avoidance in complex environments.Stay awesome!Divya Anne SelvarajEditor-in-ChiefP.S.:We have covered all requests made so far this month, in this issue.Sign Up|Advertise🐍 Python in the Tech 💻 Jungle 🌳🗞️Newsuv: Unified Python packaging:The tool now offers end-to-end project management, tool installation, Python bootstrapping, and support for single-file scripts with embedded dependencies, all within a unified, fast, and reliable interface.Tach - Strict interfaces and dep management forPython, written in Rust:Inspired by modular monolithic architecture, Tach allows you to define dependencies and ensures that modules only import from authorized dependencies.💼Case Studies and Experiments🔬Using ffmpeg, yt-dlp, and gpt-4o to Automate Extraction and Explanation of Python Code from YouTubeVideos:Details downloading video segments, capturing screenshots, extracting code from images using GPT, and then explaining the code with an LLM.Packaging Python and PyTorch for a Machine Learning Application:Discusses the challenges of packaging Python and PyTorch for the Transformer Lab application, aiming for a seamless user experience across various operating systems.📊Analysis🎥Charlie Marsh on Astral, uv, and the Python packaging ecosystem:Discusses insights on the development of Astral's uv tool, a cargo-like tool for Python, following a significant upgrade.CPython Compiler Hardening:Outlines the author’s process of selecting and testing compiler options, addressing challenges like excessive warnings, performance impacts, and developing tools to track and manage these warnings🎓Tutorials and Guides🤓Flatten JSON data with different methods using Python:Techniques discussed include usingpandas'json_normalize, recursive functions, theflatten_jsonlibrary, custom functions, and tools like PySpark and SQL.FastAPI Tutorial - Build APIs with Python in Minutes:Guides you through setting up a development environment, creating a FastAPI app, building a logistic regression classifier, defining data models with Pydantic, and setting up API endpoints for predictions.What's the deal with setuptools, setup.py, pyproject.toml, and wheels?:Provides a detailed explanation of Python packaging tools and practices, offering insights and recommendations for how to approach packaging in modern projects.Python's Preprocessor:Debunks the myth that Python lacks a preprocessor by demonstrating how Python can be extended and customized through the use of custom codecs and path configuration files.📖Open Access Book |Kalman and Bayesian Filters in Python:Addresses the need for a practical introduction to Kalman filtering, offering accessible explanations and examples, along with exercises with answers and supporting libraries.Python Backend Development - A Complete Guide for Beginners:Provides a step-by-step guide to building web applications, including advanced topics like asynchronous programming, performance optimization, and real-time data handling.Working with Excel Spreadsheets in Python:Focuses on automating tasks using theopenpyxlmodule.Read to learn about reading, writing, modifying, and formatting Excel files, and advanced features like plotting charts and integrating images.🔑Best Practices and Advice🔏Visualisation 101 - Choosing the Best Visualisation Type:Explores how visualizations improve data-driven decisions, focusing on understanding context, audience, and visual perception.Readto learn how to implement visualizations.Simone's Creative Cooking Club • If You Haven't Got a Clue What "Pass by Value" or "Pass by Reference" Mean, Read On…:Demonstrates how Python handles function arguments, particularly mutable and immutable objects.How I ask GPT-4 to make tiny Python scripts in practice:Succinctly describes starting with a basic script, then converting it into a command-line interface using click, and adding features like stdin/stdout handling and error logging.Linear Algebra Concepts Every Data Scientist Should Know:Introduces key concepts such as vectors, vector operations, vector spaces, and matrices, with visual explanations and code examples to demonstrate their application in real-world data science tasks.🎥Python From a Java Developer's Perspective:Provides guidance for Java developers to write Python code effectively.Watch to learn how to smoothly transition between Java and Python while leveraging your existing Java knowledge.🔍Featured Study: Mastering Robotic Control with PyRoboCOP for Complex Tasks💥In “PyRoboCOP: Python-based Robotic Control & Optimization Package for Manipulation and Collision Avoidance” Raghunathan et al. introduce a Python-based software package designed for the optimisation and control of robotic systems. The package excels in handling complex interactions like contact and collision avoidance, crucial for autonomous robotic manipulation.ContextRobotic systems often operate in environments with numerous obstacles and objects, making it essential to model and optimise these interactions mathematically. These interactions, defined by complementarity constraints, are challenging to manage because they do not follow standard optimisation assumptions. Most existing physics engines simulate these interactions but do not offer real-time optimisation capabilities.PyRoboCOPaddresses this gap by providing a flexible and user-friendly package that allows robots to reason about their environment and optimise their behaviour, which is critical for achieving autonomous manipulation tasks.Key Features of PyRoboCOPPyRoboCOP is characterised by its ability to automatically reformulate complex mathematical constraints and integrate seamlessly with powerful optimisation tools. Key features include:Automatic Reformulation of Complementarity Constraints:Handles difficult constraints that describe object interactions.Direct Transcription via Orthogonal Collocation:Converts DAEs into a solvable set of nonlinear equations.Integration with ADOL-C and IPOPT:Supports automatic differentiation and efficient optimisation.Built-in Support for Contact and Obstacle Avoidance Constraints:Simplifies the setup of complex robotic tasks.Flexible User Interface:Allows for customisation and adaptation to various robotic systems.What This Means for YouThe package is particularly relevant for researchers, developers, and engineers working in the field of robotics, especially those involved in designing autonomous systems that require precise control and optimisation. PyRoboCOP’s ability to handle complex robotic interactions makes it a valuable tool for developing real-time, model-based control solutions in environments where contact and collision avoidance are critical.Examining the DetailsPyRoboCOP's performance was rigorously tested across several robotic scenarios, including planar pushing, car parking, and belt drive unit assembly. In a planar pushing task, PyRoboCOP optimised the robot's trajectory, balancing a normal force of 0.5 N and a friction coefficient of 0.3, successfully navigating from (0,0,0)(0,0,0)(0,0,0) to (0.5,0.5,0)(0.5,0.5,0)(0.5,0.5,0) and (−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2)(−0.1,−0.1,3π/2). In a car parking scenario, the software optimised movement from (1,4,0,0)(1,4,0,0)(1,4,0,0) to (2,2.5,π/2,0)(2,2.5,π/2,0)(2,2.5,π/2,0), effectively avoiding obstacles. PyRoboCOP also managed the complex task of assembling a belt drive unit, demonstrating its ability to handle intricate manipulations. When benchmarked againstCasADiandPyomo, PyRoboCOP showed comparable performance, solving an acrobot system in a mean time of 2.282 seconds with 1,296 variables, versus CasADi's 1.175 seconds with 900 variables and Pyomo's 2.374 seconds with 909 variables.You can learn more by reading the entirepaperor access the packagehere.🧠 Expert insight 📚Tasks💥Here’s an excerpt from “Chapter 2: Surveying GenAI Types and Modes: An Overview of GANs, Diffusers, and Transformers” in the book,Generative AI Foundations in PythonbyCarlos Rodriguez, published in July 2024.Applying GAI models – image generation using GANs, diffusers, and transformersIn this hands-on section…You’ll get a first-hand experience and deep dive into theactual implementation of generative models, specifically GANs, diffusion models, and transformers….I'm a new paragraph block.We’ll be utilizing the highly versatilePyTorchlibrary, a popular choice among machine learning practitioners, to facilitate our operations.PyTorchprovides a powerful and dynamic toolset to define and compute gradients, which is central to trainingthese models.In addition, we’ll also use thediffuserslibrary. It’s a specialized library that provides functionality to implement diffusion models. This library enables us to reproduce state-of-the-art diffusion models directly from our workspace. It underpins the creation, training, and usage of denoising diffusion probabilistic models at an unprecedented level of simplicity, without compromising themodels’ complexity.Through this practical session, we’ll explore how to operate and integrate these libraries and implement and manipulate GANs, diffusers, and transformers using the Python programming language. This hands-on experience will complement the theoretical knowledge we have gained in the chapter, enabling us to see these models in action in thereal world….Working with Jupyter Notebook and Google ColabJupyter notebooks enable live code execution, visualization, and explanatory text, suitable for prototyping and data analysis. Google Colab, conversely, is a cloud-based version of Jupyter Notebook, designed for machine learning prototyping. It provides free GPU resources and integrates with Google Drive for file storage and sharing. We’ll leverage Colab as our prototyping environmentgoing forward.Stable diffusion transformerWe begin with a pre-trained stable diffusion model, a text-to-image latent diffusion model created by researchers and engineers from CompVis, Stability AI, and LAION (Patil et al., 2022). The diffusion process is used to draw samples from complex, high-dimensional distributions, and when it interacts with the text embeddings, it creates a powerful conditional imagesynthesis model.The term “stable” in this context refers to the fact that during training, a model maintains certain properties that stabilize the learning process. Stable diffusion models offer rich potential to create entirely new samples from a given data distribution, based ontext prompts.Again, for our practical example, we will Google Colab to alleviate a lot of initial setups. Colab also provides all of the computational resources needed to begin experimenting right away. We start by installing some libraries, and with three simple functions, we will build out a minimalStableDiffusionPipelineusing a well-established open-source implementation of the stablediffusion method.First, let’s navigate to our pre-configured Python environment, Google Colab, and install thediffusersopen-source library, which will provide most of the key underlying components we need forour experiment.In the first cell, we install all dependencies using the followingbashcommand. Note the exclamation point at the beginning of the line, which tells our environment to reach down to its underlying process and install the packageswe need:!pip install pytorch-fid torch diffusers clip transformers accelerateNext, we import the libraries we’ve just installed to make them available to ourPython program:from typing import Listimport torchimport matplotlib.pyplot as pltfrom diffusers import StableDiffusionPipeline, DDPMSchedulerNow, we’re ready for our three functions, which will execute the three tasks – loading the pre-trained model, generating the images based on prompting, and renderingthe images:def load_model(model_id: str) -> StableDiffusionPipeline:"""Load model with provided model_id."""return StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, revision="fp16", use_auth_token=False).to("cuda")def generate_images(pipe: StableDiffusionPipeline, prompts: List[str]) -> torch.Tensor:"""Generate images based on provided prompts."""with torch.autocast("cuda"):images = pipe(prompts).imagesreturn imagesdef render_images(images: torch.Tensor):"""Plot the generated images."""plt.figure(figsize=(10, 5))for i, img in enumerate(images):plt.subplot(1, 2, i + 1)plt.imshow(img)plt.axis("off")plt.show()In summary,load_modelloads a machine learning model identified bymodel_idonto a GPU for faster processing. Thegenerate_imagesfunction takes this model and a list of prompts to create our images. Within this function, you will notice torch.autocast("cuda"), which is a special command that allows PyTorch (our underlying machine learning library) to perform operations faster while maintaining accuracy. Lastly, the render_images function displays these images in a simple grid format, making use of the matplotlib visualization library to renderour output.With our functions defined, we select our model version, define our pipeline, and execute our imagegeneration process:# Executionmodel_id = "CompVis/stable-diffusion-v1-4"prompts = ["A hyper-realistic photo of a friendly lion","A stylized oil painting of a NYC Brownstone"]pipe = load_model(model_id)images = generate_images(pipe, prompts)render_images(images)The output inFigure 2.1is a vivid example of the imaginativeness and creativity we typically expect from human art, generated entirely by the diffusion process. Except, how do we measure whether the model was faithful to thetext provided?Figure 2.1: Output for the prompts “A hyper-realistic photo of a friendly lion” (left) and “A stylized oil painting of a NYC Brownstone” (right)The next step is to evaluate the quality and relevance of our generated images in relation to the prompts. This is where CLIP comes into play. CLIP is designed to measure the alignment between text and images by analyzing their semantic similarities, giving us a true quantitative measure of the fidelity of our synthetic images tothe prompts.Scoring with the CLIP modelCLIP is trained to understand the relationship between text and images by learning to place similar images and text near each other in a shared space. When evaluating a generated image, CLIP checks how closely the image aligns with the textual description provided. A higher score indicates a better match, meaning the image accurately represents the text. Conversely, a lower score suggests a deviation from the text, indicating a lesser quality or fidelity to the prompt, providing a quantitative measure of how well the generated image adheres to theintended description.Again, we will import thenecessary libraries:from typing import List, Tuplefrom PIL import Imageimport requestsfrom transformers import CLIPProcessor, CLIPModelimport torchWe begin by loading the CLIP model, processor, andnecessary parameters:# ConstantsCLIP_REPO = "openai/clip-vit-base-patch32"def load_model_and_processor(model_name: str) -> Tuple[CLIPModel, CLIPProcessor]:"""Loads the CLIP model and processor."""model = CLIPModel.from_pretrained(model_name)processor = CLIPProcessor.from_pretrained(model_name)return model, processorNext, we define a processing function to adjust the textual prompts and images, ensuring that they are in the correct format forCLIP inference:def process_inputs(processor: CLIPProcessor, prompts: List[str],images: List[Image.Image]) -> dict:"""Processes the inputs using the CLIP processor."""return processor(text=prompts, images=images,return_tensors="pt", padding=True)In this step, we initiate the evaluation process by inputting the images and textual prompts into the CLIP model. This is done in parallel across multiple devices to optimize performance. The model then computes similarity scores, known as logits, for each image-text pair. These scores indicate how well each image corresponds to the text prompts. To interpret these scores more intuitively, we convert them into probabilities, which indicate the likelihood that an image aligns with any of thegiven prompts:def get_probabilities(model: CLIPModel, inputs: dict) -> torch.Tensor:"""Computes the probabilities using the CLIP model."""outputs = model(**inputs)logits = outputs.logits_per_image# Define temperature - higher temperature will make the distribution more uniform.T = 10# Apply temperature to the logitstemp_adjusted_logits = logits / Tprobs = torch.nn.functional.softmax(temp_adjusted_logits, dim=1)return probsLastly, we display the images along with their scores, visually representing how well each image adheres to theprovided prompts:def display_images_with_scores(images: List[Image.Image], scores: torch.Tensor) -> None:"""Displays the images alongside their scores."""# Set print options for readabilitytorch.set_printoptions(precision=2, sci_mode=False)for i, image in enumerate(images):print(f"Image {i + 1}:")display(image)print(f"Scores: {scores[i, :]}")print()With everything detailed, let’s execute the pipelineas follows:# Load CLIP modelmodel, processor = load_model_and_processor(CLIP_REPO)# Process image and text inputs togetherinputs = process_inputs(processor, prompts, images)# Extract the probabilitiesprobs = get_probabilities(model, inputs)# Display each image with corresponding scoresdisplay_images_with_scores(images, probs)We now have scores for each of our synthetic images that quantify the fidelity of the synthetic image to the text provided, based on the CLIP model, which interprets both image and text data as one combined mathematical representation (or geometric space) and can measuretheir similarity.Figure 2.2: CLIP scoresFor our “friendly lion,” we computed scores of 83% and 17% for each prompt, which we can interpret as an 83% likelihood that the image aligns with thefirst prompt.Packt library subscribers cancontinue readingthe entire book for free. You can buyGenerative AI Foundations in Pythonby Carlos Rodriguez,here.Get the eBook for $31.99$21.99!And that’s a wrap.We have an entire range of newsletters with focused content for tech pros. Subscribe to the ones you find the most usefulhere. The complete PythonPro archives can be foundhere.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{line-height:0;font-size:75%}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

0
0
283

PythonPro #55: Python Project Packaging Guide, AWS Credentials Theft Alert, and PyTorch 2 Speeds Up ML

PythonPro #54: Global Forecasting Models, Python Overtakes JavaScript, and Hidden Python Libraries

PythonPro #53: FastAPI on Docker, Python-CUDA Integration with Numbast, and Concurrent Requests with httpx vs aiohttp

PythonPro #52: AI-Powered Vulnhuntr for Python, SageMaker Core SDK, and Exploring User Behaviour with Python

PythonPro #51: Python 3.13 REPL Enhancements, Python 3.12 vs. 3.13, and Visualizing Named Entities in Text

PythonPro #50: Python 3.13 Arrives, Offensive Security Practices, and Jupyter Notebook Tips

PythonPro #48: Python 3.13 JIT, Boosting Model Inference, and FastAPI Best Practices

PythonPro #47: Python 3.13 Features, AI Debugging with Copilot, and Building Neural Networks from Scratch

Master Python for Data, AI, and API Development

PythonPro #45: Outlier Detection with Boxplots, Python 3.13 Updates, and Stripe Integration for Django

PythonPro #45: Converting DataFrames, Python Developer Survey, DBSCAN in 5 Minutes, and Web Scraping with Scrapy

PythonPro #44: Generative AI with PyTorch, uv Update, Choosing the Best Visualization Type, and FastAPI for Rapid Development

Create a Free Account To Continue Reading

SignIn Free Account To Continue Reading