{ "cells": [ { "cell_type": "markdown", "id": "0", "metadata": {}, "source": [ "# Maximising Performance\n", "\n", "PyProBE uses [Polars LazyFrames](https://docs.pola.rs/user-guide/lazy/) under-the-hood. This means that data isn't loaded into memory and calculations aren't run until the data is requested by the user, either as a plot or as a DataFrame. This is what makes working with PyProBE much faster than working with Pandas DataFrames, as [this example notebook](comparing-pyprobe-performance.ipynb) demonstrates.\n", "\n", "Working with LazyFrames efficiently, though, requires use of some best practises which this notebook will demonstrate." ] }, { "cell_type": "code", "execution_count": null, "id": "1", "metadata": {}, "outputs": [], "source": [ "import timeit\n", "\n", "import matplotlib.pyplot as plt\n", "\n", "import pyprobe" ] }, { "cell_type": "code", "execution_count": null, "id": "2", "metadata": {}, "outputs": [], "source": [ "# Load test data\n", "data_directory = \"../../../tests/sample_data/neware\"\n", "info_dictionary = {\"test_name\": \"Sample\", \"device\": \"Neware\"}\n", "\n", "\n", "def load_data():\n", " \"\"\"Helper function to load fresh data for each benchmark run.\"\"\"\n", " cell_new = pyprobe.Cell(info=info_dictionary)\n", " cell_new.import_data(\n", " procedure_name=\"Sample\",\n", " data_path=data_directory + \"/sample_data_neware.parquet\",\n", " )\n", " return (\n", " cell_new.procedure[\"Sample\"].experiment(\"Break-in Cycles\").cycle(1).discharge(0)\n", " )" ] }, { "cell_type": "markdown", "id": "3", "metadata": {}, "source": [ "## Single get() with Multiple Arguments vs Multiple get() Calls\n", "\n", "When you need to retrieve multiple columns, the most efficient approach is to use a single `get()` call with multiple column arguments. This processes all columns in a single lazy evaluation plan, compared to calling `get()` separately for each column." ] }, { "cell_type": "code", "execution_count": null, "id": "4", "metadata": {}, "outputs": [], "source": [ "# Method 1: Multiple separate get() calls\n", "def multiple_get_calls():\n", " result = load_data()\n", " _ = result.get(\"Time [s]\")\n", " _ = result.get(\"Current [A]\")\n", " _ = result.get(\"Voltage [V]\")\n", "\n", "\n", "# Method 2: Single get() with multiple column arguments\n", "def single_get_multiple_args():\n", " result = load_data()\n", " _ = result.get(\"Time [s]\", \"Current [A]\", \"Voltage [V]\")\n", "\n", "\n", "# Benchmark the two methods\n", "num_runs = 10\n", "time_multiple_get = timeit.timeit(multiple_get_calls, number=num_runs) / num_runs\n", "time_single_get = timeit.timeit(single_get_multiple_args, number=num_runs) / num_runs\n", "\n", "# Visualize the results\n", "plt.figure(figsize=(8, 6))\n", "methods = [\n", " \"Multiple get()\\ncalls\",\n", " \"Single get()\\nwith multiple args\",\n", "]\n", "times = [\n", " time_multiple_get * 1000,\n", " time_single_get * 1000,\n", "]\n", "colors = [\"#ff7f0e\", \"#1f77b4\"]\n", "bars = plt.bar(methods, times, color=colors)\n", "plt.ylabel(\"Time (ms)\")\n", "plt.title(\"Single get() with Multiple Arguments vs Multiple get() Calls\")\n", "plt.ylim(0, max(times) * 1.2)\n", "\n", "# Add value labels on bars\n", "for bar, time in zip(bars, times):\n", " height = bar.get_height()\n", " plt.text(\n", " bar.get_x() + bar.get_width() / 2,\n", " height,\n", " f\"{time:.2f} ms\",\n", " ha=\"center\",\n", " va=\"bottom\",\n", " )\n", "\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "5", "metadata": {}, "source": [ "## Using collect() to Optimize Multiple get() Calls\n", "\n", "If you need to call `get()` multiple times, you can improve performance by calling `collect()` first. This materializes the lazy dataframe once, and subsequent `get()` calls operate on the collected data, avoiding repeated lazy evaluation." ] }, { "cell_type": "code", "execution_count": null, "id": "6", "metadata": {}, "outputs": [], "source": [ "# Benchmark multiple numbers of get() calls\n", "num_calls_list = [1, 3, 5, 10, 15, 20]\n", "times_multiple_get = []\n", "times_collect_then_get = []\n", "\n", "for num_calls in num_calls_list:\n", " # Method 1: Multiple separate get() calls\n", " def multiple_get_calls():\n", " result = load_data()\n", " for _ in range(num_calls):\n", " _ = result.get(\"Time [s]\")\n", " _ = result.get(\"Current [A]\")\n", " _ = result.get(\"Voltage [V]\")\n", "\n", " # Method 2: Single collect() followed by multiple get() calls\n", " def single_collect_then_get():\n", " result = load_data()\n", " result.collect()\n", " for _ in range(num_calls):\n", " _ = result.get(\"Time [s]\")\n", " _ = result.get(\"Current [A]\")\n", " _ = result.get(\"Voltage [V]\")\n", "\n", " # Benchmark\n", " num_runs = 10\n", " time_mg = timeit.timeit(multiple_get_calls, number=num_runs) / num_runs\n", " time_cg = timeit.timeit(single_collect_then_get, number=num_runs) / num_runs\n", "\n", " times_multiple_get.append(time_mg * 1000) # Convert to ms\n", " times_collect_then_get.append(time_cg * 1000) # Convert to ms\n", "\n", "# Plot the results\n", "plt.figure(figsize=(10, 6))\n", "plt.plot(\n", " num_calls_list,\n", " times_multiple_get,\n", " marker=\"o\",\n", " linewidth=2,\n", " markersize=8,\n", " label=\"Multiple get() calls\",\n", " color=\"#ff7f0e\",\n", ")\n", "plt.plot(\n", " num_calls_list,\n", " times_collect_then_get,\n", " marker=\"s\",\n", " linewidth=2,\n", " markersize=8,\n", " label=\"Single collect() + get() calls\",\n", " color=\"#2ca02c\",\n", ")\n", "plt.xlabel(\"Number of get() Call Sets\")\n", "plt.ylabel(\"Total Time (ms)\")\n", "plt.title(\"Performance: Multiple get() Calls vs collect() + get() Calls\")\n", "plt.xticks(num_calls_list)\n", "plt.legend()\n", "plt.grid(True, alpha=0.3)\n", "plt.tight_layout()\n", "plt.show()" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }