API Reference

A client for interacting with LLM completion APIs and tracking usage.

class llm_api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Bases: object

A generic API client to run rate-limited requests concurrently using threads.

By default, uses the LiteLLM completion API.

API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.

Examples

>>> completion_api_client = APIClient(max_requests_per_minute=5)
>>> requests = [
>>>     dict(
>>>         model="gpt-3.5-turbo",
>>>         messages=[{"role": "user", "content": prompt}],
>>>     ) for prompt in user_prompts
>>> ]
>>> results = completion_api_client.make_requests(requests)

__init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Initialize the API client.

Parameters:

max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.
max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.
max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).
max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.

count_messages_tokens(messages: list[dict], *, model: str, timeout: float = 10) → int[source]

Count tokens in text using the model’s tokenizer.

Parameters:

messages (list[dict]) – The messages to count tokens for.
model (str) – The model to count tokens for.
timeout (float, optional) – The timeout for the token counting operation in seconds.

Returns:

The number of tokens in the messages.

Return type:

int

property details: dict[str, Any]: Get the details of the API client.

get_max_context_tokens(model: str) → int[source]: Get the maximum context tokens for a model.

property history: list[dict]

The history of requests and responses.

Returns:: A list of request/response entries.
Return type:: list[dict]

make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) → list[object][source]

Make a series of rate-limited API requests concurrently using threads.

Parameters:

requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).
sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

Returns:

responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.

Return type:

list[object]

make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) → list[object][source]

Make a series of rate-limited API requests with automatic retries for failed requests.

Parameters:

requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – Maximum number of worker threads to use.
max_retries (int, optional) – Maximum number of retry attempts for failed requests.
sanitize (bool, optional) – Whether to sanitize the request parameters.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
current_retry (int, optional) – Current retry attempt number (used internally for recursion).

Returns:

A list of response objects returned by the API function calls.

Return type:

list[object]

remove_unsupported_params(request: dict) → dict[source]

Ensure request params are compatible with the model and provider.

Checks and removes unsupported parameters for this model and provider.

Returns:: compatible_request – A dictionary containing the provided request with all unsupported parameters removed.
Return type:: dict

sanitize_completion_request(request: dict) → dict[source]

Sanitize the request parameters for the completion API.

Checks and removes unsupported parameters for this model and provider.
Truncates the request to the maximum context tokens for the model.

Returns:: sanitized_request – A dictionary containing parsed and filtered request parameters.
Return type:: dict

property tracker: APIUsageTracker

The API usage tracker instance.

Returns:: The API usage tracker.
Return type:: llm_api_client.api_tracker.APIUsageTracker

truncate_to_max_context_tokens(messages: list[dict], model: str) → list[dict][source]

Truncate a prompt to the maximum context tokens for a model.

Parameters:

messages (list[dict]) – The request messages to truncate.
model (str) – The name of the model to use.

Returns:

The request messages, truncated so that the total token count is less than or equal to the maximum context tokens for the model.

Return type:

list[dict]

class llm_api_client.APIUsageTracker[source]

Bases: object

Class to track the cost of API calls.

__del__()[source]: Destructor that prints stats when the object is being destroyed.

__init__()[source]: Initialize the API usage tracker.

__str__()[source]: String representation of the API usage tracker.

property details: dict[str, Any]: Get the details of the API usage tracker.

get_stats_str() → str[source]: Get a string representation of the API usage tracker.

property mean_response_time: float | None: Mean response time of API calls in seconds.

property num_api_calls: int: Number of API calls; or, more specifically, number of API responses.

response_time_at_percentile(percentile: float) → float | None[source]: Response time at a given percentile in seconds.

set_up_litellm_cost_tracking()[source]: Set up cost tracking for API calls using LiteLLM.

property total_completion_tokens: int: Total number of completion tokens used across all API calls.

property total_cost: float

property total_prompt_tokens: int: Total number of prompt tokens used across all API calls.

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Function to track cost of API calls.

This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].

Client

A helper class to run rate-limited API requests concurrently using threads.

class llm_api_client.api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Bases: object

A generic API client to run rate-limited requests concurrently using threads.

By default, uses the LiteLLM completion API.

API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.

Examples

>>> completion_api_client = APIClient(max_requests_per_minute=5)
>>> requests = [
>>>     dict(
>>>         model="gpt-3.5-turbo",
>>>         messages=[{"role": "user", "content": prompt}],
>>>     ) for prompt in user_prompts
>>> ]
>>> results = completion_api_client.make_requests(requests)

__init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Initialize the API client.

Parameters:

max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.
max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.
max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).
max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.

property details: dict[str, Any]: Get the details of the API client.

property tracker: APIUsageTracker

The API usage tracker instance.

Returns:: The API usage tracker.
Return type:: llm_api_client.api_tracker.APIUsageTracker

property history: list[dict]

The history of requests and responses.

Returns:: A list of request/response entries.
Return type:: list[dict]

make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) → list[object][source]

Make a series of rate-limited API requests concurrently using threads.

Parameters:

requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).
sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

Returns:

responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.

Return type:

list[object]

make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) → list[object][source]

Make a series of rate-limited API requests with automatic retries for failed requests.

Parameters:

requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – Maximum number of worker threads to use.
max_retries (int, optional) – Maximum number of retry attempts for failed requests.
sanitize (bool, optional) – Whether to sanitize the request parameters.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
current_retry (int, optional) – Current retry attempt number (used internally for recursion).

Returns:

A list of response objects returned by the API function calls.

Return type:

list[object]

sanitize_completion_request(request: dict) → dict[source]

Sanitize the request parameters for the completion API.

Checks and removes unsupported parameters for this model and provider.
Truncates the request to the maximum context tokens for the model.

Returns:: sanitized_request – A dictionary containing parsed and filtered request parameters.
Return type:: dict

remove_unsupported_params(request: dict) → dict[source]

Ensure request params are compatible with the model and provider.

Checks and removes unsupported parameters for this model and provider.

Returns:: compatible_request – A dictionary containing the provided request with all unsupported parameters removed.
Return type:: dict

truncate_to_max_context_tokens(messages: list[dict], model: str) → list[dict][source]

Truncate a prompt to the maximum context tokens for a model.

Parameters:

messages (list[dict]) – The request messages to truncate.
model (str) – The name of the model to use.

Returns:

The request messages, truncated so that the total token count is less than or equal to the maximum context tokens for the model.

Return type:

list[dict]

get_max_context_tokens(model: str) → int[source]: Get the maximum context tokens for a model.

count_messages_tokens(messages: list[dict], *, model: str, timeout: float = 10) → int[source]

Count tokens in text using the model’s tokenizer.

Parameters:

messages (list[dict]) – The messages to count tokens for.
model (str) – The model to count tokens for.
timeout (float, optional) – The timeout for the token counting operation in seconds.

Returns:

The number of tokens in the messages.

Return type:

int

Usage Tracker

API usage tracker.

class llm_api_client.api_tracker.APIUsageTracker[source]

Bases: object

Class to track the cost of API calls.

__init__()[source]: Initialize the API usage tracker.

property details: dict[str, Any]: Get the details of the API usage tracker.

property total_cost: float

property total_prompt_tokens: int: Total number of prompt tokens used across all API calls.

property total_completion_tokens: int: Total number of completion tokens used across all API calls.

property num_api_calls: int: Number of API calls; or, more specifically, number of API responses.

property mean_response_time: float | None: Mean response time of API calls in seconds.

response_time_at_percentile(percentile: float) → float | None[source]: Response time at a given percentile in seconds.

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Function to track cost of API calls.

This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].

set_up_litellm_cost_tracking()[source]: Set up cost tracking for API calls using LiteLLM.

get_stats_str() → str[source]: Get a string representation of the API usage tracker.

__str__()[source]: String representation of the API usage tracker.

__del__()[source]: Destructor that prints stats when the object is being destroyed.