API Reference

A client for interacting with LLM completion APIs and tracking usage.

class llm_api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Bases: object

A generic API client to run rate-limited requests concurrently using threads.

By default, uses the LiteLLM completion API.

API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.

Examples

>>> completion_api_client = APIClient(max_requests_per_minute=5)
>>> requests = [
>>>     dict(
>>>         model="gpt-3.5-turbo",
>>>         messages=[{"role": "user", "content": prompt}],
>>>     ) for prompt in user_prompts
>>> ]
>>> results = completion_api_client.make_requests(requests)
__init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Initialize the API client.

Parameters:
  • max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.

  • max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.

  • max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).

  • max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.

count_messages_tokens(messages: list[dict], *, model: str, timeout: float = 10) int[source]

Count tokens in text using the model’s tokenizer.

Parameters:
  • messages (list[dict]) – The messages to count tokens for.

  • model (str) – The model to count tokens for.

  • timeout (float, optional) – The timeout for the token counting operation in seconds.

Returns:

The number of tokens in the messages.

Return type:

int

property details: dict[str, Any]

Get the details of the API client.

get_max_context_tokens(model: str) int[source]

Get the maximum context tokens for a model.

property history: list[dict]

The history of requests and responses.

Returns:

A list of request/response entries.

Return type:

list[dict]

make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) list[object][source]

Make a series of rate-limited API requests concurrently using threads.

Parameters:
  • requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.

  • max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).

  • sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.

  • timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

Returns:

responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.

Return type:

list[object]

make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) list[object][source]

Make a series of rate-limited API requests with automatic retries for failed requests.

Parameters:
  • requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.

  • max_workers (int, optional) – Maximum number of worker threads to use.

  • max_retries (int, optional) – Maximum number of retry attempts for failed requests.

  • sanitize (bool, optional) – Whether to sanitize the request parameters.

  • timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

  • current_retry (int, optional) – Current retry attempt number (used internally for recursion).

Returns:

A list of response objects returned by the API function calls.

Return type:

list[object]

remove_unsupported_params(request: dict) dict[source]

Ensure request params are compatible with the model and provider.

Checks and removes unsupported parameters for this model and provider.

Returns:

compatible_request – A dictionary containing the provided request with all unsupported parameters removed.

Return type:

dict

sanitize_completion_request(request: dict) dict[source]

Sanitize the request parameters for the completion API.

  1. Checks and removes unsupported parameters for this model and provider.

  2. Truncates the request to the maximum context tokens for the model.

Returns:

sanitized_request – A dictionary containing parsed and filtered request parameters.

Return type:

dict

property tracker: APIUsageTracker

The API usage tracker instance.

Returns:

The API usage tracker.

Return type:

llm_api_client.api_tracker.APIUsageTracker

truncate_to_max_context_tokens(messages: list[dict], model: str) list[dict][source]

Truncate a prompt to the maximum context tokens for a model.

Parameters:
  • messages (list[dict]) – The request messages to truncate.

  • model (str) – The name of the model to use.

Returns:

The request messages, truncated so that the total token count is less than or equal to the maximum context tokens for the model.

Return type:

list[dict]

class llm_api_client.APIUsageTracker[source]

Bases: object

Class to track the cost of API calls.

__del__()[source]

Destructor that prints stats when the object is being destroyed.

__init__()[source]

Initialize the API usage tracker.

__str__()[source]

String representation of the API usage tracker.

property details: dict[str, Any]

Get the details of the API usage tracker.

get_stats_str() str[source]

Get a string representation of the API usage tracker.

property mean_response_time: float | None

Mean response time of API calls in seconds.

property num_api_calls: int

Number of API calls; or, more specifically, number of API responses.

response_time_at_percentile(percentile: float) float | None[source]

Response time at a given percentile in seconds.

set_up_litellm_cost_tracking()[source]

Set up cost tracking for API calls using LiteLLM.

property total_completion_tokens: int

Total number of completion tokens used across all API calls.

property total_cost: float
property total_prompt_tokens: int

Total number of prompt tokens used across all API calls.

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Function to track cost of API calls.

This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].

Client

A helper class to run rate-limited API requests concurrently using threads.

class llm_api_client.api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Bases: object

A generic API client to run rate-limited requests concurrently using threads.

By default, uses the LiteLLM completion API.

API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.

Examples

>>> completion_api_client = APIClient(max_requests_per_minute=5)
>>> requests = [
>>>     dict(
>>>         model="gpt-3.5-turbo",
>>>         messages=[{"role": "user", "content": prompt}],
>>>     ) for prompt in user_prompts
>>> ]
>>> results = completion_api_client.make_requests(requests)
__init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]

Initialize the API client.

Parameters:
  • max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.

  • max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.

  • max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).

  • max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.

property details: dict[str, Any]

Get the details of the API client.

property tracker: APIUsageTracker

The API usage tracker instance.

Returns:

The API usage tracker.

Return type:

llm_api_client.api_tracker.APIUsageTracker

property history: list[dict]

The history of requests and responses.

Returns:

A list of request/response entries.

Return type:

list[dict]

make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) list[object][source]

Make a series of rate-limited API requests concurrently using threads.

Parameters:
  • requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.

  • max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).

  • sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.

  • timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

Returns:

responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.

Return type:

list[object]

make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) list[object][source]

Make a series of rate-limited API requests with automatic retries for failed requests.

Parameters:
  • requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.

  • max_workers (int, optional) – Maximum number of worker threads to use.

  • max_retries (int, optional) – Maximum number of retry attempts for failed requests.

  • sanitize (bool, optional) – Whether to sanitize the request parameters.

  • timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.

  • current_retry (int, optional) – Current retry attempt number (used internally for recursion).

Returns:

A list of response objects returned by the API function calls.

Return type:

list[object]

sanitize_completion_request(request: dict) dict[source]

Sanitize the request parameters for the completion API.

  1. Checks and removes unsupported parameters for this model and provider.

  2. Truncates the request to the maximum context tokens for the model.

Returns:

sanitized_request – A dictionary containing parsed and filtered request parameters.

Return type:

dict

remove_unsupported_params(request: dict) dict[source]

Ensure request params are compatible with the model and provider.

Checks and removes unsupported parameters for this model and provider.

Returns:

compatible_request – A dictionary containing the provided request with all unsupported parameters removed.

Return type:

dict

truncate_to_max_context_tokens(messages: list[dict], model: str) list[dict][source]

Truncate a prompt to the maximum context tokens for a model.

Parameters:
  • messages (list[dict]) – The request messages to truncate.

  • model (str) – The name of the model to use.

Returns:

The request messages, truncated so that the total token count is less than or equal to the maximum context tokens for the model.

Return type:

list[dict]

get_max_context_tokens(model: str) int[source]

Get the maximum context tokens for a model.

count_messages_tokens(messages: list[dict], *, model: str, timeout: float = 10) int[source]

Count tokens in text using the model’s tokenizer.

Parameters:
  • messages (list[dict]) – The messages to count tokens for.

  • model (str) – The model to count tokens for.

  • timeout (float, optional) – The timeout for the token counting operation in seconds.

Returns:

The number of tokens in the messages.

Return type:

int

Usage Tracker

API usage tracker.

class llm_api_client.api_tracker.APIUsageTracker[source]

Bases: object

Class to track the cost of API calls.

__init__()[source]

Initialize the API usage tracker.

property details: dict[str, Any]

Get the details of the API usage tracker.

property total_cost: float
property total_prompt_tokens: int

Total number of prompt tokens used across all API calls.

property total_completion_tokens: int

Total number of completion tokens used across all API calls.

property num_api_calls: int

Number of API calls; or, more specifically, number of API responses.

property mean_response_time: float | None

Mean response time of API calls in seconds.

response_time_at_percentile(percentile: float) float | None[source]

Response time at a given percentile in seconds.

track_cost_callback(kwargs, completion_response, start_time, end_time)[source]

Function to track cost of API calls.

This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].

set_up_litellm_cost_tracking()[source]

Set up cost tracking for API calls using LiteLLM.

get_stats_str() str[source]

Get a string representation of the API usage tracker.

__str__()[source]

String representation of the API usage tracker.

__del__()[source]

Destructor that prints stats when the object is being destroyed.