API Reference
A client for interacting with LLM completion APIs and tracking usage.
- class llm_api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]
Bases:
objectA generic API client to run rate-limited requests concurrently using threads.
By default, uses the LiteLLM completion API.
API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.
Examples
>>> completion_api_client = APIClient(max_requests_per_minute=5) >>> requests = [ >>> dict( >>> model="gpt-3.5-turbo", >>> messages=[{"role": "user", "content": prompt}], >>> ) for prompt in user_prompts >>> ] >>> results = completion_api_client.make_requests(requests)
- __init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]
Initialize the API client.
- Parameters:
max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.
max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.
max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).
max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.
- count_messages_tokens(messages: list[dict], *, model: str, timeout: float = 10) int[source]
Count tokens in text using the model’s tokenizer.
- make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) list[object][source]
Make a series of rate-limited API requests concurrently using threads.
- Parameters:
requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).
sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
- Returns:
responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.
- Return type:
- make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) list[object][source]
Make a series of rate-limited API requests with automatic retries for failed requests.
- Parameters:
requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – Maximum number of worker threads to use.
max_retries (int, optional) – Maximum number of retry attempts for failed requests.
sanitize (bool, optional) – Whether to sanitize the request parameters.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
current_retry (int, optional) – Current retry attempt number (used internally for recursion).
- Returns:
A list of response objects returned by the API function calls.
- Return type:
- remove_unsupported_params(request: dict) dict[source]
Ensure request params are compatible with the model and provider.
Checks and removes unsupported parameters for this model and provider.
- Returns:
compatible_request – A dictionary containing the provided request with all unsupported parameters removed.
- Return type:
- sanitize_completion_request(request: dict) dict[source]
Sanitize the request parameters for the completion API.
Checks and removes unsupported parameters for this model and provider.
Truncates the request to the maximum context tokens for the model.
- Returns:
sanitized_request – A dictionary containing parsed and filtered request parameters.
- Return type:
- property tracker: APIUsageTracker
The API usage tracker instance.
- Returns:
The API usage tracker.
- Return type:
- class llm_api_client.APIUsageTracker[source]
Bases:
objectClass to track the cost of API calls.
- response_time_at_percentile(percentile: float) float | None[source]
Response time at a given percentile in seconds.
- track_cost_callback(kwargs, completion_response, start_time, end_time)[source]
Function to track cost of API calls.
This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].
Client
A helper class to run rate-limited API requests concurrently using threads.
- class llm_api_client.api_client.APIClient(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]
Bases:
objectA generic API client to run rate-limited requests concurrently using threads.
By default, uses the LiteLLM completion API.
API requests and responses are logged and can optionally be saved to disk if a log file is specified. An APIUsageTracker instance is automatically instantiated to track the cost and usage of API calls.
Examples
>>> completion_api_client = APIClient(max_requests_per_minute=5) >>> requests = [ >>> dict( >>> model="gpt-3.5-turbo", >>> messages=[{"role": "user", "content": prompt}], >>> ) for prompt in user_prompts >>> ] >>> results = completion_api_client.make_requests(requests)
- __init__(max_requests_per_minute: int = 10000, max_tokens_per_minute: int = 2000000, max_workers: int = None, max_delay_seconds: int = 300)[source]
Initialize the API client.
- Parameters:
max_requests_per_minute (int, optional) – Maximum API requests allowed per minute. Default is OPENAI_API_RPM.
max_tokens_per_minute (int, optional) – Maximum tokens allowed per minute.
max_workers (int, optional) – Maximum number of worker threads. Default is min(CPU count * 20, max_rpm).
max_delay_seconds (int, optional) – Maximum time in seconds that the internal rate limiter will wait to acquire resources before timing out (applies to both RPM and TPM limiters). Default is 5 minutes.
- property tracker: APIUsageTracker
The API usage tracker instance.
- Returns:
The API usage tracker.
- Return type:
- make_requests(requests: list[dict], *, max_workers: int = None, sanitize: bool = True, timeout: float = None) list[object][source]
Make a series of rate-limited API requests concurrently using threads.
- Parameters:
requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – The maximum number of threads to use in the ThreadPoolExecutor. If not provided, will default to: min(CPU count * 20, max_rpm).
sanitize (bool, optional) – Whether to sanitize the requests; i.e., filter out request parameters that may be incompatible with the model and provider. Default is True.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
- Returns:
responses – A list of response objects returned by the API function calls. If a request fails, the corresponding response will be None.
- Return type:
- make_requests_with_retries(requests: list[dict], *, max_workers: int = None, max_retries: int = 2, sanitize: bool = True, timeout: float = None, current_retry: int = 0) list[object][source]
Make a series of rate-limited API requests with automatic retries for failed requests.
- Parameters:
requests (list[dict]) – A list of dictionaries, each containing the parameters to pass to the API function call.
max_workers (int, optional) – Maximum number of worker threads to use.
max_retries (int, optional) – Maximum number of retry attempts for failed requests.
sanitize (bool, optional) – Whether to sanitize the request parameters.
timeout (float, optional) – Maximum number of seconds to wait for all requests to complete. If None (default), waits indefinitely.
current_retry (int, optional) – Current retry attempt number (used internally for recursion).
- Returns:
A list of response objects returned by the API function calls.
- Return type:
- sanitize_completion_request(request: dict) dict[source]
Sanitize the request parameters for the completion API.
Checks and removes unsupported parameters for this model and provider.
Truncates the request to the maximum context tokens for the model.
- Returns:
sanitized_request – A dictionary containing parsed and filtered request parameters.
- Return type:
- remove_unsupported_params(request: dict) dict[source]
Ensure request params are compatible with the model and provider.
Checks and removes unsupported parameters for this model and provider.
- Returns:
compatible_request – A dictionary containing the provided request with all unsupported parameters removed.
- Return type:
- truncate_to_max_context_tokens(messages: list[dict], model: str) list[dict][source]
Truncate a prompt to the maximum context tokens for a model.
Usage Tracker
API usage tracker.
- class llm_api_client.api_tracker.APIUsageTracker[source]
Bases:
objectClass to track the cost of API calls.
- response_time_at_percentile(percentile: float) float | None[source]
Response time at a given percentile in seconds.
- track_cost_callback(kwargs, completion_response, start_time, end_time)[source]
Function to track cost of API calls.
This function will be added as a callback to the litellm package by calling tracker.set_up_litellm_cost_tracking(), or manually by setting litellm.success_callback = [tracker.track_cost_callback].