Skip to main content
All functions listed in this document are safe to call from the main thread and all callbacks will be run on the main thread, unless there are explicit instructions or explanations.

ModelRunner

A ModelRunner represents a loaded model instance. The SDK returns concrete ModelRunner implementations, but your code only needs the protocol surface:
public protocol ModelRunner {
  func createConversation(systemPrompt: String?) -> Conversation
  func createConversationFromHistory(history: [ChatMessage]) -> Conversation
  func generateResponse(
    conversation: Conversation,
    generationOptions: GenerationOptions?,
    onResponseCallback: @escaping (MessageResponse) -> Void,
    onErrorCallback: ((LeapError) -> Void)?
  ) -> GenerationHandler
  func unload() async
  var modelId: String { get }
}

Lifecycle

  • Create conversations using createConversation(systemPrompt:) or createConversationFromHistory(history:).
  • Hold a strong reference to the ModelRunner for as long as you need to perform generations.
  • Call unload() when you are done to release native resources (optional, happens automatically on deinit).
  • Access modelId to identify the loaded model (for analytics, debugging, or UI labels).

Low-level generation API

generateResponse(...) drives generation with callbacks and returns a GenerationHandler you can store to cancel the run. Most apps call the higher-level streaming helpers on Conversation, but you can invoke this method directly when you need fine-grained control (for example, integrating with custom async primitives).
let handler = runner.generateResponse(
  conversation: conversation,
  generationOptions: options,
  onResponseCallback: { message in
    // Handle MessageResponse values here
  },
  onErrorCallback: { error in
    // Handle LeapError
  }
)

// Stop generation early if needed
handler.stop()

GenerationHandler

public protocol GenerationHandler: Sendable {
  func stop()
}
The handler returned by ModelRunner.generateResponse or Conversation.generateResponse(..., onResponse:) lets you cancel generation without tearing down the conversation.

Conversation

Conversation tracks chat state and provides streaming helpers built on top of the model runner.
public class Conversation {
  public let modelRunner: ModelRunner
  public private(set) var history: [ChatMessage]
  public private(set) var functions: [LeapFunction]
  public private(set) var isGenerating: Bool

  public init(modelRunner: ModelRunner, history: [ChatMessage])

  public func registerFunction(_ function: LeapFunction)
  public func exportToJSON() throws -> [[String: Any]]

  public func generateResponse(
    userTextMessage: String,
    generationOptions: GenerationOptions? = nil
  ) -> AsyncThrowingStream<MessageResponse, Error>

  public func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions? = nil
  ) -> AsyncThrowingStream<MessageResponse, Error>

  @discardableResult
  public func generateResponse(
    message: ChatMessage,
    generationOptions: GenerationOptions? = nil,
    onResponse: @escaping (MessageResponse) -> Void
  ) -> GenerationHandler?
}

Properties

  • history: Copy of the accumulated chat messages. The SDK appends the assistant reply when a generation finishes successfully.
  • functions: Functions registered via registerFunction(_:) for function calling.
  • isGenerating: Boolean flag indicating whether a generation is currently running. Attempts to start a new generation while this is true immediately finish with an empty stream (or nil handler for the callback variant).

Streaming Convenience

The most common pattern is to use the async-stream helpers:
let user = ChatMessage(role: .user, content: [.text("Hello! What can you do?")])

Task {
  do {
    for try await response in conversation.generateResponse(
      message: user,
      generationOptions: GenerationOptions(temperature: 0.7)
    ) {
      switch response {
      case .chunk(let delta):
        print(delta, terminator: "")
      case .reasoningChunk(let thought):
        print("Reasoning:", thought)
      case .functionCall(let calls):
        handleFunctionCalls(calls)
      case .audioSample(let samples, let sampleRate):
        audioRenderer.enqueue(samples, sampleRate: sampleRate)
      case .complete(let completion):
        let text = completion.message.content.compactMap { item in
          if case .text(let value) = item { return value }
          return nil
        }.joined()
        print("\nComplete:", text)
        if let stats = completion.stats {
          print("Prompt tokens: \(stats.promptTokens), completions: \(stats.completionTokens)")
        }
      }
    }
  } catch {
    print("Generation failed: \(error)")
  }
}
Cancelling the task that iterates the stream stops generation and cleans up native resources.

Callback Convenience

Use generateResponse(message:onResponse:) when you prefer callbacks or need to integrate with imperative UI components:
let handler = conversation.generateResponse(message: user) { response in
  updateUI(with: response)
}

// Later
handler?.stop()
If a generation is already running, the method returns nil and emits a .complete message with finishReason == .stop via the callback.
The callback overload does not surface generation errors. Use the async-stream helper or call ModelRunner.generateResponse with onErrorCallback when you need error handling.

Export Chat History

exportToJSON() serializes the conversation history into a [[String: Any]] payload that mirrors OpenAI’s chat-completions format. This is useful for persistence, analytics, or debugging tools.

MessageResponse

public enum MessageResponse {
  case chunk(String)
  case reasoningChunk(String)
  case audioSample(samples: [Float], sampleRate: Int)
  case functionCall([LeapFunctionCall])
  case complete(MessageCompletion)
}

public struct MessageCompletion {
  public let message: ChatMessage
  public let finishReason: GenerationFinishReason
  public let stats: GenerationStats?

  public var info: GenerationCompleteInfo { get }
}

public struct GenerationCompleteInfo {
  public let finishReason: GenerationFinishReason
  public let stats: GenerationStats?
}

public struct GenerationStats {
  public var promptTokens: UInt64
  public var completionTokens: UInt64
  public var totalTokens: UInt64
  public var tokenPerSecond: Float
}
  • chunk: Partial assistant text emitted during streaming.
  • reasoningChunk: Model reasoning tokens wrapped between <think> / </think> (only for models that expose reasoning traces).
  • audioSample: PCM audio frames streamed from audio-capable checkpoints. Feed them into an audio renderer or buffer for later playback.
  • functionCall: One or more function/tool invocations requested by the model. See the Function Calling guide.
  • complete: Signals the end of generation. Access the assembled assistant reply through completion.message. Stats and finish reason live on the completion object; completion.info is provided for backward compatibility.
Errors surfaced during streaming are delivered through the thrown error of AsyncThrowingStream, or via the onErrorCallback closure when using the lower-level API.

GenerationOptions

Tune generation behavior with GenerationOptions.
public struct GenerationOptions {
  public var temperature: Float?
  public var topP: Float?
  public var minP: Float?
  public var repetitionPenalty: Float?
  public var jsonSchemaConstraint: String?
  public var functionCallParser: LeapFunctionCallParserProtocol?

  public init(
    temperature: Float? = nil,
    topP: Float? = nil,
    minP: Float? = nil,
    repetitionPenalty: Float? = nil,
    jsonSchemaConstraint: String? = nil,
    functionCallParser: LeapFunctionCallParserProtocol? = LFMFunctionCallParser()
  )
}
  • Leave a field as nil to fall back to the defaults packaged with the model bundle.
  • functionCallParser controls how tool-call tokens are parsed. LFMFunctionCallParser (the default) handles Liquid Foundation Model Pythonic function calling. Supply HermesFunctionCallParser() for Hermes/Qwen3 formats, or set the parser to nil to receive raw tool-call text in MessageResponse.chunk.
  • jsonSchemaConstraint activates constrained generation. Use setResponseFormat(type:) to populate it from a type annotated with the @Generatable macro.
extension GenerationOptions {
  public mutating func setResponseFormat<T: GeneratableType>(type: T.Type) throws {
    self.jsonSchemaConstraint = try JSONSchemaGenerator.getJSONSchema(for: type)
  }
}
var options = GenerationOptions(temperature: 0.6, topP: 0.9)
try options.setResponseFormat(type: CityFact.self)

for try await response in conversation.generateResponse(
  message: user,
  generationOptions: options
) {
  // Handle structured output
}
LiquidInferenceEngineRunner exposes advanced utilities such as getPromptTokensSize(messages:addBosToken:) for applications that need to budget tokens ahead of time. These methods are backend-specific and may be elevated to the ModelRunner protocol in a future release.