Skip to content

Commit c80f79f

Browse files
authored
Merge pull request #17 from tpaulshippy/feature/video-file-support
Add video file support
2 parents 40b5e31 + a97e059 commit c80f79f

17 files changed

+600
-4
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,7 @@ chat.ask "What's the best way to learn Ruby?"
5656
```ruby
5757
# Analyze any file type
5858
chat.ask "What's in this image?", with: "ruby_conf.jpg"
59+
chat.ask "What's happening in this video?", with: "video.mp4"
5960
chat.ask "Describe this meeting", with: "meeting.wav"
6061
chat.ask "Summarize this document", with: "contract.pdf"
6162
chat.ask "Explain this code", with: "app.rb"
@@ -115,7 +116,7 @@ response = chat.with_schema(ProductSchema).ask "Analyze this product", with: "pr
115116
## Features
116117

117118
* **Chat:** Conversational AI with `RubyLLM.chat`
118-
* **Vision:** Analyze images and screenshots
119+
* **Vision:** Analyze images and videos
119120
* **Audio:** Transcribe and understand speech
120121
* **Documents:** Extract from PDFs, CSVs, JSON, any file type
121122
* **Image generation:** Create images with `RubyLLM.paint`

docs/_advanced/models.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ The registry stores crucial information about each model, including:
4242
* **`name`**: A human-friendly name.
4343
* **`context_window`**: Max input tokens (e.g., `128_000`).
4444
* **`max_tokens`**: Max output tokens (e.g., `16_384`).
45-
* **`supports_vision`**: If it can process images.
45+
* **`supports_vision`**: If it can process images and videos.
4646
* **`supports_functions`**: If it can use [Tools]({% link _core_features/tools.md %}).
4747
* **`input_price_per_million`**: Cost in USD per 1 million input tokens.
4848
* **`output_price_per_million`**: Cost in USD per 1 million output tokens.
@@ -323,4 +323,4 @@ image = RubyLLM.paint(
323323
* **Your Responsibility:** Ensure the model ID is correct for the target endpoint.
324324
* **Warning Log:** A warning is logged indicating validation was skipped.
325325

326-
Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.
326+
Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.

docs/_core_features/chat.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -148,6 +148,31 @@ response = chat.ask "Compare the user interfaces in these two screenshots.", wit
148148
puts response.content
149149
```
150150

151+
### Working with Videos
152+
153+
You can also analyze video files or URLs with vision-capable models. RubyLLM will automatically detect video files and handle them appropriately.
154+
155+
```ruby
156+
# Ask about a local video file
157+
chat = RubyLLM.chat(model: 'gemini-2.5-flash')
158+
response = chat.ask "What happens in this video?", with: "path/to/demo.mp4"
159+
puts response.content
160+
161+
# Ask about a video from a URL
162+
response = chat.ask "Summarize the main events in this video.", with: "https://example.com/demo_video.mp4"
163+
puts response.content
164+
165+
# Combine videos with other file types
166+
response = chat.ask "Analyze these files for visual content.", with: ["diagram.png", "demo.mp4", "notes.txt"]
167+
puts response.content
168+
```
169+
170+
Notes:
171+
172+
Supported video formats include .mp4, .mov, .avi, .webm, and others (provider-dependent).
173+
Only Google Gemini models currently support video input; check the [Available Models Guide]({% link guides/available-models.md %}) for details.
174+
Large video files may be subject to size or duration limits imposed by the provider.
175+
151176
RubyLLM automatically handles image encoding and formatting for each provider's API. Local images are read and encoded as needed, while URLs are passed directly when supported by the provider.
152177

153178
### Image Generation with Chat
@@ -258,6 +283,7 @@ response = chat.ask "What's in this image?", with: { image: "photo.jpg" }
258283

259284
**Supported file types:**
260285
- **Images:** .jpg, .jpeg, .png, .gif, .webp, .bmp
286+
- **Videos:** .mp4, .mov, .avi, .webm
261287
- **Audio:** .mp3, .wav, .m4a, .ogg, .flac
262288
- **Documents:** .pdf, .txt, .md, .csv, .json, .xml
263289
- **Code:** .rb, .py, .js, .html, .css (and many others)

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,7 @@ chat.ask "What's the best way to learn Ruby?"
102102
```ruby
103103
# Analyze any file type
104104
chat.ask "What's in this image?", with: "ruby_conf.jpg"
105+
chat.ask "What's happening in this video?", with: "video.mp4"
105106
chat.ask "Describe this meeting", with: "meeting.wav"
106107
chat.ask "Summarize this document", with: "contract.pdf"
107108
chat.ask "Explain this code", with: "app.rb"

lib/ruby_llm/attachment.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ def for_llm
7878

7979
def type
8080
return :image if image?
81+
return :video if video?
8182
return :audio if audio?
8283
return :pdf if pdf?
8384
return :text if text?
@@ -89,6 +90,10 @@ def image?
8990
RubyLLM::MimeType.image? mime_type
9091
end
9192

93+
def video?
94+
RubyLLM::MimeType.video? mime_type
95+
end
96+
9297
def audio?
9398
RubyLLM::MimeType.audio? mime_type
9499
end

lib/ruby_llm/mime_type.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,10 @@ def image?(type)
1515
type.start_with?('image/')
1616
end
1717

18+
def video?(type)
19+
type.start_with?('video/')
20+
end
21+
1822
def audio?(type)
1923
type.start_with?('audio/')
2024
end

lib/ruby_llm/model/info.rb

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,10 @@ def supports_vision?
5656
modalities.input.include?('image')
5757
end
5858

59+
def supports_video?
60+
modalities.input.include?('video')
61+
end
62+
5963
def supports_functions?
6064
function_calling?
6165
end

lib/ruby_llm/providers/gemini/capabilities.rb

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,10 @@ def supports_vision?(model_id)
5252
model_id.match?(/gemini|flash|pro|imagen/)
5353
end
5454

55+
def supports_video?(model_id)
56+
model_id.match?(/gemini/)
57+
end
58+
5559
def supports_functions?(model_id)
5660
return false if model_id.match?(/text-embedding|embedding-001|aqa|flash-lite|imagen|gemini-2\.0-flash-lite/)
5761

@@ -217,6 +221,7 @@ def modalities_for(model_id)
217221
modalities[:input] << 'pdf'
218222
end
219223

224+
modalities[:input] << 'video' if supports_video?(model_id)
220225
modalities[:input] << 'audio' if model_id.match?(/audio/)
221226
modalities[:output] << 'embeddings' if model_id.match?(/embedding|gemini-embedding/)
222227

spec/fixtures/ruby.mp4

561 KB
Binary file not shown.

spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_local_videos.yml

Lines changed: 91 additions & 0 deletions
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)