Merge pull request #17 from tpaulshippy/feature/video-file-support

tpaulshippy · web-flow · commit c80f79fc36c4 · 2025-09-12T09:30:24.000-07:00
Add video file support
diff --git a/README.md b/README.md
@@ -56,6 +56,7 @@ chat.ask "What's the best way to learn Ruby?"
 ```ruby
 # Analyze any file type
 chat.ask "What's in this image?", with: "ruby_conf.jpg"
+chat.ask "What's happening in this video?", with: "video.mp4"
 chat.ask "Describe this meeting", with: "meeting.wav"
 chat.ask "Summarize this document", with: "contract.pdf"
 chat.ask "Explain this code", with: "app.rb"
@@ -115,7 +116,7 @@ response = chat.with_schema(ProductSchema).ask "Analyze this product", with: "pr
 ## Features
 
 * **Chat:** Conversational AI with `RubyLLM.chat`
-* **Vision:** Analyze images and screenshots
+* **Vision:** Analyze images and videos
 * **Audio:** Transcribe and understand speech
 * **Documents:** Extract from PDFs, CSVs, JSON, any file type
 * **Image generation:** Create images with `RubyLLM.paint`
diff --git a/docs/_advanced/models.md b/docs/_advanced/models.md
@@ -42,7 +42,7 @@ The registry stores crucial information about each model, including:
 *   **`name`**: A human-friendly name.
 *   **`context_window`**: Max input tokens (e.g., `128_000`).
 *   **`max_tokens`**: Max output tokens (e.g., `16_384`).
-*   **`supports_vision`**: If it can process images.
+*   **`supports_vision`**: If it can process images and videos.
 *   **`supports_functions`**: If it can use [Tools]({% link _core_features/tools.md %}).
 *   **`input_price_per_million`**: Cost in USD per 1 million input tokens.
 *   **`output_price_per_million`**: Cost in USD per 1 million output tokens.
@@ -323,4 +323,4 @@ image = RubyLLM.paint(
 *   **Your Responsibility:** Ensure the model ID is correct for the target endpoint.
 *   **Warning Log:** A warning is logged indicating validation was skipped.
 
-Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.
+Use these features when the standard registry doesn't cover your specific model or endpoint needs. For standard models, rely on the registry for validation and capability awareness. See the [Chat Guide]({% link _core_features/chat.md %}) for more on using the `chat` object.
diff --git a/docs/_core_features/chat.md b/docs/_core_features/chat.md
@@ -148,6 +148,31 @@ response = chat.ask "Compare the user interfaces in these two screenshots.", wit
 puts response.content
 ```
 
+### Working with Videos
+
+You can also analyze video files or URLs with vision-capable models. RubyLLM will automatically detect video files and handle them appropriately.
+
+```ruby
+# Ask about a local video file
+chat = RubyLLM.chat(model: 'gemini-2.5-flash')
+response = chat.ask "What happens in this video?", with: "path/to/demo.mp4"
+puts response.content
+
+# Ask about a video from a URL
+response = chat.ask "Summarize the main events in this video.", with: "https://example.com/demo_video.mp4"
+puts response.content
+
+# Combine videos with other file types
+response = chat.ask "Analyze these files for visual content.", with: ["diagram.png", "demo.mp4", "notes.txt"]
+puts response.content
+```
+
+Notes:
+
+    Supported video formats include .mp4, .mov, .avi, .webm, and others (provider-dependent).
+    Only Google Gemini models currently support video input; check the [Available Models Guide]({% link guides/available-models.md %}) for details.
+    Large video files may be subject to size or duration limits imposed by the provider.
+
 RubyLLM automatically handles image encoding and formatting for each provider's API. Local images are read and encoded as needed, while URLs are passed directly when supported by the provider.
 
 ### Image Generation with Chat
@@ -258,6 +283,7 @@ response = chat.ask "What's in this image?", with: { image: "photo.jpg" }
 
 **Supported file types:**
 - **Images:** .jpg, .jpeg, .png, .gif, .webp, .bmp
+- **Videos:** .mp4, .mov, .avi, .webm
 - **Audio:** .mp3, .wav, .m4a, .ogg, .flac
 - **Documents:** .pdf, .txt, .md, .csv, .json, .xml
 - **Code:** .rb, .py, .js, .html, .css (and many others)
diff --git a/docs/index.md b/docs/index.md
@@ -102,6 +102,7 @@ chat.ask "What's the best way to learn Ruby?"
 ```ruby
 # Analyze any file type
 chat.ask "What's in this image?", with: "ruby_conf.jpg"
+chat.ask "What's happening in this video?", with: "video.mp4"
 chat.ask "Describe this meeting", with: "meeting.wav"
 chat.ask "Summarize this document", with: "contract.pdf"
 chat.ask "Explain this code", with: "app.rb"
diff --git a/lib/ruby_llm/attachment.rb b/lib/ruby_llm/attachment.rb
@@ -78,6 +78,7 @@ def for_llm
 
     def type
       return :image if image?
+      return :video if video?
       return :audio if audio?
       return :pdf if pdf?
       return :text if text?
@@ -89,6 +90,10 @@ def image?
       RubyLLM::MimeType.image? mime_type
     end
 
+    def video?
+      RubyLLM::MimeType.video? mime_type
+    end
+
     def audio?
       RubyLLM::MimeType.audio? mime_type
     end
diff --git a/lib/ruby_llm/mime_type.rb b/lib/ruby_llm/mime_type.rb
@@ -15,6 +15,10 @@ def image?(type)
       type.start_with?('image/')
     end
 
+    def video?(type)
+      type.start_with?('video/')
+    end
+
     def audio?(type)
       type.start_with?('audio/')
     end
diff --git a/lib/ruby_llm/model/info.rb b/lib/ruby_llm/model/info.rb
@@ -56,6 +56,10 @@ def supports_vision?
         modalities.input.include?('image')
       end
 
+      def supports_video?
+        modalities.input.include?('video')
+      end
+
       def supports_functions?
         function_calling?
       end
diff --git a/lib/ruby_llm/providers/gemini/capabilities.rb b/lib/ruby_llm/providers/gemini/capabilities.rb
@@ -52,6 +52,10 @@ def supports_vision?(model_id)
           model_id.match?(/gemini|flash|pro|imagen/)
         end
 
+        def supports_video?(model_id)
+          model_id.match?(/gemini/)
+        end
+
         def supports_functions?(model_id)
           return false if model_id.match?(/text-embedding|embedding-001|aqa|flash-lite|imagen|gemini-2\.0-flash-lite/)
 
@@ -217,6 +221,7 @@ def modalities_for(model_id)
             modalities[:input] << 'pdf'
           end
 
+          modalities[:input] << 'video' if supports_video?(model_id)
           modalities[:input] << 'audio' if model_id.match?(/audio/)
           modalities[:output] << 'embeddings' if model_id.match?(/embedding|gemini-embedding/)
 
diff --git a/spec/fixtures/ruby.mp4 b/spec/fixtures/ruby.mp4
diff --git a/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_local_videos.yml b/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_local_videos.yml
diff --git a/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_remote_videos_without_extension.yml b/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_0-flash_can_understand_remote_videos_without_extension.yml
diff --git a/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_5-flash_can_understand_local_videos.yml b/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_5-flash_can_understand_local_videos.yml
diff --git a/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_5-flash_can_understand_remote_videos_without_extension.yml b/spec/fixtures/vcr_cassettes/chat_video_models_gemini_gemini-2_5-flash_can_understand_remote_videos_without_extension.yml
diff --git a/spec/ruby_llm/active_record/acts_as_attachment_spec.rb b/spec/ruby_llm/active_record/acts_as_attachment_spec.rb
@@ -89,6 +89,22 @@ def uploaded_file(path, type)
       expect(attachment.type).to eq(:image)
     end
 
+    it 'handles videos' do
+      video_path = File.expand_path('../../fixtures/ruby.mp4', __dir__)
+      chat = Chat.create!(model: model)
+      message = chat.messages.create!(role: 'user', content: 'Video test')
+
+      message.attachments.attach(
+        io: File.open(video_path),
+        filename: 'test.mp4',
+        content_type: 'video/mp4'
+      )
+
+      llm_message = message.to_llm
+      attachment = llm_message.content.attachments.first
+      expect(attachment.type).to eq(:video)
+    end
+
     it 'handles PDFs' do
       chat = Chat.create!(model: model)
       message = chat.messages.create!(role: 'user', content: 'PDF test')
diff --git a/spec/ruby_llm/chat_content_spec.rb b/spec/ruby_llm/chat_content_spec.rb
@@ -6,12 +6,14 @@
   include_context 'with configured RubyLLM'
 
   let(:image_path) { File.expand_path('../fixtures/ruby.png', __dir__) }
+  let(:video_path) { File.expand_path('../fixtures/ruby.mp4', __dir__) }
   let(:audio_path) { File.expand_path('../fixtures/ruby.wav', __dir__) }
   let(:mp3_path) { File.expand_path('../fixtures/ruby.mp3', __dir__) }
   let(:pdf_path) { File.expand_path('../fixtures/sample.pdf', __dir__) }
   let(:text_path) { File.expand_path('../fixtures/ruby.txt', __dir__) }
   let(:xml_path) { File.expand_path('../fixtures/ruby.xml', __dir__) }
   let(:image_url) { 'https://upload.wikimedia.org/wikipedia/commons/f/f1/Ruby_logo.png' }
+  let(:video_url) { 'https://filesamples.com/samples/video/mp4/sample_640x360.mp4' }
   let(:audio_url) { 'https://commons.wikimedia.org/wiki/File:LL-Q1860_(eng)-AcpoKrane-ruby.wav' }
   let(:pdf_url) { 'https://pdfobject.com/pdf/sample.pdf' }
   let(:text_url) { 'https://www.ruby-lang.org/en/about/license.txt' }
@@ -96,6 +98,35 @@
     end
   end
 
+  describe 'video models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
+    VIDEO_MODELS.each do |model_info|
+      provider = model_info[:provider]
+      model = model_info[:model]
+
+      it "#{provider}/#{model} can understand local videos" do
+        chat = RubyLLM.chat(model: model, provider: provider)
+        response = chat.ask('What do you see in this video?', with: { video: video_path })
+
+        expect(response.content).to be_present
+        expect(response.content).not_to include('RubyLLM::Content')
+        expect(chat.messages.first.content).to be_a(RubyLLM::Content)
+        expect(chat.messages.first.content.attachments.first.filename).to eq('ruby.mp4')
+        expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
+      end
+
+      it "#{provider}/#{model} can understand remote videos without extension" do
+        chat = RubyLLM.chat(model: model, provider: provider)
+        response = chat.ask('What do you see in this video?', with: video_url)
+
+        expect(response.content).to be_present
+        expect(response.content).not_to include('RubyLLM::Content')
+        expect(chat.messages.first.content).to be_a(RubyLLM::Content)
+        expect(chat.messages.first.content.attachments.first.filename).to eq('sample_640x360.mp4')
+        expect(chat.messages.first.content.attachments.first.mime_type).to eq('video/mp4')
+      end
+    end
+  end
+
   describe 'audio models' do # rubocop:disable RSpec/MultipleMemoizedHelpers
     AUDIO_MODELS.each do |model_info|
       model = model_info[:model]
diff --git a/spec/ruby_llm/models_spec.rb b/spec/ruby_llm/models_spec.rb
@@ -36,11 +36,17 @@
 
       # There should be models from at least OpenAI and Anthropic
       expect(provider_counts.keys).to include('openai', 'anthropic')
+    end
 
-      # Select only models with vision support
+    it 'filters by vision support' do
       vision_models = RubyLLM.models.select(&:supports_vision?)
       expect(vision_models).to all(have_attributes(supports_vision?: true))
     end
+
+    it 'filters by video support' do
+      video_models = RubyLLM.models.select(&:supports_video?)
+      expect(video_models).to all(have_attributes(supports_video?: true))
+    end
   end
 
   describe 'finding models' do
diff --git a/spec/support/models_to_test.rb b/spec/support/models_to_test.rb
@@ -35,6 +35,11 @@
   { provider: :vertexai, model: 'gemini-2.5-flash' }
 ].freeze
 
+VIDEO_MODELS = [
+  { provider: :gemini, model: 'gemini-2.0-flash' },
+  { provider: :gemini, model: 'gemini-2.5-flash' }
+].freeze
+
 AUDIO_MODELS = [
   { provider: :openai, model: 'gpt-4o-mini-audio-preview' },
   { provider: :gemini, model: 'gemini-2.5-flash' }