Skip to content

Conversation

@amhsirak
Copy link
Member

@amhsirak amhsirak commented Nov 30, 2025

Summary by CodeRabbit

  • Refactor

    • Removed the public integration endpoint and its server registration.
    • Stopped reinitializing Socket.IO on server start.
  • Improvements

    • Page-to-Markdown/HTML now requires and uses an existing browser page with improved error handling and logging.
    • Ad-blocking setup is more resilient and falls back to connection-based handling if initialization fails.
    • UI highlighting refined to avoid showing highlights during certain pagination states and to clear pagination highlights when capture ends.
  • Behavior Change

    • Stealth browser plugin no longer applied for certain requests.
    • Container builds simplified and Chromium/runtime setup reduced.
    • Browser service exposes configurable WebSocket host and Playwright path via environment variables.
  • Chores

    • Project version bumped.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Nov 30, 2025

Walkthrough

Removed the integration route and its registration; markdown conversion functions now require a Playwright Page; RemoteBrowser adblock calls loosen typing and fall back to CDP on failure; removed Chromium stealth plugin; backend Dockerfile simplified and browser Dockerfile uses npm install; package versions bumped.

Changes

Cohort / File(s) Change Summary
Route export & wiring
server/src/routes/index.ts, server/src/server.ts
Removed integration import/export and app.use('/integration', ...); removed Socket.IO reinitialization.
Markdownify — require Page instance
server/src/markdownify/scrape.ts
convertPageToMarkdown / convertPageToHTML now accept page: Page; removed internal browser/page lifecycle; added try/catch, DOM sanitization, and gotoWithFallback usage.
RemoteBrowser — adblock & CDP
server/src/browser-management/classes/RemoteBrowser.ts
Cast adblock calls to any for currentPage; on adblock init failure, fall back to creating a CDP session so CDP-based operations continue.
Stealth plugin removal
server/src/routes/storage.ts
Removed chromium.use(stealthPlugin()) invocation.
Playwright typing
server/src/api/record.ts, server/src/markdownify/scrape.ts
Switched Page import to playwright-core and updated function signatures to accept Page.
Browser service / Docker
browser/Dockerfile, Dockerfile.backend, docker-compose.yml, browser/server.ts
browser/Dockerfile: use npm install instead of npm ci. Dockerfile.backend: base changed to node:20-slim and Chromium/apt blocks removed. docker-compose.yml: updated dockerfile refs and added BROWSER_WS_HOST / PLAYWRIGHT_BROWSERS_PATH env vars. browser/server.ts: wsEndpoint host substitution using BROWSER_WS_HOST.
UI highlighting & pagination guards
src/components/browser/BrowserWindow.tsx, src/components/recorder/DOMBrowserRenderer.tsx, src/components/recorder/RightSidePanel.tsx
Tightened highlight rendering guards to consider paginationSelector and paginationMode; added effect and cleanup to clear highlights when pagination mode or auto-detected pagination changes; added try/catch for iframe DOM modifications.
Version bumps
maxun-core/package.json, package.json
Bumped maxun-core 0.0.27 → 0.0.28; root package version and dependency updated.
Other small edits
server/src/routes/storage.ts, server/src/api/record.ts, browser/server.ts
Minor import/type adjustments and environment-driven wsEndpoint substitution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25–35 minutes

  • Check all call sites updated to pass a Playwright Page into convertPageToMarkdown / convertPageToHTML.
  • Inspect RemoteBrowser adblock → CDP fallback for lifecycle and error-safety.
  • Confirm removal of integration route has no leftover references or tests.
  • Verify Dockerfile.backend base change and removed Chromium deps do not break CI or runtime.
  • Validate browser ws endpoint substitution and new env vars are consistent with deployment configs.

Possibly related PRs

Suggested labels

Type: Chore, Type: New Release, Scope: Setup | Installation

Poem

🐇 I hopped through routes and pages at dawn,

Cast a page as any, then fetched CDP on,
Stealth cloak dropped, Docker trimmed its hat,
Versions bumped — I left a tiny pat,
A carrot commit, then off to my lawn.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main purpose of this pull request—a pre-release version bump to v0.0.28, which is evident from the version updates in package.json files and maxun-core/package.json across the changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pre-release-28

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
server/src/markdownify/scrape.ts (2)

5-5: Use Page type instead of any for type safety.

Since Page is already imported, update the parameter type for consistency and better type checking.

-async function gotoWithFallback(page: any, url: string) {
+async function gotoWithFallback(page: Page, url: string) {

80-125: Consider extracting duplicated sanitization logic.

The page.evaluate block (lines 86-118) is identical to lines 32-64 in convertPageToMarkdown. Extract this to a shared helper for maintainability.

+async function sanitizePageContent(page: Page): Promise<string> {
+  return page.evaluate(() => {
+    const selectors = [
+      "script", "style", "link[rel='stylesheet']", "noscript", "meta",
+      "svg", "img", "picture", "source", "video", "audio", "iframe", "object", "embed"
+    ];
+    selectors.forEach(sel => document.querySelectorAll(sel).forEach(e => e.remove()));
+    document.querySelectorAll("*").forEach(el => {
+      [...el.attributes].forEach(attr => {
+        if (attr.name.startsWith("on")) el.removeAttribute(attr.name);
+      });
+    });
+    return document.documentElement.outerHTML;
+  });
+}

 export async function convertPageToMarkdown(url: string, page: Page): Promise<string> {
   try {
     logger.log('info', `[Scrape] Using existing page instance for markdown conversion of ${url}`);
     await gotoWithFallback(page, url);
-    const cleanedHtml = await page.evaluate(() => { /* ... duplicated code ... */ });
+    const cleanedHtml = await sanitizePageContent(page);
     const markdown = await parseMarkdown(cleanedHtml, url);
     return markdown;
   } catch (error: any) { /* ... */ }
 }

 export async function convertPageToHTML(url: string, page: Page): Promise<string> {
   try {
     logger.log('info', `[Scrape] Using existing page instance for HTML conversion of ${url}`);
     await gotoWithFallback(page, url);
-    const cleanedHtml = await page.evaluate(() => { /* ... duplicated code ... */ });
+    const cleanedHtml = await sanitizePageContent(page);
     return cleanedHtml;
   } catch (error: any) { /* ... */ }
 }
server/src/browser-management/classes/RemoteBrowser.ts (1)

551-561: Adblock/CDP init: add a small safety guard and confirm enable/disable intent

The fallback to always establish a CDP session even if adblocker init fails looks good. Two small nits here:

  • this.currentPage is typed as Page | null | undefined, but you’re casting it to any three times. A simple null check + local page variable would keep this type‑safe and clearer.
  • You enable blocking and then immediately disable it; that means you won’t actually have adblocking on for subsequent browsing. If the goal is just a one‑off “warm up”, that’s fine, but if the intent was to keep ads blocked, this behavior might be surprising.

You can keep current behavior but tighten things up like this:

-                await blocker.enableBlockingInPage(this.currentPage as any);
-                this.client = await this.currentPage.context().newCDPSession(this.currentPage);
-                await blocker.disableBlockingInPage(this.currentPage as any);
+                if (!this.currentPage) {
+                  throw new Error('No page available for adblocker initialization');
+                }
+
+                const page = this.currentPage;
+                await blocker.enableBlockingInPage(page as any);
+                this.client = await page.context().newCDPSession(page);
+                await blocker.disableBlockingInPage(page as any);

If you do want adblocking to remain active for the lifetime of the page, consider dropping the disableBlockingInPage call (or moving it to a more explicit teardown path instead of right after initialization).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 459218a and eb512d8.

📒 Files selected for processing (4)
  • server/src/browser-management/classes/RemoteBrowser.ts (1 hunks)
  • server/src/markdownify/scrape.ts (2 hunks)
  • server/src/routes/storage.ts (0 hunks)
  • server/src/server.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • server/src/routes/storage.ts
🧰 Additional context used
🧬 Code graph analysis (1)
server/src/markdownify/scrape.ts (1)
server/src/markdownify/markdown.ts (1)
  • parseMarkdown (1-98)
🔇 Additional comments (3)
server/src/markdownify/scrape.ts (2)

1-1: LGTM! Proper type import for the new function signature.

The Page type import from playwright-core correctly types the new parameter.


26-72: LGTM! Good refactor to accept external Page instance.

The change to accept an existing Page instance instead of managing browser lifecycle internally is a solid improvement for resource management. Error handling with logging and re-throwing is appropriate.

server/src/server.ts (1)

8-8: Integration route cleanly removed from server.

The import statement on line 8 and routes exports have been properly updated. The integration route no longer exists in the codebase, and there are no dangling references to the route handler.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2bbf5d6 and 2f0516f.

📒 Files selected for processing (5)
  • Dockerfile.backend (1 hunks)
  • docker-compose.yml (2 hunks)
  • maxun-core/package.json (1 hunks)
  • package.json (3 hunks)
  • server/src/api/record.ts (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • maxun-core/package.json
  • docker-compose.yml
🔇 Additional comments (2)
package.json (1)

3-3: Version bump looks good.

Consistent pre-release patch bump: root version and maxun-core dependency both updated to 0.0.28. No breaking dependency changes introduced.

Also applies to: 55-55

server/src/api/record.ts (1)

14-14: LGTM. The change from playwright to playwright-core is appropriate for this type-only import. The Page type definition remains compatible, and playwright-core 1.57.0 is a documented release version.

Note: playwright-core 1.57.0 includes breaking changes in glob URL patterns, route.continue() cookie handling, and Page#accessibility removal—verify these don't affect other parts of the codebase if they use routing or accessibility APIs.

@@ -1,4 +1,4 @@
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/playwright:v1.46.0-noble
FROM --platform=$BUILDPLATFORM node:20-slim
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for browser launch patterns that would require local Chromium
rg -n "chromium://|launch\(|launchPersistentContext" --type=ts --type=js -g '!node_modules' -g '!dist' | head -50

Repository: getmaxun/maxun

Length of output: 159


🏁 Script executed:

# Search for CDP and remote browser usage patterns
rg -n "connectOverCDP|RemoteBrowser|remote.*browser" --type=ts --type=js -g '!node_modules' -g '!dist' -C 2 | head -100

Repository: getmaxun/maxun

Length of output: 7400


🏁 Script executed:

# Check the Dockerfile.backend for any system dependency installation
cat Dockerfile.backend

Repository: getmaxun/maxun

Length of output: 630


🏁 Script executed:

# Look for docker-compose or orchestration files that might provide browser services
fd -e docker-compose.yml -e docker-compose.yaml -o compose.yml -o compose.yaml | head -20

Repository: getmaxun/maxun

Length of output: 199


🏁 Script executed:

# Get context around the chromium.launch() call
cat -n server/src/browser-management/browserConnection.ts | sed -n '40,80p'

Repository: getmaxun/maxun

Length of output: 1873


🏁 Script executed:

# Check the full browserConnection.ts file to understand the context
wc -l server/src/browser-management/browserConnection.ts

Repository: getmaxun/maxun

Length of output: 112


🏁 Script executed:

# Look at the full file if it's not too large
cat -n server/src/browser-management/browserConnection.ts

Repository: getmaxun/maxun

Length of output: 6979


🏁 Script executed:

# Search for error handling or fallback patterns around this launch call
rg -n "chromium\.launch|catch|fallback|CDP" server/src/browser-management/browserConnection.ts -C 5

Repository: getmaxun/maxun

Length of output: 3388


🏁 Script executed:

# Check the RemoteBrowser class/interface definition
rg -n "class RemoteBrowser|interface RemoteBrowser" --type=ts -C 10

Repository: getmaxun/maxun

Length of output: 3148


Remove local Chromium fallback or add Chromium binaries to Dockerfile.backend.

Switching from mcr.microsoft.com/playwright:v1.46.0-noble to node:20-slim creates a critical runtime failure. The code in server/src/browser-management/browserConnection.ts implements a fallback mechanism: it attempts to connect to a remote browser service first, but if that fails, it calls launchLocalBrowser() (line 128), which requires Chromium binaries and system libraries not present in the minimal base image.

When the remote browser service is unavailable, chromium.launch() will fail because:

  • Playwright binaries are not installed (requires npx playwright install chromium)
  • System dependencies are missing (libx11-6, libxdamage1, etc.)

Either install Chromium in the Dockerfile with RUN npx playwright@1.57.0 install chromium (and system packages), or remove the fallback and enforce remote browser service as mandatory.

🤖 Prompt for AI Agents
In Dockerfile.backend around line 1, the base image node:20-slim lacks
Playwright Chromium binaries and required system libraries so the local browser
fallback in server/src/browser-management/browserConnection.ts
(launchLocalBrowser at line 128) will fail; either (A) modify this Dockerfile to
install the Playwright Chromium browser and the system dependencies (add apt
packages required by Chromium and run the Playwright install for the specific
Playwright version used), or (B) remove/disable the local fallback in
browserConnection.ts and make the remote browser service mandatory (fail fast
with a clear error if connection to the remote browser cannot be established).
Ensure the chosen approach is consistent with CI/deployment and document the
change in the Dockerfile comments.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
browser/server.ts (1)

59-60: Consider using replaceAll() for robustness.

The implementation correctly customizes the WebSocket host for clients. However, .replace('localhost', BROWSER_WS_HOST) only replaces the first occurrence. While WebSocket endpoints typically contain a single 'localhost', using .replaceAll() would be more robust.

Additionally, the health endpoint (line 52) returns the original wsEndpoint while the root endpoint replaces the host. This appears intentional (health for internal monitoring, root for external clients), but please confirm this design choice.

Apply this diff for a more robust replacement:

-                const wsEndpoint = browserServer?.wsEndpoint().replace('localhost', BROWSER_WS_HOST) || '';
+                const wsEndpoint = browserServer?.wsEndpoint().replaceAll('localhost', BROWSER_WS_HOST) || '';
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b2fcf02 and bbfa78e.

📒 Files selected for processing (2)
  • browser/server.ts (2 hunks)
  • docker-compose.yml (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docker-compose.yml
🔇 Additional comments (2)
browser/server.ts (2)

14-14: LGTM: Configuration addition follows existing patterns.

The new BROWSER_WS_HOST configuration is consistent with the existing port configurations and provides sensible defaults.


48-56: LGTM: Clean refactoring for clarity.

Extracting the wsEndpoint to a variable improves code clarity and follows good practices.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
src/components/recorder/RightSidePanel.tsx (1)

418-456: Highlight cleanup behavior looks good; consider de-duplicating evaluateSelector for consistency

The new blocks that clear auto-detected pagination highlights on completion and discard are functionally solid: they safely guard on autoDetectedPagination?.selector and iframeElement?.contentDocument, and they reset the same inline styles (outline, outlineOffset, zIndex) that are set during highlighting. Including autoDetectedPagination in the callback deps ensures the latest selector is used.

However, the evaluateSelector helper is now duplicated here and in other places in this file (including a slightly more sophisticated variant that also handles comma-separated selectors with fallback). This duplication increases maintenance cost and makes it easy for highlight and cleanup behavior to diverge subtly if one copy is updated and others are not.

It would be cleaner to extract a single shared helper (either at module scope or as a small utility imported from clientPaginationDetector/a local helper file) that encapsulates the XPath vs. CSS selection logic and any fallback behavior, and then reuse it in:

  • Pagination-type-change cleanup,
  • Auto-detect highlighting,
  • “Choose Different Element” handling,
  • These new completion/discard cleanup blocks.

That would keep highlight and un-highlight logic symmetric and easier to evolve.

Also applies to: 789-827

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bbfa78e and 068a28a.

📒 Files selected for processing (2)
  • browser/server.ts (2 hunks)
  • src/components/recorder/RightSidePanel.tsx (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • browser/server.ts
🧰 Additional context used
🧬 Code graph analysis (1)
src/components/recorder/RightSidePanel.tsx (2)
src/helpers/clientPaginationDetector.ts (1)
  • evaluateSelector (290-318)
maxun-core/src/browserSide/scraper.js (3)
  • selector (28-28)
  • doc (225-225)
  • node (664-664)

@amhsirak amhsirak merged commit 7d0ab5f into develop Nov 30, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants