Skip to content
Merged
Show file tree
Hide file tree
Changes from 12 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 1 addition & 26 deletions Dockerfile.backend
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM --platform=$BUILDPLATFORM mcr.microsoft.com/playwright:v1.46.0-noble
FROM --platform=$BUILDPLATFORM node:20-slim
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Search for browser launch patterns that would require local Chromium
rg -n "chromium://|launch\(|launchPersistentContext" --type=ts --type=js -g '!node_modules' -g '!dist' | head -50

Repository: getmaxun/maxun

Length of output: 159


🏁 Script executed:

# Search for CDP and remote browser usage patterns
rg -n "connectOverCDP|RemoteBrowser|remote.*browser" --type=ts --type=js -g '!node_modules' -g '!dist' -C 2 | head -100

Repository: getmaxun/maxun

Length of output: 7400


🏁 Script executed:

# Check the Dockerfile.backend for any system dependency installation
cat Dockerfile.backend

Repository: getmaxun/maxun

Length of output: 630


🏁 Script executed:

# Look for docker-compose or orchestration files that might provide browser services
fd -e docker-compose.yml -e docker-compose.yaml -o compose.yml -o compose.yaml | head -20

Repository: getmaxun/maxun

Length of output: 199


🏁 Script executed:

# Get context around the chromium.launch() call
cat -n server/src/browser-management/browserConnection.ts | sed -n '40,80p'

Repository: getmaxun/maxun

Length of output: 1873


🏁 Script executed:

# Check the full browserConnection.ts file to understand the context
wc -l server/src/browser-management/browserConnection.ts

Repository: getmaxun/maxun

Length of output: 112


🏁 Script executed:

# Look at the full file if it's not too large
cat -n server/src/browser-management/browserConnection.ts

Repository: getmaxun/maxun

Length of output: 6979


🏁 Script executed:

# Search for error handling or fallback patterns around this launch call
rg -n "chromium\.launch|catch|fallback|CDP" server/src/browser-management/browserConnection.ts -C 5

Repository: getmaxun/maxun

Length of output: 3388


🏁 Script executed:

# Check the RemoteBrowser class/interface definition
rg -n "class RemoteBrowser|interface RemoteBrowser" --type=ts -C 10

Repository: getmaxun/maxun

Length of output: 3148


Remove local Chromium fallback or add Chromium binaries to Dockerfile.backend.

Switching from mcr.microsoft.com/playwright:v1.46.0-noble to node:20-slim creates a critical runtime failure. The code in server/src/browser-management/browserConnection.ts implements a fallback mechanism: it attempts to connect to a remote browser service first, but if that fails, it calls launchLocalBrowser() (line 128), which requires Chromium binaries and system libraries not present in the minimal base image.

When the remote browser service is unavailable, chromium.launch() will fail because:

  • Playwright binaries are not installed (requires npx playwright install chromium)
  • System dependencies are missing (libx11-6, libxdamage1, etc.)

Either install Chromium in the Dockerfile with RUN npx playwright@1.57.0 install chromium (and system packages), or remove the fallback and enforce remote browser service as mandatory.

🤖 Prompt for AI Agents
In Dockerfile.backend around line 1, the base image node:20-slim lacks
Playwright Chromium binaries and required system libraries so the local browser
fallback in server/src/browser-management/browserConnection.ts
(launchLocalBrowser at line 128) will fail; either (A) modify this Dockerfile to
install the Playwright Chromium browser and the system dependencies (add apt
packages required by Chromium and run the Playwright install for the specific
Playwright version used), or (B) remove/disable the local fallback in
browserConnection.ts and make the remote browser service mandatory (fail fast
with a clear error if connection to the remote browser cannot be established).
Ensure the chosen approach is consistent with CI/deployment and document the
change in the Dockerfile comments.


# Set working directory
WORKDIR /app
Expand All @@ -18,31 +18,6 @@ COPY server/tsconfig.json ./server/
# Install dependencies
RUN npm install --legacy-peer-deps

# Create the Chromium data directory with necessary permissions
RUN mkdir -p /tmp/chromium-data-dir && \
chmod -R 777 /tmp/chromium-data-dir

# Install dependencies
RUN apt-get update && apt-get install -y \
libgbm1 \
libnss3 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libdrm2 \
libxkbcommon0 \
libglib2.0-0 \
libdbus-1-3 \
libx11-xcb1 \
libxcb1 \
libxcomposite1 \
libxcursor1 \
libxdamage1 \
libxext6 \
libxi6 \
libxtst6 \
&& rm -rf /var/lib/apt/lists/* \
&& mkdir -p /tmp/.X11-unix && chmod 1777 /tmp/.X11-unix

# Expose backend port
EXPOSE ${BACKEND_PORT:-8080}

Expand Down
2 changes: 1 addition & 1 deletion browser/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ WORKDIR /app
COPY browser/package*.json ./

# Install dependencies
RUN npm ci
RUN npm install

# Copy TypeScript source and config
COPY browser/server.ts ./
Expand Down
7 changes: 5 additions & 2 deletions browser/server.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ let browserServer: BrowserServer | null = null;
// Configurable ports with defaults
const BROWSER_WS_PORT = parseInt(process.env.BROWSER_WS_PORT || '3001', 10);
const BROWSER_HEALTH_PORT = parseInt(process.env.BROWSER_HEALTH_PORT || '3002', 10);
const BROWSER_WS_HOST = process.env.BROWSER_WS_HOST || 'localhost';

async function start(): Promise<void> {
console.log('Starting Maxun Browser Service...');
Expand Down Expand Up @@ -44,17 +45,19 @@ async function start(): Promise<void> {
// Health check HTTP server
const healthServer = http.createServer((req, res) => {
if (req.url === '/health') {
const wsEndpoint = browserServer?.wsEndpoint();
res.writeHead(200, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({
status: 'healthy',
wsEndpoint: browserServer?.wsEndpoint(),
wsEndpoint,
wsPort: BROWSER_WS_PORT,
healthPort: BROWSER_HEALTH_PORT,
timestamp: new Date().toISOString()
}));
} else if (req.url === '/') {
res.writeHead(200, { 'Content-Type': 'text/plain' });
res.end(`Maxun Browser Service\nWebSocket: ${browserServer?.wsEndpoint()}\nHealth: http://localhost:${BROWSER_HEALTH_PORT}/health`);
const wsEndpoint = browserServer?.wsEndpoint().replace('localhost', BROWSER_WS_HOST) || '';
res.end(`Maxun Browser Service\nWebSocket: ${wsEndpoint}\nHealth: http://localhost:${BROWSER_HEALTH_PORT}/health`);
} else {
res.writeHead(404);
res.end('Not Found');
Expand Down
14 changes: 8 additions & 6 deletions docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,9 +30,9 @@ services:
- minio_data:/data

backend:
#build:
#context: .
#dockerfile: server/Dockerfile
# build:
# context: .
# dockerfile: Dockerfile.backend
image: getmaxun/maxun-backend:latest
restart: unless-stopped
ports:
Expand Down Expand Up @@ -60,9 +60,9 @@ services:
- /var/run/dbus:/var/run/dbus

frontend:
#build:
#context: .
#dockerfile: Dockerfile
# build:
# context: .
# dockerfile: Dockerfile.frontend
image: getmaxun/maxun-frontend:latest
restart: unless-stopped
ports:
Expand All @@ -89,6 +89,8 @@ services:
- DEBUG=pw:browser*
- BROWSER_WS_PORT=${BROWSER_WS_PORT:-3001}
- BROWSER_HEALTH_PORT=${BROWSER_HEALTH_PORT:-3002}
- BROWSER_WS_HOST=${BROWSER_WS_HOST:-browser}
- PLAYWRIGHT_BROWSERS_PATH=/ms-playwright
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:${BROWSER_HEALTH_PORT:-3002}/health"]
Expand Down
2 changes: 1 addition & 1 deletion maxun-core/package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "maxun-core",
"version": "0.0.27",
"version": "0.0.28",
"description": "Core package for Maxun, responsible for data extraction",
"main": "build/index.js",
"typings": "build/index.d.ts",
Expand Down
6 changes: 3 additions & 3 deletions package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "maxun",
"version": "0.0.27",
"version": "0.0.28",
"author": "Maxun",
"license": "AGPL-3.0-or-later",
"dependencies": {
Expand Down Expand Up @@ -52,7 +52,7 @@
"lodash": "^4.17.21",
"loglevel": "^1.8.0",
"loglevel-plugin-remote": "^0.6.8",
"maxun-core": "^0.0.27",
"maxun-core": "^0.0.28",
"minio": "^8.0.1",
"moment-timezone": "^0.5.45",
"node-cron": "^3.0.3",
Expand Down Expand Up @@ -131,4 +131,4 @@
"vite": "^5.4.10",
"zod": "^3.25.62"
}
}
}
2 changes: 1 addition & 1 deletion server/src/api/record.ts
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ import { io, Socket } from "socket.io-client";
import { BinaryOutputService } from "../storage/mino";
import { AuthenticatedRequest } from "../routes/record"
import {capture} from "../utils/analytics";
import { Page } from "playwright";
import { Page } from "playwright-core";
import { WorkflowFile } from "maxun-core";
import { addGoogleSheetUpdateTask, googleSheetUpdateTasks, processGoogleSheetUpdates } from "../workflow-management/integrations/gsheet";
import { addAirtableUpdateTask, airtableUpdateTasks, processAirtableUpdates } from "../workflow-management/integrations/airtable";
Expand Down
4 changes: 2 additions & 2 deletions server/src/browser-management/classes/RemoteBrowser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -550,9 +550,9 @@ export class RemoteBrowser {

try {
const blocker = await PlaywrightBlocker.fromLists(fetch, ['https://easylist.to/easylist/easylist.txt']);
await blocker.enableBlockingInPage(this.currentPage);
await blocker.enableBlockingInPage(this.currentPage as any);
this.client = await this.currentPage.context().newCDPSession(this.currentPage);
await blocker.disableBlockingInPage(this.currentPage);
await blocker.disableBlockingInPage(this.currentPage as any);
console.log('Adblocker initialized');
} catch (error: any) {
console.warn('Failed to initialize adblocker, continuing without it:', error.message);
Expand Down
178 changes: 84 additions & 94 deletions server/src/markdownify/scrape.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { connectToRemoteBrowser } from "../browser-management/browserConnection";
import { Page } from "playwright-core";
import { parseMarkdown } from "./markdown";
import logger from "../logger";

Expand All @@ -21,115 +21,105 @@ async function gotoWithFallback(page: any, url: string) {
* Fetches a webpage, strips scripts/styles/images/etc,
* returns clean Markdown using parser.
* @param url - The URL to convert
* @param existingPage - Optional existing Playwright page instance to reuse
* @param page - Existing Playwright page instance to use
*/
export async function convertPageToMarkdown(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
const page = await browser.newPage();

await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

const cleanedHtml = await page.evaluate(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});
export async function convertPageToMarkdown(url: string, page: Page): Promise<string> {
try {
logger.log('info', `[Scrape] Using existing page instance for markdown conversion of ${url}`);

await gotoWithFallback(page, url);

const cleanedHtml = await page.evaluate(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});

// Remove inline event handlers (onclick, onload…)
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
});
});
});

return document.documentElement.outerHTML;
});
return document.documentElement.outerHTML;
});

if (shouldCloseBrowser && browser) {
logger.log('info', `[Scrape] Closing browser instance created for markdown conversion`);
await browser.close();
} else {
logger.log('info', `[Scrape] Keeping existing browser instance open after markdown conversion`);
const markdown = await parseMarkdown(cleanedHtml, url);
return markdown;
} catch (error: any) {
logger.error(`[Scrape] Error during markdown conversion: ${error.message}`);
throw error;
}

// Convert cleaned HTML → Markdown
const markdown = await parseMarkdown(cleanedHtml, url);
return markdown;
}

/**
* Fetches a webpage, strips scripts/styles/images/etc,
* returns clean HTML.
* @param url - The URL to convert
* @param existingPage - Optional existing Playwright page instance to reuse
* @param page - Existing Playwright page instance to use
*/
export async function convertPageToHTML(url: string): Promise<string> {
const browser = await connectToRemoteBrowser();
const page = await browser.newPage();

await page.goto(url, { waitUntil: "networkidle", timeout: 100000 });

const cleanedHtml = await page.evaluate(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});
export async function convertPageToHTML(url: string, page: Page): Promise<string> {
try {
logger.log('info', `[Scrape] Using existing page instance for HTML conversion of ${url}`);

await gotoWithFallback(page, url);

const cleanedHtml = await page.evaluate(() => {
const selectors = [
"script",
"style",
"link[rel='stylesheet']",
"noscript",
"meta",
"svg",
"img",
"picture",
"source",
"video",
"audio",
"iframe",
"object",
"embed"
];

selectors.forEach(sel => {
document.querySelectorAll(sel).forEach(e => e.remove());
});

// Remove inline event handlers (onclick, onload…)
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
const all = document.querySelectorAll("*");
all.forEach(el => {
[...el.attributes].forEach(attr => {
if (attr.name.startsWith("on")) {
el.removeAttribute(attr.name);
}
});
});
});

return document.documentElement.outerHTML;
});
return document.documentElement.outerHTML;
});

if (shouldCloseBrowser && browser) {
logger.log('info', `[Scrape] Closing browser instance created for HTML conversion`);
await browser.close();
} else {
logger.log('info', `[Scrape] Keeping existing browser instance open after HTML conversion`);
return cleanedHtml;
} catch (error: any) {
logger.error(`[Scrape] Error during HTML conversion: ${error.message}`);
throw error;
}

// Return cleaned HTML directly
return cleanedHtml;
}
2 changes: 0 additions & 2 deletions server/src/routes/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ import { router as record } from './record';
import { router as workflow } from './workflow';
import { router as storage } from './storage';
import { router as auth } from './auth';
import { router as integration } from './integration';
import { router as proxy } from './proxy';
import { router as webhook } from './webhook';

Expand All @@ -11,7 +10,6 @@ export {
workflow,
storage,
auth,
integration,
proxy,
webhook
};
1 change: 0 additions & 1 deletion server/src/routes/storage.ts
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,6 @@ import { encrypt, decrypt } from '../utils/auth';
import { WorkflowFile } from 'maxun-core';
import { cancelScheduledWorkflow, scheduleWorkflow } from '../storage/schedule';
import { pgBossClient } from '../storage/pgboss';
chromium.use(stealthPlugin());

export const router = Router();

Expand Down
Loading